Apache Hadoop. Distribute your data and your application. Bernd Fondermann freier Software Architekt bernd.fondermann@brainlounge.de berndf@apache.

Apache Hadoop Distribute your data and your application Bernd Fondermann freier Software Architekt bernd.fondermann@brainlounge.de berndf@apache.org

Apache The Apache Software Foundation Community und Code Apache Software License Free + Open + Source Mission: Software free of charge problemlos in Closed Source verwendbar

Referent unabhängiger Software Architekt, Frankfurt Member The Apache Software Foundation aktiv in Apache James Apache Labs (PMC Chair) Vysper: XMPP Server

Inhalt Übersicht Hadoop Distributed Computing Architektur Verteiltes File System: HFS Verteilte Datenbank: HBase Verteilte Programme: Map/Reduce Das Hadoop Umfeld

Hadoop Produkte Hadoop HFS Hadoop HBase MapReduce Zookeeper Pig Distributed File System Distributed Database Distributed Data Processing Coordinate Distributed Processes Data Analysis Language

Warum Hadoop? Verbessert... Skalierbarkeit (Datenmenge) Durchsatz ( Throughput ) Reliability: Design-for-failure...durch Einsatz von COTS Hardware Trade-offs: Latency, Consistency

klassisch: 3-tier

3-tier Eigenschaften Daten skalieren...gigabyte Relationale DB: Dutzende Spalten Millionen von Zeilen Redundante Daten sind nicht live Skalierbarkeit sehr begrenzt mehrere single points of failure

Distributed Computing

Hadoop Eigenschaften Daten skalieren... Terabyte Distributed DB: Millionen von Spalten Milliarden von Zeilen Redundante Daten sind alle zugreifbar Skalierbarkeit auf 10.000+ DataNodes DataNode = Fail-over, NameNode = SPF

Distributed Write Name Node Replication Control 3. Replication Coordination NYC 4a. Replicate 1. Coordinate Write Zürich Client 4b. Replicate 2. Initial Write

HFS Vorbild: Google File System verteiltes FS Software-FS, benutzt die File Systeme der Betriebssysteme (Linux) R/W: Client greift direkt auf DataNode zu FS regelt Verteilung & fail-over

HBase Vorbild: Google s BigTable basiert auf HFS Jede Zelle ist versioniert schwach besetzte Matrix schema-frei & keine Fremdschlüssel Zeilen sind geordnet, über definierten Key Jede Spalte gehört zu einer ColumnFamily

RDB: Storing Mail key h_from h_to body type read prio M1 info@ openexpo.ch berndf@ apache.org Hi! text 08.9.1. 08:24 2 M2 spam@ spammer.de berndf@ apache.org <a href= scam.html > Buy me!</a> html 08.9.1. 00:00-1

HBase: Storing Mail key M1 time stamp t3 header: body: tag: from to text html read prio info@ openexpo.ch berndf@ apache.org Hi! <b>hi!</b> M2 t4 spam@ spammer.de berndf@ apache.org <a href= scam.html > Buy me!</a> Yes M1 t5 08.9.1. 08:24 1 M1 t6 2

Map/Reduce Vorbild: Google s Map/Reduce Paper führt Programme auf Hadoop aus Code & Daten nah beieinander verteilt/parallelisiert Daten und Code

Map DataNode Code Data Big Problem DataNode Code Data Code Many Data DataNode Code Data DataNode Code Data

Reduce DataNode Partial Result DataNode Partial Result Big Problem DataNode Partial Result Result DataNode Partial Result

Map/Reduce Anwendungen Aggregationen über viele Daten Zähle für jede Webseite, wieviele andere Seiten auf sie verweisen! Monte-Carlo-Simulation Invertieren großer Matrizen siehe Apache Mahout!

Map/Reduce: Link-Zähler Map Job1 Sites for openexpo.ch 2 [a-g].ch apache.org 1 Map Job 2 [h-l].ch openexpo.ch 1 Map Job 3 [m-z].ch openexpo.ch 1 Reduce openexpo.ch 4 apache.org 1

Manage Map/Reduce TaskTracker Startet und überwacht Nodes Koordiniert Übergang von Map zu Reduce Einzeltask Fail-over: auf Ersatz-Nodes

verwandte Apache Produkte Apache Nutch Apache Mahout Hama (Incubator) Internetcrawler Maschinenlernen Matrizenoperationen CouchDB (Incubator) Distributed DB (Erlang)

Links http://apache.org http://hadoop.apache.org http://incubator.apache.org/pig http://lucene.apache.org/mahout http://incubator.apache.org/hama http://labs.apache.org/ http://labs.google.com/papers/gfs.html http://labs.google.com/papers/bigtable.html

Vielen Dank! Besuchen Sie uns auf dem ASF Stand! Fragen und Antworten Do you believe in the Users? Behind the Scenes of the ASF Ceki Gülcü Taming content repositories with Sling Brian Fitzpatrick Do 10:10 Lars Eilebrecht Do 13:10 SLF4j and logback projects Do 13:50 Bertrand Delacrétaz Do 15:10