Hadoop. Eine Open-Source-Implementierung von MapReduce und BigTable. von Philipp Kemkes

Größe: px

Ab Seite anzeigen:

Download "Hadoop. Eine Open-Source-Implementierung von MapReduce und BigTable. von Philipp Kemkes"

Hilko Armbruster
vor 8 Jahren
Abrufe

1 Hadoop Eine Open-Source-Implementierung von MapReduce und BigTable von Philipp Kemkes

2 Hadoop Framework für skalierbare, verteilt arbeitende Software Zur Verarbeitung großer Datenmengen (Terra- bis Petabyte) Open Source Projekt der Apache Software Foundation In Java entwickelt Wachsender C++ Teil zur Geschwindigkeitsoptimierung Modulares System, bestehend aus (unabhängigen) Komponenten Philipp Kemkes 2

Foundation In Java entwickelt Wachsender C++ Teil zur Geschwindigkeitsoptimierung

3 Hadoop Komponenten Hadoop Common Hadoop Distributed File System MapReduce HBase Hive Philipp Kemkes 3

4 Hadoop Common Serialisierung Datentypen Dateisystemschnittstelle Lokales Dateisystem HDFS (GFS) Amazon S3 CloudStore (GFS) FTP / HTTP create(path f) delete(path f) exists(path f) liststatus(path f) mkdirs(path f) open(path f) rename(path src, Path dst) Philipp Kemkes 4

create(path f) delete(path f) exists(path f) liststatus(path f)

5 Hadoop Komponenten Hadoop Common Hadoop Distributed File System MapReduce HBase Hive Philipp Kemkes 5

6 Hadoop Distributed File System Verteiltes Dateisystem Inspiriert durch Google File System Gewappnet gegen Hardwareausfälle Skalierbar Einsatzbereich: TByte bis PByte Philipp Kemkes 6

7 HDFS Architektur NameNode Inhaltsverzeichnis DataNode Eigentlicher Speicher Secondary NameNode Backup des NameNode Philipp Kemkes 7

8 HDFS Beispiel Configuration conf = new Configuration(); FileSystem hdfs = FileSystem.get(conf); FileSystem local = FileSystem.getLocal(conf); Path infile = new Path( /home/user/alt.txt ); Path outfile = new Path( /home/user/neu.txt ); FSDataInputStream in = local.open(infile); FSDataOutputStream out = hdfs.create(outfile); IOUtils.copy(inFile, outfile); Philipp Kemkes 8

txt ); Path outfile = new Path( /home/user/neu.txt ); FSDataInputStream in = local.

9 Hadoop Komponenten Hadoop Common Hadoop Distributed File System MapReduce HBase Hive Philipp Kemkes 9

10 Hadoop MapReduce Implementation des MapReduce-Modells Parallelisierung Skalierbarkeit Große Datenmengen Fehlertoleranz bei Hardwareausfällen Philipp Kemkes 10

11 MapReduce Architektur JobTracker Nimmt Job entgegen Startet TaskTracker Überwacht TaskTracker TaskTracker Führt eigentliche Berechnung durch Philipp Kemkes 11

12 MapReduce & HDFS JobTracker kommuniziert mit NameNode Bestimmt Position der zu verarbeitenden Daten Sucht freien TaskTracker nahe dem DataNode DataNode und TaskTracker auf selbem Rechner Philipp Kemkes 12

freien TaskTracker nahe dem DataNode DataNode und

13 MapReduce Beispiel void map(longwritable key, Text value, Context context) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { Text word = new Text(tokenizer.nextToken()); context.write(word, new IntWritable(1)); } } Philipp Kemkes 13

tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while

14 MapReduce Beispiel void reduce(text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } Philipp Kemkes 14

15 MapReduce Programmieren Java C/C++ (durch Java Native Interface) Hadoop Streaming Python Bash Sonstige Shell Skripte MapReduce Jobs verketten (Chaining) Komplexere Probleme lösen Philipp Kemkes 15

Sonstige Shell Skripte MapReduce Jobs verketten

16 Hadoop Komponenten Hadoop Common Hadoop Distributed File System HBase Hive MapReduce HBase HDFS MapReduce Hive Common Philipp Kemkes 16

17 HBase Inspiriert durch Googles BigTable Ausgelegt für große Datenmengen: Mehrere Milliarden Zeilen Mehrere Millionen Spalten Keine relationale Datenbank Keine Joins / Transaktionen / Updates (dafür Append) Nur Primärschlüssel, keine weiteren Indizes Philipp Kemkes 17

Keine relationale Datenbank Keine Joins / Transaktionen / Updates

18 HBase Beispiel Configuration config = HbaseConfiguration.create(); HTable table = new HTable(config, "tabelle"); Put p = new Put(Bytes.toBytes("zeilen_id")); p.add(bytes.tobytes("spalten_familie"), Bytes.toBytes("spalte"), Bytes.toBytes("wert")); table.put(p); Philipp Kemkes 18

Put(Bytes.toBytes("zeilen_id")); p.add(bytes.

19 Hadoop Komponenten Hadoop Common Hadoop Distributed File System MapReduce HBase Hive Philipp Kemkes 19

20 Hive Abstrahiert von HDFS und MapReduce Entwickelt von Facebook, weil Programmierung von MapReduce Anwendungen zu aufwendig Unterstützt Untermenge des SQL-92 Standard + eigene Erweiterungen Einsatzgebiet: Datenanalsyse Keine Updates / Transaktionen Nur Primärschlüssel, keine weiteren Indizes (aber geplant) Philipp Kemkes 20

SQL-92 Standard + eigene Erweiterungen Einsatzgebiet: Datenanalsyse Keine Updates /

21 Hive Beispiel CREATE TABLE cite (citing INT, cited INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH 'cite75_99.txt' OVERWRITE INTO TABLE cite; SELECT * FROM cite LIMIT 10; Philipp Kemkes 21

22 Fazit Hadoop ist enorm skalierbar und kann riesige Datenmengen verwalten Wird meist zur Datenanalyse eingesetzt Kein direkter Ersatz für RDBMS The right tool for the right job! Philipp Kemkes 22

Ähnliche Dokumente

Hadoop. Simon Prewo. Simon Prewo

Hadoop. Simon Prewo. Simon Prewo Hadoop Simon Prewo Simon Prewo 1 Warum Hadoop? SQL: DB2, Oracle Hadoop? Innerhalb der letzten zwei Jahre hat sich die Datenmenge ca. verzehnfacht Die Klassiker wie DB2, Oracle usw. sind anders konzeptioniert