DduP - Towards a Deduplication Framework utilising Apache Spark

Größe: px

Ab Seite anzeigen:

Download "DduP - Towards a Deduplication Framework utilising Apache Spark"

Viktor Willi Buchholz
vor 8 Jahren
Abrufe

1 - Towards a Deduplication Framework utilising Apache Spark utilising Apache Spark Universität Hamburg, Fachbereich Informatik

2 Gliederung 1 Duplikaterkennung 2 Apache Spark 3 - Interactive Big Data Deduplication Framework 2

3 Duplikaterkennung 1 Duplikaterkennung 2 Apache Spark 3 - Interactive Big Data Deduplication Framework 3

4 Duplikaterkennung - Beispiel Corpus: Set[Tuple] Herbert Maier Hebert Maier Jan Müller Joachim Gause Herbert Mayer Peter Wüst Joachim Gausse 4

5 Duplikaterkennung - Beispiel Result: Set[Entity], Entity: Set[Tuple] Herbert Maier Hebert Maier Jan Müller Joachim Gause Herbert Mayer Peter Wüst Joachim Gausse 4

6 Motivation Probleme Quadratische Komplexität Große Datenmengen sind auf einem Standardrechner nicht handhabbar Lange Laufzeiten auf Standardrechner Hadoop MapReduce Projekt Dedoop der Uni Leipzig Duplikaterkennung bis dato Batchverarbeitung Keine relevanten Open Source Projekte vorhanden Entstehung Projekt im Rahmen meiner Masterarbeit - Interactive Big Data Deduplication Framework 5

Leipzig Duplikaterkennung bis dato Batchverarbeitung Keine relevanten Open Source Projekte

7 Apache Spark 1 Duplikaterkennung 2 Apache Spark 3 - Interactive Big Data Deduplication Framework 6

8 Apache Hadoop MapReduce - Wordcount Beispiel Partition Map Shuffle Reduce Merge Lion Goose (Lion, 1) (Goose, 1) (Lion, 1) (Lion, 1) (Lion, 2) Lion Goose Goose Lion Mouse Goose Lion Goose (Lion, 1) (Goose, 1) (Goose, 1) (Goose, 1) (Goose, 1) (Goose, 3) (Lion, 2) (Goose, 3) (Mouse, 1) Disk load Mouse Goose (Mouse, 1) (Goose, 1) (Mouse, 1) (Mouse, 1) write Disk Disk I/O Overhead Einschränkung auf Map Reduce API Batchverarbeitung 7

(Goose, 1) (Goose, 1) (Goose, 1) (Goose, 3) (Lion, 2) (Goose, 3) (Mouse, 1) Disk load Mouse Goose (Mouse,

9 Apache Spark - Übersicht Cluster Computing Framework offizieller Nachfolger von Apache Hadoop relativ jung (erstes Paper 2009; Version 1.0 Mai 2014) Deployment-Alternativen: Standalone, Hadoop (YARN), Mesos, Amazon EC2 mögliche Datenquellen: HDFS, HBase, Cassandra benutzbar über API (Scala, Java, Python) oder Spark-Shell 8

0 Mai 2014) Deployment-Alternativen: Standalone, Hadoop (YARN), Mesos, Amazon

10 Apache Spark - Cluster Aufbau Spark Grundsatz: Versende Code, nicht die Daten Master Spark Driver Spark Master HDFS Namenode task result Worker 1 Worker 2 Worker 3 Worker 4 Apache Spark Executors HDFS Datanodes 9

Spark Master HDFS Namenode task result Worker 1

11 RDD - Resilient Distributed Dataset RDD: Kernkonzept von Spark Resilient - Fehlertolerant Disitributed - Partitioniert (A1,..., A4) Dataset - Generische Collection Readonly Lazy - Transformations, Actions Versch. Persistenz-Strategien (in-memory, on-disk, persistence-priority) Versch. Partitions-Strategien um Datenlokalität zu optimieren A: A1 A2 A3 A4 10

.., A4) Dataset - Generische Collection Readonly Lazy - Transformations, Actions Versch.

12 RDD Lineage Lineage (Abstammungslinie): Gerichteter Azyklischer Graph A: A1 C: C1 A2 A3 A4 filter C2 C3 C4 E: E1 E2 E3 count B: B1 D: D1 E4 B2 D2 join groupbykey 11

13 Spark Shell - Interactive Big Data Mining Scala Shell SparkContext 12

14 Spark Word-Count Beispiel package de. unihamburg. v s i s. ddup. exe. t e s t i n g import org. apache. s p a r k. SparkConf import org. apache. s p a r k. SparkContext import org. apache. s p a r k. SparkContext. _ o b j e c t WordCount extends App { p r i v a t e v a l c o n f = new SparkConf ( ). setappname ( " WordCount " ). s e t M a s t e r ( " l o c a l " ) v a l s c = new SparkContext ( c o n f ) // h d f s : / / master :9000/ t e s t. t x t v a l f i l e = s c. t e x t F i l e ( " / lorem ipsum. t x t " ) v a l words = f i l e. flatmap ( l i n e => l i n e. s p l i t ( " " ) ) v a l wordnumbers = words. map( word => ( word, 1 ) ) v a l c o u n t s = wordnumbers. reducebykey (_ + _) c o u n t s. f o r e a c h ( p r i n t l n (_) ) } 13

s e t M a s t e r ( " l o c a l " ) v a l s c = new SparkContext ( c o n f ) // h d f s : / / master :9000/ t e s t. t x t v a l f i l e = s c. t e x t F i l e ( " / lorem ipsum.

15 - Interactive Big Data Deduplication Framework 1 Duplikaterkennung 2 Apache Spark 3 - Interactive Big Data Deduplication Framework 14

16 Duplikaterkennung - Prozessübersicht Parsing / Preprocessing <<File>> Tuple Source RDD[Tuple] Gold Standard <<File>> Goldstandard RDD[Tuple] Blocking RDD[SymPair[Tuple]] Similarity RDD[(SymPair[Tuple], Array[Double])] Decision Model RDD[SymPair[Tuple]] Clustering Duplikatcluster RDD[Set[Tuple]] 15

Blocking RDD[SymPair[Tuple]] Similarity RDD[(SymPair[Tuple],

17 Ziel Design einer Bibliothek bzw. eines Toolkits zur interaktiven Duplikaterkennung in großen Datenmengen. Anforderungen Modular Erweiterbar Performant In-Memory Skalierbar Flexibel Interaktiv einsetzbar (z.b. Shell) Batch-fähig (configurierbare Pipelines) Explorativ einsetzbar Rapid Prototyping von Pipeline Konfigurationen 16

Anforderungen Modular Erweiterbar Performant In-Memory Skalierbar Flexibel

18 Warum Apache Spark? In-Memory obwohl kein iterativer Prozess Viele Prozessschritte Hadoop Tradeoff: Modularität <-> Effizienz (Disk I/O) Mehrfache Verwendung von Zwischenergebnissen (z.b. Multiblocking) Interaktive Shell Alternativen Apache Pig - keine interaktive Shell 17

Hadoop Tradeoff: Modularität <-> Effizienz (Disk I/O) Mehrfache

19 PipE - Linear Pipelining Framework Eigenschaften Pipes and Filter Pattern Typsicher Modular Linear PipeContext (nicht lineare Pipelines) List[String] PipeParser List[Tuple] List[Tuple] PipeBlocker List[(Tuple, Tuple)] 18

PipeContext (nicht lineare Pipelines) List[String]

20 PipE - Linear Pipelining Framework Eigenschaften Pipes and Filter Pattern Typsicher Modular Linear PipeContext (nicht lineare Pipelines) List[String] PipeParser List[Tuple] List[Tuple] PipeBlocker List[(Tuple, Tuple)] 18

21 PipE - Linear Pipelining Framework Eigenschaften Pipes and Filter Pattern Typsicher Modular Linear PipeContext (nicht lineare Pipelines) List[String] PipeParser List[Tuple] PipeBlocker List[(Tuple, Tuple)] 18

22 PipE - Linear Pipelining Framework Eigenschaften Pipes and Filter Pattern Typsicher Modular Linear PipeContext (nicht lineare Pipelines) List[String] PipeParser List[Tuple] PipeBlocker List[(Tuple, Tuple)] Scala Code Beispiel v a l i n p u t = s c. t e x t F i l e ( " t e s t f i l e. c s v " ) v a l p i p e = P i p e P a r s e r ( ) append P i p e B l o c k e r ( param1, param2 ) v a l pc = new DdupPipeContext ( ) v a l r e s u l t = p i p e. run ( pc, i n p u t ) 18

23 - Pipe Typen Parsing / Preprocessing RDD[Tuple] Gold Standard RDD[Tuple] Blocking RDD[SymPair[Tuple]] Similarity Analyse Print Write Optimise RDD[(SymPair[Tuple], Array[Double])] Decision Model RDD[SymPair[Tuple]] Clustering 19

24 - Pipe Implementierungen Parsing / Preprocessing RDD[Tuple] Gold Standard CSV-Parser Regex-Removal Trim, etc. Analyse Recall Precision Anzahlen RDD[Tuple] Blocking Standard Blocking Sorted Neighbourhood Suffix Array Print Generic False Negatives False Positives RDD[SymPair[Tuple]] Similarity Lib: StringMetrics Write Clustering Result RDD[(SymPair[Tuple], Array[Double])] Decision Model RDD[SymPair[Tuple]] Threshold Machine Learning: Decision Tree SVM, Naive Bayes Lib: MLlib Optimise Cache / Persist Clustering Transitive Hülle Lib: GraphX 20

25 Vorläufiges Fazit Pro Framework skaliert (10 6 Tuple, 4,5 Min, 5 Knoten Cluster) Framework ist interaktiv nutzbar Modular und Erweiterbar durch Pipelines schlanker Entwicklungsworkflow dank Apache Spark (Test, Ausführung) Contra Spark Librarys (GraphX, MLlib) erschweren das Debuggen und Optimieren Scala IDE 21

26 Q &hopefullya 22

Ähnliche Dokumente

Schneller als Hadoop?

Schneller als Hadoop? Einführung in Spark Cluster Computing 19.11.2013 Dirk Reinemann 1 Agenda 1. Einführung 2. Motivation 3. Infrastruktur 4. Performance 5. Ausblick 19.11.2013 Dirk Reinemann 2 EINFÜHRUNG