Apriori Algortihmus. Endpräsentation. Stefan George, Felix Leupold

Größe: px

Ab Seite anzeigen:

Download "Apriori Algortihmus. Endpräsentation. Stefan George, Felix Leupold"

Fritzi Beutel
vor 5 Jahren
Abrufe

1 Apriori Algortihmus Endpräsentation Stefan George, Felix Leupold

2 Gliederung 2 Wiederholung Apriori Erweiterung: Parallelisierung Parallele Programmierung in Python Parallelisierungszenarien Implementierung Ergebnisse/Benchmarks Usecase Lift Conviction Ergebnisse

3 Wiederholung Apriori 3 Algorithmus zum Auffinden von Assoziationsregeln Wichtige Parameter: Support und Confidence Apriori Überlegung: Wenn eine Teilmenge einer Menge M klein ist, dann ist die Menge M auch auf jeden Fall klein Kandidatengenerierung iterativ: Join und Prune Regeln aus Itemsets mit ausreichend Support generieren sup( L) A ( L A) > sup( A) minconf

4 Motivation für Parallelisierung 4

5 Parallelisierung in Python 5 Multiprocessing vs. Multithreading Threads haben shared memory Deutlich mehr overhead bei Context Switch zwischen Prozessen Global Interpreter Lock in Python Nur ein Thread kann gleichzeitig Bytecode ausführen Keine Parallelisierung bei CPU lastigem Code Für uns kommt nur Multiprocessing in Frage

6 Parallelisierungsszenarien (1/3) 6 Apriori.py T candidates Transactions

7 Parallelisierungsszenarien (1/3) 7 Apriori.py T candidates Transactions Process 1 Process 2 Process n

8 Analyse (1/3) 8 Kein Initialisierungsaufwand Datenübertragung in Schritt k: D * Ck Ergebnisgröße: Ck 0 5Mio * Gesamtkosten = k ( D + 1) Ck 2Mrd Parallelisierung muss grob granularer sein.

9 Parallelisierungsszenarien (2/3) 9 Apriori.py T T T Transactions Process 1 Process 2 Process n

10 Parallelisierungsszenarien (2/3) 10 Apriori.py candidates Process 1 Process 2 Process n T1 T2 Tn

11 Parallelisierungsszenarien (2/3) 11 Apriori.py Process 1 Process 2 Process n T1 T2 T2

12 Analyse (2/3) 12 Initialisierungsaufwand: D Aufwand in Schritt k: n * Ck Ergebnisgröße: n * Ck 5Mio 4 * * 400 Gesamtkosten: + k D 2n Ck 5Mio

13 Parallelisierungsszenarien (3/3) 13 Apriori.py T Transactions Process 1 Process 2 Process n

14 Parallelisierungsszenarien (3/3) 14 Apriori.py candidates Process 1 Process 2 Process n T T T

15 Parallelisierungsszenarien (3/3) 15 Apriori.py Process 1 Process 2 Process n T T T

16 Analyse (3/3) 16 Initialisierungsaufwand: n * D Aufwand in Schritt k: Ck Ergebnisgröße: Ck 4 * 5Mio Gesamtkosten: n D + k k 2 C 20Mio

17 Theorie vs. Praxis Execution time (sec) Kern 2 Kerne 4 Kerne 1 Embarassingly Parallel Tansaction Chunks Candidate Chunk k ( D + 1) Ck + k D 2n Ck n D + k k 2 C

18 Implementierung (1) 18 Map/Reduce Map(candidates, transactions) list(candidate,count) Reduce(candidate, list (counts)) list(candidates)

19 Implementierung (2) 19 Apriori.py Process 1 Process 2 Process n T T T Update der Transaktionen Würde Rückübertragung aller Transaktionen benötigen Daher nicht mehr AprioriTID sondern Apriori

20 Fazit & Ideas 20 Parallelisierung in anderen Programmiersprachen könnte mit Threads implementiert werden Weniger Kommunikationsoverhead Geringere Context Switch Kosten Aber ggf. Wartezeiten wegen locking Parallelisierung auf Hadoop Hoher Data Parallelism CUDA/OpenCL

21 Datensets 21 Anzahl der Elemente Anzahl der Transaktionen Facebook Netflix Health Twitter Anzahl der Elemente/ Transaktion Facebook Netflix Health Twitter

22 Benchmark (Netflix) 22 D = AVG( T ) = 52,6 N = % speedup Zeit (s) Hardware: Core i Quad-Core 16GB RAM Anzahl der Kernen

23 Benchmark (Twitter) 23 D = AVG( T ) = 2,4 N = % speedup Zeit (s) Hardware: Core i Quad-Core 16GB RAM Anzahl der Kerne

Auffinden interessanter Regeln 24 Pirates of the Caribbean -> LoR II Anzahl an Bewertungen pro Film 120000 Anzahl an Bewertungen 100000 80000 60000 40000 20000 0 1 457 913 1369 1825 2281

24 Auffinden interessanter Regeln 24 Pirates of the Caribbean -> LoR II Anzahl an Bewertungen pro Film Anzahl an Bewertungen Film ID

25 Lift 25 lift(x! Y ) = sup(x!y ) sup(x)* sup(y ) 5.78% lift( ) = = 7.2%*18% 4.46 Errechnete Werte: 3,7 7,5

26 Conviction 26 conv( X Y ) = 1 sup( Y ) 1 conf ( X Y ) conv(~~~~~~~~~ ~~~) = = Errechnete Werte: 3,

27 Ergebnisse 27 Support: 5,3 % Confidence: 96,5 % Lift: 7,5 Conviction: 25,3 Support: 5% Confidence: 75% 746 Regeln

28 Einfluss von Lift und Conviction Regel Position Lift Conviction Confidence Regel ID

29 Vergleich zu IMDb 29 Stichprobe (N=30): 21/30 = 70% stimmen mit IMDb überein Nachfolger Gleiches Genre

30 Quellen 30 Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. Dynamic itemset counting and implication rules for market basket data. P. Becuzzi, M. Coppola and M. Vanneschi. Mining of Association Rules in Very Large Databases: A Structured Parallel Approach Ali Tarhini: Parallel Apriori algorithm for frequent pattern mining. [Stand 3. Juli 2012] Anuradha.T, Satya Pasad R, S.N.Tirumalarao. Parallelizing Apriori on Dual Core using OpenMP. International Journal of Computer Applications, Volume 43 R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules", Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994

Ähnliche Dokumente

Algorithms for Pattern Mining AprioriTID. Stefan George, Felix Leupold

Algorithms for Pattern Mining AprioriTID Stefan George, Felix Leupold Gliederung 2 Einleitung Support / Confidence Apriori ApriorTID Implementierung Performance Erweiterung Zusammenfassung Einleitung 3