2. Parallel Random Access Machine (PRAM)

2. Parallel Random Access Machine (PRAM) Massivparallele Modelle und Architekturen R. Hoffmann, FB Informatik, TU-Darmstadt WS 10/11 Inhalt Metriken (Wdh.) Optimale Algorithmen Classification and Types Algorithmen Initialize Broadcasting All-to-All Broadcasting Reduction Sum Pointer Jumping (Algorithmen) Naive Sorting Prefix Sum Accelarated Cascading Simulations between PRAM Models 2-1

Verwendete Quellen J. JaJa An Introduction to Parallel Algorithms,1992 Robert van Engelen (Folien) The PRAM Model and Algorithms, Advanced Topics Spring 2008 Behrooz Parhami Introduction to Parallel Processing, Algorithms and Architectures, Plenum Series, Kluwer 2002 (Folien) Arvind Krishnamurthy (Folien) PRAM Algorithms, Fall 2004 Keller, Kessler, Träff Practical PRAM Programming. Wiley Interscience, New York, 2000 2-2

MMA - WS 10/11, R. Hoffmann, Rechnerarchitektur, TU Darmstadt Metriken (Wdh.) p q(t) P(n) T(p) S(p) = T(p=1)/T(p) E = S/p Work(p) = q(t) Cost(p) = p* T(p) I(p) = Work(p)/T(p) Utilization = Work/Cost R(p) = Work(p) / Work(1) bereitgestellte Prozessoren Parallelitätsgrad, genutzte Prozessoren zum Zeitp. t Darstellung für alle t: Parallelitätsprofil Benötigte Prozessoren für einen best. Algorithmus in Abhängigkeit von n (Problemgöße) (siehe PRAM) Zeitschritte insgesamt Speedup Effizienz = T(1) / Cost(p) Work Cost bei p bereitgestellten Proz. Parallelitätsindex, Mittlerer Parallelitätsgrad Mittlere Nutzung der Prozessoren Mehraufwand durch Parallelverarbeitung mit p Prozessoren 2-3

MMA - WS 10/11, R. Hoffmann, Rechnerarchitektur, TU Darmstadt PRAM: Cost, Work Im Zusammenhang mit den PRAM-Algorithmen auch verwendete Notation: Cost(n, P(n)) = P(n) * T(n) Work(n, P(n)) = geleistete Arbeit (Prozessoren/Operationen x Zeit) in Abh. von der Problemgröße n und der Anzahl P(n) der für den PRAM- Algorithmus benötigten Prozessoren. 2-4

MMA - WS 10/11, R. Hoffmann, Rechnerarchitektur, TU Darmstadt Optimale Algorithmen Sequentieller Algorithmus Zeit-optimal: Ein sequentieller Algorithmus, dessen (sequentielle) Zeitkomplexität O(T seq (n)) nicht verbessert werden kann, heißt zeit-optimal. Paralleler Algorithmus (Work)-optimal: Wenn die Work-Komplexität in der Zeitkomplexität des besten sequentiellen Algorithmus liegt. (Die Anzahl der Operationen des parallelen Algorithmus ist asymptotisch gleich der Anzahl der Operationen des sequentiellen Algorithmus) Work(n) = O(T seq (n)) [noch strengere Definition: Total number of operations in the parallel algorithm is equal to the number of operations in a sequential algorithm] Work-time-optimal, WT-optimal, streng optimal: Die Anzahl der parallelen Schritte T par (n) kann nicht durch irgend einen anderen Work-optimalen Algorithmus weiter verringert werden. (Es gibt keinen schnelleren optimalen.) Cost-optimal, kostenoptimal: Cost(n) = O(Tseq(n)) 2-5

Parallel Random Access Machine (PRAM) Model PRAM removes algorithmic details concerning synchronization and communication, allowing the algorithm designer to focus on problem properties A PRAM algorithm describes explicitly the operations performed at each time unit PRAM design paradigms have turned out to be robust and have been mapped efficiently onto many other parallel models and even network models 2-6

The PRAM Model of Parallel Computation Natural extension of RAM: each processor is a RAM 1) Processors operate synchronously Earliest and best-known model of parallel computation Shared Memory Shared memory with m locations P processors, each with private memory P 1 P 2 P 3 P p All processors operate synchronously, by executing load, store, and operations on data 1) Random Access Machine, Zugriff per Befehl auf eine beliebige Speicherzelle (nicht gemeint ist hier RAM = Random Access Memory) 2-7

Synchronous PRAM Synchronous PRAM is an SIMD-style model All processors execute the same program, deshalb heißt das Modell auch SPMD (Single Program Multiple Data) All processors execute the same PRAM step instruction stream in lock-step (im Gleichschritt) Effect of operation depends on individual or common data which processor(i) accesses processor index i local data Instructions can be selectively disabled (if-then-else flow) Asynchronous PRAM Several competing models No lock-step 2-8

PRAM Befehlssatz Ein üblicher Befehlssatz einer PRAM sieht wie folgt aus (1) Konstanten: L[x]:=Konstante, L[x]:=Eingabegröße, L[x]:= Prozessornummer (2) Schreiben: G[L[x]] := L[y] (3) Lesen: L[x] := G[L[y]] (4) Lokale Zuweisung: L[x] := L[y] (5) Bedingte Zuweisung: if L[x] > 0 then Zuweisung (6) Operationen: L[x] := L[y] op L[z] (7) Sprünge: Goto Marke y, if L[x] > 0 goto Marke y (8) Sonstiges: if L[x] > 0 then Halt, if L[y] > 0 then NoOperation. 2-9

Classification of PRAM Models A PRAM step ( clock cycle ) consists of three phases 1. Read: each processor may read a value from shared memory 2. Compute: each processor may perform operations on local data 3. Write: each processor may write a value to shared memory Model is refined for concurrent read/write capability Exclusive Read Exclusive Write (EREW) Concurrent Read Exclusive Write (CREW) Concurrent Read Concurrent Write (CRCW) Concurrent Read Owner Write (CROW) 2-10

Types of CRCW PRAM CRCW - Common Arbitrary, Random Priority Undefined Ignore Detecting Max/Min Reduction result all processors must write the same value one of the processors succeeds in writing processor with the lowest index succeeds the value written is undefined no new value is written in case of conflict special collision detection code is written largest / smallest of the values is written SUM, AND, OR, XOR or another function is written 2-11

Comparison of PRAM Models A model B is more powerful compared to model A if either The time complexity is asymptotically less in model B for solving a problem compared to A Or the time complexity is the same and the work complexity is asymptotically less in model B compared to A From weakest to strongest: EREW CREW Common CRCW Arbitrary CRCW Priority CRCW 2-12

Initialize P processors n elements Initializing an n-vector M (with base address = B) to all 0s with P < n processors P t=0 0 for t=0 to n/p -1 do // segments t=1 parallel [i = 0.. P-1] // processors if (t*p + i < n) then M[B + t*p + i] 0 endparallel n-1 endfor FRAGE 1): Work =.. Cost =... 2-13

Making P copies of B[0] Data Broadcasting for t = 0 to log 2 P 1 do parallel [i = 0.. 2 t -1] // active B[i+ 2 t ] B[i] // proc(i) writes endparallel endfor for t = 0 to log 2 P 1 do parallel [i = 2 t.. 2 t+1-1] // active B[i] B[i 2 t ] // proc(i) reads endparallel endfor 0 1 2 3 4 5 6 7 8 9 10 11 B EREW PRAM data broadcasting without redundant copying. T =? falls Array-Grenze keine Potenz von 2 ist, dann ergänze die Algorithmen um Bedingung 2-14

Data Broadcasting, Data Parallel Formulation parallel [i = 0.. 2 t -1] B[i+ 2 t ] B[i] endparallel I [0.. 2 t -1] B[I+ 2 t ] B[I] Datenparallele Schreibweise I = Indexvektor 0 1 2 3 4 5 6 7 8 9 10 11 B EREW PRAM data broadcasting without redundant copying. entspricht B[2 t.. 2 t+1-1] B[0.. 2 t -1] 2-15

All-to-All Broadcasting on EREW PRAM parallel [i = 0.. P-1] write own data value into B[i] for t = 1 to P 1 parallel [i = 0.. P-1] Read the data value from B[(i + t) mod P] endparallel endfor Jeder Prozessor i liest im Schritt t den Wert von seinem Nachbarn i+t This O(P)-step algorithm is time-optimal Frage 2) Warum? 2-16

Reduction on the EREW PRAM Reduce P values on the P-processor EREW PRAM in O(log P) time Reduction algorithm uses exclusive reads and writes Algorithm is the basis of other EREW algorithms Speicherzellen Prozessoren Datenfluss 2-17

Sum on the EREW PRAM (1) Sum of n values using n/2 processors (i) Input: A[1, n], n = 2 k Output: s parallel [i=1.. P] B[i] A[i] endparallel B 1 2 3 4 5 6 7 8 for t = 1 to log n do parallel [i = 1.. n/2] if i < n/2 t then B[i] B[2i-1] + B[2i]; if i = 1 then s B[i] end B Cost = (n/2) log n Work = n-1 Utilization = U = Work/Cost = (n-1)/((n/2)log n) = O(1/log n) E = T(1)/Cost(p)= (n-1)/((n/2)log n) = O(1/log n) B B S Frage 3) für das Beispiel: Cost, Work, Utilization? 2-18

Sum on the EREW PRAM (2) Hierbei wird die Anzahl der Prozessoren dynamisch verwaltet (ist nicht PRAM-Standard!) 1 2 3 4 5 6 7 8 Sum of n values using n/2.. 1 processors (i) Input: A[1, n], n = 2 k Output: s B parallel [i=1.. P] B[i] A[i] endparallel for t = 1 to log n do dynparallel [i =1.. n/2 t ] // dynamic activity B[i] := B[2i-1] + B[2i] endparallel endfor s := B[i] B B Cost = n -1 Work = n -1 Utilization = 1 B s 2-19

Pointer Jumping Finding the roots of a forest using pointer-jumping 2-20

Pointer Jumping on the CREW PRAM Input: A forest of trees, each with a self-loop at its root, consisting of arcs (i, P(i)) and nodes i, where 1 < i < n Output: For each node i, the root S[i] of the tree containing i parallel [i = 1.. n] S[i] P[i] endparallel for h = 1 to log n do // while there exists a node such that S[i] S[S[i]] do parallel [i = 1.. n] if S[i] S[S[i]] then S[i] S[S[i]] endparallel endfor T(n) = O(log h) with h the maximum height of trees Cost(n) = O(n log h) 2-21

Beispiel 8 1 2 3 4 5 6 7 8 3 4 5 6 7 Pointer werden transportiert 1 2 8 3 4 6 5 7 1 2 8 3 4 1 6 2 5 7 2-22

parallel [i= 0.. P-1] for t = 1 to P 1 do parallel [i= 0.. P-1] Naive EREW PRAM sorting algorithm (all read from all) neighbor := (i + t) mod P if S[neighbor ] < S[ i ] or R[i] 0 endparallel S[neighbor ] = S[ i ] and neighbor < i endparallel endfor then R[ i ] := R[ i ] + 1 // erhöhe für jedes kleinere Element // R[i] gibt schließlich die Anzahl der kleineren Elemente an. // Das entspricht der Zielposition an parallel [i= 0.. P-1] S[R[ i ]] S[ i ] endparallel Eigene Übung: An einem Beispiel durchspielen P = n Werte in S Zielposition in R This O(P)-step sorting algorithm is far from optimal; sorting is possible in O(log P) time 2-23

Prefix Sum Applications deleting marked elements from an array (stream compaction), radixsort, solving recurrence equations, solving tri-diagonal linear systems, quicksort i = 1 2 3 4 5 6 7 8 s i Σ k=1..n x i x 7 6 5 4 3 2 1 0 s 1 = x 1 s 7 13 18 22 25 27 28 28 s 2 = s 1 + x 2... n s i = s i-1 + x i 2-24

Horn s Algorithm CREW 1 2 3 4 5 6 7 8 for t=1 to log 2 n do parallel [i = 2.. n] if i > 2 t-1 then endparallel endfor x[i] x[i-2 t-1 ] + x[i] Problem: Mehrfache Lesezugriffe auf dieselbe Variable kann zu Zugriffsverzögerungen (Leseressourcenkonflikt (Congestion) an einem Speichermodul) führen Work = 17 2-25

Prefix Sum EREW M 1 2 3 4 5 6 7 8 Die Prozessoren und die Speicherzellen M[i] enthalten die x-werte P i hat eine lokales Register y i Komplexität wie EREW- Reduktion Prozessoren parallel [i=0.. n-1] y[i] M[i] endparallel for t=0 to log n -1 do dynparallel [i=2 t.. n-1] y[i]:=y[i] + M[i-2 t ] M[i] y[i] endparallel Unterschied zum vorherigen Algorithmus -keine mehrfachen Lesezugriffe auf globale Variablen -dynamische Parallität 2-26

Prefix Sum Algorithmus mit n/2 Prozessoren CREW Lese-und Scheibzugriffe 1. 8er1 + 4ew1 2. 2cr2 + 4er1 + 4ew1 3. 1cr4 + 4er1 + 4ew1 log n Schritte n/2 Prozessoren = const. Work(p=n/2) = n/2 * log n Cost = Work Mehraufwand der parallelen Ausführung R(p=n/2) = Work(p)/T(1) = (n/2 * log n)/n = ½ log n = O(log n) 1 2 3 4 5 6 7 8 Frage 4): Mehraufwand für dieses Beispiel 2-27

Accelerated Cascading Ziel: Cost = O(T seq (n)) Kombination eines langsamen kostenoptimalen mit einem schnellen nicht kostenoptimalen Algorithmus Bsp.: Maximum. Use p = n/log n processors instead of n Have a local phase followed by the global phase Local phase: compute maximum over log n values use simple sequential algorithm Time for local phase = O(log n) in parallel for all p Global phase: take (n/log n) local maxima and compute global maximum using the reduction tree algorithm Time for global phase = O(log (n/log n)) = O(log n log log n) = O(log n) 2-28

Accelerated Cascading: Bildung des Maximums n Elemente P 1 P 2 P n/logn log n Baumreduktion log (n/log n) T(p) = log n -1 + log(n/log n) = 2 log n log log n - 1 Cost = pt(p) = (n/log n) T(p) = Work = (n/log n) (lg n -1) + (n/log n) -1 = n 1 Utilization = Work/Cost Speedup = (n-1) / T(p) 2-29

Cost = pt(p) = (n/log n) T(p) E= Cost(n)/n= n log n n/log n log(n/log n) D/log n 1-1/log n +E 4 2 2,00 1,00 0,50 1,00 8 3 2,67 1,42 0,47 1,14 16 4 4,00 2,00 0,50 1,25 32 5 6,40 2,68 0,54 1,34 64 6 10,67 3,42 0,57 1,40 128 7 18,29 4,19 0,60 1,46 256 8 32,00 5,00 0,63 1,50 512 9 56,89 5,83 0,65 1,54 1024 10 102,40 6,68 0,67 1,57 2048 11 186,18 7,54 0,69 1,59 1048576 20 52428,80 15,68 0,78 1,73 Cost = O(n) = 2n? für sehr große n 2,0? 2-30

Zahlenbeispiel Example: n = 256 Number of processors, p= n/log n = 32 Divide 256 elements into 32 groups of 8 each Local phase: each processor computes the maximum of its 8 local elements Global phase: performed amongst the maxima computed by the 32 processors Frage 5): Berechne T, Cost, Work, Utilization, Speedup 2-31

Simulations between PRAM Models An algorithm designed for a weaker model can be executed within the same time complexity and work complexity on a stronger model An algorithm designed for a stronger model can be simulated on a weaker model, either with Asymptotically more processors (more work) Or asymptotically more time 2-32

Zusammenfassung PRAM P(n) Prozessoren globaler Speicher Lock-Step EREW, CREW, CRCW, CROW Broadcasting write read all-to-all Reduction, Baum Sum, Baum Pointer Jumping Sorting Prefix Sum T seq = n Horn: n-1... n/2 aktive Prozessoren n/2 Prozessoren Accelerated Cascading Cost = O(T seq ) 2-33

Antworten 1) Work = n, Cost = n/p *P 2) Die sequentielle Ausführung benötigt genauso viele Kopieroperationen 3) Cost=4*3=12; Work=7; Utilization=7/12 4) Sequentiell 8 Operationen, Parallel 12 Operationen, R=12/8 5) T=7+5=13, Cost=32 x 13=416, Work=255, Utilization=255/416=61%, Speedup=255/13=19,6 2-34