der RWTH Aachen Konfiguration und parallele Programmierung Dieter an Mey

Transkript

1 Der Sun Fire SMP-Cluster der RWTH Aachen Konfiguration und parallele Programmierung Dieter an Mey

2 Inhalt Das Sun Fire SMP-Cluster der RWTH Aachen Aktuelle Konfiguration und Rechnerarchitektur Parallelisierung mit OpenMP und MPI an einem Beispiel MPI OpenMP Hybrid Zusammenfassung

3 Hochleistungsrechnen an der RWTH im Zeitraffer /8 Sprachen Fortran77 C Fortran90/95 C++ Fortran C/C++ Java??? Parallelisierung Vektorisierung MPI MPI2 OpenMP OpenMP autopar (nested) hybrid Hardware CDC Cyber 175 CDC Cyber 205 (Bochum) IBM3090 Fujitsu VP/VPP HP V(SMP) Workstations Sun Fire (US III/IV) PCs Opteron Betriebssystem NOS, VSOS VM/CMS VSP Unix AIX, HPUX, IRIX, UXPM, Solaris, Linux Solaris Linux Fedora, RedHat, Solaris Linux SuSE, Debian, Windows?

4 HPC-Resources at Aachen University History peak performance [GFlop/s] memory [GByte] Cyber Cyber (Bochum) IBM 3090 VF 1990 SNI S-600/ SNI VPP300/ SunFire Cluster June 2001 SunFire Cluster April 2002 SunFire + Opteron Sept 2004

5 HPC-Resources at Aachen University History peak performance [GFlop/s] memory [GByte] Cyber Cyber (Bochum) IBM 3090 VF 1990 SNI S-600/ SNI VPP300/ SunFire Cluster June 2001 SunFire Cluster April 2002 SunFire + Opteron Sept 2004

6 Sun Fire SMP Cluster at the RWTH Aachen University 4630 GF 3584 GB Sun Fire Link Gigabit Ethernet Model SF E25K SF E6900 SF E6900 SF E2900 SF V40z Nodes essors 72 US IV 24 US IV 24 US IV 12 US IV 4 Opteron. cores Clock cycle 1050 MHz 1200 MHz 1200 MHz 1200 MHz 2200 MHz Peak Perf GF GF GF 57.6 GF 17.6 GF 288 GB 96 GB 96 GB 48 GB 8 GB

8 Sun Fire SMP Cluster - Networks Latency Total bandwidth between 2 nodes Gigabit Ethernet all nodes µs ~100 MB/s Sun Fire Link Gigabit Ethernet Sun Fire Link 4 x SF E25K 8 x SF E x SF E6900 ~4 µs ~ 2 GB/s

10 Sun Fire SMP Cluster Sun Fire E2900 Crossbar Crossbar (interleaved) (interleaved) (interleaved) SF E nodes 3 processor boards 4 UltraSPARC IV 1200 MHz processors per board local memory on board memory latency ns Crossbar / Bus Crossbar / Bus Crossbar / Bus

12 Sun Fire SMP Cluster Sun Fire E6900 Sun Fire Link SF E nodes 6 processor boards 4 UltraSPARC IV 1200 MHz processors per board local memory on board memory latency ns Crossbar Crossbar (interleaved) (interleaved) (interleaved) (interleaved) (interleaved) Crossbar (interleaved) / Bus Crossbar (interleaved) / Bus Crossbar / Bus Crossbar / Bus Crossbar / Bus Crossbar / Bus Crossbar

14 Sun Fire SMP Cluster Sun Fire E25K (interleaved) (interleaved) (interleaved) (interleaved) Crossbar (interleaved) Crossbar / Bus (interleaved) Crossbar / Bus (interleaved) Crossbar (interleaved) / Bus Crossbar (interleaved) / Bus Crossbar (interleaved) / Bus Crossbar (interleaved) / Bus Sun Fire Link Crossbar (interleaved) Crossbar / Bus (interleaved) Crossbar / Bus (interleaved) / Crossbar Bus (interleaved) / SF E25K Crossbar Bus (interleaved) / Crossbar Bus (interleaved) / Crossbar Bus (interleaved) 4 nodes Crossbar (interleaved) Crossbar / Bus (interleaved) 18 processor boards Crossbar / Bus (interleaved) Crossbar / Bus 4 UltraSPARC IV 1050 MHz Crossbar / Bus Crossbar / Bus processors per board Crossbar / Bus local memory on board memory latency 250 ns local memory latency 500 ns remote Crossbar Crossbar

16 Sun Fire SMP Cluster Sun Fire V40z Gigabit Ethernet coherent HyperTransport links SF V40z 64 nodes 4 AMD Opteron 2200 MHz processors local memory attached to each processor memory latency ~36 ns local memory latency ns remote

17 Uni essor Board of SF E25K and SF E6900 Crossbar Crossbar / Bus Bus (interleaved) Crossbar / Bus

18 UltraSPARC III (before( before) versus UltraSPARC IV (since( Sept 2004) ~ ns ~1GB/s ~ ns ~1GB/s L2-8 MB L2-8 MB L2-8 MB 900 MHz ~24 ns ~2-4 GB/s 1050/1200 MHz 1 Add + 1 Mult per cycle ~24 ns ~2-4 GB/s L1-64KB I- 32KB prefetch cache 2 KB write cache 2KB address cache (TLB) L1-64KB I- 32KB prefetch cache 2 KB write cache 2KB address cache (TLB) essor core L1-64KB I- 32KB prefetch cache 2 KB write cache 2KB address cache (TLB) essor core

19 Sun Fire V40z ~36 ns ~3 GB/s ~44-57 ns ~2-3 GB/s coherent HyperTransport links

20 Opteron 848 essor L2 1MB L1 64 KB Instr. 64 KB address cache (TLB) coherent HyperTransport links 2200 MHz 1 Add + 1 Mult per cycle

22 Mehrprozessorsystem mit verteiltem Speicher (Distributed ) Parallelisierung durch Message Passing mit MPI External network MPI- MPI- MPI-

23 Mehrprozessorsystem mit gemeinsamem Speicher (Shared ) - Uniform Access (UMA) Message Passing ist auch auf Shared--Rechnern möglich (interleaved) MPI- MPI- Crossbar / Bus MPI-

24 Cluster von Mehrprozessorsystem mit gemeinsamem Speicher (SMP( SMP-Cluster) Message Passing funktioniert daher auch auf SMP-Clustern External network (interleaved) MPI- MPI- Crossbar / Bus MPI- (interleaved) MPI- MPI- Crossbar / Bus MPI-

25 Mehrprozessorsystem mit gemeinsamem Speicher (Shared ) - Uniform Access (UMA) Auf Shared--Rechnern kann alternativ OpenMP eingesetzt und auch automatisch parallelisiert werden OpenMP- (interleaved) Crossbar / Bus -Thread -Thread -Thread

26 Cluster von Mehrprozessorsystem mit gemeinsamem Speicher (SMP( SMP-Cluster) Message Passing kann mit OpenMP (und automatischer Parallelisierung) kombiniert werden External network (interleaved) (interleaved) Crossbar / Bus Crossbar / Bus

27 Sun Fire SMP Cluster Operating Systems Sun Fire Link Gigabit Ethernet Model SF E25K SF E6900 SF E6900 SF E2900 SF V40z Nodes essors 72 US IV 24 US IV 24 US IV 12 US IV 4 Opteron Operating System Solaris 9 Solaris 10 Fedora Linux Windows 2003

28 Sun Fire SMP Cluster Operating Systems Sun Fire Link Gigabit Ethernet Model SF E25K SF E6900 SF E6900 SF E2900 SF V40z Nodes essors 72 US IV 24 US IV 24 US IV 12 US IV 4 Opteron Operating System Solaris 9 Solaris 10 Fedora Linux Windows 2003 in Vorbereitung

29 Compiler / Library Debugging Analysis / Tuning Serial Sun Studio F95/C/C++ 1,2 Intel F95/C++ 3,4 MS Visual Studio C++ 4 GNU C/C++ 1,2,3,4 PGI F77/F90/C/C++ 3 most important tools (bold) less important tools (regular) installation/test planed (italic) Etnus TotalView 1,3 Sun IDE 1,2,(3) Sun dbx 1,2 GNU gdb 3,4 GNU ddd 3 PGI pgdbg 3 Intel idb 3 MS Visual Studio 4 Allinea DDT 1,3 Sun Perform. Analyzer 1,2 Sun gprof 1,2 GNU gprof 3,4 PGI pgprof 3 Intel VTune 3,4 1. SPARC-Solaris 2. Opteron-Solaris 3. Opteron-Linux 4. Opteron-Windows OpenMP Autopar Sun Studio F95/C/C++ 1,2 Intel Guide F77/F90/C/C++ 1 MS Visual Studio C++ 4 PGI F77/F90/C/C++ 3 Intel F95/C++ 3,4 Etnus TotalView 1,3 Sun IDE 1,2 Sun dbx 1,2 Intel Assure 1 Intel Thread Checker 3,4 (Allinea DDT 1,3 ) Sun Perform. Analyzer 1,2 Intel GuideView 1 Intel Thread Profiler 3,4 MPI Sun MPI 1 mpich 1,2,3,4 mpich2 1,2,3,4 Windows HPC 4 Etnus TotalView 1,3 Sun Prism 1 Allinea DDT 1,3 Windows HPC 4 Intel Trace Collector and Analyzer (Vampir) 1,3 Sun Perform. Analyzer 1 Sun mpprof 1 GNU jumpshot 1,2,3,4

30 Compiler / Library Debugging Analysis / Tuning Serial Sun Studio F95/C/C++ 1,2 Intel F95/C++ 3,4 MS Visual Studio C++ 4 GNU C/C++ 1,2,3,4 PGI F77/F90/C/C++ 3 most important tools (bold) less important tools (regular) installation/test planed (italic) Etnus TotalView 1,3 Sun IDE 1,2,(3) Sun dbx 1,2 GNU gdb 3,4 GNU ddd 3 PGI pgdbg 3 Intel idb 3 MS Visual Studio 4 Allinea DDT 1,3 Sun Perform. Analyzer 1,2 Sun gprof 1,2 GNU gprof 3,4 PGI pgprof 3 Intel VTune 3,4 1. SPARC-Solaris 2. Opteron-Solaris 3. Opteron-Linux 4. Opteron-Windows OpenMP Autopar Sun Studio F95/C/C++ 1,2 Intel Guide F77/F90/C/C++ 1 MS Visual Studio C++ 4 PGI F77/F90/C/C++ 3 Intel F95/C++ 3,4 Etnus TotalView 1,3 Sun IDE 1,2 Sun dbx 1,2 Intel Assure 1 Intel Thread Checker 3,4 Sun Perform. Analyzer 1,2 Intel GuideView 1 Intel Thread Profiler 3,4 MPI Sun MPI 1 mpich 1,2,3,4 mpich2 1,2,3,4 Windows HPC 4 Etnus TotalView 1,3 Sun Prism 1 Allinea DDT 1,3 Windows HPC 4 Intel Trace Collector and Analyzer (Vampir) 1,3 Sun Perform. Analyzer 1 Sun mpprof 1 GNU jumpshot 1,2,3,4

34

35

36

37

38 Beispiel: Numerische Integration Die Kreiszahl π kann berechnet werden als Integral: 1 π = f(x)dx, mit f(x) = 4 /(1 + x 2 ) 0 Dieses Integral kann numerisch mit einer Quadraturformel (Mittelpunktsregel) angenähert werden: n π 1/n f(x i ), mit x i = (i-½)/ n für i=1,...,n i=1

39 Serielles Fortran90-Programm! statement function f(a) = 4.d0 / (1.d0+a*a)! read *, n h = 1.0d0 / n sum = 0.0d0 do i = 1, n x = h * ( i ) sum = sum + func(x) end do pi = h * sum üblicherweise rechenintensiv!

40 MPI - Geschichte 1994: MPI Version 1.0 (1995 Version 1.1,1997 Version 1.2) Standard für f r die Message-Passing Passing-ProgrammierungProgrammierung insbes. auf Maschinen mit verteiltem Hauptspeicher Programmierschnittstelle für f r C und Fortran77 inzwischen auf allen Parallelrechnern verfügbar mächtig und komplex: weit über 100 Befehle einfach: häufig h genügen Befehle hat im techn.-wiss wiss.. Rechnen PVM abgelöst keine dynamische Generierung von Tasks 1997: MPI Version 2.0 (wesentliche Erweiterungen zu MPI-1) Einseitige Kommunikation MPI-IO IO dynamische Generierung von Tasks Unterstützung tzung von Fortran90 und C++ mpich2 (beta( beta), Sun MPI MPI

41 MPI-Programm Programm 1. Version in Fortran90 program main include 'mpif.h'...! statement function f(a) = 4.d0 / (1.d0+a*a)! Initialisierung der MPI-Umgebung call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, ntasks, ierr )...! Verlassen der MPI-Umgebung call MPI_FINALIZE(ierr) end program main send receive essor Network essor

42 ! Point-to-point communication if ( myid == 0 ) then! nur der Master read *, n do islave = 1, ntasks-1 call MPI_Send(n,1,MPI_INTEGER,islave,...) end do else! alle Slaves call MPI_Recv(n,1,MPI_INTEGER,master,...) end if MPI-Programm Programm 1. Version (Fortsg( Fortsg.) h = 1 / n sum = 0.0 Arbeitsteilung do i = myid+1, n, ntasks x = h * ( i -0.5 ) sum = sum + f(x) end do mypi = h * sum if ( myid /= 0 ) then! Slaves call MPI_Send(mypi,1,MPI_DOUBLE_PRECISION,master,...) else! Master pi = mypi do islave = 1, ntasks-1 call MPI_Recv(mypi,1,MPI_DOUBLE_PRECISION,islave,...) pi = pi + mypi end do print *, pi endif

43 MPI-Programm Programm 2. Version in Fortran90! Collective communication if ( myid == 0 ) then! nur der Master read *, n end if call MPI_BCAST(n,1,MPI_INTEGER,master,...) h = 1.0d0 / n sum = 0.0d0 do i = myid+1, n, numprocs x = h * ( i -0.5 ) sum = sum + f(x) end do mypi = h * sum call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,master,...) if (myid == 0) then print *, pi endif

44 MPI-Programm Programm 2. Version in C #include "mpi.h"... MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if (myid==0){scanf("%d",&n);} MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += f(x); } mypi = h * sum; MPI_Reduce(&mypi,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); if (myid == 0) { printf( %f\n", pi); } MPI_Finalize();

45 One-sided Communication (Fortran) if ( myrank.eq. master ) then read (*,*) n allocate ( fi(n) ) call MPI_Win_create ( fi, 8*n, 8, 0, MPI_COMM_WORLD, win, ierr ) else call MPI_Win_create ( fi, 0, 8, 0, MPI_COMM_WORLD, win, ierr ) end if call MPI_Bcast ( n, 1, MPI_INTEGER, master, MPI_COMM_WORLD, ierr ) h = 1.0d0/dble(n) call MPI_Win_fence ( 0, win, ierr ) do i = myrank+1, n, nprocs! simple approach of work sharing x = h * (dble(i) - 0.5d0) fh = f(x) call MPI_Put ( fh, 1, MPI_DOUBLE_PRECISION, master, i-1, 1, & MPI_DOUBLE_PRECISION, win, ierr) end do call MPI_Win_fence ( assert, win, ierr ) if ( myrank.eq. 0 ) then pi = h * sum(fi) print *, pi deallocate ( fi ) end if call MPI Win free ( win, ierr )

46 Visualizing the MPI Performance Vampir(trace) ) / Intel Trace Collector + Analyzer Global Timeline Gobal Activity Chart

47 OpenMP - Geschichte 1997: OpenMP Version 1.0 für Fortran Standard für f r die Shared- -Programmierung inzwischen für f r alle namhaften SMP-Rechner verfügbar wird im techn.-wiss wiss.. Rechnen die proprietären ren Direktiven und die unmittelbare Verwendung der pthreads ablösen 1998:OpenMP Version 1.0 für C und C : OpenMP Version 2.0 für Fortran Unterstützung tzung des Fortran90-Modulen Modulen-Konzeptes

48 Serielles Fortran90-Programm! statement function f(a) = 4.d0 / (1.d0+a*a)! read *, n h = 1.0d0 / n sum = 0.0d0 do i = 1, n x = h * ( i ) sum = sum + func(x) end do pi = h * sum

49 OpenMP-Programm Programm 1. Version in Fortran90! statement function f(a) = 4.d0 / (1.d0+a*a)! read *, n sum = 0.0d0!$omp parallel!$omp do private(i,x) do i = 1,n x = h * ( i )!$omp critical sum = sum + f(x)!$omp end critical end do!$omp end do!$omp end parallel Default: pi = h * sum private shared essor private i, x n, h, sum i, x i=1,2,,n/2 schedule(static) i=1,2,7,13,n Alternative (z.b.): schedule(dynamic) essor i=n/2+1,,n i=3,4,,n-1

50 OpenMP-Programm Programm 2. Version in Fortran90! statement function f(a) = 4.d0 / (1.d0+a*a) real, allocatable, dimension(:) :: fx read *, n allocate (fx(n)) h = 1.0d0 / n sum = 0.0d0!$omp parallel private(i,x) shared(h,fx)!$omp do do i = 1,n x = h * ( i ) fx(i) = f(x) end do!$omp end do!$omp end parallel do i = 1,n sum = sum + fx(i) end do pi = h * sum deallocate (fx) private shared essor private i, x n, h, fx i, x essor

51 OpenMP-Programm Programm 3. Version in Fortran90! statement function f(a) = 4.d0 / (1.d0+a*a)! read *, n h = 1.0d0 / n sum = 0.0d0!$omp parallel private(i,x,sum_local) sum_local = 0.0d0!$omp do do i = 1,n x = h * ( i ) sum_local = sum_local + f(x) end do!$omp end do!$omp critical sum = sum + sum_local!$omp end critical!$omp end parallel pi = h * sum Master Thread serial region parallel region serial region Slave Threads Slave Threads Slave Threads

52 OpenMP-Programm Programm 3. Version in Fortran90! statement function f(a) = 4.d0 / (1.d0+a*a)! read *, n h = 1.0d0 / n sum = 0.0d0!$omp parallel private(i,x,sum_local) sum_local = 0.0d0!$omp do do i = 1,n x = h * ( i ) sum_local = sum_local + f(x) end do!$omp end do!$omp critical sum = sum + sum_local!$omp end critical!$omp end parallel pi = h * sum redundant work- sharing Master Thread redundant serial region parallel region serial region Slave Threads Slave Threads Slave Threads

53 OpenMP-Programm Programm 3. Version in Fortran90! statement function f(a) = 4.d0 / (1.d0+a*a)! read *, n private shared h = 1.0d0 / n i, x, n, h, sum sum = 0.0d0 sum_local!$omp parallel private(i,x,sum_local) sum_local = 0.0d0!$omp do do i = 1,n x = h * ( i ) sum_local = sum_local + f(x) essor end do!$omp end do!$omp critical sum = sum + sum_local!$omp end critical!$omp end parallel pi = h * sum private i, x, sum_local essor

54 OpenMP-Programm Programm 4. Version in Fortran90 (Auszug)! statement function f (a) = 4.d0 / (1.d0+a*a)! read *, n h = 1.0d0 / n sum = 0.0d0!$omp parallel private(i,x)!$omp do reduction(+:sum) do i = 1,n x = h * ( i ) sum = sum + f(x) end do!$omp end do!$omp end parallel pi = h * sum private shared i, x, (sum) essor n, h, sum private i, x, (sum) essor

55 OpenMP-Programm Programm 4. Version in Fortran90 (Auszug)! statement function f (a) = 4.d0 / (1.d0+a*a)! read *, n h = 1.0d0 / n sum = 0.0d0!$omp parallel do private(i,x) reduction(+:sum) do i = 1,n x = h * ( i ) sum = sum + f(x) end do!$omp end parallel do pi = h * sum

56 Automatische Parallelisierung! statement function f(a) = 4.d0 / (1.d0+a*a)! read *, n h = 1.0d0 / n sum = 0.0d0 do i = 1, n x = h * ( i ) sum = sum + f(x) end do pi = h * sum Da die Statement-Function f(a) einfach ist: f90 -fast -xarch=v8plusb -xautopar -xreduction -xloopinfo serial_pi.f90 "serial_pi.f90", line xx: PARALLELIZED, reduction, and serial version generated

57 Visualizing the Behavior of an OpenMP program GuideView (KAI) / Intel Thread Visualizer

58 Fehlersuche mit Assure (Fortran) Intel Thread Checker

59 Hybrid-Programm in Fortran90 call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) if ( myid.eq. 0 ) read *, n call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr) h = 1.0d0 / n sum = 0.0d0!$omp parallel do reduction(+:sum) private(i,x) do i = myid+1, n, numprocs x = h * ( i ) sum = sum + f(x) end do mypi = h * sum call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,...) if (myid.eq. 0) print *, pi call MPI_FINALIZE(ierr)

60 Beispiel: Adaptive Numerische Integration Die Kreiszahl π kann berechnet werden als Integral: 1 π = f(x)dx, mit f(x) = 4 /(1 + x 2 ) 0

61 Beispiel: Adaptive Numerische Integration Die Kreiszahl π kann berechnet werden als Integral: 1 π = f(x)dx, mit f(x) = 4 /(1 + x2) 0

62 Beispiel: Adaptive Numerische Integration Die Kreiszahl π kann berechnet werden als Integral: 1 π = f(x)dx, mit f(x) = 4 /(1 + x2) 0

63 Rekursive Fortran90-Funktion Funktion zur adaptiven Integration recursive function integral (f, a, b, tolerance) & result(integral_result) h = b - a mid = (a + b) / 2 one_trapezoid_area = h * (f(a) + f(b)) / 2 two_trapezoid_area = h/2 * (f(a) + f(mid)) / 2 + & h/2 * (f(mid) + f(b)) / 2 if (abs(one_trapezoid_area - two_trapezoid_area) < & 3 * tolerance) then integral_result = two_trapezoid_area else left_area = integral (f, a, mid, tolerance / 2) right_area = integral (f, mid, b, tolerance / 2) integral_result = left_area + right_area end if end function integral

64 Tuning: Weniger Funktionsaufrufe (kein schlechtes Programm parallelisieren!) recursive function integral (f, a, fa, b, fb, tolerance) & result (integral_result)... h = b - a mid = (a + b) / 2 fmid = f(mid) one_trapezoid_area = h * (fa + fb) / 2 two_trapezoid_area = h/2 * (fa + fmid) / 2 + & h/2 * (fmid + fb) / 2 if (abs(one_trapezoid_area - two_trapezoid_area) < & 3 * tolerance) then integral_result = two_trapezoid_area else left_area = integral (f, a, fa, mid, fmid, tolerance / 2) right_area = integral (f, mid, fmid, b, fb, tolerance / 2) integral_result = left_area + right_area end if end function integral

65 Serielle Implementation mit Stack (das serielle Programm für f r die Parallelisierung vorbereiten!) function integral (f, ah, bh, tolerance) result (integral_result) call new_stack ( stack ) call push ( stack, ah, f(ah), bh, f(bh), tolerance ) integral_result = 0.0 do if ( empty_stack ( stack ) ) exit call pop ( stack, a, fa, b, fb, tolerance ) h = b - a mid = (a + b) / 2 fmid = f(mid) one_trapezoid_area = h * (fa + fb) / 2 two_trapezoid_area = h/2 * (fa + fmid) / 2 + & h/2 * (fmid + fb) / 2 if (abs(one_trapezoid_area - two_trapezoid_area) & < 3 * tolerance) then integral_result = integral_result + two_trapezoid_area else call push ( stack, a, fa, mid, fmid, tolerance / 2 ) call push ( stack, mid, fmid, b, fb, tolerance / 2 ) end if end do end function integral

66 function integral (f, ah, bh, tolerance) result (integral_result) call new_stack ( stack ) call push ( stack, ah, f(ah), bh, f(bh), tolerance ) integral_result = 0.0; busy_cnt = 0; ready =.false.!$omp parallel default(none) shared(stack, integral_result,busy_cnt) &!$omp private(a,fa,b,fb,tolerance,h,mid,fmid,one_trapezoid_area,two_trapezoid_area)&!$omp private(idle,ready,private_result) idle=.true.; private_result = 0.0; do!$omp critical (stack) if ( empty_stack ( stack ) ) then if (.not. idle ) then idle=.true.; busy_cnt = busy_cnt - 1 end if if ( busy_cnt.eq. 0 ) ready =.true. else call pop ( stack, a, fa, b, fb, tolerance ) if ( idle ) then idle =.false. ; busy_cnt = busy_cnt + 1 end if end if!$omp end critical (stack) if ( idle ) then if ( ready ) exit cycle end if h = b a; mid = (a + b) / 2; fmid = f(mid) one_trapezoid_area = h * (fa + fb) / 2 two_trapezoid_area = h/2 * (fa + fmid) / 2 + h/2 * (fmid + fb) / 2 if (abs(one_trapezoid_area - two_trapezoid_area) < 3 * tolerance) then private_result = private_result + two_trapezoid_area else!$omp critical (stack) call push ( stack, a, fa, mid, fmid, tolerance / 2 ) call push ( stack, mid, fmid, b, fb, tolerance / 2 )!$omp end critical (stack) end if end do!$omp critical (result) integral_result = integral_result + private_result!$omp end critical (result)!$omp end parallel end function integral Erste OpenMP Version Der Zugriff auf den Stack muss in kritischen Regionen geschützt werden

67 function integral (f, ah, bh, tolerance) result (integral_result) call new_stack ( stack ) call push ( stack, ah, f(ah), bh, f(bh), tolerance ) integral_result = 0.0; busy_cnt = 0; ready =.false.!$omp parallel default(none) shared(stack,,busy_cnt) &!$omp private(a,fa,b,fb,tolerance,h,mid,fmid,one_trapezoid_area,two_trapezoid_area)&!$omp private(idle,ready idle=.true. do!$omp critical (stack) if ( empty_stack ( stack ) ) then if (.not. idle ) then idle=.true.; busy_cnt = busy_cnt - 1 end if if ( busy_cnt.eq. 0 ) ready =.true. else call pop ( stack, a, fa, b, fb, tolerance ) ) reduction(+:integral_result) if ( idle ) then idle =.false. ; busy_cnt = busy_cnt + 1 end if end if!$omp end critical (stack) if ( idle ) then if ( ready ) exit cycle end if h = b a; mid = (a + b) / 2; fmid = f(mid) one_trapezoid_area = h * (fa + fb) / 2 two_trapezoid_area = h/2 * (fa + fmid) / 2 + h/2 * (fmid + fb) / 2 if (abs(one_trapezoid_area - two_trapezoid_area) < 3 * tolerance) then integral_result = integral_result + two_trapezoid_area else!$omp critical (stack) call push ( stack, a, fa, mid, fmid, tolerance / 2 ) call push ( stack, mid, fmid, b, fb, tolerance / 2 )!$omp end critical (stack) end if end do Zweite OpenMP Version reduction Klausel!$omp end parallel end function integral

68 Nested Parallelism (Neu in Sun Studio 10) recursive function integral (f, a, fa, b, fb, tolerance) & result (integral_result)... h = b - a mid = (a + b) /2 fmid = f(mid) one_trapezoid_area = h * (fa + fb) / 2 two_trapezoid_area = h/2 * (fa + fmid) / 2 + & h/2 * (fmid + fb) / 2 export OMP_NESTED=true export SUNW_MP_MAX_POOL_THREADS=23 export SUNW_MP_MAX_NESTED_LEVELS=8 if (abs(one_trapezoid_area - two_trapezoid_area) & < 3 * tolerance) then integral_result = two_trapezoid_area else!$omp parallel sections!$omp section left_area = integral (f, a, fa, mid, fmid, tolerance / 2)!$omp section right_area = integral (f, mid, fmid, b, fb, tolerance / 2)!$omp end parallel sections integral_result = left_area + right_area end if end function integral

69 Adaptive Integration mit MPI program integrate call MPI_INIT( ierror ) call MPI_COMM_RANK(, myid, ierror ) if ( myid == 0 ) then call master else call slave end if call MPI_FINALIZE(ierror) end program integrate! Der Master verwaltet den Stack subroutine master x_min = 0.0 x_max = 1.0 answer = integral (f, x_min, x_max, ) print *, answer end subroutine master

70 ! Die Slaves berechnen die Integrale subroutine slave private_result = 0.0 Adaptiven Integration mit MPI do! wait for work call MPI_Send(v,0,MPI_REAL,master,waitforworktag, ) program integrate call MPI_Recv(v,5,MPI_REAL,master,MPI_ANY_TAG,,status ) select case (status(mpi_tag)) call MPI_INIT( ierror case(readytag) ) call MPI_COMM_RANK( exit, myid, ierror ) if ( myid == 0 ) case(poptag) then call master else call slave h = b - a end if call MPI_FINALIZE(ierror) a = v(1); fa=v(2); b=v(3); fb=v(4); tolerance=v(5) end select mid = (a + b) /2 fmid = f(mid) end program integrate one_trapezoid_area = h * (fa + fb) / 2.0 two_trapezoid_area = h/2 * (fa+fmid)/2 + h/2 * (fmid+fb)/2! Der Master verwaltet den if Stack (abs(one_trapezoid_area-two_trapezoid_area)<3*tolerance)& subroutine master then private_result = private_result + two_trapezoid_area x_min = 0.0 else v(1)=a; v(2)=fa; v(3)=mid; v(4)=fmid; v(5)=tolerance/2 x_max = 1.0 call MPI_Send(v,5,MPI_REAL,master,pushtag, ) answer = integral (f, v(1)=mid; x_min, v(2)=fmid; x_max, ) v(3)=b; v(4)=fb; v(5)=tolerance/2 print *, answer call MPI_Send(v,5,MPI_REAL,master,pushtag, ) end subroutine master end if end do call MPI_Reduce (private_result,integral_result,1, ) end subroutine slave

71 function integral (f, ah, bh, tolh) result (integral_result)!master verwaltet Stack call new_stack ( stack ) call push ( stack,ah,f(ah),bh,f(bh),tolh ) integral_result=0.0; private_result=0.0 Adaptiven Integration mit MPI call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierror ) allocate(slave_status(numprocs-1)) busycount=numprocs-1; slave_status=idle do call MPI_Recv(v,5,MPI_REAL,MPI_ANY_SOURCE,MPI_ANY_TAG,, status, ) select case (status(mpi_tag)) case(waitforworktag)! Slave wartet auf Arbeit busycount=busycount - 1 if ( empty_stack(stack) ) then! Nichts da slave_status(status(mpi_source)) = idle if ( busycount.eq. 0 ) exit else! Intervall schicken call popv(stack,v) call MPI_Send(v,5,MPI_REAL,status(MPI_SOURCE),poptag, ) slave_status(status(mpi_source))=busy; busycount=busycount+1 cycle end if case(pushtag)! Slave schickt neues Intervall if ( busycount < numprocs-1 ) then! Falls möglich, weiterleiten do iproc = 1, numprocs-1 if (slave_status(iproc) == idle ) then call MPI_Send(v,5,MPI_REAL,iproc,poptag, ) busycount=busycount+1; slave_status(iproc)=busy; exit end if end do else! Sonst cachen call pushv(stack,v) end if cycle end select end do do iproc = 1, numprocs-1! Feierabend! call MPI_Send(v,0,MPI_REAL,iproc,readytag, ) end do call MPI_Reduce(private_result,integral_result,1,MPI_REAL,MPI_SUM,root, ) end function integral

72 Adaptive Integration mit MPI und OpenMP (hybrid) program integrate call MPI_INIT( ierror ) call MPI_COMM_RANK(, myid, ierror ) if ( myid == 0 ) then call master else call slave end if call MPI_FINALIZE(ierror) end program integrate subroutine master x_min = 0.0 x_max = 1.0 answer = integral (f, x_min, x_max, ) print *, answer end subroutine master

73 Adaptive Integration mit MPI und OpenMP (hybrid) program integrate call MPI_INIT( ierror ) call MPI_COMM_RANK(,myid,ierror) if ( myid == 0 ) then call master else call slave end if call MPI_FINALIZE(ierror) end program integrate! Der Master lässt nur rechnen subroutine master x_min = 0.0 x_max = 1.0 answer=integral(f,0.0,1.0, ) print *, answer end subroutine master subroutine slave! führt nur die Funktionsauswertungen durch... do call MPI_Recv(a,1,MPI_REAL, & master,mpi_any_tag,,status, ) select case (status(mpi_tag)) case(readytag) exit! done case (argtag) fa = f(a) mymaster = status(mpi_tag) call MPI_Send(fa,1,MPI_REAL,& master,functag, ) end select end do end subroutine slave

74 function integral (f, ah, bh, tolerance) result (integral_result) call new_stack ( stack ) call push ( stack, ah, f(ah), bh, f(bh), tolerance ) integral_result = 0.0; busy_cnt = 0; ready =.false. call MPI_Comm_size( MPI_COMM_WORLD, numprocs, ierror )!$omp parallel default(none) shared(stack,busy_cnt,readytag) &!$omp private(a,fa,b,fb,tolerance,h,mid,fmid,one_trapezoid_area,two_trapezoid_area)&!$omp private(idle,ready,myworker, ) reduction(+:integral_result) num_threads(numprocs-1) idle=.true. myworker = omp_get_thread_num() + 1 do!$omp critical (stack) if ( empty_stack ( stack ) ) then... else call pop ( stack, a, fa, b, fb, tolerance )... end if!$omp end critical (stack) if ( idle ) then if ( ready ) then MPI proc 0 master Thread 0 Thread 1 Thread 2 MPI proc 1 worker 1 Thread 0 MPI proc 2 worker 2 Thread 0 call MPI_Send(mid,0,MPI_REAL,myworker,readytag, ); exit end if cycle end if h = b a; mid = (a + b) / 2;! fmid = f(mid) Funktionsauswertung wird dem Worker überlassen call MPI_Send(mid,1,MPI_REAL,myworker,argtag, ) call MPI_Recv(fmid,1,MPI_REAL,myworker,functag, ) one_trapezoid_area = h * (fa + fb) / 2 two_trapezoid_area = h/2 * (fa + fmid) / 2 + h/2 * (fmid + fb) / 2 if (abs(one_trapezoid_area - two_trapezoid_area) < 3 * tolerance) then integral_result = integral_result + two_trapezoid_area else!$omp critical (stack) call push ( stack, a, fa, mid, fmid, tolerance / 2 ) call push ( stack, mid, fmid, b, fb, tolerance / 2 )!$omp end critical (stack) end if end do!$omp end parallel end function integral MPI proc 3 worker 3 Thread 0

76 Zusammenfassung 1 Das Sun Fire SMP-Cluster besteht seit September 2004 aus 28 Knoten (E25K,E6900,E2900) mit je 12 72UltraSPARCIVdual core Prozessoren (1050,1200 MHz) und GB Hauptspeicher unter Solaris 64 Knoten (V40z) mit je 4 Opteron 848 Prozessoren (2200 MHz) und 8 GB Speicher unter Linux (Produktion) / Solaris (Test) / Windows (Vorbereitung)

77 Zusammenfassung 2 Parallelisierung ist ein Muss für HPC Message Passing mit MPI auf allen Parallelrechnern einsetzbar MPI ist umfangreich, relativ einfach zu verstehen und zu erlernen Die MPI-Parallelisierung kann äußerst aufwendig sein OpenMP-Programme benötigen Rechner mit gemeinsamem Speicher OpenMP ist überschaubar und einfach zu erlernen Die OpenMP-Parallelisierung ist in kleinen Schritten durchführbar Verfizierung mit Assure/ThreadChecker, Fehler manchmal schwer zu finden Hybride Parallelisierung (MPI+OpenMP) für große Anwendungen von Interesse