AP II: Weiterentwicklung von LB-Methoden für Praktische Anwendungen auf Hochskalierenden Systemen

Transkript

1 AP II: Weiterentwicklung von LB-Methoden für Praktische Anwendungen auf Hochskalierenden Systemen C. Feichtinger Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de SKALB-Abschlussworkshop Erlangen, 3. April

2 Projektbearbeiter LSS Christian Feichtinger: WaLBerla, Datenstruckturen, Parallelisierung, GPGPU, Perf. Modellierung Florian Schornbaum: Datenstruckturen, Dynamische Lastbalancierung, Adaptivität Stefan Donath: Parallelisierung, Checkpoint-Restart, Freie- Oberflächen Jan Götz: Parallelisierung, Checkpoint-Restart, Partikuläre Strömungen Dominik Bartuschat: WaLBerla, Parallelisierung Kristina Pickl: Parallele Auswertungen Frank Deserno: Datenreduktion Simon Bogner: Visualisierung, Freie-Oberflächen, Partikuläre Strömungen 2

3 AP II II.a: Datenstrukturen II.b: Gebietszerlegung und dynamische/adaptive Lastbalancierung II.b.1: Statische Gebietspartitionierung für große Gebiete unter Berücksichtigung von Hardware- Eigenschaften II.b.2: Dynamische / adaptive Lastbalancierung II.b.3: Adaptivität II.c: Preprocessing und Visualisierung II.d: Datenkompression und Restart-Mechanismen, MPI-IO 3

4 4

5 4

6 4

7 PHDs: WaLBerla 1 st generation: (2006) Klaus Iglberger Stefan Donath Jan Götz Christian Feichtinger 2 nd generation: (~2010) Dominik Bartuschat Kristina Pickel Florian Schornbaum Christian Godenschwager Johannes Habich (RRZE) Simon Bogner Felipe Aristizabal(McGill, Ca) 3 rd generation: (~2012) Regina Ammer Martin Bauer Daniela Anderl (LSTM) Matthias Markel (WTM) 5

8 PHDs: WaLBerla 1 st generation: (2006) Klaus Iglberger Stefan Donath Jan Götz Christian Feichtinger 2 nd generation: (~2010) Dominik Bartuschat Kristina Pickel Florian Schornbaum Christian Godenschwager Johannes Habich (RRZE) Simon Bogner Felipe Aristizabal(McGill, Ca) 3 rd generation: (~2012) Regina Ammer Martin Bauer Daniela Anderl (LSTM) Matthias Markel (WTM) Aktuelle Projekte: Simulation von Proteinschäumen, LSTM, LFG, DFG, FEI(AIF) FastEBM, WTM, EU Self-propelled Particle Complexes, EAM Simulation von Mykardperfusion, Siemens AG (Healthcare) Nano Fluids, McGill, Ca 5

9 Datenstrukturen Anforderungen: Dynamisch Lastbalanzierbar Adaptiv Verfeinerbar Spezialisierbar Skalierbar Performant Fünf unterschiedliche Datenstrukturansätze wurden evaluiert 6

10 Homogene Blöcke (LSS, irmb) 7

11 Heterogene Blöcke (LSS) Spezialisierung von Teilgebieten auf Hardware Aufbau anhand von Knotenbeschreibung Hybride Parallelisierung Statische Lastbalancierung Block{ } Hardware GPU; Load 240; Threads 1; Device 0; Block{ Hardware CPU; Load 40; Threads 5; } 8

12 Heterogene Blöcke (LSS) Spezialisierung von Teilgebieten auf Hardware Aufbau anhand von Knotenbeschreibung Hybride Parallelisierung Statische Lastbalancierung Block{ } Hardware GPU; Load 240; Threads 1; Device 0; Block{ Hardware CPU; Load 40; Threads 5; } 8

13 Dynamischer Oktree (LSS, irmb) 9

14 Patch-Block (irmb) 10

19 Listen-Basiert (RRZE) Speicherung der Fluid Zellen Geometrie und PDF Liste Optimal für statische Simulation 11

20 Anforderungen: Parallelisierung(WaLBerla) Minimierung des Parallelisierungsaufwands für neue Simulationsszenerien Geeignet für massive parallele Simulationen Puffermanagement und Ablaufsteuerung Geeignet für Heterogene Simulationen 12

23 Tsubame compute nodes equipped with GPUs 3 NVIDIA Tesla M2050 per node Peak performance: 2.2 PFlop/s 633 TB/s memory bandwidth Total performance: 2.4 PFlops/s 5 th in the TOP500 list Located at Tokyo Institute of Technology, Japan Collaboration with Prof. Takayuki Aoki 13

24 Pure LBM Performance on Tsubame 2 MLUPS: Mega Lattice Updates per Seconds Pure LBM performance is limited by bandwidth Implementation in CUDA Scenario: Lid Driven Cavity NVIDIA Tesla M2050 Xeon X5670 Westmere 2 sockets 12 cores Factor Flops [ TFlop/s ] 1,0 / 0,5 0,25/0,13 x 4 Theoretical Peak Bandwidth [GB/s] x 2-3 Stream Copy Bandwidth [GB/s] 100(+ECC)/ 115(-ECC) 43 x

25 Single GPU and CPU Node Performance Performance estimates based on Stream bandwidth: 15

26 Single GPU and CPU Node Performance Performance estimates based on Stream bandwidth: CPU: 142 MLUPS (ECC, DP, -BC) GPU: 330 MLUPS (ECC, DP, -BC) 15

27 Single GPU and CPU Node Performance Performance estimates based on Stream bandwidth: CPU: 142 MLUPS (ECC, DP, -BC) GPU: 330 MLUPS (ECC, DP, -BC) Resulting performance 75 % of estimate (+BC) 15

28 Single GPU and CPU Node Performance Performance estimates based on Stream bandwidth: CPU: 142 MLUPS (ECC, DP, -BC) GPU: 330 MLUPS (ECC, DP, -BC) Resulting performance 75 % of estimate (+BC) CPU Kernel: SSE Intrinsics Non-temporal stores Padding 15

29 Single GPU and CPU Node Performance Performance estimates based on Stream bandwidth: CPU: 142 MLUPS (ECC, DP, -BC) GPU: 330 MLUPS (ECC, DP, -BC) Resulting performance 75 % of estimate (+BC) CPU Kernel: SSE Intrinsics Non-temporal stores Padding GPU Kernel: Register usage optimized Memory layout: SoA Padding 15

30 Single GPU and CPU Node Performance Performance estimates based on Stream bandwidth: CPU: 142 MLUPS (ECC, DP, -BC) GPU: 330 MLUPS (ECC, DP, -BC) Resulting performance 75 % of estimate (+BC) CPU Kernel: SSE Intrinsics Non-temporal stores Padding GPU Kernel: Register usage optimized Memory layout: SoA Padding Kernels implemented by RRZE 15

31 Weak Scaling Performance Performance in [GLUPS] 180^3 Weak Scaling 3 GPUs GPGPUs 16

32 Weak Scaling Performance Performance in [GLUPS] 180^3 Weak Scaling 3 GPUs GPGPUs 16

33 Weak Scaling Performance Performance in [GLUPS] Weak Scaling 3 GPUs 180^ GPGPUs 16

34 Ziel: Performance Model for Parallel Simulations Besseres Verständnis des Kommunikationsoverheads Performancevorhersagen für WaLBerla Input: Bytes pro Nachricht und Anzahl der Nachrichten Bandbreiten abhängig von: Nachrichtengrösse Art des Transfers Hardware Bibliotheken Anzahl der Prozesse / Threads die sich eine Ressource teilen 17

35 Weak Scaling Performance Performance in [GLUPS] 180^3 Weak Scaling 3 GPUs Kernel Estimate Weak Scaling 3 GPUs GPGPUs 18

36 Weak Scaling Performance Performance in [GLUPS] 180^3 Weak Scaling 3 GPUs Kernel Estimate Weak Scaling 3 GPUs GPGPUs 18

37 Überlappen von Kommunikation und Arbeit 19

38 Überlappen von Kommunikation und Arbeit 20

39 Überlappen von Kommunikation und Arbeit 180^3 Performance in [GLUPS] Weak Scaling 3 GPUs Weak Scaling 3 GPUs (Overlap) Kernel Kernel (Overlap) Estimate Weak Scaling 3 GPUs GPGPUs 20

40 Weak Scaling Performance Performance in [GLUPS] Estimate 60^3 (Overlap) Estimate 100^3 (Overlap) Estimate 140^3 (Overlap) Estimate 180^3 (Overlap) Kernel 100^3 Kernel 60^ GPGPUs

41 Heterogene Node Performance Performance in [MLUPS] 1 GPU + 1 CPU 1 GPU 1 CPU Cubic Domain Size 1 GPU + 1 CPU 22

42 Weak Scaling - Vergleich der Ansätze Performance in [GLUPS] 180^3 Weak Scaling 3 GPUs - MPI (Overlap) Weak Scaling 3 GPUs - Hybrid (Overlap) Weak Scaling 3 GPUs - Hetero (Overlap) GPGPUs

43 Simulationen in Porösen Medien Performance-Evaluation Gebietsgrössen: 100x100x x256x3900 Voxelisierung: irmb Strong Scaling 24

44 Performance in [GLUPS] Simulationen in Porösen Medien 100x100x1504 Strong Scaling - Plain Strong Scaling - Geometry - LUPS Strong Scaling - Geometry - FLUPS Strong Scaling - Geometry - FLUPS (Overlap) Strong Scaling - CPU - FLUPS GPGPUs / CPU Nodes

45 Performance in [GLUPS] Simulationen in Porösen Medien 256x256x3900 Strong Scaling - 2D (Overlap) Strong Scaling - 1D (Overlap) GPGPUs / CPU Nodes 26

46 Prototyp für Dynamische Lastbalancierung Warum Entwicklung eines Prototype? Testen Large-Scale Partitionierungen auf dem Desktop Vereinfachtes Testen unterschiedlicher Ansätze Nicht Algorithmus spezifisch Qualitative Evaluierung Verifikation Beschreibungsmodell: Gebietsgröße Workload pro Block Bytes pro Block Kommunizierte Bytes pro Fläche, Kante und Ecke In der Entwicklung: Performancemodellierung Geplant: Open Source 27

47 Prototyp für Dynamische Lastbalancierung Initiale Lastverteiling: LSS: Raumfüllende Kurven oder Greedy Ansatz irmb: Metis oder Soltan Optimierungvariablen: Workload Speicher Kommunikation Optimierungstrategien: Anzahl der Prozesse Durchschnittliche Speichernutzung Lastverteilung 28

48 Prototyp für Dynamische Lastbalancierung Dynamische Lastbalancierung: Diffuse Lastbalanzierung Globale Lastbalanzierung, Master-Slave 29

49 Prototyp für Dynamische Lastbalancierung Anwendungsfall: Aufsteigende Blasen Prozesse Hauptspeicher pro Core 800 MB Blöcke: 40^3 30

50 Prototyp für Dynamische Lastbalancierung Anwendungsfall: Aufsteigende Blasen Prozesse Hauptspeicher pro Core 800 MB Blöcke: 40^3 31

51 Prototyp für Dynamische Lastbalancierung workload (in cells updated per coarsest time step) 5e+06 4e+06 3e+06 2e+06 1e+06 maximum average time step 32

52 Prototyp für Dynamische Lastbalancierung 600 maximum average 500 memory in MB time step 33

53 Prototyp für Dynamische Lastbalancierung workload and memory gain due to refinement workload gain memory gain time step 34

54 Thank you for your attention! Slides, reports, thesis, animations available for download at: www10.informatik.uni-erlangen.de 35