Introduction Workshop 11th 12th November 2013

Introduction Workshop 11th 12th November 2013 Lecture I: Hardware and Applications Dr. Andreas Wolf Gruppenleiter Hochleistungsrechnen Hochschulrechenzentrum

Overview Current and next System Hardware Sections rom C to HC ccnua-architectures Accelerators Applications 7

Current and next System Hardware Sections rom C to HC ccnua-architectures Accelerators Applications 8

Der Neue Hochleistungsrechner Insgesamt 15 io für Hardware und ~ 7,5 io für Gebäude Zuschlag hat IB bekommen, Ausschreibungssumme 13,5 io Lieferung von IB in zwei hasen I und II hase I (2013): heterogenes System - verschiedene Sektionen hase II (Ende 2014) Rund 800 Rechenknoten Über 250 Tlops (peak) durch rozessoren Über 128 Tlops (peak) durch Beschleunigerkarten (GU, IC) Zusätzlich noch einmal doppelte Rechenleistung Unabhängig von Rechnerhardware Wasserkühlung so ausgelegt, dass Abwärme für Gebäudeheizung nutzbar 9

New Building (Oktober 2013) 10

Hardware 2013 currently I Ethernet Infiniband E 704(+2) x I (inclusive UCluster) 2 rocessors, Intel Sandybridge each 8 Cores, 2.6 GHz 32 GByte (10% 64 GByte) 4 x E 8 rocessors, each 8 Cores 1024 GByte ile Systems Scratch: 768 TByte, 20 GB/s SCRATCH Infiniband DR-10 11

Hardware 2013 hase I 704(+2) x I (inclusive UCluster) 2 rocessors, Intel Sandybridge each 8 Cores, 2.6 GHz 32 GByte (10% 64 GByte) 4 x E E 8 rocessors, each 8 Cores 1024 GByte 44+24(+2) x ACC reparation ACC 2 rocessors + 2 Accelerators Nvidia Kepler Intel Xeon hi (former IC) 32 GByte ile Systems Scratch: 768 TByte, 20 GB/s Home: 500 TByte, 5 GB/s Infiniband SCRATCH HOE DR-10 Ethernet Infiniband I 12

Hardware 2013 finally I Ethernet ACC S HOE reparation reparation SCRATCH Infiniband Infiniband E 32 x S 4 rocessors, AD Opteron each 12 Cores, 2.6 GHz 64 GByte (8x 128 GByte) Infiniband QDR 13

Hardware 2014 hase II I E E ACC ACC HOE S SCRATCH SCRATCH Infiniband Ethernet Infiniband I Additional I 2 rocessors Successor architecture 4x Additional E 4 rocessors Successor architecture 1024 GByte Additional ACC 2 rocessors 2 Accelerators Successor architecture ile Systems Scratch: +768 TByte Overall 1.5 Byte Infiniband DR, 54 Gbit/s,~1µs Latency 14

Hardware Details I E ACC S Details sind wichtig für effiziente Nutzung Im olgenden Was macht die Rechner schnell Heute: Anzahl Rechenkerne statt Takt Was muss man unbedingt beachten Als Anwender Welche Ressourcen fordere ich an Als rogrammierer Was muss ich bei der rogrammierung beachten 15

Vom einfachen Computer zum Hochleistungsrechner 17

Vom einfachen Computer zum Hochleistungsrechner Große Rechner mit mehren rozessoren Viele besonders kleine Rechner Statt vieler normaler C's: 19

ehrprozessor-system I S 21

I vs. S Section I S 22

NUA ccnua I Data transfer between processors/cores Non-Uniform emory Access Local memory for each processor ast access only to the local memory!!! Global main memory address space Cache-coherent NUA In respect to the Cache memory 24

ccnua System I S odul odul odul odul odul odul 25

ccnua System I Eine AVX Einheit pro Ein = zwei Hyperthreads Ein NUA-Node pro rozessor S Eine AVX Einheit pro odul Ein odul = zwei e Zwei NUA-Nodes pro rozessor odul odul odul odul odul odul 26

I vs. S Section I 16 Cores (AVX units) per node Two NUA-Nodes per node 32-64 GByte main memory S 48 odules (AVX units) per node Eight NUA-Nodes per node 64-128 GByte main memory 27

E vs. S Section E S 28

E vs. E+ax5 E 1 E+ax5 3 2 4 ax5 ax5 29

E vs. E+ax5 E 1 E+ax5 3 2 64 Cores (no AVX units) per node Eight NUA-Nodes per node 1024 GByte main memory Good for latency dependent applications 4 64 Cores (no AVX units) per node Ten NUA-Nodes per node 1024 GByte main memory Good for memory bandwidth dependent applications 30

I vs. ACC Section I ACC ACC ACC 32

Nvidia K20X vs. Intel Xeon hi ACC-G ACC- K20X K20X hi hi 33

Nvidia K20X vs. Intel Xeon hi ACC-G ACC- 16 Cores (AVX units) per node Two NUA-Nodes per node 32 GByte main memory Two Nivida Tesla K20X CUDA, OpenACC, OpenCL 16 Cores (AVX units) per node Two NUA-Nodes per node 32 GByte main memory Two Intel Xeon hi Open, I, OpenCL 34

ain Targets for the Sections I E ACC ainly for I (essage assing Interface) Scalable applications up to hundreds of cores distriuted-memory applications Serial applications with the need of hundreds of GByte main memory shared-memory parallelized applications Only when the use of accelerators is possible and efficient S shared-memory parallelized applications The need of hundreds of GByte main memory Efficient use of hundreds of cores 36

ain Applications for the Sections I E ACC Several CD simulation codes e.g. astest, Openoam... Applications for Grid generating Data-Base Seeking e.g. atlab... Application from Chemistry, Biology Similar to E S 37

ain Applications for the Sections I Several CD simulation codes e.g. astest, Openoam... Ask your Software Developer Applications for Grid generating Data-Base Seeking the implemented e.g. atlab... parallelization E It depends on technique Distributed- vs. Shared- memory parallelization I, Open, OSIX-Threads Application from Chemistry, Biology ACC It depends on the additional support for accelerators Based on CUDA, OpenCL, OpenACC, Open... S Similar to E 38

How it looks like 39

Thank you for your attention Questions??? 40