Multicore Parallelismus! in modernen CPUs

Multicore Parallelismus! in modernen CPUs Johannes Hofmann, 21.5.2014 Seminar Architekturen von Multi- und Vielkern-Prozessoren Universität Erlangen-Nürnberg Lehrstuhl für Rechnerarchitektur

Informatik 3 HPC Cluster! Doku für Cluster unter https://faui36a.informatik.uni-erlangen.de/trac/puppet Auszug aus der Liste der Systeme https://faui36a.informatik.uni-erlangen.de/trac/puppet/ wiki/systemlist Nomad (Dual-Socket Sandy Bridge-EP) 2x Xeon Phi faui36a/b sind Cluster-headnodes faui36a:~ hofmann$ lsgpu faui36g ; geforce_gtx580 ; 1 faui36g ; geforce_gtx670 ; 1 faui36g ; tesla_c1060 ; 1 faui36i ; tesla_c2050 ; 1 faui36i ; tesla_c2050ecc ; 1 faui36i ; tesla_k20 ; 4 faui36j ; altera_de4 ; 1 faui36j ; geforce_gtx480 ; 1 faui36j ; radeon7970 ; 1 faui36j ; tesla_c1060 ; 1 2

SLURM! Auf Headnode einloggen ssh faui36b Submit interactive Job srun -w nomad -c1 -t100:00 --pty bash Hints nomad Host to request -c1 Number of requested cores (1 to develop/compile, 32 to measure) -t100:00 Reservation Time, 100 minutes is maximum (Don t forget to save your work, if your time runs out, you ll be kicked from the system!) --pty bash Interactive Job (Problems with the terminal? Connect manually in another terminal: ssh nomad) 3

Parallelismus in modernen x86 CPUs! Multicore Parallelismus Single Instruction Multiple Data (SIMD) Vektorisierung Instruction Level Parallelismus (ILP) Simultaneous Multithreading (SMT) 4

Aufbau unseres Sandy Bridge-EP Systems nomad! Two-Socket Sandy Bridge-EP System (2x Xeon E5-2670) Two NUMA Domains Eight Cores per Socket Nominal CPU Frequency: 2.6 GHz 2-SMT per Core 6 Execution Units per Core (Superscalar Design) Advanced Vector Extensions (AVX) Dedicated L1+L2 cache per Core Cores communicate via shared L3 cache Memory MC 0 1 2 3 Package 0 shared LLC 4 5 6 7 QPI QPI QPI QPI 0 1 2 3 Package 1 shared LLC 4 5 6 7 MC Memory 5

8 Kern Sandy Bridge EP: Processor Package! 6

Ausnutzen von Multicore Parallelismus! OpenMP basiert auf fork-join Programmiermodel Programme starten mit nur einem Thread Zusätzliche Threads (thread team) werden für parallele Regionen geforkt Implizite Barriere am Ende einer parallelen Region Join nach paralleler Region 7

OpenMP! Ziele von OpenMP Sehr einfache Handhabung Sequentielle Äquivalenz Paralleles Programm produziert die selben Ergebnisse wie sequentielle Variante Inkrementelle Parallelisierung Ein bereits vorhandenes Programm soll sich leicht Schritt-für-Schritt parallelisieren lassen 8

OpenMP! Beispiel: Hello Word! Sequentielle Variante $ cat hello.c #include <stdio.h> int main(int argc, char **argv) { printf( Hello World!\n ); $ gcc -o hello hello.c $./hello Hello World! $ Parallele Variante $ cat hello_omp.c #include <omp.h> #include <stdio.h> int main(int argc, char **argv) { #pragma omp parallel { printf( Hello World!\n ); $ gcc fopenmp -o hello_omp \ helloomp.c $./hello_omp Hello World! Hello World!... 9

OpenMP! Zahl der Threads wird von Laufzeitumgebung festgesetzt Kann vom Programmierer beeinflusst werden Umgebungsvariable OMP_NUM_THREADS $ OMP_NUM_THREADS=3./hello_omp Hello World Hello World Hello World In Compiler-Direktive im C Code #pragma omp parallel num_threads(3) 10

OpenMP Threadzahl zur Laufzeit! #include <stdio.h> #include <omp.h> int main(int argc, char **argv) { #pragma omp parallel num_threads(4) { int my_num, total_threads; my_num = omp_get_thread_num(); total_threads = omp_get_num_threads(); printf( Hello World from thread %d of %d!\n, my_num, total_threads); Hello World from thread 2 of 4! Hello World from thread 0 of 4! Hello World from thread 3 of 4! Hello World from thread 1 of 4! 11

OpenMP Private vs. Shared Variables! #include <stdio.h> #include <omp.h> int main(int argc, char **argv) { int a; /* shared variable */ #pragma omp parallel { int b; /* private variable */ a = omp_get_thread_num(); // Probably not what you want b = omp_get_thread_num(); // Okay Race Condition/Wettlaufsituation 12

OpenMP Worksharing Constructs! In OpenMP gibt es mehrere Pragmas, die dem Programmierer die Aufgabe der Arbeitsverteilung und das Thread-Management erleichtern, z.b. Parallel for Construct Section Constructs Single Constructs 13

OpenMP For Construct! void add(double *A, double *B, double *C, int N) { #pragma omp parallel { #pragma omp for for (int i=0; i<n; ++i) A[i] = B[i] + C[i]; Thread Start Ende Aufteilung mit 4 Threads, N = 100.000 0 0 24999 1 25000 49999 2 50000 74999 3 75000 99999 14

OpenMP Section Construct! Programmcode wird in verschiedene Sektionen aufgeteilt Jede Sektion wird von genau einem Thread bearbeitet #pragma omp sections { #pragma omp section { // executed by one thread //... #pragma omp section { // executed by another thread //... /... 15

OpenMP Section Construct Beispiel! Parallel mit OpenMP Sequentiell for (int i=0; i<n; ++i) { sum+=a[i]; prod*=a[i]; #pragma omp parallel { #pragma omp sections { #pragma omp section { for (int i=0; i<n; ++i) sum+=a[i]; #pragma omp section { for (int i=0; i<n; ++i) prod*=a[i]; 16

OpenMP Single Construct! Programmcode in einer parallelen Region, der nur von einem Thread ausgeführt wird. Zum Beispiel zum Schreiben auf shared Variables #include <stdio.h> #include <omp.h> int main(int argc, char **argv) { int a; /* shared variable */ #pragma omp parallel { int b; /* private variable */ #pragma omp single { a = omp_get_thread_num(); // Okay b = omp_get_thread_num(); // Okay 17

OpenMP Worksharing Constructs! Vereinfachte Schreibweise der Worksharing Constructs #pragma omp parallel { #pragma omp <for, section, single> { kann mit #pragma omp parallel <for, section single> { abgekürzt werden 18

OpenMP Worksharing Constructs! Beispiel void add(double *A, double *B, double *C, int N) { #pragma omp parallel { #pragma omp for for (int i=0; i<n; ++i) A[i] = B[i] + C[i]; void add(double *A, double *B, double *C, int N) { #pragma omp parallel for for (int i=0; i<n; ++i) A[i] = B[i] + C[i]; 19

OpenMP Memory Management! Variablen die außerhalb der parallel Region deklariert wurden, sind per default gesharte Variablen int my_id=0; #pragma omp parallel { my_id=omp_get_thread_num(); // Bad private(<list>) bewegt den Compiler dazu private Kopien (initial value undefined!) von Variablen zu erzeugen int my_id=0; #pragma omp parallel private(my_id) { my_id=omp_get_thread_num(); // Okay 20

OpenMP Memory Management! Weitere Möglichkeiten firstprivate(<list>) Wie private(<list>), jedoch wird die Kopie der Variable mit dem globalen Wert der Variable initialisiert lastprivate(<list>) Nur für for-loops. Der Wert aus der letzten Schleifeniteration wird in die globale Variable kopiert int value; #pragma omp parallel for lastprivate(value) for (int i=0; i<n; ++i) value=i; // i is now set to N-1 21

OpenMP Reduction Clause! Reduktion aggregiert Werte mit hilfe eines binären Operators zu einem Ergebnis double sum=0.0f; #pragma omp parallel for reduction(+:sum) for (int i=0; i<n; ++i) sum+=a[i]; Mögliche Operatoren: +, -, *, &,, ^, &&, 22

OpenMP Synchronisierung! Synchronisierungskonstrukte #pragma omp flush [(<list>)] Sorgt für Speicherkonsistenz aller/einer Liste von sichtbaren Variablen für alle Threads Wird normalerweise automatisch vom Compiler eingefügt #pragma omp barrier Threads warten an dieser Stelle aufeinander #pragma omp master {... Sektion wird nur vom Master-Thread ausgeführt #pragma omp critical {... Sektion darf zu bestimmten Zeitpunkt nur von einem Thread betreten werden Serialisierung 23

OpenMP Synchronisierung! Synchronisierungskonstrukte In den meisten Fällen sind die High-Level Synchronisierungskonstrukte ausreichend Falls die gewünschte Synchronisation damit nicht erzielt werden kann, gibt es auch low-level locking Konstrukte, z.b. void omp_init_lock(omp_lock_t *lock) void omp_set_lock(omp_lock_t *lock) 24

OpenMP Scheduling! #pragma omp parallel for-loop Konstrukt verteilt die Schleifeniterationen zwischen den Threads Eine passende Scheduling-Strategie ist sinnvoll Ist die Arbeit, die in jeder Iteration geleistet werden muss identisch, so bietet sich statisches Scheduling an Ist die Arbeit variable, so ist dynamisches Scheduling meisst sinnvoller Die Scheduling-Strategie kann vom Programmierer gewählt werden #pragma omp parallel for schedule (<sched> [,chunk]) sched: static, dynamic, guided, runtime chunk: Ganzzahliger Wert 25

OpenMP Scheduling! schedule(static [,chunk]) Iterationen werden in Blöcke zu je chunk Iterationen zusammengefasst Blöcke werden gleichmäßig auf die Threads verteilt schedule(dynamic[,chunk]) Iterationen werden in Blöcke zu je chunk Iterationen zusammengefasst Zu Beginn erhält jeder Thread einen Block, der Rest der Blöcke befindet sich in einer Warteschlange Immer wenn ein Thread einen Block bearbeitet hat, holt er sich einen neuen Block aus der Warteschlange Im Gegensatz zum static Scheduling verursacht dynamic Scheduling einen Overhead zur Laufzeit 26

OpenMP Scheduling! guided(static [,chunk]) Wie dynamisches Scheduling, allerdings zu Beginn mit großer Blockgröße (à seltenerer Zugriff auf Queue, wenn Blöcke größer) Blockgröße nähert sich mit der Zeit chunk an schedule(runtime) Scheduling-Strategie ist bei der Übersetzung des Programms noch nicht festgelegt Wird zur Laufzeit über die Umgebungsvariable OMP_SCHEDULE gesetzt $ OMP_SCHEDULE= static, 1024./program 27

OpenMP Further Reading! OpenMP Spezifikation http://openmp.org/wp/openmp specifications/ Online OpenMP Tutorial https://computing.llnl.gov/tutorials/openmp/ 28

Thread Pinning! Prozessor wird von mehreren Programmen verwenden Selbst wenn kein weiterer User eingeloggt ist und keine weiteren Anwendungen laufen ist der Kernel aktiv, der regelmäßig die Ausführung eures Programms unterbricht Danach wird euer Programm idealerweise auf dem selben Core wieder gestartet, auf dem es vor der Unterbrechung lief Im Cache dieses Cores befinden sich noch die Daten, auf die euer Programm zugreift Dies muss aber nicht der Fall sein, euer Programm kann auch auf einem anderen Core weiterlaufen Keine Daten im Cache, müssen erst aus dem anderen Cache transferiert werden Durch Thread-Pinning teilt ihr dem Betriebssytem eine Liste von Kernen (genauer: Hardware threads) mit, auf denen euer Programm ausgeführt werden darf So könnt ihr verschiede Threads eures Programms an bestimmte Kerne binden 29

Thread Pinning! LIKWID: Entwickelt am RRZE der FAU Schweizer Armeemesser für HPC likwid-topology Zeigt Informationen über Node likwid-pin Affinität einstellen likwid-bench - Microbenchmarks likwid-perfctr Auslesen der Hardware Performance Counter likwid-powermeter Energieverbrauch messen Alternativ zu LIKWID kann auch das Thread Affinity Interface verwendet werden https://software.intel.com/sites/products/documentation/studio/composer/en-us/ 2011Update/compiler_c/optaps/common/optaps_openmp_thread_affinity.htm 30

Thread Pinning! Ohne Argumente werden alle logischen Cores benutzt likwid-pin./binary Beispiele likwid-pin c S0:0-7 (Socket, Package, Chip) likwid-pin c N:0-15 (Beide Sockets, d.h. ganze Node) likwid-pin c N:0-31 (ganze Node mit SMT) Package 0 Package 1 Memory MC 0 1 2 3 shared LLC 4 5 6 7 QPI QPI QPI QPI 0 1 2 3 shared LLC 4 5 6 7 MC Memory Mehr Infos https://code.google.com/p/likwid/ 31

Single Instruction Multiple Data (SIMD) Vektorisierung!! Starting with Sandy Bridge we get Advanced Vector Extensions (AVX) (256b SIMD), which extends the previous Streaming SIMD Extensions (SSE) (128b SIMD): A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] 32 bits (1 float) A[0] A[1] A[2] A[3] 256 bits ymmx xmmx A[0] A[1] A[2] A[3] 64 bits (1 double) A[0] 128 bits 128 bits A[1] 256 bits Using AVX, we can perform four operations with a single instruction. core 16 B L1 cache 32 B L2 cache 2x16 B 33

ILP / Superscalarity! Superscalar Design: Six issue ports/execution units that can execute instructions simultaneously E.g. Floating Point-Add (Port 1) and Floating-Point Mult (Port 0) 4 uop/cycle frontend limit 256KB Unified L2 Cache 256 bit Complex Decoder 128 bit Simple Decoder Port 0 Port 1 Port 5 Port 2 Port 3 Port 4 ALU V-MUL V-SHUF Fdiv 256- FP MUL 256- FP Blend 32KB L1 Instruction Cache Predecode ALU V-ADD V-SHUF 256- FP ADD Instruction Queue Simple Decoder ALU JMP 256- FP Shuf 256- FP Bool 256- FP Blend Simple Decoder Decoded Instruction Queue Renamer / Scheduler / Dispatcher Load Data AGU MSROM Load Data AGU 1536 uop (L0) Cache Store Data Memory Control 256 bit 128 bit 256 bit 32KB L1 Data Cache 35

Simultaneous Multithreading! Control Logic per Core is duplicated (program counter, registers, etc.) 256 bit 32KB L1 Instruction Cache 128 bit Predecode Instruction Queue 2-SMT: Core Resources (execution units) are shared between two hardware threads 256KB Unified L2 Cache Complex Decoder Simple Decoder Simple Decoder Simple Decoder Decoded Instruction Queue Renamer / Scheduler / Dispatcher MSROM 1536 uop (L0) Cache Port 0 Port 1 Port 5 Port 2 Port 3 Port 4 ALU V-MUL V-SHUF Fdiv 256- FP MUL 256- FP Blend ALU V-ADD V-SHUF 256- FP ADD ALU JMP 256- FP Shuf 256- FP Bool 256- FP Blend Load Data AGU Load Data AGU Store Data Memory Control 256 bit 128 bit 256 bit 32KB L1 Data Cache 37