CUDA by Example. Paralleles Rechnen auf der Graﬁkkarte. Leipzig, Paul Jähne SethosII

CUDA by Example Paralleles Rechnen auf der Graﬁkkarte Leipzig, 31.03.2017 Paul Jähne SethosII 1

Warum? 2

Aufbau CPU geringe Latenz große Zwischenspeicher besser für serielle Ausführung GPU hohe Rechenleistung hohe Bandbreite besser für parallele Ausführung 3

Was ist CUDA? Compute Unified Device Architecture API zur Programmierung von Nvidia GPUs C/C++ zum Einstieg: http://docs.nvidia.com/cuda/index.html https://developer.nvidia.com/cuda-downloads 4

Hello World #include <stdio.h> global void greet() { printf("hello World!\n"); } int main(){ greet<<<1, 1>>>(); cudadevicereset(); return 0; } Kompilierung: nvcc hello_world.cu -o hello_world 5

Ausführungsmodell 6

#include <stdio.h> Paralleles Hello World global void greet() { printf("thread %d von %d\n", blockidx.x * blockdim.x + threadidx.x + 1, blockdim.x * griddim.x); } int main(){ greet<<<2, 2>>>(); cudadevicereset(); return 0; } 7

Speichermodell 8

Genereller Programmablauf 1. Ressourcen anfordern 2. Daten zur GPU kopieren 3. Kernel aufrufen/berechnungen ausführen 4. Ergebnisse zurückkopieren 5. Ressourcen freigeben CUDA 6.0 Unified Memory 9

Beispielcode CUDA 6.0 int main() { // declare pointer int* a; // allocate memory on GPU and CPU cudamanagedmalloc((void**) &a, SIZE); // fill a with data on host... // kernel invocation kernel<<<nblocks, NTHREADS>>>(a); // do somethng with a on host... // clean up cudafree(a); } return 0: 10

Matrixmultiplikation C ij = k A ik B kj 11

Ablauf 12

CPU-Code void matmul(double *a, double *b, double *c, int dim) { for (int i = 0; i < dim; i++) { for (int j = 0; j < dim; j++) { double sum = 0; for (int k = 0; k < dimension; k++) { sum += a[i * dim + k] * b[k * dim + j]; } c[i * dim + j] = sum } } } 13

GPU-Hostcode // allocate memory on GPU cudamalloc((void**) &d_a, dim * dim * sizeof(double)); cudamalloc((void**) &d_b, dim * dim * sizeof(double)); cudamalloc((void**) &d_c, dim * dim * sizeof(double)); // copy data to the GPU cudamemcpy(d_a, a, dim * dim * sizeof(double), cudamemcpyhosttodevice); cudamemcpy(d_b, b, dim * dim * sizeof(double), cudamemcpyhosttodevice); // kernel invocation kernel<<<nblocks, NTHREADS>>>(d_a, d_b, d_c, dim); // copy result back to the CPU cudamemcpy(c, d_c, cx * cy * sizeof(double), cudamemcpydevicetohost); // clean up cudafree(d_a); cudafree(d_b); cudafree(d_c); 14

GPU-Devicecode global void kernel( double *d_a, double *d_b, double *d_c, int dim) { int x = blockidx.x * blockdim.x + threadidx.x; int y = blockidx.y * blockdim.y + threadidx.y; if (x < dim && y < dim) { double sum = 0; for (int i = 0; i < dim; i++) { sum += d_a[x * dim + i] * d_b[i * dim + y]; } d_c[x * dim + y] = sum; } } 15

cudaevent_t custart, custop; cudaeventcreate(&custart); cudaeventcreate(&custop); cudaeventrecord(custart, 0); // code to measure Zeitmessung cudaeventrecord(custop, 0); cudaeventsynchronize(custop); cudaeventelapsedtime(&cuelapsedtime, custart,custop); cudaeventdestroy(custart); cudaeventdestroy(custop); 16

Messergebnis Tesla K20, doppelte Genauigkeit erwartet: 1,2 Tflop/s gemessen: 8 Gflop/s :( 17

Alles nur Marketing? 1, 2 10 12 flop/s dafür werden die Daten benötigt hier fused-multiply-add (FMA): c + = a b dreimal laden, einmal schreiben für zwei Rechenoperationen 4 8 Byte = 32 Byte für zwei Rechenoperation 1, 2 10 12 16 Byte = 19, 2 TB/s Speicherbandbreite benötigt 145 GB/s Speicherbandbreite verfügbar 18

Speicherbandbreite ist die Beschränkung Optimierungen: caching, loop unrolling,... notwendig für performanten Code verschlechtern die Lesbarkeit des Programmcodes Was bringt Shared Memory? 19

Idee 20

GPU-Devicecode global void kernelshared(double *a, double *b, double *c, int dim) { shared double as[32][32]; shared double bs[32][32]; int x = blockidx.x * blockdim.x + threadidx.x; int y = blockidx.y * blockdim.y + threadidx.y; int xdim = x * dim; int blocks = dim / 32; double sum = 0; for (int m = 0; m < blocks; m++) { int m32 = m * 32; as[threadidx.x][threadidx.y] = a[xdim + (m32 + threadidx.y)]; bs[threadidx.x][threadidx.y] = b[(m32 +threadidx.x) * dim + y]; syncthreads(); } for (int i = 0; i < 32; i++) { sum += as[threadidx.x][i] * bs[i][threadidx.y]; } syncthreads(); } c[xdim + y] = sum; 21

Messergebnis Tesla K20, doppelte Genauigkeit erwartet: 1,2 Tflop/s gemessen: 160 Gflop/s : 22

Alternative: optimierte Bibliotheken Bibliothek Anwendungsgebiet cublas Matrix-/Vektorberechnungen cusparse sparse Matrix-/Vektorberechnungen cufft Fast-Fourier-Transformation curand Zufallszahlengenerierung cudnn Neuronale Netzwerke cusolver Gleichungssysteme Thrust Standard-Template-Library für CUDA und viele andere: https://developer.nvidia.com/gpu-accelerated-libraries 23

Matrixmultiplikation mit cublas // allocate memory on GPU... // copy data to the GPU... // initialize cublas cublashandle_t cublashandle; cublascreate(&cublashandle); // call the library function cublasdgemm(cublashandle, CUBLAS_OP_N, CUBLAS_OP_N, dim, dim, dim, 1, d_a, dim, d_b, dim, 0, d_c, dim); // copy result back to the CPU... // clean up cublasdestroy(cublashandle);... cublas berechnet: C = α op(a) op(b) + β C 24

Messergebnis Tesla K20, doppelte Genauigkeit erwartet: 1,2 Tflop/s gemessen: 1 Tflop/s :) 25

Anmerkungen zum Beispiel Auslastung der GPU ab Matrixdimension > 1000 gut parallelisierbares Beispiel sehr einfaches Beispiel niemand multipliziert den ganzen Tag Matrizen 26

Was eignet sich für GPGPU? Datenparallelität wenig Verzweigung große Problemgrößen hochgradig parallele Probleme je mehr arithmetische Operationen, desto besser Einschränkung durch GPU-Speichergröße 27

Quellen http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html CUDA by example (2010) The CUDA handbook (2013) 28

Anwendungen Hashkollisionen 29

. Charlie soll ein Gutachten K verfassen und an Bob senden.. Alice soll K durch digitale Unterschrift beglaubigen.. Alice unterschreibt mit E A (M(K)) und sendet das Ergebnis an Bob.. In Wirklichkeit sendet Charlie zwei verschiedene Gutachten K und K, wobei M(K) = M(K ).. Bob glaubt, dass Alice K unterschrieben hat, Alice hat aber K unterschrieben Umsetzung in CUDA mit 32-Bit Streufunktion basierend auf SHA-256 30

Ablauf K berechnen K berechnen K und K auf Kollision prüfen 31

Umsetzung 1. 2 16 Threads 2. t Threads 1. t berechnet ein k K und schreibt in gemeinsamen Speicher 2. globale Synchronisation 3. t berechnet ein k K 4. Block K 1. Threadgruppe lädt Block von K 2. lokale Synchonisation 3. Thread vergleicht Block mit seinem k 4. bei Kollision werden k und k gemerkt 5. lokale Synchonisation 3. zum Schluss Ausgabe aller Texte zu den gefundenen Kollisionen https://github.com/sethosii/birthday-attack 32

Ergebnis This birthday attack took 165.8 ms. Collisions found: 2 33

Collision with good text #17470 and bad text #21144 Good plaintext: Linux is awesome! It is build on top of reliable Software like X11. Also there are several projects within the Linux community which have mostly the same objective like Wayland and Mir - and that's great. Why shouldn't you? Duplicate work is no problem. And with different approaches you will propably find better solutions. Thanks to the many projects you can choose what suits you best. The next Bad plaintext: Linux sucks! It is build on top of very old Software like X11 which makes them hard to maintain. Also there are several projects within the Linux community which have mostly identical aim like Wayland and Mir. Therefore there is duplicate work. This shows how divided the community is. The next point are the codenames. They suck. What should Trusty Thar stand for? Also the whole development Hash value of both is: 640cb7c7 34

Collision with good text #51660 and bad text #26306 Good plaintext: Linux is awesome! It is build on top of reliable Software like X11. Also there are several projects within the Linux community which have mostly identical aim like Wayland and Mir - and that's great. Why shouldn't you? Duplicate work is no problem. And with more approaches you will propably find better solutions. Because of the many projects you can choose what fits you best. The next point are the Bad plaintext: Linux sucks! It is build on top of very old Software like X11 which makes them difficult to maintain. Also there are several projects within the Linux community which have mostly identical aim like Wayland and Mir. Because of such things there is a duplication of effort. This shows how divided the community is. The next point are the code names. They suck. What should Trusty Thar stand for? Also the Hash value of both is: 91ada029 35