CUDA. (Compute Unified Device Architecture) Thomas Trost. May 31 th 2016

CUDA (Compute Unified Device Architecture) Thomas Trost May 31 th 2016

Introduction and Overview platform and API for parallel computing on GPUs by NVIDIA relatively straightforward general purpose use of GPUs developed and maintained by NVIDIA around since 2007 and still rapidly changing (current version: 7.5) extension to languages like C, C++, Fortran freeware a number of clusters is equipped with NVIDIA GPUs alternative: OpenCL Thomas Trost CUDA 2016/05/31 1 / 10

Performance: Operations per Time Thomas Trost CUDA 2016/05/31 2 / 10

Performance: Memory Bandwidth Thomas Trost CUDA 2016/05/31 3 / 10

Di erence between CPUs and GPUs Thomas Trost CUDA 2016/05/31 4 / 10

Basics host (CPU + memory) CUDA API device (GPU + memory) Typical programming pattern: 1 initialize data on host 2 copy data from host to device 3 invoke kernel on device for processing data 4 copy data from device to host 5 output of data from host Thomas Trost CUDA 2016/05/31 5 / 10

CUDA API Get number of available devices: cudagetdevicecount(int *count) Tell program to use particular device: cudasetdevice(int device) Copy data between host and device and between di erent devices: cudamemcpy(void *dest, const void *source, size t count, enum cudamemcpykind kind) Thomas Trost CUDA 2016/05/31 6 / 10

Kernel Execute same code many times, organized as follows: In the program: kernelname<<<dimgrid, dimblocks>>>(args) Thomas Trost CUDA 2016/05/31 7 / 10

Memory Hierarchy on Device Thomas Trost CUDA 2016/05/31 8 / 10

Pros and Cons Pros: de facto standard well-documented active community good coverage of platforms and languages many libraries Cons: monopoly of NVIDIA rapidly changing standard specific (expensive) hardware required di cult and costly to write really e cient code Thomas Trost CUDA 2016/05/31 9 / 10

Source and Further Reading http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html Thomas Trost CUDA 2016/05/31 10 / 10

Ruhr-Universität Bochum 31.05.2016

Warum Libraries? Oftmals werden Standard-Operationen benötigt CUDA-Code optimieren ist aufwendig! Libraries nehmen die Arbeit ab Beispiele Sortieren, Matrix-Multiplikation, FFT, Eigenwerte...

Welche Libraries? Thrust Diverse Datenstrukturen und Funktionen für GPU cufft Schnelle Fourier-Transformation für GPU

Thrust - Einleitung Datenstrukturen und Algorithmen für parallelisiertes Rechnen auf GPU STL-artige Struktur Zwei Kernbereiche: Datenstrukturen thrust::device vector thrust::host vector thrust::device ptr... Algorithmen thrust::sort thrust::reduce thrust::exclusive scan...

Thrust - Container Vereinfachung von häufigen Operationen Verzicht auf cudamalloc, cudamemcpy, cudafree, etc. Kompatibel mit STL-Datenstrukturen Beispiel: 1 // list container on host 2 std::list<int> h_list; 3 h_list.push_back(13); 4 h_list.push_back(37); 5 // copy list to device vector 6 thrust::device_vector <int> d_vec(h_list.size()); 7 thrust::copy(h_list.begin(), h_list.end(), d_vec.begin()); 8 // alternative method 9 thrust::device_vector <int> d_vec(h_list.begin(), h_list.end()); 10 //print elements from device vector 11 for ( auto elem : d_vec) 12 std::cout << elem << std::endl;

Thrust - Algorithmen Viele Standard-Algorithmen: Transformationen Reduktionen Präfixsumme Sortieren Templatisiert zur Verwendung beliebiger Datentypen Benutzerdefinierte Operationen möglich 1 // declare storage 2 device_vector <int> i_vec =... 3 device_vector <float > f_vec =... 4 // sum of integers ( equivalent calls) 5 reduce(i_vec.begin(), i_vec.end()); 6 reduce(i_vec.begin(), i_vec.end(), 0, plus<int >()); 7 // sum of floats ( equivalent calls) 8 reduce(f_vec.begin(), f_vec.end()); 9 reduce(f_vec.begin(), f_vec.end(), 0.0f, plus<float >()); 10 // maximum of integers 11 reduce(i_vec.begin(), i_vec.end(), 0, maximum <int >());

Thrust - Performance

cufft - Einleitung Optimierte FFT auf der GPU 1D, 2D und 3D Transformationen Reelle und Komplexe Datentypen Single und Double Precision In-place und out-of-place Thread-safe Interface ähnlich zu FFTW

cufft - Benutzung Ausführung von FFTs basiert auf Plan Vor der Ausführung muss ein Plan erstellt werden Plan enthält Informationen über Art der Transformation (Auflösung, Datentyp, Hardware, etc.) Anhand des Plans wird der Algorithmus optimiert Ein Plan kann für mehrere FFTs genutzt werden Optimierung muss nur einmal zu beginn erfolgen

cufft - Beispiel 1 # define NX 64 2 # define NY 64 3 # define NZ 128 4 5 cuffthandle plan; 6 cufftcomplex *data1, *data2; 7 cudamalloc((void**)&data1, sizeof(cufftcomplex)*nx*ny*nz); 8 cudamalloc((void**)&data2, sizeof(cufftcomplex)*nx*ny*nz); 9 /* Create a 3D FFT plan. */ 10 cufftplan3d(&plan, NX, NY, NZ, CUFFT_C2C); 11 12 /* Transform the first signal in place. */ 13 cufftexecc2c(plan, data1, data1, CUFFT_FORWARD); 14 15 /* Transform the second signal using the same plan. */ 16 cufftexecc2c(plan, data2, data2, CUFFT_FORWARD); 17 18 /* Destroy the cufft plan. */ 19 cufftdestroy(plan); 20 cudafree(data1); cudafree(data2);

cufft - Weiter Möglichkeiten Callback routinen Ermöglichen Ausführung von Code nach jeder FFT Angepasstes Speicherlayout Asynchrone Transfomationen meherer FFTs FFTs mit mehreren GPUs

Weitere Libraries cusolver - Lösen von linearen Gleichungssystemen cusparse - Lineare Algebra für dünnbesetzte Matrizen cublas - Lineare Algebra Routinen curand - Pseudo Zufallszahlen CUDA Math Library - Diverse mathematische Funktionen etc.