GPGPU-Architekturen CUDA CUDA Beispiel OpenCL OpenCL Beispiel. CUDA & OpenCL. Ralf Seidler. Friedrich-Alexander-Universität Erlangen-Nürnberg

Transkript

1 CUDA und OpenCL Friedrich-Alexander-Universität Erlangen-Nürnberg 24. April 2012

2 Outline 1 GPGPU-Architekturen 2 CUDA 3 CUDA Beispiel 4 OpenCL 5 OpenCL Beispiel

3 Outlook 1 GPGPU-Architekturen 2 CUDA 3 CUDA Beispiel 4 OpenCL 5 OpenCL Beispiel

4 Eine kurze Geschichte der Grafikkarten Ursprünglich: Video Card steuert Monitor an Mitte 80er: Grafikkarten mit 2D-Beschleunigung angelehnt an Arcade- und Home-Computer Frühe 90er: erste 3D-Beschleunigung: Matrox Mystique, 3dfx Voodoo Rastern von Polygonen Textur Polygon Abbildung

5 Eine kurze Geschichte der Grafikkarten Direct3D 10 Pipeline

6 Eine kurze Geschichte der Grafikkarten 2000er: Zunächst nur Fixed-Function-Pipeline Shader-Programme bieten mehr Flexibilität Shader-Programme ursprünglich nur einfache Listen 2002: ATI Radeon 9700 kann Loops in Shadern ausführen Heute: Shader turing-vollständig Hersteller: Intel, ATI und NVIDIA Massenmarkt niedrige Preise

7 GPGPUs GPGPU = General Purpose Graphics Processing Unit Grafikkarten zunehmend flexibler programmierbar Stetig wachsende Leistung Geeignet für Streamprozessing: Geringes Verhältnis IO-zu-Rechenlast Datenparallelität (SIMD-Verarbeitung) Single precision wichtiger als double precision

8 Aufbau GPGPU Bus- Interface (z.b. PCIe) GPGPU Multi-Prozessor Shader Shader... Lokaler Speicher Globaler Speicher Multi-Prozessor...

9 Eigenschaften von GPGPUs Viele einfache Cores, genannt Skalarprozessoren (SP) Keine Sprungvorhersage etc. Gruppiert in Multi-Prozessoren (Vektorprozessoren) Probleme bei nicht einheitlichen Sprüngen Viele Register Großer, langsamer, globaler Speicher (Latenz: Taktzyklen) Kleine, schnelle on-chip Shared-Memory-Blöcke

10 GPGPU: GeForce G80 / Geforce GT200

11 GPGPU: Geforce GF 100

12 GPGPU: AMD Cayman

13 Programmierung Sehr viele, kurzlebige Threads Threads in Blöcken gruppiert Blöcke auf Multi-Prozessoren verteilt Standards: CUDA (NVIDIA, Marktführer) OpenCL (offener Standard, entsprechend zu OpenGL) FireStream (AMD) DirectCompute (Microsoft)

14 GPU Systeme am Lehrstuhl faui36i und faui36j mit je 2 Intel Xeon 2.66 GHz 24 GB DDR3-Ram faui36i - Fermi-System: 3 Nvidia Tesla C2050 (448 Cores (13 x 32 Cores)) 3 GB GDDR5 (144 GB/s) 1.03 TFlops SP, 515 GFlops DP 1 Nvidia Geforce GTX 480 (480 Cores (14 x 32 Cores)) 1.5 GB GDDR5 (177.4 GB/s) 1.34 TFlops SP, 100 GFlops DP faui36j: 1 AMD Radeon HD 6970 (1536 Cores (24 x 16 Cores (4 Fach VLIW))) 2 GB GDDR5 (174 GB/s) 2.73 TFlops SP, 676 GFlops DP 1 Nvidia Tesla C1060 (240 Cores (30 x 8 Cores)) 4 GB GDDR3 (102 GB/s) 1.05 TFlops SP, 74 GFlops DP

16 CUDA Einstieg I Programmierung in C einzelne Funktionen laufen auf GPU in sog. Kernels (Function-Offloading) Compiler nvcc separiert Code, baut auf gcc auf Programm wird in Host-Code (Standard C) und Device-Code (CUDA) unterteilt Unterscheidung der Funktionen durch Qulifier host als CPU Funktionen (nicht notwendig) device GPGPU Funktionen (wichtig) global Einsprungpunkte in CUDA Code

17 CUDA Einstieg - Speicherverwaltung Speicher wird ebenfalls durch Qualifier beschrieben normaler Speicher im System-RAM device in globalem Speicher auf der GPU shared im Shared-Memory auf den Multiprozessoren CUDA-API für Speicheroperationen Allokation/Deallokation von globalem Speicher (cudamalloc()) Transfer System-RAM GPU-RAM (cudamemcpy()) In Kernels Transfer globaler Speicher Shared-Memory Zusätzliche spezielle Speicherbereiche Konstanten (gecached) Texturen (gecached, mehrdimensionale Adressierung, Filterung)

18 NVCC in Detail Die wichtigsten Schalter Alle nicht-cuda Schalter werden an gcc/g++ weitergegeben (z.b. -g) -G - Erstellt Debugging Informationen für den cuda-gdb -ptxas-options=-v - Generiert Kernelinformationen (Anzahl der Register pro Thread, Shared-Memory, Local Memory,...) -arch=sm_xx - Stellt die Zielarchitektur ein (xx {10,11,12,13,20})

20 Matrixmultiplikation - Kernel # define DIM 1024 / / Size of Ma tr ix global void matmul ( f l o a t a, f l o a t b, f l o a t c ) { i n t 2 coord ; coord. x = blockdim. x b l o c k I d x. x+ threadidx. x ; coord. y = blockdim. y b l o c k I d x. y+ threadidx. y ; f l o a t sum =0; for ( i n t z =0; z<dim ; z ++) { sum +=a [ coord. y DIM+z ] b [ z DIM+coord. x ] ; } c [ coord. y DIM+coord. x ]=sum ; }

21 Matrixmultiplikation - Main i n t main ( ) { dim3 blockdim ( 1 6, 1 6, 1 ) ; dim3 griddim ( DIM / blockdim. x, DIM / blockdim. y, 1 ) ; f l o a t a [ DIM DIM ], b [ DIM DIM ], res [ DIM DIM ] ; f i l l ( a, b ) ; f l o a t deva, devb, devres ; i n t bytesize = DIM DIM sizeof ( f l o a t ) ; cudamalloc ( ( void )&deva, bytesize ) ; cudamalloc ( ( void )&devb, bytesize ) ; cudamalloc ( ( void )&devres, bytesize ) ; cudamemcpy ( deva, a, bytesize, cudamemcpyhosttodevice ) ; cudamemcpy ( devb, b, bytesize, cudamemcpyhosttodevice ) ; matmul<<<griddim, blockdim >>>(deva, devb, devres ) ; cudadevicesynchronize ( ) ; cudamemcpy ( res, devres, bytesize, cudamemcpydevicetohost ) ; cleanup ( ) ; return 0; }

23 Kurzer Überblick Was ist OpenCL Offener Standard zur Programmierung paralleler Architekturen Jede Architekture die Treiber zur Verfügung stellt wird unterstützt Populäre: AMD: Radeon HD 4000 und später, x86 Architekturen ab Phenom II Nvidia: Alle GPUs, die auch CUDA unterstützen Intel: Alle neueren Intel CPUs, ab Ivy-Bridge auch die GPUs Intel und AMD CPUs werden von AMD und Intel Treibern unterstützt, meist Intel deutlich schneller

24 Plattform-Modell

25 Plattform-Modell(II) Host führt das OpenCL-Programm aus verwaltet die Compute Devices Compute Device Genutzte Rechenresource Bsp: Grafikkarte, Prozessor, Cell-Blade Compute Unit zusammenschluss einzelner Processing Elemente Bsp: Rechenkern, Rechnwerk CUDA: Streaming Multiprozessor Processing Element Eigentliches Rechenelement CUDA: Skalar Prozessor

26 Ausführungsmodell Indexraum (NDRange) 1-, 2- oder 3-dimensional Globale ID eines Work-Items Bei CUDA keine direkte Entsprechung: lässt sich über Block-ID und Thread-ID berechnen

27 Ausführungsmodell Work-Item entspricht Kernelinstanz CUDA: Thread Work-Group Zusammenfassung von Work-Items gemeinsamer Speicher CUDA: Thread-Block

28 Speichermodell Global Memory Der Arbeitsspeicher des Compute Device Constant Memory Teil des Global Memory konstant während Programmausführung

29 Speichermodell(II) Data Cache optionale Daten-Cache für Zugriffe auf den globalen/konstanten Speicher erst mit neuem Shader-Model 2.0 bei NVidia

30 Speichermodell(III) Local Memory lokaler Speicher einer Workgroup CUDA: Shared Memory Private Memory privater Speicherbereich eines Work-Items CUDA: Register oder in CUDA Local Memory

31 Speichermodell(IV) Zugriffsrechte

33 OpenCL: Matrixmultiplikation - Kernel # define DIM 1024 / / Size of Ma tr ix kernel void matmul ( global f l o a t a, global f l o a t b, global f l o a t c ) { i n t coordx = g e t _ g l o b a l _ i d ( 0 ) ; i n t coordy = g e t _ g l o b a l _ i d ( 1 ) ; f l o a t sum =0; for ( i n t z =0; z<dim ; z ++) { sum +=a [ coordy DIM+z ] b [ z DIM+coordx ] ; } c [ coordy DIM+coordx ] =sum ; }

34 OpenCL: Matrixmultiplikation - Main i n t main ( ) {... clgetplatformid (& p l a t f o r m ) ; clgetdeviceids ( platform, CL_DEVICE_TYPE_GPU,1,& device, NULL ) ; c t x =clcreatecontext (0,1,& device, NULL, NULL, NULL ) ; queue=clcreatecommandqueue ( ctx, device, 0, NULL ) ; prog=clcreateprogramwithsource ( ctx, 1, f i l e ( " k e r n e l. c l " ), length, NULL ) ; clbuildprogram (... ) ; k e r n e l = clcreatekernel ( prog, " matmul ",NULL ) ; cl_mem a_dev= c l C r e a t e B u f f e r ( ctx,cl_mem_read_write, size, NULL, NULL ) ;.... clenqueuewritebuffer ( queue, a_dev,cl_true, 0, size, a_host, 0,NULL, NULL ) ; clsetkernelarg ( kernel, 0, sizeof ( cl_mem),& a_dev ) ;... clenqueuendrangkernel ( queue, kernel, 2,NULL,& global_ws,& local_ws, 0,NUL clenqueuebarrier ( queue ) ; read_back ( ) ; output ( ) ; clreleasekernel (.. ) ; return 0; }

35 OpenCL: C++ Bindings OpenCL-Handhabung mit C umständlich Deshalb einfache C++-Wrapper zur Kapselung Unter der Haube passiert nichts anderes Dafür deutlich entspannter zu programmieren C- und C++-Doku unter:

36 OpenCL: Beispiel mit OpenCL-C++ Bindings i n t main ( ) {... vector < c l : : Platform > p l a t f o r m ; vector < c l : : Device > device ; c l : : Platform : : get ( p l a t f o r m ) ; p [ 0 ]. getdevices (CL_DEVICE_TYPE_GPU, device ) ; c l : : Context c t x = c l : : Context ( device ) ; c l : : CommandQueue queue = c l : : CommandQueue( ctx, device [ 0 ] ) ; c l : : Sources src = getsources ( " kernel. c l " ) ; c l : : Program prog = c l : : Program ( ctx, src ) ; prog. b u i l d ( device ) ; c l : : Kernel k e r n e l = c l : : Kernel ( prog, " matmul " ) ; c l : : NDRange g l o b a l = c l : : NDRange( DIM, DIM ) ; c l : : KernlFunctor matmul = c l : : KernelFunctor ( kernel, queue ) ; c l : : B u f f e r a_dev = c l : : B u f f e r ( ctx,cl_mem_read_only ) ; queue. enqueuecopybuffer ( a, a_dev, 0, 0, bytesize ) ;... matmul ( a_dev, b_dev, c_dev ) ; queue. enqueuebarrier ( ) ; queue. enqueuecopybuffer ( c_dev, c, 0, 0, bytesize ) ;... }