Volumenrendering mit CUDA

Volumenrendering mit CUDA Arbeitsgruppe Visualisierung und Computergrafik http://viscg.uni-muenster.de

Überblick Volumenrendering allgemein Raycasting-Algorithmus Volumen-Raycasting mit CUDA Optimierung 2

Volumenrendering Aufgabe: Visualisierung volumetrischer Daten Einsatzgebiete: Medizin, Geologie, Meterologie, Materialprüfung,... Datenquellen: CT, MR, PET, Ultraschall,... 3

www.voreen.org

Volumenrendering Ziel: interaktives Rendering (>10 fps) Problem: große Datenmengen 2.048 MB 32 MB 256 MB 256³ 512³ 1024³ CPU-Implementierungen nicht interaktiv GPUs bieten höhere Speicherbandbreite 5

Volumen-Raycasting Strahlen sind voneinander unabhängig Verfahren gut parallelisierbar 6

GPU-based Raycasting GPUs unterstützen Volumengrafik nicht direkt Ansatz: Krüger/Westermann (2003) Datensatz als 3D-Textur (lineare Filterung) Strahl-Traversierung im Fragment Shader Hohe Frameraten, gute Bildqualität 7

GPU-based Raycasting Strahlen werden mittels einer Proxy-Geometrie spezifiziert, nicht analytisch Erzeugung durch Rendern eines Würfels 8

Raycasting mit CUDA Motivation: Shader-Implementierung ist Umweg Mehr Kontrolle über Ausführung gewünscht Zusätzliche Hardware-Features nutzen Bessere Dokumentation 9

Raycasting mit CUDA 1:1-Umsetzung der Shader-Implementierung: Thread Strahl Block Bildschirmregion Strahlparameter mit OpenGL erzeugen Volumen-Datensatz als CUDA-Textur 10

CUDA-Texturen 1D/2D/3D lineare Interpolation (kostenlos!) Randbehandlung (clamp/wrap) Caching (optimiert für 2D-Lokalität) keine Coalescing-Regeln zu beachten 11

CUDA-Implementierung 12

CUDA-Implementierung texture<ushort, 3, cudareadmodenormalizedfloat> volumetex; void cuda_raycast_bindvolumearray(cudaarray* array) { volumetex.normalized = true; volumetex.filtermode = cudafiltermodelinear; volumetex.addressmode[0] = cudaaddressmodeclamp; volumetex.addressmode[1] = cudaaddressmodeclamp; volumetex.addressmode[2] = cudaaddressmodeclamp; } channeldescvolume = cudacreatechanneldesc<ushort>(); cudabindtexturetoarray(volumetex, array, channeldescvolume); 13

global void simpleraycast(float4* entryparams, float4* exitparams, float4* output, uint width, uint height, float qualityfactor, float3 camerapos, float3 lightpos) { uint x = umul24(blockidx.x, blockdim.x) + threadidx.x; uint y = umul24(blockidx.y, blockdim.y) + threadidx.y; if (x >= width y >= height) return; uint index = ( umul24(y, width) + x); // enforce reading 4 floats although only 3 are accessed to get coalescing volatile float4 first4 = entryparams[index]; volatile float4 last4 = exitparams[index]; float3 first = { first4.x, first4.y, first4.z }; float3 last = { last4.x, last4.y, last4.z }; 14

while (t <= tend) { float3 sample = first + t * direction; float intensity = tex3d(volumetex, sample.x, sample.y, sample.z); float3 gradient = calcgradient(sample); float4 color = tex1d(transferfunctex, intensity); float3 shadedcolor = phong(sample, color, gradient, lightpos, camerapos); color.x = shadedcolor.x; color.y = shadedcolor.y; color.z = shadedcolor.z; t += stepincr; } // perform compositing color.w *= qualityfactor; result.x = result.x + (1.0f - result.w) * color.w * color.x; result.y = result.y + (1.0f - result.w) * color.w * color.y; result.z = result.z + (1.0f - result.w) * color.w * color.z; result.w = result.w + (1.0f - result.w) * color.w; // write output color output[index] = result; 15

Vergleich Shader/CUDA Shader CUDA frames per second 100,2 82,7 70,5 64,6 +42% +28% -4% 18,0 17,2 basic TF Phong technique fetches registers basic 1 15 TF 2 19 Phong 8 33 komplexe Kernel benötigen viele Register, damit weniger Threads gleichzeitig möglich 16

Vergleich Shader/CUDA 520 500 8x8 12x8 8x16 16x20 occupancy 480 16x10 16x12 16x14 16x16 16x18 32 fps 460 440 420 400 380 16x8 16x22 16x24 16x26 16x28 16x30 16x32 Blockgröße beachten! CUDA 24 16 multiprocessor warp occupancy GLSL 360 8 340 simpleraycast<<gridsize, blocksize>>(...) 320 8x4 0 50 100 150 200 250 300 350 400 450 500 550 0 threads per block 17

Raycasting beschleunigen Problem: auf Voxel wird mehrfach zugegriffen (Gradienten-Berechnung) Idee: Datensatz in slabs aufteilen und im Shared Memory zwischenspeichern nur 16 kb Shared Memory: max.16 3 Voxel Overhead: Randbehandlung, Preprocessing, Koordinatensystemwechsel 18

Slab-based Raycasting basic slab-based 70,0 frames per second 52,5 35,0 17,5 0 +78% +26% +13% 512² 786² 1024² viewport size 19

Optimierungs-Hinweise Raycasting ist Sonderfall: einfache Berechnungen, viele Speicherzugriffe "if" ist teuer: Branch-Coherence beachten, Präprozessor benutzen Kernel klein halten, große Kernel aufteilen Blockgrößen beachten 20

Fazit Auch vorhandene GPGPU-Verfahren können von CUDA profitieren Grafikhardware ist auch für nicht-grafikanwendungen nützlich Aber: Verfahren müssen an die CUDA- Architektur angepasst werden 21