Zach Arend Final Project for Intro to Computer Graphics (CPE 471)
Parallel Rasterizer in CUDA

For My final project, I did program 1 again. This time, I achieved a 2x speedup by parallelizing it with CUDA - the parallel programming framework for NVIDIA graphics cards. The biggest challenge I faced was allocating memory on the host (CPU) for the pixel array, and copying results back to the host. The biggest lesson learned is the memory is expensive.

A big challenge in parallelizing a rasterizer is dealing with the z-buffer. While many threads have the capability to write to the same locations at the same time, managing this memory can be tricky. To get around this, I sorted the triangles by depth, then use the atomicMax operating to fill in the z-buffer with the highest triangle at each point. AtomicMax is a function that comes with the cuda runtime that allows us to atomically take the max of an array element.

Although I though the atomic operatings would be expensive, it paled in comparsision to the overhead of memory allocation. My parallel rasterizer spends about half of it's time allocating pined memory on the host for the pixel area. I choose to use pinned memory because data can be copied accross the bus much faster to it. This speedup outweighs the time required to allocate its memory.

summary of algorithm:
allocate host memory for pixel array
copy mesh and indices from host to device
populate array of triangles with vertices and indices
calculate the depth of each triangle by averaging its three points
key-value sort the triangles by depth
create a mapping from pixel to triangle number
-> launch a thread for each triangle
for each pixel in its hit area
atomic max with its triangle number at pixel loc in z-Buffer
rasterize each pixel
-> launch a thread for each pixel
get the triangle from z-Buffer and rasterize it
copy results back to host

results
resolution	serial time (ms)	parallel time(ms)	speedup
5760x3600	528.36	243.60	2.17
1920x1080	54.26	26.51	2.05
400x600	9.28	5.55	1.67

serial bunny in on the left and the paralllel bunny is on the right. As you can see, the bunny came out unscathed.

a pdf printout of the nvidia visual profiler's (nvvp) findings for the most expesive kernel can be found at http://users.csc.calpoly.edu/~zarend/bunnies/nvvp_report.pdf
nvprof printout:

==3112== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 38.14%  75.289ms        16  4.7056ms     922ns  75.256ms  cudaEventCreate
 29.49%  58.211ms         9  6.4679ms  1.4290us  27.593ms  cudaDeviceSynchronize
 20.44%  40.347ms         5  8.0693ms  17.504us  39.741ms  cudaMemcpy
  6.96%  13.731ms         1  13.731ms  13.731ms  13.731ms  cudaHostAlloc
  3.35%  6.6162ms         1  6.6162ms  6.6162ms  6.6162ms  cudaFreeHost
  0.67%  1.3262ms         6  221.04us  5.5920us  1.1869ms  cudaFree
  0.31%  611.85us        12  50.987us  5.6280us  200.42us  cudaMalloc
  0.18%  353.61us        33  10.715us  6.8450us  29.593us  cudaLaunch
  0.16%  308.59us         2  154.29us  132.51us  176.08us  cudaGetDeviceProperties
  0.11%  212.33us        83  2.5580us     172ns  92.901us  cuDeviceGetAttribute
  0.06%  115.29us         8  14.411us  7.6690us  31.970us  cudaEventSynchronize
  0.04%  70.112us        16  4.3820us  1.8930us  16.000us  cudaEventRecord
  0.03%  67.232us        20  3.3610us  2.8270us  4.4960us  cudaFuncGetAttributes
  0.02%  35.397us         1  35.397us  35.397us  35.397us  cuDeviceGetName
  0.02%  29.657us        16  1.8530us     759ns  11.621us  cudaEventDestroy
  0.01%  21.588us       149     144ns      95ns  1.3660us  cudaSetupArgument
  0.01%  20.985us         1  20.985us  20.985us  20.985us  cuDeviceTotalMem
  0.01%  18.717us         8  2.3390us  1.0820us  8.8850us  cudaEventElapsedTime
  0.00%  9.8140us        33     297ns     156ns  3.3980us  cudaConfigureCall
  0.00%  4.6100us         7     658ns     313ns  1.3190us  cudaGetDevice
  0.00%  3.6240us         2  1.8120us     272ns  3.3520us  cuDeviceGetCount
  0.00%  1.4890us         7     212ns     160ns     307ns  cudaGetLastError
  0.00%     732ns         2     366ns     245ns     487ns  cuDeviceGet