Zach Arend Final Project for Intro to Computer Graphics (CPE 471)
Parallel Rasterizer in CUDA

For My final project, I did program 1 again. This time, I achieved a 2x speedup by parallelizing it with CUDA - the parallel programming framework for NVIDIA graphics cards. The biggest challenge I faced was allocating memory on the host (CPU) for the pixel array, and copying results back to the host. The biggest lesson learned is the memory is expensive.

A big challenge in parallelizing a rasterizer is dealing with the z-buffer. While many threads have the capability to write to the same locations at the same time, managing this memory can be tricky. To get around this, I sorted the triangles by depth, then use the atomicMax operating to fill in the z-buffer with the highest triangle at each point. AtomicMax is a function that comes with the cuda runtime that allows us to atomically take the max of an array element.

Although I though the atomic operatings would be expensive, it paled in comparsision to the overhead of memory allocation. My parallel rasterizer spends about half of it's time allocating pined memory on the host for the pixel area. I choose to use pinned memory because data can be copied accross the bus much faster to it. This speedup outweighs the time required to allocate its memory.

summary of algorithm:
allocate host memory for pixel array
copy mesh and indices from host to device
populate array of triangles with vertices and indices
calculate the depth of each triangle by averaging its three points
key-value sort the triangles by depth
create a mapping from pixel to triangle number
-> launch a thread for each triangle
for each pixel in its hit area
atomic max with its triangle number at pixel loc in z-Buffer
rasterize each pixel
-> launch a thread for each pixel
get the triangle from z-Buffer and rasterize it
copy results back to host

resolution serial time (ms) parallel time(ms) speedup
5760x3600 528.36 243.60 2.17
1920x1080 54.26 26.51 2.05
400x600 9.28 5.55 1.67

serial bunny in on the left and the paralllel bunny is on the right. As you can see, the bunny came out unscathed.

a pdf printout of the nvidia visual profiler's (nvvp) findings for the most expesive kernel can be found at
nvprof printout: