Multi Threaded Ray Tracing using Intel Threading Building Blocks
Description
For this class we had the option of developing our projects using either just C++ or Cuda/OpenCL. Going with either of these options had their advantages and disadvantages. Using Cuda would have enabled me to write a very highly parallel application, but there were many drawbacks since it is much more difficult to implement advanced object oriented features, it would have also been written only for Nvidia's C/C++ compiler and only able to run on Nvidia hardware. I could have also went with OpenCL to be less platform specific, but the amount of work to implement it in a quarter would have been too much. For my project, I choose to go with the straight C++ route for taking advantage of the advanced object oriented features such as polymorphism. The big disadvantage of going this route was that I was completely CPU bound, but by using Intel's Threading Building Blocks library I was able to take advantage of multiple cores on the CPU and still have some parallelism to speed up ray tracing. Taking this approach also allowed me to use linked structures and handle large scenes with many objects where it would have been either very difficult or impossible to implement with Cuda/OpenCL. Even though this approach would not yield the fastest in ray tracing, it did prove effective at handling large tasks for distributive ray tracing such as Monte Carlo global illumination, depth of field, stratified anti aliasing, and ambient occlusion.
Implementation
In ray tracing, rays are often dependent of sub rays before being evaluated for a single pixel forming a tree of rays. The most straight forward way to create and evaluate these ray trees is by using recursive functions, but these functions are very difficult to implement in parallel. Right from the start I implemented an iterative solution that required a number of steps for creating and evaluating rays. The following steps are taken for each ray.
- Creation- The ray object created, initialized, and scheduled to be casted.
- Casting- The nearest collision is determined and sub rays are created. If no sub rays are created, the ray is immediately evaluated.
- Evaluation- The ray determines its color and draws it to either its parent ray or a draw buffer for the final image. If the ray is the last dependent ray of its parent it schedules the parent to be evaluated. If the ray is the root ray, it attempts to get the next pixel in the draw buffer and schedules its self for another casting step if successful.
In the serial version of this algorithm of these steps, the rays are scheduled using linked stacks on the individual rays. After each step on a ray object the head of the linked stack is next to be processed and when it is empty the whole image is done. This algorithm works by creating the breath of dependent rays on the casting step and evaluating rays in depth first before any new casts casts. This keeps the number of ray objects used in the tree to a minimum and for performance these objects are reused in a pool rather than being created and deleted many times. The parallel form of this algorithm makes heavy use of tbb's task system. Each tbb task can be seen as a lightweight thread that gets assigned to a real thread using tbb's scheduling. The default scheduling used by tbb is depth first with breath stealing which operates much like how the serial version worked with linked stacks. In the parallel algorithm, every time a ray object is scheduled it is given to a new tbb task to be run later. For each ray that is created two tbb tasks are typically needed for casting and evaluation unless the ray has no sub rays where one task casts and evaluates. Like the serial version, the number of ray objects is kept very minimal and each task does a small part which is desirable for tbb tasks to run efficiently.
The behavior of the ray objects used in either the serial or parallel algorithms are determined by shader objects which are reminiscent of glsl shaders used in OpenGL programming, but for rays instead. When each ray is created it is associated with a shader object where the shader dictates what the ray does for each step. A number of different shaders can be used from those used by primary drawing rays, reflection/refraction rays, and light rays. This custom shader system allows for a great deal of flexibility in how and what the ray tracer does with out the need to modify the core functionality of the program. The way ambient occlusion works is a good example of how this happens where there is only one ambient occlusion shader that is used for drawing just ambient occlusion on the image. This flexibility opens the door for this program to be used for a great variety of ray tracing techniques.
Conclusion
The following images show examples of Monte Carlo global illumination, depth of field, and ambient occlusion. The bunny scenes in particular took about 30 minutes each on machines with 24 cores. I estimated several billion rays were needed to create each one. In a comparison between serial and parallel on a 24 core machine, a simplified bunny scene took the serial algorithm about 81 seconds while the parallel version took 9 seconds giving a 9x speedup. Although it was not a 24x speedup, the larger renders that took 30 minutes would have taken four and a half hours. I suspect much of this wasted time is spent creating and freeing tbb tasks, but tbb does provide a way to recycle tasks to help improve these times which I did not have time to implement. In the end I was quite happy with the results of this project.