Rusinkiewicz and Levoy present a multiresolution point-based rendering system for large meshes in their paper, QSplat. Their system favors parallelization because multiple processors can process different subtrees of a model’s sphere hierarchy at the same time. In this project, the rendering algorithm of the author's original source code was modified to take advantage of multiprocessing using threads. Benchmarks run on a Dual Pentium 4 Xeon 2.8Ghz using software rendering resulted in an increase of the original performance of QSplat for larger models.
.qs
format models from the original QSplat homepage..qs
file in step 1.Per request, here are a number of images with Parallelized QSplat running with different number of threads rendering only part of the whole Buddha model.
Model | Threading | Splats Rendered | Render time (sec.) | Splats per second | Speed up | |
---|---|---|---|---|---|---|
Buddha | Non-Threaded | 1,466,944 | 1.151 | 1,274,495 | 262% | |
Threaded | 1,281,384 | 0.384 | 3,336,937 | |||
Bunny | Non-Threaded | 110,488 | 0.384 | 287,729 | 92% | |
Threaded | 102,089 | 0.384 | 265,856 | |||
Dragon | Non-Threaded | 1,707,981 | 1.342 | 1,272,713 | 134% | |
Threaded | 1,470,871 | 0.864 | 1,702,396 | |||
Lion | Non-Threaded | 384,780 | 0.576 | 668,020 | 109% | |
Threaded | 349540 | 0.480 | 728,208 | |||
Lucy | Non-Threaded | 2,101,164 | 1.534 | 1,369,728 | 133% | |
Threaded | 1,745,465 | 0.960 | 1,818,192 |
Detailed numerical results can be found here.
Looking at the perfomance chart, models with more vertices (i.e. Buddha, Dragon, and Lucy) benefit more from the threading of the QSplat rendering algorithm. Models with less vertices (i.e. Bunny and Lion) do not benefit as much. Rendering the Bunny model using the threaded algorithm actually decreases its performance.
An explanation for this behavior would be that the threaded algorithm has extra overhead when setting up threads to run the rendering. There exists a threshold of vertex points that should be met before an increase in performance is apparent.
Currently, the naive implementation of Parallelized QSplat looks at the number of children of the root node of a model's sphere hierarchy and kicks off a thread for each child. This causes a lot of overheaded when using level-of-detail rendering because threads are made each time a model is refined. A better implementation of the parallelization (maybe using a static amount of threads or some other parallelization mechanism) would further increase performance.
Another optimization would consist of finding the number of points such that rendering a smaller model would not cause a decrease in performance. Once found, the rendering algorithm can be changed so that if the number of vertices in a model is less than the found threshold, then a non-threaded branch of the program would render the model.