Peter Godkin and Ryan Quinlan

The N-Body Gravitational Problem

According to Newton's law of universal gravitation the force between two bodies is where m1 and m2 are the masses of the two bodies, r is the distance between them, and G is the gravitational constant. This means if you have n bodies, in order to calculate the total gravitational force exerted on each body, you have to do n-1 calculations. This leaves us with a problem that has O(n2) computational complexity. This is usually not ideal for calculating something in real time, as performance tends to suffer with a larger n.

Solution

There are many techniques used to approximate these forces like the Barnes-Hutt method1 which involves building a tree and results in a O(n log n) algorithm, as well as the fast multiple method2 which can reduce the complexity to O(n). We decided to use the more exact but brute force method of calculating every force between every object. In each frame the calculation to find the force on one body is independent of the calculations needed to find the force of gravity on another particle. So this approach is not only simpler to parallelize but also the most accurate, as it's not an approximation.

Approach

Our application used OpenMP to parallelize the code for its simplicity and ease of integration. Our visualizations utilized OpenGL and particle systems in real-time. We collected various intervals of data measurements with each new performance technique we applied in order to attempt to find out where the majority of performance gains were found.


We decided to offload all of the gravity calculations to the Xeon Phi because of its great potential for parallelization. We used a model called Knight's Corner which has about 60 cores. For our approach, we sent the current positions of each particle to the card, did all the necessary calculations, and sent back the accelerations of each particle. We continued to improve upon this by sending the positions simultaneously to two different Xeon Phi cards and then calculating half the accelerations on each card. In figure 1 below, you can see the method we used to offload to a single card.

Figure 1: Offloaded Code

When offloading, what can often be a bottleneck for this model of Xeon Phi is the data transfer rate. If you aren't giving it enough work to justify the data you are sending it, then you likely won't get any speedup. This is not the case for our approach because for every acceleration we send back from the card, we do N calculations on the card. So the time it takes to do N floating point operations will be more of a limit than the time it takes to send a single point of data to and from the card.


Further improvements may have come from sending all the initial positions to the Xeon Phi at the beginning and just sending back the newly calculated positions each frame. By keeping track of the positions on card we may have been able to cut the data transfer time in half. However, considering that the data transfer rate wasn't a limiting factor, we believe this wouldn't have made a significant difference in runtime.

Vectorization

The other form of parallelism we used besides multithreading was vectorization, also known as SIMD (single instruction multiple data) parallelism. This works by loading the data for multiple points into the register and then doing whatever operation to all of them simultaneously. We were able to get extra speedup by taking advantage of the Xeon Phi's 512-bit SIMD vector registers.

Figure 2: Initializing Aligned Arrays

There are numerous things you have to take into account when vectorizing code and it's pretty easy to accidentally get slow down. This is the main reason our code looks a bit strange. You can see in figure 2 that we declared our arrays to be aligned with the cache size and then later call _assume_aligned to let the compiler know. Another thing we did was use a _builtin_expect in our if statement. When doing SIMD parallelization, what usually happens is that everything in your for loop is run even if a certain branch doesn't actually happen. The _builtin_expect tells the machine to either run a branch every time or to not run it at all, and then go back and run the branch if the conditional turned out to be true. This didn't end up giving us much speedup because we expected that big chunk of code in the if statement in figure 1 to get run almost every time anyway.


A lot of times the compiler can predict that vectorizing will give you speed up and do it for you automatically. But for us, even after all the changes we made, we still had to specifically force it to vectorize with the pragma simd directive.

Data

As expected, our performance increases scaled with the problem size. For our comparisons we first ran our program with no parallelism on a 4 core 2.5GHz Intel Xeon Processor E5-2609 v2 without offloading. Then we added vectorization which gave us up to a 1.6x speedup with n=16,000. Then we ran it with 2 threads which, along with vectorization, gave us nearly a 3.2x speedup. Next we offloaded to a Xeon Phi card and ran the offloaded function in parallel and achieved over 22x speedup. Finally, by offloading to two cards at once we achieved a final speedup of nearly 42x with n=16,000.

Discussion

The problem we chose lends itself well to vectorization, parallelization and offloading to the Xeon Phi. However, there are many additional problems where one could achieve speedup by taking a similar approach to the one we did. Not only are many other real-world equations similar in nature and complexity to ours, but other fields such as simulations, modelling, and high-performance computing could utilize similar techniques. Basically any algorithm that can be done in parallel and has an arithmetic intensity of greater than O(n).

Conclusion

By doing the work in parallel from the graph it almost looks like we started out with a O(n2) problem and ended with a O(n) problem. While this isn't the case, and if you added a lot more particles the graph would certainly start to curve up, you can also add more cards to offload to. It allows you to scale with the problem. If you have a massive number of points, instead of offloading to two cards, you could offload to whatever amount you need to get the speed you want.

Citations

  1. J. Barnes and P. Hut (December 1986). "A hierarchical O(N log N) force-calculation algorithm". Nature 324 (4): 446-449.
  2. Rokhlin, Vladimir (1985). "Rapid Solution of Integral Equations of Classic Potential Theory." J. Computational Physics Vol. 60, pp. 187-207.