nbody Simulation on a ring

The following code demonstrates an nbody simulation on a ring. This nody simulation predictions how n massive points (bodies) in space move when effected by gravity. All points pull on each other, causing points to accelerate toward each other. Each point starts with a known mass, position, and are stationary. The plots below plot these coordinates of the points across time. Lines represent the path of a single body. The colour of the line describes the passage of time, starting at blue and ending at red. The simulation is on a ring because it is designed for parallel computation on a set of parallel machines capable of communicating in a ring network topology. That is, machine \( i \) can communicate with machine \( i-1 \) and \( i+1 \), except for the last machine and first machine who communicate with each other instead of the inexistent machines \( i=-1 \) and \( i=n+1 \).

How it works

The software works by assigning each of the \( n \) bodies to an MPI process, one of which is a master processes and does a little more work than the rest of the processes. There are three major phases. First is initialization, where the master node sends the initial coordinates around the ring for each process to observe. Immediately after initialization, the master node enters the second phase, simulation. During simulation, there is always a latest version of the coordinates travelling around the ring. While the coordinates are travelling, each process is calculating is own updated coordinates. This way, once the latest coordinates reach each node, the node may update the coordinates and pass them on before calculating again. Thus the coordinates travel without the need to stop, unless they can pass around the entire ring before a node can update its position. Once the desired number of iterates is reached, the master node will write the updated coordinates to standard output.

Examples

The following data set is generated from two sparsely sampled Guassians. The smaller cluster has more mass than the more numerous cluster.

setwd("./sim/")
data = read.table("data.dat")
# data # see appendix for data
source("script.R")
sim(0, 500, 5)

plot of chunk unnamed-chunk-1

The following data set is a collection of equally massed bodies, equally spaced on the parimeter of a circle. As time progresses, the bodies progress together in what appears to be a parabolic path. Once the bodues become arbitrarily close (because collisions are not calculated), they gain a very strong veolocity.

setwd("./temp/")
circle = read.table("circle.dat")
# circle # see appendix for data
sim(0, 400, 5)

plot of chunk unnamed-chunk-2

The following data set is the circle set with a single additional body initialized to the top-right of the circle. The additional body is about 300 times more massive than the other bodies and thus strongly dictates their movement, resulting in beautiful oscillations.

setwd("./tube")
tube = read.table("tube.dat")
# tube # see appendix for data
source("script.R")
sim(0, 1000, 5)

plot of chunk unnamed-chunk-3

Parallel algorithm analysis

The MPI nbody software was rewritten to a single-thread format and run on the same cluster as the MPI implementation to benchmark. The parallel implementation was given 3 nodes X 2 processors X 4 cores = 32 cores (each a quad-core Xeon x5550 2.67GHz Intel processor). Communication was facilitated by 1 Gb Ethernet.

Compute times

From the compute times, we can see that the parallel computation is definatively faster for most inputs.

N = c(2, 3, 5, 9, 17, 33, 65, 129, 257)
mpi = c(3763, 7673, 12721, 78104, 113275, 113275, 334473, 1401958, 7020787)
serial = c(2023, 5901, 17728, 59752, 207549, 763433, 2946789, 11653092, 46534030)
plot(N, serial, type = "l", col = "blue", lwd = 2, ylim = c(0, max(serial)), 
    xlab = "Problem size, N", ylab = "Microseconds", main = "Compute time\nSerial (blue) vs Parallel (red)")
lines(N, mpi, col = "red", lwd = 2)

plot of chunk unnamed-chunk-4

SpeedUp

From the following speedup plot, we can see some interesting behaviour. Initially, the overhead of extra communication is burdensome, and the serial implementation is faster. Then there is an intermediary phase from work of size 5 to 65 (nearly triple the number of workers, 24). Then, as workers are are given several-fold threads to manage each, the speedUp reduces.

\[ S_p = \frac{T_1}{T_p} \]

plot(N, serial/mpi, type = "l", col = "red", lwd = 2, xlab = "Problem size, N", 
    ylab = "serial / parallel time", main = "SpeedUp")

plot of chunk unnamed-chunk-5

Efficiency

This efficiency plot shares much of the story of the SpeedUp plot, but also highlights the fact that layering workers with several threads to manage is inefficient. A better design would be to assign each worker several bodies and only a single thread. This would reduce thread management and message passing costs.

\[ E_p = S_p/p \]

proc = c(2, 3, 5, 9, 17, 24, 24, 24, 24)
plot(N, serial/(proc * mpi), type = "l", col = "red", lwd = 2, xlab = "Problem size, N", 
    ylab = "serial time /( p * parallel time )", main = "Efficiency")

plot of chunk unnamed-chunk-6

Appendix

data
##         V1      V2 V3
## 1   1.0936  1.6491 10
## 2   0.2488  0.4840  2
## 3  -0.3090  0.8258  5
## 4  -0.5656  0.8170  5
## 5   0.4137  1.8089  6
## 6   1.2256  1.1387  3
## 7  -0.7973 -0.3504  4
## 8   7.4821  6.6419 20
## 9   6.8572  7.3097  6
## 10  5.3781  6.0461 26
circle
##          V1     V2 V3
## 1    9.9211  1.253  5
## 2    9.6858  2.487  5
## 3    9.2978  3.681  5
## 4    8.7631  4.818  5
## 5    8.0902  5.878  5
## 6    7.2897  6.845  5
## 7    6.3742  7.705  5
## 8    5.3583  8.443  5
## 9    4.2578  9.048  5
## 10   3.0902  9.511  5
## 11   1.8738  9.823  5
## 12   0.6279  9.980  5
## 13  -0.6279  9.980  5
## 14  -1.8738  9.823  5
## 15  -3.0902  9.511  5
## 16  -4.2578  9.048  5
## 17  -5.3583  8.443  5
## 18  -6.3742  7.705  5
## 19  -7.2897  6.845  5
## 20  -8.0902  5.878  5
## 21  -8.7631  4.818  5
## 22  -9.2978  3.681  5
## 23  -9.6858  2.487  5
## 24  -9.9211  1.253  5
## 25 -10.0000  0.000  5
## 26  -9.9211 -1.253  5
## 27  -9.6858 -2.487  5
## 28  -9.2978 -3.681  5
## 29  -8.7631 -4.818  5
## 30  -8.0902 -5.878  5
## 31  -7.2897 -6.845  5
## 32  -6.3742 -7.705  5
## 33  -5.3583 -8.443  5
## 34  -4.2578 -9.048  5
## 35  -3.0902 -9.511  5
## 36  -1.8738 -9.823  5
## 37  -0.6279 -9.980  5
## 38   0.6279 -9.980  5
## 39   1.8738 -9.823  5
## 40   3.0902 -9.511  5
## 41   4.2578 -9.048  5
## 42   5.3583 -8.443  5
## 43   6.3742 -7.705  5
## 44   7.2897 -6.845  5
## 45   8.0902 -5.878  5
## 46   8.7631 -4.818  5
## 47   9.2978 -3.681  5
## 48   9.6858 -2.487  5
## 49   9.9211 -1.253  5
## 50  10.0000  0.000  5
tube
##          V1     V2   V3
## 1    9.9211  1.253    5
## 2    9.6858  2.487    5
## 3    9.2978  3.681    5
## 4    8.7631  4.818    5
## 5    8.0902  5.878    5
## 6    7.2897  6.845    5
## 7    6.3742  7.705    5
## 8    5.3583  8.443    5
## 9    4.2578  9.048    5
## 10   3.0902  9.511    5
## 11   1.8738  9.823    5
## 12   0.6279  9.980    5
## 13  -0.6279  9.980    5
## 14  -1.8738  9.823    5
## 15  -3.0902  9.511    5
## 16  -4.2578  9.048    5
## 17  -5.3583  8.443    5
## 18  -6.3742  7.705    5
## 19  -7.2897  6.845    5
## 20  -8.0902  5.878    5
## 21  -8.7631  4.818    5
## 22  -9.2978  3.681    5
## 23  -9.6858  2.487    5
## 24  -9.9211  1.253    5
## 25 -10.0000  0.000    5
## 26  -9.9211 -1.253    5
## 27  -9.6858 -2.487    5
## 28  -9.2978 -3.681    5
## 29  -8.7631 -4.818    5
## 30  -8.0902 -5.878    5
## 31  -7.2897 -6.845    5
## 32  -6.3742 -7.705    5
## 33  -5.3583 -8.443    5
## 34  -4.2578 -9.048    5
## 35  -3.0902 -9.511    5
## 36  -1.8738 -9.823    5
## 37  -0.6279 -9.980    5
## 38   0.6279 -9.980    5
## 39   1.8738 -9.823    5
## 40   3.0902 -9.511    5
## 41   4.2578 -9.048    5
## 42   5.3583 -8.443    5
## 43   6.3742 -7.705    5
## 44   7.2897 -6.845    5
## 45   8.0902 -5.878    5
## 46   8.7631 -4.818    5
## 47   9.2978 -3.681    5
## 48   9.6858 -2.487    5
## 49   9.9211 -1.253    5
## 50  10.0000  0.000    5
## 51  20.0000 20.000 1500