R Performance Study: Summing in Parallel

This performance study compares a simple task, the summation of a vector of integers, using various techniques in the R language including the old standard for loop, base R’s vectorized sum function, and a doParallel package implementation. The microbenchmark package is used to perform the timing measurements and visualization of the timing measurements is presented via ggplot2.

require(microbenchmark, quietly=TRUE)
require(doParallel, quietly=TRUE)
require(ggplot2, quietly=TRUE)

I am running on a 4 core laptop, so I will use 3 cores for this experiment, leaving one for keeping my machine reasonably responsive.

detectedCores <- parallel::detectCores()
registerDoParallel(cores=detectedCores - 1) 
print(detectedCores)

## [1] 4

Summing with For Loop

The for loop to be used in this study is wrapped in the function forLoopSum and is defined as follows:

forLoopSum <- function(x)
{
  sum <- 0
  for( a in x){
    sum <- sum + a
  }
  return (sum)
}

Summing the R Way

The base R sum function will be used as is.

sum(x)

Summing in Parallel

For parallelization, the doParallel package will provide the interface to base R’s parallel package as described in Weston and Calaway (2014). The split approach used to allocate data amongst various batches is taken from a StackOverflow post (O’Hanlon, 2013). Finally, the base R foreach function is used to to iterate over the batches and dispatch them to the parallelizer (Team, 2014).

parallelSum <- function(x) {
  
  items <- length(x)
  batches <- detectedCores * 4
  batchSets <- split(x, rep(1:batches, length.out=items))
  
  finalSum <- foreach(b=iter(batchSets, by='row'), .combine="+") %dopar% sum(b)
  
  return (finalSum)
}

Performance Experiment

Using the functions described above, a microbenchmark experiment is excuted as shown in the following code. Powers of ten (10) are used to measure differences in magnitude of vector size up to 10 million, with 5 executions per size. The results are saved in a data.frame for use in visualization and the appendix.

setSize <- c(1, 2, 3, 4, 5, 6, 7, 8)
resultsDF <- data.frame()

# Loop through the set sizes and perform sub-experiment
for(s in setSize) {
  size <- 10 ^ s
  a <- sample(1:20, size, replace=TRUE)
  
  # Validate
  checkSum <- as.integer(sum(a))
  forSum <- as.integer(forLoopSum(a))
  stopifnot(identical(forSum, checkSum))
  stopifnot(identical(parallelSum(a), checkSum))
  
  # Run performance test
  results <- microbenchmark::microbenchmark(forLoopSum(a), 
                                            sum(a), 
                                            parallelSum(a), 
                                            times=5, unit="ms")
  
  # Save results
  agSum <- summary(results)
  agSumPlus <- data.frame(agSum, count = 10 ^ s) 
  
  resultsDF <- rbind(resultsDF, agSumPlus)
}

Results

The following chart shows the execution times for the various summation techniques.

plot of chunk unnamed-chunk-8

Conslusions

As you can see, the base R sum function is very scalable compared to a standard for loop. A bit surprisingly, batching for parallelization does not improve the execution. Possibly base R is doing something similar under the covers and is optimized for this already, in contrast to my simple batch oriented parallelization which may suffer from non-performant batch determination (maybe the split and rep functions are eating up time?). This is an area for further investigation.

See the appendix that follows for more detailed statistics for each of the benchmark executions.

Appendix

The following table lists the raw microbenchmark statistics from the experiment.

## 
## 
## |expr           |       min|        lq|      mean|    median|        uq|       max| neval| count|
## |:--------------|---------:|---------:|---------:|---------:|---------:|---------:|-----:|-----:|
## |forLoopSum(a)  | 1.630e-02| 1.690e-02| 1.980e-02| 1.750e-02| 2.170e-02| 2.660e-02|     5| 1e+01|
## |sum(a)         | 0.000e+00| 0.000e+00| 1.300e-03| 1.800e-03| 1.800e-03| 3.000e-03|     5| 1e+01|
## |parallelSum(a) | 1.845e+01| 1.886e+01| 1.988e+01| 1.986e+01| 2.026e+01| 2.200e+01|     5| 1e+01|
## |forLoopSum(a)  | 7.000e-02| 7.060e-02| 8.120e-02| 7.300e-02| 9.600e-02| 9.660e-02|     5| 1e+02|
## |sum(a)         | 0.000e+00| 6.000e-04| 1.800e-03| 2.400e-03| 3.000e-03| 3.000e-03|     5| 1e+02|
## |parallelSum(a) | 2.203e+01| 2.304e+01| 2.417e+01| 2.374e+01| 2.530e+01| 2.675e+01|     5| 1e+02|
## |forLoopSum(a)  | 6.481e-01| 6.590e-01| 6.903e-01| 6.759e-01| 6.807e-01| 7.875e-01|     5| 1e+03|
## |sum(a)         | 1.200e-03| 1.800e-03| 3.100e-03| 2.400e-03| 4.800e-03| 5.400e-03|     5| 1e+03|
## |parallelSum(a) | 2.281e+01| 2.283e+01| 2.524e+01| 2.396e+01| 2.688e+01| 2.969e+01|     5| 1e+03|
## |forLoopSum(a)  | 6.507e+00| 6.646e+00| 6.798e+00| 6.897e+00| 6.918e+00| 7.024e+00|     5| 1e+04|
## |sum(a)         | 1.510e-02| 1.930e-02| 1.890e-02| 1.930e-02| 1.990e-02| 2.110e-02|     5| 1e+04|
## |parallelSum(a) | 2.393e+01| 2.442e+01| 2.524e+01| 2.447e+01| 2.564e+01| 2.775e+01|     5| 1e+04|
## |forLoopSum(a)  | 6.802e+01| 7.105e+01| 7.854e+01| 7.364e+01| 8.340e+01| 9.658e+01|     5| 1e+05|
## |sum(a)         | 1.497e-01| 1.750e-01| 2.083e-01| 2.136e-01| 2.142e-01| 2.891e-01|     5| 1e+05|
## |parallelSum(a) | 3.380e+01| 3.694e+01| 4.119e+01| 3.951e+01| 4.597e+01| 4.974e+01|     5| 1e+05|
## |forLoopSum(a)  | 7.043e+02| 7.176e+02| 7.497e+02| 7.291e+02| 7.511e+02| 8.462e+02|     5| 1e+06|
## |sum(a)         | 1.678e+00| 1.681e+00| 1.733e+00| 1.690e+00| 1.729e+00| 1.888e+00|     5| 1e+06|
## |parallelSum(a) | 1.321e+02| 1.382e+02| 1.424e+02| 1.415e+02| 1.462e+02| 1.539e+02|     5| 1e+06|
## |forLoopSum(a)  | 6.949e+03| 7.253e+03| 7.387e+03| 7.465e+03| 7.601e+03| 7.666e+03|     5| 1e+07|
## |sum(a)         | 1.783e+01| 1.789e+01| 1.820e+01| 1.798e+01| 1.803e+01| 1.924e+01|     5| 1e+07|
## |parallelSum(a) | 1.429e+03| 1.462e+03| 1.653e+03| 1.607e+03| 1.858e+03| 1.909e+03|     5| 1e+07|
## |forLoopSum(a)  | 7.306e+04| 7.342e+04| 7.643e+04| 7.425e+04| 7.563e+04| 8.578e+04|     5| 1e+08|
## |sum(a)         | 1.776e+02| 1.801e+02| 1.891e+02| 1.814e+02| 1.926e+02| 2.140e+02|     5| 1e+08|
## |parallelSum(a) | 1.365e+04| 1.384e+04| 1.445e+04| 1.412e+04| 1.461e+04| 1.604e+04|     5| 1e+08|

Source Code

The raw R markdown code used to produce this performance study can be found on GitHub, in my DataAcqMgmt repository.

References

O’Hanlon, S. Split a vector into three vectors of unequal length in R. 2013. URL: http://stackoverflow.com/a/18406749.

Team, R. C. package foreach, version 1.4.1. R Foundation for Statistical Computing, 2014. URL: http://www.inside-r.org/packages/cran/foreach/docs/foreach.

Weston, S. and R. Calaway. Getting Started with doParallel and foreach. 2014. URL: http://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf.