Parallel computing with the parallel R package

Most computers these days come with multiple computational units, commonly called CPUs. In addition, each CPU typically contains multiple processing units, commonly called “cores”. Each of these individual processing units is capable of working nearly independently of the other. So, a computer can typically do several sets of computation at the same time; we call this parallelism. When we program a computer to do several tasks at the same time, we are practicing parallel programming techniques. The opposite of parallel computing, performing only one task at a time, is called serial computing.

The R programming environment generally runs on only one processor. However, a relatively new software package, the parallel package, allows R to run on multiple processors simultaneously. However, to take advantage of this functionality, we may need to make some adjustments to our scripts.

The background for our parallel programming exercise

In this little exercise, we will be running a computationally-intensive calculation of segmenting copy number data using the DNAcopy R package. Copy number data along the genome is expected to be piece-wise continuous. In other words, regions of the genome may have different copy number states, particularly in cancer, but the data in a small region of the genome will likely share the same copy number state; between regions of common copy number state, there will be discrete step-like changes. However, the data from copy number arrays or from count-based sequencing are often noisy. The circular binary segmentation algorithm implemented in the DNAcopy package analyses copy number data and defines the regions of common copy number as well as breakpoints between those regions; it performs “smart-smoothing” of copy number data. Feel free to read the DNAcopy vignette for details.

The DNAcopy algorithm is rather computationally intensive, so we would like to use parallel programming strategies to speed it up. To do so, we need to define a unit of work over which we will parallelize. In the simulated data defined below, there will be 20 samples, so I will choose to parallelize over samples.

Running the serial example

library(DNAcopy)
dat = matrix(rnorm(2e+06), nc = 20)
dim(dat)
## [1] 100000     20
chroms = paste0("chr", rep(1:20, each = 5000))
pos = rep(1:5000, 20)
cna = CNA(dat, Chromosome = chroms, Position = pos, data.type = "logratio")
## Error: unused arguments (Chromosome = chroms, Position = pos)
cna = CNA(dat, chroms, pos, data.type = "logratio")
cna
## Number of Samples 20 
## Number of Probes  100000 
## Data Type         logratio
system.time((segmented = segment(cna, verbose = 1)))
## Analyzing: Sample.1 
## Analyzing: Sample.2 
## Analyzing: Sample.3 
## Analyzing: Sample.4 
## Analyzing: Sample.5 
## Analyzing: Sample.6 
## Analyzing: Sample.7 
## Analyzing: Sample.8 
## Analyzing: Sample.9 
## Analyzing: Sample.10 
## Analyzing: Sample.11 
## Analyzing: Sample.12 
## Analyzing: Sample.13 
## Analyzing: Sample.14 
## Analyzing: Sample.15 
## Analyzing: Sample.16 
## Analyzing: Sample.17 
## Analyzing: Sample.18 
## Analyzing: Sample.19 
## Analyzing: Sample.20
##    user  system elapsed 
##   53.37    0.51   53.94

The important detail to notice is that the processing occurred one sample at a time. The system.time() function returns the time utilized by the computer and the “elapsed” time is how long we really had to wait for the result. The “user” time is the amount of time the computer spent doing our calculations while the “system” time is the amount of time the system needed to perform its work (copying memory, starting processes, etc.).

Parallel workflow

First, load the parallel package.

library(parallel)

To determine the number of “cores”, or computational units, on your machine, you can use the detectCores() function.

detectCores()
## [1] 4

Later, when we want to perform our parallel calculations, we are going to use all the available cores, so we can set that option to the number of cores now:

options(mc.cores = detectCores())

In order to run on each sample in parallel, we are going to use the mclapply function. Notice that the name contains “lapply” in it, so we are going to have to make a list of objects, one-per-sample, so that we can run the samples independently and in parallel on all of our available cores. The code to make the list of objects for segmentation is:

cna2 = apply(dat, 2, function(z) {
    return(CNA(z, chroms, pos, data.type = "logratio"))
})
length(cna2)
## [1] 20

You can see that the length of cna2 is 20. The cna2 object is a list with each item being one sample.

If we now run the segmentation code using mclapply, we can see if using all available compute resources improves our speed to complete this simple analysis.

system.time((res = mclapply(cna2, segment)))
##    user  system elapsed 
##  38.811   0.867  23.361
length(res)
## [1] 20

We can see that, indeed, our “elapsed” time decreased quite nicely, but by what factor? Why did it not go down as much as we expected? There is some overhead associated with creating new R sessions (which is what the parallel package is doing) as well with communicating the R data objects between the R processes.