good resources:

CRAN Task View: High-Performance and Parallel Computing with R Jonathan’s course

First step - detect your bottleneck

When things go slower than you want/need, you might want to find the bottlenecks in your code.

Usually, there are two options:

  1. you don’t have enough memory (RAM bottleneck)
  2. you don’t have enough processing power (CPU bottleneck)

How to diagnose?

RAM bottleneck

Principles that can help are:

CPU bottleneck

what processes can/should be parallelised?

problems that are amenable to parallelism. The most obvious involve repetition (“embarrassingly parallel”). e.g.:

First note that some computations can be done in parallel without explicitly define parallel setup. e.g. with BLAS/LAPACK libraries for algebraic operations. These libraries make it possible for many common R operations to use all of the processing power available.

Yet, be careful when you explicitly parallelize and use functions that do implicit parallelism. In that case you can specify the number of cores to use e.g.:

RhpcBLASctl::blas_set_num_threads(1)

The goal: to use as many cores as possible while having enough RAM, with low overhead so it is beneficial.

On Linux, explicit parallelism preferably can be done with foreach and doMC

for example (evaluate models with CV):

library(foreach)
library(doMC)
## Loading required package: iterators
## Loading required package: parallel
time_seq <- system.time(
  for(i in 1:100) {
    sum(rnorm(1e6))
    }
  )

registerDoMC(100)
time_par <- system.time(
  foreach(i=1:100) %dopar% {
    sum(rnorm(1e6))
    }
  )

rbind(time_seq,time_par)
##          user.self sys.self elapsed user.child sys.child
## time_seq     6.451    0.363   6.814       0.00     0.000
## time_par     0.032    0.170   0.263       7.16     1.892

example for nested loops

As a rule of thumb, note it always better to parallelize the outer loop

This results in a larger individual tasks, and larger tasks can often be performed more efficiently than smaller task.

registerDoSEQ()

cols <- 1:10
rows <- 1:10

system.time({
  registerDoMC(10)
  some_matrix <- 
  foreach(i=cols, .combine = "cbind") %dopar% {
    foreach(j=rows, .combine = "rbind") %do% {
      mean(rnorm(1e6, mean = rep(i*j,1e6)))
    }
    }
  })
##    user  system elapsed 
##   6.292   0.588   0.787
class(some_matrix)
## [1] "matrix"
image(some_matrix)

However, if the outer loop doesn’t have many iterations and the tasks are already large, parallelizing the outer loop results in a small number of huge tasks, which may not allow you to use all of your processors…

For that problem we have %:% operator: it turns multiple foreach loops into a single loop.

with %:% and %dopar%, we are creating a single stream of tasks that can all be executed in parallel:

registerDoSEQ()

system.time({
  registerDoMC(10)
  some_matrix <- 
  foreach(i=cols, .combine = "cbind") %:%
    foreach(j=rows, .combine = "rbind") %dopar% {
      mean(rnorm(i*j))
    }
  })
##    user  system elapsed 
##   0.073   0.180   0.067

note: you can allays change a %dopar% argument to %do%

we can specify multicore options in the foreach according to our goal. see ?mclapply for details:

mcoptions <- list(preschedule=T,
                  set.seed=T,
                  cores = 6)
registerDoMC(2)
foreach(i=1:6, .options.multicore=mcoptions,
        .errorhandling = ) %dopar% sum(rnorm(1e7))
## [[1]]
## [1] 4598.252
## 
## [[2]]
## [1] -1515.41
## 
## [[3]]
## [1] -1280.445
## 
## [[4]]
## [1] -925.9409
## 
## [[5]]
## [1] 770.6542
## 
## [[6]]
## [1] 1300.151

if preschedule set to TRUE then the computation is first divided to (at most) as many jobs are there are cores and then the jobs are started, each job possibly covering more than one value. If set to FALSE then one job is forked for each value of i. The former is better for short computations or large number of values in i, the latter is better for jobs that have high variance of completion time and not too many values of X compared to mc.cores.

The “cores” options allows us to temporarily override the number of workers to use for a single foreach operation. This is more convenient than having to re-register doMC.

another useful option if specifying .errorhandling argument in the foreach function. options are “remove”, “stop”, and “pass”.

When parallelism might not be necessary?

Overhead

Sending tasks out to another processor and have work to be done there takes some time and there’s some work involved, thus we don’t want to parallelizing things that already runs really fast sequentially. So when working on code parallelization it is important to keep in mind that for very short jobs, the parallelization overhead will diminish the parallelization benefits.

many models in which resampling is primary approach for optimizing predictive models with tuning parameters have use efficient parallel mechanism (e.g. xgboost, rf, and other caret::train models)

more on caret and parallelism here

note about shared RAM constraint

RAM is like sand in a sandbox. when you parallelize your code and each cores demand a lot of ram (e.g. for huge numerical optimization) than you need to consider the amount of available RAM. For instance, if singlecore computation takes 51% of the RAM, you will not be able to use more than one core, even if you have many cores available, as your constraint is RAM.

suggestion: make small experimant to check what works best for your mechine (CPU vs RAM)

an example for bad parallelism (overhead) (open htop and see all cores are working and many red lines):

library(foreach)
library(doMC)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
registerDoSEQ()

registerDoMC(cores = 50)
system.time(
    result <- foreach(i=1:50, .combine = "rbind") %dopar% {
      
      intrain <- sample(1:nrow(iris),nrow(iris)*0.7)

      m.lda <- train(Species~., data=iris[intrain,], method='lda')
      m.knn <- train(Species~., data=iris[intrain,], method='knn')
      m.svm <- train(Species~., data=iris[intrain,], method='svmRadial')

      p.lda <- predict(m.lda,newdata=iris[-intrain,])
      p.knn <- predict(m.knn,newdata=iris[-intrain,])
      p.svm <- predict(m.svm,newdata=iris[-intrain,])

      c(lda = mean(p.lda==iris[-intrain,"Species"]),
        knn = mean(p.knn==iris[-intrain,"Species"]),
        svm = mean(p.svm==iris[-intrain,"Species"]))
    } %>% data.table()
)
##    user  system elapsed 
## 244.153 729.093   9.011

alternatively:

registerDoMC(cores = 50)
system.time(
    result <- foreach(i=1:50, .combine = "rbind") %dopar% {
      
      intrain <- sample(1:nrow(iris),nrow(iris)*0.7)
      
      registerDoMC(cores = 2)
       
      m.lda <- train(Species~., data=iris[intrain,], method='lda')
      m.knn <- train(Species~., data=iris[intrain,], method='knn')
      m.svm <- train(Species~., data=iris[intrain,], method='svmRadial')

      p.lda <- predict(m.lda,newdata=iris[-intrain,])
      p.knn <- predict(m.knn,newdata=iris[-intrain,])
      p.svm <- predict(m.svm,newdata=iris[-intrain,])

      c(lda = mean(p.lda==iris[-intrain,"Species"]),
        knn = mean(p.knn==iris[-intrain,"Species"]),
        svm = mean(p.svm==iris[-intrain,"Species"]))
    } %>% data.table()
)
##    user  system elapsed 
## 168.034  48.352   4.346

by the way…

boxplot(result)

!

spatial context

todo