good resources:
CRAN Task View: High-Performance and Parallel Computing with R Jonathan’s course
When things go slower than you want/need, you might want to find the bottlenecks in your code.
Usually, there are two options:
How to diagnose?
htop
in Terminal or download Chrome’s Secure Shell AppPrinciples that can help are:
bigmemory
ff
what processes can/should be parallelised?
problems that are amenable to parallelism. The most obvious involve repetition (“embarrassingly parallel”). e.g.:
First note that some computations can be done in parallel without explicitly define parallel setup. e.g. with BLAS/LAPACK libraries for algebraic operations. These libraries make it possible for many common R operations to use all of the processing power available.
Yet, be careful when you explicitly parallelize and use functions that do implicit parallelism. In that case you can specify the number of cores to use e.g.:
RhpcBLASctl::blas_set_num_threads(1)
The goal: to use as many cores as possible while having enough RAM, with low overhead so it is beneficial.
On Linux, explicit parallelism preferably can be done with foreach
and doMC
for example (evaluate models with CV):
library(foreach)
library(doMC)
## Loading required package: iterators
## Loading required package: parallel
time_seq <- system.time(
for(i in 1:100) {
sum(rnorm(1e6))
}
)
registerDoMC(100)
time_par <- system.time(
foreach(i=1:100) %dopar% {
sum(rnorm(1e6))
}
)
rbind(time_seq,time_par)
## user.self sys.self elapsed user.child sys.child
## time_seq 6.451 0.363 6.814 0.00 0.000
## time_par 0.032 0.170 0.263 7.16 1.892
example for nested loops
As a rule of thumb, note it always better to parallelize the outer loop
This results in a larger individual tasks, and larger tasks can often be performed more efficiently than smaller task.
registerDoSEQ()
cols <- 1:10
rows <- 1:10
system.time({
registerDoMC(10)
some_matrix <-
foreach(i=cols, .combine = "cbind") %dopar% {
foreach(j=rows, .combine = "rbind") %do% {
mean(rnorm(1e6, mean = rep(i*j,1e6)))
}
}
})
## user system elapsed
## 6.292 0.588 0.787
class(some_matrix)
## [1] "matrix"
image(some_matrix)
However, if the outer loop doesn’t have many iterations and the tasks are already large, parallelizing the outer loop results in a small number of huge tasks, which may not allow you to use all of your processors…
For that problem we have %:%
operator: it turns multiple foreach loops into a single loop.
with %:%
and %dopar%
, we are creating a single stream of tasks that can all be executed in parallel:
registerDoSEQ()
system.time({
registerDoMC(10)
some_matrix <-
foreach(i=cols, .combine = "cbind") %:%
foreach(j=rows, .combine = "rbind") %dopar% {
mean(rnorm(i*j))
}
})
## user system elapsed
## 0.073 0.180 0.067
note: you can allays change a %dopar%
argument to %do%
we can specify multicore options in the foreach
according to our goal. see ?mclapply
for details:
mcoptions <- list(preschedule=T,
set.seed=T,
cores = 6)
registerDoMC(2)
foreach(i=1:6, .options.multicore=mcoptions,
.errorhandling = ) %dopar% sum(rnorm(1e7))
## [[1]]
## [1] 4598.252
##
## [[2]]
## [1] -1515.41
##
## [[3]]
## [1] -1280.445
##
## [[4]]
## [1] -925.9409
##
## [[5]]
## [1] 770.6542
##
## [[6]]
## [1] 1300.151
if preschedule
set to TRUE then the computation is first divided to (at most) as many jobs are there are cores and then the jobs are started, each job possibly covering more than one value. If set to FALSE then one job is forked for each value of i. The former is better for short computations or large number of values in i, the latter is better for jobs that have high variance of completion time and not too many values of X compared to mc.cores.
The “cores” options allows us to temporarily override the number of workers to use for a single foreach operation. This is more convenient than having to re-register doMC.
another useful option if specifying .errorhandling
argument in the foreach
function. options are “remove”, “stop”, and “pass”.
When parallelism might not be necessary?
Overhead
Sending tasks out to another processor and have work to be done there takes some time and there’s some work involved, thus we don’t want to parallelizing things that already runs really fast sequentially. So when working on code parallelization it is important to keep in mind that for very short jobs, the parallelization overhead will diminish the parallelization benefits.
many models in which resampling is primary approach for optimizing predictive models with tuning parameters have use efficient parallel mechanism (e.g. xgboost
, rf
, and other caret::train
models)
more on caret and parallelism here
note about shared RAM constraint
RAM is like sand in a sandbox. when you parallelize your code and each cores demand a lot of ram (e.g. for huge numerical optimization) than you need to consider the amount of available RAM. For instance, if singlecore computation takes 51% of the RAM, you will not be able to use more than one core, even if you have many cores available, as your constraint is RAM.
suggestion: make small experimant to check what works best for your mechine (CPU vs RAM)
an example for bad parallelism (overhead) (open htop and see all cores are working and many red lines):
library(foreach)
library(doMC)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
registerDoSEQ()
registerDoMC(cores = 50)
system.time(
result <- foreach(i=1:50, .combine = "rbind") %dopar% {
intrain <- sample(1:nrow(iris),nrow(iris)*0.7)
m.lda <- train(Species~., data=iris[intrain,], method='lda')
m.knn <- train(Species~., data=iris[intrain,], method='knn')
m.svm <- train(Species~., data=iris[intrain,], method='svmRadial')
p.lda <- predict(m.lda,newdata=iris[-intrain,])
p.knn <- predict(m.knn,newdata=iris[-intrain,])
p.svm <- predict(m.svm,newdata=iris[-intrain,])
c(lda = mean(p.lda==iris[-intrain,"Species"]),
knn = mean(p.knn==iris[-intrain,"Species"]),
svm = mean(p.svm==iris[-intrain,"Species"]))
} %>% data.table()
)
## user system elapsed
## 244.153 729.093 9.011
alternatively:
registerDoMC(cores = 50)
system.time(
result <- foreach(i=1:50, .combine = "rbind") %dopar% {
intrain <- sample(1:nrow(iris),nrow(iris)*0.7)
registerDoMC(cores = 2)
m.lda <- train(Species~., data=iris[intrain,], method='lda')
m.knn <- train(Species~., data=iris[intrain,], method='knn')
m.svm <- train(Species~., data=iris[intrain,], method='svmRadial')
p.lda <- predict(m.lda,newdata=iris[-intrain,])
p.knn <- predict(m.knn,newdata=iris[-intrain,])
p.svm <- predict(m.svm,newdata=iris[-intrain,])
c(lda = mean(p.lda==iris[-intrain,"Species"]),
knn = mean(p.knn==iris[-intrain,"Species"]),
svm = mean(p.svm==iris[-intrain,"Species"]))
} %>% data.table()
)
## user system elapsed
## 168.034 48.352 4.346
by the way…
boxplot(result)
!
todo