Introduction

As the complexity of problems increase. We find ourselves needing more processing power to model and train machine learning models. Unfortunately, R implementation of parallelism is not automatically used in an R session.

An excellent package to implement parallelism in R is doParallel and base parallel which provides a backend for the %dopar% function. Combining these two packages allow for the creation of clusters of computer cores. After the formulation of the cores their needs to be a registration of the parallel cluster and a combination of this process with the task. In this example, the task method will be implemented to the caret package train control.

We will first start by loading mlbench package and attaching the Pima Indians Diabetes dataset. The data will be split 70% training and 30% testing. In this example, the focus is going to be the timing of K fold cross validation using parallelism vs. single core.

library("mlbench")

data(PimaIndiansDiabetes)
str(PimaIndiansDiabetes)

## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
##  $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
##  $ pressure: num  72 66 64 66 40 74 50 0 70 96 ...
##  $ triceps : num  35 29 0 23 35 0 32 0 45 0 ...
##  $ insulin : num  0 0 0 94 168 0 88 0 543 0 ...
##  $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
##  $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

summary(PimaIndiansDiabetes)

##     pregnant         glucose         pressure         triceps     
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     insulin           mass          pedigree           age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##  diabetes 
##  neg:500  
##  pos:268  
##           
##           
##           
##

set.seed(143) 
sample = sample.int(n = nrow(PimaIndiansDiabetes), 
                    size = floor(.70*nrow(PimaIndiansDiabetes)), 
                    replace = F)

train = PimaIndiansDiabetes[sample, ]
test  = PimaIndiansDiabetes[-sample,]

Execution

After loading parallel, doParallel, and caret a data frame name Timedf will be initialized to hold the time it takes for each model to complete. The system.time function will be used for time of execution. Each model will be cross-validated 100 times.

Parallelism

To apply parallelism using R a cluster must be created. The current system set up has a total of 12 cores and its customary to set the cluster -1 of all available cores. The idea here is that one core will be reserved for the operating system. After the cluster creation, the function registerDoParallel registers the workers from the cluster. The option to allowParallel is by default enabled in the Caret package allowing the registered workers to distribute the work. After training the model, stopCluster will suspend the workers and registerDOSEQ will make the R environment single threaded.

SVM Radial

#base
library("parallel")
library("doParallel")
library("caret")

#detectCores()
Timedf = data.frame(time="")
#stime = data.frame(stime)

Mycluster = makeCluster(detectCores()-2)
registerDoParallel(Mycluster)

stime = system.time({
set.seed(143)
SVM_Radial_Fit = train(diabetes~.,train, method = "svmRadial",verbose = F,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = T))
})[3]



Timedf$SVM_Para =stime[1]

stopCluster(Mycluster)
registerDoSEQ()


stime = system.time({
set.seed(143)
SVM_Radial_Fit = train(diabetes~.,train, method = "svmRadial",verbose = F,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = F))
})[3]


Timedf$SVM_NON_Para =stime[1]

GBM models

Mycluster = makeCluster(detectCores()-2)
registerDoParallel(Mycluster)

stime = system.time({
set.seed(143)
GBM_Fit = train(diabetes~.,train, method = "gbm",verbose = F,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = T))
})[3]



Timedf$GBM_Para =stime[1]

stopCluster(Mycluster)
registerDoSEQ()


stime = system.time({
set.seed(143)
GBM_Fit = train(diabetes~.,train, method = "gbm",verbose = F,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = F))
})[3]


Timedf$GBM_NON_Para =stime[1]

knn

Mycluster = makeCluster(detectCores()-2)
registerDoParallel(Mycluster)

stime = system.time({
set.seed(143)
KNN_Fit = train(diabetes~.,train, method = "knn",
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = T))
})[3]



Timedf$KNN_Para =stime[1]

stopCluster(Mycluster)
registerDoSEQ()


stime = system.time({
set.seed(143)
KNN_Fit = train(diabetes~.,train, method = "knn",
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = F))
})[3]


Timedf$KNN_NON_Para =stime[1]

Tree model

Mycluster = makeCluster(detectCores()-2)
registerDoParallel(Mycluster)

stime = system.time({
set.seed(143)
Tree_Fit = train(diabetes~.,train, method = "rpart",
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = T))
})[3]



Timedf$Tree_Para =stime[1]

stopCluster(Mycluster)
registerDoSEQ()


stime = system.time({
set.seed(143)
Tree_Fit = train(diabetes~.,train, method = "rpart",
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = F))
})[3]


Timedf$Tree_NON_Para =stime

The models that significantly benefitted by parallelism were SVM radical, and GBM. KNN and Tree model did not benefit from the use of parallelism in this testing environment.

row.names(Timedf)=c("Time")
Timedf=Timedf[,-1]
require("dplyr")
require("kableExtra")

knitr::kable(Timedf,format = "html", booktabs = T) %>% 
  kable_styling(latex_options = c("striped", "scale_down"),position = "center")

	SVM_Para	SVM_NON_Para	GBM_Para	GBM_NON_Para	KNN_Para	KNN_NON_Para	Tree_Para	Tree_NON_Para
Time	6.63	11.63	6.9	14.58	5.02	1.94	5.14	2.09

Sources

Parallelism & Caret Blog2

Christopher Estevez