As the complexity of problems increase. We find ourselves needing more processing power to model and train machine learning models. Unfortunately, R implementation of parallelism is not automatically used in an R session.
An excellent package to implement parallelism in R is doParallel and base parallel which provides a backend for the %dopar% function. Combining these two packages allow for the creation of clusters of computer cores. After the formulation of the cores their needs to be a registration of the parallel cluster and a combination of this process with the task. In this example, the task method will be implemented to the caret package train control.
We will first start by loading mlbench package and attaching the Pima Indians Diabetes dataset. The data will be split 70% training and 30% testing. In this example, the focus is going to be the timing of K fold cross validation using parallelism vs. single core.
library("mlbench")
data(PimaIndiansDiabetes)
str(PimaIndiansDiabetes)
## 'data.frame': 768 obs. of 9 variables:
## $ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
## $ pressure: num 72 66 64 66 40 74 50 0 70 96 ...
## $ triceps : num 35 29 0 23 35 0 32 0 45 0 ...
## $ insulin : num 0 0 0 94 168 0 88 0 543 0 ...
## $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : num 50 31 32 21 33 30 26 29 53 54 ...
## $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
summary(PimaIndiansDiabetes)
## pregnant glucose pressure triceps
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## insulin mass pedigree age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## diabetes
## neg:500
## pos:268
##
##
##
##
set.seed(143)
sample = sample.int(n = nrow(PimaIndiansDiabetes),
size = floor(.70*nrow(PimaIndiansDiabetes)),
replace = F)
train = PimaIndiansDiabetes[sample, ]
test = PimaIndiansDiabetes[-sample,]
After loading parallel, doParallel, and caret a data frame name Timedf will be initialized to hold the time it takes for each model to complete. The system.time function will be used for time of execution. Each model will be cross-validated 100 times.
To apply parallelism using R a cluster must be created. The current system set up has a total of 12 cores and its customary to set the cluster -1 of all available cores. The idea here is that one core will be reserved for the operating system. After the cluster creation, the function registerDoParallel registers the workers from the cluster. The option to allowParallel is by default enabled in the Caret package allowing the registered workers to distribute the work. After training the model, stopCluster will suspend the workers and registerDOSEQ will make the R environment single threaded.
#base
library("parallel")
library("doParallel")
library("caret")
#detectCores()
Timedf = data.frame(time="")
#stime = data.frame(stime)
Mycluster = makeCluster(detectCores()-2)
registerDoParallel(Mycluster)
stime = system.time({
set.seed(143)
SVM_Radial_Fit = train(diabetes~.,train, method = "svmRadial",verbose = F,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = T))
})[3]
Timedf$SVM_Para =stime[1]
stopCluster(Mycluster)
registerDoSEQ()
stime = system.time({
set.seed(143)
SVM_Radial_Fit = train(diabetes~.,train, method = "svmRadial",verbose = F,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = F))
})[3]
Timedf$SVM_NON_Para =stime[1]
Mycluster = makeCluster(detectCores()-2)
registerDoParallel(Mycluster)
stime = system.time({
set.seed(143)
GBM_Fit = train(diabetes~.,train, method = "gbm",verbose = F,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = T))
})[3]
Timedf$GBM_Para =stime[1]
stopCluster(Mycluster)
registerDoSEQ()
stime = system.time({
set.seed(143)
GBM_Fit = train(diabetes~.,train, method = "gbm",verbose = F,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = F))
})[3]
Timedf$GBM_NON_Para =stime[1]
Mycluster = makeCluster(detectCores()-2)
registerDoParallel(Mycluster)
stime = system.time({
set.seed(143)
KNN_Fit = train(diabetes~.,train, method = "knn",
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = T))
})[3]
Timedf$KNN_Para =stime[1]
stopCluster(Mycluster)
registerDoSEQ()
stime = system.time({
set.seed(143)
KNN_Fit = train(diabetes~.,train, method = "knn",
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = F))
})[3]
Timedf$KNN_NON_Para =stime[1]
Mycluster = makeCluster(detectCores()-2)
registerDoParallel(Mycluster)
stime = system.time({
set.seed(143)
Tree_Fit = train(diabetes~.,train, method = "rpart",
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = T))
})[3]
Timedf$Tree_Para =stime[1]
stopCluster(Mycluster)
registerDoSEQ()
stime = system.time({
set.seed(143)
Tree_Fit = train(diabetes~.,train, method = "rpart",
preProc = c("center", "scale"),
trControl = trainControl(method = "cv",number = 100,allowParallel = F))
})[3]
Timedf$Tree_NON_Para =stime
The models that significantly benefitted by parallelism were SVM radical, and GBM. KNN and Tree model did not benefit from the use of parallelism in this testing environment.
row.names(Timedf)=c("Time")
Timedf=Timedf[,-1]
require("dplyr")
require("kableExtra")
knitr::kable(Timedf,format = "html", booktabs = T) %>%
kable_styling(latex_options = c("striped", "scale_down"),position = "center")
SVM_Para | SVM_NON_Para | GBM_Para | GBM_NON_Para | KNN_Para | KNN_NON_Para | Tree_Para | Tree_NON_Para | |
---|---|---|---|---|---|---|---|---|
Time | 6.63 | 11.63 | 6.9 | 14.58 | 5.02 | 1.94 | 5.14 | 2.09 |