Computer get faster and more powerful every day.
One of the newer advents in recent years has been the introduction of parallel processing.
However, this new-fangled advancement is seldom taken advantage of unless using a program that is already configured to take advantage of this automatically.
The other thing that can be confusing is deciding exactly how to set-up one’s system to use the power of parallel processing when it is available.
Here, we will discuss the use of one easy-to-use library for setting up parallel processing in R and conduct a time test to gauge its performance.
First of all, in order to measure performance: we will use the package “tictoc” to time the total processes. Next, we will use the libraries “doMC” and “parallel” to actually set-up our processing.
Many packages, such as “caret” will then automatically detect the processing power and make use of it. In other instances, the cores can be applied using the mcapply
function. For the sake of convenience, we will use a simpler example here.
# Options
set.seed(100)
options(scipen = 3)
# The libraries
library(tictoc)
library(readxl)
library(data.table)
library(randomForest)
library(caret)
library(parallel)
library(doMC)
# Function to compute classification error
classification_error <- function(conf_mat) {
conf_mat = as.matrix(conf_mat)
error = 1 - sum(diag(conf_mat)) / sum(conf_mat)
return (error)
}
Now that we have our libraries and data, we are ready to do our machine learning. Let’s start with a standard approach using a single core.
# Start timer
tic("Standard Approach:")
# Load the data and create a data.table
data <- setDT(read_xlsx("Training Dataset.xlsx"))
# Remove the unwanted characters
names(data) <- gsub(" ", "", names(data))
names(data) <- gsub("&", "and", names(data))
# Lower-casing
setnames(data, names(data), tolower(names(data)))
# Fix the factor names
data$employmentstatus <- as.factor(make.names(data$employmentstatus))
# Factor columns
factors <- c("educationlevel", "agerange", "employmentstatus", "gender", "year")
train <- data[, (factors) := lapply(.SD, as.factor), .SDcols = factors]
# Create training and validation sets
trainObs <- sample(nrow(data), .8 * nrow(data), replace = FALSE)
valObs <- sample(nrow(data), .2 * nrow(data), replace = FALSE)
train <- data[trainObs,]
val <- data[valObs,]
# Model
rf <- randomForest(employmentstatus ~ ., data = train, importance = TRUE)
print(rf)
##
## Call:
## randomForest(formula = employmentstatus ~ ., data = train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 5.38%
## Confusion matrix:
## Employed Not.in.labor.force Unemployed class.error
## Employed 32370 460 11 0.014341829
## Not.in.labor.force 5 15581 149 0.009787099
## Unemployed 2 2129 497 0.810882801
# Predictions
rf_val_pred <- predict(rf, val)
val$rf_pred <- rf_val_pred
# Results
rf_conf_mat <- table(true = val$employmentstatus, pred = val$rf_pred)
cat("Random Forest Classification Error, Validation:", classification_error(rf_conf_mat), "\n")
## Random Forest Classification Error, Validation: 0.01124912
# End timer
toc()
## Standard Approach:: 173.896 sec elapsed
Next we will conduct the same experiment, but this time we will take advantage of all the computers cores and see if it makes a difference.
# Parallel processing
library(parallel)
library(doMC)
# Detect the cores available in the computer
numCores <- detectCores()
# Register the cores for use by R
registerDoMC(cores = numCores)
# Start timer
tic("Parallel Processing Approach:")
# Load the data and create a data.table
data <- setDT(read_xlsx("Training Dataset.xlsx"))
# Remove the unwanted characters
names(data) <- gsub(" ", "", names(data))
names(data) <- gsub("&", "and", names(data))
# Lower-casing
setnames(data, names(data), tolower(names(data)))
# Fix the factor names
data$employmentstatus <- as.factor(make.names(data$employmentstatus))
# Factor columns
factors <- c("educationlevel", "agerange", "employmentstatus", "gender", "year")
train <- data[, (factors) := lapply(.SD, as.factor), .SDcols = factors]
# Create training and validation sets
trainObs <- sample(nrow(data), .8 * nrow(data), replace = FALSE)
valObs <- sample(nrow(data), .2 * nrow(data), replace = FALSE)
train <- data[trainObs,]
val <- data[valObs,]
# Model
rf <- randomForest(employmentstatus ~ ., data = train, importance = TRUE)
print(rf)
##
## Call:
## randomForest(formula = employmentstatus ~ ., data = train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 5.33%
## Confusion matrix:
## Employed Not.in.labor.force Unemployed class.error
## Employed 32402 453 12 0.014147930
## Not.in.labor.force 6 15563 134 0.008915494
## Unemployed 1 2124 509 0.806757783
# Predictions
rf_val_pred <- predict(rf, val)
val$rf_pred <- rf_val_pred
# Results
rf_conf_mat <- table(true = val$employmentstatus, pred = val$rf_pred)
cat("Random Forest Classification Error, Validation:", classification_error(rf_conf_mat), "\n")
## Random Forest Classification Error, Validation: 0.01054605
# End timer
toc()
## Parallel Processing Approach:: 165.808 sec elapsed
From these results, we can see that using multiple cores shaved a good 10 seconds off of our completion time for the entire process. It stands to reason that the results would be even more impressive with larger data and more complicated algorithms.
So in the end, the take-home message from all this should be: if you have multiple cores on your computer, make sure that you take advantage of their power and use them. The parallel/doMC
approach is simple, easy way to improve the efficiency and accuracy of all your data science projects.