Introduction:

Computer get faster and more powerful every day.

One of the newer advents in recent years has been the introduction of parallel processing.

However, this new-fangled advancement is seldom taken advantage of unless using a program that is already configured to take advantage of this automatically.

The other thing that can be confusing is deciding exactly how to set-up one’s system to use the power of parallel processing when it is available.

Here, we will discuss the use of one easy-to-use library for setting up parallel processing in R and conduct a time test to gauge its performance.


The libraries:

First of all, in order to measure performance: we will use the package “tictoc” to time the total processes. Next, we will use the libraries “doMC” and “parallel” to actually set-up our processing.

Many packages, such as “caret” will then automatically detect the processing power and make use of it. In other instances, the cores can be applied using the mcapply function. For the sake of convenience, we will use a simpler example here.

# Options 
set.seed(100)

options(scipen = 3)

# The libraries
library(tictoc) 
library(readxl)
library(data.table)
library(randomForest)
library(caret)
library(parallel)
library(doMC)

# Function to compute classification error
classification_error <- function(conf_mat) {
  conf_mat = as.matrix(conf_mat)
  
  error = 1 - sum(diag(conf_mat)) / sum(conf_mat)
  
  return (error)
}

Now that we have our libraries and data, we are ready to do our machine learning. Let’s start with a standard approach using a single core.

# Start timer
tic("Standard Approach:")

# Load the data and create a data.table
data <- setDT(read_xlsx("Training Dataset.xlsx"))

# Remove the unwanted characters
names(data) <- gsub(" ", "", names(data))

names(data) <- gsub("&", "and", names(data))

# Lower-casing
setnames(data, names(data), tolower(names(data)))

# Fix the factor names
data$employmentstatus <- as.factor(make.names(data$employmentstatus))

# Factor columns 
factors <- c("educationlevel", "agerange", "employmentstatus", "gender", "year")

train <- data[, (factors) := lapply(.SD, as.factor), .SDcols = factors]

# Create training and validation sets
trainObs <- sample(nrow(data), .8 * nrow(data), replace = FALSE)
valObs <- sample(nrow(data), .2 * nrow(data), replace = FALSE)

train <- data[trainObs,]
val <- data[valObs,]

# Model
rf <- randomForest(employmentstatus ~ ., data = train, importance = TRUE)

print(rf)
## 
## Call:
##  randomForest(formula = employmentstatus ~ ., data = train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 5.38%
## Confusion matrix:
##                    Employed Not.in.labor.force Unemployed class.error
## Employed              32370                460         11 0.014341829
## Not.in.labor.force        5              15581        149 0.009787099
## Unemployed                2               2129        497 0.810882801
# Predictions
rf_val_pred <- predict(rf, val)
val$rf_pred <- rf_val_pred

# Results
rf_conf_mat <- table(true = val$employmentstatus, pred = val$rf_pred)

cat("Random Forest Classification Error, Validation:", classification_error(rf_conf_mat), "\n")
## Random Forest Classification Error, Validation: 0.01124912
# End timer
toc()
## Standard Approach:: 173.896 sec elapsed

Next we will conduct the same experiment, but this time we will take advantage of all the computers cores and see if it makes a difference.

# Parallel processing
library(parallel)
library(doMC)

# Detect the cores available in the computer
numCores <- detectCores()

# Register the cores for use by R
registerDoMC(cores = numCores)

# Start timer
tic("Parallel Processing Approach:")

# Load the data and create a data.table
data <- setDT(read_xlsx("Training Dataset.xlsx"))

# Remove the unwanted characters
names(data) <- gsub(" ", "", names(data))

names(data) <- gsub("&", "and", names(data))

# Lower-casing
setnames(data, names(data), tolower(names(data)))

# Fix the factor names
data$employmentstatus <- as.factor(make.names(data$employmentstatus))

# Factor columns 
factors <- c("educationlevel", "agerange", "employmentstatus", "gender", "year")

train <- data[, (factors) := lapply(.SD, as.factor), .SDcols = factors]

# Create training and validation sets
trainObs <- sample(nrow(data), .8 * nrow(data), replace = FALSE)
valObs <- sample(nrow(data), .2 * nrow(data), replace = FALSE)

train <- data[trainObs,]
val <- data[valObs,]

# Model
rf <- randomForest(employmentstatus ~ ., data = train, importance = TRUE)

print(rf)
## 
## Call:
##  randomForest(formula = employmentstatus ~ ., data = train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 5.33%
## Confusion matrix:
##                    Employed Not.in.labor.force Unemployed class.error
## Employed              32402                453         12 0.014147930
## Not.in.labor.force        6              15563        134 0.008915494
## Unemployed                1               2124        509 0.806757783
# Predictions
rf_val_pred <- predict(rf, val)
val$rf_pred <- rf_val_pred

# Results
rf_conf_mat <- table(true = val$employmentstatus, pred = val$rf_pred)

cat("Random Forest Classification Error, Validation:", classification_error(rf_conf_mat), "\n")
## Random Forest Classification Error, Validation: 0.01054605
# End timer
toc()
## Parallel Processing Approach:: 165.808 sec elapsed

Conclusion:

From these results, we can see that using multiple cores shaved a good 10 seconds off of our completion time for the entire process. It stands to reason that the results would be even more impressive with larger data and more complicated algorithms.

So in the end, the take-home message from all this should be: if you have multiple cores on your computer, make sure that you take advantage of their power and use them. The parallel/doMC approach is simple, easy way to improve the efficiency and accuracy of all your data science projects.