Introduction:

In data science, powerful machine learning algorithms are used to make predictions based on user-defined input. These predictions can be numeric and continuous or discrete and categorical. Every algorithm is best-suited for different types of inputs and the output often changes depending upon what formulae are used.

Sometimes, one algorithm might not be enough. Especially if the dataset is large and has numerous features and outputs. For example, one algorithm may perform perfectly in predicting the majority class label, whereas it is abysmal at determining the minority classes. Alternatively, another algorithm might be superb at detecting members of one minority class- but not so great at predicting any of the other categories. These issues are unique to each dataset and there is no universal method for choosing what will be the best performer in a given situation.

In these situations, it is often wise to put together what is commonly referred to as an “ensemble”. Dictionary.com defines this word as meaning: all the parts of a thing taken together, so that each part is considered only in relation to the whole. Indeed, that is exactly what data scientists use this method for in the given context.

An ensemble is a collection algorithms whose individual outputs are combined in a systematic fashion in order to create a single output. When used properly, this method of machine learning is perhaps the most powerful and effective means of improving a model’s accuracy.

Unfortunately, creating an ensemble of algorithms can be an extremely complicated and time-consuming process. One has to create the individual models and take the time to train them all. Then a technique must be developed to unify their predictions in a cohesive manner. Luckily, the Caret package in R has developed a new interface that streamlines this process into a convenient set of functions that we will use here to create our own ensemble.


Method:

The first thing we need to do is establish some sort of baseline by which we will be able to judge the improvement of our ensemble model. For this task, we will use a random forest. The data is employment information from 2005-2012 and the output is the variable employmentstatus which can take on three different categorical values: “Employed”, “Unemployed”, and “Not in labor force”.

Lets get started:

# Load the necessary libraries
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
# Set-up parallel processing to take advantage of multiple machine cores
library(parallel)
library(doMC)

numCores <- detectCores()
registerDoMC(cores = numCores)

# Function to compute classification error
classification_error <- function(conf_mat) {
  conf_mat = as.matrix(conf_mat)
  
  error = 1 - sum(diag(conf_mat)) / sum(conf_mat)
  
  return (error)
}

# Model
rf_model <- randomForest(employmentstatus ~ . - worker, data = train, importance = TRUE)

# Perform predictions on the validation set (20% of training data)
rf_pred <- as.factor(predict(rf_model, val))

rf_conf_mat <- table(true = val$employmentstatus, pred = rf_pred)

# Results 
print(rf_model)
## 
## Call:
##  randomForest(formula = employmentstatus ~ . - worker, data = train,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 5.33%
## Confusion matrix:
##                    Employed Not.in.labor.force Unemployed class.error
## Employed              32343                449         11 0.014023108
## Not.in.labor.force        6              15631        132 0.008751348
## Unemployed                2               2127        503 0.808890578
cat("\n", "RF Classification Error Rate, Validation:", classification_error(rf_conf_mat), "\n")
## 
##  RF Classification Error Rate, Validation: 0.01007734

Thus, we have established a baseline. The classification error rate of the simple random forest model is the metric we will use to judge the outcome of our ensemble model with. Hopefully, we will get an improvement using the caretEnsemble technique.


caretList-

# Load the required libraries
library(caret)
library(nnet)
library(e1071)
library(caretEnsemble)

# Prepare a Phase 1 model, by reducing the outcome to a binary `labor_force` variable

# Create a new variable for workers
train$worker <- factor(ifelse(train$employmentstatus == "Not.in.labor.force", 0, 1))
val$worker <- factor(ifelse(val$employmentstatus == "Not.in.labor.force", 0, 1))

train$worker <- as.factor(make.names(train$worker))
val$worker <- as.factor(make.names(val$worker))

# Model to predict workers 
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, search = "grid", savePredictions = "final", index = createResample(train$worker, 10), summaryFunction = twoClassSummary, classProbs = TRUE, verboseIter = TRUE)

# List of algorithms to use in ensemble
alg_list <- c("rf", "glm", "gbm", "glmboost", "nnet", "treebag", "svmLinear")

multi_mod <- caretList(worker ~ . - employmentstatus, data = train, trControl = control, methodList = alg_list, metric = "ROC")

# Results
res <- resamples(multi_mod)
summary(res)

In the code above, we have used the caretList function to conduct training on a set of algorithms all simultaneously. These algorithms will be combined to help us separate the minority class from the other two- giving us a means of improving our final classification rate.


caretStack-

# Stack 
stackControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE, classProbs = TRUE, verboseIter = TRUE)

stack <- caretStack(multi_mod, method = "rf", metric = "Accuracy", trControl = stackControl)

# Predict
stack_val_preds <- data.frame(predict(stack, val, type = "prob"))
stack_test_preds <- data.frame(predict(stack, test, type = "prob"))

The caretList function allowed us to create several different models simultaneously without having to manually call each one. The next step in the process is to aggregate their results using caretStack, which is done in the code above. This approach uses a random forest algorithm to aggregate the results of the models we created previously. Of course, you could also choose any other model supported by caret instead of using randomForest as well.

The output is a list of probabilities for each class determined by the combined results of all of the different models in the list. In order to reach our final results, we will have to decide upon a threshold value for the probabilities in order to choose final class outcomes.

# Function to find threshold

# Values
thresholds <- seq(0, 1, .05)
num_thresh <- length(thresholds)

# Empty list to store results
errors <- rep(0, num_thresh)

iter <- 1

for (i in thresholds) {

  cat("Calculating error for threshold value-", i, "\n")
  
  threshold_value <- i
  
  val_work_pred <- ifelse(stack_val_preds > threshold_value, "Yes", "No")
  
  conf_mat <- table(true = val$worker, pred = val_work_pred)
  
  errors[iter]<- classification_error(conf_mat) 
  
  iter <- iter + 1
}
## Calculating error for threshold value- 0 
## Calculating error for threshold value- 0.05 
## Calculating error for threshold value- 0.1 
## Calculating error for threshold value- 0.15 
## Calculating error for threshold value- 0.2 
## Calculating error for threshold value- 0.25 
## Calculating error for threshold value- 0.3 
## Calculating error for threshold value- 0.35 
## Calculating error for threshold value- 0.4 
## Calculating error for threshold value- 0.45 
## Calculating error for threshold value- 0.5 
## Calculating error for threshold value- 0.55 
## Calculating error for threshold value- 0.6 
## Calculating error for threshold value- 0.65 
## Calculating error for threshold value- 0.7 
## Calculating error for threshold value- 0.75 
## Calculating error for threshold value- 0.8 
## Calculating error for threshold value- 0.85 
## Calculating error for threshold value- 0.9 
## Calculating error for threshold value- 0.95 
## Calculating error for threshold value- 1
# Compute final threshold value
result <- data.table(cbind(thresholds, errors))

final_value <- result[which(result$error == min(result$errors))]

val_worker_pred <- ifelse(stack_val_preds >= final_value$thresholds, 1, 0)

# Report error rate
phase1_conf <- table(true = val$worker, pred = val_worker_pred)

cat("Classification Error for Worker Predictions:", classification_error(phase1_conf), "\n")
## Classification Error for Worker Predictions: 0.03288806

Now that we have created our “worker” predictions for the data, we can re-run our original model to see if the ensemble has been able to improve our classification rate.

# Include predictions as part of model
val$worker <- as.factor(val_worker_pred)

# Model
rf <- randomForest(employmentstatus ~ ., data = train, importance = TRUE)

# Predictions
rf_val_pred <- predict(rf, val)

# Results
rf_conf_mat <- table(true = val$employmentstatus, pred = rf_val_pred)

print(rf)
## 
## Call:
##  randomForest(formula = employmentstatus ~ ., data = train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 0.8%
## Confusion matrix:
##                    Employed Not.in.labor.force Unemployed class.error
## Employed              32479                  0        324 0.009877145
## Not.in.labor.force        0              15769          0 0.000000000
## Unemployed               84                  0       2548 0.031914894
cat("\n", "Random Forest Classification Error, Validation:", classification_error(rf_conf_mat), "\n")
## 
##  Random Forest Classification Error, Validation: 0.03421608

Conclusion:

The creation of a two-tiered ensemble definitely reduced the classification error rate of our model, which was already low to begin with. In this example, it is clear that creating an ensemble was useful in helping us fine-tune a model in order to achieve a nearly perfect classification rate. The results were already fairly accurate, so the difference is hard to tell. However, the benefit of creating the ensemble model becomes more clear when looking at the confusion matrix. Initially, the minority class had an error rate of 81%, with nearly 2200 misclassified examples in the training data! The overall training error was reduced by roughly 5% from 5.3 to .8 and more importantly, the number of misclassified minority cases dropped to only 79! Caret provides a useful and convenient interface for creating these kinds of models.