Random Forests (RF) are an emsemble method designed to improve the performance of the Classification and Regression Tree (CART) algorithm. Random Forest is a supervised learning method, where the target class is known a priori, and we seek to build a model (classification or regression) to predict future responses.

tl;dr

This tutorial serves as an introduction to the Random Forest. This tutorial will cover the following material:

  1. Packages Used:Packages used throughout this tutorial

  2. Data Sets: Data sets used in this tutorial

  3. Decision Trees: Overview of decision trees and the CART algorithm

  4. Bias-Variance Tradeoff: Discussion of bias-variance trade off and the drawback of decision trees

  5. Bagging: Overview of bagging and its utility in Random Forest

  6. Random Forest: Overview of the Random Forest algorithm

  7. Exercises: Exercises with Random Forest

Packages Used

library(tidyverse)
library(ISLR)
library(randomForest)
library(caret)
library(rpart)
library(rpart.plot)

tidyverse is a collection of data manipulation tools, from which we use functions from dplyr.

ISLR is the standard reference package provided with an Introducton to Statistical Learning in R (James et al. 2013).

caret is a standard machine learning library in R, with numerous built in features. From this package, we impliment createDataPartition to create training and validation sets and confusionMatrix to generate relevant metrics from our model. Although caret supports creation of both models syntatically within one package, we delineate the following two packages for this tutorial.

randomForest is the standard package to implement the Random Forest algorithm.

rpart is one of many packages based on the original CART algorithm, used to generate decision trees.

rpart.plot is a fancy wrapper for the plot command to generate a decision tree plot.

Data sets

Two data sets will be used for this tutorial.

Carseats, from ISLR, is a simulated data set containing sales of child car seats at 400 different stores. This data set is used to generate a decision tree for reference prior to discussion of the Random Forest algorithm.

Caravan, from ISLR, is a data set containing 5882 customer records and 85 variables relating to purchase of a child carseat (Yes/No response). It is used to generate the Random Forest model.

Decision Trees

Discussion of Random Forests is not possible without introduction to the Decision Tree and the CART algorithm.

Classification and Regression Trees (CART) were first introducted in 1984 by a group led by Leo Briemann (Brieman et al. 1984). The CART algorithm provided a means to sequentially conduct binary splits on variables provided to the algorithm, resulting in a decision structure that resembles its namesake, a tree. The algorithm is capable of creating a response for both qualitative (categorical) and quantitative (numeric) models, wherein the decision space (group of input variables) is continuously stratified using either classificaiton error or node purity (qualitative) or residual sum of squares (quantitative). For this tutorial, we focus on qualitative responses, seeking to predict to which class a set of predictors belongs to.

While discussing decision trees, it is important to note that not every model provides the best fit to every problem.

FIgure 1 - Decision Space Stratification [@ISLR]

FIgure 1 - Decision Space Stratification (James et al. 2013)

As seen in Figure 1, the upper left example is best suited to a regression function, while a similar regression would perform poorly on the lower left example. Alternatively, we see that stratification of the decision space in the upper right by a tree would perform poorly, while it is aptly suited to handle the example in the lower right. Model selection is important and we see that although we could apply CART to any example above, performace based on incorrect model selection would be poor.

#Example modified from ISLR, page 324 [@ISLR]
attach(Carseats)
High <- ifelse(Sales<8, "No","Yes")
Carseats <- data.frame(Carseats,High)
########################## decision tree for depiction ############
d_tree <- rpart(High ~ .-Sales, Carseats)
rpart.plot(d_tree, main="Full Data Set Decision Tree", fallen.leaves=FALSE, extra=104, box.palette="GnBu")

Various decision tree plotting tools are avialable, however using rpart.plot, we are able to present the decision tree nicely. Depicted at each node above, we see the split decision (Yes/No), percent populations for each category compared and finally the total percent of observations from the original dataset that reside in each split. The split on the right only contains 21% of observations, and we see that its leaf nodes (each split) fathom earlier than the larger portion of the data.

Bias-Variance Tradeoff

With any classification method, we seek to achieve results such that our overall classification rate on our test set is 100% (no false positives or false negatives), where each predicted class is correctly placed according to its true outcome. This, however, is one drawback of the CART algorithm, as data used in one training/validation split may produce results dissimilar to another random split. In addition, a model that performed well on the test set may not produce good results given additional data. As James et al. (2013) state, “a small change in the data can cause a large change in the final estimated tree.” . These results exemplify the bias/variance tradeoff, where increasing model bias produces large variance in the final results. Similarly, low bias results in low variance, but can also produce an oversimplification of the final model. While decision trees are useful visual aids for a decision making process, performance is hindered by increased bias/variance, depicted below.

FIgure 2 - Bias-Variance Tradeoff [@Bias]

FIgure 2 - Bias-Variance Tradeoff (“Understanding the Bias-Variance Tradeoff: An Overview,” n.d.)

Figure 2 (“Understanding the Bias-Variance Tradeoff: An Overview,” n.d.) shows the difference between high/low bias and variance. One would ideally seek to build a model that both decrases variance, the large shot group, and decreases bias, the offcenter shot group. Results that reside in the upper left corner of this chart, where the model is not overfit and the variance is low are ideal. When we look at decision trees with regards to bias/variance, we see the results below, a change in the decision tree structure based on the random number used to generate training sets.

#modification of ISLR, p324

seats <- ISLR::Carseats
seats$High <- ifelse(seats$Sales<8, "No","Yes")

iter = 2

for(i in 1:iter){
  set.seed(i+43)
  # create train/test sets
  train_index <- createDataPartition(seats$High, p=.70,
                                     list=FALSE,
                                     times = 1)
  
  train_DF <- seats[train_index,]
  validate_DF <- seats[-train_index,]
  
  train_y <- train_DF$High
  train_x <- select(train_DF, -High)
  
  validate_y <- validate_DF$High
  validate_x <- select(validate_DF, -High)
  
  d_tree <- rpart(High ~ .-Sales,train_DF, na.action = na.exclude)
  
  # graphs

  rpart.plot(d_tree, main=paste0("Decision Tree ", i), fallen.leaves=FALSE, , extra=104, box.palette="GnBu") 
  
  validate_preds <- predict(d_tree, validate_x)
  
  pred_y <- vector()
  
  for(j in 1:nrow(validate_preds)){
    if  (validate_preds[j] >= 0.5){
      #do something
      pred_y[j] = "No"
    } else {
      pred_y[j] = "Yes"
    }
  }
  
  conf_mat <- confusionMatrix(pred_y, validate_y)
  print(paste0("Classification Accuracy: ", conf_mat$overall[1]))
  
}

## [1] "Classification Accuracy: 0.705882352941177"

## [1] "Classification Accuracy: 0.789915966386555"

As you see above, two trees using a random number only offset by 1 produce varying tree structures and varying classification accuracy on a test set. Similarly, we see these trees are significantly different from the original tree generated above using the full data set without partitions. For this reason, we turn to random forests, begining with a discussion on bagging.

Bagging

Bagging, bootstrap aggregating, builds on the bootstrap, presented in the resampling methods tutorial. Using this method, n bootstrapped samples are created, and n models are trained from those bootstrap samples. For a classification problem, among all models trained, an average probability is calculated to determine the classification of each new sample. In doing so, bias and variance are decreased, as we now move from one original tree to n unique trees.

Random Forests

FIgure 3 - 3 Tree Random Forest Example

FIgure 3 - 3 Tree Random Forest Example

Building on the concept of bagging, random forests utilize an important key component of bootstrapping and bagging - only a subset of samples from the original data set are used to create each tree. On average, about two thirds of of each data set is sampled each time a bootstrap sample is taken. With one third of observations remaining, we utilize this subset for testing each newly created tree, creating out-of-bag (OOB) error, with which we can gague the accuracy of each tree.

This fact however, is possible solely with bagging, and a second distinction of random forests emerges, subsets of variables as well. Within the random forest model, only a fraction of variables are considered when building each tree node, which not only creates de-correlated trees for each model fit, it ensures that strong predictor variables do not overpower the model at each split. In essence, we create multiple trees in the forest, all of which are a unique combination of variables and possible samples from the original set. The resulting model proves more robust to change in future data, as well as having reduced bias and variance over a single tree model.

Parameter Tuning

In the random forest model, there are a number of possible tunable hyperparameters, however we limit discussion to the most important.

  • ntree - number of trees in the forest
  • mtry - number of variables used to generate splits at each node
  • classwt - prior class probabilities

Within the randomForest package, there are also methods by which to select the optimal condition for certain parameters.

  • rfcv - this method calculates cross-validation prediction performance for sequentially reduced number of predictors
  • tuneRF - this method used OOB error to search for the optimal mtry level

Further details about Random Forests describe all methods and parameters.

Exercises

Prepare the data set - build a train and test set:

Using the function createDataPartition, a training and validation set is created, splitting 80/20% of the data on the variable purchase. This method, validation set, allows us to build a model from 80% of the data, and use the remaining 20% to validate our results and tune parameters as necessary.

set.seed(1)
#load the data
my_data <- ISLR::Caravan
#split data into predictors and response, train and validate
#use times = 2 for train, val and test

# create train and validate data sets using caret 80/20 split
train_index <- createDataPartition(my_data$Purchase, p=.8,
                                   list=FALSE,
                                   times = 1)

train_DF <- my_data[train_index,]
validate_DF <- my_data[-train_index,]

train_y <- train_DF$Purchase
train_x <- select(train_DF, -Purchase)

validate_y <- validate_DF$Purchase
validate_x <- select(validate_DF, -Purchase)

Build the model:

The function randomForest, given the minimal inputs necessary to run build a random forest model based on the training data set created above. Using the function varImpPlot from the package, we graphically display the variable importance within our model, with both mean decrease in classifaction accuracy (from OOB samples) and gini index, a measure of nude purity at each split.

#build the model with default parameters

rf_model1 <- randomForest(x=train_x,
                          y=train_y,
                          importance=TRUE,
                          na.action=na.exclude)

#display model variable importance
#importance(rf_model1, type=1)
# plot variable importance
varImpPlot(rf_model1)

Using the model to predict from a testing set:

The predict function takes as inputs the random forest model generated above, the validation set data (without target class) and a specific call to return prediction probabilities, by class. Using these probabilites, the vector pred_y is generated, assigning a class to each new observation by picking values at a 50% probability split. Using this new vector, we pass it and the target class actual responses into the confusionMatrix function from caret and generate accuracy measures, see below.

#use the validation set to build predictions
#preds for validation set, confuion matrix
validate_preds <- predict(rf_model1, newdata=validate_x, type="prob")

pred_y <- vector()

for(i in 1:nrow(validate_preds)){
  if  (validate_preds[i] >= 0.5){
    #do something
    pred_y[i] = "No"
  } else {
    pred_y[i] = "Yes"
  }
}
pred_y <- factor(pred_y, levels = c("No", "Yes")) 
# build confusion matrix from predictions
confusionMatrix(pred_y, validate_y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1081   66
##        Yes   13    3
##                                           
##                Accuracy : 0.9321          
##                  95% CI : (0.9161, 0.9459)
##     No Information Rate : 0.9407          
##     P-Value [Acc > NIR] : 0.9018          
##                                           
##                   Kappa : 0.0494          
##  Mcnemar's Test P-Value : 4.902e-09       
##                                           
##             Sensitivity : 0.98812         
##             Specificity : 0.04348         
##          Pos Pred Value : 0.94246         
##          Neg Pred Value : 0.18750         
##              Prevalence : 0.94067         
##          Detection Rate : 0.92949         
##    Detection Prevalence : 0.98624         
##       Balanced Accuracy : 0.51580         
##                                           
##        'Positive' Class : No              
## 

Hyperparemeter tuning:

Although our model achieves 93% accuracy, it is possbile to improve performance by tuning parameters passed into the model build. Built into the randomForest function, two functions below easily do just that.

# This function shows the cross-validated prediction performance of models with sequentially reduced
# number of predictors (ranked by variable importance) via a nested cross-validation procedure.
cross_val <- rfcv(train_x, train_y, cv.fold=5, scale="log", step=0.8,
     mtry=function(p) max(1, floor(sqrt(p))), recursive=FALSE)
#show only cross validation error and corresponding number of predictors
cross_val$error.cv
##         85         68         54         44         35         28 
## 0.06675252 0.06911354 0.06954282 0.06997210 0.07447950 0.07362095 
##         22         18         14         11          9          7 
## 0.07297703 0.07276240 0.07211848 0.07254776 0.07040137 0.06439150 
##          6          5          4          3          2          1 
## 0.06245976 0.06310367 0.06245976 0.05988410 0.05988410 0.05988410
# Starting with the default value of mtry, search for the optimal value (with respect to Out-of-Bag
# error estimate) of mtry for randomForest.
tuneRF(train_x, train_y, 55, ntreeTry=35, stepFactor=2, improve=0.005,
                  trace=TRUE, plot=TRUE, doBest=FALSE)
## mtry = 55  OOB error = 7.9% 
## Searching left ...
## mtry = 28    OOB error = 7.94% 
## -0.005434783 0.005 
## Searching right ...
## mtry = 85    OOB error = 8.07% 
## -0.02173913 0.005

##        mtry   OOBError
## 28.OOB   28 0.07941618
## 55.OOB   55 0.07898691
## 85.OOB   85 0.08070401

Improvement in classification accuracy may be garnered by using the results of these function calls in the original model build. Although not demonstrated here, this process should increase overall classification accuracy of the final model.

References

Brieman, L, J Friedman, R Olshen, and C Stone. 1984. Classification and Regression Trees. First. Boca Raton, FL: Chapman; Hall/CRC Press.

James, G, D Witten, T Hastie, and R Tibshirani. 2013. An Introduction to Statistical Learning with Applications in R. Sixth. New York: Springer.

“Understanding the Bias-Variance Tradeoff: An Overview.” n.d. http://www.kdnuggets.com/2016/08/bias-variance-tradeoff-overview.html; Matthew Mayo - KD Nuggets.