SHORT NOTES: Random Forests

Introduction

The following notes are from the COURSERA Practical Machine Learning course, and are intended to help others understand the concepts and code behind the math. Random Forests are an extension of Bootstrap Aggregating. The basic idea behind is very simple.

Bootstrap samples
At each split, bootstrap variables
Grow multiple trees and vote

The pros of Random Forest is their accuracy. However the method has several cons; speed is an issue since your computer will feel the toll of running heavy Random Forest models. The method is prone to overfitting, and interpretation is ambigous, given that it might be hard to interpret exactly what branch is explaining what effect.

The method works by building multiple trees from resampled data. At each node, we let a different sample of variables contribute. For each observation we run them through the multiple trees, each for different nodes. In each case we’ll get a prediction which we will average. This would probably be very complicated without computers or tidy computer languages like R. But using the caret library, most of the heavy lifting is done by R functions.

Code Example

We will use the iris data set to create a random forest model and try to accurately predict classification of flower species.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

data(iris)
library(ggplot2)
names(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

As you can see, there is only three possible classifications in the data set (setosa, versicolor, and virginica.) We will split the data into training and testing set, and the proceed to train a model using Random Forest.

inTrain <- createDataPartition(y = iris$Species, p = 0.7, list = FALSE)
training <- iris[inTrain, ]
testing <- iris[-inTrain, ]

modFit <- train(Species ~ ., data = training, method = "rf", prox = TRUE)

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

modFit

## Random Forest 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 105, 105, 105, 105, 105, 105, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9364902  0.9031117
##   3     0.9405967  0.9093574
##   4     0.9354307  0.9014021
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 3.

The training algorith is very simple. We tell the train function to create a predictor model using Species as the outcome variable and all others as predictors; we pass the training set as data, and indicate the method to use should be Random Forest with method = “rf”. The prox = TRUE addition is to let R know we want the center of classes available for later use (soon we will check into that.) Right away we see some high accuracy numbers, so the classification method should be very accurate itself. We can check into a single tree from the training set by using:

getTree(modFit$finalModel, k =2)

##    left daughter right daughter split var split point status prediction
## 1              2              3         4        0.70      1          0
## 2              0              0         0        0.00     -1          1
## 3              4              5         3        4.95      1          0
## 4              6              7         4        1.65      1          0
## 5              8              9         4        1.70      1          0
## 6              0              0         0        0.00     -1          2
## 7             10             11         2        2.95      1          0
## 8             12             13         2        2.45      1          0
## 9              0              0         0        0.00     -1          3
## 10             0              0         0        0.00     -1          3
## 11             0              0         0        0.00     -1          2
## 12             0              0         0        0.00     -1          3
## 13            14             15         2        2.85      1          0
## 14             0              0         0        0.00     -1          2
## 15             0              0         0        0.00     -1          3

Random Forest Centers

We can use Random Forest Centers to see the centers of class predictions on a plot. This is a very easy cue as to what to expect on class allocation and where the boundaries from one class to the next might make the prediction inaccurate.

irisP <- classCenter(training[, c(3,4)], training$Species, modFit$finalModel$proximity)
irisP <- as.data.frame(irisP)
irisP$Species <- rownames(irisP)

p <- qplot(Petal.Width, Petal.Length, col = Species, data = training)
p + geom_point(aes(x = Petal.Width, y = Petal.Length, col = Species), size = 5, shape = 4, data = irisP)

Making Predictions with Random Forest

The only reason to train a model is to make predictions with it! Let’s see how well our trained model does with the testing set data.

# Predicting new values with random forests models
pred <- predict(modFit, testing)
testing$predRight <- pred == testing$Species
table(pred, testing$Species)

##             
## pred         setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         2
##   virginica       0          1        13

Not bad! We only seem to have two discrepancies, where a versicolor is predicted as a virginica and viceversa. If you look at the prior plot, this is a probable case since the two classes share a common boundary which makes prediction a bit fuzzy. We can validate this easily with another plot:

qplot(Petal.Width, Petal.Length, colour = predRight, data = testing, main = "Predictions Based on Test Set")

As we suspected, the false positives are located right in the boundaries between the two classes.

Conclusions

Random Forests are usually one of the two top performing algorithms along with Boosting in prediction contests such as Kaggle. They are difficult to interpret given their intricate methodology, but their accuracy makes-up for this short-coming. The only word of caution is to be careful and avoid overfitting the data challenging the model from the training set with as many cross-validations as possible before moving to the test set.