Comparing a Random Forest to a CART model

Credits

Code comes directly from Analytics Vidhya website:

Goal

Learn how to create Classification And Regression Tree (CART) models
Random Forest models
Compare models

Prepare data

Load data

data(iris)
library(ggplot2)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

qplot(Petal.Length,Petal.Width,colour=Species,data=iris)

Load libraries

library(rpart) #For CART
library(caret) #For Random Forest & CART
library(rattle) #To plot decission plots

library(randomForest) #To build Random Forest model
library(randomForestSRC) #To build Random Forest model

Training and Test data

train.flag <- caret::createDataPartition(y=iris$Species,p=0.5,list=FALSE)

training <- iris[train.flag,]
Validation <- iris[-train.flag,]

Building models

modfitCart <- caret::train(Species~ ., method="rpart", data=training) #CART model
rattle::fancyRpartPlot(modfitCart$finalModel, main="with CART model") #plot CART model

modfitCart

## CART 
## 
## 75 samples
##  4 predictor
##  3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 75, 75, 75, 75, 75, 75, ... 
## Resampling results across tuning parameters:
## 
##   cp    Accuracy   Kappa      Accuracy SD  Kappa SD  
##   0.00  0.9136195  0.8684389  0.0433719    0.06530862
##   0.42  0.6768710  0.5369138  0.1383603    0.18573989
##   0.50  0.5174937  0.3190449  0.1552115    0.19207217
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.

modfitRF <- caret::train(Species~ ., method="rf", data=training) #Random Forest
#it has not sense to plot a tree for a Random Forest model
modfitRF

## Random Forest 
## 
## 75 samples
##  4 predictor
##  3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 75, 75, 75, 75, 75, 75, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
##   2     0.9129551  0.8680739  0.04834241   0.07324377
##   3     0.9113551  0.8657601  0.05142151   0.07749080
##   4     0.9129957  0.8683811  0.05208093   0.07807996
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 4.

Validating models

With training data

set.seed(12321)

#CART
train.cart<-predict(modfitCart, newdata=training) 
table(train.cart, training$Species)

##             
## train.cart   setosa versicolor virginica
##   setosa         25          0         0
##   versicolor      0         24         3
##   virginica       0          1        22

# Misclassification rate = 1/75

#Random Forest
train.rf <- predict(modfitRF, newdata=training)
table(train.rf, training$Species)

##             
## train.rf     setosa versicolor virginica
##   setosa         25          0         0
##   versicolor      0         25         0
##   virginica       0          0        25

# Misclassification rate = 0/75 !!

With test data (Validation data)

set.seed(12321)

#CART
pred.cart<-predict(modfitCart,newdata=Validation)
table(pred.cart,Validation$Species)

##             
## pred.cart    setosa versicolor virginica
##   setosa         25          0         0
##   versicolor      0         25         2
##   virginica       0          0        23

# Misclassification rate = 6/75

#Random Forest
pred.rf<-predict(modfitRF,newdata=Validation)
table(pred.rf,Validation$Species)

##             
## pred.rf      setosa versicolor virginica
##   setosa         25          0         0
##   versicolor      0         25         2
##   virginica       0          0        23

# Misclassification rate = 4/75

Selecting the model

Every model has its own strength. Random forest, as seen from this case study, has a very high accuracy on the training population, because it uses many different characteristics to make a prediction. But, because of the same reason, it sometimes over fits the model on the data. CART model on the other side is simplistic criterion cut model. This might be over simplification in some case but works pretty well in most business scenarios. However, the choice of model might be business requirement dependent, it is always good to compare performance of different model before taking this call.