Code comes directly from Analytics Vidhya website:
Load data
data(iris)
library(ggplot2)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
qplot(Petal.Length,Petal.Width,colour=Species,data=iris)
Load libraries
library(rpart) #For CART
library(caret) #For Random Forest & CART
library(rattle) #To plot decission plots
library(randomForest) #To build Random Forest model
library(randomForestSRC) #To build Random Forest model
Training and Test data
train.flag <- caret::createDataPartition(y=iris$Species,p=0.5,list=FALSE)
training <- iris[train.flag,]
Validation <- iris[-train.flag,]
modfitCart <- caret::train(Species~ ., method="rpart", data=training) #CART model
rattle::fancyRpartPlot(modfitCart$finalModel, main="with CART model") #plot CART model
modfitCart
## CART
##
## 75 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 75, 75, 75, 75, 75, 75, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 0.00 0.9136195 0.8684389 0.0433719 0.06530862
## 0.42 0.6768710 0.5369138 0.1383603 0.18573989
## 0.50 0.5174937 0.3190449 0.1552115 0.19207217
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.
modfitRF <- caret::train(Species~ ., method="rf", data=training) #Random Forest
#it has not sense to plot a tree for a Random Forest model
modfitRF
## Random Forest
##
## 75 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 75, 75, 75, 75, 75, 75, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9129551 0.8680739 0.04834241 0.07324377
## 3 0.9113551 0.8657601 0.05142151 0.07749080
## 4 0.9129957 0.8683811 0.05208093 0.07807996
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
With training data
set.seed(12321)
#CART
train.cart<-predict(modfitCart, newdata=training)
table(train.cart, training$Species)
##
## train.cart setosa versicolor virginica
## setosa 25 0 0
## versicolor 0 24 3
## virginica 0 1 22
# Misclassification rate = 1/75
#Random Forest
train.rf <- predict(modfitRF, newdata=training)
table(train.rf, training$Species)
##
## train.rf setosa versicolor virginica
## setosa 25 0 0
## versicolor 0 25 0
## virginica 0 0 25
# Misclassification rate = 0/75 !!
With test data (Validation data)
set.seed(12321)
#CART
pred.cart<-predict(modfitCart,newdata=Validation)
table(pred.cart,Validation$Species)
##
## pred.cart setosa versicolor virginica
## setosa 25 0 0
## versicolor 0 25 2
## virginica 0 0 23
# Misclassification rate = 6/75
#Random Forest
pred.rf<-predict(modfitRF,newdata=Validation)
table(pred.rf,Validation$Species)
##
## pred.rf setosa versicolor virginica
## setosa 25 0 0
## versicolor 0 25 2
## virginica 0 0 23
# Misclassification rate = 4/75
Every model has its own strength. Random forest, as seen from this case study, has a very high accuracy on the training population, because it uses many different characteristics to make a prediction. But, because of the same reason, it sometimes over fits the model on the data. CART model on the other side is simplistic criterion cut model. This might be over simplification in some case but works pretty well in most business scenarios. However, the choice of model might be business requirement dependent, it is always good to compare performance of different model before taking this call.