Source: Analytics Edge – Unit 4 Homework

Techniques involved: CART, Random Forest, Multiple classification

One of the earliest applications of the predictive analytics methods we have studied so far in this class was to automatically recognize letters, which post office machines use to sort mail. In this problem, we will build a model that uses statistics of images of four letters in the Roman alphabet – A, B, P, and R – to predict which letter a particular image corresponds to.

Note that this is a multiclass classification problem.

setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_4_Trees")
letters <- read.csv("letters_ABPR.csv")
str(letters)
## 'data.frame':    3116 obs. of  17 variables:
##  $ letter   : Factor w/ 4 levels "A","B","P","R": 2 1 4 2 3 4 4 1 3 3 ...
##  $ xbox     : int  4 1 5 5 3 8 2 3 8 6 ...
##  $ ybox     : int  2 1 9 9 6 10 6 7 14 10 ...
##  $ width    : int  5 3 5 7 4 8 4 5 7 8 ...
##  $ height   : int  4 2 7 7 4 6 4 5 8 8 ...
##  $ onpix    : int  4 1 6 10 2 6 3 3 4 7 ...
##  $ xbar     : int  8 8 6 9 4 7 6 12 5 8 ...
##  $ ybar     : int  7 2 11 8 14 7 7 2 10 5 ...
##  $ x2bar    : int  6 2 7 4 8 3 5 3 6 7 ...
##  $ y2bar    : int  6 2 3 4 1 5 5 2 3 5 ...
##  $ xybar    : int  7 8 7 6 11 8 6 10 12 7 ...
##  $ x2ybar   : int  6 2 3 8 6 4 5 2 5 6 ...
##  $ xy2bar   : int  6 8 9 6 3 8 7 9 4 6 ...
##  $ xedge    : int  2 1 2 6 0 6 3 2 4 3 ...
##  $ xedgeycor: int  8 6 7 11 10 6 7 6 10 9 ...
##  $ yedge    : int  7 2 5 8 4 7 5 3 4 8 ...
##  $ yedgexcor: int  10 7 11 7 8 7 8 8 8 9 ...

PREDICTING B OR NOT B

create a factor variable

letters$isB <- as.factor(letters$letter == "B")

split the dataset

library(caTools)
set.seed(1000)
spl <- sample.split(letters$isB, SplitRatio = 0.5)
train <- subset(letters, spl == TRUE)
test <- subset(letters, spl == FALSE)

consider a baseline method that always predicts the most frequent outcome.What is the accuracy of this baseline method on the test set?

table(test$isB)
## 
## FALSE  TRUE 
##  1175   383

Accuracy is 1175/(1175+383) = 0.754172

Now build a classification tree to predict whether a letter is a B or not. Remember to remove the variable “letter” out of the model

library(rpart)
library(rpart.plot)
CARTb <- rpart(isB~. - letter, data = train, method = "class")
CARTb_predict <- predict(CARTb, newdata = test, type = "class")
table(test$isB, CARTb_predict)
##        CARTb_predict
##         FALSE TRUE
##   FALSE  1118   57
##   TRUE     43  340

Model accuracy is (1118+340)/nrow(test) = 0.9358151

Now, build a random forest model to predict whether the letter is a B or not (the isB variable) using the training set.

set.seed(1000)
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
RFmodel <- randomForest(isB ~. - letter, data = train)
RFmodel_predict <- predict(RFmodel, newdata = test)
table(test$isB, RFmodel_predict)
##        RFmodel_predict
##         FALSE TRUE
##   FALSE  1165   10
##   TRUE      9  374

Model accuracy is (1165+374)/nrow(test) = 0.9878049

PREDICTING THE LETTERS A, B, P, R

letters$letter <- as.factor(letters$letter)
set.seed(2000)
spl <- sample.split(letters$letter, SplitRatio = 0.5)
train <- subset(letters, spl == TRUE)
test <- subset(letters, spl == FALSE)

create a baseline model, predicting the most frequent outcome

table(test$letter)
## 
##   A   B   P   R 
## 395 383 401 379

accuracy is 401/nrow(test) = 0.2573813

Now build a classification tree to predict “letter”, use all of the other variables as independent variables, except “isB”, since it is related to what we are trying to predict!

CARTLetter <- rpart(letter~ . - isB, data = train, method = "class")
CARTLetter_predict <- predict(CARTLetter, newdata = test, type = "class")
table(test$letter, CARTLetter_predict)
##    CARTLetter_predict
##       A   B   P   R
##   A 348   4   0  43
##   B   8 318  12  45
##   P   2  21 363  15
##   R  10  24   5 340

accuracy is (348+318+363+340)/nrow(test) = 0.8786906

now build a random forest model

set.seed(1000)
letter_RF <- randomForest(letter ~. -isB, data = train)
letter_RF_predict <- predict(letter_RF, newdata = test)
table(test$letter, letter_RF_predict)
##    letter_RF_predict
##       A   B   P   R
##   A 390   0   3   2
##   B   0 380   1   2
##   P   0   5 393   3
##   R   3  12   0 364

accuracy is (390+380+393+364)/nrow(test) = 0.9801027

You should find this value rather striking, for several reasons. The first is that it is significantly higher than the value for CART, highlighting the gain in accuracy that is possible from using random forest models. The second is that while the accuracy of CART decreased significantly as we transitioned from the problem of predicting B/not B (a relatively simple problem) to the problem of predicting the four letters (certainly a harder problem), the accuracy of the random forest model decreased by a tiny amount.