One of the earliest applications of the predictive analytics methods we have studied so far in this class was to automatically recognize letters, which post office machines use to sort mail. In this problem, we will build a model that uses statistics of images of four letters in the Roman alphabet – A, B, P, and R – to predict which letter a particular image corresponds to.
Note that this is a multiclass classification problem.
setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_4_Trees")
letters <- read.csv("letters_ABPR.csv")
str(letters)
## 'data.frame': 3116 obs. of 17 variables:
## $ letter : Factor w/ 4 levels "A","B","P","R": 2 1 4 2 3 4 4 1 3 3 ...
## $ xbox : int 4 1 5 5 3 8 2 3 8 6 ...
## $ ybox : int 2 1 9 9 6 10 6 7 14 10 ...
## $ width : int 5 3 5 7 4 8 4 5 7 8 ...
## $ height : int 4 2 7 7 4 6 4 5 8 8 ...
## $ onpix : int 4 1 6 10 2 6 3 3 4 7 ...
## $ xbar : int 8 8 6 9 4 7 6 12 5 8 ...
## $ ybar : int 7 2 11 8 14 7 7 2 10 5 ...
## $ x2bar : int 6 2 7 4 8 3 5 3 6 7 ...
## $ y2bar : int 6 2 3 4 1 5 5 2 3 5 ...
## $ xybar : int 7 8 7 6 11 8 6 10 12 7 ...
## $ x2ybar : int 6 2 3 8 6 4 5 2 5 6 ...
## $ xy2bar : int 6 8 9 6 3 8 7 9 4 6 ...
## $ xedge : int 2 1 2 6 0 6 3 2 4 3 ...
## $ xedgeycor: int 8 6 7 11 10 6 7 6 10 9 ...
## $ yedge : int 7 2 5 8 4 7 5 3 4 8 ...
## $ yedgexcor: int 10 7 11 7 8 7 8 8 8 9 ...
PREDICTING B OR NOT B
create a factor variable
letters$isB <- as.factor(letters$letter == "B")
split the dataset
library(caTools)
set.seed(1000)
spl <- sample.split(letters$isB, SplitRatio = 0.5)
train <- subset(letters, spl == TRUE)
test <- subset(letters, spl == FALSE)
consider a baseline method that always predicts the most frequent outcome.What is the accuracy of this baseline method on the test set?
table(test$isB)
##
## FALSE TRUE
## 1175 383
Accuracy is 1175/(1175+383) = 0.754172
Now build a classification tree to predict whether a letter is a B or not. Remember to remove the variable “letter” out of the model
library(rpart)
library(rpart.plot)
CARTb <- rpart(isB~. - letter, data = train, method = "class")
CARTb_predict <- predict(CARTb, newdata = test, type = "class")
table(test$isB, CARTb_predict)
## CARTb_predict
## FALSE TRUE
## FALSE 1118 57
## TRUE 43 340
Model accuracy is (1118+340)/nrow(test) = 0.9358151
Now, build a random forest model to predict whether the letter is a B or not (the isB variable) using the training set.
set.seed(1000)
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
RFmodel <- randomForest(isB ~. - letter, data = train)
RFmodel_predict <- predict(RFmodel, newdata = test)
table(test$isB, RFmodel_predict)
## RFmodel_predict
## FALSE TRUE
## FALSE 1165 10
## TRUE 9 374
Model accuracy is (1165+374)/nrow(test) = 0.9878049
PREDICTING THE LETTERS A, B, P, R
letters$letter <- as.factor(letters$letter)
set.seed(2000)
spl <- sample.split(letters$letter, SplitRatio = 0.5)
train <- subset(letters, spl == TRUE)
test <- subset(letters, spl == FALSE)
create a baseline model, predicting the most frequent outcome
table(test$letter)
##
## A B P R
## 395 383 401 379
accuracy is 401/nrow(test) = 0.2573813
Now build a classification tree to predict “letter”, use all of the other variables as independent variables, except “isB”, since it is related to what we are trying to predict!
CARTLetter <- rpart(letter~ . - isB, data = train, method = "class")
CARTLetter_predict <- predict(CARTLetter, newdata = test, type = "class")
table(test$letter, CARTLetter_predict)
## CARTLetter_predict
## A B P R
## A 348 4 0 43
## B 8 318 12 45
## P 2 21 363 15
## R 10 24 5 340
accuracy is (348+318+363+340)/nrow(test) = 0.8786906
now build a random forest model
set.seed(1000)
letter_RF <- randomForest(letter ~. -isB, data = train)
letter_RF_predict <- predict(letter_RF, newdata = test)
table(test$letter, letter_RF_predict)
## letter_RF_predict
## A B P R
## A 390 0 3 2
## B 0 380 1 2
## P 0 5 393 3
## R 3 12 0 364
accuracy is (390+380+393+364)/nrow(test) = 0.9801027
You should find this value rather striking, for several reasons. The first is that it is significantly higher than the value for CART, highlighting the gain in accuracy that is possible from using random forest models. The second is that while the accuracy of CART decreased significantly as we transitioned from the problem of predicting B/not B (a relatively simple problem) to the problem of predicting the four letters (certainly a harder problem), the accuracy of the random forest model decreased by a tiny amount.