assignment for ‘The Analytics Edge’ MITx
The file letters_ABPR.csv contains 3116 observations, each of which corresponds to a certain image of one of the four letters A, B, P and R. The images came from 20 different fonts, which were then randomly distorted to produce the final images; each such distorted image is represented as a collection of pixels, each of which is “on” or “off”. For each such distorted image, we have available certain statistics of the image in terms of these pixels, as well as which of the four letters the image is. This data comes from the UCI Machine Learning Repository.
letters <- read.csv("letters_ABPR.csv")
str(letters)
## 'data.frame': 3116 obs. of 17 variables:
## $ letter : Factor w/ 4 levels "A","B","P","R": 2 1 4 2 3 4 4 1 3 3 ...
## $ xbox : int 4 1 5 5 3 8 2 3 8 6 ...
## $ ybox : int 2 1 9 9 6 10 6 7 14 10 ...
## $ width : int 5 3 5 7 4 8 4 5 7 8 ...
## $ height : int 4 2 7 7 4 6 4 5 8 8 ...
## $ onpix : int 4 1 6 10 2 6 3 3 4 7 ...
## $ xbar : int 8 8 6 9 4 7 6 12 5 8 ...
## $ ybar : int 7 2 11 8 14 7 7 2 10 5 ...
## $ x2bar : int 6 2 7 4 8 3 5 3 6 7 ...
## $ y2bar : int 6 2 3 4 1 5 5 2 3 5 ...
## $ xybar : int 7 8 7 6 11 8 6 10 12 7 ...
## $ x2ybar : int 6 2 3 8 6 4 5 2 5 6 ...
## $ xy2bar : int 6 8 9 6 3 8 7 9 4 6 ...
## $ xedge : int 2 1 2 6 0 6 3 2 4 3 ...
## $ xedgeycor: int 8 6 7 11 10 6 7 6 10 9 ...
## $ yedge : int 7 2 5 8 4 7 5 3 4 8 ...
## $ yedgexcor: int 10 7 11 7 8 7 8 8 8 9 ...
This dataset contains the following 17 variables:
letter = the letter that the image corresponds to (A, B, P or R)
xbox = the horizontal position of where the smallest box covering the letter shape begins.
ybox = the vertical position of where the smallest box covering the letter shape begins.
width = the width of this smallest box.
height = the height of this smallest box.
onpix = the total number of "on" pixels in the character image
xbar = the mean horizontal position of all of the "on" pixels
ybar = the mean vertical position of all of the "on" pixels
x2bar = the mean squared horizontal position of all of the "on" pixels in the image
y2bar = the mean squared vertical position of all of the "on" pixels in the image
xybar = the mean of the product of the horizontal and vertical position of all of the "on" pixels in the image
x2ybar = the mean of the product of the squared horizontal position and the vertical position of all of the "on" pixels
xy2bar = the mean of the product of the horizontal position and the squared vertical position of all of the "on" pixels
xedge = the mean number of edges (the number of times an "off" pixel is followed by an "on" pixel, or the image boundary is hit) as the image is scanned from left to right, along the whole vertical length of the image
xedgeycor = the mean of the product of the number of horizontal edges at each vertical position and the vertical position
yedge = the mean number of edges as the images is scanned from top to bottom, along the whole horizontal length of the image
yedgexcor = the mean of the product of the number of vertical edges at each horizontal position and the horizontal position
letters$isB = as.factor(letters$letter == "B")
set.seed(1000)
split <- sample.split(letters$isB, SplitRatio = 0.5)
train <- letters[split == TRUE,]
test <- letters[split == FALSE,]
The ‘baseline’ model is that which predicts the most frequent outcome, which is “not B”. Proportion of ‘not B’ in the training set is 0.754172
set.seed(1000)
CARTb = rpart(isB ~ . - letter, data=train, method="class")
prp(CARTb)
predictB <- predict(CARTb, newdata = test, type = "class")
table(test$isB, predictB)
## predictB
## FALSE TRUE
## FALSE 1118 57
## TRUE 43 340
Rows are the ‘actual’. Columns are the prediction.
Accuracy of this CART tree is 0.9358151
Now, build a random forest model to predict whether the letter is a B or not (the isB variable) using the training set. You should use all of the other variables as independent variables, except letter (since it helped us define what we are trying to predict!).
set.seed(1000)
bForest <- randomForest(isB ~ . -letter,
data = train)
predictForest <- predict(bForest, newdata = test)
table(test$isB, predictForest)
## predictForest
## FALSE TRUE
## FALSE 1163 12
## TRUE 9 374
Rows are the ‘actual’. Columns are the prediction.
Accuracy of this CART tree is 0.9865212
Random forests tends to improve on CART in terms of predictive accuracy. Sometimes, this improvement can be quite significant, as it is here.
To predict whether or not a letter is one of the four letters A, B, P or R.
letters$letter = as.factor(letters$letter)
set.seed(2000)
split <- sample.split(letters$letter, SplitRatio = 0.5)
train <- letters[split == TRUE,]
test <- letters[split == FALSE,]
summary(letters)
## letter xbox ybox width
## A:789 Min. : 0.000 Min. : 0.000 Min. : 1.000
## B:766 1st Qu.: 3.000 1st Qu.: 5.000 1st Qu.: 4.000
## P:803 Median : 4.000 Median : 7.000 Median : 5.000
## R:758 Mean : 3.915 Mean : 7.051 Mean : 5.186
## 3rd Qu.: 5.000 3rd Qu.: 9.000 3rd Qu.: 6.000
## Max. :13.000 Max. :15.000 Max. :11.000
## height onpix xbar ybar
## Min. : 0.000 Min. : 0.000 Min. : 3.000 Min. : 0.000
## 1st Qu.: 4.000 1st Qu.: 2.000 1st Qu.: 6.000 1st Qu.: 6.000
## Median : 6.000 Median : 4.000 Median : 7.000 Median : 7.000
## Mean : 5.276 Mean : 3.869 Mean : 7.469 Mean : 7.197
## 3rd Qu.: 7.000 3rd Qu.: 5.000 3rd Qu.: 8.000 3rd Qu.: 9.000
## Max. :12.000 Max. :12.000 Max. :14.000 Max. :15.000
## x2bar y2bar xybar x2ybar
## Min. : 0.000 Min. :0.000 Min. : 3.000 Min. : 0.00
## 1st Qu.: 3.000 1st Qu.:2.000 1st Qu.: 7.000 1st Qu.: 3.00
## Median : 4.000 Median :4.000 Median : 8.000 Median : 5.00
## Mean : 4.706 Mean :3.903 Mean : 8.491 Mean : 4.52
## 3rd Qu.: 6.000 3rd Qu.:5.000 3rd Qu.:10.000 3rd Qu.: 6.00
## Max. :11.000 Max. :8.000 Max. :14.000 Max. :10.00
## xy2bar xedge xedgeycor yedge
## Min. : 0.000 Min. : 0.000 Min. : 1.000 Min. : 0.0
## 1st Qu.: 6.000 1st Qu.: 2.000 1st Qu.: 7.000 1st Qu.: 3.0
## Median : 7.000 Median : 2.000 Median : 8.000 Median : 4.0
## Mean : 6.711 Mean : 2.913 Mean : 7.763 Mean : 4.6
## 3rd Qu.: 8.000 3rd Qu.: 4.000 3rd Qu.: 9.000 3rd Qu.: 6.0
## Max. :14.000 Max. :10.000 Max. :13.000 Max. :12.0
## yedgexcor isB
## Min. : 1.000 FALSE:2350
## 1st Qu.: 7.000 TRUE : 766
## Median : 8.000
## Mean : 8.418
## 3rd Qu.:10.000
## Max. :13.000
paste("Accuracty of 'baseline' model predicting the most common letter :", 803/nrow(letters))
## [1] "Accuracty of 'baseline' model predicting the most common letter : 0.257702182284981"
CARTletter = rpart(letter ~ . - isB, data=train, method="class")
prp(CARTletter)
predictletter <- predict(CARTletter, newdata = test, type = "class")
table(test$letter, predictletter)
## predictletter
## A B P R
## A 348 4 0 43
## B 8 318 12 45
## P 2 21 363 15
## R 10 24 5 340
Rows are the ‘actual’. Columns are the prediction.
Accuracy of this CART tree is 0.8786906
set.seed(1000)
letterForest <- randomForest(letter ~ . -isB,
data = train)
predictForest <- predict(letterForest, newdata = test)
table(test$letter, predictForest)
## predictForest
## A B P R
## A 391 0 3 1
## B 0 380 1 2
## P 0 6 394 1
## R 3 14 0 362
Rows are the ‘actual’. Columns are the prediction.
Accuracy of this CART tree is 0.9801027
Accuracy is significantly higher than the value for CART, highlighting the gain in accuracy that is possible from using random forest models. The second is that while the accuracy of CART decreased significantly as we transitioned from the problem of predicting B/not B (a relatively simple problem) to the problem of predicting the four letters (certainly a harder problem), the accuracy of the random forest model decreased by a tiny amount.