assignment for ‘The Analytics Edge’ MITx

the data

The file letters_ABPR.csv contains 3116 observations, each of which corresponds to a certain image of one of the four letters A, B, P and R. The images came from 20 different fonts, which were then randomly distorted to produce the final images; each such distorted image is represented as a collection of pixels, each of which is “on” or “off”. For each such distorted image, we have available certain statistics of the image in terms of these pixels, as well as which of the four letters the image is. This data comes from the UCI Machine Learning Repository.

letters <- read.csv("letters_ABPR.csv")
str(letters)
## 'data.frame':    3116 obs. of  17 variables:
##  $ letter   : Factor w/ 4 levels "A","B","P","R": 2 1 4 2 3 4 4 1 3 3 ...
##  $ xbox     : int  4 1 5 5 3 8 2 3 8 6 ...
##  $ ybox     : int  2 1 9 9 6 10 6 7 14 10 ...
##  $ width    : int  5 3 5 7 4 8 4 5 7 8 ...
##  $ height   : int  4 2 7 7 4 6 4 5 8 8 ...
##  $ onpix    : int  4 1 6 10 2 6 3 3 4 7 ...
##  $ xbar     : int  8 8 6 9 4 7 6 12 5 8 ...
##  $ ybar     : int  7 2 11 8 14 7 7 2 10 5 ...
##  $ x2bar    : int  6 2 7 4 8 3 5 3 6 7 ...
##  $ y2bar    : int  6 2 3 4 1 5 5 2 3 5 ...
##  $ xybar    : int  7 8 7 6 11 8 6 10 12 7 ...
##  $ x2ybar   : int  6 2 3 8 6 4 5 2 5 6 ...
##  $ xy2bar   : int  6 8 9 6 3 8 7 9 4 6 ...
##  $ xedge    : int  2 1 2 6 0 6 3 2 4 3 ...
##  $ xedgeycor: int  8 6 7 11 10 6 7 6 10 9 ...
##  $ yedge    : int  7 2 5 8 4 7 5 3 4 8 ...
##  $ yedgexcor: int  10 7 11 7 8 7 8 8 8 9 ...

This dataset contains the following 17 variables:

letter = the letter that the image corresponds to (A, B, P or R)
xbox = the horizontal position of where the smallest box covering the letter shape begins.
ybox = the vertical position of where the smallest box covering the letter shape begins.
width = the width of this smallest box.
height = the height of this smallest box.
onpix = the total number of "on" pixels in the character image
xbar = the mean horizontal position of all of the "on" pixels
ybar = the mean vertical position of all of the "on" pixels
x2bar = the mean squared horizontal position of all of the "on" pixels in the image
y2bar = the mean squared vertical position of all of the "on" pixels in the image
xybar = the mean of the product of the horizontal and vertical position of all of the "on" pixels in the image
x2ybar = the mean of the product of the squared horizontal position and the vertical position of all of the "on" pixels
xy2bar = the mean of the product of the horizontal position and the squared vertical position of all of the "on" pixels
xedge = the mean number of edges (the number of times an "off" pixel is followed by an "on" pixel, or the image boundary is hit) as the image is scanned from left to right, along the whole vertical length of the image
xedgeycor = the mean of the product of the number of horizontal edges at each vertical position and the vertical position
yedge = the mean number of edges as the images is scanned from top to bottom, along the whole horizontal length of the image
yedgexcor = the mean of the product of the number of vertical edges at each horizontal position and the horizontal position

Predicting B or not B

letters$isB = as.factor(letters$letter == "B")

set.seed(1000)
split <- sample.split(letters$isB, SplitRatio = 0.5)
train <- letters[split == TRUE,]
test <- letters[split == FALSE,]

The ‘baseline’ model is that which predicts the most frequent outcome, which is “not B”. Proportion of ‘not B’ in the training set is 0.754172

Classification and regression tree model

set.seed(1000)

CARTb = rpart(isB ~ . - letter, data=train, method="class")
prp(CARTb)

predictB <- predict(CARTb, newdata = test, type = "class")
table(test$isB, predictB)
##        predictB
##         FALSE TRUE
##   FALSE  1118   57
##   TRUE     43  340

Rows are the ‘actual’. Columns are the prediction.

Accuracy of this CART tree is 0.9358151

Random forest model

Now, build a random forest model to predict whether the letter is a B or not (the isB variable) using the training set. You should use all of the other variables as independent variables, except letter (since it helped us define what we are trying to predict!).

set.seed(1000)

bForest <- randomForest(isB ~ . -letter,
                        data = train)

predictForest <- predict(bForest, newdata = test)
table(test$isB, predictForest)
##        predictForest
##         FALSE TRUE
##   FALSE  1163   12
##   TRUE      9  374

Rows are the ‘actual’. Columns are the prediction.

Accuracy of this CART tree is 0.9865212

Random forests tends to improve on CART in terms of predictive accuracy. Sometimes, this improvement can be quite significant, as it is here.

Predicting the letters A, B, P, R

To predict whether or not a letter is one of the four letters A, B, P or R.

letters$letter = as.factor(letters$letter)

set.seed(2000)
split <- sample.split(letters$letter, SplitRatio = 0.5)
train <- letters[split == TRUE,]
test <- letters[split == FALSE,]

summary(letters)
##  letter       xbox             ybox            width       
##  A:789   Min.   : 0.000   Min.   : 0.000   Min.   : 1.000  
##  B:766   1st Qu.: 3.000   1st Qu.: 5.000   1st Qu.: 4.000  
##  P:803   Median : 4.000   Median : 7.000   Median : 5.000  
##  R:758   Mean   : 3.915   Mean   : 7.051   Mean   : 5.186  
##          3rd Qu.: 5.000   3rd Qu.: 9.000   3rd Qu.: 6.000  
##          Max.   :13.000   Max.   :15.000   Max.   :11.000  
##      height           onpix             xbar             ybar       
##  Min.   : 0.000   Min.   : 0.000   Min.   : 3.000   Min.   : 0.000  
##  1st Qu.: 4.000   1st Qu.: 2.000   1st Qu.: 6.000   1st Qu.: 6.000  
##  Median : 6.000   Median : 4.000   Median : 7.000   Median : 7.000  
##  Mean   : 5.276   Mean   : 3.869   Mean   : 7.469   Mean   : 7.197  
##  3rd Qu.: 7.000   3rd Qu.: 5.000   3rd Qu.: 8.000   3rd Qu.: 9.000  
##  Max.   :12.000   Max.   :12.000   Max.   :14.000   Max.   :15.000  
##      x2bar            y2bar           xybar            x2ybar     
##  Min.   : 0.000   Min.   :0.000   Min.   : 3.000   Min.   : 0.00  
##  1st Qu.: 3.000   1st Qu.:2.000   1st Qu.: 7.000   1st Qu.: 3.00  
##  Median : 4.000   Median :4.000   Median : 8.000   Median : 5.00  
##  Mean   : 4.706   Mean   :3.903   Mean   : 8.491   Mean   : 4.52  
##  3rd Qu.: 6.000   3rd Qu.:5.000   3rd Qu.:10.000   3rd Qu.: 6.00  
##  Max.   :11.000   Max.   :8.000   Max.   :14.000   Max.   :10.00  
##      xy2bar           xedge          xedgeycor          yedge     
##  Min.   : 0.000   Min.   : 0.000   Min.   : 1.000   Min.   : 0.0  
##  1st Qu.: 6.000   1st Qu.: 2.000   1st Qu.: 7.000   1st Qu.: 3.0  
##  Median : 7.000   Median : 2.000   Median : 8.000   Median : 4.0  
##  Mean   : 6.711   Mean   : 2.913   Mean   : 7.763   Mean   : 4.6  
##  3rd Qu.: 8.000   3rd Qu.: 4.000   3rd Qu.: 9.000   3rd Qu.: 6.0  
##  Max.   :14.000   Max.   :10.000   Max.   :13.000   Max.   :12.0  
##    yedgexcor         isB      
##  Min.   : 1.000   FALSE:2350  
##  1st Qu.: 7.000   TRUE : 766  
##  Median : 8.000               
##  Mean   : 8.418               
##  3rd Qu.:10.000               
##  Max.   :13.000
paste("Accuracty of 'baseline' model predicting the most common letter :", 803/nrow(letters))
## [1] "Accuracty of 'baseline' model predicting the most common letter : 0.257702182284981"

CART model

CARTletter = rpart(letter ~ . - isB, data=train, method="class")
prp(CARTletter)

predictletter <- predict(CARTletter, newdata = test, type = "class")
table(test$letter, predictletter)
##    predictletter
##       A   B   P   R
##   A 348   4   0  43
##   B   8 318  12  45
##   P   2  21 363  15
##   R  10  24   5 340

Rows are the ‘actual’. Columns are the prediction.

Accuracy of this CART tree is 0.8786906

Random forest model

set.seed(1000)

letterForest <- randomForest(letter ~ . -isB,
                        data = train)

predictForest <- predict(letterForest, newdata = test)
table(test$letter, predictForest)
##    predictForest
##       A   B   P   R
##   A 391   0   3   1
##   B   0 380   1   2
##   P   0   6 394   1
##   R   3  14   0 362

Rows are the ‘actual’. Columns are the prediction.

Accuracy of this CART tree is 0.9801027

Accuracy is significantly higher than the value for CART, highlighting the gain in accuracy that is possible from using random forest models. The second is that while the accuracy of CART decreased significantly as we transitioned from the problem of predicting B/not B (a relatively simple problem) to the problem of predicting the four letters (certainly a harder problem), the accuracy of the random forest model decreased by a tiny amount.