Chapter 08 (page 332): 3, 8, 9

Libraries:

library(ISLR)
library(caret)
library(tree)
library(rattle)
library(randomForest)
library(gbm)

3) Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆpm1. The x-axis should display ˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy. (Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.)

# set the axis range from 0 to 1
p <- seq(0, 1, 0.01)

# Calculate each error rate. Since 2 classes, use pˆm1 = 1 − pˆm2
classification.error <- 1 - pmax(p, 1 - p)
gini.index <- 2 * p * (1 - p)
entropy <- - (p * log(p) + (1 - p) * log(1 - p))

# Build the plot
matplot(p, cbind(classification.error, gini.index, entropy), col = c("blue", "green", "red" ), pch=16,main = "Classification Tree Measures with 2 classes", xlab = "Proportion of training observations", ylab = "Measure Values",ylim=c(0,1.1), lty=1)
legend(x='top', pch = 16,title = "Measures", col=c("blue", "green", "red"), legend=c("Classification Error Rate", "Gini Index", "Entropy"), box.lty=1)

Classification Error Rate: The fraction of the training observations in that region that do not belong to the most common class.

Gini Index: Gini Index is a measure of total variance across K classes (or) node purity (a small value indicates that a node contains predominantly observations from a single class) . Gini index takes on a small value if all of the pˆmk’s are close to zero or one.

Entropy: This is another measure of node purity and will take on a value near zero if the p̂mk’s are all near zero or near 1.

When building a classification tree, either the Gini index or the entropy are typically used to evaluate the quality of a particular split, since these two approaches are more sensitive to node purity than is the classification error rate. Any of these three approaches might be used when pruning the tree, but the classification error rate is preferable if prediction accuracy of the final pruned tree is the goal.

9) This problem involves the OJ data set which is part of the ISLR package.

attach(OJ)
str(OJ)
## 'data.frame':    1070 obs. of  18 variables:
##  $ Purchase      : Factor w/ 2 levels "CH","MM": 1 1 1 2 1 1 1 1 1 1 ...
##  $ WeekofPurchase: num  237 239 245 227 228 230 232 234 235 238 ...
##  $ StoreID       : num  1 1 1 1 7 7 7 7 7 7 ...
##  $ PriceCH       : num  1.75 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceMM       : num  1.99 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 ...
##  $ DiscCH        : num  0 0 0.17 0 0 0 0 0 0 0 ...
##  $ DiscMM        : num  0 0.3 0 0 0 0 0.4 0.4 0.4 0.4 ...
##  $ SpecialCH     : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ SpecialMM     : num  0 1 0 0 0 1 1 0 0 0 ...
##  $ LoyalCH       : num  0.5 0.6 0.68 0.4 0.957 ...
##  $ SalePriceMM   : num  1.99 1.69 2.09 1.69 1.69 1.99 1.59 1.59 1.59 1.59 ...
##  $ SalePriceCH   : num  1.75 1.75 1.69 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceDiff     : num  0.24 -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 ...
##  $ Store7        : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 2 2 ...
##  $ PctDiscMM     : num  0 0.151 0 0 0 ...
##  $ PctDiscCH     : num  0 0 0.0914 0 0 ...
##  $ ListPriceDiff : num  0.24 0.24 0.23 0 0 0.3 0.3 0.24 0.24 0.24 ...
##  $ STORE         : num  1 1 1 1 0 0 0 0 0 0 ...
p = ncol(OJ)-1
n = nrow(OJ)
cat ('p = ', p , '; n = ', n)
## p =  17 ; n =  1070
9a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
# set the seed
set.seed(5)

# use caret package and split the dataset. Determined that if p=.746, training set will have 800 so set p=.746
oj.inTrain <- createDataPartition(OJ$Purchase, p = .746, list = FALSE, times = 1)
oj.train <- OJ[oj.inTrain,]
oj.test <- OJ[-oj.inTrain,]
cat('training observations count = ', dim(oj.train)[1], '\ntest observations count = ', dim(oj.test)[1])
## training observations count =  800 
## test observations count =  270
9b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
# set the seed
set.seed(6)

# Call the tree function
tree.oj <- tree(Purchase ~ ., data = oj.train)
summary(tree.oj)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "ListPriceDiff"
## Number of terminal nodes:  7 
## Residual mean deviance:  0.7823 = 620.3 / 793 
## Misclassification error rate: 0.1675 = 134 / 800

The response variable Purchase is a categorical varaible and this is a classification problem. The Misclassification rate or the training error rate is 16.75%. The terminal nodes are 7. The varaibles actually used in tree construction are ‘LoyalCH’, ‘PriceDiff’ and ‘ListPriceDiff’.

9c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
# Print the tree fit
tree.oj
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1070.00 CH ( 0.61000 0.39000 )  
##    2) LoyalCH < 0.469289 293  319.90 MM ( 0.23549 0.76451 )  
##      4) LoyalCH < 0.280875 170  127.10 MM ( 0.12353 0.87647 ) *
##      5) LoyalCH > 0.280875 123  164.50 MM ( 0.39024 0.60976 )  
##       10) PriceDiff < -0.24 13    0.00 MM ( 0.00000 1.00000 ) *
##       11) PriceDiff > -0.24 110  150.70 MM ( 0.43636 0.56364 ) *
##    3) LoyalCH > 0.469289 507  468.00 CH ( 0.82643 0.17357 )  
##      6) LoyalCH < 0.764572 242  302.70 CH ( 0.68182 0.31818 )  
##       12) PriceDiff < 0.145 91  124.80 MM ( 0.43956 0.56044 )  
##         24) ListPriceDiff < 0.235 63   78.74 MM ( 0.31746 0.68254 ) *
##         25) ListPriceDiff > 0.235 28   33.50 CH ( 0.71429 0.28571 ) *
##       13) PriceDiff > 0.145 151  138.70 CH ( 0.82781 0.17219 ) *
##      7) LoyalCH > 0.764572 265   91.54 CH ( 0.95849 0.04151 ) *

We pick one of the terminal nodes ‘ListPriceDiff’ because of the asterisk at the end. The split criterion is ListPriceDiff < 0.235, the number of observations in that branch is 63 with a deviance of 78.74 and an overall prediction for the branch of MM. If the ListPriceDiff > 0.235, the number of observations in that branch is 28 and the overall prediction is for the branch of CH (Citrus Hill). There are no further child nodes for this ListPriceDiff node as per this fit.

9d) Create a plot of the tree, and interpret the results.
# Plot the tree
plot(tree.oj)
text(tree.oj, pretty = 0)

The above tree plot shows that ‘LoyalCH’, ‘PriceDiff’ and ‘ListPriceDiff’ are the only variables used in tree construction. However, ‘LoyalCH’ appears to be the most important predictor as the first branch (left) for LoyalCH<0.469289 classifies it as MM, which indicates that this model predicts that when there is a strong preference for MinuteMaid (when LoyalCH <0.469289), MM will always be the drink chosen. Also, the right branch indicates if loyalCH > 0.764572 i.e. when there is strong loyalty of 76.5% or above, CitruHill is the drink that will always be chosen.

9e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
# use predict function to get the response on the test data
tree.preds <- predict(tree.oj, oj.test, type = "class")

# Build a confusion matrix
table(tree.preds, oj.test$Purchase)
##           
## tree.preds  CH  MM
##         CH 135  17
##         MM  30  88
# Get the misclassification rate and accuracy rate
mse.oj = (17+30)/(dim(oj.test)[1])
acc.oj = 1-mse.oj
cat('Misclassification/test error rate = ', mse.oj, '\nAccuracy Rate = ',acc.oj)
## Misclassification/test error rate =  0.1740741 
## Accuracy Rate =  0.8259259

The test error rate is 0.1740741 or the misclassification rate is 17.40741%.

9f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.
# Get the cross validation error rate
cv.oj <- cv.tree(tree.oj, FUN = prune.misclass)
cv.oj
## $size
## [1] 7 5 2 1
## 
## $dev
## [1] 168 168 174 312
## 
## $k
## [1]       -Inf   0.000000   7.666667 155.000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
plot(cv.oj)

9g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
plot(cv.oj$size, cv.oj$dev, type = "b", xlab = "Tree size", ylab = "Deviance", main='Cross Validation Error Plot')
points(5,cv.oj$dev[2],col='green',cex=2,pch=20)
points(7,cv.oj$dev[1],col='red',cex=2,pch=20)

9h) Which tree size corresponds to the lowest cross-validated classification error rate?

The above plot shows that both 5 and 7 have the same deviance. However, we would like to go with the lowest size to reduce overfitting. Hence, we can conclude that tree size of 5 corresponds to the lowest cross-validated classification error rate.

9i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
prune.oj <- prune.misclass(tree.oj, best = 5)
plot(prune.oj)
text(prune.oj, pretty = 0)

This pruned tree is much better and reduced the noise before pruning. The left branch for LoyalCH < 0.469289 got pruned as previously also, before pruning also it always resulted in the classification as MM but there were 2 extra terminal nodes.Now, with CV approach, they got pruned.

9j) Compare the training error rates between the pruned and un-pruned trees. Which is higher?

From 9b), we know that the training error rate of un-pruned tree is .1675 or misclassification rate is 16.75%. Lets find the training error rate of pruned tree.

summary(prune.oj)
## 
## Classification tree:
## snip.tree(tree = tree.oj, nodes = 2L)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "ListPriceDiff"
## Number of terminal nodes:  5 
## Residual mean deviance:  0.8332 = 662.4 / 795 
## Misclassification error rate: 0.1675 = 134 / 800

From the above, we can see that the training error rate of pruned tree is also .1675. Hence, we can conclude that both pruned and un-pruned trees have the same training error rate.

The key difference is that the pruned tree has lesser terminal nodes(5) compared to the unpruned tree.

9k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

We know the test error rate for the unpruned tree is 0.1740741 from 9(e). Lets find out the test error rate of pruned tree.

# Get the training error of pruned tree using predict function
prune.oj.preds = predict(prune.oj, newdata = oj.test, type = "class")
table(prune.oj.preds, oj.test$Purchase)
##               
## prune.oj.preds  CH  MM
##             CH 135  17
##             MM  30  88
mse.prune.oj = (17+30)/270
mse.prune.oj
## [1] 0.1740741

From the above, the test error rate of pruned tree is 0.1740741 which is same as the test error rate of un-pruned tree. There is no difference in the misclassification rate.