3) Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆpm1. The x-axis should display ˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy. (Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.)
# set the axis range from 0 to 1
p <- seq(0, 1, 0.01)
# Calculate each error rate. Since 2 classes, use pˆm1 = 1 − pˆm2
classification.error <- 1 - pmax(p, 1 - p)
gini.index <- 2 * p * (1 - p)
entropy <- - (p * log(p) + (1 - p) * log(1 - p))
# Build the plot
matplot(p, cbind(classification.error, gini.index, entropy), col = c("blue", "green", "red" ), pch=16,main = "Classification Tree Measures with 2 classes", xlab = "Proportion of training observations", ylab = "Measure Values",ylim=c(0,1.1), lty=1)
legend(x='top', pch = 16,title = "Measures", col=c("blue", "green", "red"), legend=c("Classification Error Rate", "Gini Index", "Entropy"), box.lty=1)

Classification Error Rate: The fraction of the training observations in that region that do not belong to the most common class.
Gini Index: Gini Index is a measure of total variance across K classes (or) node purity (a small value indicates that a node contains predominantly observations from a single class) . Gini index takes on a small value if all of the pˆmk’s are close to zero or one.
Entropy: This is another measure of node purity and will take on a value near zero if the p̂mk’s are all near zero or near 1.
When building a classification tree, either the Gini index or the entropy are typically used to evaluate the quality of a particular split, since these two approaches are more sensitive to node purity than is the classification error rate. Any of these three approaches might be used when pruning the tree, but the classification error rate is preferable if prediction accuracy of the final pruned tree is the goal.
8) In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
summary(Carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
# get p and n values; as responsevariable is in the dataset, reduce column count by 1.
p = ncol(Carseats) - 1
n = nrow(Carseats)
cat ('p = ', p , '; n = ', n)
## p = 10 ; n = 400
8a) Split the data set into a training set and a test set.
# set the seed
set.seed(1)
# Divide the dataset into train and test. using 75-25 even though n is just 400
# using care package for practice
inTrain <- createDataPartition(Carseats$Sales, p = 0.75, list = FALSE, times = 1)
carseats.train <- Carseats[inTrain,]
carseats.test <- Carseats[-inTrain,]
dim(carseats.train)
## [1] 301 11
8b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
# Fit a regression tree
tree.carseats = tree(Sales~., data = carseats.train)
summary(tree.carseats)
##
## Regression tree:
## tree(formula = Sales ~ ., data = carseats.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Income" "Age" "Advertising"
## [6] "CompPrice"
## Number of terminal nodes: 16
## Residual mean deviance: 2.295 = 654 / 285
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.5490 -1.0160 -0.1187 0.0000 0.9648 3.5840
The regression tree fit shows that 6 out of 11 variables are used in tree construction.
# Build a plot
plot(tree.carseats)
text(tree.carseats, pretty = 0)

The tree plot indicates that the most important indicator of Sales is shelving location (ShelvLoc) which is a categorical variable.
# Get the MSE by using predict function
preds.carseats = predict(tree.carseats, newdata = carseats.test)
tree.mse = mean((preds.carseats - carseats.test$Sales)^2)
tree.mse
## [1] 4.935788
The MSE from this tree is 4.935788.
8c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
# Use cross validation to determine the optimal level of tree complexity
set.seed(2)
cv.carseats <- cv.tree(tree.carseats)
cv.carseats
## $size
## [1] 16 15 14 13 12 10 9 8 7 6 5 4 3 2 1
##
## $dev
## [1] 1096.549 1096.851 1089.865 1075.781 1099.359 1235.014 1317.378 1317.028
## [9] 1324.641 1389.604 1504.873 1530.520 1618.503 1813.690 2354.377
##
## $k
## [1] -Inf 24.57854 25.70560 27.05048 29.90736 43.69234 48.99582
## [8] 49.70574 54.03148 78.84306 112.38463 123.79024 152.23047 280.88027
## [15] 593.16181
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
The size values go from decreasing order. So, gather the size for which the deviance value is low. In this case, it is size=13.
# Gather the size for which the deviance is low
tree.min.cv = which.min(cv.carseats$dev)
tree.min.size = cv.carseats$size[tree.min.cv]
tree.min.size
## [1] 13
# Use the size value from above to highlight the size at which deviance is low
plot(cv.carseats$size, cv.carseats$dev, type = "b",xlab='Size',ylab='Deviance',main='Cross Validation Error Plot')
points(tree.min.size, cv.carseats$dev[tree.min.cv], col = "red", cex = 2, pch = 20)

As the plot indicates, tree size=8 has the lowest CV error or in other words it indicates optimal level of tree complexity. Let’s prune the tree to 13 nodes.
# Populate the size value for the best parameter to prune
prune.carseats <- prune.tree(tree.carseats, best = tree.min.size)
plot(prune.carseats)
text(prune.carseats, pretty = 0)

# Use predict function to get the predictions and find the MSE
prune.preds <- predict(prune.carseats, newdata = carseats.test)
prune.mse = mean((prune.preds - carseats.test$Sales)^2)
prune.mse
## [1] 5.021645
Pruning the tree actually increased the test MSE.
8d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.
To proceed with bagging approach, use randomForest function. randomForest uses bagging approach with m=p. Hence populate mtry parameter with the variables count i.e. p=10.
# use randomforest function for bagging approach
set.seed(3)
bag.carseats <- randomForest(Sales ~ ., data = carseats.train, mtry = p, importance = TRUE)
bag.carseats
##
## Call:
## randomForest(formula = Sales ~ ., data = carseats.train, mtry = p, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 10
##
## Mean of squared residuals: 2.389486
## % Var explained: 69.3
The above fit could explain 69.3% of variability. Lets find the MSE.
bag.preds <- predict(bag.carseats, newdata = carseats.test)
bag.mse = mean((bag.preds - carseats.test$Sales)^2)
bag.mse
## [1] 3.178911
The test MSE obtained from bagging is 3.178911 which is better than Decision tree approach.
Lets determine which variables are important.
# Use importance function to determine the importance of variables
importance(bag.carseats)
## %IncMSE IncNodePurity
## CompPrice 29.6775410 215.26019
## Income 7.4441919 98.62490
## Advertising 25.7177819 182.87717
## Population 0.2845831 73.86004
## Price 72.5086085 751.09396
## ShelveLoc 79.1673609 705.64459
## Age 20.0807016 184.05511
## Education 2.7658912 58.03403
## Urban -1.0251162 10.42711
## US 3.3490685 12.88167
From the above importance, we can conclude that only Price and ShelveLoc are the most important variables.
8e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
This time, use the randomForest function but do not define the mtry parameter to the predictors count p and let the function choose it.
# set the seed
set.seed(4)
# use randomForest function
rf.carseats <- randomForest(Sales ~ ., data = carseats.train, importance = TRUE)
rf.carseats
##
## Call:
## randomForest(formula = Sales ~ ., data = carseats.train, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 2.792944
## % Var explained: 64.11
The above randomforest fit too 3 varaibles at each split and explains about 64.11% of variability. Lets get the MSE and importance.
rf.preds <- predict(rf.carseats, newdata = carseats.test)
rf.mse = mean((rf.preds - carseats.test$Sales)^2)
rf.mse
## [1] 3.748591
importance(rf.carseats)
## %IncMSE IncNodePurity
## CompPrice 14.0763435 198.72262
## Income 3.0894890 156.39168
## Advertising 20.3108832 202.91520
## Population 0.2388015 138.20203
## Price 46.8728499 578.50601
## ShelveLoc 50.1638302 542.98987
## Age 16.2776580 234.60743
## Education -0.1216019 103.10753
## Urban -0.1260788 19.19768
## US 6.7317642 33.33732
The importance shows that Price and ShelveLoc are the important variables still.
However, the test MSE obtained in randomForest approach is not better than the test MSE obtained from bagging approach. Bagging approach uses the m = p(10) i.e. 10 variables at each split where as randomForest used m=√p, i.e. m=3 for each split.
We can conclude that in this scenario, bagging approach with considering all 10 variables in a split actually produced better test error rate.
9) This problem involves the OJ data set which is part of the ISLR package.
attach(OJ)
str(OJ)
## 'data.frame': 1070 obs. of 18 variables:
## $ Purchase : Factor w/ 2 levels "CH","MM": 1 1 1 2 1 1 1 1 1 1 ...
## $ WeekofPurchase: num 237 239 245 227 228 230 232 234 235 238 ...
## $ StoreID : num 1 1 1 1 7 7 7 7 7 7 ...
## $ PriceCH : num 1.75 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
## $ PriceMM : num 1.99 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 ...
## $ DiscCH : num 0 0 0.17 0 0 0 0 0 0 0 ...
## $ DiscMM : num 0 0.3 0 0 0 0 0.4 0.4 0.4 0.4 ...
## $ SpecialCH : num 0 0 0 0 0 0 1 1 0 0 ...
## $ SpecialMM : num 0 1 0 0 0 1 1 0 0 0 ...
## $ LoyalCH : num 0.5 0.6 0.68 0.4 0.957 ...
## $ SalePriceMM : num 1.99 1.69 2.09 1.69 1.69 1.99 1.59 1.59 1.59 1.59 ...
## $ SalePriceCH : num 1.75 1.75 1.69 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
## $ PriceDiff : num 0.24 -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 ...
## $ Store7 : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 2 2 ...
## $ PctDiscMM : num 0 0.151 0 0 0 ...
## $ PctDiscCH : num 0 0 0.0914 0 0 ...
## $ ListPriceDiff : num 0.24 0.24 0.23 0 0 0.3 0.3 0.24 0.24 0.24 ...
## $ STORE : num 1 1 1 1 0 0 0 0 0 0 ...
p = ncol(OJ)-1
n = nrow(OJ)
cat ('p = ', p , '; n = ', n)
## p = 17 ; n = 1070
9a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
# set the seed
set.seed(5)
# use caret package and split the dataset. Determined that if p=.746, training set will have 800 so set p=.746
oj.inTrain <- createDataPartition(OJ$Purchase, p = .746, list = FALSE, times = 1)
oj.train <- OJ[oj.inTrain,]
oj.test <- OJ[-oj.inTrain,]
cat('training observations count = ', dim(oj.train)[1], '\ntest observations count = ', dim(oj.test)[1])
## training observations count = 800
## test observations count = 270
9b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
# set the seed
set.seed(6)
# Call the tree function
tree.oj <- tree(Purchase ~ ., data = oj.train)
summary(tree.oj)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "ListPriceDiff"
## Number of terminal nodes: 7
## Residual mean deviance: 0.7823 = 620.3 / 793
## Misclassification error rate: 0.1675 = 134 / 800
The response variable Purchase is a categorical varaible and this is a classification problem. The Misclassification rate or the training error rate is 16.75%. The terminal nodes are 7. The varaibles actually used in tree construction are ‘LoyalCH’, ‘PriceDiff’ and ‘ListPriceDiff’.
9c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
# Print the tree fit
tree.oj
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1070.00 CH ( 0.61000 0.39000 )
## 2) LoyalCH < 0.469289 293 319.90 MM ( 0.23549 0.76451 )
## 4) LoyalCH < 0.280875 170 127.10 MM ( 0.12353 0.87647 ) *
## 5) LoyalCH > 0.280875 123 164.50 MM ( 0.39024 0.60976 )
## 10) PriceDiff < -0.24 13 0.00 MM ( 0.00000 1.00000 ) *
## 11) PriceDiff > -0.24 110 150.70 MM ( 0.43636 0.56364 ) *
## 3) LoyalCH > 0.469289 507 468.00 CH ( 0.82643 0.17357 )
## 6) LoyalCH < 0.764572 242 302.70 CH ( 0.68182 0.31818 )
## 12) PriceDiff < 0.145 91 124.80 MM ( 0.43956 0.56044 )
## 24) ListPriceDiff < 0.235 63 78.74 MM ( 0.31746 0.68254 ) *
## 25) ListPriceDiff > 0.235 28 33.50 CH ( 0.71429 0.28571 ) *
## 13) PriceDiff > 0.145 151 138.70 CH ( 0.82781 0.17219 ) *
## 7) LoyalCH > 0.764572 265 91.54 CH ( 0.95849 0.04151 ) *
We pick one of the terminal nodes ‘ListPriceDiff’ because of the asterisk at the end. The split criterion is ListPriceDiff < 0.235, the number of observations in that branch is 63 with a deviance of 78.74 and an overall prediction for the branch of MM. If the ListPriceDiff > 0.235, the number of observations in that branch is 28 and the overall prediction is for the branch of CH (Citrus Hill). There are no further child nodes for this ListPriceDiff node as per this fit.
9d) Create a plot of the tree, and interpret the results.
# Plot the tree
plot(tree.oj)
text(tree.oj, pretty = 0)

The above tree plot shows that ‘LoyalCH’, ‘PriceDiff’ and ‘ListPriceDiff’ are the only variables used in tree construction. However, ‘LoyalCH’ appears to be the most important predictor as the first branch (left) for LoyalCH<0.469289 classifies it as MM, which indicates that this model predicts that when there is a strong preference for MinuteMaid (when LoyalCH <0.469289), MM will always be the drink chosen. Also, the right branch indicates if loyalCH > 0.764572 i.e. when there is strong loyalty of 76.5% or above, CitruHill is the drink that will always be chosen.
9e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
# use predict function to get the response on the test data
tree.preds <- predict(tree.oj, oj.test, type = "class")
# Build a confusion matrix
table(tree.preds, oj.test$Purchase)
##
## tree.preds CH MM
## CH 135 17
## MM 30 88
# Get the misclassification rate and accuracy rate
mse.oj = (17+30)/(dim(oj.test)[1])
acc.oj = 1-mse.oj
cat('Misclassification/test error rate = ', mse.oj, '\nAccuracy Rate = ',acc.oj)
## Misclassification/test error rate = 0.1740741
## Accuracy Rate = 0.8259259
The test error rate is 0.1740741 or the misclassification rate is 17.40741%.
9f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.
# Get the cross validation error rate
cv.oj <- cv.tree(tree.oj, FUN = prune.misclass)
cv.oj
## $size
## [1] 7 5 2 1
##
## $dev
## [1] 168 168 174 312
##
## $k
## [1] -Inf 0.000000 7.666667 155.000000
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(cv.oj)

9g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
plot(cv.oj$size, cv.oj$dev, type = "b", xlab = "Tree size", ylab = "Deviance", main='Cross Validation Error Plot')
points(5,cv.oj$dev[2],col='green',cex=2,pch=20)
points(7,cv.oj$dev[1],col='red',cex=2,pch=20)

9h) Which tree size corresponds to the lowest cross-validated classification error rate?
The above plot shows that both 5 and 7 have the same deviance. However, we would like to go with the lowest size to reduce overfitting. Hence, we can conclude that tree size of 5 corresponds to the lowest cross-validated classification error rate.
9i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
prune.oj <- prune.misclass(tree.oj, best = 5)
plot(prune.oj)
text(prune.oj, pretty = 0)

This pruned tree is much better and reduced the noise before pruning. The left branch for LoyalCH < 0.469289 got pruned as previously also, before pruning also it always resulted in the classification as MM but there were 2 extra terminal nodes.Now, with CV approach, they got pruned.
9j) Compare the training error rates between the pruned and un-pruned trees. Which is higher?
From 9b), we know that the training error rate of un-pruned tree is .1675 or misclassification rate is 16.75%. Lets find the training error rate of pruned tree.
summary(prune.oj)
##
## Classification tree:
## snip.tree(tree = tree.oj, nodes = 2L)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "ListPriceDiff"
## Number of terminal nodes: 5
## Residual mean deviance: 0.8332 = 662.4 / 795
## Misclassification error rate: 0.1675 = 134 / 800
From the above, we can see that the training error rate of pruned tree is also .1675. Hence, we can conclude that both pruned and un-pruned trees have the same training error rate.
The key difference is that the pruned tree has lesser terminal nodes(5) compared to the unpruned tree.
9k) Compare the test error rates between the pruned and unpruned trees. Which is higher?
We know the test error rate for the unpruned tree is 0.1740741 from 9(e). Lets find out the test error rate of pruned tree.
# Get the training error of pruned tree using predict function
prune.oj.preds = predict(prune.oj, newdata = oj.test, type = "class")
table(prune.oj.preds, oj.test$Purchase)
##
## prune.oj.preds CH MM
## CH 135 17
## MM 30 88
mse.prune.oj = (17+30)/270
mse.prune.oj
## [1] 0.1740741
From the above, the test error rate of pruned tree is 0.1740741 which is same as the test error rate of un-pruned tree. There is no difference in the misclassification rate.