3

Consider the Gini index, classification error, and cross-entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆ pm1. The xaxis should display ˆ pm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.

Let’s talk briefly about these terms.

  • Gini Index: given by the equation \(\sum_{k=1}^{K}{\hat{p}_{mk}(1-\hat{p}_{mk})}\), this is a measure of the total variance across all K classes. When the value of \(\hat{p}_{mk}\) approaches either 0 or 1, the Gini index is smallest. This is a good reason Gini index is also considered to be a measure of node purity - a small Gini index means the node contains mostly observations from a single class.

  • Classification Error: given by the equation \(E = 1- max_k(\hat{p}_{mk})\) is the fraction of the training datapoints in a region that don’t belong to the most common class. This type of error isn’t sensitive enough for tree-growing, so it’s typically not used (instead we use Gini index or cross-entropy).

  • Cross-Entropy: given by the equation \(D = -\sum_{k=1}^{K}{\hat{p}_{mk}\log\hat{p}_{mk}}\) this equation is similar to the Gini index in that if \(\hat{p}_{mk}\) takes on a value near 1 or 0, the cross-entropy value will be near 0. This means the cross-entropy value will be near-zero if the node is really pure.

Here’s the plot:

p = seq(0, 1, 0.01)
class.error = 1 - pmax(p, 1 - p)
gini.index = 2 * p * (1-p)
Xentropy = - (p * log(p) + (1 - p) * log(1 - p))

matplot(p, cbind(class.error, gini.index, Xentropy), type = "l", lwd=3, col = c("red", "blue", "green" ), pch = 16,xlim = c(0,1.2), xlab = "P-hat", ylab = "Resulting Value")
legend("topright",pch=16, title = "Quality Measure", col=c("red", "blue", "green"), legend=c("Classification Error", "Gini Index", "Cross-Entropy"), box.lty = 1)

8

8-A) Split the data set into a training set and a test set.

set.seed(2021)
carseats = Carseats

inTrain <- createDataPartition(carseats$Sales, p = 0.7, list = FALSE, times = 1)
carseats.train <- carseats[inTrain, ]
carseats.test <- carseats[-inTrain, ]

8-B) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

set.seed(2021)
tree <- tree(Sales ~ ., carseats.train)
summary(tree)
## 
## Regression tree:
## tree(formula = Sales ~ ., data = carseats.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price"     "Age"       "CompPrice" "Income"   
## Number of terminal nodes:  16 
## Residual mean deviance:  2.476 = 656.1 / 265 
## Distribution of residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -4.068000 -1.015000 -0.006471  0.000000  0.913500  4.546000
plot(tree); text(tree, pretty = 0)

The variables used in tree construction are ShelveLoc, Price, Age, CompPrice and Income. There are 16 terminal nodes, and the residual mean deviance is 2.48. The variable ShelveLoc is the most important variable (with good acting as the reference), based on its position in the plot, followed by Price.

set.seed(2021)
tree.preds <- predict(tree, newdata = carseats.test)
tree.MSE <- mean((tree.preds - carseats.test$Sales)^2)
cat('The decision tree MSE is: ', tree.MSE)
## The decision tree MSE is:  5.119811

The decision tree MSE is approximately 5.12.

8-C)

Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

set.seed(2021)
CVtree = cv.tree(tree)
CVtree
## $size
##  [1] 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1
## 
## $dev
##  [1] 1198.868 1220.312 1211.394 1207.512 1198.210 1206.164 1237.239 1302.599
##  [9] 1422.524 1415.249 1438.069 1425.824 1525.997 1576.075 1817.700 2392.918
## 
## $k
##  [1]      -Inf  23.35243  26.76040  28.34406  28.51928  28.64549  38.54721
##  [8]  43.54840  57.44161  60.57303  69.37522 105.02149 134.19967 174.08867
## [15] 251.33441 588.26787
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
plot(CVtree$size, CVtree$dev, type = "b")

min.dev <- which.min(CVtree$dev)
min.dev.val <- CVtree$dev[min.dev]
min.dev.size <- CVtree$size[min.dev]
cat('The model with the minimum dev value is', min.dev.val)
## The model with the minimum dev value is 1198.21
cat('\nThe model size with the minimum dev value is' ,min.dev.size)
## 
## The model size with the minimum dev value is 12

Pruning the tree for optimal size:

set.seed(2021)
prunetree = prune.tree(tree, best = min.dev.size)
plot(prunetree)
text(prunetree, pretty = 0)

CVtree.preds <- predict(prunetree, newdata = carseats.test)
CVtree.MSE <- mean((CVtree.preds - carseats.test$Sales)^2)
cat('The CV decision tree MSE is: ', CVtree.MSE)
## The CV decision tree MSE is:  5.231266

MSE on the CV pruned decision tree is 5.23, which is higher than the un-pruned decision tree MSE of 5.12.

8-D)

Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

Typically use randomforest for bagging trees applications. Telling it to use an mtry of 10 so it can grab from the entire set of variables. Using ntrees = 1000.

set.seed(2021)
bagmodel <- randomForest(Sales ~ ., carseats.train, mtry = 10, importance = TRUE, ntree = 1000)
bagmodel
## 
## Call:
##  randomForest(formula = Sales ~ ., data = carseats.train, mtry = 10,      importance = TRUE, ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 10
## 
##           Mean of squared residuals: 2.318885
##                     % Var explained: 71.84
varImpPlot(bagmodel)

The bagged model explains 71.84% of the variance in Sales in the data we fed it. Based on the importance plot, ShelveLoc and Price are the two most important variables.

set.seed(2021)
bagmodel.preds <- predict(bagmodel, newdata = carseats.test)
bagmodel.MSE <- mean((bagmodel.preds - carseats.test$Sales)^2)
cat('The bagged model MSE is ',bagmodel.MSE)
## The bagged model MSE is  2.747382

The bagged model has an MSE of 2.75, which is about half of the MSE from the decision tree tried above. Heck yeah, bagged models! Whoo, random forests!

8-E)

Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

set.seed(2021)
RF1 <- randomForest(Sales ~ ., carseats.train, mtry=5, ntree = 1000, importance = TRUE)
RF1
## 
## Call:
##  randomForest(formula = Sales ~ ., data = carseats.train, mtry = 5,      ntree = 1000, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 2.434538
##                     % Var explained: 70.44
importance(RF1)
##                %IncMSE IncNodePurity
## CompPrice   35.2811766     214.84678
## Income      13.6836596     150.61959
## Advertising 21.5526008     164.64507
## Population   0.2415301     107.30226
## Price       79.8599871     621.45955
## ShelveLoc   81.5668189     617.46293
## Age         31.3461235     248.92422
## Education    1.1764927      72.31147
## Urban       -1.3948924      14.71352
## US           6.0714316      15.37581
RF1.preds <- predict(RF1, newdata = carseats.test)
RF1.MSE <- mean((RF1.preds - carseats.test$Sales)^2)
cat('Random Forest MSE is ',RF1.MSE)
## Random Forest MSE is  2.613064

Random forest with mtry left to R to decide upon explains 66.3% of Sales variance. The top three important variables are ShelveLoc, Price and Age. Its MSE is 2.77. If I manually code mtry = 5, the MSE changes to 2.61. There appears to be a sweet spot where MSE is minimized.

9

This problem involves the OJ data set which is part of the ISLR package.

9-A) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

set.seed(2021)
oj <- OJ
inTrain <- createDataPartition(oj$Purchase, p=0.747, list=FALSE, times = 1)
oj.train <- oj[inTrain, ]
oj.test <- oj[-inTrain, ]
dim(oj.train)
## [1] 800  18
dim(oj.test)
## [1] 270  18

9-B) Fit a tree to the training data, with Purchase as the response and the other variables except for Buy as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

I’m sorry, Dave, but I can’t do that. Why? Because Buy doesn’t exist in the OJ dataset. Maybe it did in a previous edition, but not in this one. So you’re getting all of the variables as predictors. Deal with it.

oj.tree <- tree(Purchase ~ ., data = oj.train)
summary(oj.tree)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  7 
## Residual mean deviance:  0.7709 = 611.3 / 793 
## Misclassification error rate: 0.1838 = 147 / 800

This tree uses two of the predictors, LoyalCH and PriceDiff, and has 7 nodes. The training error / misclassification error rate is 18.4%.

9-C) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

oj.tree
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1070.000 CH ( 0.61000 0.39000 )  
##    2) LoyalCH < 0.50395 352  422.000 MM ( 0.28693 0.71307 )  
##      4) LoyalCH < 0.136698 101   45.520 MM ( 0.05941 0.94059 ) *
##      5) LoyalCH > 0.136698 251  333.000 MM ( 0.37849 0.62151 )  
##       10) PriceDiff < 0.05 99   87.580 MM ( 0.16162 0.83838 ) *
##       11) PriceDiff > 0.05 152  210.500 CH ( 0.51974 0.48026 ) *
##    3) LoyalCH > 0.50395 448  356.500 CH ( 0.86384 0.13616 )  
##      6) LoyalCH < 0.764572 182  212.000 CH ( 0.73077 0.26923 )  
##       12) PriceDiff < -0.065 45   60.570 MM ( 0.40000 0.60000 ) *
##       13) PriceDiff > -0.065 137  120.700 CH ( 0.83942 0.16058 )  
##         26) PriceDiff < 0.31 94   99.860 CH ( 0.77660 0.22340 ) *
##         27) PriceDiff > 0.31 43    9.499 CH ( 0.97674 0.02326 ) *
##      7) LoyalCH > 0.764572 266   97.820 CH ( 0.95489 0.04511 ) *

Picking terminal node 4. The splitting variable here is LoyalCH, and the split value is approximately 0.14. There’s 101 subtree points below this node, and the deviance for the points below Node 4 is 45.52. This is the first terminal node (* means terminal node), and it is predicting MM for Sales, with approximately 6% leaning for Sales = CH and the remaining 94% of the points leaning for Sales = MM.

9-D) Create a plot of the tree, and interpret the results.

plot(oj.tree); text(oj.tree, pretty = TRUE)

LoyalCH is the most important variable and make up the first three nodes. The second-most important variable is PriceDiff. If LoyalCH is < 0.504, and LoyalCH < 0.14, the tree predicts MM. If LoyalCH > 0.14, it becomes a contest of PriceDiff, where PriceDiff < 0.05 is predicted to be MM, otherwise it goes to CH. On the other side, if LoyalCH > 0.76, the tree predicts CH, otherwise it comes down to PriceDiff again.

9-E) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

set.seed(2021)
oj.preds <- predict(oj.tree, newdata = oj.test, type = 'class')
conf.matrix <- table(oj.preds, oj.test$Purchase)

#Test error rate
tree.acc <- sum(diag(conf.matrix))/sum(conf.matrix)
tree.err <- 1 - tree.acc
cat('Overall tree model accuracy is: ', tree.acc)
## Overall tree model accuracy is:  0.7740741
cat('\nOverall tree model test error rate is: ', tree.err)
## 
## Overall tree model test error rate is:  0.2259259

The overall tree model accuracy is 77.4%, and the overall error rate is 22.6%

9-F) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

set.seed(2021)
CV.oj.tree <- cv.tree(oj.tree, FUN = prune.misclass)
CV.oj.tree
## $size
## [1] 7 6 4 2 1
## 
## $dev
## [1] 173 171 174 170 312
## 
## $k
## [1]  -Inf   0.0   3.0   4.5 150.0
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

Looks like a tree of size of 2 has lowest deviance, based on the information above. We could also look at a size of 6.

Plotting this we get:

plot(CV.oj.tree)

9-G) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

plot(CV.oj.tree$size, CV.oj.tree$dev, type = "l", lwd = 2, col = "red",
     main = "CV Classification Error vs. Tree Size", 
     xlab = "Tree Size", 
     ylab = "CV Classification Error")

9-H) Which tree size corresponds to the lowest cross-validated classification error rate?

There are two options we can go with. A tree with 2 terminal nodes has a CV classification error rate of 170, while a tree with 6 terminal nodes has a CV classification error rate of 171. I’ll try both in the next step to see which one really has the best performance.

9-I) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

pruned.oj.tree2 <- prune.misclass(oj.tree, best = 2)
summary(pruned.oj.tree2)
## 
## Classification tree:
## snip.tree(tree = oj.tree, nodes = 2:3)
## Variables actually used in tree construction:
## [1] "LoyalCH"
## Number of terminal nodes:  2 
## Residual mean deviance:  0.9756 = 778.5 / 798 
## Misclassification error rate: 0.2025 = 162 / 800

If I use 2 terminal nodes, the misclassification error rate is 20.25%

pruned.oj.tree6 <- prune.misclass(oj.tree, best = 6)
summary(pruned.oj.tree6)
## 
## Classification tree:
## snip.tree(tree = oj.tree, nodes = 13L)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  6 
## Residual mean deviance:  0.7843 = 622.7 / 794 
## Misclassification error rate: 0.1838 = 147 / 800

If I use 6 terminal nodes, the misclassification error rate is 18.4%, which is better than the 2-terminal-node solution. So let’s go with the 6-TN model and call it good.

9-J) Compare the training error rates between the pruned and unpruned trees. Which is higher?

The pruned training error rate for 6-TN is 18.38%. The training error rate for the un-pruned is also 18.38%, identical to the pruned model, but then again it has 7 terminal nodes, very similar to the 6 terminal node model.

9-K) Compare the test error rates between the pruned and unpruned trees. Which is higher?

Let’s unleash the pruned decision tree upon the test data and see what kind of performance it has.

pruned.oj.preds <- predict(pruned.oj.tree6, newdata = oj.test, type = 'class')
pruned.conf.matrix <- table(pruned.oj.preds, oj.test$Purchase)

pruned.tree.acc <- sum(diag(pruned.conf.matrix))/sum(pruned.conf.matrix)
pruned.tree.err <- 1 - pruned.tree.acc
cat('The unpruned tree model confusion matrix looks like this: \n')
## The unpruned tree model confusion matrix looks like this:
conf.matrix
##         
## oj.preds  CH  MM
##       CH 142  38
##       MM  23  67
cat('\nOverall unpruned tree model accuracy is: ', tree.acc)
## 
## Overall unpruned tree model accuracy is:  0.7740741
cat('\nOverall unpruned tree model test error rate is: ', tree.err)
## 
## Overall unpruned tree model test error rate is:  0.2259259
cat('The pruned tree model confusion matrix looks like this: \n') 
## The pruned tree model confusion matrix looks like this:
pruned.conf.matrix
##                
## pruned.oj.preds  CH  MM
##              CH 142  38
##              MM  23  67
cat('\nOverall pruned tree model accuracy is: ', pruned.tree.acc)
## 
## Overall pruned tree model accuracy is:  0.7740741
cat('\nOverall pruned tree model test error rate is: ', pruned.tree.err)
## 
## Overall pruned tree model test error rate is:  0.2259259

There is no difference in performance between the unpruned and pruned decision tree. This does not come as a surprise.