Conceptual

3. Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a sinle plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The \(x\)-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the \(y\)-axis should display the value of the Gini index, classification error, and entropy.

library(tidyverse)
library(plotly)

Create Gini index data:

gini <- as_tibble(list(p = seq(0, 1, 0.001))) %>% 
  mutate(value = p * (1 - p) * 2,
         measure = "Gini")


Create entropy data:

entr <- as_tibble(list(p = seq(0, 1, 0.001))) %>% 
  mutate(value = -(p * log(p) + (1 - p) * log(1 - p)),
         measure = "Entropy")


Create classification error data:

error <- as_tibble(list(p = seq(0, 1, 0.001))) %>% 
  mutate(value = 1 - pmax(p, 1 - p),
         measure = "Classification Error")


df <- bind_rows(gini, entr, error) %>% 
  arrange(measure, p)
plt <- ggplot(df, aes(x = p, y = value, col = measure)) +
  geom_line() +
  scale_color_manual(values = c("#377eb8","#e41a1c","#4daf4a")) +
  labs(x = "p",
       y = "Value fot Split",
       title = "Max Value for Each Criterion Occurs at p = 0.50") +
  theme_minimal()
figure <- ggplotly(plt, width = 600, height = 300)
figure


Applied

9. This problem involves the OJ data set which is part of the ISLR2 package.

a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

set.seed(23)
index <- sample(nrow(OJ), 800)
oj_train <- OJ[index, ]
oj_test <- OJ[-index, ]

b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

oj_tree <- tree(Purchase ~., data = oj_train)
summary(oj_tree)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = oj_train)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff" "SpecialCH"
## Number of terminal nodes:  10 
## Residual mean deviance:  0.7116 = 562.2 / 790 
## Misclassification error rate: 0.145 = 116 / 800


Three variables are used in the construction of the tree: LoyalCH, PriceDiff, and SpecialCH. There are 10 terminal nodes, and the training error rate is 0.145.


c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

oj_tree
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1064.00 CH ( 0.61750 0.38250 )  
##    2) LoyalCH < 0.48285 298  327.20 MM ( 0.23826 0.76174 )  
##      4) LoyalCH < 0.142213 101   55.92 MM ( 0.07921 0.92079 ) *
##      5) LoyalCH > 0.142213 197  246.90 MM ( 0.31980 0.68020 )  
##       10) PriceDiff < 0.31 154  169.80 MM ( 0.24026 0.75974 )  
##         20) SpecialCH < 0.5 137  136.00 MM ( 0.19708 0.80292 ) *
##         21) SpecialCH > 0.5 17   23.03 CH ( 0.58824 0.41176 ) *
##       11) PriceDiff > 0.31 43   57.71 CH ( 0.60465 0.39535 ) *
##    3) LoyalCH > 0.48285 502  437.00 CH ( 0.84263 0.15737 )  
##      6) LoyalCH < 0.705699 208  258.40 CH ( 0.68750 0.31250 )  
##       12) PriceDiff < -0.165 27   25.87 MM ( 0.18519 0.81481 ) *
##       13) PriceDiff > -0.165 181  198.50 CH ( 0.76243 0.23757 )  
##         26) PriceDiff < 0.265 97  125.70 CH ( 0.64948 0.35052 )  
##           52) LoyalCH < 0.6864 92  114.70 CH ( 0.68478 0.31522 ) *
##           53) LoyalCH > 0.6864 5    0.00 MM ( 0.00000 1.00000 ) *
##         27) PriceDiff > 0.265 84   57.20 CH ( 0.89286 0.10714 ) *
##      7) LoyalCH > 0.705699 294  112.60 CH ( 0.95238 0.04762 )  
##       14) PriceDiff < -0.39 14   19.12 CH ( 0.57143 0.42857 ) *
##       15) PriceDiff > -0.39 280   72.65 CH ( 0.97143 0.02857 ) *


Looking at node 20, we see that the splitting variable at this node is SpecialCH. The splitting value for this node is 0.5. There are 137 points in the subtree below this node. The deviance for all points contained in the region below node 20 is 136.00. A * in the line denotes that this is a terminal node in the tree. The prediction at this node is Sales = MM. According to the text output, about 19.71% of the points in this node have CH as the value of Sales, while roughly 81.29% of the points have MM as the value of Sales.


d) Create a plot of the tree, and interpret the results.

plot(oj_tree)
text(oj_tree, pretty = 0, cex = 0.55)


LoyalCH is clearly the most important variable of the tree, since the top three nodes split on LoyalCH. If LoyalCH < 0.142213, the tree will predict MM for the value of Sales. If LoyalCH > 0.705699, the tree will predict CH for the value of Sales. For intermediate values of LoyalCH, the decision also depends on the value of the two additional variables: PriceDiff andSpecialCH.


e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

library(caret)

oj_pred <- predict(oj_tree, oj_test, type = "class")
confusionMatrix(oj_test$Purchase, oj_pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CH  MM
##         CH 139  20
##         MM  37  74
##                                          
##                Accuracy : 0.7889         
##                  95% CI : (0.7353, 0.836)
##     No Information Rate : 0.6519         
##     P-Value [Acc > NIR] : 6.278e-07      
##                                          
##                   Kappa : 0.5537         
##                                          
##  Mcnemar's Test P-Value : 0.03407        
##                                          
##             Sensitivity : 0.7898         
##             Specificity : 0.7872         
##          Pos Pred Value : 0.8742         
##          Neg Pred Value : 0.6667         
##              Prevalence : 0.6519         
##          Detection Rate : 0.5148         
##    Detection Prevalence : 0.5889         
##       Balanced Accuracy : 0.7885         
##                                          
##        'Positive' Class : CH             
## 


The prediction accuracy is 78.89%, so the test error rate is 21.11%.


f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

oj_cv <- cv.tree(oj_tree, FUN = prune.tree)


g) Produce a plot with tree size on the \(x\)-axis and cross-validated classification error rate on the \(y\)-axis.

plot(oj_cv$size, oj_cv$dev, type = "b", xlab = "Tree Size", ylab = "Deviance")

h) Which tree size corresponds to the lowest cross-validated classification error rate?


According the plot, it appears that a tree size of 3 gives the lowest cross-validation error.


i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of pruned tree, then create a pruned tree with five terminal nodes.

oj_prune <- prune.tree(oj_tree, best = 3)


j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

summary(oj_tree)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = oj_train)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff" "SpecialCH"
## Number of terminal nodes:  10 
## Residual mean deviance:  0.7116 = 562.2 / 790 
## Misclassification error rate: 0.145 = 116 / 800
summary(oj_prune)
## 
## Classification tree:
## snip.tree(tree = oj_tree, nodes = c(7L, 2L, 6L))
## Variables actually used in tree construction:
## [1] "LoyalCH"
## Number of terminal nodes:  3 
## Residual mean deviance:  0.876 = 698.2 / 797 
## Misclassification error rate: 0.1875 = 150 / 800


It appears that the unpruned tree has a slightly lower error rate on the training set then that of the pruned tree.


k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

pred_unpruned <- predict(oj_tree, oj_test, type = "class")
unpruned_error <- sum(oj_test$Purchase != pred_unpruned)
unpruned_error / length(pred_unpruned)
## [1] 0.2111111
pred_pruned <- predict(oj_prune, oj_test, type = "class")
pruned_error <- sum(oj_test$Purchase != pred_pruned)
pruned_error / length(pred_pruned)
## [1] 0.2


It appears that the pruned tree has a slightly lower test error rate then that of the unpruned tree.