3. Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The x-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

Hint: In a setting with two classes, \(\hat{p}_{m1}\) = 1− \(\hat{p}_{m2}\). You could make this plot by hand, but it will be much easier to make in R.

The classification error rate is:

\[E = 1 − \displaystyle \max_{k}(\hat{p}_{mk}).\]

Therefore, \[E = 1 - max \{\hat{p}_{m1}, \hat{p}_{m2}\}\]

The Gini index is:

\[G = \sum_{k=1}^{K} \hat{p}_{mk} (1 - \hat{p}_{mk}) \]

Therefore, \[G = \hat{p}_{m1} (1 - \hat{p}_{m1}) + \hat{p}_{m2} (1 - \hat{p}_{m2})\]

The Entropy is:

\[D = - \sum_{k=1}^{K} \hat{p}_{mk} log(\hat{p}_{mk}) \]

Therefore, \[D = - \hat{p}_{m1} log(\hat{p}_{m1}) - \hat{p}_{m2} log(\hat{p}_{m2}) \]

When we plot for the above quantities in a simple classification settings in R, we get the below plot.

#Two classes
p1 <- seq(0, 1, 0.001)
p2 <- 1 - p1

#Classification error rate
classification.error.rate <- 1 - pmax(p1, p2)
#Gini index
gini.index <- p1 * (1 - p1) + p2 * (1 - p2)
#Entropy
entropy <- -p1 * log(p1) - p2*log(p2)

#Data frame to hold the 2 classes with the quantities
df.classification.quantities <-  data.frame(p1, p2, classification.error.rate, gini.index, entropy) 
df.classification.quantities <- pivot_longer(df.classification.quantities, cols = c(classification.error.rate, gini.index, entropy), names_to = "metric")

##Plot classification quantities
ggplot(data=df.classification.quantities,
        aes(x = p1, y = value, col = factor(metric))) + 
        geom_line(size = 1) + 
        scale_y_continuous(breaks = seq(0, 1, 0.1), minor_breaks = NULL) + 
        scale_color_hue(labels = c("Classification Error", "Entropy", "Gini")) +
        labs(col = "Metric", 
        y = "Value", 
        x = "P")

9. This problem involves the OJ data set which is part of the ISLR package.

(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

Creating the training set with 800 observations and remaining in the test set.

set.seed(123)
train <- sample(1:nrow(OJ), 800)
oj.train <- OJ[train, ]
oj.test <- OJ[-train, ]

(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

tree.oj.model <- tree(Purchase ~ ., data = oj.train)
summary(tree.oj.model)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  8 
## Residual mean deviance:  0.7625 = 603.9 / 792 
## Misclassification error rate: 0.165 = 132 / 800

The classification tree has 8 terminal nodes and a training error rate of 16.5%.

From the available 17 predictors, the classification tree splits used only 2 predictors:
(i) LoyalCH - Customer brand loyalty for CH
(ii) PriceDiff - Sale price of MM less sale price of CH

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

tree.oj.model
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1071.00 CH ( 0.60875 0.39125 )  
##    2) LoyalCH < 0.5036 350  415.10 MM ( 0.28000 0.72000 )  
##      4) LoyalCH < 0.276142 170  131.00 MM ( 0.12941 0.87059 )  
##        8) LoyalCH < 0.0356415 56   10.03 MM ( 0.01786 0.98214 ) *
##        9) LoyalCH > 0.0356415 114  108.90 MM ( 0.18421 0.81579 ) *
##      5) LoyalCH > 0.276142 180  245.20 MM ( 0.42222 0.57778 )  
##       10) PriceDiff < 0.05 74   74.61 MM ( 0.20270 0.79730 ) *
##       11) PriceDiff > 0.05 106  144.50 CH ( 0.57547 0.42453 ) *
##    3) LoyalCH > 0.5036 450  357.10 CH ( 0.86444 0.13556 )  
##      6) PriceDiff < -0.39 27   32.82 MM ( 0.29630 0.70370 ) *
##      7) PriceDiff > -0.39 423  273.70 CH ( 0.90071 0.09929 )  
##       14) LoyalCH < 0.705326 130  135.50 CH ( 0.78462 0.21538 )  
##         28) PriceDiff < 0.145 43   58.47 CH ( 0.58140 0.41860 ) *
##         29) PriceDiff > 0.145 87   62.07 CH ( 0.88506 0.11494 ) *
##       15) LoyalCH > 0.705326 293  112.50 CH ( 0.95222 0.04778 ) *

I am picking the terminal node 8 (because there is an asterisk at the end). The split criterion for that node is LoyalCH < 0.0356415 with 56 observations in the branch. The deviance of the branch is 10.03 with overall prediction for the branch of MM. Within that branch, MM takes approximately 81% of the observations and CH takes approximately 18% of the observations.

(d) Create a plot of the tree, and interpret the results.

plot(tree.oj.model)
text(tree.oj.model, pretty=0, cex = 0.5)

From the tree plot, we can identify that the LoyalCH is the most important variable followed by PriceDiff to predict the Purchase. The first branch differentiates the intensity of the Customer brand loyalty for CH (LoyalCH). The top 2 nodes are LoyalCH and the three node is PriceDiff. There are 8 terminal nodes and the number of terminal nodes are equally divided among CH and MM.

The first node split is that the customer less loyal to Citrus Hill (CH) orange juice to the left and more loyal to the right. On the more loyal to Citrus Hill side, the PriceDiff plays the role in dividing the nodes, that’s if the PriceDiff is less than -0.39, the customer prefers Minute Maid Orange Juice.

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

Prediction and the confusion matrix are:

tree.oj.pred <- predict(tree.oj.model, oj.test, type = "class")
table(tree.oj.pred, oj.test$Purchase)
##             
## tree.oj.pred  CH  MM
##           CH 150  34
##           MM  16  70

\[Test Error rate = 1 - (150 + 70)/270 = 0.18518519\]

We can say the test error rate would be 18.5% (approximately).

(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

Apply prune.misclass to the `cv.tree() function in order to determine the classification error rate.

set.seed(2)

cv.tree.model <- cv.tree(tree.oj.model, K = 10, FUN = prune.misclass)
cv.tree.model
## $size
## [1] 8 5 3 2 1
## 
## $dev
## [1] 138 138 162 165 313
## 
## $k
## [1] -Inf    0    8   11  154
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

#Construct the data frame with deviance and size 
df.cv.oj <- data.frame(Tree.Size = cv.tree.model$size,
                        CV.Error = cv.tree.model$dev/nrow(oj.train))
#Find the minimum Error rate
df.cv.oj <- mutate(df.cv.oj, min.cv.error = as.numeric(min(CV.Error) == CV.Error))
##Use the ggplot
ggplot(df.cv.oj, aes(x = Tree.Size, y = CV.Error)) +
  geom_line(col = "black") +
  geom_point(size = 2, aes(col = factor(min.cv.error))) +
  scale_x_continuous(breaks = seq(1, 10, 1)) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_color_manual(values = c("green", "red")) +
  theme(panel.background = element_rect(fill = 'white', colour = 'red'), legend.position = "none") +
  labs(title = "OJ Dataset - Cross Validation",
       x = "Tree size",
       y = "CV - Error")

(h) Which tree size corresponds to the lowest cross-validated classification error rate?

From the above (g) cross-validation tree plot, we find out tree terminal nodes 5 and 8 have the same cross-validation error rate. It is more obvious to select the model with 5 terminal nodes.

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

From the above cross-validation (h), the optimal tree size is 5.

prune.oj <- prune.misclass(tree.oj.model, best = 5)
plot(prune.oj)
text(prune.oj, pretty=0, cex = 0.5)

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

summary(tree.oj.model)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  8 
## Residual mean deviance:  0.7625 = 603.9 / 792 
## Misclassification error rate: 0.165 = 132 / 800
summary(prune.oj)
## 
## Classification tree:
## snip.tree(tree = tree.oj.model, nodes = c(4L, 7L))
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  5 
## Residual mean deviance:  0.826 = 656.6 / 795 
## Misclassification error rate: 0.165 = 132 / 800

Since both the cross-validation error rate is same for the nodes 5 and 8 (as determined in (h)), it looks like the training error rates are same when compared between pruned and unpruned trees.

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

mean(predict(tree.oj.model, type = "class", newdata = oj.test) != oj.test$Purchase)
## [1] 0.1851852
mean(predict(prune.oj, type = "class", newdata = oj.test) != oj.test$Purchase)
## [1] 0.1851852

Since both the cross-validation error rate is same for the nodes 5 and 8 (as determined in (h)), it looks like we have the same test error rates as well between the pruned and unpruned trees.