library(tidyverse)
library(plotly)
Create Gini index data:
gini <- as_tibble(list(p = seq(0, 1, 0.001))) %>%
mutate(value = p * (1 - p) * 2,
measure = "Gini")
Create entropy data:
entr <- as_tibble(list(p = seq(0, 1, 0.001))) %>%
mutate(value = -(p * log(p) + (1 - p) * log(1 - p)),
measure = "Entropy")
Create classification error data:
error <- as_tibble(list(p = seq(0, 1, 0.001))) %>%
mutate(value = 1 - pmax(p, 1 - p),
measure = "Classification Error")
df <- bind_rows(gini, entr, error) %>%
arrange(measure, p)
plt <- ggplot(df, aes(x = p, y = value, col = measure)) +
geom_line() +
scale_color_manual(values = c("#377eb8","#e41a1c","#4daf4a")) +
labs(x = "p",
y = "Value fot Split",
title = "Max Value for Each Criterion Occurs at p = 0.50") +
theme_minimal()
figure <- ggplotly(plt, width = 600, height = 300)
figure
OJ data set which is part
of the ISLR2 package.a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
set.seed(23)
index <- sample(nrow(OJ), 800)
oj_train <- OJ[index, ]
oj_test <- OJ[-index, ]
b) Fit a tree to the training data, with
Purchase as the response and the other variables as
predictors. Use the summary() function to produce summary
statistics about the tree and describe the results obtained. What is the
training error rate? How many terminal nodes does the tree
have?
oj_tree <- tree(Purchase ~., data = oj_train)
summary(oj_tree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = oj_train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH"
## Number of terminal nodes: 10
## Residual mean deviance: 0.7116 = 562.2 / 790
## Misclassification error rate: 0.145 = 116 / 800
Three variables are used in the construction of the tree:
LoyalCH, PriceDiff, and
SpecialCH. There are 10 terminal nodes, and the training
error rate is 0.145.
c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
oj_tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1064.00 CH ( 0.61750 0.38250 )
## 2) LoyalCH < 0.48285 298 327.20 MM ( 0.23826 0.76174 )
## 4) LoyalCH < 0.142213 101 55.92 MM ( 0.07921 0.92079 ) *
## 5) LoyalCH > 0.142213 197 246.90 MM ( 0.31980 0.68020 )
## 10) PriceDiff < 0.31 154 169.80 MM ( 0.24026 0.75974 )
## 20) SpecialCH < 0.5 137 136.00 MM ( 0.19708 0.80292 ) *
## 21) SpecialCH > 0.5 17 23.03 CH ( 0.58824 0.41176 ) *
## 11) PriceDiff > 0.31 43 57.71 CH ( 0.60465 0.39535 ) *
## 3) LoyalCH > 0.48285 502 437.00 CH ( 0.84263 0.15737 )
## 6) LoyalCH < 0.705699 208 258.40 CH ( 0.68750 0.31250 )
## 12) PriceDiff < -0.165 27 25.87 MM ( 0.18519 0.81481 ) *
## 13) PriceDiff > -0.165 181 198.50 CH ( 0.76243 0.23757 )
## 26) PriceDiff < 0.265 97 125.70 CH ( 0.64948 0.35052 )
## 52) LoyalCH < 0.6864 92 114.70 CH ( 0.68478 0.31522 ) *
## 53) LoyalCH > 0.6864 5 0.00 MM ( 0.00000 1.00000 ) *
## 27) PriceDiff > 0.265 84 57.20 CH ( 0.89286 0.10714 ) *
## 7) LoyalCH > 0.705699 294 112.60 CH ( 0.95238 0.04762 )
## 14) PriceDiff < -0.39 14 19.12 CH ( 0.57143 0.42857 ) *
## 15) PriceDiff > -0.39 280 72.65 CH ( 0.97143 0.02857 ) *
Looking at node 20, we see that the splitting variable at this node
is SpecialCH. The splitting value for this node is 0.5.
There are 137 points in the subtree below this node. The deviance for
all points contained in the region below node 20 is 136.00. A
* in the line denotes that this is a terminal node in the
tree. The prediction at this node is Sales = MM. According
to the text output, about 19.71% of the points in this node have
CH as the value of Sales, while roughly 81.29%
of the points have MM as the value of
Sales.
d) Create a plot of the tree, and interpret the results.
plot(oj_tree)
text(oj_tree, pretty = 0, cex = 0.55)
LoyalCH is clearly the most important variable of the
tree, since the top three nodes split on LoyalCH. If
LoyalCH < 0.142213, the tree will predict
MM for the value of Sales. If
LoyalCH > 0.705699, the tree will predict
CH for the value of Sales. For intermediate
values of LoyalCH, the decision also depends on the value
of the two additional variables: PriceDiff
andSpecialCH.
e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
library(caret)
oj_pred <- predict(oj_tree, oj_test, type = "class")
confusionMatrix(oj_test$Purchase, oj_pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction CH MM
## CH 139 20
## MM 37 74
##
## Accuracy : 0.7889
## 95% CI : (0.7353, 0.836)
## No Information Rate : 0.6519
## P-Value [Acc > NIR] : 6.278e-07
##
## Kappa : 0.5537
##
## Mcnemar's Test P-Value : 0.03407
##
## Sensitivity : 0.7898
## Specificity : 0.7872
## Pos Pred Value : 0.8742
## Neg Pred Value : 0.6667
## Prevalence : 0.6519
## Detection Rate : 0.5148
## Detection Prevalence : 0.5889
## Balanced Accuracy : 0.7885
##
## 'Positive' Class : CH
##
The prediction accuracy is 78.89%, so the test error rate is 21.11%.
f) Apply the cv.tree() function to the training
set in order to determine the optimal tree size.
oj_cv <- cv.tree(oj_tree, FUN = prune.tree)
g) Produce a plot with tree size on the \(x\)-axis and cross-validated classification error rate on the \(y\)-axis.
plot(oj_cv$size, oj_cv$dev, type = "b", xlab = "Tree Size", ylab = "Deviance")
h) Which tree size corresponds to the lowest cross-validated classification error rate?
According the plot, it appears that a tree size of 3 gives the lowest cross-validation error.
i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of pruned tree, then create a pruned tree with five terminal nodes.
oj_prune <- prune.tree(oj_tree, best = 3)
j) Compare the training error rates between the pruned and unpruned trees. Which is higher?
summary(oj_tree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = oj_train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH"
## Number of terminal nodes: 10
## Residual mean deviance: 0.7116 = 562.2 / 790
## Misclassification error rate: 0.145 = 116 / 800
summary(oj_prune)
##
## Classification tree:
## snip.tree(tree = oj_tree, nodes = c(7L, 2L, 6L))
## Variables actually used in tree construction:
## [1] "LoyalCH"
## Number of terminal nodes: 3
## Residual mean deviance: 0.876 = 698.2 / 797
## Misclassification error rate: 0.1875 = 150 / 800
It appears that the unpruned tree has a slightly lower error rate on the training set then that of the pruned tree.
k) Compare the test error rates between the pruned and unpruned trees. Which is higher?
pred_unpruned <- predict(oj_tree, oj_test, type = "class")
unpruned_error <- sum(oj_test$Purchase != pred_unpruned)
unpruned_error / length(pred_unpruned)
## [1] 0.2111111
pred_pruned <- predict(oj_prune, oj_test, type = "class")
pruned_error <- sum(oj_test$Purchase != pred_pruned)
pruned_error / length(pred_pruned)
## [1] 0.2
It appears that the pruned tree has a slightly lower test error rate then that of the unpruned tree.