Hint: In a setting with two classes, \(\hat{p}_{m1}\) = 1− \(\hat{p}_{m2}\). You could make this plot by hand, but it will be much easier to make in R.
The classification error rate is:
\[E = 1 − \displaystyle \max_{k}(\hat{p}_{mk}).\]
Therefore, \[E = 1 - max \{\hat{p}_{m1}, \hat{p}_{m2}\}\]
The Gini index is:
\[G = \sum_{k=1}^{K} \hat{p}_{mk} (1 - \hat{p}_{mk}) \]
Therefore, \[G = \hat{p}_{m1} (1 - \hat{p}_{m1}) + \hat{p}_{m2} (1 - \hat{p}_{m2})\]
The Entropy is:
\[D = - \sum_{k=1}^{K} \hat{p}_{mk} log(\hat{p}_{mk}) \]
Therefore, \[D = - \hat{p}_{m1} log(\hat{p}_{m1}) - \hat{p}_{m2} log(\hat{p}_{m2}) \]
When we plot for the above quantities in a simple classification settings in R, we get the below plot.
#Two classes
p1 <- seq(0, 1, 0.001)
p2 <- 1 - p1
#Classification error rate
classification.error.rate <- 1 - pmax(p1, p2)
#Gini index
gini.index <- p1 * (1 - p1) + p2 * (1 - p2)
#Entropy
entropy <- -p1 * log(p1) - p2*log(p2)
#Data frame to hold the 2 classes with the quantities
df.classification.quantities <- data.frame(p1, p2, classification.error.rate, gini.index, entropy)
df.classification.quantities <- pivot_longer(df.classification.quantities, cols = c(classification.error.rate, gini.index, entropy), names_to = "metric")
##Plot classification quantities
ggplot(data=df.classification.quantities,
aes(x = p1, y = value, col = factor(metric))) +
geom_line(size = 1) +
scale_y_continuous(breaks = seq(0, 1, 0.1), minor_breaks = NULL) +
scale_color_hue(labels = c("Classification Error", "Entropy", "Gini")) +
labs(col = "Metric",
y = "Value",
x = "P")
OJ data set which is part of the ISLR package.Creating the training set with 800 observations and remaining in the test set.
set.seed(123)
train <- sample(1:nrow(OJ), 800)
oj.train <- OJ[train, ]
oj.test <- OJ[-train, ]
Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?tree.oj.model <- tree(Purchase ~ ., data = oj.train)
summary(tree.oj.model)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff"
## Number of terminal nodes: 8
## Residual mean deviance: 0.7625 = 603.9 / 792
## Misclassification error rate: 0.165 = 132 / 800
The classification tree has 8 terminal nodes and a training error rate of 16.5%.
From the available 17 predictors, the classification tree splits used only 2 predictors:
(i) LoyalCH - Customer brand loyalty for CH
(ii) PriceDiff - Sale price of MM less sale price of CH
tree.oj.model
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1071.00 CH ( 0.60875 0.39125 )
## 2) LoyalCH < 0.5036 350 415.10 MM ( 0.28000 0.72000 )
## 4) LoyalCH < 0.276142 170 131.00 MM ( 0.12941 0.87059 )
## 8) LoyalCH < 0.0356415 56 10.03 MM ( 0.01786 0.98214 ) *
## 9) LoyalCH > 0.0356415 114 108.90 MM ( 0.18421 0.81579 ) *
## 5) LoyalCH > 0.276142 180 245.20 MM ( 0.42222 0.57778 )
## 10) PriceDiff < 0.05 74 74.61 MM ( 0.20270 0.79730 ) *
## 11) PriceDiff > 0.05 106 144.50 CH ( 0.57547 0.42453 ) *
## 3) LoyalCH > 0.5036 450 357.10 CH ( 0.86444 0.13556 )
## 6) PriceDiff < -0.39 27 32.82 MM ( 0.29630 0.70370 ) *
## 7) PriceDiff > -0.39 423 273.70 CH ( 0.90071 0.09929 )
## 14) LoyalCH < 0.705326 130 135.50 CH ( 0.78462 0.21538 )
## 28) PriceDiff < 0.145 43 58.47 CH ( 0.58140 0.41860 ) *
## 29) PriceDiff > 0.145 87 62.07 CH ( 0.88506 0.11494 ) *
## 15) LoyalCH > 0.705326 293 112.50 CH ( 0.95222 0.04778 ) *
I am picking the terminal node 8 (because there is an asterisk at the end). The split criterion for that node is LoyalCH < 0.0356415 with 56 observations in the branch. The deviance of the branch is 10.03 with overall prediction for the branch of MM. Within that branch, MM takes approximately 81% of the observations and CH takes approximately 18% of the observations.
plot(tree.oj.model)
text(tree.oj.model, pretty=0, cex = 0.5)
From the tree plot, we can identify that the LoyalCH is the most important variable followed by PriceDiff to predict the Purchase. The first branch differentiates the intensity of the Customer brand loyalty for CH (LoyalCH). The top 2 nodes are LoyalCH and the three node is PriceDiff. There are 8 terminal nodes and the number of terminal nodes are equally divided among CH and MM.
The first node split is that the customer less loyal to Citrus Hill (CH) orange juice to the left and more loyal to the right. On the more loyal to Citrus Hill side, the PriceDiff plays the role in dividing the nodes, that’s if the PriceDiff is less than -0.39, the customer prefers Minute Maid Orange Juice.
Prediction and the confusion matrix are:
tree.oj.pred <- predict(tree.oj.model, oj.test, type = "class")
table(tree.oj.pred, oj.test$Purchase)
##
## tree.oj.pred CH MM
## CH 150 34
## MM 16 70
\[Test Error rate = 1 - (150 + 70)/270 = 0.18518519\]
We can say the test error rate would be 18.5% (approximately).
cv.tree() function to the training set in order to determine the optimal tree size.Apply prune.misclass to the `cv.tree() function in order to determine the classification error rate.
set.seed(2)
cv.tree.model <- cv.tree(tree.oj.model, K = 10, FUN = prune.misclass)
cv.tree.model
## $size
## [1] 8 5 3 2 1
##
## $dev
## [1] 138 138 162 165 313
##
## $k
## [1] -Inf 0 8 11 154
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
#Construct the data frame with deviance and size
df.cv.oj <- data.frame(Tree.Size = cv.tree.model$size,
CV.Error = cv.tree.model$dev/nrow(oj.train))
#Find the minimum Error rate
df.cv.oj <- mutate(df.cv.oj, min.cv.error = as.numeric(min(CV.Error) == CV.Error))
##Use the ggplot
ggplot(df.cv.oj, aes(x = Tree.Size, y = CV.Error)) +
geom_line(col = "black") +
geom_point(size = 2, aes(col = factor(min.cv.error))) +
scale_x_continuous(breaks = seq(1, 10, 1)) +
scale_y_continuous(labels = scales::percent_format()) +
scale_color_manual(values = c("green", "red")) +
theme(panel.background = element_rect(fill = 'white', colour = 'red'), legend.position = "none") +
labs(title = "OJ Dataset - Cross Validation",
x = "Tree size",
y = "CV - Error")
From the above (g) cross-validation tree plot, we find out tree terminal nodes 5 and 8 have the same cross-validation error rate. It is more obvious to select the model with 5 terminal nodes.
From the above cross-validation (h), the optimal tree size is 5.
prune.oj <- prune.misclass(tree.oj.model, best = 5)
plot(prune.oj)
text(prune.oj, pretty=0, cex = 0.5)
summary(tree.oj.model)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff"
## Number of terminal nodes: 8
## Residual mean deviance: 0.7625 = 603.9 / 792
## Misclassification error rate: 0.165 = 132 / 800
summary(prune.oj)
##
## Classification tree:
## snip.tree(tree = tree.oj.model, nodes = c(4L, 7L))
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff"
## Number of terminal nodes: 5
## Residual mean deviance: 0.826 = 656.6 / 795
## Misclassification error rate: 0.165 = 132 / 800
Since both the cross-validation error rate is same for the nodes 5 and 8 (as determined in (h)), it looks like the training error rates are same when compared between pruned and unpruned trees.
mean(predict(tree.oj.model, type = "class", newdata = oj.test) != oj.test$Purchase)
## [1] 0.1851852
mean(predict(prune.oj, type = "class", newdata = oj.test) != oj.test$Purchase)
## [1] 0.1851852
Since both the cross-validation error rate is same for the nodes 5 and 8 (as determined in (h)), it looks like we have the same test error rates as well between the pruned and unpruned trees.