Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆpm1. The x-axis should display ˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.
Hint: In a setting with two classes, ˆpm1 = 1 − ˆpm2. You could make this plot by hand, but it will be much easier to make in R
phat1 = seq(0,1,0.01)
phat2 = 1 - phat1
We have the following equation for Gini index:
\[ G = \sum_{k=1}^{K} \hat{p}_{mk} (1 - \hat{p}_{mk}) \]
We can plug in the hint to get the following:
gini = phat1 * phat2 + phat2 * phat1
We can then move on to the classification error below:
\[ E = 1 - \max_{k} (\hat{p}_{mk}). \]
Which we can translate into R:
e <- 1 - pmax(phat1, phat2)
We finally move on to entropy below:
\[ D = - \sum_{k=1}^{K} \hat{p}_{mk} \log \hat{p}_{mk} \]
Which is the following in R:
d <- -(phat1 * log(phat1) + phat2 * log(phat2))
Now we can plot everything together:
impurity_df <- data.frame(
phat1 = phat1,
Gini = gini,
Classification_Error = e,
Entropy = d
)
impurity_long <- pivot_longer(impurity_df,
cols = c(Gini, Classification_Error, Entropy),
names_to = "Measure",
values_to = "Value")
ggplot(impurity_long, aes(x = phat1, y = Value, color = Measure)) +
geom_line(size = 1) +
labs(
x = expression(hat(p)[1]),
y = "Error",
color = "Measure"
) +
theme_minimal() +
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
# 80 20 train test split
set.seed(123)
train_idx <- createDataPartition(Carseats$Sales, p=0.8, list = F)
train <- Carseats[train_idx,]
test <- Carseats[-train_idx,]
tree <- tree(Sales~., data=train)
plot(tree)
text(tree, pretty=0)
# summary(tree)
tree_preds <- predict(tree, newdata = test)
cat('MSE of regression trees: ', mean((test$Sales - tree_preds)^2))
## MSE of regression trees: 3.989909
We calculate a test MSE of 3.99, with the associated regression tree. It seems the most important factor is shelve location, followed by the price.
cv_tree <- cv.tree(tree)
plot(cv_tree$size, cv_tree$dev, type='b')
We see the best performance at a size of 20, but this may be due to overfitting. The error seems to plateau around 10, so we will prune the tree to size 10 and then compare.
prune_tree <- prune.tree(tree, best = 10)
plot(prune_tree)
text(prune_tree, pretty=0)
prune_preds <- predict(prune_tree, newdata=test)
cat('MSE of pruned regression trees: ', mean((test$Sales - prune_preds)^2))
## MSE of pruned regression trees: 4.17508
We see an improved test MSE of 4.17, meaning that the unpruned tree has a better performance than the pruned tree.
bag <- randomForest(Sales~., data=train, mtry=10, importance=T) # mtry 10 to indicate bagging
bag_preds <- predict(bag, newdata=test)
cat('MSE of bagging: ', mean((test$Sales - bag_preds)^2))
## MSE of bagging: 2.207355
We see a much better performance with a test MSE of 2.20 using bagging. We can examine which variables are important next:
importance(bag)
## %IncMSE IncNodePurity
## CompPrice 32.9658615 261.321866
## Income 8.3720174 119.142486
## Advertising 21.9712398 171.089174
## Population -0.8515952 78.469427
## Price 73.3770038 735.787501
## ShelveLoc 77.6120330 786.114217
## Age 23.7411566 245.395461
## Education 2.3243803 65.264136
## Urban 0.4501544 11.374272
## US 2.0875754 7.874132
With this we can see that CompPrice, Income, and Advertising were the most important. Each of those lead to a significant increase in node purity when used and a significant increase in MSE when they were removed.
rf <- randomForest(Sales~., data=train, mtry=3, importance=T) # p/3 for regression
rf_preds <- predict(rf, newdata=test)
cat('MSE of random forest: ', mean((test$Sales - rf_preds)^2))
## MSE of random forest: 2.687778
We calculate a test MSE for random forest at 2.69. We can check the importance once again:
importance(rf)
## %IncMSE IncNodePurity
## CompPrice 13.3317563 227.72227
## Income 6.5171317 196.31469
## Advertising 13.2852877 207.80164
## Population -2.2650990 145.37736
## Price 42.8014305 580.20990
## ShelveLoc 46.9570829 603.81542
## Age 15.2863156 282.51052
## Education 2.8357733 105.57985
## Urban 0.2410994 21.99124
## US 5.2215714 32.28122
We once again see the top 3 most important as CompPrice, Income, and Advertising, agreeing with the results we got in bagging.
xtrain <- train[,2:11]
xtest <- test[,2:11]
ytrain <- train[,1]
ytest <- test[,1]
bart <- gbart(xtrain, ytrain, x.test = xtest)
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 321, 14, 79
## y1,yn: 3.754922, -1.525078
## x1,x[n*p]: 111.000000, 1.000000
## xp1,xp[np*p]: 138.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 69 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.287616,3,0.198224,7.46508
## *****sigma: 1.008771
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
##
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 4s
## trcnt,tecnt: 1000,1000
bart_preds <- bart$yhat.train.mean
cat('MSE of BART: ', mean((test$Sales - bart_preds)^2))
## MSE of BART: 13.31927
We see an MSE of 13.32 for BART, so far the worst performing model by far.
This problem involves the OJ data set which is part of the ISLR2 package.
str(OJ)
## 'data.frame': 1070 obs. of 18 variables:
## $ Purchase : Factor w/ 2 levels "CH","MM": 1 1 1 2 1 1 1 1 1 1 ...
## $ WeekofPurchase: num 237 239 245 227 228 230 232 234 235 238 ...
## $ StoreID : num 1 1 1 1 7 7 7 7 7 7 ...
## $ PriceCH : num 1.75 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
## $ PriceMM : num 1.99 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 ...
## $ DiscCH : num 0 0 0.17 0 0 0 0 0 0 0 ...
## $ DiscMM : num 0 0.3 0 0 0 0 0.4 0.4 0.4 0.4 ...
## $ SpecialCH : num 0 0 0 0 0 0 1 1 0 0 ...
## $ SpecialMM : num 0 1 0 0 0 1 1 0 0 0 ...
## $ LoyalCH : num 0.5 0.6 0.68 0.4 0.957 ...
## $ SalePriceMM : num 1.99 1.69 2.09 1.69 1.69 1.99 1.59 1.59 1.59 1.59 ...
## $ SalePriceCH : num 1.75 1.75 1.69 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
## $ PriceDiff : num 0.24 -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 ...
## $ Store7 : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 2 2 ...
## $ PctDiscMM : num 0 0.151 0 0 0 ...
## $ PctDiscCH : num 0 0 0.0914 0 0 ...
## $ ListPriceDiff : num 0.24 0.24 0.23 0 0 0.3 0.3 0.24 0.24 0.24 ...
## $ STORE : num 1 1 1 1 0 0 0 0 0 0 ...
library(tree)
set.seed(123)
train_idx <- sample(1:length(OJ$Purchase), size = 800) # train indices
train <- OJ[train_idx,]
test <- OJ[-train_idx,]
tree <- tree(Purchase ~., data=train)
summary(tree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff"
## Number of terminal nodes: 8
## Residual mean deviance: 0.7625 = 603.9 / 792
## Misclassification error rate: 0.165 = 132 / 800
The training error is misclassification error in this case, which is 16.5% for this tree. There are 8 terminal nodes, and there is a residual mean deviance of 0.7625.
tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1071.00 CH ( 0.60875 0.39125 )
## 2) LoyalCH < 0.5036 350 415.10 MM ( 0.28000 0.72000 )
## 4) LoyalCH < 0.276142 170 131.00 MM ( 0.12941 0.87059 )
## 8) LoyalCH < 0.0356415 56 10.03 MM ( 0.01786 0.98214 ) *
## 9) LoyalCH > 0.0356415 114 108.90 MM ( 0.18421 0.81579 ) *
## 5) LoyalCH > 0.276142 180 245.20 MM ( 0.42222 0.57778 )
## 10) PriceDiff < 0.05 74 74.61 MM ( 0.20270 0.79730 ) *
## 11) PriceDiff > 0.05 106 144.50 CH ( 0.57547 0.42453 ) *
## 3) LoyalCH > 0.5036 450 357.10 CH ( 0.86444 0.13556 )
## 6) PriceDiff < -0.39 27 32.82 MM ( 0.29630 0.70370 ) *
## 7) PriceDiff > -0.39 423 273.70 CH ( 0.90071 0.09929 )
## 14) LoyalCH < 0.705326 130 135.50 CH ( 0.78462 0.21538 )
## 28) PriceDiff < 0.145 43 58.47 CH ( 0.58140 0.41860 ) *
## 29) PriceDiff > 0.145 87 62.07 CH ( 0.88506 0.11494 ) *
## 15) LoyalCH > 0.705326 293 112.50 CH ( 0.95222 0.04778 ) *
Let us examine node 8 which is a terminal node. To get there, the first split is at LoyalCH which is below 0.504, of which leaves 350 people. The next split is with LoyalCH being lower than 0.276, which decreases to 170 people. The next split is again with LoyalCH being lower than 0.036. This categorizes the remaining 56 individuals as purchasing MM, as 98.2% of individuals in the training dataset who met both of these criteria purchased MM (meaning only about one person in this group purchased CH). This outcome makes sense: people with a very low loyalty for CH will be very unlikely to buy CH.
plot(tree)
text(tree, pretty = 0)
In this tree diagram, branches on the left mean the condition is met while branches on the right mean the condition is not met. We can see visually the same concept from the previous question: in individuals with LoyalCH < 0.5036, LoyalCH < 0.276, and LoyalCH < 0.036, they would be classified as MM.
Seeing the entire tree shows an overall trend: if LoyalCH < 0.5 then you are unlikely to purchase CH, but if it is higher you are more likely to purchase CH. We see that the exceptions to this are involving PriceDiff; when there isn’t much of a price difference or MM is higher then they are likely to buy CH on the left hand side, while if MM is much lower the right hand side is more likely to buy MM.
Interestingly enough, we see that some splits have the same classification (such as nodes 8 and 9 as seen in the last section). This is because these splits increase the node purity, as the likelihood of being correct is much higher in node 8 than node 9.
tree_preds <- predict(tree, newdata = test)
tree_preds <- as.factor(ifelse(tree_preds[,1] >= 0.5, "CH","MM")) # classify as CH or MM
confusionMatrix(as.factor(test$Purchase), tree_preds)
## Confusion Matrix and Statistics
##
## Reference
## Prediction CH MM
## CH 150 16
## MM 34 70
##
## Accuracy : 0.8148
## 95% CI : (0.7633, 0.8593)
## No Information Rate : 0.6815
## P-Value [Acc > NIR] : 5.99e-07
##
## Kappa : 0.596
##
## Mcnemar's Test P-Value : 0.01621
##
## Sensitivity : 0.8152
## Specificity : 0.8140
## Pos Pred Value : 0.9036
## Neg Pred Value : 0.6731
## Prevalence : 0.6815
## Detection Rate : 0.5556
## Detection Prevalence : 0.6148
## Balanced Accuracy : 0.8146
##
## 'Positive' Class : CH
##
cat('misclassification error: ', mean(tree_preds != as.factor(test$Purchase)))
## misclassification error: 0.1851852
We see misclassification error is 18.5%, slightly higher than the training error.
cv_tree <- cv.tree(tree) # uses prune.tree with cross validation
print(cv_tree)
## $size
## [1] 8 7 6 5 4 3 2 1
##
## $dev
## [1] 690.2346 692.8691 683.0423 717.4446 717.4446 743.3604 793.1266
## [8] 1073.0916
##
## $k
## [1] -Inf 12.03823 14.92474 25.76707 26.02613 38.91686 50.61655
## [8] 298.68751
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(cv_tree)
We see the lowest error at size 6 at 683, meaining that the optimal length to cut the tree is at size 6.
tree_pruned <- prune.tree(tree, best = 6)
plot(tree_pruned)
text(tree_pruned, pretty = 0)
summary(tree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff"
## Number of terminal nodes: 8
## Residual mean deviance: 0.7625 = 603.9 / 792
## Misclassification error rate: 0.165 = 132 / 800
summary(tree_pruned)
##
## Classification tree:
## snip.tree(tree = tree, nodes = c(4L, 14L))
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff"
## Number of terminal nodes: 6
## Residual mean deviance: 0.7945 = 630.9 / 794
## Misclassification error rate: 0.165 = 132 / 800
We see the training error is at 16.5% for both trees. This is because the two nodes that were pruned were at a split in which both options were the same classification. This means that we see no decrease in performance after pruning the tree.
prune_preds <- predict(tree_pruned, newdata = test)
prune_preds <- as.factor(ifelse(prune_preds[,1] >= 0.5, "CH","MM")) # classify as CH or MM
cat('misclassification error for full tree: ', mean(tree_preds != as.factor(test$Purchase)), '\n')
## misclassification error for full tree: 0.1851852
cat('misclassification error for pruned tree: ', mean(prune_preds != as.factor(test$Purchase)))
## misclassification error for pruned tree: 0.1851852
We see the same test error for the pruning, for the same reason; the nodes pruned were only regarding improving purity rather than performance.