3.) Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆpm1. The x-axis should display ˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.
p = seq(0, 1, 0.01)
gini = p * (1 - p) * 2
entropy = -(p * log(p) + (1 - p) * log(1 - p))
class.err = 1 - pmax(p, 1 - p)
matplot(p, cbind(gini, entropy, class.err), col = c("orange", "blue", "green"))
8.) In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.1.3
attach(Carseats)
set.seed(96)
train = sample(dim(Carseats)[1], dim(Carseats)[1]/3)
Carseats.train = Carseats[-train, ]
Carseats.test = Carseats[train, ]
library(tree)
## Warning: package 'tree' was built under R version 4.1.3
tree.carseats = tree(Sales ~ ., data = Carseats.train)
summary(tree.carseats)
##
## Regression tree:
## tree(formula = Sales ~ ., data = Carseats.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "Income" "CompPrice"
## [6] "Education" "Advertising"
## Number of terminal nodes: 19
## Residual mean deviance: 2.408 = 597.1 / 248
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.87000 -1.10000 0.06133 0.00000 1.05900 3.93500
plot(tree.carseats)
text(tree.carseats, pretty = 0)
pred.carseats = predict(tree.carseats, Carseats.test)
mean((Carseats.test$Sales - pred.carseats)^2)
## [1] 4.106416
MSE is roughly 5.25.
cv.carseats = cv.tree(tree.carseats, FUN = prune.tree)
par(mfrow = c(1, 2))
plot(cv.carseats$size, cv.carseats$dev, type = "b")
plot(cv.carseats$k, cv.carseats$dev, type = "b")
Best size = 13.
pruned.carseats = prune.tree(tree.carseats, best = 13)
par(mfrow = c(1, 1))
plot(pruned.carseats)
text(pruned.carseats, pretty = 0)
pred.pruned = predict(pruned.carseats, Carseats.test)
mean((Carseats.test$Sales - pred.pruned)^2)
## [1] 4.316175
MSE is roughly 5.44.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
bag.carseats = randomForest(Sales ~ ., data = Carseats.train, mtry = 10, ntree = 500,
importance = T)
bag.pred = predict(bag.carseats, Carseats.test)
mean((Carseats.test$Sales - bag.pred)^2)
## [1] 2.387076
importance(bag.carseats)
## %IncMSE IncNodePurity
## CompPrice 27.2327728 254.929288
## Income 8.1227939 111.716806
## Advertising 19.0661236 157.406322
## Population -0.8751688 65.173074
## Price 57.8265312 519.423753
## ShelveLoc 73.3318600 756.953261
## Age 17.8904175 190.879169
## Education 5.5626592 76.472595
## Urban -0.1915783 11.364312
## US 0.9419029 7.527342
MSE is now 2.54. Most important variables are Price, ShelveLoc, and Age.
rf.carseats = randomForest(Sales ~ ., data = Carseats.train, mtry = 5, ntree = 500,
importance = T)
rf.pred = predict(rf.carseats, Carseats.test)
mean((Carseats.test$Sales - rf.pred)^2)
## [1] 2.292494
importance(rf.carseats)
## %IncMSE IncNodePurity
## CompPrice 23.4940193 228.09041
## Income 5.2477228 129.95889
## Advertising 17.9395611 196.25554
## Population -2.0898898 93.32153
## Price 49.7200070 506.39863
## ShelveLoc 59.9138565 624.47180
## Age 16.7206769 219.21207
## Education 2.4953601 83.05996
## Urban -1.6208138 14.78184
## US 0.8603896 17.17162
MSE is now roughly 2.50. Price, ShelveLoc, and Age are important again, along with Advertising, Income, and CompPrice.
9.) This problem involves the OJ data set which is part of the ISLR2 package.
library(ISLR)
attach(OJ)
set.seed(12)
train = sample(dim(OJ)[1], 802)
OJ.train = OJ[train, ]
OJ.test = OJ[-train, ]
library(tree)
oj.tree = tree(Purchase ~ ., data = OJ.train)
summary(oj.tree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "SalePriceMM" "PriceDiff" "ListPriceDiff"
## Number of terminal nodes: 7
## Residual mean deviance: 0.765 = 608.2 / 795
## Misclassification error rate: 0.1584 = 127 / 802
The tree only uses two variables: LoyalCH and PriceDiff. It has 8 terminal nodes. Training error rate (misclassification error) for the tree is 0.1484.
oj.tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 802 1072.00 CH ( 0.61097 0.38903 )
## 2) LoyalCH < 0.482935 305 331.00 MM ( 0.23279 0.76721 )
## 4) LoyalCH < 0.0589885 69 18.11 MM ( 0.02899 0.97101 ) *
## 5) LoyalCH > 0.0589885 236 285.20 MM ( 0.29237 0.70763 )
## 10) SalePriceMM < 2.04 125 125.10 MM ( 0.20000 0.80000 ) *
## 11) SalePriceMM > 2.04 111 149.10 MM ( 0.39640 0.60360 ) *
## 3) LoyalCH > 0.482935 497 432.00 CH ( 0.84306 0.15694 )
## 6) LoyalCH < 0.764572 222 270.20 CH ( 0.70270 0.29730 )
## 12) PriceDiff < 0.085 73 99.54 MM ( 0.42466 0.57534 )
## 24) ListPriceDiff < 0.235 48 56.07 MM ( 0.27083 0.72917 ) *
## 25) ListPriceDiff > 0.235 25 29.65 CH ( 0.72000 0.28000 ) *
## 13) PriceDiff > 0.085 149 131.60 CH ( 0.83893 0.16107 ) *
## 7) LoyalCH > 0.764572 275 98.63 CH ( 0.95636 0.04364 ) *
Let’s pick terminal node labeled “8)”. The splitting variable at this node is LoyalCH. The splitting value of this node is 0.036. There are 60 points in the subtree below this node. The deviance for all points contained in region below this node is 61. A * in the line denotes that this is in fact a terminal node. The prediction at this node is Sales = MM. About 1.7% points in this node have CH as value of Sales. Remaining 98.3% points have MM as value of Sales.
plot(oj.tree)
text(oj.tree, pretty = 0)
LoyalCH is the most important variable of the tree, in fact top 3 nodes contain LoyalCH. If LoyalCH<0.276, the tree predicts MM. If LoyalCH>0.765, the tree predicts CH. For intermediate values of LoyalCH, the decision also depends on the value of PriceDiff.
oj.pred = predict(oj.tree, OJ.test, type = "class")
table(OJ.test$Purchase, oj.pred)
## oj.pred
## CH MM
## CH 127 36
## MM 15 90
1-((138+74)/(138+18+38+74))
## [1] 0.2089552
Test error rate is 20.9%.
cv.oj = cv.tree(oj.tree, FUN = prune.tree)
plot(cv.oj$size, cv.oj$dev, type = "b", xlab = "Tree Size", ylab = "Deviance")
Size of 8 gives lowest cross-validation error.
oj.pruned = prune.tree(oj.tree, best = 8)
## Warning in prune.tree(oj.tree, best = 8): best is bigger than tree size
summary(oj.pruned)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "SalePriceMM" "PriceDiff" "ListPriceDiff"
## Number of terminal nodes: 7
## Residual mean deviance: 0.765 = 608.2 / 795
## Misclassification error rate: 0.1584 = 127 / 802
Misclassification error of pruned tree is exactly same as that of original tree: 0.1484.
pred.unpruned = predict(oj.tree, OJ.test, type = "class")
misclass.unpruned = sum(OJ.test$Purchase != pred.unpruned)
misclass.unpruned/length(pred.unpruned)
## [1] 0.1902985
pred.pruned = predict(oj.pruned, OJ.test, type = "class")
misclass.pruned = sum(OJ.test$Purchase != pred.pruned)
misclass.pruned/length(pred.pruned)
## [1] 0.1902985
Pruned and unpruned trees have same test error rate of 0.209.