Problem 1. A.
library(ISLR2)
#splits data into test and train sets
set.seed(1)
train = sample(1:nrow(Carseats), nrow(Carseats) / 2) #split into train/test 50/50
test <- Carseats[-train, ]
carseats.test <- Carseats[-train, "Sales"]
library(tree)
tree.carseats <- tree(Sales ~ ., data = Carseats, subset = train)
Warning: NAs introduced by coercion
summary(tree.carseats)
Regression tree:
tree(formula = Sales ~ ., data = Carseats, subset = train)
Variables actually used in tree construction:
[1] "Price" "Population" "CompPrice" "Income"
[5] "Advertising" "Age" "Education"
Number of terminal nodes: 18
Residual mean deviance: 2.726 = 496.2 / 182
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.361000 -1.092000 0.006545 0.000000 1.125000 4.989000
plot(tree.carseats)
text(tree.carseats, pretty = 0)
yhat <- predict(tree.carseats, newdata = test)
Warning: NAs introduced by coercion
mean((yhat - carseats.test)^2)
[1] 7.871697
The tree is a confusing mess as the text overlaps; however, the test MSE is low at 7.87.
set.seed(1)
cv.carseats = cv.tree(tree.carseats)
Warning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercionWarning: NAs introduced by coercion
cv.carseats
$size
[1] 18 17 16 14 13 12 11 10 7 6 5 4 2 1
$dev
[1] 1322.684 1354.387 1340.895 1293.154 1323.226 1322.193
[7] 1270.718 1320.904 1327.335 1337.650 1323.785 1433.244
[13] 1473.722 1590.085
$k
[1] -Inf 15.99975 18.35626 20.65982 20.95202
[6] 22.72688 29.47418 38.15698 40.27422 41.01065
[11] 44.44327 93.39728 160.12831 269.57228
$method
[1] "deviance"
attr(,"class")
[1] "prune" "tree.sequence"
plot(cv.carseats$size, cv.carseats$dev, type = "b")
prune.carseats <- prune.tree(tree.carseats, best = 11) #prune.tree for regression while prune.class used for classification
plot(prune.carseats)
text(prune.carseats, pretty = 0)
yhat <- predict(prune.carseats, newdata = test)
Warning: NAs introduced by coercion
mean((yhat - carseats.test)^2) #pruning to only 4 predictors
[1] 7.60365
Pruning increases test MSE compared to the original tree to 7.60.
library(randomForest)
set.seed(1)
bag.carseats <- randomForest(Sales ~ ., data = Carseats,
subset = train, mtry = 10, importance = TRUE)
bag.carseats
Call:
randomForest(formula = Sales ~ ., data = Carseats, mtry = 10, importance = TRUE, subset = train)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 10
Mean of squared residuals: 3.195026
% Var explained: 59.37
set.seed(1)
yhat.bag <- predict(bag.carseats, newdata = test)
mean((yhat.bag - carseats.test)^2) #better test error with bagging
[1] 3.113233
importance(bag.carseats)
%IncMSE IncNodePurity
CompPrice 24.264977 214.88031
Income 6.330426 109.78464
Advertising 8.366619 99.70299
Population 2.407187 85.28203
Price 51.679914 526.59906
ShelveLoc 26.392677 223.56530
Age 12.968903 181.57228
Education 1.509590 55.68405
Urban 2.003649 10.43385
US 3.559154 16.29080
varImpPlot(bag.carseats)
set.seed(1)
rf.carseats <- randomForest(Sales ~ ., data = Carseats,
subset = train, mtry = 6, importance = TRUE)
yhat.rf <- predict(rf.carseats, newdata = test)
mean((yhat.rf - carseats.test)^2)
[1] 2.995462
The best test MSE was 2.99 using 6 nodes.
importance(rf.carseats)
%IncMSE IncNodePurity
CompPrice 21.6494396 192.25341
Income 4.5460176 122.37487
Advertising 9.7957496 102.45053
Population -0.5380075 86.51329
Price 47.2200892 493.91143
ShelveLoc 28.3374685 245.16604
Age 13.6683556 177.28292
Education -1.4358224 62.86011
Urban -0.7109767 11.97869
US 4.1017980 24.78830
list = list()
for (x in 1:10) {
set.seed(1)
rf.carseats <- randomForest(Sales ~ ., data = Carseats,
subset = train, mtry = x, importance = TRUE)
yhat.rf <- predict(rf.carseats, newdata = test)
mean = mean((yhat.rf - carseats.test)^2)
list[x] = mean
}
list = unlist(list)
min = min(list)
plot(list, ylab = "Test MSE", xlab = "m")
points(match(min, list), min, col = 'red')
As the graph shows you, the test MSE rapidly decreases when m increases from 1 to 4. From then on, the test MSE increases only a bit with the lowest appearing at m = 6.
Problem 2.
set.seed(1)
train = sample(1:nrow(OJ), 800)
test = OJ[-train, ]
OJ.test <- OJ[-train, "Purchase"]
set.seed(1)
tree.OJ <- tree(Purchase ~ ., OJ, subset = train)
summary(tree.OJ)
Classification tree:
tree(formula = Purchase ~ ., data = OJ, subset = train)
Variables actually used in tree construction:
[1] "LoyalCH" "PriceDiff" "SpecialCH"
[4] "ListPriceDiff" "PctDiscMM"
Number of terminal nodes: 9
Residual mean deviance: 0.7432 = 587.8 / 791
Misclassification error rate: 0.1588 = 127 / 800
Only 5 variables used in tree construction. The test MSE is 0.1588. There are 9 nodes.
tree.OJ
node), split, n, deviance, yval, (yprob)
* denotes terminal node
1) root 800 1073.00 CH ( 0.60625 0.39375 )
2) LoyalCH < 0.5036 365 441.60 MM ( 0.29315 0.70685 )
4) LoyalCH < 0.280875 177 140.50 MM ( 0.13559 0.86441 )
8) LoyalCH < 0.0356415 59 10.14 MM ( 0.01695 0.98305 ) *
9) LoyalCH > 0.0356415 118 116.40 MM ( 0.19492 0.80508 ) *
5) LoyalCH > 0.280875 188 258.00 MM ( 0.44149 0.55851 )
10) PriceDiff < 0.05 79 84.79 MM ( 0.22785 0.77215 )
20) SpecialCH < 0.5 64 51.98 MM ( 0.14062 0.85938 ) *
21) SpecialCH > 0.5 15 20.19 CH ( 0.60000 0.40000 ) *
11) PriceDiff > 0.05 109 147.00 CH ( 0.59633 0.40367 ) *
3) LoyalCH > 0.5036 435 337.90 CH ( 0.86897 0.13103 )
6) LoyalCH < 0.764572 174 201.00 CH ( 0.73563 0.26437 )
12) ListPriceDiff < 0.235 72 99.81 MM ( 0.50000 0.50000 )
24) PctDiscMM < 0.196196 55 73.14 CH ( 0.61818 0.38182 ) *
25) PctDiscMM > 0.196196 17 12.32 MM ( 0.11765 0.88235 ) *
13) ListPriceDiff > 0.235 102 65.43 CH ( 0.90196 0.09804 ) *
7) LoyalCH > 0.764572 261 91.20 CH ( 0.95785 0.04215 ) *
Terminal node 7: This node is labeled LoyalCH > 0.765. There were 261 observations in this branch with a deviance of 91.20. It classifies as CH. About 95.8% of observations got classified as CH while the remaining 4.2% are classified as MM.
plot(tree.OJ)
text(tree.OJ, pretty = 0)
The tree is somewhat messy since the text overlaps at some points, but it can be interperted with some effort. The variable LoyalCH is seems to be the most important one as its the first node.
tree.pred = predict(tree.OJ, test,
type = "class")
table(tree.pred, OJ.test)
OJ.test
tree.pred CH MM
CH 160 38
MM 8 64
100-((126+84)/270)*100
[1] 22.22222
The test MSE is 22.22%.
set.seed(1)
cv.OJ = cv.tree(tree.OJ, FUN = prune.misclass)
cv.OJ
$size
[1] 9 8 7 4 2 1
$dev
[1] 145 145 146 146 167 315
$k
[1] -Inf 0.000000 3.000000 4.333333 10.500000
[6] 151.000000
$method
[1] "misclass"
attr(,"class")
[1] "prune" "tree.sequence"
plot(cv.OJ$size, cv.OJ$dev, type = "b")
min = min(cv.OJ$dev)
plot(cv.OJ$size, cv.OJ$dev, type = "b")
points(cv.OJ$size[match(min, cv.OJ$dev)],
min, col = 'red')
A tree size of 9 nodes produces the smallest cross validated error rate.
prune.OJ <- prune.misclass(tree.OJ, best = 5) #prune.tree for regression while prune.class used for classification
plot(prune.OJ)
text(prune.OJ, pretty = 0)
tree.pred <- predict(prune.OJ, newdata = test,
type = "class")
table(tree.pred, OJ.test)
OJ.test
tree.pred CH MM
CH 160 36
MM 8 66
100-(((125+89)/270)*100)
[1] 20.74074
The test MSE for the pruned tree is 20.74%, which is a bit better compared to the 22.22% test MSE of the unpruned tree. It’s also easier to understand.