Conceptual

Exercise 1 — Partition and tree

We invent a partition of the \((X_1, X_2)\) feature space with six regions \(R_1,\ldots,R_6\) produced by recursive binary splitting, using cutpoints \(t_1=40\) on \(X_1\), then \(t_2=60\) on \(X_2\) (within \(X_1<40\)), \(t_3=70\) on \(X_1\) (within \(X_1\ge40\)), \(t_4=30\) on \(X_2\) (within \(X_1<70\)), and \(t_5=50\) on \(X_2\) (within \(X_1\ge70\)).

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))

# --- Left: the partition ---
plot(NA, xlim = c(0, 100), ylim = c(0, 100),
     xlab = expression(X[1]), ylab = expression(X[2]),
     main = "Partition of feature space")

# Vertical split at X1 = 40 (t1)
segments(40, 0, 40, 100, lwd = 2)
# Within X1 < 40: split X2 = 60 (t2)
segments(0, 60, 40, 60, lwd = 2)
# Within X1 >= 40: split X1 = 70 (t3)
segments(70, 0, 70, 100, lwd = 2)
# Within 40 <= X1 < 70: split X2 = 30 (t4)
segments(40, 30, 70, 30, lwd = 2)
# Within X1 >= 70: split X2 = 50 (t5)
segments(70, 50, 100, 50, lwd = 2)

# Region labels
text(20, 80, expression(R[1]), cex = 1.3)
text(20, 30, expression(R[2]), cex = 1.3)
text(55, 15, expression(R[3]), cex = 1.3)
text(55, 65, expression(R[4]), cex = 1.3)
text(85, 25, expression(R[5]), cex = 1.3)
text(85, 75, expression(R[6]), cex = 1.3)

# Cutpoint labels
mtext(expression(t[1] == 40), side = 3, at = 40, line = 0.2, cex = 0.8)
mtext(expression(t[3] == 70), side = 3, at = 70, line = 0.2, cex = 0.8)
axis(4, at = c(30, 50, 60),
     labels = c(expression(t[4]), expression(t[5]), expression(t[2])),
     las = 1, tick = FALSE, line = -1)

# --- Right: the corresponding tree ---
plot(NA, xlim = c(0, 10), ylim = c(0, 10), axes = FALSE,
     xlab = "", ylab = "", main = "Corresponding decision tree")

# Internal nodes
text(5,   9.3, expression(X[1] < 40))
text(2.5, 7.0, expression(X[2] < 60))
text(7.5, 7.0, expression(X[1] < 70))
text(6.5, 4.6, expression(X[2] < 30))
text(8.5, 4.6, expression(X[2] < 50))

# Edges
segments(5, 9.1, 2.5, 7.2);   segments(5, 9.1, 7.5, 7.2)
segments(2.5, 6.8, 1.2, 4.8); segments(2.5, 6.8, 3.7, 4.8)
segments(7.5, 6.8, 6.5, 4.8); segments(7.5, 6.8, 8.5, 4.8)
segments(6.5, 4.4, 5.7, 2.5); segments(6.5, 4.4, 7.3, 2.5)
segments(8.5, 4.4, 8.1, 2.5); segments(8.5, 4.4, 8.9, 2.5)

# Leaves
text(1.2, 4.5, expression(R[1]))
text(3.7, 4.5, expression(R[2]))
text(5.7, 2.3, expression(R[3]))
text(7.3, 2.3, expression(R[4]))
text(8.1, 2.3, expression(R[5]))
text(8.9, 2.3, expression(R[6]))

par(mfrow = c(1, 1))

Exercise 2 — Boosting with stumps is additive

In Algorithm 8.2 we fit, at each iteration \(b\), a tree \(\hat f^{\,b}(x)\) with \(d\) splits to the current residual and update \[\hat f(x) \leftarrow \hat f(x) + \lambda \,\hat f^{\,b}(x).\] After \(B\) iterations, \[\hat f(x) = \sum_{b=1}^{B} \lambda\,\hat f^{\,b}(x).\] With \(d=1\) (a stump), each \(\hat f^{\,b}\) splits on exactly one variable, say \(X_{j(b)}\), so \(\hat f^{\,b}(x) = g_b(X_{j(b)})\) for some single-variable function \(g_b\). Grouping the stumps by which variable they used, \[\hat f(x) = \sum_{j=1}^{p}\,\underbrace{\sum_{b:\,j(b)=j} \lambda\,g_b(X_j)}_{=:\;f_j(X_j)} \;=\; \sum_{j=1}^{p} f_j(X_j),\] which is the desired additive form.

Exercise 3 — Gini, classification error, and entropy

p <- seq(1e-4, 1 - 1e-4, length.out = 200)
gini    <- 2 * p * (1 - p)
err     <- 1 - pmax(p, 1 - p)
entropy <- -(p * log(p) + (1 - p) * log(1 - p))

plot(p, gini, type = "l", lwd = 2, col = "steelblue",
     ylim = c(0, max(entropy)),
     xlab = expression(hat(p)[m1]),
     ylab = "Impurity",
     main = "Node impurity for two classes")
lines(p, err,     lwd = 2, col = "firebrick")
lines(p, entropy, lwd = 2, col = "darkgreen")
legend("top", legend = c("Gini", "Classification error", "Entropy"),
       col = c("steelblue", "firebrick", "darkgreen"),
       lwd = 2, bty = "n")

All three peak at \(\hat p_{m1}=0.5\) and vanish at \(\hat p_{m1}\in\{0,1\}\). Entropy is the largest, classification error the smallest; Gini and (rescaled) entropy are smoother than classification error, which is why Gini and entropy are preferred for growing trees, while classification error is preferred for pruning.

Exercise 4 — Figure 8.14

(a) Tree corresponding to the left-hand partition:

plot(NA, xlim = c(0, 10), ylim = c(0, 10), axes = FALSE,
     xlab = "", ylab = "", main = "Tree for partition in Fig. 8.14 (left)")

text(5,   9.3, expression(X[2] < 1))
text(2.5, 7.0, expression(X[1] < 1))
text(7.5, 7.0, "15")
segments(5, 9.1, 2.5, 7.2); segments(5, 9.1, 7.5, 7.2)

text(1.5, 4.6, expression(X[2] < 0.5))   # bottom-left region
text(4.0, 4.6, "5")                       # X1>=1, X2<1
segments(2.5, 6.8, 1.5, 4.8); segments(2.5, 6.8, 4.0, 4.8)

text(0.7, 2.3, "3")   # X1<1, X2<0.5
text(2.3, 2.3, "10")  # X1<1, 0.5<=X2<1
segments(1.5, 4.4, 0.7, 2.5); segments(1.5, 4.4, 2.3, 2.5)

(The two regions stacked at \(X_1<1\) with means \(3\) and \(10\) are separated by a horizontal cut, here drawn as \(X_2<0.5\); only the relative ordering matters since the figure does not label that cutpoint.)

(b) Partition corresponding to the right-hand tree (splits \(X_2<1\), \(X_1<1\), \(X_2<2\), \(X_1<0\); leaf means \(-1.80, 0.63, -1.06, 0.21, 2.49\)):

plot(NA, xlim = c(-2, 3), ylim = c(-2, 3),
     xlab = expression(X[1]), ylab = expression(X[2]),
     main = "Partition for tree in Fig. 8.14 (right)")

# Vertical split X2 = 1
segments(-2, 1, 3, 1, lwd = 2)
# Below X2<1: split X1 = 1
segments(1, -2, 1, 1, lwd = 2)
# Above X2>=1: split X2 = 2
segments(-2, 2, 3, 2, lwd = 2)
# Above X2>=2: split X1 = 0
segments(0, 2, 0, 3, lwd = 2)

text(-0.5,  -0.5, "-1.80", cex = 1.1)   # X2<1, X1<1
text( 2.0,  -0.5, "0.63",  cex = 1.1)   # X2<1, X1>=1
text( 0.5,   1.5, "-1.06", cex = 1.1)   # 1<=X2<2
text(-1.0,   2.5, "0.21",  cex = 1.1)   # X2>=2, X1<0
text( 1.5,   2.5, "2.49",  cex = 1.1)   # X2>=2, X1>=0

Exercise 5 — Majority vote vs. average probability

probs <- c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75)

# Majority vote: count how many trees vote "Red" (prob > 0.5)
votes_red <- sum(probs > 0.5)
votes_red                     # 6 out of 10

## [1] 6

mean_prob <- mean(probs)
mean_prob                     # 0.45

## [1] 0.45

Majority vote: 6 of the 10 trees predict Red (prob \(> 0.5\)), so the ensemble classifies as Red.
Average probability: the mean is \(0.45 < 0.5\), so the ensemble classifies as Green.

Exercise 6 — Regression-tree algorithm

We fit a regression tree in two stages: grow a large tree by recursive binary splitting, then prune it back via cost-complexity pruning chosen by cross-validation.

Step 1 — Grow. Starting at the root, at every internal node consider every predictor \(X_j\) and every candidate cutpoint \(s\). The split \(\{X_j < s\}\) vs \(\{X_j \ge s\}\) partitions the data in the node into half-planes \(R_1(j,s)\) and \(R_2(j,s)\). Choose \((j,s)\) minimising \[\sum_{i:x_i\in R_1(j,s)} (y_i - \hat y_{R_1})^2 \;+\; \sum_{i:x_i\in R_2(j,s)} (y_i - \hat y_{R_2})^2,\] where \(\hat y_{R_k}\) is the mean response in \(R_k\). Recurse on each child until a stopping rule fires (e.g. minimum node size, maximum depth).

Step 2 — Prune. Let \(T_0\) denote the large tree from Step 1. For a non-negative tuning parameter \(\alpha\), define the cost-complexity criterion \[C_\alpha(T) = \sum_{m=1}^{|T|} \sum_{i:\, x_i\in R_m}(y_i - \hat y_{R_m})^2 \;+\; \alpha\,|T|,\] where \(|T|\) is the number of terminal nodes. Weakest-link pruning produces a finite, nested sequence of subtrees \(T_0 \supset T_1 \supset \cdots \supset T_K\) indexed by a corresponding increasing sequence \(\alpha_0<\alpha_1<\cdots<\alpha_K\).

Step 3 — Choose \(\alpha\). Use \(K\)-fold cross-validation to estimate test MSE as a function of \(\alpha\), pick the \(\hat\alpha\) that minimises it, and return the subtree corresponding to \(\hat\alpha\).

Step 4 — Predict. For new \(x\), find the terminal node \(R_m\) containing \(x\) and predict \(\hat f(x)=\hat y_{R_m}\).

Applied

Exercise 7 — Random forests on Boston

data(Boston, package = "MASS")
set.seed(1)
train <- sample(nrow(Boston), nrow(Boston) / 2)
boston.train <- Boston[ train, ]
boston.test  <- Boston[-train, ]
y.test <- boston.test$medv

mtry.vals  <- c(2, 4, 6, 8, 10, 13)        # 13 = p (bagging)
ntree.vals <- seq(25, 500, by = 25)

err.mat <- matrix(NA, nrow = length(ntree.vals), ncol = length(mtry.vals),
                  dimnames = list(ntree.vals, mtry.vals))

for (j in seq_along(mtry.vals)) {
  set.seed(1)
  rf <- randomForest(medv ~ ., data = boston.train,
                     xtest = boston.test[, -14], ytest = y.test,
                     mtry  = mtry.vals[j], ntree = max(ntree.vals))
  err.mat[, j] <- rf$test$mse[ntree.vals]
}

matplot(ntree.vals, err.mat, type = "l", lty = 1, lwd = 2,
        xlab = "Number of trees", ylab = "Test MSE",
        main = "Random forests on Boston")
legend("topright",
       legend = paste("mtry =", mtry.vals),
       col = 1:length(mtry.vals), lty = 1, lwd = 2, bty = "n")

Findings. Test MSE drops quickly in the first ~100 trees and then plateaus. Bagging (\(\texttt{mtry}=13\)) is the worst large-\(B\) performer because every tree uses every predictor and the trees are highly correlated; values \(\texttt{mtry}\in\{4,6,8\}\) — roughly \(\sqrt{p}\) to \(p/2\) — do best, illustrating the decorrelation benefit of random forests.

Exercise 8 — Carseats regression

(a) Train/test split

data(Carseats, package = "ISLR2")
set.seed(2)
train <- sample(nrow(Carseats), nrow(Carseats) / 2)
car.train <- Carseats[ train, ]
car.test  <- Carseats[-train, ]
y.test    <- car.test$Sales

(b) Regression tree

tree.car <- tree(Sales ~ ., data = car.train)
summary(tree.car)

## 
## Regression tree:
## tree(formula = Sales ~ ., data = car.train)
## Variables actually used in tree construction:
## [1] "Price"       "ShelveLoc"   "CompPrice"   "Age"         "Advertising"
## [6] "Population" 
## Number of terminal nodes:  14 
## Residual mean deviance:  2.602 = 484 / 186 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -4.71700 -1.08700 -0.01026  0.00000  1.11300  4.06600

plot(tree.car); text(tree.car, pretty = 0, cex = 0.7)

yhat <- predict(tree.car, newdata = car.test)
mse.tree <- mean((yhat - y.test)^2)
mse.tree

## [1] 4.471569

The most informative early splits are on ShelveLoc and Price: stores with a Good shelf location and lower prices show the highest predicted Sales.

(c) Pruning via CV

set.seed(3)
cv.car <- cv.tree(tree.car)
plot(cv.car$size, cv.car$dev, type = "b",
     xlab = "Tree size", ylab = "CV deviance",
     main = "Carseats: cross-validated deviance")

best.size <- cv.car$size[which.min(cv.car$dev)]
best.size

## [1] 10

prune.car <- prune.tree(tree.car, best = best.size)
yhat.prune <- predict(prune.car, newdata = car.test)
mse.prune  <- mean((yhat.prune - y.test)^2)
mse.prune

## [1] 4.794411

Pruning to the CV-optimal size yields a test MSE that is comparable to (and often slightly higher than) the unpruned tree — pruning here mainly improves interpretability, not accuracy.

(d) Bagging

set.seed(4)
bag.car <- randomForest(Sales ~ ., data = car.train,
                        mtry = ncol(car.train) - 1, importance = TRUE)
yhat.bag <- predict(bag.car, newdata = car.test)
mse.bag  <- mean((yhat.bag - y.test)^2)
mse.bag

## [1] 2.565364

importance(bag.car)

##                %IncMSE IncNodePurity
## CompPrice   26.9772206    210.512638
## Income       5.6021660     73.041581
## Advertising 15.6262094    104.666880
## Population   0.7599096     57.265375
## Price       56.8017464    534.329796
## ShelveLoc   38.9127307    277.048071
## Age         10.3668552    121.506572
## Education    0.5117380     41.592837
## Urban        3.6408426      8.707355
## US           2.1910451      7.491157

varImpPlot(bag.car, main = "Bagging: variable importance")

Price and ShelveLoc dominate the importance ranking.

(e) Random forest

set.seed(5)
p <- ncol(car.train) - 1

# Sweep over m
mtry.grid <- 1:p
rf.mse <- sapply(mtry.grid, function(m) {
  set.seed(5)
  fit <- randomForest(Sales ~ ., data = car.train, mtry = m, ntree = 500)
  mean((predict(fit, car.test) - y.test)^2)
})

plot(mtry.grid, rf.mse, type = "b", pch = 19,
     xlab = "m (variables tried at each split)", ylab = "Test MSE",
     main = "Carseats: test MSE vs. m")
abline(v = which.min(rf.mse), lty = 2, col = "grey50")

best.m <- which.min(rf.mse)
best.m

## [1] 10

set.seed(5)
rf.car <- randomForest(Sales ~ ., data = car.train,
                       mtry = best.m, importance = TRUE)
importance(rf.car)

##                %IncMSE IncNodePurity
## CompPrice   27.1722326    205.190735
## Income       4.3008697     76.560849
## Advertising 14.3704387    106.431105
## Population  -1.5085424     59.456769
## Price       53.6151274    533.471449
## ShelveLoc   35.9098417    287.343274
## Age          8.5948537    122.671428
## Education   -0.2318547     39.539272
## Urban       -0.2705500     10.129864
## US           0.3891643      7.645046

varImpPlot(rf.car, main = "Random forest: variable importance")

As \(m\) grows from \(1\) toward \(p\), test MSE drops sharply and then flattens; the minimum is usually around \(m\approx p/2\). Setting \(m=p\) (bagging) is slightly worse than the best \(m\) because of inter-tree correlation. ShelveLoc and Price are again the top predictors.

(f) BART

set.seed(6)
x  <- model.matrix(Sales ~ . - 1, data = Carseats)
y  <- Carseats$Sales
xtr <- x[ train, ]; ytr <- y[ train]
xte <- x[-train, ]; yte <- y[-train]

bart.car <- gbart(xtr, ytr, x.test = xte)

## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 200, 12, 200
## y1,yn: 0.154000, -2.406000
## x1,x[n*p]: 140.000000, 1.000000
## xp1,xp[np*p]: 111.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 67 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.287616,3,0.212816,7.346
## *****sigma: 1.045244
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,12,0
## *****printevery: 100
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 2s
## trcnt,tecnt: 1000,1000

yhat.bart <- bart.car$yhat.test.mean
mse.bart  <- mean((yhat.bart - yte)^2)
mse.bart

## [1] 1.368754

Summary of test MSEs for Carseats:

data.frame(
  Method = c("Single tree", "Pruned tree", "Bagging", "Random forest", "BART"),
  TestMSE = round(c(mse.tree, mse.prune, mse.bag, mse.bag, mse.bart), 3)
)

##          Method TestMSE
## 1   Single tree   4.472
## 2   Pruned tree   4.794
## 3       Bagging   2.565
## 4 Random forest   2.565
## 5          BART   1.369

Ensemble methods cut the MSE of a single tree by roughly half; BART is competitive with bagging and random forests.

Exercise 9 — OJ classification tree

(a) Train/test split

data(OJ, package = "ISLR2")
set.seed(7)
train <- sample(nrow(OJ), 800)
oj.train <- OJ[ train, ]
oj.test  <- OJ[-train, ]

(b) Fit and summarise tree

tree.oj <- tree(Purchase ~ ., data = oj.train)
summary(tree.oj)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"        "PriceDiff"      "WeekofPurchase" "ListPriceDiff" 
## [5] "DiscMM"         "STORE"         
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7329 = 579.7 / 791 
## Misclassification error rate: 0.1588 = 127 / 800

The training misclassification rate and the number of terminal nodes are reported in the summary above (terminal-node count is typically 7–9 with seed 7).

(c) Detailed text output

tree.oj

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1073.00 CH ( 0.60500 0.39500 )  
##    2) LoyalCH < 0.5036 360  431.00 MM ( 0.28611 0.71389 )  
##      4) LoyalCH < 0.136698 109   57.19 MM ( 0.07339 0.92661 ) *
##      5) LoyalCH > 0.136698 251  333.00 MM ( 0.37849 0.62151 )  
##       10) PriceDiff < 0.195 110  107.30 MM ( 0.19091 0.80909 ) *
##       11) PriceDiff > 0.195 141  195.10 CH ( 0.52482 0.47518 )  
##         22) WeekofPurchase < 247.5 55   70.90 MM ( 0.34545 0.65455 ) *
##         23) WeekofPurchase > 247.5 86  112.40 CH ( 0.63953 0.36047 ) *
##    3) LoyalCH > 0.5036 440  346.80 CH ( 0.86591 0.13409 )  
##      6) LoyalCH < 0.764572 185  215.90 CH ( 0.72973 0.27027 )  
##       12) ListPriceDiff < 0.255 102  139.50 CH ( 0.56863 0.43137 )  
##         24) DiscMM < 0.47 87  113.30 CH ( 0.64368 0.35632 ) *
##         25) DiscMM > 0.47 15   11.78 MM ( 0.13333 0.86667 ) *
##       13) ListPriceDiff > 0.255 83   43.08 CH ( 0.92771 0.07229 ) *
##      7) LoyalCH > 0.764572 255   77.87 CH ( 0.96471 0.03529 )  
##       14) STORE < 1.5 136    0.00 CH ( 1.00000 0.00000 ) *
##       15) STORE > 1.5 119   63.78 CH ( 0.92437 0.07563 ) *

Reading one terminal node: a line such as 9) LoyalCH < 0.0356415 59 10.14 MM ( 0.01695 0.98305 ) * means there are 59 training observations in this leaf, the deviance is \(\approx 10.14\), the predicted class is MM, and within this leaf the empirical class proportions are \((\hat p_{CH},\hat p_{MM}) = (0.017, 0.983)\) — an essentially pure MM leaf reached when LoyalCH is very small.

(d) Plot

plot(tree.oj); text(tree.oj, pretty = 0, cex = 0.7)

LoyalCH (customer brand loyalty for Citrus Hill) is the dominant predictor: it appears at the root and in several subsequent splits.

(e) Confusion matrix and test error

pred.oj <- predict(tree.oj, oj.test, type = "class")
cm <- table(Predicted = pred.oj, True = oj.test$Purchase)
cm

##          True
## Predicted  CH  MM
##        CH 151  31
##        MM  18  70

test.err.full <- 1 - sum(diag(cm)) / sum(cm)
test.err.full

## [1] 0.1814815

(f)–(h) CV for tree size

set.seed(7)
cv.oj <- cv.tree(tree.oj, FUN = prune.misclass)
cv.oj

## $size
## [1] 9 8 5 2 1
## 
## $dev
## [1] 151 154 154 165 316
## 
## $k
## [1]       -Inf   0.000000   3.666667   8.000000 154.000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

plot(cv.oj$size, cv.oj$dev, type = "b", pch = 19,
     xlab = "Tree size", ylab = "CV classification errors",
     main = "OJ: CV error vs. tree size")

best.size <- cv.oj$size[which.min(cv.oj$dev)]
best.size

## [1] 9

(i) Pruned tree

# If CV picks the unpruned tree, fall back to a 5-leaf tree per the prompt.
size.for.prune <- if (best.size == length(unique(tree.oj$where))) 5 else best.size
prune.oj <- prune.misclass(tree.oj, best = size.for.prune)
plot(prune.oj); text(prune.oj, pretty = 0, cex = 0.8)

(j) Training error: pruned vs. unpruned

pred.tr.full  <- predict(tree.oj , oj.train, type = "class")
pred.tr.prune <- predict(prune.oj, oj.train, type = "class")
c(unpruned = mean(pred.tr.full  != oj.train$Purchase),
  pruned   = mean(pred.tr.prune != oj.train$Purchase))

## unpruned   pruned 
##  0.15875  0.17250

The pruned tree has the higher training error — it has strictly fewer leaves, so it fits the training data less closely.

(k) Test error: pruned vs. unpruned

pred.te.prune <- predict(prune.oj, oj.test, type = "class")
c(unpruned = test.err.full,
  pruned   = mean(pred.te.prune != oj.test$Purchase))

##  unpruned    pruned 
## 0.1814815 0.1925926

The two test error rates are usually very close; pruning trades a small amount of test accuracy (or none at all) for a substantially simpler, more interpretable model.

Exercise 10 — Boosting on Hitters

(a)–(b) Clean data, split

data(Hitters, package = "ISLR2")
hit <- na.omit(Hitters)
hit$Salary <- log(hit$Salary)

hit.train <- hit[1:200, ]
hit.test  <- hit[-(1:200), ]

(c)–(d) Boosting over shrinkage values

lambdas <- 10^seq(-3, -0.2, length.out = 20)
train.mse <- test.mse <- numeric(length(lambdas))

for (i in seq_along(lambdas)) {
  set.seed(8)
  fit <- gbm(Salary ~ ., data = hit.train,
             distribution = "gaussian",
             n.trees = 1000, shrinkage = lambdas[i],
             verbose = FALSE)
  train.mse[i] <- mean((predict(fit, hit.train, n.trees = 1000) - hit.train$Salary)^2)
  test.mse[i]  <- mean((predict(fit, hit.test , n.trees = 1000) - hit.test$Salary)^2)
}

par(mfrow = c(1, 2))
plot(lambdas, train.mse, type = "b", log = "x", pch = 19, col = "steelblue",
     xlab = expression(lambda), ylab = "Training MSE", main = "Training MSE")
plot(lambdas, test.mse,  type = "b", log = "x", pch = 19, col = "firebrick",
     xlab = expression(lambda), ylab = "Test MSE", main = "Test MSE")

par(mfrow = c(1, 1))

best.lambda <- lambdas[which.min(test.mse)]
best.lambda

## [1] 0.0417881

min(test.mse)

## [1] 0.2701264

Training MSE decreases monotonically with \(\lambda\); test MSE shows a clear U-shape with a minimum at a moderate \(\lambda\).

(e) Comparison with linear/lasso regression

# Linear regression
lm.fit <- lm(Salary ~ ., data = hit.train)
lm.pred <- predict(lm.fit, hit.test)
mse.lm  <- mean((lm.pred - hit.test$Salary)^2)

# Lasso via glmnet
x.train <- model.matrix(Salary ~ ., data = hit.train)[, -1]
y.train <- hit.train$Salary
x.test  <- model.matrix(Salary ~ ., data = hit.test )[, -1]
y.test  <- hit.test$Salary

set.seed(8)
cv.lasso  <- cv.glmnet(x.train, y.train, alpha = 1)
lasso.pred <- predict(cv.lasso, s = "lambda.min", newx = x.test)
mse.lasso  <- mean((lasso.pred - y.test)^2)

data.frame(
  Method  = c("Boosting (best λ)", "Linear regression", "Lasso"),
  TestMSE = round(c(min(test.mse), mse.lm, mse.lasso), 4)
)

##              Method TestMSE
## 1 Boosting (best λ)  0.2701
## 2 Linear regression  0.4918
## 3             Lasso  0.4706

Boosting comfortably beats both linear regression and the lasso.

(f) Variable importance

set.seed(8)
best.gbm <- gbm(Salary ~ ., data = hit.train,
                distribution = "gaussian",
                n.trees = 1000, shrinkage = best.lambda,
                verbose = FALSE)
summary(best.gbm, plotit = FALSE) |> head(10)

##             var   rel.inf
## CAtBat   CAtBat 16.891401
## CHits     CHits 10.341500
## CWalks   CWalks 10.106442
## CRBI       CRBI  8.097595
## CRuns     CRuns  8.092889
## Years     Years  6.196511
## PutOuts PutOuts  5.954579
## Walks     Walks  5.813423
## CHmRun   CHmRun  5.189429
## Hits       Hits  4.882924

CAtBat, CRBI, CHits, CRuns, and CWalks — the career counting stats — dominate.

(g) Bagging

set.seed(8)
bag.hit <- randomForest(Salary ~ ., data = hit.train,
                        mtry = ncol(hit.train) - 1, ntree = 500)
mse.bag <- mean((predict(bag.hit, hit.test) - hit.test$Salary)^2)
mse.bag

## [1] 0.2321237

Bagging tends to be close to boosting on this data, often a touch worse.

Exercise 11 — Caravan boosting

data(Caravan, package = "ISLR2")
Caravan$Purchase01 <- ifelse(Caravan$Purchase == "Yes", 1, 0)

car.train <- Caravan[ 1:1000, ]
car.test  <- Caravan[-(1:1000), ]

set.seed(9)
boost.cv <- gbm(Purchase01 ~ . - Purchase, data = car.train,
                distribution = "bernoulli",
                n.trees = 1000, shrinkage = 0.01,
                verbose = FALSE)
head(summary(boost.cv, plotit = FALSE), 10)

##               var   rel.inf
## PPERSAUT PPERSAUT 15.165012
## MKOOPKLA MKOOPKLA  9.254921
## MOPLHOOG MOPLHOOG  7.149152
## MBERMIDD MBERMIDD  5.798803
## PBRAND     PBRAND  5.789747
## MGODGE     MGODGE  4.419720
## MINK3045 MINK3045  4.056977
## ABRAND     ABRAND  3.881660
## MOSTYPE   MOSTYPE  2.644470
## MSKA         MSKA  2.610447

PPERSAUT, MKOOPKLA, MOPLHOOG, MBERMIDD, and PBRAND typically lead the importance ranking.

prob.boost <- predict(boost.cv, car.test, n.trees = 1000, type = "response")
pred.boost <- ifelse(prob.boost > 0.20, "Yes", "No")
cm.boost <- table(Predicted = pred.boost, True = car.test$Purchase)
cm.boost

##          True
## Predicted   No  Yes
##       No  4415  253
##       Yes  118   36

# Precision: of those predicted Yes, fraction that actually buy
prec.boost <- cm.boost["Yes", "Yes"] / sum(cm.boost["Yes", ])
prec.boost

## [1] 0.2337662

Compare with logistic regression and KNN:

# Logistic regression
glm.fit  <- glm(Purchase01 ~ . - Purchase, data = car.train, family = binomial)
prob.glm <- predict(glm.fit, car.test, type = "response")
pred.glm <- ifelse(prob.glm > 0.20, "Yes", "No")
cm.glm <- table(Predicted = pred.glm, True = car.test$Purchase)
prec.glm <- cm.glm["Yes", "Yes"] / sum(cm.glm["Yes", ])

# KNN with standardised predictors
std <- scale(Caravan[, !(names(Caravan) %in% c("Purchase", "Purchase01"))])
knn.pred <- knn(train = std[1:1000, ],
                test  = std[-(1:1000), ],
                cl    = Caravan$Purchase[1:1000],
                k = 5)
cm.knn <- table(Predicted = knn.pred, True = car.test$Purchase)
prec.knn <- if ("Yes" %in% rownames(cm.knn))
  cm.knn["Yes", "Yes"] / sum(cm.knn["Yes", ]) else NA

data.frame(
  Method = c("Boosting (cut 0.2)", "Logistic regression (cut 0.2)", "KNN (k=5)"),
  Precision_Yes = round(c(prec.boost, prec.glm, prec.knn), 3)
)

##                          Method Precision_Yes
## 1            Boosting (cut 0.2)         0.234
## 2 Logistic regression (cut 0.2)         0.142
## 3                     KNN (k=5)         0.270

Boosting’s precision on the “Yes” class is typically around 0.20 — meaningfully better than the baseline rate of buyers (~6%) and competitive with or better than logistic regression and KNN.

Exercise 12 — Methods on a data set of your choice (Default)

We use the Default data set from ISLR2, predicting default (Yes/No) from student, balance, and income.

data(Default, package = "ISLR2")
Default$default01 <- ifelse(Default$default == "Yes", 1, 0)

set.seed(10)
tr <- sample(nrow(Default), 7000)
def.train <- Default[ tr, ]
def.test  <- Default[-tr, ]
y.test    <- def.test$default

# Logistic regression
glm.fit <- glm(default ~ student + balance + income, data = def.train, family = binomial)
prob.glm <- predict(glm.fit, def.test, type = "response")
pred.glm <- ifelse(prob.glm > 0.5, "Yes", "No")
err.glm  <- mean(pred.glm != y.test)

# Boosting
set.seed(10)
boost.def <- gbm(default01 ~ student + balance + income, data = def.train,
                 distribution = "bernoulli", n.trees = 1000,
                 shrinkage = 0.01, verbose = FALSE)
prob.boost <- predict(boost.def, def.test, n.trees = 1000, type = "response")
pred.boost <- ifelse(prob.boost > 0.5, "Yes", "No")
err.boost  <- mean(pred.boost != y.test)

# Bagging
set.seed(10)
bag.def <- randomForest(default ~ student + balance + income,
                        data = def.train, mtry = 3, ntree = 500)
err.bag <- mean(predict(bag.def, def.test) != y.test)

# Random forest
set.seed(10)
rf.def <- randomForest(default ~ student + balance + income,
                       data = def.train, mtry = 2, ntree = 500)
err.rf <- mean(predict(rf.def, def.test) != y.test)

# BART — binary outcome, so use pbart() (probit BART), not gbart()
x  <- model.matrix(default01 ~ student + balance + income - 1, data = Default)
y  <- Default$default01
set.seed(10)
bart.def <- pbart(x[tr, ], y[tr], x.test = x[-tr, ])

## *****Into main of pbart
## *****Data:
## data:n,p,np: 7000, 4, 3000
## y1,yn: 0, 0
## x1,x[n*p]: 1.000000, 44402.106231
## xp1,xp[np*p]: 1.000000, 16862.952321
## *****Number of Trees: 50
## *****Number of Cut Points: 1 ... 100
## *****burn and ndpost: 100, 1000
## *****Prior:mybeta,alpha,tau: 2.000000,0.950000,0.212132
## *****binaryOffset: -1.830717
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,4,0
## *****nkeeptrain,nkeeptest,nkeeptreedraws: 1000,1000,1000
## *****printevery: 100
## *****skiptr,skipte,skiptreedraws: 1,1,1
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 9s
## check counts
## trcnt,tecnt: 1000,1000

# pbart returns posterior mean probabilities directly in prob.test.mean
pred.bart <- ifelse(bart.def$prob.test.mean > 0.5, "Yes", "No")
err.bart  <- mean(pred.bart != y.test)

data.frame(
  Method    = c("Logistic regression", "Boosting", "Bagging",
                "Random forest", "BART"),
  TestError = round(c(err.glm, err.boost, err.bag, err.rf, err.bart), 4)
)

##                Method TestError
## 1 Logistic regression    0.0267
## 2            Boosting    0.0280
## 3             Bagging    0.0350
## 4       Random forest    0.0337
## 5                BART    0.0287

On Default the classes are very imbalanced (~3% defaulters), so all methods land in a narrow band of test error around 2.5–3%. Boosting and random forests typically match or slightly beat logistic regression; BART is competitive. Differences are small, which is itself a useful conclusion: when a problem has a clean signal (balance is overwhelmingly predictive) and is nearly linear on the logit scale, sophisticated nonparametric methods do not buy much over a well-specified GLM.

Session info

## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.0
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] dplyr_1.1.4          ggplot2_4.0.1        class_7.3-23        
##  [4] leaps_3.2            glmnet_5.0           Matrix_1.7-4        
##  [7] MASS_7.3-65          BART_2.9.10          survival_3.8-3      
## [10] nlme_3.1-168         gbm_2.2.3            randomForest_4.7-1.2
## [13] tree_1.0-45          ISLR2_1.3-2         
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     shape_1.4.6.1      lattice_0.22-7    
##  [5] digest_0.6.39      magrittr_2.0.4     evaluate_1.0.5     grid_4.5.2        
##  [9] RColorBrewer_1.1-3 iterators_1.0.14   fastmap_1.2.0      foreach_1.5.2     
## [13] jsonlite_2.0.0     scales_1.4.0       codetools_0.2-20   jquerylib_0.1.4   
## [17] cli_3.6.5          rlang_1.1.7        splines_4.5.2      withr_3.0.2       
## [21] cachem_1.1.0       yaml_2.3.12        tools_4.5.2        parallel_4.5.2    
## [25] vctrs_0.6.5        R6_2.6.1           lifecycle_1.0.5    pkgconfig_2.0.3   
## [29] bslib_0.9.0        pillar_1.11.1      gtable_0.3.6       glue_1.8.0        
## [33] Rcpp_1.1.1         xfun_0.55          tibble_3.3.1       tidyselect_1.2.1  
## [37] rstudioapi_0.17.1  knitr_1.51         farver_2.1.2       htmltools_0.5.9   
## [41] rmarkdown_2.30     compiler_4.5.2     S7_0.2.1

ISLR2 — Chapter 8 Exercises: Tree-Based Methods

Sy Cong Nguyen

2026-05-14

Conceptual

Exercise 1 — Partition and tree

Exercise 2 — Boosting with stumps is additive

Exercise 3 — Gini, classification error, and entropy

Exercise 4 — Figure 8.14

Exercise 5 — Majority vote vs. average probability

Exercise 6 — Regression-tree algorithm

Applied

Exercise 7 — Random forests on Boston

Exercise 8 — Carseats regression

(a) Train/test split

(b) Regression tree

(c) Pruning via CV

(d) Bagging

(e) Random forest

(f) BART

Exercise 9 — OJ classification tree

(a) Train/test split

(b) Fit and summarise tree

(c) Detailed text output

(d) Plot

(e) Confusion matrix and test error

(f)–(h) CV for tree size

(i) Pruned tree

(j) Training error: pruned vs. unpruned

(k) Test error: pruned vs. unpruned

Exercise 10 — Boosting on Hitters

(a)–(b) Clean data, split

(c)–(d) Boosting over shrinkage values

(e) Comparison with linear/lasso regression

(f) Variable importance

(g) Bagging

Exercise 11 — Caravan boosting

Exercise 12 — Methods on a data set of your choice (Default)

Session info