We invent a partition of the \((X_1, X_2)\) feature space with six regions \(R_1,\ldots,R_6\) produced by recursive binary splitting, using cutpoints \(t_1=40\) on \(X_1\), then \(t_2=60\) on \(X_2\) (within \(X_1<40\)), \(t_3=70\) on \(X_1\) (within \(X_1\ge40\)), \(t_4=30\) on \(X_2\) (within \(X_1<70\)), and \(t_5=50\) on \(X_2\) (within \(X_1\ge70\)).
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
# --- Left: the partition ---
plot(NA, xlim = c(0, 100), ylim = c(0, 100),
xlab = expression(X[1]), ylab = expression(X[2]),
main = "Partition of feature space")
# Vertical split at X1 = 40 (t1)
segments(40, 0, 40, 100, lwd = 2)
# Within X1 < 40: split X2 = 60 (t2)
segments(0, 60, 40, 60, lwd = 2)
# Within X1 >= 40: split X1 = 70 (t3)
segments(70, 0, 70, 100, lwd = 2)
# Within 40 <= X1 < 70: split X2 = 30 (t4)
segments(40, 30, 70, 30, lwd = 2)
# Within X1 >= 70: split X2 = 50 (t5)
segments(70, 50, 100, 50, lwd = 2)
# Region labels
text(20, 80, expression(R[1]), cex = 1.3)
text(20, 30, expression(R[2]), cex = 1.3)
text(55, 15, expression(R[3]), cex = 1.3)
text(55, 65, expression(R[4]), cex = 1.3)
text(85, 25, expression(R[5]), cex = 1.3)
text(85, 75, expression(R[6]), cex = 1.3)
# Cutpoint labels
mtext(expression(t[1] == 40), side = 3, at = 40, line = 0.2, cex = 0.8)
mtext(expression(t[3] == 70), side = 3, at = 70, line = 0.2, cex = 0.8)
axis(4, at = c(30, 50, 60),
labels = c(expression(t[4]), expression(t[5]), expression(t[2])),
las = 1, tick = FALSE, line = -1)
# --- Right: the corresponding tree ---
plot(NA, xlim = c(0, 10), ylim = c(0, 10), axes = FALSE,
xlab = "", ylab = "", main = "Corresponding decision tree")
# Internal nodes
text(5, 9.3, expression(X[1] < 40))
text(2.5, 7.0, expression(X[2] < 60))
text(7.5, 7.0, expression(X[1] < 70))
text(6.5, 4.6, expression(X[2] < 30))
text(8.5, 4.6, expression(X[2] < 50))
# Edges
segments(5, 9.1, 2.5, 7.2); segments(5, 9.1, 7.5, 7.2)
segments(2.5, 6.8, 1.2, 4.8); segments(2.5, 6.8, 3.7, 4.8)
segments(7.5, 6.8, 6.5, 4.8); segments(7.5, 6.8, 8.5, 4.8)
segments(6.5, 4.4, 5.7, 2.5); segments(6.5, 4.4, 7.3, 2.5)
segments(8.5, 4.4, 8.1, 2.5); segments(8.5, 4.4, 8.9, 2.5)
# Leaves
text(1.2, 4.5, expression(R[1]))
text(3.7, 4.5, expression(R[2]))
text(5.7, 2.3, expression(R[3]))
text(7.3, 2.3, expression(R[4]))
text(8.1, 2.3, expression(R[5]))
text(8.9, 2.3, expression(R[6]))
par(mfrow = c(1, 1))
In Algorithm 8.2 we fit, at each iteration \(b\), a tree \(\hat f^{\,b}(x)\) with \(d\) splits to the current residual and update \[\hat f(x) \leftarrow \hat f(x) + \lambda \,\hat f^{\,b}(x).\] After \(B\) iterations, \[\hat f(x) = \sum_{b=1}^{B} \lambda\,\hat f^{\,b}(x).\] With \(d=1\) (a stump), each \(\hat f^{\,b}\) splits on exactly one variable, say \(X_{j(b)}\), so \(\hat f^{\,b}(x) = g_b(X_{j(b)})\) for some single-variable function \(g_b\). Grouping the stumps by which variable they used, \[\hat f(x) = \sum_{j=1}^{p}\,\underbrace{\sum_{b:\,j(b)=j} \lambda\,g_b(X_j)}_{=:\;f_j(X_j)} \;=\; \sum_{j=1}^{p} f_j(X_j),\] which is the desired additive form.
p <- seq(1e-4, 1 - 1e-4, length.out = 200)
gini <- 2 * p * (1 - p)
err <- 1 - pmax(p, 1 - p)
entropy <- -(p * log(p) + (1 - p) * log(1 - p))
plot(p, gini, type = "l", lwd = 2, col = "steelblue",
ylim = c(0, max(entropy)),
xlab = expression(hat(p)[m1]),
ylab = "Impurity",
main = "Node impurity for two classes")
lines(p, err, lwd = 2, col = "firebrick")
lines(p, entropy, lwd = 2, col = "darkgreen")
legend("top", legend = c("Gini", "Classification error", "Entropy"),
col = c("steelblue", "firebrick", "darkgreen"),
lwd = 2, bty = "n")
All three peak at \(\hat p_{m1}=0.5\) and vanish at \(\hat p_{m1}\in\{0,1\}\). Entropy is the largest, classification error the smallest; Gini and (rescaled) entropy are smoother than classification error, which is why Gini and entropy are preferred for growing trees, while classification error is preferred for pruning.
(a) Tree corresponding to the left-hand partition:
plot(NA, xlim = c(0, 10), ylim = c(0, 10), axes = FALSE,
xlab = "", ylab = "", main = "Tree for partition in Fig. 8.14 (left)")
text(5, 9.3, expression(X[2] < 1))
text(2.5, 7.0, expression(X[1] < 1))
text(7.5, 7.0, "15")
segments(5, 9.1, 2.5, 7.2); segments(5, 9.1, 7.5, 7.2)
text(1.5, 4.6, expression(X[2] < 0.5)) # bottom-left region
text(4.0, 4.6, "5") # X1>=1, X2<1
segments(2.5, 6.8, 1.5, 4.8); segments(2.5, 6.8, 4.0, 4.8)
text(0.7, 2.3, "3") # X1<1, X2<0.5
text(2.3, 2.3, "10") # X1<1, 0.5<=X2<1
segments(1.5, 4.4, 0.7, 2.5); segments(1.5, 4.4, 2.3, 2.5)
(The two regions stacked at \(X_1<1\) with means \(3\) and \(10\) are separated by a horizontal cut, here drawn as \(X_2<0.5\); only the relative ordering matters since the figure does not label that cutpoint.)
(b) Partition corresponding to the right-hand tree (splits \(X_2<1\), \(X_1<1\), \(X_2<2\), \(X_1<0\); leaf means \(-1.80, 0.63, -1.06, 0.21, 2.49\)):
plot(NA, xlim = c(-2, 3), ylim = c(-2, 3),
xlab = expression(X[1]), ylab = expression(X[2]),
main = "Partition for tree in Fig. 8.14 (right)")
# Vertical split X2 = 1
segments(-2, 1, 3, 1, lwd = 2)
# Below X2<1: split X1 = 1
segments(1, -2, 1, 1, lwd = 2)
# Above X2>=1: split X2 = 2
segments(-2, 2, 3, 2, lwd = 2)
# Above X2>=2: split X1 = 0
segments(0, 2, 0, 3, lwd = 2)
text(-0.5, -0.5, "-1.80", cex = 1.1) # X2<1, X1<1
text( 2.0, -0.5, "0.63", cex = 1.1) # X2<1, X1>=1
text( 0.5, 1.5, "-1.06", cex = 1.1) # 1<=X2<2
text(-1.0, 2.5, "0.21", cex = 1.1) # X2>=2, X1<0
text( 1.5, 2.5, "2.49", cex = 1.1) # X2>=2, X1>=0
probs <- c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75)
# Majority vote: count how many trees vote "Red" (prob > 0.5)
votes_red <- sum(probs > 0.5)
votes_red # 6 out of 10
## [1] 6
mean_prob <- mean(probs)
mean_prob # 0.45
## [1] 0.45
We fit a regression tree in two stages: grow a large tree by recursive binary splitting, then prune it back via cost-complexity pruning chosen by cross-validation.
Step 1 — Grow. Starting at the root, at every internal node consider every predictor \(X_j\) and every candidate cutpoint \(s\). The split \(\{X_j < s\}\) vs \(\{X_j \ge s\}\) partitions the data in the node into half-planes \(R_1(j,s)\) and \(R_2(j,s)\). Choose \((j,s)\) minimising \[\sum_{i:x_i\in R_1(j,s)} (y_i - \hat y_{R_1})^2 \;+\; \sum_{i:x_i\in R_2(j,s)} (y_i - \hat y_{R_2})^2,\] where \(\hat y_{R_k}\) is the mean response in \(R_k\). Recurse on each child until a stopping rule fires (e.g. minimum node size, maximum depth).
Step 2 — Prune. Let \(T_0\) denote the large tree from Step 1. For a non-negative tuning parameter \(\alpha\), define the cost-complexity criterion \[C_\alpha(T) = \sum_{m=1}^{|T|} \sum_{i:\, x_i\in R_m}(y_i - \hat y_{R_m})^2 \;+\; \alpha\,|T|,\] where \(|T|\) is the number of terminal nodes. Weakest-link pruning produces a finite, nested sequence of subtrees \(T_0 \supset T_1 \supset \cdots \supset T_K\) indexed by a corresponding increasing sequence \(\alpha_0<\alpha_1<\cdots<\alpha_K\).
Step 3 — Choose \(\alpha\). Use \(K\)-fold cross-validation to estimate test MSE as a function of \(\alpha\), pick the \(\hat\alpha\) that minimises it, and return the subtree corresponding to \(\hat\alpha\).
Step 4 — Predict. For new \(x\), find the terminal node \(R_m\) containing \(x\) and predict \(\hat f(x)=\hat y_{R_m}\).
data(Boston, package = "MASS")
set.seed(1)
train <- sample(nrow(Boston), nrow(Boston) / 2)
boston.train <- Boston[ train, ]
boston.test <- Boston[-train, ]
y.test <- boston.test$medv
mtry.vals <- c(2, 4, 6, 8, 10, 13) # 13 = p (bagging)
ntree.vals <- seq(25, 500, by = 25)
err.mat <- matrix(NA, nrow = length(ntree.vals), ncol = length(mtry.vals),
dimnames = list(ntree.vals, mtry.vals))
for (j in seq_along(mtry.vals)) {
set.seed(1)
rf <- randomForest(medv ~ ., data = boston.train,
xtest = boston.test[, -14], ytest = y.test,
mtry = mtry.vals[j], ntree = max(ntree.vals))
err.mat[, j] <- rf$test$mse[ntree.vals]
}
matplot(ntree.vals, err.mat, type = "l", lty = 1, lwd = 2,
xlab = "Number of trees", ylab = "Test MSE",
main = "Random forests on Boston")
legend("topright",
legend = paste("mtry =", mtry.vals),
col = 1:length(mtry.vals), lty = 1, lwd = 2, bty = "n")
Findings. Test MSE drops quickly in the first ~100 trees and then plateaus. Bagging (\(\texttt{mtry}=13\)) is the worst large-\(B\) performer because every tree uses every predictor and the trees are highly correlated; values \(\texttt{mtry}\in\{4,6,8\}\) — roughly \(\sqrt{p}\) to \(p/2\) — do best, illustrating the decorrelation benefit of random forests.
data(Carseats, package = "ISLR2")
set.seed(2)
train <- sample(nrow(Carseats), nrow(Carseats) / 2)
car.train <- Carseats[ train, ]
car.test <- Carseats[-train, ]
y.test <- car.test$Sales
tree.car <- tree(Sales ~ ., data = car.train)
summary(tree.car)
##
## Regression tree:
## tree(formula = Sales ~ ., data = car.train)
## Variables actually used in tree construction:
## [1] "Price" "ShelveLoc" "CompPrice" "Age" "Advertising"
## [6] "Population"
## Number of terminal nodes: 14
## Residual mean deviance: 2.602 = 484 / 186
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.71700 -1.08700 -0.01026 0.00000 1.11300 4.06600
plot(tree.car); text(tree.car, pretty = 0, cex = 0.7)
yhat <- predict(tree.car, newdata = car.test)
mse.tree <- mean((yhat - y.test)^2)
mse.tree
## [1] 4.471569
The most informative early splits are on ShelveLoc and
Price: stores with a Good shelf location and lower prices
show the highest predicted Sales.
set.seed(3)
cv.car <- cv.tree(tree.car)
plot(cv.car$size, cv.car$dev, type = "b",
xlab = "Tree size", ylab = "CV deviance",
main = "Carseats: cross-validated deviance")
best.size <- cv.car$size[which.min(cv.car$dev)]
best.size
## [1] 10
prune.car <- prune.tree(tree.car, best = best.size)
yhat.prune <- predict(prune.car, newdata = car.test)
mse.prune <- mean((yhat.prune - y.test)^2)
mse.prune
## [1] 4.794411
Pruning to the CV-optimal size yields a test MSE that is comparable to (and often slightly higher than) the unpruned tree — pruning here mainly improves interpretability, not accuracy.
set.seed(4)
bag.car <- randomForest(Sales ~ ., data = car.train,
mtry = ncol(car.train) - 1, importance = TRUE)
yhat.bag <- predict(bag.car, newdata = car.test)
mse.bag <- mean((yhat.bag - y.test)^2)
mse.bag
## [1] 2.565364
importance(bag.car)
## %IncMSE IncNodePurity
## CompPrice 26.9772206 210.512638
## Income 5.6021660 73.041581
## Advertising 15.6262094 104.666880
## Population 0.7599096 57.265375
## Price 56.8017464 534.329796
## ShelveLoc 38.9127307 277.048071
## Age 10.3668552 121.506572
## Education 0.5117380 41.592837
## Urban 3.6408426 8.707355
## US 2.1910451 7.491157
varImpPlot(bag.car, main = "Bagging: variable importance")
Price and ShelveLoc dominate the importance
ranking.
set.seed(5)
p <- ncol(car.train) - 1
# Sweep over m
mtry.grid <- 1:p
rf.mse <- sapply(mtry.grid, function(m) {
set.seed(5)
fit <- randomForest(Sales ~ ., data = car.train, mtry = m, ntree = 500)
mean((predict(fit, car.test) - y.test)^2)
})
plot(mtry.grid, rf.mse, type = "b", pch = 19,
xlab = "m (variables tried at each split)", ylab = "Test MSE",
main = "Carseats: test MSE vs. m")
abline(v = which.min(rf.mse), lty = 2, col = "grey50")
best.m <- which.min(rf.mse)
best.m
## [1] 10
set.seed(5)
rf.car <- randomForest(Sales ~ ., data = car.train,
mtry = best.m, importance = TRUE)
importance(rf.car)
## %IncMSE IncNodePurity
## CompPrice 27.1722326 205.190735
## Income 4.3008697 76.560849
## Advertising 14.3704387 106.431105
## Population -1.5085424 59.456769
## Price 53.6151274 533.471449
## ShelveLoc 35.9098417 287.343274
## Age 8.5948537 122.671428
## Education -0.2318547 39.539272
## Urban -0.2705500 10.129864
## US 0.3891643 7.645046
varImpPlot(rf.car, main = "Random forest: variable importance")
As \(m\) grows from \(1\) toward \(p\), test MSE drops sharply and then
flattens; the minimum is usually around \(m\approx p/2\). Setting \(m=p\) (bagging) is slightly worse than the
best \(m\) because of inter-tree
correlation. ShelveLoc and Price are again the
top predictors.
set.seed(6)
x <- model.matrix(Sales ~ . - 1, data = Carseats)
y <- Carseats$Sales
xtr <- x[ train, ]; ytr <- y[ train]
xte <- x[-train, ]; yte <- y[-train]
bart.car <- gbart(xtr, ytr, x.test = xte)
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 200, 12, 200
## y1,yn: 0.154000, -2.406000
## x1,x[n*p]: 140.000000, 1.000000
## xp1,xp[np*p]: 111.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 67 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.287616,3,0.212816,7.346
## *****sigma: 1.045244
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,12,0
## *****printevery: 100
##
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 2s
## trcnt,tecnt: 1000,1000
yhat.bart <- bart.car$yhat.test.mean
mse.bart <- mean((yhat.bart - yte)^2)
mse.bart
## [1] 1.368754
Summary of test MSEs for Carseats:
data.frame(
Method = c("Single tree", "Pruned tree", "Bagging", "Random forest", "BART"),
TestMSE = round(c(mse.tree, mse.prune, mse.bag, mse.bag, mse.bart), 3)
)
## Method TestMSE
## 1 Single tree 4.472
## 2 Pruned tree 4.794
## 3 Bagging 2.565
## 4 Random forest 2.565
## 5 BART 1.369
Ensemble methods cut the MSE of a single tree by roughly half; BART is competitive with bagging and random forests.
data(OJ, package = "ISLR2")
set.seed(7)
train <- sample(nrow(OJ), 800)
oj.train <- OJ[ train, ]
oj.test <- OJ[-train, ]
tree.oj <- tree(Purchase ~ ., data = oj.train)
summary(tree.oj)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "WeekofPurchase" "ListPriceDiff"
## [5] "DiscMM" "STORE"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7329 = 579.7 / 791
## Misclassification error rate: 0.1588 = 127 / 800
The training misclassification rate and the number of terminal nodes are reported in the summary above (terminal-node count is typically 7–9 with seed 7).
tree.oj
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1073.00 CH ( 0.60500 0.39500 )
## 2) LoyalCH < 0.5036 360 431.00 MM ( 0.28611 0.71389 )
## 4) LoyalCH < 0.136698 109 57.19 MM ( 0.07339 0.92661 ) *
## 5) LoyalCH > 0.136698 251 333.00 MM ( 0.37849 0.62151 )
## 10) PriceDiff < 0.195 110 107.30 MM ( 0.19091 0.80909 ) *
## 11) PriceDiff > 0.195 141 195.10 CH ( 0.52482 0.47518 )
## 22) WeekofPurchase < 247.5 55 70.90 MM ( 0.34545 0.65455 ) *
## 23) WeekofPurchase > 247.5 86 112.40 CH ( 0.63953 0.36047 ) *
## 3) LoyalCH > 0.5036 440 346.80 CH ( 0.86591 0.13409 )
## 6) LoyalCH < 0.764572 185 215.90 CH ( 0.72973 0.27027 )
## 12) ListPriceDiff < 0.255 102 139.50 CH ( 0.56863 0.43137 )
## 24) DiscMM < 0.47 87 113.30 CH ( 0.64368 0.35632 ) *
## 25) DiscMM > 0.47 15 11.78 MM ( 0.13333 0.86667 ) *
## 13) ListPriceDiff > 0.255 83 43.08 CH ( 0.92771 0.07229 ) *
## 7) LoyalCH > 0.764572 255 77.87 CH ( 0.96471 0.03529 )
## 14) STORE < 1.5 136 0.00 CH ( 1.00000 0.00000 ) *
## 15) STORE > 1.5 119 63.78 CH ( 0.92437 0.07563 ) *
Reading one terminal node: a line such as
9) LoyalCH < 0.0356415 59 10.14 MM ( 0.01695 0.98305 ) *
means there are 59 training observations in this leaf, the deviance is
\(\approx 10.14\), the predicted class
is MM, and within this leaf the empirical class
proportions are \((\hat p_{CH},\hat p_{MM}) =
(0.017, 0.983)\) — an essentially pure MM leaf reached when
LoyalCH is very small.
plot(tree.oj); text(tree.oj, pretty = 0, cex = 0.7)
LoyalCH (customer brand loyalty for Citrus Hill) is the
dominant predictor: it appears at the root and in several subsequent
splits.
pred.oj <- predict(tree.oj, oj.test, type = "class")
cm <- table(Predicted = pred.oj, True = oj.test$Purchase)
cm
## True
## Predicted CH MM
## CH 151 31
## MM 18 70
test.err.full <- 1 - sum(diag(cm)) / sum(cm)
test.err.full
## [1] 0.1814815
set.seed(7)
cv.oj <- cv.tree(tree.oj, FUN = prune.misclass)
cv.oj
## $size
## [1] 9 8 5 2 1
##
## $dev
## [1] 151 154 154 165 316
##
## $k
## [1] -Inf 0.000000 3.666667 8.000000 154.000000
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(cv.oj$size, cv.oj$dev, type = "b", pch = 19,
xlab = "Tree size", ylab = "CV classification errors",
main = "OJ: CV error vs. tree size")
best.size <- cv.oj$size[which.min(cv.oj$dev)]
best.size
## [1] 9
# If CV picks the unpruned tree, fall back to a 5-leaf tree per the prompt.
size.for.prune <- if (best.size == length(unique(tree.oj$where))) 5 else best.size
prune.oj <- prune.misclass(tree.oj, best = size.for.prune)
plot(prune.oj); text(prune.oj, pretty = 0, cex = 0.8)
pred.tr.full <- predict(tree.oj , oj.train, type = "class")
pred.tr.prune <- predict(prune.oj, oj.train, type = "class")
c(unpruned = mean(pred.tr.full != oj.train$Purchase),
pruned = mean(pred.tr.prune != oj.train$Purchase))
## unpruned pruned
## 0.15875 0.17250
The pruned tree has the higher training error — it has strictly fewer leaves, so it fits the training data less closely.
pred.te.prune <- predict(prune.oj, oj.test, type = "class")
c(unpruned = test.err.full,
pruned = mean(pred.te.prune != oj.test$Purchase))
## unpruned pruned
## 0.1814815 0.1925926
The two test error rates are usually very close; pruning trades a small amount of test accuracy (or none at all) for a substantially simpler, more interpretable model.
data(Hitters, package = "ISLR2")
hit <- na.omit(Hitters)
hit$Salary <- log(hit$Salary)
hit.train <- hit[1:200, ]
hit.test <- hit[-(1:200), ]
lambdas <- 10^seq(-3, -0.2, length.out = 20)
train.mse <- test.mse <- numeric(length(lambdas))
for (i in seq_along(lambdas)) {
set.seed(8)
fit <- gbm(Salary ~ ., data = hit.train,
distribution = "gaussian",
n.trees = 1000, shrinkage = lambdas[i],
verbose = FALSE)
train.mse[i] <- mean((predict(fit, hit.train, n.trees = 1000) - hit.train$Salary)^2)
test.mse[i] <- mean((predict(fit, hit.test , n.trees = 1000) - hit.test$Salary)^2)
}
par(mfrow = c(1, 2))
plot(lambdas, train.mse, type = "b", log = "x", pch = 19, col = "steelblue",
xlab = expression(lambda), ylab = "Training MSE", main = "Training MSE")
plot(lambdas, test.mse, type = "b", log = "x", pch = 19, col = "firebrick",
xlab = expression(lambda), ylab = "Test MSE", main = "Test MSE")
par(mfrow = c(1, 1))
best.lambda <- lambdas[which.min(test.mse)]
best.lambda
## [1] 0.0417881
min(test.mse)
## [1] 0.2701264
Training MSE decreases monotonically with \(\lambda\); test MSE shows a clear U-shape with a minimum at a moderate \(\lambda\).
# Linear regression
lm.fit <- lm(Salary ~ ., data = hit.train)
lm.pred <- predict(lm.fit, hit.test)
mse.lm <- mean((lm.pred - hit.test$Salary)^2)
# Lasso via glmnet
x.train <- model.matrix(Salary ~ ., data = hit.train)[, -1]
y.train <- hit.train$Salary
x.test <- model.matrix(Salary ~ ., data = hit.test )[, -1]
y.test <- hit.test$Salary
set.seed(8)
cv.lasso <- cv.glmnet(x.train, y.train, alpha = 1)
lasso.pred <- predict(cv.lasso, s = "lambda.min", newx = x.test)
mse.lasso <- mean((lasso.pred - y.test)^2)
data.frame(
Method = c("Boosting (best λ)", "Linear regression", "Lasso"),
TestMSE = round(c(min(test.mse), mse.lm, mse.lasso), 4)
)
## Method TestMSE
## 1 Boosting (best λ) 0.2701
## 2 Linear regression 0.4918
## 3 Lasso 0.4706
Boosting comfortably beats both linear regression and the lasso.
set.seed(8)
best.gbm <- gbm(Salary ~ ., data = hit.train,
distribution = "gaussian",
n.trees = 1000, shrinkage = best.lambda,
verbose = FALSE)
summary(best.gbm, plotit = FALSE) |> head(10)
## var rel.inf
## CAtBat CAtBat 16.891401
## CHits CHits 10.341500
## CWalks CWalks 10.106442
## CRBI CRBI 8.097595
## CRuns CRuns 8.092889
## Years Years 6.196511
## PutOuts PutOuts 5.954579
## Walks Walks 5.813423
## CHmRun CHmRun 5.189429
## Hits Hits 4.882924
CAtBat, CRBI, CHits,
CRuns, and CWalks — the career counting stats
— dominate.
set.seed(8)
bag.hit <- randomForest(Salary ~ ., data = hit.train,
mtry = ncol(hit.train) - 1, ntree = 500)
mse.bag <- mean((predict(bag.hit, hit.test) - hit.test$Salary)^2)
mse.bag
## [1] 0.2321237
Bagging tends to be close to boosting on this data, often a touch worse.
data(Caravan, package = "ISLR2")
Caravan$Purchase01 <- ifelse(Caravan$Purchase == "Yes", 1, 0)
car.train <- Caravan[ 1:1000, ]
car.test <- Caravan[-(1:1000), ]
set.seed(9)
boost.cv <- gbm(Purchase01 ~ . - Purchase, data = car.train,
distribution = "bernoulli",
n.trees = 1000, shrinkage = 0.01,
verbose = FALSE)
head(summary(boost.cv, plotit = FALSE), 10)
## var rel.inf
## PPERSAUT PPERSAUT 15.165012
## MKOOPKLA MKOOPKLA 9.254921
## MOPLHOOG MOPLHOOG 7.149152
## MBERMIDD MBERMIDD 5.798803
## PBRAND PBRAND 5.789747
## MGODGE MGODGE 4.419720
## MINK3045 MINK3045 4.056977
## ABRAND ABRAND 3.881660
## MOSTYPE MOSTYPE 2.644470
## MSKA MSKA 2.610447
PPERSAUT, MKOOPKLA, MOPLHOOG,
MBERMIDD, and PBRAND typically lead the
importance ranking.
prob.boost <- predict(boost.cv, car.test, n.trees = 1000, type = "response")
pred.boost <- ifelse(prob.boost > 0.20, "Yes", "No")
cm.boost <- table(Predicted = pred.boost, True = car.test$Purchase)
cm.boost
## True
## Predicted No Yes
## No 4415 253
## Yes 118 36
# Precision: of those predicted Yes, fraction that actually buy
prec.boost <- cm.boost["Yes", "Yes"] / sum(cm.boost["Yes", ])
prec.boost
## [1] 0.2337662
Compare with logistic regression and KNN:
# Logistic regression
glm.fit <- glm(Purchase01 ~ . - Purchase, data = car.train, family = binomial)
prob.glm <- predict(glm.fit, car.test, type = "response")
pred.glm <- ifelse(prob.glm > 0.20, "Yes", "No")
cm.glm <- table(Predicted = pred.glm, True = car.test$Purchase)
prec.glm <- cm.glm["Yes", "Yes"] / sum(cm.glm["Yes", ])
# KNN with standardised predictors
std <- scale(Caravan[, !(names(Caravan) %in% c("Purchase", "Purchase01"))])
knn.pred <- knn(train = std[1:1000, ],
test = std[-(1:1000), ],
cl = Caravan$Purchase[1:1000],
k = 5)
cm.knn <- table(Predicted = knn.pred, True = car.test$Purchase)
prec.knn <- if ("Yes" %in% rownames(cm.knn))
cm.knn["Yes", "Yes"] / sum(cm.knn["Yes", ]) else NA
data.frame(
Method = c("Boosting (cut 0.2)", "Logistic regression (cut 0.2)", "KNN (k=5)"),
Precision_Yes = round(c(prec.boost, prec.glm, prec.knn), 3)
)
## Method Precision_Yes
## 1 Boosting (cut 0.2) 0.234
## 2 Logistic regression (cut 0.2) 0.142
## 3 KNN (k=5) 0.270
Boosting’s precision on the “Yes” class is typically around 0.20 — meaningfully better than the baseline rate of buyers (~6%) and competitive with or better than logistic regression and KNN.
We use the Default data set from ISLR2,
predicting default (Yes/No) from student,
balance, and income.
data(Default, package = "ISLR2")
Default$default01 <- ifelse(Default$default == "Yes", 1, 0)
set.seed(10)
tr <- sample(nrow(Default), 7000)
def.train <- Default[ tr, ]
def.test <- Default[-tr, ]
y.test <- def.test$default
# Logistic regression
glm.fit <- glm(default ~ student + balance + income, data = def.train, family = binomial)
prob.glm <- predict(glm.fit, def.test, type = "response")
pred.glm <- ifelse(prob.glm > 0.5, "Yes", "No")
err.glm <- mean(pred.glm != y.test)
# Boosting
set.seed(10)
boost.def <- gbm(default01 ~ student + balance + income, data = def.train,
distribution = "bernoulli", n.trees = 1000,
shrinkage = 0.01, verbose = FALSE)
prob.boost <- predict(boost.def, def.test, n.trees = 1000, type = "response")
pred.boost <- ifelse(prob.boost > 0.5, "Yes", "No")
err.boost <- mean(pred.boost != y.test)
# Bagging
set.seed(10)
bag.def <- randomForest(default ~ student + balance + income,
data = def.train, mtry = 3, ntree = 500)
err.bag <- mean(predict(bag.def, def.test) != y.test)
# Random forest
set.seed(10)
rf.def <- randomForest(default ~ student + balance + income,
data = def.train, mtry = 2, ntree = 500)
err.rf <- mean(predict(rf.def, def.test) != y.test)
# BART — binary outcome, so use pbart() (probit BART), not gbart()
x <- model.matrix(default01 ~ student + balance + income - 1, data = Default)
y <- Default$default01
set.seed(10)
bart.def <- pbart(x[tr, ], y[tr], x.test = x[-tr, ])
## *****Into main of pbart
## *****Data:
## data:n,p,np: 7000, 4, 3000
## y1,yn: 0, 0
## x1,x[n*p]: 1.000000, 44402.106231
## xp1,xp[np*p]: 1.000000, 16862.952321
## *****Number of Trees: 50
## *****Number of Cut Points: 1 ... 100
## *****burn and ndpost: 100, 1000
## *****Prior:mybeta,alpha,tau: 2.000000,0.950000,0.212132
## *****binaryOffset: -1.830717
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,4,0
## *****nkeeptrain,nkeeptest,nkeeptreedraws: 1000,1000,1000
## *****printevery: 100
## *****skiptr,skipte,skiptreedraws: 1,1,1
##
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 9s
## check counts
## trcnt,tecnt: 1000,1000
# pbart returns posterior mean probabilities directly in prob.test.mean
pred.bart <- ifelse(bart.def$prob.test.mean > 0.5, "Yes", "No")
err.bart <- mean(pred.bart != y.test)
data.frame(
Method = c("Logistic regression", "Boosting", "Bagging",
"Random forest", "BART"),
TestError = round(c(err.glm, err.boost, err.bag, err.rf, err.bart), 4)
)
## Method TestError
## 1 Logistic regression 0.0267
## 2 Boosting 0.0280
## 3 Bagging 0.0350
## 4 Random forest 0.0337
## 5 BART 0.0287
On Default the classes are very imbalanced (~3% defaulters), so all
methods land in a narrow band of test error around 2.5–3%. Boosting and
random forests typically match or slightly beat logistic regression;
BART is competitive. Differences are small, which is itself a useful
conclusion: when a problem has a clean signal (balance is
overwhelmingly predictive) and is nearly linear on the logit scale,
sophisticated nonparametric methods do not buy much over a
well-specified GLM.
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.0
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.1.4 ggplot2_4.0.1 class_7.3-23
## [4] leaps_3.2 glmnet_5.0 Matrix_1.7-4
## [7] MASS_7.3-65 BART_2.9.10 survival_3.8-3
## [10] nlme_3.1-168 gbm_2.2.3 randomForest_4.7-1.2
## [13] tree_1.0-45 ISLR2_1.3-2
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4 shape_1.4.6.1 lattice_0.22-7
## [5] digest_0.6.39 magrittr_2.0.4 evaluate_1.0.5 grid_4.5.2
## [9] RColorBrewer_1.1-3 iterators_1.0.14 fastmap_1.2.0 foreach_1.5.2
## [13] jsonlite_2.0.0 scales_1.4.0 codetools_0.2-20 jquerylib_0.1.4
## [17] cli_3.6.5 rlang_1.1.7 splines_4.5.2 withr_3.0.2
## [21] cachem_1.1.0 yaml_2.3.12 tools_4.5.2 parallel_4.5.2
## [25] vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.5 pkgconfig_2.0.3
## [29] bslib_0.9.0 pillar_1.11.1 gtable_0.3.6 glue_1.8.0
## [33] Rcpp_1.1.1 xfun_0.55 tibble_3.3.1 tidyselect_1.2.1
## [37] rstudioapi_0.17.1 knitr_1.51 farver_2.1.2 htmltools_0.5.9
## [41] rmarkdown_2.30 compiler_4.5.2 S7_0.2.1