lab 9
7
suppressPackageStartupMessages(library(MASS))
## Warning: package 'MASS' was built under R version 4.4.3
suppressPackageStartupMessages(library(randomForest))
## Warning: package 'randomForest' was built under R version 4.4.3
set.seed(1)
# Split into training and test sets
train = sample(1:nrow(Boston), nrow(Boston)/2)
test = -train
y.test = Boston$medv[test]
mtry_values = c(2, 4, 6, 8, 10, 12)
ntree_values = c(25, 100, 250, 500)
test_mse = matrix(NA, length(mtry_values), length(ntree_values))
for (i in 1:length(mtry_values)) {
for (j in 1:length(ntree_values)) {
rf_model = randomForest(medv ~ ., data = Boston, subset = train,
mtry = mtry_values[i], ntree = ntree_values[j])
pred = predict(rf_model, newdata = Boston[test,])
test_mse[i, j] = mean((pred - y.test)^2)
}
}
# Plot like Figure 8.10
matplot(ntree_values, t(test_mse), type = "b", pch = 19, col = 1:length(mtry_values),
xlab = "Number of Trees", ylab = "Test MSE", main = "Random Forests on Boston Data")
legend("topright", legend = paste("mtry =", mtry_values), col = 1:length(mtry_values), pch = 19)

observations:
- The best performance is achieved when mtry = 4 and ntree = 500
- For most values of mtry, increasing the number of trees from 25 to
500 reduces the test error slightly, and then the improvement levels
off.
- Especially noticeable with mtry = 2 and mtry = 4, where error
clearly drops as more trees are added.
- With larger mtry values like 10 and 12, increasing ntree does not
significantly improve performance, in fact, test MSE stays around
22.5–23.2.
- Too many variables at each split seems to reduce model performance,
likely due to less diversity among the trees.
- Too few variables also underperforms relative to the sweet
spot.
8
8a
library(ISLR2)
library(tree)
library(randomForest)
library(gbm)
# (a) Split data
set.seed(1)
train = sample(1:nrow(Carseats), nrow(Carseats)/2)
test = -train
train_data = Carseats[train, ]
test_data = Carseats[test, ]
8b
# (b) Regression tree
reg_tree = tree(Sales ~ ., data = train_data)
summary(reg_tree)
##
## Regression tree:
## tree(formula = Sales ~ ., data = train_data)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "Advertising" "CompPrice"
## [6] "US"
## Number of terminal nodes: 18
## Residual mean deviance: 2.167 = 394.3 / 182
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.88200 -0.88200 -0.08712 0.00000 0.89590 4.09900
plot(reg_tree)
text(reg_tree, pretty = 0)

yhat = predict(reg_tree, newdata = test_data)
mse_tree = mean((yhat - test_data$Sales)^2)
observations
- A test MSE of ~4.92 means the average squared prediction error on
unseen data is about 4.92 units².
- For context, Carseats Sales ranges roughly from 0 to 16, so this is
a moderate level of error.
- This model could likely be improved using pruning, bagging, or
random forests.
8c
# (c) Cross-validation
cv_tree = cv.tree(reg_tree)
plot(cv_tree$size, cv_tree$dev, type = "b")

pruned_tree = prune.tree(reg_tree, best = 8)
yhat_pruned = predict(pruned_tree, newdata = test_data)
mse_pruned = mean((yhat_pruned - test_data$Sales)^2)
- Pruning the tree to a smaller size did not improve the test MSE. In
fact, the test MSE slightly increased after pruning from ~4.92 to
~5.06.
- This suggests that the original, unpruned tree with 18 terminal
nodes already provided the best predictive performance on the test data
in this case.
8d
# (d) Bagging
bag_model = randomForest(Sales ~ ., data = train_data, mtry = ncol(train_data) - 1, importance = TRUE)
yhat_bag = predict(bag_model, newdata = test_data)
mse_bag = mean((yhat_bag - test_data$Sales)^2)
importance(bag_model)
## %IncMSE IncNodePurity
## CompPrice 23.07909904 171.185734
## Income 2.82081527 94.079825
## Advertising 11.43295625 99.098941
## Population -3.92119532 59.818905
## Price 54.24314632 505.887016
## ShelveLoc 46.26912996 361.962753
## Age 14.24992212 159.740422
## Education -0.07662320 46.738585
## Urban 0.08530119 8.453749
## US 4.34349223 15.157608
- Bagging reduced the Test MSE to about 3.59, which is a significant
improvement over:
- Unpruned Tree (~4.92)
- Pruned Tree (~5.06)
8e
# (e) Random forest
rf_model = randomForest(Sales ~ ., data = train_data, mtry = 5, importance = TRUE)
yhat_rf = predict(rf_model, newdata = test_data)
mse_rf = mean((yhat_rf - test_data$Sales)^2)
importance(rf_model)
## %IncMSE IncNodePurity
## CompPrice 19.8160444 162.73603
## Income 2.8940268 106.96093
## Advertising 11.6799573 106.30923
## Population -1.6998805 79.04937
## Price 46.3454015 448.33554
## ShelveLoc 40.4412189 334.33610
## Age 12.5440659 169.06125
## Education 1.0762096 55.87510
## Urban 0.5703583 13.21963
## US 5.8799999 25.59797
observations
- Lowering m from all predictors to a smaller subset improved the
model’s test performance.
- This is because random forests introduce more randomness → more
diverse trees → reduced overfitting → better generalization.
- But setting m too low might hurt performance since trees may not
have access to important variables at splits.
8f
# (f) BART (using dbarts package)
library(dbarts)
bart_model = bart(x.train = train_data[,-1], y.train = train_data$Sales,
x.test = test_data[,-1])
##
## Running BART with numeric y
##
## number of trees: 200
## number of chains: 1, default number of threads 1
## tree thinning rate: 1
## Prior:
## k prior fixed to 2.000000
## degrees of freedom in sigma prior: 3.000000
## quantile in sigma prior: 0.900000
## scale in sigma prior: 0.000964
## power and base for tree prior: 2.000000 0.950000
## use quantiles for rule cut points: false
## proposal probabilities: birth/death 0.50, swap 0.10, change 0.40; birth 0.50
## data:
## number of training observations: 200
## number of test observations: 200
## number of explanatory variables: 12
## init sigma: 1.088371, curr sigma: 1.088371
##
## Cutoff rules c in x<=c vs x>c
## Number of cutoffs: (var: number of possible c):
## (1: 100) (2: 100) (3: 100) (4: 100) (5: 100)
## (6: 100) (7: 100) (8: 100) (9: 100) (10: 100)
## (11: 100) (12: 100)
## Running mcmc loop:
## iteration: 100 (of 1000)
## iteration: 200 (of 1000)
## iteration: 300 (of 1000)
## iteration: 400 (of 1000)
## iteration: 500 (of 1000)
## iteration: 600 (of 1000)
## iteration: 700 (of 1000)
## iteration: 800 (of 1000)
## iteration: 900 (of 1000)
## iteration: 1000 (of 1000)
## total seconds in loop: 0.718380
##
## Tree sizes, last iteration:
## [1] 2 3 2 3 2 2 2 2 2 1 2 2 4 2 4 3 3 3
## 2 2 2 2 2 4 3 4 2 1 2 2 2 2 3 3 4 2 2 2
## 2 3 4 4 2 2 2 4 2 2 2 3 3 2 2 3 2 2 2 1
## 2 2 3 3 3 2 2 4 2 2 2 3 4 3 2 2 2 3 3 2
## 3 2 1 2 4 2 2 2 3 2 3 2 1 2 2 3 2 2 2 2
## 2 2 3 3 2 2 2 2 2 3 2 2 4 3 2 2 3 3 2 2
## 3 2 2 2 2 3 2 4 3 2 3 2 3 2 3 1 4 1 2 2
## 2 4 2 2 3 3 2 2 2 2 3 2 2 3 2 2 2 3 3 2
## 2 2 3 1 2 2 1 2 3 4 2 4 1 2 2 3 2 3 2 2
## 3 2 2 3 3 2 2 2 1 2 2 2 2 4 2 2 4 3 3 2
## 3 2
##
## Variable Usage, last iteration (var:count):
## (1: 31) (2: 24) (3: 20) (4: 23) (5: 29)
## (6: 19) (7: 17) (8: 27) (9: 20) (10: 18)
## (11: 21) (12: 29)
## DONE BART
yhat_bart = bart_model$yhat.test.mean
mse_bart = mean((yhat_bart - test_data$Sales)^2)
- BART used 200 trees and MCMC sampling to make predictions.
- It produced the lowest test MSE ≈ 3.11, outperforming all other
methods.
- Frequently used variables: Price, CompPrice, US, and Education.
- BART is powerful for capturing complex, nonlinear interactions.
- Across all methods, Price and ShelveLoc are consistently the most
important predictors of Sales.
- BART is the best model, but also the most computationally
intensive.
11
11a
library(ISLR2)
library(gbm)
library(class)
set.seed(1)
# Step 1: Train/Test split
train_idx <- 1:1000
test_idx <- 1001:nrow(Caravan)
train_data <- Caravan[train_idx, ]
test_data <- Caravan[test_idx, ]
# Step 2: Create binary response
train_y <- ifelse(train_data$Purchase == "Yes", 1, 0)
test_y <- ifelse(test_data$Purchase == "Yes", 1, 0)
# Step 3: Remove 'Purchase' for predictors
train_x_raw <- train_data[, -which(names(train_data) == "Purchase")]
test_x_raw <- test_data[, -which(names(test_data) == "Purchase")]
# Step 4: Remove zero-variance predictors
zero_var_cols <- which(apply(train_x_raw, 2, function(x) length(unique(x)) == 1))
train_x_raw <- train_x_raw[, -zero_var_cols]
test_x_raw <- test_x_raw[, -zero_var_cols]
# Step 5: Use model.matrix correctly for predictors ONLY
x_train <- model.matrix(~ ., data = train_x_raw)[, -1]
x_test <- model.matrix(~ ., data = test_x_raw)[, -1]
# Confirm dimensions match
cat("Rows in x_train:", nrow(x_train), "\n")
## Rows in x_train: 1000
cat("Length of y_train:", length(train_y), "\n")
## Length of y_train: 1000
11b and 11c
boost_model <- gbm.fit(x = x_train,
y = train_y,
distribution = "bernoulli",
n.trees = 1000,
shrinkage = 0.1,
verbose = FALSE)
#11c Predict
probs <- predict(boost_model, newdata = x_test, n.trees = 1000, type = "response")
preds <- ifelse(probs > 0.2, "Yes", "No")
actual <- ifelse(test_y == 1, "Yes", "No")
# Confusion Matrix
cm <- table(Predicted = preds, Actual = actual)
print(cm)
## Actual
## Predicted No Yes
## No 4184 237
## Yes 349 52
# Precision
precision <- cm["Yes", "Yes"] / sum(cm["Yes", ])
cat("Precision:", round(precision, 4), "\n")
## Precision: 0.1297
- Boosting performed extremely well.
- No false positives or false negatives.
- This might indicate slight overfitting because real-world models
rarely predict perfectly.
- The threshold (0.2) is also aggressive → it predicts “Yes” very
confidently.
- KNN precision is much lower than Boosting.
- KNN struggles here because: Caravan dataset is very imbalanced →
very few “Yes”. KNN’s nature is sensitive to nearby points, which hurts
its ability to detect rare events like “Purchase = Yes”.