lab 9

7

suppressPackageStartupMessages(library(MASS))
## Warning: package 'MASS' was built under R version 4.4.3
suppressPackageStartupMessages(library(randomForest))
## Warning: package 'randomForest' was built under R version 4.4.3
set.seed(1)

# Split into training and test sets
train = sample(1:nrow(Boston), nrow(Boston)/2)
test = -train
y.test = Boston$medv[test]

mtry_values = c(2, 4, 6, 8, 10, 12)
ntree_values = c(25, 100, 250, 500)
test_mse = matrix(NA, length(mtry_values), length(ntree_values))

for (i in 1:length(mtry_values)) {
  for (j in 1:length(ntree_values)) {
    rf_model = randomForest(medv ~ ., data = Boston, subset = train,
                            mtry = mtry_values[i], ntree = ntree_values[j])
    pred = predict(rf_model, newdata = Boston[test,])
    test_mse[i, j] = mean((pred - y.test)^2)
  }
}

# Plot like Figure 8.10
matplot(ntree_values, t(test_mse), type = "b", pch = 19, col = 1:length(mtry_values),
        xlab = "Number of Trees", ylab = "Test MSE", main = "Random Forests on Boston Data")
legend("topright", legend = paste("mtry =", mtry_values), col = 1:length(mtry_values), pch = 19)

observations:

  • The best performance is achieved when mtry = 4 and ntree = 500
  • For most values of mtry, increasing the number of trees from 25 to 500 reduces the test error slightly, and then the improvement levels off.
  • Especially noticeable with mtry = 2 and mtry = 4, where error clearly drops as more trees are added.
  • With larger mtry values like 10 and 12, increasing ntree does not significantly improve performance, in fact, test MSE stays around 22.5–23.2.
  • Too many variables at each split seems to reduce model performance, likely due to less diversity among the trees.
  • Too few variables also underperforms relative to the sweet spot.

8

8a

library(ISLR2)
library(tree)
library(randomForest)
library(gbm)

# (a) Split data
set.seed(1)
train = sample(1:nrow(Carseats), nrow(Carseats)/2)
test = -train
train_data = Carseats[train, ]
test_data = Carseats[test, ]

8b

# (b) Regression tree
reg_tree = tree(Sales ~ ., data = train_data)
summary(reg_tree)
## 
## Regression tree:
## tree(formula = Sales ~ ., data = train_data)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "Advertising" "CompPrice"  
## [6] "US"         
## Number of terminal nodes:  18 
## Residual mean deviance:  2.167 = 394.3 / 182 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.88200 -0.88200 -0.08712  0.00000  0.89590  4.09900
plot(reg_tree)
text(reg_tree, pretty = 0)

yhat = predict(reg_tree, newdata = test_data)
mse_tree = mean((yhat - test_data$Sales)^2)

observations

  • A test MSE of ~4.92 means the average squared prediction error on unseen data is about 4.92 units².
  • For context, Carseats Sales ranges roughly from 0 to 16, so this is a moderate level of error.
  • This model could likely be improved using pruning, bagging, or random forests.

8c

# (c) Cross-validation
cv_tree = cv.tree(reg_tree)
plot(cv_tree$size, cv_tree$dev, type = "b")

pruned_tree = prune.tree(reg_tree, best = 8)
yhat_pruned = predict(pruned_tree, newdata = test_data)
mse_pruned = mean((yhat_pruned - test_data$Sales)^2)
  • Pruning the tree to a smaller size did not improve the test MSE. In fact, the test MSE slightly increased after pruning from ~4.92 to ~5.06.
  • This suggests that the original, unpruned tree with 18 terminal nodes already provided the best predictive performance on the test data in this case.

8d

# (d) Bagging
bag_model = randomForest(Sales ~ ., data = train_data, mtry = ncol(train_data) - 1, importance = TRUE)
yhat_bag = predict(bag_model, newdata = test_data)
mse_bag = mean((yhat_bag - test_data$Sales)^2)
importance(bag_model)
##                 %IncMSE IncNodePurity
## CompPrice   23.07909904    171.185734
## Income       2.82081527     94.079825
## Advertising 11.43295625     99.098941
## Population  -3.92119532     59.818905
## Price       54.24314632    505.887016
## ShelveLoc   46.26912996    361.962753
## Age         14.24992212    159.740422
## Education   -0.07662320     46.738585
## Urban        0.08530119      8.453749
## US           4.34349223     15.157608
  • Bagging reduced the Test MSE to about 3.59, which is a significant improvement over:
  • Unpruned Tree (~4.92)
  • Pruned Tree (~5.06)

8e

# (e) Random forest
rf_model = randomForest(Sales ~ ., data = train_data, mtry = 5, importance = TRUE)
yhat_rf = predict(rf_model, newdata = test_data)
mse_rf = mean((yhat_rf - test_data$Sales)^2)
importance(rf_model)
##                %IncMSE IncNodePurity
## CompPrice   19.8160444     162.73603
## Income       2.8940268     106.96093
## Advertising 11.6799573     106.30923
## Population  -1.6998805      79.04937
## Price       46.3454015     448.33554
## ShelveLoc   40.4412189     334.33610
## Age         12.5440659     169.06125
## Education    1.0762096      55.87510
## Urban        0.5703583      13.21963
## US           5.8799999      25.59797

observations

  • Lowering m from all predictors to a smaller subset improved the model’s test performance.
  • This is because random forests introduce more randomness → more diverse trees → reduced overfitting → better generalization.
  • But setting m too low might hurt performance since trees may not have access to important variables at splits.

8f

# (f) BART (using dbarts package)
library(dbarts)
bart_model = bart(x.train = train_data[,-1], y.train = train_data$Sales,
                  x.test = test_data[,-1])
## 
## Running BART with numeric y
## 
## number of trees: 200
## number of chains: 1, default number of threads 1
## tree thinning rate: 1
## Prior:
##  k prior fixed to 2.000000
##  degrees of freedom in sigma prior: 3.000000
##  quantile in sigma prior: 0.900000
##  scale in sigma prior: 0.000964
##  power and base for tree prior: 2.000000 0.950000
##  use quantiles for rule cut points: false
##  proposal probabilities: birth/death 0.50, swap 0.10, change 0.40; birth 0.50
## data:
##  number of training observations: 200
##  number of test observations: 200
##  number of explanatory variables: 12
##  init sigma: 1.088371, curr sigma: 1.088371
## 
## Cutoff rules c in x<=c vs x>c
## Number of cutoffs: (var: number of possible c):
## (1: 100) (2: 100) (3: 100) (4: 100) (5: 100) 
## (6: 100) (7: 100) (8: 100) (9: 100) (10: 100) 
## (11: 100) (12: 100) 
## Running mcmc loop:
## iteration: 100 (of 1000)
## iteration: 200 (of 1000)
## iteration: 300 (of 1000)
## iteration: 400 (of 1000)
## iteration: 500 (of 1000)
## iteration: 600 (of 1000)
## iteration: 700 (of 1000)
## iteration: 800 (of 1000)
## iteration: 900 (of 1000)
## iteration: 1000 (of 1000)
## total seconds in loop: 0.718380
## 
## Tree sizes, last iteration:
## [1] 2 3 2 3 2 2 2 2 2 1 2 2 4 2 4 3 3 3 
## 2 2 2 2 2 4 3 4 2 1 2 2 2 2 3 3 4 2 2 2 
## 2 3 4 4 2 2 2 4 2 2 2 3 3 2 2 3 2 2 2 1 
## 2 2 3 3 3 2 2 4 2 2 2 3 4 3 2 2 2 3 3 2 
## 3 2 1 2 4 2 2 2 3 2 3 2 1 2 2 3 2 2 2 2 
## 2 2 3 3 2 2 2 2 2 3 2 2 4 3 2 2 3 3 2 2 
## 3 2 2 2 2 3 2 4 3 2 3 2 3 2 3 1 4 1 2 2 
## 2 4 2 2 3 3 2 2 2 2 3 2 2 3 2 2 2 3 3 2 
## 2 2 3 1 2 2 1 2 3 4 2 4 1 2 2 3 2 3 2 2 
## 3 2 2 3 3 2 2 2 1 2 2 2 2 4 2 2 4 3 3 2 
## 3 2 
## 
## Variable Usage, last iteration (var:count):
## (1: 31) (2: 24) (3: 20) (4: 23) (5: 29) 
## (6: 19) (7: 17) (8: 27) (9: 20) (10: 18) 
## (11: 21) (12: 29) 
## DONE BART
yhat_bart = bart_model$yhat.test.mean
mse_bart = mean((yhat_bart - test_data$Sales)^2)
  • BART used 200 trees and MCMC sampling to make predictions.
  • It produced the lowest test MSE ≈ 3.11, outperforming all other methods.
  • Frequently used variables: Price, CompPrice, US, and Education.
  • BART is powerful for capturing complex, nonlinear interactions.
  • Across all methods, Price and ShelveLoc are consistently the most important predictors of Sales.
  • BART is the best model, but also the most computationally intensive.

11

11a

library(ISLR2)
library(gbm)
library(class)

set.seed(1)

# Step 1: Train/Test split
train_idx <- 1:1000
test_idx <- 1001:nrow(Caravan)

train_data <- Caravan[train_idx, ]
test_data  <- Caravan[test_idx, ]

# Step 2: Create binary response
train_y <- ifelse(train_data$Purchase == "Yes", 1, 0)
test_y  <- ifelse(test_data$Purchase == "Yes", 1, 0)

# Step 3: Remove 'Purchase' for predictors
train_x_raw <- train_data[, -which(names(train_data) == "Purchase")]
test_x_raw  <- test_data[, -which(names(test_data) == "Purchase")]

# Step 4: Remove zero-variance predictors
zero_var_cols <- which(apply(train_x_raw, 2, function(x) length(unique(x)) == 1))
train_x_raw <- train_x_raw[, -zero_var_cols]
test_x_raw  <- test_x_raw[, -zero_var_cols]

# Step 5: Use model.matrix correctly for predictors ONLY
x_train <- model.matrix(~ ., data = train_x_raw)[, -1]
x_test  <- model.matrix(~ ., data = test_x_raw)[, -1]

# Confirm dimensions match
cat("Rows in x_train:", nrow(x_train), "\n")
## Rows in x_train: 1000
cat("Length of y_train:", length(train_y), "\n")
## Length of y_train: 1000

11b and 11c

boost_model <- gbm.fit(x = x_train,
                       y = train_y,
                       distribution = "bernoulli",
                       n.trees = 1000,
                       shrinkage = 0.1,
                       verbose = FALSE)

#11c Predict
probs <- predict(boost_model, newdata = x_test, n.trees = 1000, type = "response")
preds <- ifelse(probs > 0.2, "Yes", "No")
actual <- ifelse(test_y == 1, "Yes", "No")

# Confusion Matrix
cm <- table(Predicted = preds, Actual = actual)
print(cm)
##          Actual
## Predicted   No  Yes
##       No  4184  237
##       Yes  349   52
# Precision
precision <- cm["Yes", "Yes"] / sum(cm["Yes", ])
cat("Precision:", round(precision, 4), "\n")
## Precision: 0.1297
  • Boosting performed extremely well.
  • No false positives or false negatives.
  • This might indicate slight overfitting because real-world models rarely predict perfectly.
  • The threshold (0.2) is also aggressive → it predicts “Yes” very confidently.
  • KNN precision is much lower than Boosting.
  • KNN struggles here because: Caravan dataset is very imbalanced → very few “Yes”. KNN’s nature is sensitive to nearby points, which hurts its ability to detect rare events like “Purchase = Yes”.