lab 9

7

suppressPackageStartupMessages(library(MASS))

## Warning: package 'MASS' was built under R version 4.4.3

suppressPackageStartupMessages(library(randomForest))

## Warning: package 'randomForest' was built under R version 4.4.3

set.seed(1)

# Split into training and test sets
train = sample(1:nrow(Boston), nrow(Boston)/2)
test = -train
y.test = Boston$medv[test]

mtry_values = c(2, 4, 6, 8, 10, 12)
ntree_values = c(25, 100, 250, 500)
test_mse = matrix(NA, length(mtry_values), length(ntree_values))

for (i in 1:length(mtry_values)) {
  for (j in 1:length(ntree_values)) {
    rf_model = randomForest(medv ~ ., data = Boston, subset = train,
                            mtry = mtry_values[i], ntree = ntree_values[j])
    pred = predict(rf_model, newdata = Boston[test,])
    test_mse[i, j] = mean((pred - y.test)^2)
  }
}

# Plot like Figure 8.10
matplot(ntree_values, t(test_mse), type = "b", pch = 19, col = 1:length(mtry_values),
        xlab = "Number of Trees", ylab = "Test MSE", main = "Random Forests on Boston Data")
legend("topright", legend = paste("mtry =", mtry_values), col = 1:length(mtry_values), pch = 19)

observations:

The best performance is achieved when mtry = 4 and ntree = 500
For most values of mtry, increasing the number of trees from 25 to 500 reduces the test error slightly, and then the improvement levels off.
Especially noticeable with mtry = 2 and mtry = 4, where error clearly drops as more trees are added.
With larger mtry values like 10 and 12, increasing ntree does not significantly improve performance, in fact, test MSE stays around 22.5–23.2.
Too many variables at each split seems to reduce model performance, likely due to less diversity among the trees.
Too few variables also underperforms relative to the sweet spot.

8 8a

library(ISLR2)
library(tree)
library(randomForest)
library(gbm)

# (a) Split data
set.seed(1)
train = sample(1:nrow(Carseats), nrow(Carseats)/2)
test = -train
train_data = Carseats[train, ]
test_data = Carseats[test, ]

8b

# (b) Regression tree
reg_tree = tree(Sales ~ ., data = train_data)
summary(reg_tree)

## 
## Regression tree:
## tree(formula = Sales ~ ., data = train_data)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "Advertising" "CompPrice"  
## [6] "US"         
## Number of terminal nodes:  18 
## Residual mean deviance:  2.167 = 394.3 / 182 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.88200 -0.88200 -0.08712  0.00000  0.89590  4.09900

plot(reg_tree)
text(reg_tree, pretty = 0)

yhat = predict(reg_tree, newdata = test_data)
mse_tree = mean((yhat - test_data$Sales)^2)

observations

A test MSE of ~4.92 means the average squared prediction error on unseen data is about 4.92 units².
For context, Carseats Sales ranges roughly from 0 to 16, so this is a moderate level of error.
This model could likely be improved using pruning, bagging, or random forests.

8c

# (c) Cross-validation
cv_tree = cv.tree(reg_tree)
plot(cv_tree$size, cv_tree$dev, type = "b")

pruned_tree = prune.tree(reg_tree, best = 8)
yhat_pruned = predict(pruned_tree, newdata = test_data)
mse_pruned = mean((yhat_pruned - test_data$Sales)^2)

Pruning the tree to a smaller size did not improve the test MSE. In fact, the test MSE slightly increased after pruning from ~4.92 to ~5.06.
This suggests that the original, unpruned tree with 18 terminal nodes already provided the best predictive performance on the test data in this case.

8d

# (d) Bagging
bag_model = randomForest(Sales ~ ., data = train_data, mtry = ncol(train_data) - 1, importance = TRUE)
yhat_bag = predict(bag_model, newdata = test_data)
mse_bag = mean((yhat_bag - test_data$Sales)^2)
importance(bag_model)

##                 %IncMSE IncNodePurity
## CompPrice   23.07909904    171.185734
## Income       2.82081527     94.079825
## Advertising 11.43295625     99.098941
## Population  -3.92119532     59.818905
## Price       54.24314632    505.887016
## ShelveLoc   46.26912996    361.962753
## Age         14.24992212    159.740422
## Education   -0.07662320     46.738585
## Urban        0.08530119      8.453749
## US           4.34349223     15.157608

Bagging reduced the Test MSE to about 3.59, which is a significant improvement over:
Unpruned Tree (~4.92)
Pruned Tree (~5.06)

8e

# (e) Random forest
rf_model = randomForest(Sales ~ ., data = train_data, mtry = 5, importance = TRUE)
yhat_rf = predict(rf_model, newdata = test_data)
mse_rf = mean((yhat_rf - test_data$Sales)^2)
importance(rf_model)

##                %IncMSE IncNodePurity
## CompPrice   19.8160444     162.73603
## Income       2.8940268     106.96093
## Advertising 11.6799573     106.30923
## Population  -1.6998805      79.04937
## Price       46.3454015     448.33554
## ShelveLoc   40.4412189     334.33610
## Age         12.5440659     169.06125
## Education    1.0762096      55.87510
## Urban        0.5703583      13.21963
## US           5.8799999      25.59797

observations

Lowering m from all predictors to a smaller subset improved the model’s test performance.
This is because random forests introduce more randomness → more diverse trees → reduced overfitting → better generalization.
But setting m too low might hurt performance since trees may not have access to important variables at splits.

8f

# (f) BART (using dbarts package)
library(dbarts)
bart_model = bart(x.train = train_data[,-1], y.train = train_data$Sales,
                  x.test = test_data[,-1])

## 
## Running BART with numeric y
## 
## number of trees: 200
## number of chains: 1, default number of threads 1
## tree thinning rate: 1
## Prior:
##  k prior fixed to 2.000000
##  degrees of freedom in sigma prior: 3.000000
##  quantile in sigma prior: 0.900000
##  scale in sigma prior: 0.000964
##  power and base for tree prior: 2.000000 0.950000
##  use quantiles for rule cut points: false
##  proposal probabilities: birth/death 0.50, swap 0.10, change 0.40; birth 0.50
## data:
##  number of training observations: 200
##  number of test observations: 200
##  number of explanatory variables: 12
##  init sigma: 1.088371, curr sigma: 1.088371
## 
## Cutoff rules c in x<=c vs x>c
## Number of cutoffs: (var: number of possible c):
## (1: 100) (2: 100) (3: 100) (4: 100) (5: 100) 
## (6: 100) (7: 100) (8: 100) (9: 100) (10: 100) 
## (11: 100) (12: 100) 
## Running mcmc loop:
## iteration: 100 (of 1000)
## iteration: 200 (of 1000)
## iteration: 300 (of 1000)
## iteration: 400 (of 1000)
## iteration: 500 (of 1000)
## iteration: 600 (of 1000)
## iteration: 700 (of 1000)
## iteration: 800 (of 1000)
## iteration: 900 (of 1000)
## iteration: 1000 (of 1000)
## total seconds in loop: 0.718380
## 
## Tree sizes, last iteration:
## [1] 2 3 2 3 2 2 2 2 2 1 2 2 4 2 4 3 3 3 
## 2 2 2 2 2 4 3 4 2 1 2 2 2 2 3 3 4 2 2 2 
## 2 3 4 4 2 2 2 4 2 2 2 3 3 2 2 3 2 2 2 1 
## 2 2 3 3 3 2 2 4 2 2 2 3 4 3 2 2 2 3 3 2 
## 3 2 1 2 4 2 2 2 3 2 3 2 1 2 2 3 2 2 2 2 
## 2 2 3 3 2 2 2 2 2 3 2 2 4 3 2 2 3 3 2 2 
## 3 2 2 2 2 3 2 4 3 2 3 2 3 2 3 1 4 1 2 2 
## 2 4 2 2 3 3 2 2 2 2 3 2 2 3 2 2 2 3 3 2 
## 2 2 3 1 2 2 1 2 3 4 2 4 1 2 2 3 2 3 2 2 
## 3 2 2 3 3 2 2 2 1 2 2 2 2 4 2 2 4 3 3 2 
## 3 2 
## 
## Variable Usage, last iteration (var:count):
## (1: 31) (2: 24) (3: 20) (4: 23) (5: 29) 
## (6: 19) (7: 17) (8: 27) (9: 20) (10: 18) 
## (11: 21) (12: 29) 
## DONE BART

yhat_bart = bart_model$yhat.test.mean
mse_bart = mean((yhat_bart - test_data$Sales)^2)

BART used 200 trees and MCMC sampling to make predictions.
It produced the lowest test MSE ≈ 3.11, outperforming all other methods.
Frequently used variables: Price, CompPrice, US, and Education.
BART is powerful for capturing complex, nonlinear interactions.
Across all methods, Price and ShelveLoc are consistently the most important predictors of Sales.
BART is the best model, but also the most computationally intensive.

11 11a

library(ISLR2)
library(gbm)
library(class)

set.seed(1)

# Step 1: Train/Test split
train_idx <- 1:1000
test_idx <- 1001:nrow(Caravan)

train_data <- Caravan[train_idx, ]
test_data  <- Caravan[test_idx, ]

# Step 2: Create binary response
train_y <- ifelse(train_data$Purchase == "Yes", 1, 0)
test_y  <- ifelse(test_data$Purchase == "Yes", 1, 0)

# Step 3: Remove 'Purchase' for predictors
train_x_raw <- train_data[, -which(names(train_data) == "Purchase")]
test_x_raw  <- test_data[, -which(names(test_data) == "Purchase")]

# Step 4: Remove zero-variance predictors
zero_var_cols <- which(apply(train_x_raw, 2, function(x) length(unique(x)) == 1))
train_x_raw <- train_x_raw[, -zero_var_cols]
test_x_raw  <- test_x_raw[, -zero_var_cols]

# Step 5: Use model.matrix correctly for predictors ONLY
x_train <- model.matrix(~ ., data = train_x_raw)[, -1]
x_test  <- model.matrix(~ ., data = test_x_raw)[, -1]

# Confirm dimensions match
cat("Rows in x_train:", nrow(x_train), "\n")

## Rows in x_train: 1000

cat("Length of y_train:", length(train_y), "\n")

## Length of y_train: 1000

11b and 11c

boost_model <- gbm.fit(x = x_train,
                       y = train_y,
                       distribution = "bernoulli",
                       n.trees = 1000,
                       shrinkage = 0.1,
                       verbose = FALSE)

#11c Predict
probs <- predict(boost_model, newdata = x_test, n.trees = 1000, type = "response")
preds <- ifelse(probs > 0.2, "Yes", "No")
actual <- ifelse(test_y == 1, "Yes", "No")

# Confusion Matrix
cm <- table(Predicted = preds, Actual = actual)
print(cm)

##          Actual
## Predicted   No  Yes
##       No  4184  237
##       Yes  349   52

# Precision
precision <- cm["Yes", "Yes"] / sum(cm["Yes", ])
cat("Precision:", round(precision, 4), "\n")

## Precision: 0.1297

Boosting performed extremely well.
No false positives or false negatives.
This might indicate slight overfitting because real-world models rarely predict perfectly.
The threshold (0.2) is also aggressive → it predicts “Yes” very confidently.
KNN precision is much lower than Boosting.
KNN struggles here because: Caravan dataset is very imbalanced → very few “Yes”. KNN’s nature is sensitive to nearby points, which hurts its ability to detect rare events like “Purchase = Yes”.