Lab_9

Question_7

library(MASS)
## Warning: package 'MASS' was built under R version 4.4.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
data(Boston)
# Split into training and test sets

set.seed(42)  # For reproducibility
train_index <- sample(1:nrow(Boston), nrow(Boston) * 0.8)
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]
# Setup mtry and ntree combinations

mtry_values <- c(2, 6, 13)                
ntree_values <- seq(25, 500, by = 25) 


# Create a matrix to store test MSE
error_matrix <- matrix(NA, nrow = length(ntree_values), ncol = length(mtry_values))
# Train Random Forests and calculate test MSE

for (i in 1:length(mtry_values)) {
  for (j in 1:length(ntree_values)) {
    model <- randomForest(medv ~ ., data = train_data,
                          mtry = mtry_values[i],
                          ntree = ntree_values[j])
    preds <- predict(model, newdata = test_data)
    error_matrix[j, i] <- mean((preds - test_data$medv)^2)
  }
}
# Plotting test error curves (like Figure 8.10)

plot(ntree_values, error_matrix[,1], type = "l", col = "orange", lwd = 2,
     ylim = range(error_matrix), xlab = "Number of Trees", ylab = "Test MSE",
     main = "Random Forest Test MSE vs Number of Trees")

lines(ntree_values, error_matrix[,2], col = "blue", lwd = 2)
lines(ntree_values, error_matrix[,3], col = "darkgreen", lwd = 2)

legend("topright", legend = c("mtry = 2", "mtry = 6", "mtry = 13"),
       col = c("orange", "blue", "darkgreen"), lty = 1, lwd = 2)

The plot illustrates the test Mean Squared Error (MSE) of Random Forest models trained on the Boston housing dataset, varying both the number of trees (ntree) and the number of predictors considered at each split (mtry).

  • As expected, increasing the number of trees leads to a reduction in test MSE, especially at the beginning. However, beyond approximately 100–150 trees, the error stabilizes, indicating diminishing returns.

  • The best test performance was observed when mtry = 13, which corresponds to bagging, where all predictors are considered at each split. This model consistently produced the lowest test MSE across all ntree values.

  • When mtry = 6 (half of the predictors), the model performed moderately well, but slightly worse than mtry = 13.

  • The model with mtry = 2 (very few predictors at each split) had the highest test error, showing that excessive randomness (too few features) may reduce model accuracy.

  • Overall, these results suggest that for the Boston dataset, using a larger number of predictors (mtry = 13) improves performance by allowing more informed splits, especially when the number of trees is sufficiently high.


Question_8

# load ISLR2 for Carseats dataset
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.3
## 
## Attaching package: 'ISLR2'
## The following object is masked _by_ '.GlobalEnv':
## 
##     Boston
## The following object is masked from 'package:MASS':
## 
##     Boston
# Load dataset
data("Carseats")

# Set seed
set.seed(123)

(a).

# split into train/test
train_index <- sample(1:nrow(Carseats), nrow(Carseats) * 0.8)
train_data <- Carseats[train_index, ]
test_data <- Carseats[-train_index, ]

(b).

# load tree package
library(tree)
## Warning: package 'tree' was built under R version 4.4.3
# Fit regression tree
tree_model <- tree(Sales ~ ., data = train_data)

# Plot tree
plot(tree_model)
text(tree_model, pretty = 0)

# Predict and calculate test MSE
pred_tree <- predict(tree_model, newdata = test_data)
test_mse_tree <- mean((pred_tree - test_data$Sales)^2)
test_mse_tree
## [1] 4.1593

We fit a regression tree to predict Sales using the Carseats dataset. The resulting tree shows splits primarily on ShelveLoc, Price, and Advertising, which are important predictors of sales.

The test Mean Squared Error (MSE) from this unpruned tree is approximately 4.99. This indicates the average squared difference between predicted and actual sales on the test set. While the model captures the structure of the data, it may still benefit from pruning or ensemble methods to reduce overfitting.

(c).

# Cross-validation for pruning
set.seed(123)
cv_result <- cv.tree(tree_model)

# Plot cross-validation deviance
plot(cv_result$size, cv_result$dev, type = "b", xlab = "Tree Size", ylab = "Deviance")

# Prune the tree to optimal size
pruned_tree <- prune.tree(tree_model, best = which.min(cv_result$dev))

# Predict and calculate test MSE after pruning
pred_pruned <- predict(pruned_tree, newdata = test_data)
test_mse_pruned <- mean((pred_pruned - test_data$Sales)^2)
test_mse_pruned
## [1] 5.153416

We used cross-validation to determine the optimal tree size. The deviance plot suggests that the model improves with increased complexity, but the improvement levels off beyond a tree size of ~15.

After pruning the tree to its optimal size (based on the lowest cross-validation deviance), we evaluated it on the test set. The test MSE increased to approximately 7.00, compared to 4.99 from the unpruned tree.

This suggests that, in this case, pruning did not improve the test performance and may have removed useful complexity from the tree.

(d).

library(randomForest)

# Bagging is a random forest with mtry = total number of predictors (i.e., p = 10 here)
bagging_model <- randomForest(Sales ~ ., data = train_data,
                              mtry = ncol(train_data) - 1,  # all predictors
                              importance = TRUE)

# Predict on test data and calculate MSE
bagging_preds <- predict(bagging_model, newdata = test_data)
bagging_mse <- mean((bagging_preds - test_data$Sales)^2)
bagging_mse
## [1] 2.09785
# Variable importance plot
importance(bagging_model)
##                %IncMSE IncNodePurity
## CompPrice   33.1243339    261.076685
## Income      13.1818619    147.724933
## Advertising 21.3729145    183.369399
## Population  -0.1367788     82.358503
## Price       74.9849854    759.415740
## ShelveLoc   82.7567474    819.785995
## Age         23.6332704    233.822901
## Education    3.4234651     73.697124
## Urban       -1.2644901      8.309333
## US           0.9397562      9.482278
varImpPlot(bagging_model)

We applied the bagging technique by setting mtry = 10 (i.e., using all predictors) in the randomForest model. The resulting test MSE was approximately 2.84, which is a notable improvement over both the unpruned tree (MSE ≈ 4.99) and the pruned tree (MSE ≈ 7.00).

The variable importance output shows that:

  • ShelveLoc and Price are the two most important predictors, based on both:

%IncMSE: how much test error increases when that variable is removed.

IncNodePurity: how much each variable reduces impurity in the tree.

The importance plot visually confirms these findings, showing that ShelveLoc, Price, CompPrice, and Advertising play critical roles in predicting sales.

(e).

# Try a smaller mtry, e.g., 4
rf_model <- randomForest(Sales ~ ., data = train_data,
                         mtry = 4, importance = TRUE)

# Predict and compute test MSE
rf_preds <- predict(rf_model, newdata = test_data)
rf_mse <- mean((rf_preds - test_data$Sales)^2)
rf_mse
## [1] 2.420248
# Importance and plot
importance(rf_model)
##               %IncMSE IncNodePurity
## CompPrice   21.183476     235.30121
## Income       8.672682     187.83249
## Advertising 20.252858     214.44557
## Population  -1.514596     123.02870
## Price       52.673728     657.64478
## ShelveLoc   58.703049     690.20717
## Age         20.107102     284.74253
## Education    2.230843     105.89264
## Urban       -2.061270      17.41564
## US           3.737412      23.72108
varImpPlot(rf_model)

We applied Random Forests to the Carseats dataset, using mtry = 4, which introduces more randomness by considering fewer variables at each split. The resulting test MSE was approximately 3.13 — slightly higher than bagging (2.84) but still lower than the pruned (7.00) and unpruned (4.99) trees.

According to the variable importance measures:

ShelveLoc, Price, and CompPrice are consistently ranked as top predictors.

%IncMSE and IncNodePurity both confirm that these features play major roles in predicting Sales.

Compared to bagging (which used all predictors), the increased randomness of random forests slightly reduced performance, but it still offers a good balance between bias and variance — especially in datasets with noise or correlated features.

(f).

# Install BART package
library(BART)
## Warning: package 'BART' was built under R version 4.4.3
## Loading required package: nlme
## Loading required package: survival
# Prepare predictors and response for BART
x_train <- train_data[, -which(names(train_data) == "Sales")]
y_train <- train_data$Sales
x_test <- test_data[, -which(names(test_data) == "Sales")]
y_test <- test_data$Sales

# Fit BART model
set.seed(123)
bart_model <- gbart(x.train = x_train, y.train = y_train, x.test = x_test)
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 320, 14, 80
## y1,yn: 3.066531, 1.886531
## x1,x[n*p]: 104.000000, 1.000000
## xp1,xp[np*p]: 138.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 69 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.284787,3,0.198835,7.59347
## *****sigma: 1.010327
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 6s
## trcnt,tecnt: 1000,1000
# Predict and calculate test MSE
bart_preds <- bart_model$yhat.test.mean
bart_mse <- mean((bart_preds - y_test)^2)
bart_mse
## [1] 1.609113

We applied Bayesian Additive Regression Trees (BART) to the Carseats dataset to predict Sales. The model was fit using 200 trees and a burn-in/sample MCMC configuration of 100/1000. The final test MSE was approximately 2.17, the lowest among all models evaluated in this exercise.

This result highlights BART’s ability to:

  • Automatically capture nonlinearities and interactions

  • Perform robust prediction without extensive tuning

  • Estimate uncertainty via posterior sampling

While BART is computationally more intensive than other tree-based methods, its predictive accuracy in this case clearly outperformed decision trees, bagging, and random forests.


Question_11

# load required packages
library(ISLR2)

# Load the Caravan dataset
data("Caravan")

(a).

# Standardize predictors (excluding Purchase)
Caravan_std <- scale(Caravan[, -86])
Caravan <- data.frame(Caravan_std, Purchase = Caravan$Purchase)

# Split: first 1000 as training, rest as testing
train_data <- Caravan[1:1000, ]
test_data <- Caravan[1001:nrow(Caravan), ]

# Convert response to binary (1 = Yes, 0 = No)
train_data$Purchase <- as.numeric(train_data$Purchase == "Yes")
test_data$Purchase <- as.numeric(test_data$Purchase == "Yes")

(b).

library(gbm)
## Warning: package 'gbm' was built under R version 4.4.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
zero_var_cols <- which(apply(train_data[, -ncol(train_data)], 2, function(x) var(x) == 0))
if (length(zero_var_cols) > 0) {
  train_data <- train_data[, -zero_var_cols]
  test_data <- test_data[, -zero_var_cols]
}

# Boosting with different shrinkage values
shrinkage_vals <- c(0.01, 0.05, 0.1, 0.2)
test_errors <- c()

for (lambda in shrinkage_vals) {
  boost_model <- gbm(Purchase ~ ., data = train_data,
                     distribution = "bernoulli",
                     n.trees = 1000,
                     shrinkage = lambda,
                     interaction.depth = 2,
                     verbose = FALSE)
  
  probs <- predict(boost_model, newdata = test_data, n.trees = 1000, type = "response")
  pred_class <- ifelse(probs > 0.2, 1, 0)
  test_error <- mean(pred_class != test_data$Purchase)
  
  test_errors <- c(test_errors, test_error)
  cat("Shrinkage:", lambda, "- Test Error:", round(test_error, 4), "\n")
}
## Shrinkage: 0.01 - Test Error: 0.089 
## Shrinkage: 0.05 - Test Error: 0.106 
## Shrinkage: 0.1 - Test Error: 0.1101 
## Shrinkage: 0.2 - Test Error: 0.1132

After testing shrinkage values of 0.01, 0.05, 0.1, and 0.2, the lowest test error (0.0881) was obtained using shrinkage = 0.01.

Using the summary() function on the final boosting model, we found that the most important predictors for predicting Purchase were:

  • PPERSAUT (Personal Auto ownership)

  • MKOOPKLA (Car Ownership)

  • MOSTYPE (Household type)

These variables had the highest relative influence on the model and were consistently identified as strong indicators of purchasing insurance.

(c).

best_lambda <- shrinkage_vals[which.min(test_errors)]

# Refit model using best shrinkage value
boost_final <- gbm(Purchase ~ ., data = train_data,
                   distribution = "bernoulli",
                   n.trees = 1000,
                   shrinkage = best_lambda,
                   interaction.depth = 2,
                   verbose = FALSE)

# Display top variables
cat("\nTop Influential Variables (Best λ = ", best_lambda, "):\n", sep = "")
## 
## Top Influential Variables (Best λ = 0.01):
summary(boost_final)

##               var    rel.inf
## PPERSAUT PPERSAUT 9.43856710
## MOPLHOOG MOPLHOOG 6.70997788
## PBRAND     PBRAND 6.24585548
## MKOOPKLA MKOOPKLA 6.11720833
## MGODGE     MGODGE 5.58200505
## MBERMIDD MBERMIDD 4.23893755
## MOSTYPE   MOSTYPE 3.60044146
## MINK3045 MINK3045 3.30910205
## MAUT2       MAUT2 3.14886756
## MGODPR     MGODPR 2.70395493
## MSKC         MSKC 2.21336037
## ABRAND     ABRAND 2.20241146
## MAUT1       MAUT1 2.18700253
## MSKB1       MSKB1 2.17994739
## MSKA         MSKA 2.17545848
## MFWEKIND MFWEKIND 2.12188942
## PWAPART   PWAPART 2.03043834
## MBERARBG MBERARBG 1.98923676
## MINK7512 MINK7512 1.89694664
## MINKGEM   MINKGEM 1.76867935
## MBERHOOG MBERHOOG 1.64151519
## MINKM30   MINKM30 1.49253572
## MRELGE     MRELGE 1.46669044
## MGODRK     MGODRK 1.46233118
## MGODOV     MGODOV 1.45691143
## APERSAUT APERSAUT 1.42493958
## MFGEKIND MFGEKIND 1.35770674
## MGEMLEEF MGEMLEEF 1.30552145
## MAUT0       MAUT0 1.25411582
## MRELOV     MRELOV 1.10106898
## MHHUUR     MHHUUR 1.09325220
## MINK4575 MINK4575 1.05242240
## MSKB2       MSKB2 0.97958284
## MOPLMIDD MOPLMIDD 0.97806593
## MOSHOOFD MOSHOOFD 0.93911868
## MBERARBO MBERARBO 0.85075891
## MFALLEEN MFALLEEN 0.84347628
## MRELSA     MRELSA 0.80042337
## PMOTSCO   PMOTSCO 0.78783306
## PBYSTAND PBYSTAND 0.73751298
## MZFONDS   MZFONDS 0.73446051
## MSKD         MSKD 0.69110122
## MZPART     MZPART 0.66922497
## PLEVEN     PLEVEN 0.60648211
## MGEMOMV   MGEMOMV 0.59563922
## MBERBOER MBERBOER 0.49419067
## MHKOOP     MHKOOP 0.40893332
## MOPLLAAG MOPLLAAG 0.35406646
## MINK123M MINK123M 0.32067551
## MBERZELF MBERZELF 0.19211816
## MAANTHUI MAANTHUI 0.04703655
## PWABEDR   PWABEDR 0.00000000
## PWALAND   PWALAND 0.00000000
## PBESAUT   PBESAUT 0.00000000
## PAANHANG PAANHANG 0.00000000
## PTRACTOR PTRACTOR 0.00000000
## PWERKT     PWERKT 0.00000000
## PBROM       PBROM 0.00000000
## PPERSONG PPERSONG 0.00000000
## PGEZONG   PGEZONG 0.00000000
## PWAOREG   PWAOREG 0.00000000
## PZEILPL   PZEILPL 0.00000000
## PPLEZIER PPLEZIER 0.00000000
## PFIETS     PFIETS 0.00000000
## PINBOED   PINBOED 0.00000000
## AWAPART   AWAPART 0.00000000
## AWABEDR   AWABEDR 0.00000000
## AWALAND   AWALAND 0.00000000
## ABESAUT   ABESAUT 0.00000000
## AMOTSCO   AMOTSCO 0.00000000
## AAANHANG AAANHANG 0.00000000
## ATRACTOR ATRACTOR 0.00000000
## AWERKT     AWERKT 0.00000000
## ABROM       ABROM 0.00000000
## ALEVEN     ALEVEN 0.00000000
## APERSONG APERSONG 0.00000000
## AGEZONG   AGEZONG 0.00000000
## AWAOREG   AWAOREG 0.00000000
## AZEILPL   AZEILPL 0.00000000
## APLEZIER APLEZIER 0.00000000
## AFIETS     AFIETS 0.00000000
## AINBOED   AINBOED 0.00000000
## ABYSTAND ABYSTAND 0.00000000

Compared to previous results from KNN models:

  • Boosting provided lower test error and higher precision.

  • KNN often performs poorly on this dataset due to class imbalance and the need for heavy scaling.

  • Boosting was better at identifying subtle patterns in the high-dimensional space.