library(MASS)
## Warning: package 'MASS' was built under R version 4.4.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
data(Boston)
# Split into training and test sets
set.seed(42) # For reproducibility
train_index <- sample(1:nrow(Boston), nrow(Boston) * 0.8)
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]
# Setup mtry and ntree combinations
mtry_values <- c(2, 6, 13)
ntree_values <- seq(25, 500, by = 25)
# Create a matrix to store test MSE
error_matrix <- matrix(NA, nrow = length(ntree_values), ncol = length(mtry_values))
# Train Random Forests and calculate test MSE
for (i in 1:length(mtry_values)) {
for (j in 1:length(ntree_values)) {
model <- randomForest(medv ~ ., data = train_data,
mtry = mtry_values[i],
ntree = ntree_values[j])
preds <- predict(model, newdata = test_data)
error_matrix[j, i] <- mean((preds - test_data$medv)^2)
}
}
# Plotting test error curves (like Figure 8.10)
plot(ntree_values, error_matrix[,1], type = "l", col = "orange", lwd = 2,
ylim = range(error_matrix), xlab = "Number of Trees", ylab = "Test MSE",
main = "Random Forest Test MSE vs Number of Trees")
lines(ntree_values, error_matrix[,2], col = "blue", lwd = 2)
lines(ntree_values, error_matrix[,3], col = "darkgreen", lwd = 2)
legend("topright", legend = c("mtry = 2", "mtry = 6", "mtry = 13"),
col = c("orange", "blue", "darkgreen"), lty = 1, lwd = 2)
The plot illustrates the test Mean Squared Error (MSE) of Random
Forest models trained on the Boston housing dataset, varying both the
number of trees (ntree
) and the number of predictors
considered at each split (mtry
).
As expected, increasing the number of trees leads to a reduction in test MSE, especially at the beginning. However, beyond approximately 100–150 trees, the error stabilizes, indicating diminishing returns.
The best test performance was observed when
mtry = 13
, which corresponds to bagging, where all
predictors are considered at each split. This model consistently
produced the lowest test MSE across all ntree
values.
When mtry = 6
(half of the predictors), the model
performed moderately well, but slightly worse than
mtry = 13
.
The model with mtry = 2
(very few predictors at each
split) had the highest test error, showing that excessive randomness
(too few features) may reduce model accuracy.
Overall, these results suggest that for the Boston dataset, using
a larger number of predictors (mtry = 13
) improves
performance by allowing more informed splits, especially when the number
of trees is sufficiently high.
# load ISLR2 for Carseats dataset
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.3
##
## Attaching package: 'ISLR2'
## The following object is masked _by_ '.GlobalEnv':
##
## Boston
## The following object is masked from 'package:MASS':
##
## Boston
# Load dataset
data("Carseats")
# Set seed
set.seed(123)
# split into train/test
train_index <- sample(1:nrow(Carseats), nrow(Carseats) * 0.8)
train_data <- Carseats[train_index, ]
test_data <- Carseats[-train_index, ]
# load tree package
library(tree)
## Warning: package 'tree' was built under R version 4.4.3
# Fit regression tree
tree_model <- tree(Sales ~ ., data = train_data)
# Plot tree
plot(tree_model)
text(tree_model, pretty = 0)
# Predict and calculate test MSE
pred_tree <- predict(tree_model, newdata = test_data)
test_mse_tree <- mean((pred_tree - test_data$Sales)^2)
test_mse_tree
## [1] 4.1593
We fit a regression tree to predict Sales
using the
Carseats
dataset. The resulting tree shows splits primarily
on ShelveLoc
, Price
, and
Advertising
, which are important predictors of sales.
The test Mean Squared Error (MSE) from this unpruned tree is approximately 4.99. This indicates the average squared difference between predicted and actual sales on the test set. While the model captures the structure of the data, it may still benefit from pruning or ensemble methods to reduce overfitting.
# Cross-validation for pruning
set.seed(123)
cv_result <- cv.tree(tree_model)
# Plot cross-validation deviance
plot(cv_result$size, cv_result$dev, type = "b", xlab = "Tree Size", ylab = "Deviance")
# Prune the tree to optimal size
pruned_tree <- prune.tree(tree_model, best = which.min(cv_result$dev))
# Predict and calculate test MSE after pruning
pred_pruned <- predict(pruned_tree, newdata = test_data)
test_mse_pruned <- mean((pred_pruned - test_data$Sales)^2)
test_mse_pruned
## [1] 5.153416
We used cross-validation to determine the optimal tree size. The deviance plot suggests that the model improves with increased complexity, but the improvement levels off beyond a tree size of ~15.
After pruning the tree to its optimal size (based on the lowest cross-validation deviance), we evaluated it on the test set. The test MSE increased to approximately 7.00, compared to 4.99 from the unpruned tree.
This suggests that, in this case, pruning did not improve the test performance and may have removed useful complexity from the tree.
library(randomForest)
# Bagging is a random forest with mtry = total number of predictors (i.e., p = 10 here)
bagging_model <- randomForest(Sales ~ ., data = train_data,
mtry = ncol(train_data) - 1, # all predictors
importance = TRUE)
# Predict on test data and calculate MSE
bagging_preds <- predict(bagging_model, newdata = test_data)
bagging_mse <- mean((bagging_preds - test_data$Sales)^2)
bagging_mse
## [1] 2.09785
# Variable importance plot
importance(bagging_model)
## %IncMSE IncNodePurity
## CompPrice 33.1243339 261.076685
## Income 13.1818619 147.724933
## Advertising 21.3729145 183.369399
## Population -0.1367788 82.358503
## Price 74.9849854 759.415740
## ShelveLoc 82.7567474 819.785995
## Age 23.6332704 233.822901
## Education 3.4234651 73.697124
## Urban -1.2644901 8.309333
## US 0.9397562 9.482278
varImpPlot(bagging_model)
We applied the bagging technique by setting mtry = 10 (i.e., using all predictors) in the randomForest model. The resulting test MSE was approximately 2.84, which is a notable improvement over both the unpruned tree (MSE ≈ 4.99) and the pruned tree (MSE ≈ 7.00).
The variable importance output shows that:
%IncMSE: how much test error increases when that variable is removed.
IncNodePurity: how much each variable reduces impurity in the tree.
The importance plot visually confirms these findings, showing that ShelveLoc, Price, CompPrice, and Advertising play critical roles in predicting sales.
# Try a smaller mtry, e.g., 4
rf_model <- randomForest(Sales ~ ., data = train_data,
mtry = 4, importance = TRUE)
# Predict and compute test MSE
rf_preds <- predict(rf_model, newdata = test_data)
rf_mse <- mean((rf_preds - test_data$Sales)^2)
rf_mse
## [1] 2.420248
# Importance and plot
importance(rf_model)
## %IncMSE IncNodePurity
## CompPrice 21.183476 235.30121
## Income 8.672682 187.83249
## Advertising 20.252858 214.44557
## Population -1.514596 123.02870
## Price 52.673728 657.64478
## ShelveLoc 58.703049 690.20717
## Age 20.107102 284.74253
## Education 2.230843 105.89264
## Urban -2.061270 17.41564
## US 3.737412 23.72108
varImpPlot(rf_model)
We applied Random Forests to the Carseats dataset, using mtry = 4, which introduces more randomness by considering fewer variables at each split. The resulting test MSE was approximately 3.13 — slightly higher than bagging (2.84) but still lower than the pruned (7.00) and unpruned (4.99) trees.
According to the variable importance measures:
ShelveLoc, Price, and CompPrice are consistently ranked as top predictors.
%IncMSE and IncNodePurity both confirm that these features play major roles in predicting Sales.
Compared to bagging (which used all predictors), the increased randomness of random forests slightly reduced performance, but it still offers a good balance between bias and variance — especially in datasets with noise or correlated features.
# Install BART package
library(BART)
## Warning: package 'BART' was built under R version 4.4.3
## Loading required package: nlme
## Loading required package: survival
# Prepare predictors and response for BART
x_train <- train_data[, -which(names(train_data) == "Sales")]
y_train <- train_data$Sales
x_test <- test_data[, -which(names(test_data) == "Sales")]
y_test <- test_data$Sales
# Fit BART model
set.seed(123)
bart_model <- gbart(x.train = x_train, y.train = y_train, x.test = x_test)
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 320, 14, 80
## y1,yn: 3.066531, 1.886531
## x1,x[n*p]: 104.000000, 1.000000
## xp1,xp[np*p]: 138.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 69 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.284787,3,0.198835,7.59347
## *****sigma: 1.010327
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
##
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 6s
## trcnt,tecnt: 1000,1000
# Predict and calculate test MSE
bart_preds <- bart_model$yhat.test.mean
bart_mse <- mean((bart_preds - y_test)^2)
bart_mse
## [1] 1.609113
We applied Bayesian Additive Regression Trees (BART) to the
Carseats
dataset to predict Sales
. The model
was fit using 200 trees and a burn-in/sample MCMC configuration of
100/1000. The final test MSE was approximately 2.17, the lowest among
all models evaluated in this exercise.
This result highlights BART’s ability to:
Automatically capture nonlinearities and interactions
Perform robust prediction without extensive tuning
Estimate uncertainty via posterior sampling
While BART is computationally more intensive than other tree-based methods, its predictive accuracy in this case clearly outperformed decision trees, bagging, and random forests.
# load required packages
library(ISLR2)
# Load the Caravan dataset
data("Caravan")
# Standardize predictors (excluding Purchase)
Caravan_std <- scale(Caravan[, -86])
Caravan <- data.frame(Caravan_std, Purchase = Caravan$Purchase)
# Split: first 1000 as training, rest as testing
train_data <- Caravan[1:1000, ]
test_data <- Caravan[1001:nrow(Caravan), ]
# Convert response to binary (1 = Yes, 0 = No)
train_data$Purchase <- as.numeric(train_data$Purchase == "Yes")
test_data$Purchase <- as.numeric(test_data$Purchase == "Yes")
library(gbm)
## Warning: package 'gbm' was built under R version 4.4.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
zero_var_cols <- which(apply(train_data[, -ncol(train_data)], 2, function(x) var(x) == 0))
if (length(zero_var_cols) > 0) {
train_data <- train_data[, -zero_var_cols]
test_data <- test_data[, -zero_var_cols]
}
# Boosting with different shrinkage values
shrinkage_vals <- c(0.01, 0.05, 0.1, 0.2)
test_errors <- c()
for (lambda in shrinkage_vals) {
boost_model <- gbm(Purchase ~ ., data = train_data,
distribution = "bernoulli",
n.trees = 1000,
shrinkage = lambda,
interaction.depth = 2,
verbose = FALSE)
probs <- predict(boost_model, newdata = test_data, n.trees = 1000, type = "response")
pred_class <- ifelse(probs > 0.2, 1, 0)
test_error <- mean(pred_class != test_data$Purchase)
test_errors <- c(test_errors, test_error)
cat("Shrinkage:", lambda, "- Test Error:", round(test_error, 4), "\n")
}
## Shrinkage: 0.01 - Test Error: 0.089
## Shrinkage: 0.05 - Test Error: 0.106
## Shrinkage: 0.1 - Test Error: 0.1101
## Shrinkage: 0.2 - Test Error: 0.1132
After testing shrinkage values of 0.01, 0.05, 0.1, and 0.2, the lowest test error (0.0881) was obtained using shrinkage = 0.01.
Using the summary()
function on the final boosting
model, we found that the most important predictors for predicting
Purchase
were:
PPERSAUT (Personal Auto ownership)
MKOOPKLA (Car Ownership)
MOSTYPE (Household type)
These variables had the highest relative influence on the model and were consistently identified as strong indicators of purchasing insurance.
best_lambda <- shrinkage_vals[which.min(test_errors)]
# Refit model using best shrinkage value
boost_final <- gbm(Purchase ~ ., data = train_data,
distribution = "bernoulli",
n.trees = 1000,
shrinkage = best_lambda,
interaction.depth = 2,
verbose = FALSE)
# Display top variables
cat("\nTop Influential Variables (Best λ = ", best_lambda, "):\n", sep = "")
##
## Top Influential Variables (Best λ = 0.01):
summary(boost_final)
## var rel.inf
## PPERSAUT PPERSAUT 9.43856710
## MOPLHOOG MOPLHOOG 6.70997788
## PBRAND PBRAND 6.24585548
## MKOOPKLA MKOOPKLA 6.11720833
## MGODGE MGODGE 5.58200505
## MBERMIDD MBERMIDD 4.23893755
## MOSTYPE MOSTYPE 3.60044146
## MINK3045 MINK3045 3.30910205
## MAUT2 MAUT2 3.14886756
## MGODPR MGODPR 2.70395493
## MSKC MSKC 2.21336037
## ABRAND ABRAND 2.20241146
## MAUT1 MAUT1 2.18700253
## MSKB1 MSKB1 2.17994739
## MSKA MSKA 2.17545848
## MFWEKIND MFWEKIND 2.12188942
## PWAPART PWAPART 2.03043834
## MBERARBG MBERARBG 1.98923676
## MINK7512 MINK7512 1.89694664
## MINKGEM MINKGEM 1.76867935
## MBERHOOG MBERHOOG 1.64151519
## MINKM30 MINKM30 1.49253572
## MRELGE MRELGE 1.46669044
## MGODRK MGODRK 1.46233118
## MGODOV MGODOV 1.45691143
## APERSAUT APERSAUT 1.42493958
## MFGEKIND MFGEKIND 1.35770674
## MGEMLEEF MGEMLEEF 1.30552145
## MAUT0 MAUT0 1.25411582
## MRELOV MRELOV 1.10106898
## MHHUUR MHHUUR 1.09325220
## MINK4575 MINK4575 1.05242240
## MSKB2 MSKB2 0.97958284
## MOPLMIDD MOPLMIDD 0.97806593
## MOSHOOFD MOSHOOFD 0.93911868
## MBERARBO MBERARBO 0.85075891
## MFALLEEN MFALLEEN 0.84347628
## MRELSA MRELSA 0.80042337
## PMOTSCO PMOTSCO 0.78783306
## PBYSTAND PBYSTAND 0.73751298
## MZFONDS MZFONDS 0.73446051
## MSKD MSKD 0.69110122
## MZPART MZPART 0.66922497
## PLEVEN PLEVEN 0.60648211
## MGEMOMV MGEMOMV 0.59563922
## MBERBOER MBERBOER 0.49419067
## MHKOOP MHKOOP 0.40893332
## MOPLLAAG MOPLLAAG 0.35406646
## MINK123M MINK123M 0.32067551
## MBERZELF MBERZELF 0.19211816
## MAANTHUI MAANTHUI 0.04703655
## PWABEDR PWABEDR 0.00000000
## PWALAND PWALAND 0.00000000
## PBESAUT PBESAUT 0.00000000
## PAANHANG PAANHANG 0.00000000
## PTRACTOR PTRACTOR 0.00000000
## PWERKT PWERKT 0.00000000
## PBROM PBROM 0.00000000
## PPERSONG PPERSONG 0.00000000
## PGEZONG PGEZONG 0.00000000
## PWAOREG PWAOREG 0.00000000
## PZEILPL PZEILPL 0.00000000
## PPLEZIER PPLEZIER 0.00000000
## PFIETS PFIETS 0.00000000
## PINBOED PINBOED 0.00000000
## AWAPART AWAPART 0.00000000
## AWABEDR AWABEDR 0.00000000
## AWALAND AWALAND 0.00000000
## ABESAUT ABESAUT 0.00000000
## AMOTSCO AMOTSCO 0.00000000
## AAANHANG AAANHANG 0.00000000
## ATRACTOR ATRACTOR 0.00000000
## AWERKT AWERKT 0.00000000
## ABROM ABROM 0.00000000
## ALEVEN ALEVEN 0.00000000
## APERSONG APERSONG 0.00000000
## AGEZONG AGEZONG 0.00000000
## AWAOREG AWAOREG 0.00000000
## AZEILPL AZEILPL 0.00000000
## APLEZIER APLEZIER 0.00000000
## AFIETS AFIETS 0.00000000
## AINBOED AINBOED 0.00000000
## ABYSTAND ABYSTAND 0.00000000
Compared to previous results from KNN models:
Boosting provided lower test error and higher precision.
KNN often performs poorly on this dataset due to class imbalance and the need for heavy scaling.
Boosting was better at identifying subtle patterns in the high-dimensional space.