library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(MASS)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
# Optional: define char2seed() if required by your course setup
char2seed <- function(char) sum(utf8ToInt(char))
set.seed(char2seed("Spencer"))
# Train/test split
train <- sample(1:nrow(Boston), nrow(Boston) / 2)
trainDat <- Boston[train, ]
testDat <- Boston[-train, ]
# Values of mtry
mList <- 3:12
# Run random forests for each mtry, store full model
ranFList <- lapply(mList, function(m) {
randomForest(x = trainDat[, -14], y = trainDat$medv,
xtest = testDat[, -14], ytest = testDat$medv,
ntree = 700, mtry = m)
})
# Extract test MSEs across all trees (700) for each mtry value
testErrVals <- do.call("c", lapply(ranFList, function(model) model$test$mse))
# Prepare data for ggplot
MSEDat <- data.frame(
TestErr = testErrVals,
mVal = factor(rep(mList, each = 700)),
nTree = rep(1:700, length(mList))
)
# Plot
ggplot(MSEDat, aes(x = nTree, y = TestErr, colour = mVal)) +
geom_line() +
scale_colour_discrete(name = "mtry Value") +
ylab("Test Mean Squared Error") +
xlab("Number of Trees") +
ggtitle("Test Error vs Trees for Random Forest (Boston Dataset)")
This plot shows how the Test Mean Squared Error (MSE) changes with the
number of trees (ntree) in a random forest, for different values of the
mtry parameter (number of predictors randomly selected at each split).
Optimal mtry values are around 6 to 8, balancing bias and variance
effectively. Test MSE flattens after 200 trees, indicating more trees
offer marginal gain. Avoid extremes: very low or very high mtry values
are less effective.
# Load necessary libraries
library(ISLR2) # for Carseats dataset
##
## Attaching package: 'ISLR2'
## The following object is masked from 'package:MASS':
##
## Boston
set.seed(42) # for reproducibility
char2seed <- function(char) sum(utf8ToInt(char))
set.seed(char2seed("Spencer"))
train <- sample(1:nrow(Carseats), nrow(Carseats) / 2)
trainDat <- Carseats[train,]
testDat <- Carseats[-train,]
cat("Training set size:", nrow(trainDat), "\n")
## Training set size: 200
cat("Test set size:", nrow(testDat), "\n")
## Test set size: 200
We split the Carseats dataset into equal training and test sets (200 each) using a fixed seed for reproducibility. This allows us to train models on one half and evaluate performance on unseen data using the other.
library(tree)
library(ISLR2) # for Carseats
# Fit regression tree to training data
regTree <- tree(Sales ~ ., data = trainDat)
# Plot the tree
plot(regTree)
text(regTree, pretty = 0)
# Summary of the tree
summary(regTree)
##
## Regression tree:
## tree(formula = Sales ~ ., data = trainDat)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Income" "Age" "Advertising"
## [6] "Education" "CompPrice"
## Number of terminal nodes: 17
## Residual mean deviance: 2.433 = 445.3 / 183
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.43900 -1.14900 0.08696 0.00000 1.06400 3.85100
# Predict on test set
preds <- predict(regTree, newdata = testDat)
# Compute test MSE
testMSE <- mean((preds - testDat$Sales)^2)
cat("Test MSE:", testMSE, "\n")
## Test MSE: 4.584845
Main splitting variable: ShelveLoc (whether the shelf location is Good, Medium, or Bad) — this is the first split, meaning it’s the most important factor in predicting Sales. Number of terminal nodes: 17 The model partitions the data into 17 distinct groups, each predicting a constant Sales value (shown at the bottom of each leaf node). Residual Mean Deviance (training error): 2.433 This represents the average squared error per observation within each terminal node on the training set. Test Mean Squared Error (MSE): 4.58 This is the average squared error when predicting Sales on the test set — a key indicator of how well your model generalizes to new data.
The regression tree shows that store layout (ShelveLoc) and pricing (Price, CompPrice) are major drivers of Sales. Age of customers, income, and advertising also contribute but are used at deeper levels of the tree. A test MSE of ~4.58 suggests moderate predictive accuracy. While interpretable, a regression tree like this may benefit from pruning or switching to ensemble methods (like bagging or random forest) for better generalization.
#c
# Cross-validation to find optimal tree size
set.seed(char2seed("Spencer")) # or set.seed(123) if char2seed not defined
cvTree <- cv.tree(regTree)
# Plot CV deviance vs. tree size
plot(cvTree$size, cvTree$dev, type = "b",
xlab = "Tree Size", ylab = "Deviance",
main = "Cross-Validation for Tree Pruning")
# Manually choose tree size (e.g., 5) based on plot
prunedTree <- prune.tree(regTree, best = 5)
# Plot the pruned tree
plot(prunedTree)
text(prunedTree, pretty = 0, cex = 0.6)
# Predict on test set using pruned tree
prunedPreds <- predict(prunedTree, newdata = testDat)
# Compute test MSE for pruned tree
prunedTestMSE <- mean((prunedPreds - testDat$Sales)^2)
cat("Test MSE (Pruned Tree):", prunedTestMSE, "\n")
## Test MSE (Pruned Tree): 4.537613
The cross-validation plot shows the deviance decreases sharply as tree size increases up to around size 5. After size 5, the improvement in deviance becomes minimal or flat — this suggests that tree size 5 is near-optimal, balancing performance and complexity.
Pruned Tree Structure The pruned tree (size = 5) uses only a few key variables: ShelveLoc (first split) Price (used twice) It creates simple, interpretable rules to predict Sales, with leaf nodes showing predicted sales values.
Test MSE Comparison Test MSE (Pruned Tree): 4.54 This is slightly lower than the unpruned test MSE (which you got earlier in part b, likely around 4.6–4.7). This indicates pruning slightly improved test performance, while significantly simplifying the model.
Pruning the regression tree to size 5 using cross-validation:
Reduced model complexity Improved generalization slightly Made the tree easier to interpret
#d
library(randomForest)
# Bagging = random forest with mtry set to total number of predictors
set.seed(char2seed("Spencer"))
bagModel <- randomForest(Sales ~ ., data = trainDat,
mtry = ncol(trainDat) - 1,
importance = TRUE)
# Predict on test data
bagPreds <- predict(bagModel, newdata = testDat)
# Compute test MSE
bagTestMSE <- mean((bagPreds - testDat$Sales)^2)
cat("Test MSE (Bagging):", bagTestMSE, "\n")
## Test MSE (Bagging): 2.403783
# Variable importance
importance(bagModel)
## %IncMSE IncNodePurity
## CompPrice 22.80859631 177.227711
## Income 7.10411334 102.184054
## Advertising 18.27143123 120.841963
## Population 0.04754453 65.383896
## Price 54.45010395 418.830187
## ShelveLoc 55.83315397 569.026862
## Age 13.87805894 124.411267
## Education 0.90167375 48.636363
## Urban 1.44036924 9.203865
## US 0.97499504 6.532915
varImpPlot(bagModel)
The test MSE using Bagging was 2.40, which is significantly lower than:
The unpruned regression tree (≈ 4.6) The pruned regression tree (≈ 4.5)
Bagging substantially improves predictive performance due to variance
reduction from averaging multiple trees.
Variable Importance From the varImpPlot() and importance() output: Top predictors based on %IncMSE:
ShelveLoc (55.8%) Price (54.5%) CompPrice (22.8%) Advertising (18.3%) Age (13.9%) These variables contribute most to accurate predictions — removing them would significantly increase test error. Top predictors based on IncNodePurity:
ShelveLoc (569) Price (419) Age, Advertising, and CompPrice also show strong influence.
Bagging delivers a strong predictive model with a much lower test MSE than single trees. ShelveLoc and Price are by far the most important variables, followed by CompPrice and Advertising. The model benefits from averaging many trees, leading to greater stability and accuracy.
#e
library(randomForest)
# 1. Fit Random Forest with default mtry = 4 (approx. p/3 for regression)
set.seed(char2seed("Spencer")) # Or use set.seed(123) if char2seed not defined
rfModel <- randomForest(Sales ~ ., data = trainDat,
mtry = 4, importance = TRUE)
# 2. Predict on test data
rfPreds <- predict(rfModel, newdata = testDat)
# 3. Compute test MSE
rfTestMSE <- mean((rfPreds - testDat$Sales)^2)
cat("Test MSE (Random Forest):", rfTestMSE, "\n")
## Test MSE (Random Forest): 2.644819
# 4. Variable Importance
importance(rfModel)
## %IncMSE IncNodePurity
## CompPrice 13.5577756 165.13423
## Income 5.8914299 130.62355
## Advertising 12.7461607 143.38861
## Population 0.1550998 95.78001
## Price 34.9723978 379.93603
## ShelveLoc 44.8330249 432.67317
## Age 9.6736447 157.34831
## Education 0.3068673 77.06310
## Urban 0.8491092 15.06488
## US 2.8530557 15.31951
varImpPlot(rfModel)
# 5. Effect of mtry on test MSE
mVals <- 2:10
mseResults <- sapply(mVals, function(m) {
model <- randomForest(Sales ~ ., data = trainDat, mtry = m)
pred <- predict(model, newdata = testDat)
mean((pred - testDat$Sales)^2)
})
names(mseResults) <- paste("m =", mVals)
print(mseResults)
## m = 2 m = 3 m = 4 m = 5 m = 6 m = 7 m = 8 m = 9
## 3.301694 2.860756 2.694653 2.530898 2.547160 2.455293 2.382269 2.382343
## m = 10
## 2.335752
# Optional: Plot the effect of mtry
plot(mVals, mseResults, type = "b",
xlab = "mtry (Number of Variables at Each Split)",
ylab = "Test MSE",
main = "Effect of mtry on Test Error")
Test MSE (Random Forest) was 2.64, showing improved accuracy compared to
both pruned and unpruned trees. The most important variables were
ShelveLoc, Price, and CompPrice. As shown in the second plot, increasing
mtry from 2 to 10 consistently reduced test error, with the best
performance at mtry = 10, similar to bagging. Conclusion: Random forests
improve accuracy, and using a higher mtry further reduces error by
allowing more variables per split.
#f
library(BART)
## Loading required package: nlme
## Loading required package: survival
# Prepare training and test data (remove response from x)
xTrain <- trainDat[, -which(names(trainDat) == "Sales")]
yTrain <- trainDat$Sales
xTest <- testDat[, -which(names(testDat) == "Sales")]
yTest <- testDat$Sales
# Fit BART model
set.seed(char2seed("Spencer")) # Or set.seed(123)
bartModel <- wbart(x.train = xTrain, y.train = yTrain,
x.test = xTest, ndpost = 1000, nskip = 100)
## *****Into main of wbart
## *****Data:
## data:n,p,np: 200, 14, 200
## y1,yn: -1.832450, 2.047550
## x1,x[n*p]: 104.000000, 1.000000
## xp1,xp[np*p]: 138.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 61 ... 1
## *****burn and ndpost: 100, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,0.284787,3.000000,0.223728
## *****sigma: 1.071706
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 100
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
##
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 1s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000
# Get predictions
bartPreds <- apply(bartModel$yhat.test, 2, mean)
# Compute test MSE
bartTestMSE <- mean((yTest - bartPreds)^2)
cat("Test MSE (BART):", bartTestMSE, "\n")
## Test MSE (BART): 1.349031
Test MSE (BART): 1.35 This is the lowest test error among all models tried: Unpruned tree: ~4.6 Pruned tree: ~4.5 Bagging: ~2.4 Random Forest: ~2.6 BART: ~1.35
BART significantly outperforms all other models, thanks to its Bayesian framework, which flexibly captures nonlinearity and interactions. It also automatically regularizes, reducing overfitting without needing tuning like mtry or pruning. BART is the most powerful model in this analysis, offering the best predictive performance.
library(ISLR2)
## Check class distribution in the response variable
summary(Caravan$Purchase)
## No Yes
## 5474 348
## Convert factor response to numeric (0 = No, 1 = Yes) for models like gbm()
Caravan$PurchaseN <- ifelse(Caravan$Purchase == "Yes", 1, 0)
## Create training set: first 1000 observations
trainDat <- Caravan[1:1000, ]
## Create test set: remaining observations
testDat <- Caravan[1001:nrow(Caravan), ]
5474 people did not purchase insurance (No) Only 348 people did purchase insurance (Yes)
#b
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
# Fit boosting model with 1000 trees, shrinkage (learning rate) = 0.01
set.seed(char2seed("Spencer")) # Or set.seed(123)
boostModel <- gbm(PurchaseN ~ . - Purchase, # Exclude original factor variable
data = trainDat,
distribution = "bernoulli", # For binary classification
n.trees = 1000,
shrinkage = 0.01,
interaction.depth = 1,
verbose = FALSE)
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 50: PVRAAUT has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 71: AVRAAUT has no variation.
# View variable importance
summary(boostModel) # Shows relative influence of predictors
## var rel.inf
## PPERSAUT PPERSAUT 15.52185107
## MKOOPKLA MKOOPKLA 10.98708993
## MOPLHOOG MOPLHOOG 6.98671025
## MBERMIDD MBERMIDD 5.01709823
## PBRAND PBRAND 4.98919600
## ABRAND ABRAND 4.25874120
## MINK3045 MINK3045 4.12837892
## MGODGE MGODGE 3.87114474
## MOSTYPE MOSTYPE 2.92194790
## MSKC MSKC 2.64638425
## MAUT1 MAUT1 2.33743720
## PWAPART PWAPART 2.30135082
## MBERARBG MBERARBG 2.06465396
## MAUT2 MAUT2 2.03892771
## MSKA MSKA 1.99576358
## MGODPR MGODPR 1.97107080
## PBYSTAND PBYSTAND 1.96797877
## MSKB1 MSKB1 1.68941716
## MFWEKIND MFWEKIND 1.58304967
## MINKGEM MINKGEM 1.50449427
## MHKOOP MHKOOP 1.38294421
## MGODRK MGODRK 1.26527926
## MGODOV MGODOV 1.25630492
## APERSAUT APERSAUT 1.14016074
## MINK7512 MINK7512 1.13785164
## MBERHOOG MBERHOOG 1.06780561
## MRELGE MRELGE 1.04933963
## MINK4575 MINK4575 0.78293484
## MGEMLEEF MGEMLEEF 0.74702587
## MFGEKIND MFGEKIND 0.70634821
## MBERBOER MBERBOER 0.68840879
## MAUT0 MAUT0 0.68435527
## MBERARBO MBERARBO 0.65295421
## MINKM30 MINKM30 0.64271356
## MHHUUR MHHUUR 0.59717170
## MOSHOOFD MOSHOOFD 0.59016532
## MZPART MZPART 0.55270174
## MGEMOMV MGEMOMV 0.45881936
## MRELSA MRELSA 0.45705246
## MRELOV MRELOV 0.44396703
## PLEVEN PLEVEN 0.41187656
## MOPLMIDD MOPLMIDD 0.37220325
## MFALLEEN MFALLEEN 0.36066461
## MZFONDS MZFONDS 0.34613896
## MSKD MSKD 0.29941416
## MSKB2 MSKB2 0.29356023
## MINK123M MINK123M 0.28935279
## PMOTSCO PMOTSCO 0.28412878
## MOPLLAAG MOPLLAAG 0.14543912
## MBERZELF MBERZELF 0.05827500
## MAANTHUI MAANTHUI 0.05195574
## PWABEDR PWABEDR 0.00000000
## PWALAND PWALAND 0.00000000
## PBESAUT PBESAUT 0.00000000
## PVRAAUT PVRAAUT 0.00000000
## PAANHANG PAANHANG 0.00000000
## PTRACTOR PTRACTOR 0.00000000
## PWERKT PWERKT 0.00000000
## PBROM PBROM 0.00000000
## PPERSONG PPERSONG 0.00000000
## PGEZONG PGEZONG 0.00000000
## PWAOREG PWAOREG 0.00000000
## PZEILPL PZEILPL 0.00000000
## PPLEZIER PPLEZIER 0.00000000
## PFIETS PFIETS 0.00000000
## PINBOED PINBOED 0.00000000
## AWAPART AWAPART 0.00000000
## AWABEDR AWABEDR 0.00000000
## AWALAND AWALAND 0.00000000
## ABESAUT ABESAUT 0.00000000
## AMOTSCO AMOTSCO 0.00000000
## AVRAAUT AVRAAUT 0.00000000
## AAANHANG AAANHANG 0.00000000
## ATRACTOR ATRACTOR 0.00000000
## AWERKT AWERKT 0.00000000
## ABROM ABROM 0.00000000
## ALEVEN ALEVEN 0.00000000
## APERSONG APERSONG 0.00000000
## AGEZONG AGEZONG 0.00000000
## AWAOREG AWAOREG 0.00000000
## AZEILPL AZEILPL 0.00000000
## APLEZIER APLEZIER 0.00000000
## AFIETS AFIETS 0.00000000
## AINBOED AINBOED 0.00000000
## ABYSTAND ABYSTAND 0.00000000
The variable importance plot confirms the steep drop-off after the top few predictors. PPERSAUT and MKOOPKLA are by far the most influential in predicting insurance purchases.
The boosting model identifies car ownership, home ownership, and education level as the most influential predictors of purchase behavior. These variables likely indicate socioeconomic status, which strongly correlates with the likelihood of buying insurance.
#c
library(caret) # for confusionMatrix (optional but useful)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
# Step 1: Predict probabilities on the test set
boostProbs <- predict(boostModel, newdata = testDat, n.trees = 1000, type = "response")
# Step 2: Apply 20% threshold
boostPredClass <- ifelse(boostProbs > 0.2, 1, 0)
# Step 3: Actual values
actualClass <- testDat$PurchaseN
# Step 4: Confusion matrix
confMat <- table(Predicted = boostPredClass, Actual = actualClass)
print(confMat)
## Actual
## Predicted 0 1
## 0 4414 255
## 1 119 34
# Step 5: Calculate precision (fraction of predicted buyers who actually bought)
precision <- confMat["1", "1"] / sum(confMat["1", ])
cat("Precision (predicted buyers who actually bought):", round(precision, 3), "\n")
## Precision (predicted buyers who actually bought): 0.222
True Positives (TP) = 34 (predicted = 1, actual = 1) False Positives (FP) = 119 (predicted = 1, actual = 0) Precision = 34 / (34 + 119) = 0.222
22.2% of the people predicted to buy actually made a purchase. This is a big improvement over random guessing or always predicting “No” in this imbalanced dataset. The 20% threshold helps catch more potential buyers by lowering the bar for classification. Boosting with this threshold offers a better balance of sensitivity and precision, making it more useful than models like logistic regression or KNN in this context.