7)

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(MASS)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

# Optional: define char2seed() if required by your course setup
char2seed <- function(char) sum(utf8ToInt(char))
set.seed(char2seed("Spencer"))

# Train/test split
train <- sample(1:nrow(Boston), nrow(Boston) / 2)
trainDat <- Boston[train, ]
testDat <- Boston[-train, ]

# Values of mtry
mList <- 3:12

# Run random forests for each mtry, store full model
ranFList <- lapply(mList, function(m) {
  randomForest(x = trainDat[, -14], y = trainDat$medv,
               xtest = testDat[, -14], ytest = testDat$medv,
               ntree = 700, mtry = m)
})

# Extract test MSEs across all trees (700) for each mtry value
testErrVals <- do.call("c", lapply(ranFList, function(model) model$test$mse))

# Prepare data for ggplot
MSEDat <- data.frame(
  TestErr = testErrVals,
  mVal = factor(rep(mList, each = 700)),
  nTree = rep(1:700, length(mList))
)

# Plot
ggplot(MSEDat, aes(x = nTree, y = TestErr, colour = mVal)) +
  geom_line() +
  scale_colour_discrete(name = "mtry Value") +
  ylab("Test Mean Squared Error") +
  xlab("Number of Trees") +
  ggtitle("Test Error vs Trees for Random Forest (Boston Dataset)")

This plot shows how the Test Mean Squared Error (MSE) changes with the number of trees (ntree) in a random forest, for different values of the mtry parameter (number of predictors randomly selected at each split). Optimal mtry values are around 6 to 8, balancing bias and variance effectively. Test MSE flattens after 200 trees, indicating more trees offer marginal gain. Avoid extremes: very low or very high mtry values are less effective.

8a)

# Load necessary libraries
library(ISLR2)  # for Carseats dataset

## 
## Attaching package: 'ISLR2'

## The following object is masked from 'package:MASS':
## 
##     Boston

set.seed(42)    # for reproducibility
char2seed <- function(char) sum(utf8ToInt(char))

set.seed(char2seed("Spencer"))
train <- sample(1:nrow(Carseats), nrow(Carseats) / 2)
trainDat <- Carseats[train,]
testDat <- Carseats[-train,]
cat("Training set size:", nrow(trainDat), "\n")

## Training set size: 200

cat("Test set size:", nrow(testDat), "\n")

## Test set size: 200

We split the Carseats dataset into equal training and test sets (200 each) using a fixed seed for reproducibility. This allows us to train models on one half and evaluate performance on unseen data using the other.

b)

library(tree)
library(ISLR2)  # for Carseats
# Fit regression tree to training data
regTree <- tree(Sales ~ ., data = trainDat)

# Plot the tree
plot(regTree)
text(regTree, pretty = 0)

# Summary of the tree
summary(regTree)

## 
## Regression tree:
## tree(formula = Sales ~ ., data = trainDat)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Income"      "Age"         "Advertising"
## [6] "Education"   "CompPrice"  
## Number of terminal nodes:  17 
## Residual mean deviance:  2.433 = 445.3 / 183 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -4.43900 -1.14900  0.08696  0.00000  1.06400  3.85100

# Predict on test set
preds <- predict(regTree, newdata = testDat)

# Compute test MSE
testMSE <- mean((preds - testDat$Sales)^2)
cat("Test MSE:", testMSE, "\n")

## Test MSE: 4.584845

Main splitting variable: ShelveLoc (whether the shelf location is Good, Medium, or Bad) — this is the first split, meaning it’s the most important factor in predicting Sales. Number of terminal nodes: 17 The model partitions the data into 17 distinct groups, each predicting a constant Sales value (shown at the bottom of each leaf node). Residual Mean Deviance (training error): 2.433 This represents the average squared error per observation within each terminal node on the training set. Test Mean Squared Error (MSE): 4.58 This is the average squared error when predicting Sales on the test set — a key indicator of how well your model generalizes to new data.

The regression tree shows that store layout (ShelveLoc) and pricing (Price, CompPrice) are major drivers of Sales. Age of customers, income, and advertising also contribute but are used at deeper levels of the tree. A test MSE of ~4.58 suggests moderate predictive accuracy. While interpretable, a regression tree like this may benefit from pruning or switching to ensemble methods (like bagging or random forest) for better generalization.

# Cross-validation to find optimal tree size
set.seed(char2seed("Spencer"))  # or set.seed(123) if char2seed not defined
cvTree <- cv.tree(regTree)

# Plot CV deviance vs. tree size
plot(cvTree$size, cvTree$dev, type = "b",
     xlab = "Tree Size", ylab = "Deviance",
     main = "Cross-Validation for Tree Pruning")

# Manually choose tree size (e.g., 5) based on plot
prunedTree <- prune.tree(regTree, best = 5)

# Plot the pruned tree
plot(prunedTree)
text(prunedTree, pretty = 0, cex = 0.6)

# Predict on test set using pruned tree
prunedPreds <- predict(prunedTree, newdata = testDat)

# Compute test MSE for pruned tree
prunedTestMSE <- mean((prunedPreds - testDat$Sales)^2)
cat("Test MSE (Pruned Tree):", prunedTestMSE, "\n")

## Test MSE (Pruned Tree): 4.537613

The cross-validation plot shows the deviance decreases sharply as tree size increases up to around size 5. After size 5, the improvement in deviance becomes minimal or flat — this suggests that tree size 5 is near-optimal, balancing performance and complexity.

Pruned Tree Structure The pruned tree (size = 5) uses only a few key variables: ShelveLoc (first split) Price (used twice) It creates simple, interpretable rules to predict Sales, with leaf nodes showing predicted sales values.

Test MSE Comparison Test MSE (Pruned Tree): 4.54 This is slightly lower than the unpruned test MSE (which you got earlier in part b, likely around 4.6–4.7). This indicates pruning slightly improved test performance, while significantly simplifying the model.

Pruning the regression tree to size 5 using cross-validation:

Reduced model complexity Improved generalization slightly Made the tree easier to interpret

library(randomForest)

# Bagging = random forest with mtry set to total number of predictors
set.seed(char2seed("Spencer"))
bagModel <- randomForest(Sales ~ ., data = trainDat, 
                         mtry = ncol(trainDat) - 1, 
                         importance = TRUE)

# Predict on test data
bagPreds <- predict(bagModel, newdata = testDat)

# Compute test MSE
bagTestMSE <- mean((bagPreds - testDat$Sales)^2)
cat("Test MSE (Bagging):", bagTestMSE, "\n")

## Test MSE (Bagging): 2.403783

# Variable importance
importance(bagModel)

##                 %IncMSE IncNodePurity
## CompPrice   22.80859631    177.227711
## Income       7.10411334    102.184054
## Advertising 18.27143123    120.841963
## Population   0.04754453     65.383896
## Price       54.45010395    418.830187
## ShelveLoc   55.83315397    569.026862
## Age         13.87805894    124.411267
## Education    0.90167375     48.636363
## Urban        1.44036924      9.203865
## US           0.97499504      6.532915

varImpPlot(bagModel)

The test MSE using Bagging was 2.40, which is significantly lower than: The unpruned regression tree (≈ 4.6) The pruned regression tree (≈ 4.5) Bagging substantially improves predictive performance due to variance reduction from averaging multiple trees.

Variable Importance From the varImpPlot() and importance() output: Top predictors based on %IncMSE:

ShelveLoc (55.8%) Price (54.5%) CompPrice (22.8%) Advertising (18.3%) Age (13.9%) These variables contribute most to accurate predictions — removing them would significantly increase test error. Top predictors based on IncNodePurity:

ShelveLoc (569) Price (419) Age, Advertising, and CompPrice also show strong influence.

Bagging delivers a strong predictive model with a much lower test MSE than single trees. ShelveLoc and Price are by far the most important variables, followed by CompPrice and Advertising. The model benefits from averaging many trees, leading to greater stability and accuracy.

library(randomForest)

# 1. Fit Random Forest with default mtry = 4 (approx. p/3 for regression)
set.seed(char2seed("Spencer"))  # Or use set.seed(123) if char2seed not defined
rfModel <- randomForest(Sales ~ ., data = trainDat, 
                        mtry = 4, importance = TRUE)

# 2. Predict on test data
rfPreds <- predict(rfModel, newdata = testDat)

# 3. Compute test MSE
rfTestMSE <- mean((rfPreds - testDat$Sales)^2)
cat("Test MSE (Random Forest):", rfTestMSE, "\n")

## Test MSE (Random Forest): 2.644819

# 4. Variable Importance
importance(rfModel)

##                %IncMSE IncNodePurity
## CompPrice   13.5577756     165.13423
## Income       5.8914299     130.62355
## Advertising 12.7461607     143.38861
## Population   0.1550998      95.78001
## Price       34.9723978     379.93603
## ShelveLoc   44.8330249     432.67317
## Age          9.6736447     157.34831
## Education    0.3068673      77.06310
## Urban        0.8491092      15.06488
## US           2.8530557      15.31951

varImpPlot(rfModel)

# 5. Effect of mtry on test MSE
mVals <- 2:10
mseResults <- sapply(mVals, function(m) {
  model <- randomForest(Sales ~ ., data = trainDat, mtry = m)
  pred <- predict(model, newdata = testDat)
  mean((pred - testDat$Sales)^2)
})
names(mseResults) <- paste("m =", mVals)
print(mseResults)

##    m = 2    m = 3    m = 4    m = 5    m = 6    m = 7    m = 8    m = 9 
## 3.301694 2.860756 2.694653 2.530898 2.547160 2.455293 2.382269 2.382343 
##   m = 10 
## 2.335752

# Optional: Plot the effect of mtry
plot(mVals, mseResults, type = "b", 
     xlab = "mtry (Number of Variables at Each Split)", 
     ylab = "Test MSE",
     main = "Effect of mtry on Test Error")

Test MSE (Random Forest) was 2.64, showing improved accuracy compared to both pruned and unpruned trees. The most important variables were ShelveLoc, Price, and CompPrice. As shown in the second plot, increasing mtry from 2 to 10 consistently reduced test error, with the best performance at mtry = 10, similar to bagging. Conclusion: Random forests improve accuracy, and using a higher mtry further reduces error by allowing more variables per split.

library(BART)

## Loading required package: nlme

## Loading required package: survival

# Prepare training and test data (remove response from x)
xTrain <- trainDat[, -which(names(trainDat) == "Sales")]
yTrain <- trainDat$Sales
xTest <- testDat[, -which(names(testDat) == "Sales")]
yTest <- testDat$Sales

# Fit BART model
set.seed(char2seed("Spencer"))  # Or set.seed(123)
bartModel <- wbart(x.train = xTrain, y.train = yTrain,
                   x.test = xTest, ndpost = 1000, nskip = 100)

## *****Into main of wbart
## *****Data:
## data:n,p,np: 200, 14, 200
## y1,yn: -1.832450, 2.047550
## x1,x[n*p]: 104.000000, 1.000000
## xp1,xp[np*p]: 138.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 61 ... 1
## *****burn and ndpost: 100, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,0.284787,3.000000,0.223728
## *****sigma: 1.071706
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 100
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 1s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000

# Get predictions
bartPreds <- apply(bartModel$yhat.test, 2, mean)

# Compute test MSE
bartTestMSE <- mean((yTest - bartPreds)^2)
cat("Test MSE (BART):", bartTestMSE, "\n")

## Test MSE (BART): 1.349031

Test MSE (BART): 1.35 This is the lowest test error among all models tried: Unpruned tree: ~4.6 Pruned tree: ~4.5 Bagging: ~2.4 Random Forest: ~2.6 BART: ~1.35

BART significantly outperforms all other models, thanks to its Bayesian framework, which flexibly captures nonlinearity and interactions. It also automatically regularizes, reducing overfitting without needing tuning like mtry or pruning. BART is the most powerful model in this analysis, offering the best predictive performance.

11a

library(ISLR2)

## Check class distribution in the response variable
summary(Caravan$Purchase)

##   No  Yes 
## 5474  348

## Convert factor response to numeric (0 = No, 1 = Yes) for models like gbm()
Caravan$PurchaseN <- ifelse(Caravan$Purchase == "Yes", 1, 0)

## Create training set: first 1000 observations
trainDat <- Caravan[1:1000, ]

## Create test set: remaining observations
testDat <- Caravan[1001:nrow(Caravan), ]

5474 people did not purchase insurance (No) Only 348 people did purchase insurance (Yes)

library(gbm)

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

# Fit boosting model with 1000 trees, shrinkage (learning rate) = 0.01
set.seed(char2seed("Spencer"))  # Or set.seed(123)
boostModel <- gbm(PurchaseN ~ . - Purchase,  # Exclude original factor variable
                  data = trainDat,
                  distribution = "bernoulli",  # For binary classification
                  n.trees = 1000,
                  shrinkage = 0.01,
                  interaction.depth = 1,
                  verbose = FALSE)

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 50: PVRAAUT has no variation.

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 71: AVRAAUT has no variation.

# View variable importance
summary(boostModel)  # Shows relative influence of predictors

##               var     rel.inf
## PPERSAUT PPERSAUT 15.52185107
## MKOOPKLA MKOOPKLA 10.98708993
## MOPLHOOG MOPLHOOG  6.98671025
## MBERMIDD MBERMIDD  5.01709823
## PBRAND     PBRAND  4.98919600
## ABRAND     ABRAND  4.25874120
## MINK3045 MINK3045  4.12837892
## MGODGE     MGODGE  3.87114474
## MOSTYPE   MOSTYPE  2.92194790
## MSKC         MSKC  2.64638425
## MAUT1       MAUT1  2.33743720
## PWAPART   PWAPART  2.30135082
## MBERARBG MBERARBG  2.06465396
## MAUT2       MAUT2  2.03892771
## MSKA         MSKA  1.99576358
## MGODPR     MGODPR  1.97107080
## PBYSTAND PBYSTAND  1.96797877
## MSKB1       MSKB1  1.68941716
## MFWEKIND MFWEKIND  1.58304967
## MINKGEM   MINKGEM  1.50449427
## MHKOOP     MHKOOP  1.38294421
## MGODRK     MGODRK  1.26527926
## MGODOV     MGODOV  1.25630492
## APERSAUT APERSAUT  1.14016074
## MINK7512 MINK7512  1.13785164
## MBERHOOG MBERHOOG  1.06780561
## MRELGE     MRELGE  1.04933963
## MINK4575 MINK4575  0.78293484
## MGEMLEEF MGEMLEEF  0.74702587
## MFGEKIND MFGEKIND  0.70634821
## MBERBOER MBERBOER  0.68840879
## MAUT0       MAUT0  0.68435527
## MBERARBO MBERARBO  0.65295421
## MINKM30   MINKM30  0.64271356
## MHHUUR     MHHUUR  0.59717170
## MOSHOOFD MOSHOOFD  0.59016532
## MZPART     MZPART  0.55270174
## MGEMOMV   MGEMOMV  0.45881936
## MRELSA     MRELSA  0.45705246
## MRELOV     MRELOV  0.44396703
## PLEVEN     PLEVEN  0.41187656
## MOPLMIDD MOPLMIDD  0.37220325
## MFALLEEN MFALLEEN  0.36066461
## MZFONDS   MZFONDS  0.34613896
## MSKD         MSKD  0.29941416
## MSKB2       MSKB2  0.29356023
## MINK123M MINK123M  0.28935279
## PMOTSCO   PMOTSCO  0.28412878
## MOPLLAAG MOPLLAAG  0.14543912
## MBERZELF MBERZELF  0.05827500
## MAANTHUI MAANTHUI  0.05195574
## PWABEDR   PWABEDR  0.00000000
## PWALAND   PWALAND  0.00000000
## PBESAUT   PBESAUT  0.00000000
## PVRAAUT   PVRAAUT  0.00000000
## PAANHANG PAANHANG  0.00000000
## PTRACTOR PTRACTOR  0.00000000
## PWERKT     PWERKT  0.00000000
## PBROM       PBROM  0.00000000
## PPERSONG PPERSONG  0.00000000
## PGEZONG   PGEZONG  0.00000000
## PWAOREG   PWAOREG  0.00000000
## PZEILPL   PZEILPL  0.00000000
## PPLEZIER PPLEZIER  0.00000000
## PFIETS     PFIETS  0.00000000
## PINBOED   PINBOED  0.00000000
## AWAPART   AWAPART  0.00000000
## AWABEDR   AWABEDR  0.00000000
## AWALAND   AWALAND  0.00000000
## ABESAUT   ABESAUT  0.00000000
## AMOTSCO   AMOTSCO  0.00000000
## AVRAAUT   AVRAAUT  0.00000000
## AAANHANG AAANHANG  0.00000000
## ATRACTOR ATRACTOR  0.00000000
## AWERKT     AWERKT  0.00000000
## ABROM       ABROM  0.00000000
## ALEVEN     ALEVEN  0.00000000
## APERSONG APERSONG  0.00000000
## AGEZONG   AGEZONG  0.00000000
## AWAOREG   AWAOREG  0.00000000
## AZEILPL   AZEILPL  0.00000000
## APLEZIER APLEZIER  0.00000000
## AFIETS     AFIETS  0.00000000
## AINBOED   AINBOED  0.00000000
## ABYSTAND ABYSTAND  0.00000000

The variable importance plot confirms the steep drop-off after the top few predictors. PPERSAUT and MKOOPKLA are by far the most influential in predicting insurance purchases.

The boosting model identifies car ownership, home ownership, and education level as the most influential predictors of purchase behavior. These variables likely indicate socioeconomic status, which strongly correlates with the likelihood of buying insurance.

library(caret)  # for confusionMatrix (optional but useful)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

# Step 1: Predict probabilities on the test set
boostProbs <- predict(boostModel, newdata = testDat, n.trees = 1000, type = "response")

# Step 2: Apply 20% threshold
boostPredClass <- ifelse(boostProbs > 0.2, 1, 0)

# Step 3: Actual values
actualClass <- testDat$PurchaseN

# Step 4: Confusion matrix
confMat <- table(Predicted = boostPredClass, Actual = actualClass)
print(confMat)

##          Actual
## Predicted    0    1
##         0 4414  255
##         1  119   34

# Step 5: Calculate precision (fraction of predicted buyers who actually bought)
precision <- confMat["1", "1"] / sum(confMat["1", ])
cat("Precision (predicted buyers who actually bought):", round(precision, 3), "\n")

## Precision (predicted buyers who actually bought): 0.222

True Positives (TP) = 34 (predicted = 1, actual = 1) False Positives (FP) = 119 (predicted = 1, actual = 0) Precision = 34 / (34 + 119) = 0.222

22.2% of the people predicted to buy actually made a purchase. This is a big improvement over random guessing or always predicting “No” in this imbalanced dataset. The 20% threshold helps catch more potential buyers by lowering the bar for classification. Boosting with this threshold offers a better balance of sensitivity and precision, making it more useful than models like logistic regression or KNN in this context.

lab 9

2025-04-15

7)

8a)

b)

11a