7. In the lab, we applied random forests to the Boston data using mtry = 6 and using ntree = 25 and ntree = 500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained.

set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston) / 2)
Boston.train <- Boston[train, -14]
Boston.test <- Boston[-train, -14]
Y.train <- Boston[train, 14]
Y.test <- Boston[-train, 14]

p <- ncol(Boston) - 1
m1 <- p
m2 <- floor(p / 2)
m3 <- floor(sqrt(p))

rf.boston1 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, 
                           mtry = m1, ntree = 500)
rf.boston2 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, 
                           mtry = m2, ntree = 500)
rf.boston3 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, 
                           mtry = m3, ntree = 500)

mse1 <- rf.boston1$test$mse
mse2 <- rf.boston2$test$mse
mse3 <- rf.boston3$test$mse

ymin <- min(c(mse1, mse2, mse3))
ymax <- max(c(mse1, mse2, mse3))

plot(1:500, mse1, col = "green", type = "l", lwd = 2,
     xlab = "Number of Trees", ylab = "Test MSE", ylim = c(ymin, ymax))
lines(1:500, mse2, col = "red", lwd = 2)
lines(1:500, mse3, col = "blue", lwd = 2)
legend("topright", legend = c("m = p", "m = p/2", "m = sqrt(p)"),
       col = c("green", "red", "blue"), lty = 1, lwd = 2)

The plot displays test Mean Squared Error (MSE) as a function of the number of trees (ntree) in a random forest, for three values of mtry:

Green line: mtry = p

Red line: mtry = p/2

Blue line: mtry = sqrt(p)

Key Observations:

All curves decrease rapidly at first as more trees are added: The test error stabilizes after about 150 trees. This is expected since adding more trees reduces variance in predictions, especially early on.

Blue line (m = sqrt(p)) yields the lowest test MSE across most of the ntree range: This confirms that random forests benefit from choosing a small number of predictors at each split. Lower mtry increases tree diversity, which improves ensemble performance.

Red line (m = p/2) performs better than green (m = p), but worse than blue: A moderate mtry balances variance reduction and predictive power of individual trees.

Green line (m = p) has the highest test MSE: This is equivalent to bagging, where all predictors are considered at each split. Bagging produces more correlated trees, reducing the benefit of ensembling.

8. In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.3.3

## 
## Attaching package: 'ISLR2'

## The following object is masked from 'package:MASS':
## 
##     Boston

data(Carseats)
library(tree)

## Warning: package 'tree' was built under R version 4.3.3

(a) Split the data set into a training set and a test set.

set.seed(1)
train <- sample(1:nrow(Carseats), 200)
Carseats.train <- Carseats[train, ]
Carseats.test <- Carseats[-train, ]

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

tree.carseats <- tree(Sales ~ ., data = Carseats.train)

plot(tree.carseats)
text(tree.carseats, pretty = 0)

yhat.tree <- predict(tree.carseats, newdata = Carseats.test)
mse.tree <- mean((yhat.tree - Carseats.test$Sales)^2)
print(paste("Test MSE (unpruned tree):", round(mse.tree, 3)))

## [1] "Test MSE (unpruned tree): 4.922"

Th test mse is 4.922

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

set.seed(2)
cv.carseats <- cv.tree(tree.carseats)

plot(cv.carseats$size, cv.carseats$dev, type = "b",
     xlab = "Tree Size", ylab = "CV Error", main = "CV for Regression Tree")

best.size <- cv.carseats$size[which.min(cv.carseats$dev)]
cat("Optimal size:", best.size, "\n")

## Optimal size: 18

pruned.carseats <- prune.tree(tree.carseats, best = best.size)
plot(pruned.carseats)
text(pruned.carseats, pretty = 0)

yhat.pruned <- predict(pruned.carseats, newdata = Carseats.test)
mse.pruned <- mean((yhat.pruned - Carseats.test$Sales)^2)
print(paste("Test MSE (pruned tree):", round(mse.pruned, 3)))

## [1] "Test MSE (pruned tree): 4.922"

The test mse of the pruned and unpruned tree is the same at: 4.922

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

set.seed(1)

bag.carseats <- randomForest(Sales ~ ., data = Carseats.train,
                             mtry = ncol(Carseats) - 1, importance = TRUE)

yhat.bag <- predict(bag.carseats, newdata = Carseats.test)
mse.bag <- mean((yhat.bag - Carseats.test$Sales)^2)
print(paste("Test MSE (bagging):", round(mse.bag, 3)))

## [1] "Test MSE (bagging): 2.605"

importance(bag.carseats)

##                %IncMSE IncNodePurity
## CompPrice   24.8888481    170.182937
## Income       4.7121131     91.264880
## Advertising 12.7692401     97.164338
## Population  -1.8074075     58.244596
## Price       56.3326252    502.903407
## ShelveLoc   48.8886689    380.032715
## Age         17.7275460    157.846774
## Education    0.5962186     44.598731
## Urban        0.1728373      9.822082
## US           4.2172102     18.073863

varImpPlot(bag.carseats)

I got a test mse of 2.605.

Interpretation:

Price is by far the most influential predictor of Sales. ShelveLoc also plays a large role—likely capturing product visibility or marketing effectiveness. CompPrice, Age, and Advertising offer additional predictive value. May consider dropping or simplifying low-importance predictors for a leaner model (e.g., Population, Urban).

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

set.seed(1)
rf.carseats <- randomForest(Sales ~ ., data = Carseats.train,
                            mtry = 4, importance = TRUE)

# Predict and MSE
yhat.rf <- predict(rf.carseats, newdata = Carseats.test)
mse.rf <- mean((yhat.rf - Carseats.test$Sales)^2)
print(paste("Test MSE (random forest):", round(mse.rf, 3)))

## [1] "Test MSE (random forest): 2.788"

# Variable importance
importance(rf.carseats)

##                %IncMSE IncNodePurity
## CompPrice   15.7891655     160.57944
## Income       4.1275509     121.12953
## Advertising  9.6425758     111.54581
## Population  -1.3596645      85.92575
## Price       43.4055391     423.06225
## ShelveLoc   37.8850232     311.97119
## Age         13.8924424     174.18229
## Education    0.1960888      62.77782
## Urban        0.1393816      12.92952
## US           6.3532441      30.42255

varImpPlot(rf.carseats)

I otained a test mse of 2.788.

The top variables (Price, ShelveLoc, CompPrice, Age) remain consistent.

Overall %IncMSE values are lower in random forests than in bagging, which is normal since random forests aim to reduce variance rather than optimize individual tree strength.

Decreasing m (from 10 to 4) increases randomness across trees, leading to more decorrelated trees. This typically results in a lower variance ensemble and improved test performance. The tradeoff is slightly weaker individual trees, but the ensemble benefits from the reduction in correlation.

(f) Now analyze the data using BART, and report your results.

library(BART)

## Warning: package 'BART' was built under R version 4.3.3

## Loading required package: nlme

## Loading required package: survival

xtrain <- Carseats.train[, -which(names(Carseats.train) == "Sales")]
ytrain <- Carseats.train$Sales
xtest <- Carseats.test[, -which(names(Carseats.test) == "Sales")]
ytest <- Carseats.test$Sales

set.seed(1)
bart.model <- gbart(x.train = xtrain, y.train = ytrain, x.test = xtest)

## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 200, 14, 200
## y1,yn: 2.781850, 1.091850
## x1,x[n*p]: 107.000000, 1.000000
## xp1,xp[np*p]: 111.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 63 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.273474,3,0.23074,7.57815
## *****sigma: 1.088371
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 6s
## trcnt,tecnt: 1000,1000

yhat.bart <- bart.model$yhat.test.mean
mse.bart <- mean((yhat.bart - ytest)^2)
print(paste("Test MSE (BART):", round(mse.bart, 3)))

## [1] "Test MSE (BART): 1.451"

The test mse is 1.451, the lowest so far.

BART outperforms all other methods on this task: It achieves the lowest test MSE (1.451). It demonstrates the power of flexible, regularized ensembles in regression tasks where nonlinearity and interactions matter.

11. This question uses the Caravan data set.

For number 11, instead of only using 0.01 as the shrinkage in (11), use 0.01, 0.05, 0.1, and 0.2, and choose the best λ value by estimating test error. Do not fit the Logistic Regression Model.In order to have a trustworthy model with this many covariates, you would have to pursue a very sophisticated modeling process which is beyond the scope of this class. I do not want you to report a model which does not converge, so only consider boosting and KNN here. Please do consider different values for k, the number of neighbors to use in smoothing the KNN algorithm.

library(ISLR2)
library(gbm)

## Warning: package 'gbm' was built under R version 4.3.3

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

library(class)

a). Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.

standardized.X <- scale(Caravan[, -86])
Purchase <- Caravan$Purchase

test <- 1:1000
train.X <- standardized.X[-test, ]
test.X <- standardized.X[test, ]
train.Y <- Purchase[-test]
test.Y <- Purchase[test]

b). Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?

Caravan.boost <- Caravan
Caravan.boost$Purchase <- as.numeric(Caravan.boost$Purchase == "Yes")

set.seed(1)
boost.caravan <- gbm(Purchase ~ ., data = Caravan.boost[-test, ],
                     distribution = "bernoulli",
                     n.trees = 1000,
                     shrinkage = 0.01)

summary(boost.caravan)

##               var     rel.inf
## PPERSAUT PPERSAUT 25.77462891
## PPLEZIER PPLEZIER 12.87452732
## PBRAND     PBRAND 10.19292748
## MOPLLAAG MOPLLAAG  4.83547107
## MINKGEM   MINKGEM  4.32190380
## ALEVEN     ALEVEN  4.06709588
## PBYSTAND PBYSTAND  3.98749823
## APERSAUT APERSAUT  3.19156662
## MBERMIDD MBERMIDD  2.30144421
## MKOOPKLA MKOOPKLA  2.19115382
## MAUT1       MAUT1  1.88051502
## MBERHOOG MBERHOOG  1.82774814
## MOSHOOFD MOSHOOFD  1.79902350
## PWAPART   PWAPART  1.78092009
## PFIETS     PFIETS  1.37983185
## MBERARBG MBERARBG  1.36486472
## MINKM30   MINKM30  1.35865381
## MINK7512 MINK7512  1.34146746
## MGODOV     MGODOV  1.30363305
## MOSTYPE   MOSTYPE  1.16144252
## MOPLMIDD MOPLMIDD  1.15188377
## MRELGE     MRELGE  0.88320737
## MSKC         MSKC  0.87297327
## PGEZONG   PGEZONG  0.72327251
## MOPLHOOG MOPLHOOG  0.71047893
## AFIETS     AFIETS  0.69948203
## MHKOOP     MHKOOP  0.59634528
## MINK3045 MINK3045  0.58368256
## MGODPR     MGODPR  0.53409514
## MGODGE     MGODGE  0.46356510
## PLEVEN     PLEVEN  0.42662816
## MGEMLEEF MGEMLEEF  0.34057976
## MSKA         MSKA  0.30338080
## MGODRK     MGODRK  0.28577801
## MHHUUR     MHHUUR  0.24976259
## MINK4575 MINK4575  0.22508839
## MRELSA     MRELSA  0.21744841
## MAUT0       MAUT0  0.20791079
## MBERZELF MBERZELF  0.20197341
## MZPART     MZPART  0.18163500
## MINK123M MINK123M  0.15035536
## MAUT2       MAUT2  0.15035164
## MFGEKIND MFGEKIND  0.13073541
## MSKB1       MSKB1  0.13000199
## MBERBOER MBERBOER  0.12139096
## MRELOV     MRELOV  0.11669982
## PINBOED   PINBOED  0.10839605
## MSKD         MSKD  0.08046866
## PTRACTOR PTRACTOR  0.07780436
## MBERARBO MBERARBO  0.05569114
## MZFONDS   MZFONDS  0.04920348
## MFWEKIND MFWEKIND  0.03341234
## MAANTHUI MAANTHUI  0.00000000
## MGEMOMV   MGEMOMV  0.00000000
## MFALLEEN MFALLEEN  0.00000000
## MSKB2       MSKB2  0.00000000
## PWABEDR   PWABEDR  0.00000000
## PWALAND   PWALAND  0.00000000
## PBESAUT   PBESAUT  0.00000000
## PMOTSCO   PMOTSCO  0.00000000
## PVRAAUT   PVRAAUT  0.00000000
## PAANHANG PAANHANG  0.00000000
## PWERKT     PWERKT  0.00000000
## PBROM       PBROM  0.00000000
## PPERSONG PPERSONG  0.00000000
## PWAOREG   PWAOREG  0.00000000
## PZEILPL   PZEILPL  0.00000000
## AWAPART   AWAPART  0.00000000
## AWABEDR   AWABEDR  0.00000000
## AWALAND   AWALAND  0.00000000
## ABESAUT   ABESAUT  0.00000000
## AMOTSCO   AMOTSCO  0.00000000
## AVRAAUT   AVRAAUT  0.00000000
## AAANHANG AAANHANG  0.00000000
## ATRACTOR ATRACTOR  0.00000000
## AWERKT     AWERKT  0.00000000
## ABROM       ABROM  0.00000000
## APERSONG APERSONG  0.00000000
## AGEZONG   AGEZONG  0.00000000
## AWAOREG   AWAOREG  0.00000000
## ABRAND     ABRAND  0.00000000
## AZEILPL   AZEILPL  0.00000000
## APLEZIER APLEZIER  0.00000000
## AINBOED   AINBOED  0.00000000
## ABYSTAND ABYSTAND  0.00000000

Interpretation:

PPERSAUT (Personal car ownership) has the strongest influence on purchase behavior — suggesting that car ownership is highly predictive of Caravan product purchases.

PPLEZIER (Leisure cars) and PBRAND (Brand loyalty) also play significant roles.

Many of these top predictors are automotive or socio-economic indicators, which aligns with expectations for a consumer insurance product like Caravan.

c). Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated probability of purchase is greater than 20%. Form a confusion matrix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtained from applying KNN or logistic regression to this data set?

boost.probs <- predict(boost.caravan, newdata = Caravan[test, ], n.trees = 1000, type = "response")

boost.pred <- ifelse(boost.probs > 0.2, "Yes", "No")

boost.table <- table(boost.pred, test.Y)
print(boost.table)

##           test.Y
## boost.pred  No Yes
##        No  917  46
##        Yes  24  13

success.rate <- boost.table["Yes", "Yes"] / sum(boost.table["Yes", ])
print(paste("Success rate (boosting, >20% cutoff):", round(success.rate, 3)))

## [1] "Success rate (boosting, >20% cutoff): 0.351"

Using the boosting model with a 0.2 cutoff, the test set confusion matrix shows that 13 out of 37 predicted buyers actually made a purchase. This yields a success rate of 35.1% among predicted purchasers, which is a significant improvement over the baseline rate of ~6%. Boosting is thus effective at identifying a smaller, more targeted group of likely purchasers.

set.seed(1)
knn.pred1 <- knn(train.X, test.X, train.Y, k = 1)
knn.pred3 <- knn(train.X, test.X, train.Y, k = 3)
knn.pred5 <- knn(train.X, test.X, train.Y, k = 5)

tab1 <- table(knn.pred1, test.Y)
tab3 <- table(knn.pred3, test.Y)
tab5 <- table(knn.pred5, test.Y)

prec1 <- tab1["Yes", "Yes"] / sum(tab1["Yes", ])
prec3 <- tab3["Yes", "Yes"] / sum(tab3["Yes", ])
prec5 <- tab5["Yes", "Yes"] / sum(tab5["Yes", ])

cat("KNN Success Rate (K = 1):", round(prec1, 3), "\n")

## KNN Success Rate (K = 1): 0.117

cat("KNN Success Rate (K = 3):", round(prec3, 3), "\n")

## KNN Success Rate (K = 3): 0.192

cat("KNN Success Rate (K = 5):", round(prec5, 3), "\n")

## KNN Success Rate (K = 5): 0.267

The boosting model with 1000 trees and shrinkage λ = 0.01 predicted that 37 people would make a purchase, of whom 13 actually did—yielding a success rate of 35.1%. This is significantly higher than any of the KNN models: K = 1 yielded 11.7%, K = 3 yielded 19.2%, and K = 5 reached 26.7%. Thus, boosting provided the most precise targeting of potential buyers.

Using multiple shrinkage

shrinkages <- c(0.01, 0.05, 0.1, 0.2)
test.errors <- numeric(length(shrinkages))

for (i in 1:length(shrinkages)) {
  set.seed(1)
  boost.temp <- gbm(Purchase ~ ., data = Caravan.boost[-test, ],
                    distribution = "bernoulli",
                    n.trees = 1000, shrinkage = shrinkages[i])
  pred.probs <- predict(boost.temp, newdata = Caravan[test, ], n.trees = 1000, type = "response")
  pred.class <- ifelse(pred.probs > 0.2, "Yes", "No")
  test.errors[i] <- mean(pred.class != test.Y)
}

plot(shrinkages, test.errors, type = "b", col = "blue", pch = 19,
     xlab = "Shrinkage (λ)", ylab = "Test Classification Error Rate",
     main = "Boosting Error Rate vs. Shrinkage")

Caravan.boost <- Caravan
Caravan.boost$Purchase <- as.numeric(Caravan.boost$Purchase == "Yes")

test <- 1:1000
train.data <- Caravan.boost[-test, ]
test.data <- Caravan.boost[test, ]
true.labels <- Caravan$Purchase[test]

shrinkages <- c(0.01, 0.05, 0.1, 0.2)

results <- data.frame(
  Lambda = shrinkages,
  Predicted_Yes = integer(length(shrinkages)),
  Actual_Yes = integer(length(shrinkages)),
  Success_Rate = numeric(length(shrinkages))
)

for (i in seq_along(shrinkages)) {
  set.seed(1)
  model <- gbm(Purchase ~ ., data = train.data,
               distribution = "bernoulli",
               n.trees = 1000,
               shrinkage = shrinkages[i],
               verbose = FALSE)
  
  probs <- predict(model, newdata = test.data, n.trees = 1000, type = "response")
  pred <- ifelse(probs > 0.2, "Yes", "No")
  
  cm <- table(pred, true.labels)
  
  predicted_yes <- sum(pred == "Yes")
  actual_yes <- if ("Yes" %in% rownames(cm) && "Yes" %in% colnames(cm)) cm["Yes", "Yes"] else 0
  success_rate <- if (predicted_yes > 0) actual_yes / predicted_yes else 0
  
  results[i, "Predicted_Yes"] <- predicted_yes
  results[i, "Actual_Yes"] <- actual_yes
  results[i, "Success_Rate"] <- round(success_rate, 3)
}

print(results)

##   Lambda Predicted_Yes Actual_Yes Success_Rate
## 1   0.01            37         13        0.351
## 2   0.05            67         16        0.239
## 3   0.10            78         16        0.205
## 4   0.20            78         12        0.154

Final Summary: In this analysis of the Caravan insurance dataset, we aimed to predict whether a customer would purchase insurance using two classification techniques: boosting and k-nearest neighbors (KNN). We trained models on the first 1,000 observations and tested them on the remaining data. Boosting was implemented using the gbm() function with 1,000 trees and various shrinkage values (λ), while KNN was applied with k-values of 1, 3, and 5. To evaluate model performance meaningfully in this imbalanced classification setting—where only about 6% of customers make a purchase.

Boosting with a shrinkage value of λ = 0.01 achieved the best overall results. It correctly identified 13 purchasers out of 37 predicted buyers, yielding a success rate of 35.1%. This far exceeded the success rates from KNN models, which ranged from 11.7% (k = 1) to 26.7% (k = 5). As we increased the shrinkage value in boosting to 0.05, 0.1, and 0.2, the number of predicted purchasers increased, but precision declined—illustrating a bias-variance tradeoff. The most important predictors identified by the boosting model included personal car ownership, leisure vehicle ownership, and brand loyalty, which intuitively align with consumer behavior in the insurance market. Overall, boosting not only delivered the highest predictive precision but also offered valuable insight into the underlying drivers of purchasing decisions, making it the superior method in this case.

Lab 9

jace vayhinger

2025-04-11

11. This question uses the Caravan data set.

a). Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.

b). Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?

Using multiple shrinkage