7. In the lab, we applied random forests to the Boston data using mtry = 6 and using ntree = 25 and ntree = 500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained.

set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston) / 2)
Boston.train <- Boston[train, -14]
Boston.test <- Boston[-train, -14]
Y.train <- Boston[train, 14]
Y.test <- Boston[-train, 14]

p <- ncol(Boston) - 1
m1 <- p
m2 <- floor(p / 2)
m3 <- floor(sqrt(p))

rf.boston1 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, 
                           mtry = m1, ntree = 500)
rf.boston2 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, 
                           mtry = m2, ntree = 500)
rf.boston3 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, 
                           mtry = m3, ntree = 500)

mse1 <- rf.boston1$test$mse
mse2 <- rf.boston2$test$mse
mse3 <- rf.boston3$test$mse

ymin <- min(c(mse1, mse2, mse3))
ymax <- max(c(mse1, mse2, mse3))

plot(1:500, mse1, col = "green", type = "l", lwd = 2,
     xlab = "Number of Trees", ylab = "Test MSE", ylim = c(ymin, ymax))
lines(1:500, mse2, col = "red", lwd = 2)
lines(1:500, mse3, col = "blue", lwd = 2)
legend("topright", legend = c("m = p", "m = p/2", "m = sqrt(p)"),
       col = c("green", "red", "blue"), lty = 1, lwd = 2)

The plot displays test Mean Squared Error (MSE) as a function of the number of trees (ntree) in a random forest, for three values of mtry:

Green line: mtry = p

Red line: mtry = p/2

Blue line: mtry = sqrt(p)

Key Observations:

All curves decrease rapidly at first as more trees are added: The test error stabilizes after about 150 trees. This is expected since adding more trees reduces variance in predictions, especially early on.

Blue line (m = sqrt(p)) yields the lowest test MSE across most of the ntree range: This confirms that random forests benefit from choosing a small number of predictors at each split. Lower mtry increases tree diversity, which improves ensemble performance.

Red line (m = p/2) performs better than green (m = p), but worse than blue: A moderate mtry balances variance reduction and predictive power of individual trees.

Green line (m = p) has the highest test MSE: This is equivalent to bagging, where all predictors are considered at each split. Bagging produces more correlated trees, reducing the benefit of ensembling.

11. This question uses the Caravan data set.

For number 11, instead of only using 0.01 as the shrinkage in (11), use 0.01, 0.05, 0.1, and 0.2, and choose the best λ value by estimating test error. Do not fit the Logistic Regression Model.In order to have a trustworthy model with this many covariates, you would have to pursue a very sophisticated modeling process which is beyond the scope of this class. I do not want you to report a model which does not converge, so only consider boosting and KNN here. Please do consider different values for k, the number of neighbors to use in smoothing the KNN algorithm.

library(ISLR2)
library(gbm)
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(class)

a). Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.

standardized.X <- scale(Caravan[, -86])
Purchase <- Caravan$Purchase

test <- 1:1000
train.X <- standardized.X[-test, ]
test.X <- standardized.X[test, ]
train.Y <- Purchase[-test]
test.Y <- Purchase[test]

b). Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?

Caravan.boost <- Caravan
Caravan.boost$Purchase <- as.numeric(Caravan.boost$Purchase == "Yes")

set.seed(1)
boost.caravan <- gbm(Purchase ~ ., data = Caravan.boost[-test, ],
                     distribution = "bernoulli",
                     n.trees = 1000,
                     shrinkage = 0.01)

summary(boost.caravan)

##               var     rel.inf
## PPERSAUT PPERSAUT 25.77462891
## PPLEZIER PPLEZIER 12.87452732
## PBRAND     PBRAND 10.19292748
## MOPLLAAG MOPLLAAG  4.83547107
## MINKGEM   MINKGEM  4.32190380
## ALEVEN     ALEVEN  4.06709588
## PBYSTAND PBYSTAND  3.98749823
## APERSAUT APERSAUT  3.19156662
## MBERMIDD MBERMIDD  2.30144421
## MKOOPKLA MKOOPKLA  2.19115382
## MAUT1       MAUT1  1.88051502
## MBERHOOG MBERHOOG  1.82774814
## MOSHOOFD MOSHOOFD  1.79902350
## PWAPART   PWAPART  1.78092009
## PFIETS     PFIETS  1.37983185
## MBERARBG MBERARBG  1.36486472
## MINKM30   MINKM30  1.35865381
## MINK7512 MINK7512  1.34146746
## MGODOV     MGODOV  1.30363305
## MOSTYPE   MOSTYPE  1.16144252
## MOPLMIDD MOPLMIDD  1.15188377
## MRELGE     MRELGE  0.88320737
## MSKC         MSKC  0.87297327
## PGEZONG   PGEZONG  0.72327251
## MOPLHOOG MOPLHOOG  0.71047893
## AFIETS     AFIETS  0.69948203
## MHKOOP     MHKOOP  0.59634528
## MINK3045 MINK3045  0.58368256
## MGODPR     MGODPR  0.53409514
## MGODGE     MGODGE  0.46356510
## PLEVEN     PLEVEN  0.42662816
## MGEMLEEF MGEMLEEF  0.34057976
## MSKA         MSKA  0.30338080
## MGODRK     MGODRK  0.28577801
## MHHUUR     MHHUUR  0.24976259
## MINK4575 MINK4575  0.22508839
## MRELSA     MRELSA  0.21744841
## MAUT0       MAUT0  0.20791079
## MBERZELF MBERZELF  0.20197341
## MZPART     MZPART  0.18163500
## MINK123M MINK123M  0.15035536
## MAUT2       MAUT2  0.15035164
## MFGEKIND MFGEKIND  0.13073541
## MSKB1       MSKB1  0.13000199
## MBERBOER MBERBOER  0.12139096
## MRELOV     MRELOV  0.11669982
## PINBOED   PINBOED  0.10839605
## MSKD         MSKD  0.08046866
## PTRACTOR PTRACTOR  0.07780436
## MBERARBO MBERARBO  0.05569114
## MZFONDS   MZFONDS  0.04920348
## MFWEKIND MFWEKIND  0.03341234
## MAANTHUI MAANTHUI  0.00000000
## MGEMOMV   MGEMOMV  0.00000000
## MFALLEEN MFALLEEN  0.00000000
## MSKB2       MSKB2  0.00000000
## PWABEDR   PWABEDR  0.00000000
## PWALAND   PWALAND  0.00000000
## PBESAUT   PBESAUT  0.00000000
## PMOTSCO   PMOTSCO  0.00000000
## PVRAAUT   PVRAAUT  0.00000000
## PAANHANG PAANHANG  0.00000000
## PWERKT     PWERKT  0.00000000
## PBROM       PBROM  0.00000000
## PPERSONG PPERSONG  0.00000000
## PWAOREG   PWAOREG  0.00000000
## PZEILPL   PZEILPL  0.00000000
## AWAPART   AWAPART  0.00000000
## AWABEDR   AWABEDR  0.00000000
## AWALAND   AWALAND  0.00000000
## ABESAUT   ABESAUT  0.00000000
## AMOTSCO   AMOTSCO  0.00000000
## AVRAAUT   AVRAAUT  0.00000000
## AAANHANG AAANHANG  0.00000000
## ATRACTOR ATRACTOR  0.00000000
## AWERKT     AWERKT  0.00000000
## ABROM       ABROM  0.00000000
## APERSONG APERSONG  0.00000000
## AGEZONG   AGEZONG  0.00000000
## AWAOREG   AWAOREG  0.00000000
## ABRAND     ABRAND  0.00000000
## AZEILPL   AZEILPL  0.00000000
## APLEZIER APLEZIER  0.00000000
## AINBOED   AINBOED  0.00000000
## ABYSTAND ABYSTAND  0.00000000

Interpretation:

PPERSAUT (Personal car ownership) has the strongest influence on purchase behavior — suggesting that car ownership is highly predictive of Caravan product purchases.

PPLEZIER (Leisure cars) and PBRAND (Brand loyalty) also play significant roles.

Many of these top predictors are automotive or socio-economic indicators, which aligns with expectations for a consumer insurance product like Caravan.

c). Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated probability of purchase is greater than 20%. Form a confusion matrix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtained from applying KNN or logistic regression to this data set?

boost.probs <- predict(boost.caravan, newdata = Caravan[test, ], n.trees = 1000, type = "response")

boost.pred <- ifelse(boost.probs > 0.2, "Yes", "No")

boost.table <- table(boost.pred, test.Y)
print(boost.table)
##           test.Y
## boost.pred  No Yes
##        No  917  46
##        Yes  24  13
success.rate <- boost.table["Yes", "Yes"] / sum(boost.table["Yes", ])
print(paste("Success rate (boosting, >20% cutoff):", round(success.rate, 3)))
## [1] "Success rate (boosting, >20% cutoff): 0.351"

Using the boosting model with a 0.2 cutoff, the test set confusion matrix shows that 13 out of 37 predicted buyers actually made a purchase. This yields a success rate of 35.1% among predicted purchasers, which is a significant improvement over the baseline rate of ~6%. Boosting is thus effective at identifying a smaller, more targeted group of likely purchasers.

set.seed(1)
knn.pred1 <- knn(train.X, test.X, train.Y, k = 1)
knn.pred3 <- knn(train.X, test.X, train.Y, k = 3)
knn.pred5 <- knn(train.X, test.X, train.Y, k = 5)

tab1 <- table(knn.pred1, test.Y)
tab3 <- table(knn.pred3, test.Y)
tab5 <- table(knn.pred5, test.Y)

prec1 <- tab1["Yes", "Yes"] / sum(tab1["Yes", ])
prec3 <- tab3["Yes", "Yes"] / sum(tab3["Yes", ])
prec5 <- tab5["Yes", "Yes"] / sum(tab5["Yes", ])

cat("KNN Success Rate (K = 1):", round(prec1, 3), "\n")
## KNN Success Rate (K = 1): 0.117
cat("KNN Success Rate (K = 3):", round(prec3, 3), "\n")
## KNN Success Rate (K = 3): 0.192
cat("KNN Success Rate (K = 5):", round(prec5, 3), "\n")
## KNN Success Rate (K = 5): 0.267

The boosting model with 1000 trees and shrinkage λ = 0.01 predicted that 37 people would make a purchase, of whom 13 actually did—yielding a success rate of 35.1%. This is significantly higher than any of the KNN models: K = 1 yielded 11.7%, K = 3 yielded 19.2%, and K = 5 reached 26.7%. Thus, boosting provided the most precise targeting of potential buyers.

Using multiple shrinkage

shrinkages <- c(0.01, 0.05, 0.1, 0.2)
test.errors <- numeric(length(shrinkages))

for (i in 1:length(shrinkages)) {
  set.seed(1)
  boost.temp <- gbm(Purchase ~ ., data = Caravan.boost[-test, ],
                    distribution = "bernoulli",
                    n.trees = 1000, shrinkage = shrinkages[i])
  pred.probs <- predict(boost.temp, newdata = Caravan[test, ], n.trees = 1000, type = "response")
  pred.class <- ifelse(pred.probs > 0.2, "Yes", "No")
  test.errors[i] <- mean(pred.class != test.Y)
}

plot(shrinkages, test.errors, type = "b", col = "blue", pch = 19,
     xlab = "Shrinkage (λ)", ylab = "Test Classification Error Rate",
     main = "Boosting Error Rate vs. Shrinkage")

Caravan.boost <- Caravan
Caravan.boost$Purchase <- as.numeric(Caravan.boost$Purchase == "Yes")

test <- 1:1000
train.data <- Caravan.boost[-test, ]
test.data <- Caravan.boost[test, ]
true.labels <- Caravan$Purchase[test]

shrinkages <- c(0.01, 0.05, 0.1, 0.2)

results <- data.frame(
  Lambda = shrinkages,
  Predicted_Yes = integer(length(shrinkages)),
  Actual_Yes = integer(length(shrinkages)),
  Success_Rate = numeric(length(shrinkages))
)

for (i in seq_along(shrinkages)) {
  set.seed(1)
  model <- gbm(Purchase ~ ., data = train.data,
               distribution = "bernoulli",
               n.trees = 1000,
               shrinkage = shrinkages[i],
               verbose = FALSE)
  
  probs <- predict(model, newdata = test.data, n.trees = 1000, type = "response")
  pred <- ifelse(probs > 0.2, "Yes", "No")
  
  cm <- table(pred, true.labels)
  
  predicted_yes <- sum(pred == "Yes")
  actual_yes <- if ("Yes" %in% rownames(cm) && "Yes" %in% colnames(cm)) cm["Yes", "Yes"] else 0
  success_rate <- if (predicted_yes > 0) actual_yes / predicted_yes else 0
  
  results[i, "Predicted_Yes"] <- predicted_yes
  results[i, "Actual_Yes"] <- actual_yes
  results[i, "Success_Rate"] <- round(success_rate, 3)
}

print(results)
##   Lambda Predicted_Yes Actual_Yes Success_Rate
## 1   0.01            37         13        0.351
## 2   0.05            67         16        0.239
## 3   0.10            78         16        0.205
## 4   0.20            78         12        0.154

Final Summary: In this analysis of the Caravan insurance dataset, we aimed to predict whether a customer would purchase insurance using two classification techniques: boosting and k-nearest neighbors (KNN). We trained models on the first 1,000 observations and tested them on the remaining data. Boosting was implemented using the gbm() function with 1,000 trees and various shrinkage values (λ), while KNN was applied with k-values of 1, 3, and 5. To evaluate model performance meaningfully in this imbalanced classification setting—where only about 6% of customers make a purchase.

Boosting with a shrinkage value of λ = 0.01 achieved the best overall results. It correctly identified 13 purchasers out of 37 predicted buyers, yielding a success rate of 35.1%. This far exceeded the success rates from KNN models, which ranged from 11.7% (k = 1) to 26.7% (k = 5). As we increased the shrinkage value in boosting to 0.05, 0.1, and 0.2, the number of predicted purchasers increased, but precision declined—illustrating a bias-variance tradeoff. The most important predictors identified by the boosting model included personal car ownership, leisure vehicle ownership, and brand loyalty, which intuitively align with consumer behavior in the insurance market. Overall, boosting not only delivered the highest predictive precision but also offered valuable insight into the underlying drivers of purchasing decisions, making it the superior method in this case.