set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston) / 2)
Boston.train <- Boston[train, -14]
Boston.test <- Boston[-train, -14]
Y.train <- Boston[train, 14]
Y.test <- Boston[-train, 14]
p <- ncol(Boston) - 1
m1 <- p
m2 <- floor(p / 2)
m3 <- floor(sqrt(p))
rf.boston1 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test,
mtry = m1, ntree = 500)
rf.boston2 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test,
mtry = m2, ntree = 500)
rf.boston3 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test,
mtry = m3, ntree = 500)
mse1 <- rf.boston1$test$mse
mse2 <- rf.boston2$test$mse
mse3 <- rf.boston3$test$mse
ymin <- min(c(mse1, mse2, mse3))
ymax <- max(c(mse1, mse2, mse3))
plot(1:500, mse1, col = "green", type = "l", lwd = 2,
xlab = "Number of Trees", ylab = "Test MSE", ylim = c(ymin, ymax))
lines(1:500, mse2, col = "red", lwd = 2)
lines(1:500, mse3, col = "blue", lwd = 2)
legend("topright", legend = c("m = p", "m = p/2", "m = sqrt(p)"),
col = c("green", "red", "blue"), lty = 1, lwd = 2)
The plot displays test Mean Squared Error (MSE) as a function of the number of trees (ntree) in a random forest, for three values of mtry:
Green line: mtry = p
Red line: mtry = p/2
Blue line: mtry = sqrt(p)
Key Observations:
All curves decrease rapidly at first as more trees are added: The test error stabilizes after about 150 trees. This is expected since adding more trees reduces variance in predictions, especially early on.
Blue line (m = sqrt(p)) yields the lowest test MSE across most of the ntree range: This confirms that random forests benefit from choosing a small number of predictors at each split. Lower mtry increases tree diversity, which improves ensemble performance.
Red line (m = p/2) performs better than green (m = p), but worse than blue: A moderate mtry balances variance reduction and predictive power of individual trees.
Green line (m = p) has the highest test MSE: This is equivalent to bagging, where all predictors are considered at each split. Bagging produces more correlated trees, reducing the benefit of ensembling.
library(ISLR2)
library(gbm)
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(class)
standardized.X <- scale(Caravan[, -86])
Purchase <- Caravan$Purchase
test <- 1:1000
train.X <- standardized.X[-test, ]
test.X <- standardized.X[test, ]
train.Y <- Purchase[-test]
test.Y <- Purchase[test]
Caravan.boost <- Caravan
Caravan.boost$Purchase <- as.numeric(Caravan.boost$Purchase == "Yes")
set.seed(1)
boost.caravan <- gbm(Purchase ~ ., data = Caravan.boost[-test, ],
distribution = "bernoulli",
n.trees = 1000,
shrinkage = 0.01)
summary(boost.caravan)
## var rel.inf
## PPERSAUT PPERSAUT 25.77462891
## PPLEZIER PPLEZIER 12.87452732
## PBRAND PBRAND 10.19292748
## MOPLLAAG MOPLLAAG 4.83547107
## MINKGEM MINKGEM 4.32190380
## ALEVEN ALEVEN 4.06709588
## PBYSTAND PBYSTAND 3.98749823
## APERSAUT APERSAUT 3.19156662
## MBERMIDD MBERMIDD 2.30144421
## MKOOPKLA MKOOPKLA 2.19115382
## MAUT1 MAUT1 1.88051502
## MBERHOOG MBERHOOG 1.82774814
## MOSHOOFD MOSHOOFD 1.79902350
## PWAPART PWAPART 1.78092009
## PFIETS PFIETS 1.37983185
## MBERARBG MBERARBG 1.36486472
## MINKM30 MINKM30 1.35865381
## MINK7512 MINK7512 1.34146746
## MGODOV MGODOV 1.30363305
## MOSTYPE MOSTYPE 1.16144252
## MOPLMIDD MOPLMIDD 1.15188377
## MRELGE MRELGE 0.88320737
## MSKC MSKC 0.87297327
## PGEZONG PGEZONG 0.72327251
## MOPLHOOG MOPLHOOG 0.71047893
## AFIETS AFIETS 0.69948203
## MHKOOP MHKOOP 0.59634528
## MINK3045 MINK3045 0.58368256
## MGODPR MGODPR 0.53409514
## MGODGE MGODGE 0.46356510
## PLEVEN PLEVEN 0.42662816
## MGEMLEEF MGEMLEEF 0.34057976
## MSKA MSKA 0.30338080
## MGODRK MGODRK 0.28577801
## MHHUUR MHHUUR 0.24976259
## MINK4575 MINK4575 0.22508839
## MRELSA MRELSA 0.21744841
## MAUT0 MAUT0 0.20791079
## MBERZELF MBERZELF 0.20197341
## MZPART MZPART 0.18163500
## MINK123M MINK123M 0.15035536
## MAUT2 MAUT2 0.15035164
## MFGEKIND MFGEKIND 0.13073541
## MSKB1 MSKB1 0.13000199
## MBERBOER MBERBOER 0.12139096
## MRELOV MRELOV 0.11669982
## PINBOED PINBOED 0.10839605
## MSKD MSKD 0.08046866
## PTRACTOR PTRACTOR 0.07780436
## MBERARBO MBERARBO 0.05569114
## MZFONDS MZFONDS 0.04920348
## MFWEKIND MFWEKIND 0.03341234
## MAANTHUI MAANTHUI 0.00000000
## MGEMOMV MGEMOMV 0.00000000
## MFALLEEN MFALLEEN 0.00000000
## MSKB2 MSKB2 0.00000000
## PWABEDR PWABEDR 0.00000000
## PWALAND PWALAND 0.00000000
## PBESAUT PBESAUT 0.00000000
## PMOTSCO PMOTSCO 0.00000000
## PVRAAUT PVRAAUT 0.00000000
## PAANHANG PAANHANG 0.00000000
## PWERKT PWERKT 0.00000000
## PBROM PBROM 0.00000000
## PPERSONG PPERSONG 0.00000000
## PWAOREG PWAOREG 0.00000000
## PZEILPL PZEILPL 0.00000000
## AWAPART AWAPART 0.00000000
## AWABEDR AWABEDR 0.00000000
## AWALAND AWALAND 0.00000000
## ABESAUT ABESAUT 0.00000000
## AMOTSCO AMOTSCO 0.00000000
## AVRAAUT AVRAAUT 0.00000000
## AAANHANG AAANHANG 0.00000000
## ATRACTOR ATRACTOR 0.00000000
## AWERKT AWERKT 0.00000000
## ABROM ABROM 0.00000000
## APERSONG APERSONG 0.00000000
## AGEZONG AGEZONG 0.00000000
## AWAOREG AWAOREG 0.00000000
## ABRAND ABRAND 0.00000000
## AZEILPL AZEILPL 0.00000000
## APLEZIER APLEZIER 0.00000000
## AINBOED AINBOED 0.00000000
## ABYSTAND ABYSTAND 0.00000000
Interpretation:
PPERSAUT (Personal car ownership) has the strongest influence on purchase behavior — suggesting that car ownership is highly predictive of Caravan product purchases.
PPLEZIER (Leisure cars) and PBRAND (Brand loyalty) also play significant roles.
Many of these top predictors are automotive or socio-economic indicators, which aligns with expectations for a consumer insurance product like Caravan.
boost.probs <- predict(boost.caravan, newdata = Caravan[test, ], n.trees = 1000, type = "response")
boost.pred <- ifelse(boost.probs > 0.2, "Yes", "No")
boost.table <- table(boost.pred, test.Y)
print(boost.table)
## test.Y
## boost.pred No Yes
## No 917 46
## Yes 24 13
success.rate <- boost.table["Yes", "Yes"] / sum(boost.table["Yes", ])
print(paste("Success rate (boosting, >20% cutoff):", round(success.rate, 3)))
## [1] "Success rate (boosting, >20% cutoff): 0.351"
Using the boosting model with a 0.2 cutoff, the test set confusion matrix shows that 13 out of 37 predicted buyers actually made a purchase. This yields a success rate of 35.1% among predicted purchasers, which is a significant improvement over the baseline rate of ~6%. Boosting is thus effective at identifying a smaller, more targeted group of likely purchasers.
set.seed(1)
knn.pred1 <- knn(train.X, test.X, train.Y, k = 1)
knn.pred3 <- knn(train.X, test.X, train.Y, k = 3)
knn.pred5 <- knn(train.X, test.X, train.Y, k = 5)
tab1 <- table(knn.pred1, test.Y)
tab3 <- table(knn.pred3, test.Y)
tab5 <- table(knn.pred5, test.Y)
prec1 <- tab1["Yes", "Yes"] / sum(tab1["Yes", ])
prec3 <- tab3["Yes", "Yes"] / sum(tab3["Yes", ])
prec5 <- tab5["Yes", "Yes"] / sum(tab5["Yes", ])
cat("KNN Success Rate (K = 1):", round(prec1, 3), "\n")
## KNN Success Rate (K = 1): 0.117
cat("KNN Success Rate (K = 3):", round(prec3, 3), "\n")
## KNN Success Rate (K = 3): 0.192
cat("KNN Success Rate (K = 5):", round(prec5, 3), "\n")
## KNN Success Rate (K = 5): 0.267
The boosting model with 1000 trees and shrinkage λ = 0.01 predicted that 37 people would make a purchase, of whom 13 actually did—yielding a success rate of 35.1%. This is significantly higher than any of the KNN models: K = 1 yielded 11.7%, K = 3 yielded 19.2%, and K = 5 reached 26.7%. Thus, boosting provided the most precise targeting of potential buyers.
shrinkages <- c(0.01, 0.05, 0.1, 0.2)
test.errors <- numeric(length(shrinkages))
for (i in 1:length(shrinkages)) {
set.seed(1)
boost.temp <- gbm(Purchase ~ ., data = Caravan.boost[-test, ],
distribution = "bernoulli",
n.trees = 1000, shrinkage = shrinkages[i])
pred.probs <- predict(boost.temp, newdata = Caravan[test, ], n.trees = 1000, type = "response")
pred.class <- ifelse(pred.probs > 0.2, "Yes", "No")
test.errors[i] <- mean(pred.class != test.Y)
}
plot(shrinkages, test.errors, type = "b", col = "blue", pch = 19,
xlab = "Shrinkage (λ)", ylab = "Test Classification Error Rate",
main = "Boosting Error Rate vs. Shrinkage")
Caravan.boost <- Caravan
Caravan.boost$Purchase <- as.numeric(Caravan.boost$Purchase == "Yes")
test <- 1:1000
train.data <- Caravan.boost[-test, ]
test.data <- Caravan.boost[test, ]
true.labels <- Caravan$Purchase[test]
shrinkages <- c(0.01, 0.05, 0.1, 0.2)
results <- data.frame(
Lambda = shrinkages,
Predicted_Yes = integer(length(shrinkages)),
Actual_Yes = integer(length(shrinkages)),
Success_Rate = numeric(length(shrinkages))
)
for (i in seq_along(shrinkages)) {
set.seed(1)
model <- gbm(Purchase ~ ., data = train.data,
distribution = "bernoulli",
n.trees = 1000,
shrinkage = shrinkages[i],
verbose = FALSE)
probs <- predict(model, newdata = test.data, n.trees = 1000, type = "response")
pred <- ifelse(probs > 0.2, "Yes", "No")
cm <- table(pred, true.labels)
predicted_yes <- sum(pred == "Yes")
actual_yes <- if ("Yes" %in% rownames(cm) && "Yes" %in% colnames(cm)) cm["Yes", "Yes"] else 0
success_rate <- if (predicted_yes > 0) actual_yes / predicted_yes else 0
results[i, "Predicted_Yes"] <- predicted_yes
results[i, "Actual_Yes"] <- actual_yes
results[i, "Success_Rate"] <- round(success_rate, 3)
}
print(results)
## Lambda Predicted_Yes Actual_Yes Success_Rate
## 1 0.01 37 13 0.351
## 2 0.05 67 16 0.239
## 3 0.10 78 16 0.205
## 4 0.20 78 12 0.154
Final Summary: In this analysis of the Caravan insurance dataset, we aimed to predict whether a customer would purchase insurance using two classification techniques: boosting and k-nearest neighbors (KNN). We trained models on the first 1,000 observations and tested them on the remaining data. Boosting was implemented using the gbm() function with 1,000 trees and various shrinkage values (λ), while KNN was applied with k-values of 1, 3, and 5. To evaluate model performance meaningfully in this imbalanced classification setting—where only about 6% of customers make a purchase.
Boosting with a shrinkage value of λ = 0.01 achieved the best overall results. It correctly identified 13 purchasers out of 37 predicted buyers, yielding a success rate of 35.1%. This far exceeded the success rates from KNN models, which ranged from 11.7% (k = 1) to 26.7% (k = 5). As we increased the shrinkage value in boosting to 0.05, 0.1, and 0.2, the number of predicted purchasers increased, but precision declined—illustrating a bias-variance tradeoff. The most important predictors identified by the boosting model included personal car ownership, leisure vehicle ownership, and brand loyalty, which intuitively align with consumer behavior in the insurance market. Overall, boosting not only delivered the highest predictive precision but also offered valuable insight into the underlying drivers of purchasing decisions, making it the superior method in this case.