R Markdown

Question 7

library(ISLR2)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine()  masks randomForest::combine()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ ggplot2::margin() masks randomForest::margin()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(42)

# Split data
train_idx <- sample(1:nrow(Boston), nrow(Boston) / 2)
train_data <- Boston[train_idx, ]
test_data <- Boston[-train_idx, ]

# Function to compute test MSE
rf_test_mse <- function(mtry, ntree) {
  model <- randomForest(medv ~ ., data = train_data, mtry = mtry, ntree = ntree)
  preds <- predict(model, newdata = test_data)
  mean((preds - test_data$medv)^2)
}

# Vary mtry and ntree
mtry_vals <- c(1, 2, 4, 6, 8, 10, 12)
ntree_vals <- c(25, 100, 250, 500)

# Store results
results <- expand.grid(mtry = mtry_vals, ntree = ntree_vals)
results$mse <- mapply(rf_test_mse, results$mtry, results$ntree)

# Plot results
ggplot(results, aes(x = ntree, y = mse, color = as.factor(mtry), group = mtry)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2) +
  labs(
    title = "Test MSE vs Number of Trees for Different mtry Values",
    x = "Number of Trees (ntree)",
    y = "Test MSE",
    color = "mtry"
  ) +
  theme_minimal()

As the number of trees (ntree) increases, the test error generally decreases or stabilizes, demonstrating the benefit of using more trees up to a point. However, the value of mtry plays a more critical role: setting it too low (e.g., mtry = 1) causes underfitting and high error, while intermediate values like mtry = 6 provide the best performance. These results suggest that choosing an optimal mtry is crucial for building a well-performing Random Forest model.

Question 8

library(ISLR2)
library(tree)
library(randomForest)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:ISLR2':
## 
##     Boston
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(BART)  
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## Loading required package: survival
set.seed(42)

a):

# Convert Sales to quantitative response 
train_idx <- sample(1:nrow(Carseats), nrow(Carseats) / 2)
train <- Carseats[train_idx, ]
test <- Carseats[-train_idx, ]

b):

tree_model <- tree(Sales ~ ., data = train)
summary(tree_model)
## 
## Regression tree:
## tree(formula = Sales ~ ., data = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "Income"      "Advertising"
## [6] "CompPrice"   "Population"  "Urban"      
## Number of terminal nodes:  18 
## Residual mean deviance:  2.266 = 412.4 / 182 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.46100 -1.01400 -0.09829  0.00000  1.06900  3.68600
plot(tree_model)
text(tree_model, pretty = 0)

# Test MSE
pred_tree <- predict(tree_model, newdata = test)
mse_tree <- mean((pred_tree - test$Sales)^2)
mse_tree
## [1] 5.686401

Left Subtree (Bad or Medium Shelf Location) * Then splits on Price < 124.5

  • Lower price leads to higher sales (intuitive).

  • Further splits on:

    • Age: Younger customers likely buying more.

    *Income, Advertising, CompPrice (competitor price): All affect purchasing behavior.

  • Leaf nodes show predicted Sales for observations in each path.

Right Subtree (Good Shelf Location) * Again splits on Price < 113

  • Lower prices still important, even with good shelf location.

  • Further splits use Urban, CompPrice, and Population:

    • Urban = No may indicate rural stores perform differently.

    • CompPrice and Population help refine local market conditions.

The regression tree begins by splitting on ShelveLoc, indicating that shelf location has the largest impact on predicted sales. Stores with “Good” shelf locations follow a different prediction path than those with “Bad” or “Medium” locations. For the latter, lower Price values and higher Advertising budgets were associated with increased sales. On the right side of the tree, for “Good” shelf locations, both Price and CompPrice played significant roles in determining predicted sales values. The tree illustrates how multiple interacting features contribute to sales performance in a retail setting.

C)

cv_tree <- cv.tree(tree_model)
plot(cv_tree$size, cv_tree$dev, type = "b")

# Prune the tree
pruned_tree <- prune.tree(tree_model, best = cv_tree$size[which.min(cv_tree$dev)])
plot(pruned_tree)
text(pruned_tree, pretty = 0)

# MSE after pruning
pred_pruned <- predict(pruned_tree, newdata = test)
mse_pruned <- mean((pred_pruned - test$Sales)^2)
mse_pruned
## [1] 5.393592

Cross-validation shows that test deviance decreases sharply until tree size ~6, then stabilizes.Pruning to the optimal size improves model generalization by reducing overfitting.

d)

bag_model <- randomForest(Sales ~ ., data = train, mtry = ncol(train) - 1, importance = TRUE)
pred_bag <- predict(bag_model, newdata = test)
mse_bag <- mean((pred_bag - test$Sales)^2)
mse_bag
## [1] 2.360426
# Variable importance
importance(bag_model)
##                %IncMSE IncNodePurity
## CompPrice   30.4417501    197.998343
## Income       9.2101020    104.302860
## Advertising 12.7498522    104.098442
## Population   2.4997582     62.703767
## Price       54.8337819    432.196870
## ShelveLoc   56.8325702    429.294512
## Age         10.4795981    133.921947
## Education   -0.5607019     42.616731
## Urban        1.8401309     10.026923
## US           0.8045295      5.682508
varImpPlot(bag_model)

ShelveLoc and Price are the most important predictors, significantly impacting test MSE and node purity.Variables like US, Urban, and Education contribute very little to the predictive power of the bagged model.

e)

rf_model <- randomForest(Sales ~ ., data = train, mtry = 4, importance = TRUE)
pred_rf <- predict(rf_model, newdata = test)
mse_rf <- mean((pred_rf - test$Sales)^2)
mse_rf
## [1] 2.720674
# Importance
importance(rf_model)
##                %IncMSE IncNodePurity
## CompPrice   18.2653561     162.97391
## Income       6.0659730     125.82369
## Advertising 12.1784658     128.66873
## Population   2.0057333      96.55879
## Price       35.1859261     362.08648
## ShelveLoc   42.0588679     356.48942
## Age         10.7853978     164.29555
## Education    0.2917640      75.11868
## Urban        1.3735268      15.12849
## US           0.5462559      12.78130
varImpPlot(rf_model)

In the random forest model, Price and ShelveLoc remain the most influential variables for predicting Sales.Variables like Education, Urban, and US have minimal impact on reducing error or improving node purity.

f)

x_train <- train[, -which(names(train) == "Sales")]
y_train <- train$Sales
x_test <- test[, -which(names(test) == "Sales")]

bart_model <- gbart(x.train = x_train, y.train = y_train, x.test = x_test)
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 200, 14, 200
## y1,yn: -3.732150, -7.112150
## x1,x[n*p]: 116.000000, 1.000000
## xp1,xp[np*p]: 115.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 62 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.284787,3,0.219563,7.64215
## *****sigma: 1.061683
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 3s
## trcnt,tecnt: 1000,1000
pred_bart <- bart_model$yhat.test.mean
mse_bart <- mean((pred_bart - test$Sales)^2)
mse_bart
## [1] 1.720143
cat("Regression Tree Test MSE:", mse_tree, "\n")
## Regression Tree Test MSE: 5.686401
cat("Pruned Tree Test MSE:", mse_pruned, "\n")
## Pruned Tree Test MSE: 5.393592
cat("Bagging Test MSE:", mse_bag, "\n")
## Bagging Test MSE: 2.360426
cat("Random Forest Test MSE:", mse_rf, "\n")
## Random Forest Test MSE: 2.720674
cat("BART Test MSE:", mse_bart, "\n")
## BART Test MSE: 1.720143

By the above result we can conclude that the simple regression tree had the highest test MSE, and pruning offered a slight improvement.Ensemble methods (Bagging, Random Forest) reduced error significantly, while BART achieved the lowest test MSE, indicating the best performance.

Question 11

a)

# Load required libraries
library(ISLR2)
library(gbm)
library(class)
library(dplyr)

set.seed(42)

# Load and prepare data
data(Caravan)

# Step 1: Create binary response variable (0 = No, 1 = Yes)
Caravan$PurchaseBinary <- ifelse(Caravan$Purchase == "Yes", 1, 0)

# Step 2: Split into training and testing sets
train_idx <- 1:1000
test_idx <- 1001:nrow(Caravan)

train <- Caravan[train_idx, ]
test <- Caravan[test_idx, ]

b)

# Step 3: Fit GBM model
boost_model <- gbm(PurchaseBinary ~ . -Purchase, 
                   data = train,
                   distribution = "bernoulli",
                   n.trees = 1000,
                   shrinkage = 0.01,
                   interaction.depth = 1)

# Step 4: Variable Importance
summary(boost_model)

##               var     rel.inf
## PPERSAUT PPERSAUT 15.24304059
## MKOOPKLA MKOOPKLA 10.22049754
## MOPLHOOG MOPLHOOG  7.58473391
## MBERMIDD MBERMIDD  5.98365038
## PBRAND     PBRAND  4.55749118
## ABRAND     ABRAND  4.07601736
## MINK3045 MINK3045  4.03149141
## MGODGE     MGODGE  3.50618597
## MOSTYPE   MOSTYPE  2.82332650
## MAUT2       MAUT2  2.61711991
## MSKC         MSKC  2.21439111
## MAUT1       MAUT1  2.13764619
## MBERARBG MBERARBG  2.01645301
## MSKA         MSKA  1.98039700
## MGODPR     MGODPR  1.94608284
## PWAPART   PWAPART  1.90065796
## MGODOV     MGODOV  1.81046263
## MRELGE     MRELGE  1.65327955
## MINK7512 MINK7512  1.54952273
## MBERHOOG MBERHOOG  1.54562792
## MINKGEM   MINKGEM  1.51129086
## PBYSTAND PBYSTAND  1.46885445
## MSKB1       MSKB1  1.46349386
## MFWEKIND MFWEKIND  1.24890126
## MINKM30   MINKM30  1.01877067
## APERSAUT APERSAUT  1.00947264
## MSKD         MSKD  0.94342296
## MOSHOOFD MOSHOOFD  0.92805596
## MFGEKIND MFGEKIND  0.92012209
## MAUT0       MAUT0  0.91661495
## MGODRK     MGODRK  0.90097295
## MOPLMIDD MOPLMIDD  0.80067001
## MRELOV     MRELOV  0.79866885
## MHHUUR     MHHUUR  0.66251044
## MBERBOER MBERBOER  0.61490907
## MBERARBO MBERARBO  0.59493791
## PMOTSCO   PMOTSCO  0.59140712
## MSKB2       MSKB2  0.49738895
## PLEVEN     PLEVEN  0.47656908
## MINK4575 MINK4575  0.44932585
## MZPART     MZPART  0.43764686
## MHKOOP     MHKOOP  0.41454005
## MGEMOMV   MGEMOMV  0.41336506
## MBERZELF MBERZELF  0.38071586
## MZFONDS   MZFONDS  0.31613260
## MINK123M MINK123M  0.30173680
## MGEMLEEF MGEMLEEF  0.26253060
## MOPLLAAG MOPLLAAG  0.16165917
## MFALLEEN MFALLEEN  0.09723737
## MAANTHUI MAANTHUI  0.00000000
## MRELSA     MRELSA  0.00000000
## PWABEDR   PWABEDR  0.00000000
## PWALAND   PWALAND  0.00000000
## PBESAUT   PBESAUT  0.00000000
## PVRAAUT   PVRAAUT  0.00000000
## PAANHANG PAANHANG  0.00000000
## PTRACTOR PTRACTOR  0.00000000
## PWERKT     PWERKT  0.00000000
## PBROM       PBROM  0.00000000
## PPERSONG PPERSONG  0.00000000
## PGEZONG   PGEZONG  0.00000000
## PWAOREG   PWAOREG  0.00000000
## PZEILPL   PZEILPL  0.00000000
## PPLEZIER PPLEZIER  0.00000000
## PFIETS     PFIETS  0.00000000
## PINBOED   PINBOED  0.00000000
## AWAPART   AWAPART  0.00000000
## AWABEDR   AWABEDR  0.00000000
## AWALAND   AWALAND  0.00000000
## ABESAUT   ABESAUT  0.00000000
## AMOTSCO   AMOTSCO  0.00000000
## AVRAAUT   AVRAAUT  0.00000000
## AAANHANG AAANHANG  0.00000000
## ATRACTOR ATRACTOR  0.00000000
## AWERKT     AWERKT  0.00000000
## ABROM       ABROM  0.00000000
## ALEVEN     ALEVEN  0.00000000
## APERSONG APERSONG  0.00000000
## AGEZONG   AGEZONG  0.00000000
## AWAOREG   AWAOREG  0.00000000
## AZEILPL   AZEILPL  0.00000000
## APLEZIER APLEZIER  0.00000000
## AFIETS     AFIETS  0.00000000
## AINBOED   AINBOED  0.00000000
## ABYSTAND ABYSTAND  0.00000000
# Step 5: Predict probabilities on test set
boost_probs <- predict(boost_model, newdata = test, n.trees = 1000, type = "response")

# Step 6: Convert to binary prediction
boost_preds <- ifelse(boost_probs > 0.2, "Yes", "No")
boost_preds <- factor(boost_preds, levels = c("No", "Yes"))

# Step 7: Confusion Matrix & Precision
conf_matrix <- table(Predicted = boost_preds, Actual = test$Purchase)
print(conf_matrix)
##          Actual
## Predicted   No  Yes
##       No  4415  257
##       Yes  118   32
precision <- conf_matrix["Yes", "Yes"] / sum(conf_matrix["Yes", ])
cat("Boosting Precision (Yes|Predicted Yes):", round(precision, 4), "\n")
## Boosting Precision (Yes|Predicted Yes): 0.2133
# Step 8: Tune shrinkage
shrink_vals <- c(0.01, 0.05, 0.1, 0.2)
results <- data.frame(Shrinkage = shrink_vals, TestError = NA, Precision = NA)

for (i in 1:length(shrink_vals)) {
  model <- gbm(PurchaseBinary ~ . -Purchase,
               data = train,
               distribution = "bernoulli",
               n.trees = 1000,
               shrinkage = shrink_vals[i],
               interaction.depth = 1)
  
  probs <- predict(model, newdata = test, n.trees = 1000, type = "response")
  preds <- ifelse(probs > 0.2, "Yes", "No")
  preds <- factor(preds, levels = c("No", "Yes"))
  cm <- table(preds, test$Purchase)
  err <- mean(preds != test$Purchase)
  prec <- cm["Yes", "Yes"] / sum(cm["Yes", ])
  results[i, 2:3] <- c(err, prec)
}

print(results)
##   Shrinkage  TestError Precision
## 1      0.01 0.07818333 0.2215190
## 2      0.05 0.10535048 0.1501597
## 3      0.10 0.11613438 0.1424802
## 4      0.20 0.12878474 0.1343612
# Step 9: Normalize predictors (excluding Purchase and PurchaseBinary)
Caravan_scaled <- scale(Caravan[, -c(86, 87)])

train_X <- Caravan_scaled[train_idx, ]
test_X <- Caravan_scaled[test_idx, ]
train_Y <- Caravan$Purchase[train_idx]
test_Y <- Caravan$Purchase[test_idx]

# Step 10: KNN with various k
k_vals <- c(1, 3, 5, 10)
knn_results <- data.frame(k = k_vals, TestError = NA, Precision = NA)

for (i in 1:length(k_vals)) {
  pred_knn <- knn(train_X, test_X, train_Y, k = k_vals[i])
  cm <- table(pred_knn, test_Y)
  err <- mean(pred_knn != test_Y)
  prec <- cm["Yes", "Yes"] / sum(cm["Yes", ])
  knn_results[i, 2:3] <- c(err, prec)
}

print(knn_results)
##    k  TestError Precision
## 1  1 0.11115720 0.1152648
## 2  3 0.07465782 0.2066116
## 3  5 0.06345915 0.2702703
## 4 10 0.05993364       NaN

Boosting model with 1,000 trees and shrinkage 0.01; PPERSAUT, MKOOPKLA, and MOPLHOOG were the most important predictors.Using a 20% probability threshold, boosting achieved a precision of 21.33%, and the best performance came with shrinkage = 0.01.KNN performed best with k = 5, giving the lowest test error (6.35%) and highest precision (27.0%) compared to boosting.