Lab 9

7. In the lab, we applied random forests to the Boston data using mtry = 6 and using ntree = 25 and ntree = 500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained.

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(MASS)

## Warning: package 'MASS' was built under R version 4.4.3

library(tidyverse)

## Warning: package 'tidyr' was built under R version 4.4.2

## Warning: package 'readr' was built under R version 4.4.2

## Warning: package 'dplyr' was built under R version 4.4.3

## Warning: package 'lubridate' was built under R version 4.4.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine()  masks randomForest::combine()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ ggplot2::margin() masks randomForest::margin()
## ✖ dplyr::select()   masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.3

## 
## Attaching package: 'ISLR2'
## 
## The following object is masked from 'package:MASS':
## 
##     Boston

library(tree)

## Warning: package 'tree' was built under R version 4.4.3

library(BART)

## Warning: package 'BART' was built under R version 4.4.3

## Loading required package: nlme
## 
## Attaching package: 'nlme'
## 
## The following object is masked from 'package:dplyr':
## 
##     collapse
## 
## Loading required package: survival

library(gbm)

## Warning: package 'gbm' was built under R version 4.4.3

## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

library(caret)

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 4.4.2

## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:survival':
## 
##     cluster
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(class)

## Warning: package 'class' was built under R version 4.4.3

set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
test <- setdiff(1:nrow(Boston), train)

# Fixing the mtry range
p <- ncol(Boston) - 1
mtry_vals <- 1:p
ntree_vals <- c(25, 100, 250, 500)

test_error <- matrix(NA, nrow = length(mtry_vals), ncol = length(ntree_vals),
                     dimnames = list(paste0("mtry=", mtry_vals), paste0("ntree=", ntree_vals)))

for (i in 1:length(mtry_vals)) {
  for (j in 1:length(ntree_vals)) {
    rf <- randomForest(medv ~ ., data = Boston, subset = train, mtry = mtry_vals[i], ntree = ntree_vals[j])
    pred <- predict(rf, newdata = Boston[test, ])
    test_error[i, j] <- mean((pred - Boston$medv[test])^2)
  }
}

matplot(mtry_vals, test_error, type = "b", pch = 19, col = 1:length(ntree_vals), lty = 1,
        xlab = "mtry", ylab = "Test MSE", main = "Random Forest Test Error")
legend("topright", legend = paste("ntree =", ntree_vals), col = 1:length(ntree_vals), lty = 1, pch = 19)

Explanation: This code evaluates the test MSE of random forests on the Boston dataset for a grid of mtry and ntree values, and plots the results.

8. In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

set.seed(1)
train <- sample(1:nrow(Carseats), nrow(Carseats)/2)
test <- setdiff(1:nrow(Carseats), train)

Explanation: Splits the Carseats dataset into equal-sized training and test sets.

(b) Fit a regression tree to the training set. Plot the tree, and inter pret the results. What test MSE do you obtain?

tree.carseats <- tree(Sales ~ ., data = Carseats, subset = train)
plot(tree.carseats); text(tree.carseats, pretty = 0)

yhat <- predict(tree.carseats, newdata = Carseats[test, ])
mse_tree <- mean((yhat - Carseats$Sales[test])^2)
mse_tree

## [1] 4.922039

Explanation: Fits a regression tree to predict Sales, visualizes it, and computes test MSE.

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

cv.carseats <- cv.tree(tree.carseats)
plot(cv.carseats$size, cv.carseats$dev, type = "b")

pruned <- prune.tree(tree.carseats, best = cv.carseats$size[which.min(cv.carseats$dev)])
yhat_pruned <- predict(pruned, newdata = Carseats[test, ])
mse_pruned <- mean((yhat_pruned - Carseats$Sales[test])^2)
mse_pruned

## [1] 4.922039

Explanation: Performs CV to choose optimal tree size and evaluates pruned tree MSE.

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

bag.carseats <- randomForest(Sales ~ ., data = Carseats, subset = train, mtry = ncol(Carseats) - 1, importance = TRUE)
yhat_bag <- predict(bag.carseats, newdata = Carseats[test, ])
mean((yhat_bag - Carseats$Sales[test])^2)

## [1] 2.657296

importance(bag.carseats)

##                 %IncMSE IncNodePurity
## CompPrice   23.07909904    171.185734
## Income       2.82081527     94.079825
## Advertising 11.43295625     99.098941
## Population  -3.92119532     59.818905
## Price       54.24314632    505.887016
## ShelveLoc   46.26912996    361.962753
## Age         14.24992212    159.740422
## Education   -0.07662320     46.738585
## Urban        0.08530119      8.453749
## US           4.34349223     15.157608

Explanation: Applies bagging, computes test MSE, and shows variable importance.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

rf.carseats <- randomForest(Sales ~ ., data = Carseats, subset = train, importance = TRUE)
yhat_rf <- predict(rf.carseats, newdata = Carseats[test, ])
mean((yhat_rf - Carseats$Sales[test])^2)

## [1] 3.049406

importance(rf.carseats)

##                %IncMSE IncNodePurity
## CompPrice   12.9489323     158.48521
## Income       2.2754686     129.59400
## Advertising  8.9977589     111.94374
## Population  -2.2513981     102.84599
## Price       33.4226950     391.60804
## ShelveLoc   34.0233545     290.56502
## Age         12.2185108     171.83302
## Education    0.2592124      71.65413
## Urban        1.1382113      14.76798
## US           4.1925335      33.75554

Explanation: Fits random forest, computes MSE, and identifies important predictors.

(f) Now analyze the data using BART, and report your results.

xtrain <- Carseats[train, -1]
xtest <- Carseats[test, -1]
ytrain <- Carseats$Sales[train]
bart_model <- gbart(x.train = xtrain, y.train = ytrain, x.test = xtest)

## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 200, 14, 200
## y1,yn: 2.781850, 1.091850
## x1,x[n*p]: 107.000000, 1.000000
## xp1,xp[np*p]: 111.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 63 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.273474,3,0.23074,7.57815
## *****sigma: 1.088371
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 5s
## trcnt,tecnt: 1000,1000

mean((bart_model$yhat.test.mean - Carseats$Sales[test])^2)

## [1] 1.438471

Explanation: Fits Bayesian Additive Regression Trees and reports test MSE.

11. This question uses the Caravan data set.

(a) Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.

Caravan <- Caravan
train <- 1:1000
test <- -train

Explanation: First 1000 rows are training set, rest are test set.

(b) Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?

# Set training and test sets
train <- 1:1000
test <- -train

# Convert Purchase to 0/1
caravan_train <- Caravan[train, ]
caravan_train$Purchase <- as.numeric(Caravan$Purchase[train] == "Yes")  # 1 if "Yes", else 0

shrink_vals <- c(0.01, 0.05, 0.1, 0.2)
test_errors <- numeric(length(shrink_vals))
names(test_errors) <- paste0("λ = ", shrink_vals)

for (i in seq_along(shrink_vals)) {
  boost_model <- gbm(Purchase ~ ., data = caravan_train, distribution = "bernoulli",
                     n.trees = 1000, shrinkage = shrink_vals[i], verbose = FALSE)
  
  # Predict probabilities on test set
  probs <- predict(boost_model, newdata = Caravan[test, ], n.trees = 1000, type = "response")
  
  # Convert to predicted labels based on 20% threshold
  pred_labels <- ifelse(probs > 0.2, "Yes", "No")
  
  # Actual labels
  actual <- Caravan$Purchase[test]
  
  # Fraction of correct "Yes" predictions
  correct_yes <- mean(pred_labels == "Yes" & actual == "Yes")
  predicted_yes <- mean(pred_labels == "Yes")
  precision <- ifelse(predicted_yes == 0, 0, correct_yes / predicted_yes)
  
  test_errors[i] <- 1 - mean(pred_labels == actual)
  cat(sprintf("λ = %.2f | Test Error = %.3f | Precision for 'Yes' = %.3f\n",
              shrink_vals[i], test_errors[i], precision))
}

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 50: PVRAAUT has no variation.

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 71: AVRAAUT has no variation.

## λ = 0.01 | Test Error = 0.078 | Precision for 'Yes' = 0.203

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 50: PVRAAUT has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 71: AVRAAUT has no variation.

## λ = 0.05 | Test Error = 0.108 | Precision for 'Yes' = 0.139

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 50: PVRAAUT has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 71: AVRAAUT has no variation.

## λ = 0.10 | Test Error = 0.114 | Precision for 'Yes' = 0.145

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 50: PVRAAUT has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 71: AVRAAUT has no variation.

## λ = 0.20 | Test Error = 0.127 | Precision for 'Yes' = 0.132

Explanation: Fits boosted models with different shrinkage rates and computes test error.

(c) Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated prob ability of purchase is greater than 20%. Form a confusion matrix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtained from applying KNN or logistic regression to this data set?

# Split train and test sets
train <- 1:1000
test <- -train

# Remove non-numeric predictors and ensure Purchase is retained
data_clean <- Caravan
data_clean$Purchase <- Caravan$Purchase  # keep response as factor

# Identify predictors and scale them manually
X <- Caravan[, -86]
Y <- Caravan$Purchase

# Scale predictors, but do it column by column to avoid NAs from 0 variance
X_scaled <- scale(X)

# Remove any columns that became NA (i.e., zero variance)
X_scaled <- X_scaled[, colSums(is.na(X_scaled)) == 0]

# Final train/test matrices
train.X <- X_scaled[train, ]
test.X <- X_scaled[test, ]
train.Y <- Y[train]
test.Y <- Y[test]

# Run KNN for different k values
k_vals <- c(1, 3, 5, 10)
acc_knn <- sapply(k_vals, function(k) {
  pred <- knn(train.X, test.X, train.Y, k = k)
  mean(pred == test.Y)
})

# Print results
names(acc_knn) <- paste0("k=", k_vals)
print(acc_knn)

##       k=1       k=3       k=5      k=10 
## 0.8890502 0.9253422 0.9367482 0.9400664

Explanation: Runs KNN for various k and computes test set accuracy.

Lab 9

Akshay Kumar

04/15/2025