7. In the lab, we applied random forests to the Boston data using mtry = 6 and using ntree = 25 and ntree = 500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.3
library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine() masks randomForest::combine()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ ggplot2::margin() masks randomForest::margin()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.3
##
## Attaching package: 'ISLR2'
##
## The following object is masked from 'package:MASS':
##
## Boston
library(tree)
## Warning: package 'tree' was built under R version 4.4.3
library(BART)
## Warning: package 'BART' was built under R version 4.4.3
## Loading required package: nlme
##
## Attaching package: 'nlme'
##
## The following object is masked from 'package:dplyr':
##
## collapse
##
## Loading required package: survival
library(gbm)
## Warning: package 'gbm' was built under R version 4.4.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.4.2
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:survival':
##
## cluster
##
## The following object is masked from 'package:purrr':
##
## lift
library(class)
## Warning: package 'class' was built under R version 4.4.3
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
test <- setdiff(1:nrow(Boston), train)
# Fixing the mtry range
p <- ncol(Boston) - 1
mtry_vals <- 1:p
ntree_vals <- c(25, 100, 250, 500)
test_error <- matrix(NA, nrow = length(mtry_vals), ncol = length(ntree_vals),
dimnames = list(paste0("mtry=", mtry_vals), paste0("ntree=", ntree_vals)))
for (i in 1:length(mtry_vals)) {
for (j in 1:length(ntree_vals)) {
rf <- randomForest(medv ~ ., data = Boston, subset = train, mtry = mtry_vals[i], ntree = ntree_vals[j])
pred <- predict(rf, newdata = Boston[test, ])
test_error[i, j] <- mean((pred - Boston$medv[test])^2)
}
}
matplot(mtry_vals, test_error, type = "b", pch = 19, col = 1:length(ntree_vals), lty = 1,
xlab = "mtry", ylab = "Test MSE", main = "Random Forest Test Error")
legend("topright", legend = paste("ntree =", ntree_vals), col = 1:length(ntree_vals), lty = 1, pch = 19)
Explanation: This code evaluates the test MSE of
random forests on the Boston
dataset for a grid of mtry and
ntree values, and plots the results.
8. In the lab, a classification tree was applied to the
Carseats data set after converting Sales into a qualitative response
variable. Now we will seek to predict Sales using regression trees and
related approaches, treating the response as a quantitative
variable.
(a) Split the data set into a training set and a test set.
set.seed(1)
train <- sample(1:nrow(Carseats), nrow(Carseats)/2)
test <- setdiff(1:nrow(Carseats), train)
Explanation: Splits the Carseats dataset into equal-sized training and test sets.
(b) Fit a regression tree to the training set. Plot the tree, and inter pret the results. What test MSE do you obtain?
tree.carseats <- tree(Sales ~ ., data = Carseats, subset = train)
plot(tree.carseats); text(tree.carseats, pretty = 0)
yhat <- predict(tree.carseats, newdata = Carseats[test, ])
mse_tree <- mean((yhat - Carseats$Sales[test])^2)
mse_tree
## [1] 4.922039
Explanation: Fits a regression tree to predict Sales, visualizes it, and computes test MSE.
(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
cv.carseats <- cv.tree(tree.carseats)
plot(cv.carseats$size, cv.carseats$dev, type = "b")
pruned <- prune.tree(tree.carseats, best = cv.carseats$size[which.min(cv.carseats$dev)])
yhat_pruned <- predict(pruned, newdata = Carseats[test, ])
mse_pruned <- mean((yhat_pruned - Carseats$Sales[test])^2)
mse_pruned
## [1] 4.922039
Explanation: Performs CV to choose optimal tree size and evaluates pruned tree MSE.
(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.
bag.carseats <- randomForest(Sales ~ ., data = Carseats, subset = train, mtry = ncol(Carseats) - 1, importance = TRUE)
yhat_bag <- predict(bag.carseats, newdata = Carseats[test, ])
mean((yhat_bag - Carseats$Sales[test])^2)
## [1] 2.657296
importance(bag.carseats)
## %IncMSE IncNodePurity
## CompPrice 23.07909904 171.185734
## Income 2.82081527 94.079825
## Advertising 11.43295625 99.098941
## Population -3.92119532 59.818905
## Price 54.24314632 505.887016
## ShelveLoc 46.26912996 361.962753
## Age 14.24992212 159.740422
## Education -0.07662320 46.738585
## Urban 0.08530119 8.453749
## US 4.34349223 15.157608
Explanation: Applies bagging, computes test MSE, and shows variable importance.
(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
rf.carseats <- randomForest(Sales ~ ., data = Carseats, subset = train, importance = TRUE)
yhat_rf <- predict(rf.carseats, newdata = Carseats[test, ])
mean((yhat_rf - Carseats$Sales[test])^2)
## [1] 3.049406
importance(rf.carseats)
## %IncMSE IncNodePurity
## CompPrice 12.9489323 158.48521
## Income 2.2754686 129.59400
## Advertising 8.9977589 111.94374
## Population -2.2513981 102.84599
## Price 33.4226950 391.60804
## ShelveLoc 34.0233545 290.56502
## Age 12.2185108 171.83302
## Education 0.2592124 71.65413
## Urban 1.1382113 14.76798
## US 4.1925335 33.75554
Explanation: Fits random forest, computes MSE, and identifies important predictors.
(f) Now analyze the data using BART, and report your results.
xtrain <- Carseats[train, -1]
xtest <- Carseats[test, -1]
ytrain <- Carseats$Sales[train]
bart_model <- gbart(x.train = xtrain, y.train = ytrain, x.test = xtest)
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 200, 14, 200
## y1,yn: 2.781850, 1.091850
## x1,x[n*p]: 107.000000, 1.000000
## xp1,xp[np*p]: 111.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 63 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.273474,3,0.23074,7.57815
## *****sigma: 1.088371
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
##
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 5s
## trcnt,tecnt: 1000,1000
mean((bart_model$yhat.test.mean - Carseats$Sales[test])^2)
## [1] 1.438471
Explanation: Fits Bayesian Additive Regression Trees and reports test MSE.
11. This question uses the Caravan data set.
(a) Create a training set consisting of the first 1,000 observations,
and a test set consisting of the remaining observations.
Caravan <- Caravan
train <- 1:1000
test <- -train
Explanation: First 1000 rows are training set, rest are test set.
(b) Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?
# Set training and test sets
train <- 1:1000
test <- -train
# Convert Purchase to 0/1
caravan_train <- Caravan[train, ]
caravan_train$Purchase <- as.numeric(Caravan$Purchase[train] == "Yes") # 1 if "Yes", else 0
shrink_vals <- c(0.01, 0.05, 0.1, 0.2)
test_errors <- numeric(length(shrink_vals))
names(test_errors) <- paste0("λ = ", shrink_vals)
for (i in seq_along(shrink_vals)) {
boost_model <- gbm(Purchase ~ ., data = caravan_train, distribution = "bernoulli",
n.trees = 1000, shrinkage = shrink_vals[i], verbose = FALSE)
# Predict probabilities on test set
probs <- predict(boost_model, newdata = Caravan[test, ], n.trees = 1000, type = "response")
# Convert to predicted labels based on 20% threshold
pred_labels <- ifelse(probs > 0.2, "Yes", "No")
# Actual labels
actual <- Caravan$Purchase[test]
# Fraction of correct "Yes" predictions
correct_yes <- mean(pred_labels == "Yes" & actual == "Yes")
predicted_yes <- mean(pred_labels == "Yes")
precision <- ifelse(predicted_yes == 0, 0, correct_yes / predicted_yes)
test_errors[i] <- 1 - mean(pred_labels == actual)
cat(sprintf("λ = %.2f | Test Error = %.3f | Precision for 'Yes' = %.3f\n",
shrink_vals[i], test_errors[i], precision))
}
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 50: PVRAAUT has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 71: AVRAAUT has no variation.
## λ = 0.01 | Test Error = 0.078 | Precision for 'Yes' = 0.203
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 50: PVRAAUT has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 71: AVRAAUT has no variation.
## λ = 0.05 | Test Error = 0.108 | Precision for 'Yes' = 0.139
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 50: PVRAAUT has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 71: AVRAAUT has no variation.
## λ = 0.10 | Test Error = 0.114 | Precision for 'Yes' = 0.145
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 50: PVRAAUT has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 71: AVRAAUT has no variation.
## λ = 0.20 | Test Error = 0.127 | Precision for 'Yes' = 0.132
Explanation: Fits boosted models with different shrinkage rates and computes test error.
(c) Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated prob ability of purchase is greater than 20%. Form a confusion matrix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtained from applying KNN or logistic regression to this data set?
# Split train and test sets
train <- 1:1000
test <- -train
# Remove non-numeric predictors and ensure Purchase is retained
data_clean <- Caravan
data_clean$Purchase <- Caravan$Purchase # keep response as factor
# Identify predictors and scale them manually
X <- Caravan[, -86]
Y <- Caravan$Purchase
# Scale predictors, but do it column by column to avoid NAs from 0 variance
X_scaled <- scale(X)
# Remove any columns that became NA (i.e., zero variance)
X_scaled <- X_scaled[, colSums(is.na(X_scaled)) == 0]
# Final train/test matrices
train.X <- X_scaled[train, ]
test.X <- X_scaled[test, ]
train.Y <- Y[train]
test.Y <- Y[test]
# Run KNN for different k values
k_vals <- c(1, 3, 5, 10)
acc_knn <- sapply(k_vals, function(k) {
pred <- knn(train.X, test.X, train.Y, k = k)
mean(pred == test.Y)
})
# Print results
names(acc_knn) <- paste0("k=", k_vals)
print(acc_knn)
## k=1 k=3 k=5 k=10
## 0.8890502 0.9253422 0.9367482 0.9400664
Explanation: Runs KNN for various k and computes test set accuracy.