7: Random Forest Test Error for Various mtry and ntree Values

I applied random forests to the Boston housing dataset and now want to explore how the test error changes across different values of mtry (number of variables randomly chosen at each split) and ntree (number of trees grown).

Below is the R code to compute and plot the test MSE (Mean Squared Error) for different combinations of mtry and ntree values. This approach helps visualize model performance and choose optimal parameters.

# Load necessary libraries
library(randomForest)
library(MASS)  # For Boston dataset
library(ggplot2)

# Prepare data
set.seed(1)
train_indices <- sample(1:nrow(Boston), nrow(Boston) / 2)
train_data <- Boston[train_indices, ]
test_data <- Boston[-train_indices, ]

# Define parameter values to try
mtry_values <- c(2, 4, 6, 8, 10, 13)
ntree_values <- c(25, 100, 250, 500)

# Create a dataframe to store results
results <- data.frame()

# Loop through mtry and ntree combinations
for (m in mtry_values) {
  for (n in ntree_values) {
    rf_model <- randomForest(medv ~ ., data = train_data, mtry = m, ntree = n)
    preds <- predict(rf_model, newdata = test_data)
    test_mse <- mean((preds - test_data$medv)^2)
    results <- rbind(results, data.frame(mtry = m, ntree = n, TestMSE = test_mse))
  }
}

# Plotting test error
ggplot(results, aes(x = ntree, y = TestMSE, color = as.factor(mtry))) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  labs(title = "Test MSE for Random Forest on Boston Data",
       x = "Number of Trees (ntree)",
       y = "Test MSE",
       color = "mtry") +
  theme_minimal()

Explanation of Results:

The test MSE generally decreases as ntree increases, but flattens out after a certain point, showing that more trees help reduce variance but with diminishing returns.

Lower values of mtry (e.g., 2 or 4) often result in lower test error, suggesting that choosing fewer variables at each split increases diversity among trees, which can improve performance.

The plot visually confirms the bias-variance tradeoff: increasing the number of trees reduces variance, and tuning mtry balances bias and variance.

This experiment helps in selecting optimal hyperparameters for building more accurate random forest models.

(a) Split the dataset into a training set and a test set

library(ISLR2)
library(tree)
set.seed(1)

# Convert to regression format: treat Sales as numeric
train_indices <- sample(1:nrow(Carseats), nrow(Carseats) / 2)
train_data <- Carseats[train_indices, ]
test_data <- Carseats[-train_indices, ]

This code loads the Carseats dataset and randomly splits it into two halves: a training set and a test set. The training set (train_data) is used to build models, while the test set (test_data) is used to evaluate model performance. The set.seed(1) ensures the results are reproducible.

(b) Fit a regression tree to the training set. Plot the tree and compute test MSE.

# Fit the regression tree
tree_model <- tree(Sales ~ ., data = train_data)
summary(tree_model)

# Plot the tree
plot(tree_model)
text(tree_model, pretty = 0)

# Predict on test set
preds <- predict(tree_model, newdata = test_data)
test_mse <- mean((preds - test_data$Sales)^2)
test_mse

Interpretation:

The tree splits based on variables like ShelveLoc, Price, and Age.

The predicted Sales is the mean value within each terminal node.

Test MSE is printed and shows how well the model performs on unseen data.

(c) Use cross-validation to determine optimal tree size. Prune the tree and check test MSE.

# Cross-validation
set.seed(2)
cv_model <- cv.tree(tree_model)
plot(cv_model$size, cv_model$dev, type = "b")

# Prune the tree
pruned_tree <- prune.tree(tree_model, best = cv_model$size[which.min(cv_model$dev)])
plot(pruned_tree)
text(pruned_tree, pretty = 0)

# Test MSE after pruning
pruned_preds <- predict(pruned_tree, newdata = test_data)
pruned_mse <- mean((pruned_preds - test_data$Sales)^2)
pruned_mse

This code performs cross-validation on the regression tree to find the optimal tree size that minimizes deviance (error). The tree is then pruned to that optimal size to reduce overfitting. After pruning, predictions are made on the test set, and the test Mean Squared Error (MSE) is calculated to evaluate the pruned tree’s performance.

Observation:

Pruning may reduce overfitting. If the pruned MSE is lower than the unpruned one, it improves model performance.

(d) Analyze the data using bagging. Report test MSE and variable importance.

library(randomForest)

# Bagging: set mtry to total number of predictors
set.seed(1)
bag_model <- randomForest(Sales ~ ., data = train_data, mtry = ncol(train_data) - 1, importance = TRUE)
bag_preds <- predict(bag_model, newdata = test_data)
bag_mse <- mean((bag_preds - test_data$Sales)^2)
bag_mse

# Variable importance
importance(bag_model)
varImpPlot(bag_model)

This code applies bagging (bootstrap aggregation) to the Carseats data using the randomForest function. It sets mtry equal to the total number of predictors, which means all variables are considered at each split (a key trait of bagging). It then computes the test Mean Squared Error (MSE) and displays the importance of each predictor, along with a plot of variable importance.

Observation:

Bagging usually reduces variance. Important variables often include Price and ShelveLoc.

(e) Use random forests to analyze the data. Report test MSE and effect of mtry.

# Random forest with default mtry (sqrt(p))
set.seed(1)
rf_model <- randomForest(Sales ~ ., data = train_data, mtry = 4, importance = TRUE)
rf_preds <- predict(rf_model, newdata = test_data)
rf_mse <- mean((rf_preds - test_data$Sales)^2)
rf_mse

# Importance
importance(rf_model)
varImpPlot(rf_model)

This code fits a random forest model to the Carseats training data using the default value of mtry = 4 (square root of the number of predictors). It predicts Sales on the test data and computes the test Mean Squared Error (MSE). The model also calculates and plots variable importance, helping identify which predictors contribute most to the accuracy.

Effect of mtry:

Lower mtry increases randomness and reduces correlation among trees, which can help reduce overfitting.

Try mtry = 2, 4, 6, 10 and compare test MSEs to see this effect.

(f) Analyze the data using BART (Bayesian Additive Regression Trees)

# BART package may need to be installed first
# install.packages("BART")
library(BART)

# Prepare data
x_train <- train_data[, -which(names(train_data) == "Sales")]
y_train <- train_data$Sales
x_test <- test_data[, -which(names(test_data) == "Sales")]

# Fit BART model
set.seed(1)
bart_model <- wbart(x.train = x_train, y.train = y_train, x.test = x_test)

# Predict and compute test MSE
bart_preds <- bart_model$yhat.test.mean
bart_mse <- mean((bart_preds - test_data$Sales)^2)
bart_mse

This code fits a Bayesian Additive Regression Trees (BART) model to the Carseats training data. The model is used to predict Sales on the test data. It then calculates the test Mean Squared Error (MSE) to evaluate prediction accuracy. BART is a flexible, non-parametric ensemble method that captures complex relationships in the data.

Observation:

BART often yields strong performance due to its ensemble nature and built-in regularization. Compare MSE with previous models to evaluate.

11. Boosting on Caravan Data Set

(a) Create a training set of the first 1,000 observations, and use the rest as the test set.

# Load required libraries
library(ISLR2)
library(gbm)
library(dplyr)

# Prepare the data
data(Caravan)
Caravan <- Caravan %>%
  mutate(Purchase = ifelse(Purchase == "Yes", 1, 0))

# Split into training and test sets
train_data <- Caravan[1:1000, ]
test_data <- Caravan[-(1:1000), ]

# Fit the boosting model
set.seed(1)
boost_model <- gbm(Purchase ~ ., 
                   data = train_data,
                   distribution = "bernoulli",
                   n.trees = 1000,
                   shrinkage = 0.01,
                   verbose = FALSE)
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,  :
  variable 50: PVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,  :
  variable 71: AVRAAUT has no variation.
# --- Variable Importance Plot (fixed labels) ---
# Set larger left margin so names don’t get cut off
par(mar = c(5, 12, 4, 2))  # Bottom, Left, Top, Right

# Plot variable importance clearly
summary(boost_model,
        n.trees = 1000,
        method = relative.influence,
        las = 1,           # Horizontal labels
        cBars = 30,        # Top 30 variables
        main = "Variable Importance from Boosting Model",
        cex.names = 0.9)   # Font size for names
NA

This code fits a boosting model to the Caravan dataset using 1,000 trees and a shrinkage rate of 0.01 to predict the probability of a purchase. The response variable is converted to binary (1 for “Yes”, 0 for “No”). After training, it generates a variable importance plot showing the top 30 predictors, with adjusted margins and label settings to ensure the graph is clear and readable.

(b) Fit a boosting model using 1,000 trees and shrinkage = 0.01. Identify important predictors.

# Load required libraries
library(ISLR2)
library(gbm)
library(dplyr)

# Prepare the data
data(Caravan)
Caravan <- Caravan %>%
  mutate(Purchase = ifelse(Purchase == "Yes", 1, 0))

# Split into training and test sets
train_data <- Caravan[1:1000, ]
test_data <- Caravan[-(1:1000), ]

# Fit the boosting model
set.seed(1)
boost_model <- gbm(Purchase ~ ., 
                   data = train_data,
                   distribution = "bernoulli",
                   n.trees = 1000,
                   shrinkage = 0.01,
                   verbose = FALSE)
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,  :
  variable 50: PVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,  :
  variable 71: AVRAAUT has no variation.
# -------- BIG AND CLEAR VARIABLE IMPORTANCE PLOT --------

# Open high-res plot window (optional: comment out if not needed)
# windows(width = 12, height = 8)  # For Windows
# quartz(width = 12, height = 8)  # For macOS
# X11(width = 12, height = 8)     # For Linux

# Set larger margins for long variable names
par(mar = c(5, 13, 4, 2))  # Bottom, Left, Top, Right

# Plot top 30 variables clearly
summary(boost_model,
        n.trees = 1000,
        method = relative.influence,
        las = 1,         # Horizontal labels
        cBars = 30,      # Top 30 variables
        main = "Top 30 Variable Importances - Boosting Model",
        cex.names = 1.0) # Font size of variable names
NA

This code fits a boosting model to the Caravan dataset to predict whether a person will make a purchase. It uses 1,000 trees with a learning rate (shrinkage) of 0.01. After training, it generates a high-resolution variable importance plot that displays the top 30 most influential predictors. The plot layout is adjusted with larger margins and font size to ensure that long variable names are fully visible and easy to read.

Observation: The output will show the most important predictors (e.g., PPERSAUT, MKOOPKLA, etc.) based on their relative influence in the model.

(c) Use the model to predict on the test set. Create a confusion matrix using 20% probability threshold.

# Predict probabilities
boost_probs <- predict(boost_model, newdata = test_data, n.trees = 1000, type = "response")

# Classify as '1' if probability > 0.20
boost_preds <- ifelse(boost_probs > 0.2, 1, 0)

# Actual responses
actual <- as.numeric(as.character(test_data$Purchase))

# Confusion matrix
table(Predicted = boost_preds, Actual = actual)
         Actual
Predicted    0    1
        0 4410  256
        1  123   33
# Fraction of correct positive predictions
correct_positive <- sum(boost_preds == 1 & actual == 1)
total_predicted_positive <- sum(boost_preds == 1)
fraction_correct <- correct_positive / total_predicted_positive
fraction_correct
[1] 0.2115385

This code uses the trained boosting model to predict purchase probabilities on the test set. If the predicted probability is greater than 20%, it classifies the observation as a purchase (1). It then creates a confusion matrix to compare predictions with actual outcomes and calculates the precision—the fraction of predicted buyers who actually made a purchase. This helps evaluate how well the model identifies actual purchasers.

Interpretation:

The confusion matrix shows how many purchases were correctly/incorrectly classified.

The fraction of people predicted to make a purchase who actually did is shown by fraction_correct.

Comparison with Logistic Regression or KNN (from textbook):

Logistic regression and KNN tend to have lower precision at this 20% threshold.

Boosting usually gives higher precision due to its ability to capture complex interactions.

---
title: "R Notebook"
output: html_notebook
---
# 7: Random Forest Test Error for Various mtry and ntree Values

I applied random forests to the Boston housing dataset and now want to explore how the test error changes across different values of mtry (number of variables randomly chosen at each split) and ntree (number of trees grown).

Below is the R code to compute and plot the test MSE (Mean Squared Error) for different combinations of mtry and ntree values. This approach helps visualize model performance and choose optimal parameters.

```{r}
# Load necessary libraries
library(randomForest)
library(MASS)  # For Boston dataset
library(ggplot2)

# Prepare data
set.seed(1)
train_indices <- sample(1:nrow(Boston), nrow(Boston) / 2)
train_data <- Boston[train_indices, ]
test_data <- Boston[-train_indices, ]

# Define parameter values to try
mtry_values <- c(2, 4, 6, 8, 10, 13)
ntree_values <- c(25, 100, 250, 500)

# Create a dataframe to store results
results <- data.frame()

# Loop through mtry and ntree combinations
for (m in mtry_values) {
  for (n in ntree_values) {
    rf_model <- randomForest(medv ~ ., data = train_data, mtry = m, ntree = n)
    preds <- predict(rf_model, newdata = test_data)
    test_mse <- mean((preds - test_data$medv)^2)
    results <- rbind(results, data.frame(mtry = m, ntree = n, TestMSE = test_mse))
  }
}

# Plotting test error
ggplot(results, aes(x = ntree, y = TestMSE, color = as.factor(mtry))) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  labs(title = "Test MSE for Random Forest on Boston Data",
       x = "Number of Trees (ntree)",
       y = "Test MSE",
       color = "mtry") +
  theme_minimal()
```

Explanation of Results:

The test MSE generally decreases as ntree increases, but flattens out after a certain point, showing that more trees help reduce variance but with diminishing returns.

Lower values of mtry (e.g., 2 or 4) often result in lower test error, suggesting that choosing fewer variables at each split increases diversity among trees, which can improve performance.

The plot visually confirms the bias-variance tradeoff: increasing the number of trees reduces variance, and tuning mtry balances bias and variance.

This experiment helps in selecting optimal hyperparameters for building more accurate random forest models.


# 8. Predicting Sales Using Regression Trees and Related Methods

# (a) Split the dataset into a training set and a test set

```{r}
library(ISLR2)
library(tree)
set.seed(1)

# Convert to regression format: treat Sales as numeric
train_indices <- sample(1:nrow(Carseats), nrow(Carseats) / 2)
train_data <- Carseats[train_indices, ]
test_data <- Carseats[-train_indices, ]
```

This code loads the Carseats dataset and randomly splits it into two halves: a training set and a test set. The training set (train_data) is used to build models, while the test set (test_data) is used to evaluate model performance. The set.seed(1) ensures the results are reproducible.

# (b) Fit a regression tree to the training set. Plot the tree and compute test MSE.

```{r}
# Fit the regression tree
tree_model <- tree(Sales ~ ., data = train_data)
summary(tree_model)

# Plot the tree
plot(tree_model)
text(tree_model, pretty = 0)

# Predict on test set
preds <- predict(tree_model, newdata = test_data)
test_mse <- mean((preds - test_data$Sales)^2)
test_mse
```

Interpretation:

The tree splits based on variables like ShelveLoc, Price, and Age.

The predicted Sales is the mean value within each terminal node.

Test MSE is printed and shows how well the model performs on unseen data.

# (c) Use cross-validation to determine optimal tree size. Prune the tree and check test MSE.

```{r}
# Cross-validation
set.seed(2)
cv_model <- cv.tree(tree_model)
plot(cv_model$size, cv_model$dev, type = "b")

# Prune the tree
pruned_tree <- prune.tree(tree_model, best = cv_model$size[which.min(cv_model$dev)])
plot(pruned_tree)
text(pruned_tree, pretty = 0)

# Test MSE after pruning
pruned_preds <- predict(pruned_tree, newdata = test_data)
pruned_mse <- mean((pruned_preds - test_data$Sales)^2)
pruned_mse
```

This code performs cross-validation on the regression tree to find the optimal tree size that minimizes deviance (error). The tree is then pruned to that optimal size to reduce overfitting. After pruning, predictions are made on the test set, and the test Mean Squared Error (MSE) is calculated to evaluate the pruned tree's performance.

Observation:

Pruning may reduce overfitting. If the pruned MSE is lower than the unpruned one, it improves model performance.

# (d) Analyze the data using bagging. Report test MSE and variable importance.

```{r}
library(randomForest)

# Bagging: set mtry to total number of predictors
set.seed(1)
bag_model <- randomForest(Sales ~ ., data = train_data, mtry = ncol(train_data) - 1, importance = TRUE)
bag_preds <- predict(bag_model, newdata = test_data)
bag_mse <- mean((bag_preds - test_data$Sales)^2)
bag_mse

# Variable importance
importance(bag_model)
varImpPlot(bag_model)
```

This code applies bagging (bootstrap aggregation) to the Carseats data using the randomForest function. It sets mtry equal to the total number of predictors, which means all variables are considered at each split (a key trait of bagging). It then computes the test Mean Squared Error (MSE) and displays the importance of each predictor, along with a plot of variable importance.

Observation:

Bagging usually reduces variance. Important variables often include Price and ShelveLoc.

# (e) Use random forests to analyze the data. Report test MSE and effect of mtry.

```{r}
# Random forest with default mtry (sqrt(p))
set.seed(1)
rf_model <- randomForest(Sales ~ ., data = train_data, mtry = 4, importance = TRUE)
rf_preds <- predict(rf_model, newdata = test_data)
rf_mse <- mean((rf_preds - test_data$Sales)^2)
rf_mse

# Importance
importance(rf_model)
varImpPlot(rf_model)
```

This code fits a random forest model to the Carseats training data using the default value of mtry = 4 (square root of the number of predictors). It predicts Sales on the test data and computes the test Mean Squared Error (MSE). The model also calculates and plots variable importance, helping identify which predictors contribute most to the accuracy.


Effect of mtry:

Lower mtry increases randomness and reduces correlation among trees, which can help reduce overfitting.

Try mtry = 2, 4, 6, 10 and compare test MSEs to see this effect.


# (f) Analyze the data using BART (Bayesian Additive Regression Trees)

```{r}
# BART package may need to be installed first
# install.packages("BART")
library(BART)

# Prepare data
x_train <- train_data[, -which(names(train_data) == "Sales")]
y_train <- train_data$Sales
x_test <- test_data[, -which(names(test_data) == "Sales")]

# Fit BART model
set.seed(1)
bart_model <- wbart(x.train = x_train, y.train = y_train, x.test = x_test)

# Predict and compute test MSE
bart_preds <- bart_model$yhat.test.mean
bart_mse <- mean((bart_preds - test_data$Sales)^2)
bart_mse
```

This code fits a Bayesian Additive Regression Trees (BART) model to the Carseats training data. The model is used to predict Sales on the test data. It then calculates the test Mean Squared Error (MSE) to evaluate prediction accuracy. BART is a flexible, non-parametric ensemble method that captures complex relationships in the data.

Observation:

BART often yields strong performance due to its ensemble nature and built-in regularization. Compare MSE with previous models to evaluate.

# 11. Boosting on Caravan Data Set

# (a) Create a training set of the first 1,000 observations, and use the rest as the test set.

```{r}
# Load required libraries
library(ISLR2)
library(gbm)
library(dplyr)

# Prepare the data
data(Caravan)
Caravan <- Caravan %>%
  mutate(Purchase = ifelse(Purchase == "Yes", 1, 0))

# Split into training and test sets
train_data <- Caravan[1:1000, ]
test_data <- Caravan[-(1:1000), ]

# Fit the boosting model
set.seed(1)
boost_model <- gbm(Purchase ~ ., 
                   data = train_data,
                   distribution = "bernoulli",
                   n.trees = 1000,
                   shrinkage = 0.01,
                   verbose = FALSE)

# --- Variable Importance Plot (fixed labels) ---
# Set larger left margin so names don’t get cut off
par(mar = c(5, 12, 4, 2))  # Bottom, Left, Top, Right

# Plot variable importance clearly
summary(boost_model,
        n.trees = 1000,
        method = relative.influence,
        las = 1,           # Horizontal labels
        cBars = 30,        # Top 30 variables
        main = "Variable Importance from Boosting Model",
        cex.names = 0.9)   # Font size for names
```

This code fits a boosting model to the Caravan dataset using 1,000 trees and a shrinkage rate of 0.01 to predict the probability of a purchase. The response variable is converted to binary (1 for "Yes", 0 for "No"). After training, it generates a variable importance plot showing the top 30 predictors, with adjusted margins and label settings to ensure the graph is clear and readable.

# (b) Fit a boosting model using 1,000 trees and shrinkage = 0.01. Identify important predictors.

```{r}
# Load required libraries
library(ISLR2)
library(gbm)
library(dplyr)

# Prepare the data
data(Caravan)
Caravan <- Caravan %>%
  mutate(Purchase = ifelse(Purchase == "Yes", 1, 0))

# Split into training and test sets
train_data <- Caravan[1:1000, ]
test_data <- Caravan[-(1:1000), ]

# Fit the boosting model
set.seed(1)
boost_model <- gbm(Purchase ~ ., 
                   data = train_data,
                   distribution = "bernoulli",
                   n.trees = 1000,
                   shrinkage = 0.01,
                   verbose = FALSE)

# -------- BIG AND CLEAR VARIABLE IMPORTANCE PLOT --------

# Open high-res plot window (optional: comment out if not needed)
# windows(width = 12, height = 8)  # For Windows
# quartz(width = 12, height = 8)  # For macOS
# X11(width = 12, height = 8)     # For Linux

# Set larger margins for long variable names
par(mar = c(5, 13, 4, 2))  # Bottom, Left, Top, Right

# Plot top 30 variables clearly
summary(boost_model,
        n.trees = 1000,
        method = relative.influence,
        las = 1,         # Horizontal labels
        cBars = 30,      # Top 30 variables
        main = "Top 30 Variable Importances - Boosting Model",
        cex.names = 1.0) # Font size of variable names
```

This code fits a boosting model to the Caravan dataset to predict whether a person will make a purchase. It uses 1,000 trees with a learning rate (shrinkage) of 0.01. After training, it generates a high-resolution variable importance plot that displays the top 30 most influential predictors. The plot layout is adjusted with larger margins and font size to ensure that long variable names are fully visible and easy to read.

Observation:
The output will show the most important predictors (e.g., PPERSAUT, MKOOPKLA, etc.) based on their relative influence in the model.

# (c) Use the model to predict on the test set. Create a confusion matrix using 20% probability threshold.

```{r}
# Predict probabilities
boost_probs <- predict(boost_model, newdata = test_data, n.trees = 1000, type = "response")

# Classify as '1' if probability > 0.20
boost_preds <- ifelse(boost_probs > 0.2, 1, 0)

# Actual responses
actual <- as.numeric(as.character(test_data$Purchase))

# Confusion matrix
table(Predicted = boost_preds, Actual = actual)

# Fraction of correct positive predictions
correct_positive <- sum(boost_preds == 1 & actual == 1)
total_predicted_positive <- sum(boost_preds == 1)
fraction_correct <- correct_positive / total_predicted_positive
fraction_correct
```


This code uses the trained boosting model to predict purchase probabilities on the test set. If the predicted probability is greater than 20%, it classifies the observation as a purchase (1). It then creates a confusion matrix to compare predictions with actual outcomes and calculates the precision—the fraction of predicted buyers who actually made a purchase. This helps evaluate how well the model identifies actual purchasers.


Interpretation:

The confusion matrix shows how many purchases were correctly/incorrectly classified.

The fraction of people predicted to make a purchase who actually did is shown by fraction_correct.

Comparison with Logistic Regression or KNN (from textbook):

Logistic regression and KNN tend to have lower precision at this 20% threshold.

Boosting usually gives higher precision due to its ability to capture complex interactions.

