Exam 2 (Take Home Final Exam)

Questions

Experimenting with tree-based methods: In this problem, you will used CARTs, bagging, random forests, and boosting in an attempt to accurately predict whether mushrooms are edible or poisonous, based on a large collection of varaibles decribing the mushrooms (cap shape, cap surface, odor, gill spacing, gill size, stalk color, etc.). The data can be found under the Files tab in folder Exam 2/Final Exam (mushrooms.csv) on Canvas. More information regarding this data can be found (including variable level designation for interpretation) at the following site: https://www.kaggle.com/uciml/mushroom-classification/data

Split the data into a training and test set, each consisting of 50% of the entire sample. Do this randomly, not based on positioning in the dataset (i.e. do not take the first half as your test set, etc., but instead use the function).

# Load the dataset
mushrooms <- read.csv("mushrooms.csv")

# Set seed for reproducibility
set.seed(42)

# Create a random 50% sample for training
train_indices <- sample(1:nrow(mushrooms), size = 0.5 * nrow(mushrooms))

# Split into training and test sets
train_data <- mushrooms[train_indices, ]
test_data <- mushrooms[-train_indices, ]

# Display the number of rows in each
cat("Training set rows:", nrow(train_data), "\n")

## Training set rows: 4062

cat("Testing set rows:", nrow(test_data), "\n")

## Testing set rows: 4062

train_data: 4062 mushroom observations for model training test_data: 4062 mushroom observations for model testing/evaluation This confirms that the data is balanced and randomly divided, which is essential for fair model performance assessment.

Using a CART, fit a classification tree to the training data and employ cost complexity pruning with CV used to determine the optimal number of terminal nodes to include. Then test this “best” model on the test data set. Interpret the final best model using the tree structure, and visualize it / add text to represent the decisions.

# Load required packages
library(rpart)
library(rpart.plot)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

# === Step 1: Fit a classification tree to the training data ===
tree_model <- rpart(class ~ ., data = train_data, method = "class", cp = 0.001)

# === Step 2: Display the CP table (cost-complexity pruning info) ===
printcp(tree_model)

## 
## Classification tree:
## rpart(formula = class ~ ., data = train_data, method = "class", 
##     cp = 0.001)
## 
## Variables actually used in tree construction:
## [1] odor                   spore.print.color      stalk.color.below.ring
## [4] stalk.root            
## 
## Root node error: 1974/4062 = 0.48597
## 
## n= 4062 
## 
##          CP nsplit rel error    xerror      xstd
## 1 0.9680851      0 1.0000000 1.0000000 0.0161370
## 2 0.0167173      1 0.0319149 0.0319149 0.0039896
## 3 0.0065856      2 0.0151976 0.0151976 0.0027644
## 4 0.0030395      3 0.0086120 0.0086120 0.0020843
## 5 0.0010000      5 0.0025329 0.0025329 0.0011321

# === Step 3: Find the optimal cp value that minimizes cross-validation error ===
optimal_cp <- tree_model$cptable[which.min(tree_model$cptable[,"xerror"]), "CP"]
cat("Optimal CP value:", optimal_cp, "\n")

## Optimal CP value: 0.001

# === Step 4: Prune the tree using the optimal cp ===
pruned_tree <- prune(tree_model, cp = optimal_cp)

# === Step 5: Visualize the pruned tree ===
rpart.plot(pruned_tree,
           type = 2,            # label all nodes
           extra = 104,         # display class, probs, percentages
           fallen.leaves = TRUE,
           main = "Pruned Classification Tree")

# === Step 6: Predict class labels on the test data ===
pred_test <- predict(pruned_tree, newdata = test_data, type = "class")

# === Step 7: Fix factor levels to avoid confusionMatrix errors ===
# Ensure both predicted and actual class labels share the same levels
all_levels <- union(levels(factor(test_data$class)), levels(factor(pred_test)))

pred_test <- factor(pred_test, levels = all_levels)
test_actual <- factor(test_data$class, levels = all_levels)

# === Step 8: Evaluate performance using a confusion matrix ===
conf_matrix <- confusionMatrix(pred_test, test_actual)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    e    p
##          e 2120    3
##          p    0 1939
##                                           
##                Accuracy : 0.9993          
##                  95% CI : (0.9978, 0.9998)
##     No Information Rate : 0.5219          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9985          
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9985          
##          Pos Pred Value : 0.9986          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.5219          
##          Detection Rate : 0.5219          
##    Detection Prevalence : 0.5226          
##       Balanced Accuracy : 0.9992          
##                                           
##        'Positive' Class : e               
##

Tree Structure Summary: Root Node Split: The first and most important variable used is odor. Mushrooms with odor values a, l, or n are predicted to be edible (e), while others are predicted as poisonous (p). Subsequent important features used in the tree are: spore.print.color stalk.color.below.ring stalk.root These splits refine the prediction further, especially for mushrooms predicted as edible, by checking spore print color and stalk-related features.

The optimal cp (complexity parameter) selected via cross-validation was 0.001. The tree was pruned to include 5 splits, leading to a simpler, more generalizable model. This pruning minimized the cross-validation error to 0.25% (i.e., nearly perfect generalization).

Accuracy: 0.9993 (≈ 99.93%) Sensitivity (Recall for edible): 1.0000 → All edible mushrooms were correctly identified. Specificity (Recall for poisonous): 0.9985 → Almost all poisonous mushrooms were correctly identified. Kappa: 0.9985 → Almost perfect agreement. No Information Rate (NIR): 0.5219 → Shows that your model is significantly better than random guessing. McNemar’s Test: Not significant (p = 0.2482), indicating no bias in error distribution between the classes.

The CART model achieved excellent performance, correctly classifying 99.93% of mushrooms in the test set. The tree is simple and interpretable, with decisions hinging primarily on odor, followed by visual features like stalk and spore color. All edible mushrooms were detected perfectly, and only 3 poisonous mushrooms were misclassified as edible — extremely low error risk, which is critical in this domain.

Use bootstrap aggregation on the training set to build a model which can predict edible vs poisonous status. Test your resulting bagged tree model on the testing data created in part a). Try some different values for \(B\), the number of bagged trees, to assess sensitivity of your results. How many trees seems sufficient? Create variable importance plots for this final model and interpret the findings.

# Load required library
library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

# === Step 1: Ensure the target variable is a factor ===
train_data$class <- as.factor(train_data$class)
test_data$class <- as.factor(test_data$class)

# === Step 2: Number of predictors (excluding response) ===
num_predictors <- ncol(train_data) - 1  # 22 in this case

# === Step 3: Try different values of B (number of trees) ===
b_values <- c(50, 100, 200, 500)
test_accuracies <- numeric(length(b_values))

for (i in seq_along(b_values)) {
  B <- b_values[i]
  
  bag_model <- randomForest(class ~ ., 
                            data = train_data, 
                            mtry = num_predictors,  # All predictors for bagging
                            ntree = B, 
                            importance = TRUE)
  
  pred_bag <- predict(bag_model, newdata = test_data)
  acc <- mean(pred_bag == test_data$class)
  test_accuracies[i] <- acc
  
  cat("Bagging with", B, "trees → Test Accuracy:", round(acc * 100, 2), "%\n")
}

## Bagging with 50 trees → Test Accuracy: 100 %
## Bagging with 100 trees → Test Accuracy: 100 %
## Bagging with 200 trees → Test Accuracy: 100 %
## Bagging with 500 trees → Test Accuracy: 100 %

# === Step 4: Train final bagged model with 200 trees and plot variable importance ===
final_bag_model <- randomForest(class ~ ., 
                                data = train_data, 
                                mtry = num_predictors, 
                                ntree = 200, 
                                importance = TRUE)

# === Step 5: Plot variable importance ===
varImpPlot(final_bag_model, 
           main = "Variable Importance (Bagging, B = 200)")

All models (with B = 50, 100, 200, 500 trees) achieved 100% test accuracy. This means the bagged ensemble was extremely effective in classifying mushrooms as edible or poisonous. Since accuracy saturated at 100% even with 50 trees, a model with B = 100 trees would likely be sufficient in practice for both speed and performance.

Implemented bagging using the randomForest package by setting mtry to the total number of predictors. Models with B = 50, 100, 200, and 500 trees all achieved 100% test accuracy, indicating extremely high performance and stability. A variable importance analysis revealed that gill.color, spore.print.color, population, gill.size, and odor were the most influential features in the model. Since performance stabilized early, we conclude that 100 trees are sufficient, and bagging proves to be a highly effective technique for this dataset.

Use random forests on the training set to build a model which can predict edible vs poisonous status. Test your resulting random forest model on the testing data created in part a). Try some alternative values for \(mtry\), the number of predictors which are “tried” as optimal splitters at each node, and also \(B\). Compare these models in terms of their test performance. Which \(B\) and \(mtry\) combination seems best? Create variable importance plots for this final model and interpret the findings.

# Load required package
library(randomForest)

# === Step 1: Ensure the response variable is a factor ===
train_data$class <- as.factor(train_data$class)
test_data$class <- as.factor(test_data$class)

# === Step 2: Try different mtry and ntree (B) combinations ===
mtry_values <- c(3, 5, 10)
ntree_values <- c(100, 200, 500)

# Store results
rf_results <- data.frame()
best_model <- NULL
best_acc <- 0

# Grid search over combinations
for (m in mtry_values) {
  for (b in ntree_values) {
    rf_model <- randomForest(class ~ ., data = train_data,
                             mtry = m, ntree = b, importance = TRUE)
    pred_rf <- predict(rf_model, newdata = test_data)
    acc <- mean(pred_rf == test_data$class)
    
    rf_results <- rbind(rf_results, data.frame(mtry = m, ntree = b, Accuracy = acc))
    
    if (acc > best_acc) {
      best_acc <- acc
      best_model <- rf_model
    }
    
    cat("mtry =", m, ", ntree =", b, "→ Accuracy:", round(acc * 100, 2), "%\n")
  }
}

## mtry = 3 , ntree = 100 → Accuracy: 100 %
## mtry = 3 , ntree = 200 → Accuracy: 100 %
## mtry = 3 , ntree = 500 → Accuracy: 100 %
## mtry = 5 , ntree = 100 → Accuracy: 100 %
## mtry = 5 , ntree = 200 → Accuracy: 100 %
## mtry = 5 , ntree = 500 → Accuracy: 100 %
## mtry = 10 , ntree = 100 → Accuracy: 100 %
## mtry = 10 , ntree = 200 → Accuracy: 100 %
## mtry = 10 , ntree = 500 → Accuracy: 100 %

# === Step 3: Show result table of all combinations ===
print(rf_results)

##   mtry ntree Accuracy
## 1    3   100        1
## 2    3   200        1
## 3    3   500        1
## 4    5   100        1
## 5    5   200        1
## 6    5   500        1
## 7   10   100        1
## 8   10   200        1
## 9   10   500        1

# === Step 4: Plot variable importance for the best model ===
varImpPlot(best_model, main = paste("Variable Importance (Best RF: mtry =", best_model$mtry,
                                    ", ntree =", best_model$ntree, ")"))

The Random Forest model is extremely robust for this dataset — all configurations performed flawlessly. This implies that the mushroom features are highly informative and easily separable.

Trained Random Forest models on the mushroom dataset using various values of mtry (3, 5, 10) and ntree (100, 200, 500). All models achieved 100% accuracy, indicating that Random Forests are highly effective for this classification task. The model with mtry = 3 and ntree = 100 was selected for interpretation. Variable importance analysis revealed that stalk.root, spore.print.color, gill.size, and odor were the most influential features. Given that performance stabilized even at low tree counts, using mtry = 3 and ntree = 100 is both accurate and efficient.

Use boosting on the training set to build a model which can predict edible vs poisonous status. Test your resulting boosted model on the testing data created in part a). You should assess alternative values for \(\lambda\), perhaps 0.001, 0.01, 0.1, etc. What do you notice about the relationship between \(\lambda\) and \(B\)? You may use “stumps” for tree in the boosting algorithm if you’d like, so you will not need to modify , and instead keep it set to 1. Which combination (of \(\lambda\), \(B\)) seems best? Create variable importance plots for this final model and interpret the findings.

# Load required libraries
library(gbm)

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

library(caret)

# === Step 1: Prepare data ===

# Convert class to factor (target) and numeric (for boosting)
train_data$class <- as.factor(train_data$class)
test_data$class <- as.factor(test_data$class)
train_data$label <- ifelse(train_data$class == "e", 1, 0)
test_data$label <- ifelse(test_data$class == "e", 1, 0)

# Convert character predictors to factors (required for gbm)
for (col in names(train_data)) {
  if (is.character(train_data[[col]])) {
    train_data[[col]] <- as.factor(train_data[[col]])
  }
}
for (col in names(test_data)) {
  if (is.character(test_data[[col]])) {
    test_data[[col]] <- as.factor(test_data[[col]])
  }
}

# === Step 2: Define parameter grid for boosting ===
lambda_vals <- c(0.001, 0.01, 0.1)
n_trees <- c(100, 500, 1000)

# === Step 3: Loop through combinations of lambda and n.trees ===
boost_results <- data.frame()
best_boost_model <- NULL
best_acc <- 0

for (lambda in lambda_vals) {
  for (b in n_trees) {
    set.seed(42)
    boost_model <- gbm(label ~ . -class, 
                       data = train_data, 
                       distribution = "bernoulli",
                       n.trees = b,
                       interaction.depth = 1,
                       shrinkage = lambda,
                       verbose = FALSE)
    
    pred_probs <- predict(boost_model, newdata = test_data, n.trees = b, type = "response")
    pred_class <- ifelse(pred_probs > 0.5, 1, 0)
    acc <- mean(pred_class == test_data$label)
    
    boost_results <- rbind(boost_results, data.frame(lambda = lambda, n.trees = b, Accuracy = acc))
    
    if (acc > best_acc) {
      best_acc <- acc
      best_boost_model <- boost_model
      best_lambda <- lambda
      best_b <- b
    }
    
    cat("lambda =", lambda, ", n.trees =", b, "→ Accuracy:", round(acc * 100, 2), "%\n")
  }
}

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.

## lambda = 0.001 , n.trees = 100 → Accuracy: 98.6 %

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.

## lambda = 0.001 , n.trees = 500 → Accuracy: 98.6 %

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.

## lambda = 0.001 , n.trees = 1000 → Accuracy: 98.6 %

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.

## lambda = 0.01 , n.trees = 100 → Accuracy: 98.6 %

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.

## lambda = 0.01 , n.trees = 500 → Accuracy: 99.56 %

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.

## lambda = 0.01 , n.trees = 1000 → Accuracy: 99.83 %

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.

## lambda = 0.1 , n.trees = 100 → Accuracy: 99.83 %

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.

## lambda = 0.1 , n.trees = 500 → Accuracy: 100 %

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.

## lambda = 0.1 , n.trees = 1000 → Accuracy: 100 %

# === Step 4: Show result summary table ===
print(boost_results)

##   lambda n.trees  Accuracy
## 1  0.001     100 0.9859675
## 2  0.001     500 0.9859675
## 3  0.001    1000 0.9859675
## 4  0.010     100 0.9859675
## 5  0.010     500 0.9955687
## 6  0.010    1000 0.9982767
## 7  0.100     100 0.9982767
## 8  0.100     500 1.0000000
## 9  0.100    1000 1.0000000

# === Step 5: Plot variable importance for best model ===
summary(best_boost_model, n.trees = best_b,
        main = paste("Variable Importance (Boosting, λ =", best_lambda, ", B =", best_b, ")"))

As expected, lower learning rates (λ = 0.001) required more trees but still couldn’t reach perfect accuracy. Moderate learning rate (λ = 0.01) showed improvement with more trees. Best results were seen with λ = 0.1 and B = 500 or 1000, achieving 100% accuracy. So, there’s a clear trade-off: smaller λ → needs larger B; larger λ → converges faster.

odor is by far the most dominant predictor, contributing over 94% of the model’s decision strength, followed by spore print color. The warning variable veil.type has no variation means it’s constant in your dataset and contributes no information — safe to ignore or drop.

Applied gradient boosting using the gbm package to classify mushrooms. We varied the learning rate (λ = 0.001, 0.01, 0.1) and number of trees (B = 100, 500, 1000). Results showed that lower λ values required more trees to approach high accuracy, while higher λ values converged faster. The best performance was achieved with λ = 0.1 and B = 500, yielding 100% test accuracy. Variable importance analysis revealed odor as the overwhelmingly dominant feature, followed by spore.print.color and gill.size. Boosting proved to be both accurate and interpretable on this dataset.

Based on your results to b) through e), which statistical learning algorithm seems to yield the best accuracy results? Select the best model from all models. Is this likely to be the same answer that other students will get on this exam, assuming you correctly used your name in setting the seed on part a) (set.seed(1234))?

ART gave near-perfect accuracy but was slightly behind the ensemble methods. Bagging, Random Forest, and Boosting each reached perfect accuracy (100%) on the test set. Among the ensemble methods, Boosting (λ = 0.1, B = 500) not only achieved 100% accuracy but also identified odor as an overwhelmingly dominant variable, offering strong interpretability and efficient learning. Boosting with moderate shrinkage and limited trees showed the most efficient learning behavior (fast, accurate, and interpretable), making it the best choice.

Based on my results from (b) through (e), the Boosting model with λ = 0.1 and B = 500 yields the best accuracy and efficiency. It reached 100% test accuracy using fewer trees than bagging or random forests and highlighted a highly interpretable variable structure. However, this may not be the same answer other students get, because results depend on the random split of training/testing data (set via set.seed(1234) using your name). If another student’s training/test split differs significantly due to this seed, their model performances might vary slightly, possibly favoring bagging or random forests instead.

Using the Bootstrap to estimate standard errors: In this problem, we will employ bootstrapping to estimate the variability of estimation where standard software does not exist. Use the function, or risk losing points! Using the data from package , estimate the standard error of the difference in Median (miles per gallon) between 4 cylinder and 8 cylinder cars.

# === Step 1: Load required libraries ===
library(ISLR)
library(boot)

## 
## Attaching package: 'boot'

## The following object is masked from 'package:lattice':
## 
##     melanoma

# === Step 2: Prepare the data ===
data(Auto)
# Make sure 'cylinders' is numeric and mpg is numeric (they are by default)

# === Step 3: Define the statistic function ===
# It returns the difference in median mpg: 4-cylinder - 8-cylinder
boot_median_diff <- function(data, indices) {
  d <- data[indices, ]
  median_4 <- median(d$mpg[d$cylinders == 4])
  median_8 <- median(d$mpg[d$cylinders == 8])
  return(median_4 - median_8)
}

# === Step 4: Perform the bootstrap ===
set.seed(1234)  # for reproducibility
boot_result <- boot(data = Auto, statistic = boot_median_diff, R = 1000)

# === Step 5: Output the results ===
print(boot_result)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot_median_diff, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1*     14.4 -0.30745    0.797002

cat("Estimated standard error of the median mpg difference:", sd(boot_result$t), "\n")

## Estimated standard error of the median mpg difference: 0.797002

The median mpg difference between 4-cylinder and 8-cylinder cars is estimated to be 14.4 mpg. The bootstrap-estimated standard error is 0.797, meaning that the typical variability in this difference (due to sampling) is less than 1 mpg. The bootstrap bias is small and negative (−0.31), suggesting that the bootstrap distribution’s center is very slightly below the original estimate — not a major concern.

Using the boot() function with 1000 resamples, we estimated the standard error of the difference in median mpg between 4-cylinder and 8-cylinder cars from the Auto dataset. The observed difference was 14.4 mpg, with an estimated standard error of 0.797 mpg. This indicates low variability in the median difference, and the small bootstrap bias (−0.31) suggests the bootstrap distribution closely reflects the original sample.

Write your own function for conducting 10-fold cross validation to assess the test error in a linear regression model predicting estimated performance in the dataset from R package , using only minimum main memory and maximum main memory as predictors. DO NOT worry about including interaction effects, or higher order effects as in the first exam. Just a main effect for each variable will be sufficient. You need to show how to use the function (creates the folds for you) from the R package , as explained in class.

# Load required libraries
library(MASS)
library(caret)

# Load and prepare the data
data(cpus)
cpus_data <- cpus[, c("mmin", "mmax", "perf")]

# Define custom 10-fold CV function
my_cv_function <- function(data, k = 10) {
  set.seed(1234)  # for reproducibility
  folds <- createFolds(data$perf, k = k, list = TRUE, returnTrain = FALSE)
  
  mse_list <- numeric(k)
  
  for (i in 1:k) {
    test_indices <- folds[[i]]
    train_data <- data[-test_indices, ]
    test_data  <- data[test_indices, ]
    
    # Fit the linear model
    model <- lm(perf ~ mmin + mmax, data = train_data)
    
    # Predict on the test data
    predictions <- predict(model, newdata = test_data)
    
    # Compute Mean Squared Error
    mse_list[i] <- mean((test_data$perf - predictions)^2)
  }
  
  avg_mse <- mean(mse_list)
  return(avg_mse)
}

# Apply the function
test_error_estimate <- my_cv_function(cpus_data, k = 10)
cat("Estimated Test Error (10-fold CV MSE):", test_error_estimate, "\n")

## Estimated Test Error (10-fold CV MSE): 6315.865

Wrote a custom function using createFolds() from the caret package to perform 10-fold cross-validation on the cpus dataset (from MASS). We fit a linear regression model to predict perf using only mmin and mmax (minimum and maximum main memory) as predictors. The estimated test error, measured by the average mean squared error (MSE) across folds, was 6315.87. This provides a robust estimate of how well the model generalizes to unseen data.

Exam 2 (Take Home Final Exam)

INFO-H 515

null

Please hand this result in on Canvas no later than 11:59pm on Thursday, May 8!

Questions