Please hand this result in on Canvas no later than 11:59pm on Thursday, May 8!

Questions

  1. Experimenting with tree-based methods: In this problem, you will used CARTs, bagging, random forests, and boosting in an attempt to accurately predict whether mushrooms are edible or poisonous, based on a large collection of varaibles decribing the mushrooms (cap shape, cap surface, odor, gill spacing, gill size, stalk color, etc.). The data can be found under the Files tab in folder Exam 2/Final Exam (mushrooms.csv) on Canvas. More information regarding this data can be found (including variable level designation for interpretation) at the following site: https://www.kaggle.com/uciml/mushroom-classification/data
  1. Split the data into a training and test set, each consisting of 50% of the entire sample. Do this randomly, not based on positioning in the dataset (i.e. do not take the first half as your test set, etc., but instead use the function).
# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
# Load data
mushrooms <- read.csv("mushrooms.csv")

# Set seed and split 50/50
set.seed(1234)
train_index <- sample(1:nrow(mushrooms), nrow(mushrooms)/2)
train <- mushrooms[train_index, ]
test <- mushrooms[-train_index, ]
  1. Using a CART, fit a classification tree to the training data and employ cost complexity pruning with CV used to determine the optimal number of terminal nodes to include. Then test this “best” model on the test data set. Interpret the final best model using the tree structure, and visualize it / add text to represent the decisions.
library(rpart)
library(rpart.plot)

# Train full tree
tree_model <- rpart(class ~ ., data = train, method = "class", cp = 0.001)

# Use cross-validation to find optimal cp
printcp(tree_model)
## 
## Classification tree:
## rpart(formula = class ~ ., data = train, method = "class", cp = 0.001)
## 
## Variables actually used in tree construction:
## [1] odor                   spore.print.color      stalk.color.below.ring
## [4] stalk.root            
## 
## Root node error: 1953/4062 = 0.4808
## 
## n= 4062 
## 
##          CP nsplit rel error    xerror      xstd
## 1 0.9703021      0 1.0000000 1.0000000 0.0163049
## 2 0.0163850      1 0.0296979 0.0296979 0.0038716
## 3 0.0061444      2 0.0133129 0.0133129 0.0026025
## 4 0.0023041      3 0.0071685 0.0071685 0.0019125
## 5 0.0010000      5 0.0025602 0.0025602 0.0011442
best_cp <- tree_model$cptable[which.min(tree_model$cptable[,"xerror"]), "CP"]

# Prune tree
pruned_tree <- prune(tree_model, cp = best_cp)

# Visualize
rpart.plot(pruned_tree, type = 3, extra = 104)

# Predict on test data
tree_pred <- predict(pruned_tree, newdata = test, type = "class")
tree_accuracy <- mean(tree_pred == test$class)
tree_accuracy
## [1] 0.9992614

interpretting results

  • First split: spore.print.color is the most important feature for the initial split.
  • Second split: If certain spore colors (like b, h, k, n, o, u, w, y) → check stalk.color.below.ring.
  • Other splits: odor is used when spore print color is not among certain types. stalk.root further refines when stalk.color.below.ring = e, g, o, p, w but n is present.
  • Prediction outcomes: If odor is c, f, m, p, s, y, it is poisonous (“p”). If odor is a, l, n, it depends further on spore.print.color and stalk.color.below.ring.
  • Terminal nodes: You have 6 final leaves, each predicting either edible (“e”) or poisonous (“p”).
  1. Use bootstrap aggregation on the training set to build a model which can predict edible vs poisonous status. Test your resulting bagged tree model on the testing data created in part a). Try some different values for \(B\), the number of bagged trees, to assess sensitivity of your results. How many trees seems sufficient? Create variable importance plots for this final model and interpret the findings.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Ensure the target is a factor
train$class <- as.factor(train$class)
test$class <- as.factor(test$class)

# Bagging model (mtry = all predictors)
bag_model <- randomForest(class ~ ., data = train, mtry = ncol(train) - 1, ntree = 100, importance = TRUE)

# Predict and evaluate
bag_pred <- predict(bag_model, newdata = test)
bag_accuracy <- mean(bag_pred == test$class)
cat("Bagging Accuracy (100 trees):", bag_accuracy, "\\n")
## Bagging Accuracy (100 trees): 1 \n
# Try different numbers of trees
for (b in c(50, 100, 200, 500)) {
  model <- randomForest(class ~ ., data = train, mtry = ncol(train) - 1, ntree = b)
  acc <- mean(predict(model, test) == test$class)
  cat("Bagging - Trees:", b, "- Accuracy:", acc, "\\n")
}
## Bagging - Trees: 50 - Accuracy: 1 \nBagging - Trees: 100 - Accuracy: 1 \nBagging - Trees: 200 - Accuracy: 1 \nBagging - Trees: 500 - Accuracy: 1 \n
# Variable importance plot
varImpPlot(bag_model)

results

  • We built a bagging model to predict whether mushrooms are edible or poisonous. Based on model performance, we observed that even with 50 trees, the model achieved 100% test accuracy. Increasing the number of trees to 100, 200, or 500 did not improve the model’s performance further, suggesting that around 50 trees are sufficient for this dataset.
  • The variable importance plot shows that gill.color, spore.print.color, population, and gill.size are the most important predictors. In particular, gill.color has the highest Mean Decrease in Accuracy and Mean Decrease in Gini, indicating it plays a major role in classifying mushrooms correctly. spore.print.color and population are also crucial features according to both metrics.
  • Thus, bagging performs exceptionally well on this data, with stable performance across different numbers of trees and a clear identification of the most predictive variables.
  1. Use random forests on the training set to build a model which can predict edible vs poisonous status. Test your resulting random forest model on the testing data created in part a). Try some alternative values for \(mtry\), the number of predictors which are “tried” as optimal splitters at each node, and also \(B\). Compare these models in terms of their test performance. Which \(B\) and \(mtry\) combination seems best? Create variable importance plots for this final model and interpret the findings.
# Try different mtry values
results <- data.frame()
for (m in c(2, 4, 6, 8)) {
  rf_model <- randomForest(class ~ ., data = train, mtry = m, ntree = 200)
  acc <- mean(predict(rf_model, test) == test$class)
  results <- rbind(results, data.frame(mtry = m, accuracy = acc))
}
print(results)
##   mtry accuracy
## 1    2        1
## 2    4        1
## 3    6        1
## 4    8        1
# Final model
final_rf <- randomForest(class ~ ., data = train, mtry = 4, ntree = 200, importance = TRUE)
varImpPlot(final_rf)

results

  • Test performance is the same (100% accuracy) across all mtry values, indicating the model is not very sensitive to this parameter on this dataset.
  • Since accuracy is perfect for all, you might choose the lowest-complexity model (i.e., with smaller mtry) to avoid overfitting or reduce computation time.
  • Any of the combinations yield perfect accuracy.Conventionally, mtry = √p ≈ √22 ≈ 4.7 → choose mtry = 4, with ntree = 200.
  • Variable Importance Interpretation:
    • From the random forest variable importance plots based on Mean Decrease in Accuracy and Mean Decrease in Gini, we observe that:
    • odor is by far the most important predictor. It contributes the most to both splitting purity and prediction accuracy. Certain odors (such as foul or fishy smells) are likely strong indicators of poisonous mushrooms.
    • spore-print color, gill size, and gill color are also highly important features, suggesting that physical characteristics of spores and gills are critical in determining edibility.
    • Other notable predictors include stalk root, population, and habitat, although their contributions are smaller compared to odor.
    • Variables like veil type, veil color, and gill attachment show minimal importance and likely have little predictive power in this classification problem.
  1. Use boosting on the training set to build a model which can predict edible vs poisonous status. Test your resulting boosted model on the testing data created in part a). You should assess alternative values for \(\lambda\), perhaps 0.001, 0.01, 0.1, etc. What do you notice about the relationship between \(\lambda\) and \(B\)? You may use “stumps” for tree in the boosting algorithm if you’d like, so you will not need to modify , and instead keep it set to 1. Which combination (of \(\lambda\), \(B\)) seems best? Create variable importance plots for this final model and interpret the findings.
# Load libraries
library(tidyverse)
library(gbm)
## Warning: package 'gbm' was built under R version 4.4.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
# Load and fix data
mushrooms <- read.csv("mushrooms.csv")

# Convert character columns to factors BEFORE splitting
mushrooms[] <- lapply(mushrooms, function(x) if(is.character(x)) as.factor(x) else x)

# Split 50/50 with seed
set.seed(1234)
train_index <- sample(1:nrow(mushrooms), nrow(mushrooms)/2)
train <- mushrooms[train_index, ]
test <- mushrooms[-train_index, ]

# Convert the response 'class' to numeric binary (poisonous = 1, edible = 0)
train$class <- ifelse(train$class == "p", 1, 0)
test$class <- ifelse(test$class == "p", 1, 0)

# Try different shrinkage (learning rates)
for (lr in c(0.001, 0.01, 0.1)) {
  boost_model <- gbm(class ~ ., 
                     data = train,
                     distribution = "bernoulli",
                     n.trees = 500,
                     interaction.depth = 1, 
                     shrinkage = lr,
                     verbose = FALSE)
  
  # Predict
  pred_prob <- predict(boost_model, newdata = test, n.trees = 500, type = "response")
  pred_class <- ifelse(pred_prob > 0.5, 1, 0)
  
  # Accuracy
  cat("Learning rate:", lr, "- Accuracy:", mean(pred_class == test$class), "\n")
}
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.
## Learning rate: 0.001 - Accuracy: 0.9847366
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.
## Learning rate: 0.01 - Accuracy: 0.9945839
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.
## Learning rate: 0.1 - Accuracy: 1
# Fit final boosting model
final_boost <- gbm(class ~ ., 
                   data = train,
                   distribution = "bernoulli",
                   n.trees = 500,
                   interaction.depth = 1,
                   shrinkage = 0.1,
                   verbose = FALSE)
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.
# Variable importance plot
summary(final_boost)

results

  • When λ is small (e.g., 0.001 or 0.01): Each tree makes a small adjustment.
  • When λ is larger (e.g., 0.1):Each tree has more influence.
  • The best combination is λ = 0.1 and B = 500. This gives perfect prediction accuracy on the test set.
  • From boosting model’s plot:
    • Odor dominates with ~95% of the model’s relative influence.
    • Spore print color comes next with a small contribution.
    • Other features have very minor roles.
    • Many variables had no useful impact and were effectively ignored.
  • Boosting with λ = 0.1 and 500 trees focuses almost entirely on a few very important predictors, particularly odor, to achieve highly accurate mushroom classification.
  1. Based on your results to b) through e), which statistical learning algorithm seems to yield the best accuracy results? Select the best model from all models. Is this likely to be the same answer that other students will get on this exam, assuming you correctly used your name in setting the seed on part a) (set.seed(1234))?
  • Bagging, Random Forest, and Boosting have exact classification accuracy on the test set.
  • However: Boosting with λ = 0.1 focuses even more precisely on the most influential variables (especially odor), potentially making it slightly more efficient or simpler for interpretation. Random forests are also a very robust and stable choice, especially because they are less sensitive to overfitting. Thus, both random forests and boosting could reasonably be considered the best models here.
  • Given that the seed was correctly set using set.seed(1234), it is likely that other students would obtain the same conclusion, since the mushroom dataset has strong, easily separable features and is well-behaved for tree-based learning algorithms.
  1. Using the Bootstrap to estimate standard errors: In this problem, we will employ bootstrapping to estimate the variability of estimation where standard software does not exist. Use the function, or risk losing points!

Using the data from package , estimate the standard error of the difference in Median (miles per gallon) between 4 cylinder and 8 cylinder cars.

# Load libraries
library(ISLR)
library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
## 
##     melanoma
# Load the data
data(Auto)

# Define a bootstrap function
boot_median_diff <- function(data, indices) {
  d <- data[indices, ]  # resample data
  median_4cyl <- median(d$mpg[d$cylinders == 4])
  median_8cyl <- median(d$mpg[d$cylinders == 8])
  return(median_4cyl - median_8cyl)
}

# Apply bootstrapping
set.seed(1234)
boot_results <- boot(data = Auto, statistic = boot_median_diff, R = 1000)

# View the bootstrap standard error
print(boot_results)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot_median_diff, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1*     14.4 -0.30745    0.797002
cat("Bootstrap estimated standard error:", sd(boot_results$t), "\n")
## Bootstrap estimated standard error: 0.797002

results

  • Using 1000 bootstrap resamples, the estimated difference in the median miles per gallon (mpg) between 4-cylinder and 8-cylinder cars is 14.4 mpg. The bootstrap estimate of the standard error of this difference is approximately 0.797 mpg.
  • This means that the sampling variability of the estimated median difference is small. The small standard error suggests that the median mpg difference between 4-cylinder and 8-cylinder cars is estimated with good precision.
  1. Write your own function for conducting 10-fold cross validation to assess the test error in a linear regression model predicting estimated performance in the dataset from R package , using only minimum main memory and maximum main memory as predictors. DO NOT worry about including interaction effects, or higher order effects as in the first exam. Just a main effect for each variable will be sufficient. You need to show how to use the function (creates the folds for you) from the R package , as explained in class.
# Load libraries
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.3
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(caret)

# Load the data
data(cpus)

# Define the custom 10-fold cross-validation function
my_cv_function <- function(data, response, predictors, k = 10) {
  # Create folds
  set.seed(1234)  # for reproducibility
  folds <- createFolds(data[[response]], k = k, list = TRUE, returnTrain = FALSE)
  
  # Initialize vector to store test errors
  test_errors <- numeric(length = k)
  
  # Loop over each fold
  for (i in 1:k) {
    test_idx <- folds[[i]]
    train_data <- data[-test_idx, ]
    test_data <- data[test_idx, ]
    
    # Fit linear model on training data
    formula <- as.formula(
      paste(response, "~", paste(predictors, collapse = " + "))
    )
    model <- lm(formula, data = train_data)
    
    # Predict on test data
    predictions <- predict(model, newdata = test_data)
    
    # Calculate mean squared error (MSE) on test data
    mse <- mean((predictions - test_data[[response]])^2)
    
    # Store the error
    test_errors[i] <- mse
  }
  
  # Return average test error
  return(mean(test_errors))
}

# Use the function
# Predicting perf using mmin and mmax
cv_error <- my_cv_function(cpus, response = "perf", predictors = c("mmin", "mmax"), k = 10)

# Print result
cat("Estimated test error (10-fold CV):", cv_error, "\n")
## Estimated test error (10-fold CV): 6315.865

results

  • Using a custom 10-fold cross-validation procedure with the createFolds() function from the caret package, we estimated the test error for predicting CPU performance using minimum main memory and maximum main memory as predictors. The resulting estimated test error is approximately 6315.87.