Please hand this result in on Canvas no later than 11:59pm on Thursday, May 8!

  • Do not work in groups!

  • Do not discuss the exam with other students, or share any ideas or thoughts regarding the exam. This is academic misconduct, and will not be tolerated. Your presented solution must be your own work!

  • Must submit compiled RMarkdown file in PDF/HTML format.

  • If you cannot compile the RMarkdown file, please send an email to me for help before Thursday, May 8!

Question 1

Experimenting with tree-based methods: In this problem, you will used CARTs, bagging, random forests, and boosting in an attempt to accurately predict whether mushrooms are edible or poisonous, based on a large collection of varaibles decribing the mushrooms (cap shape, cap surface, odor, gill spacing, gill size, stalk color, etc.). The data can be found under the Files tab in folder Exam 2/Final Exam (mushrooms.csv) on Canvas. More information regarding this data can be found (including variable level designation for interpretation) at the following site: https://www.kaggle.com/uciml/mushroom-classification/data

Read in the Mushroom Dataset from the Google Sheets public URL.

url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vSP2UBc4FC9EgHTeQl8UtFy9PHP1J1pox-ljfMClb21UUf4tWq2PiVrdffsSvXqrxW6fNJFceXgr_TN/pub?output=csv"
mushroom <- read.csv(url)
dim(mushroom)
## [1] 8124   23
head(mushroom)
##   class cap.shape cap.surface cap.color bruises odor gill.attachment
## 1     p         x           s         n       t    p               f
## 2     e         x           s         y       t    a               f
## 3     e         b           s         w       t    l               f
## 4     p         x           y         w       t    p               f
## 5     e         x           s         g       f    n               f
## 6     e         x           y         y       t    a               f
##   gill.spacing gill.size gill.color stalk.shape stalk.root
## 1            c         n          k           e          e
## 2            c         b          k           e          c
## 3            c         b          n           e          c
## 4            c         n          n           e          e
## 5            w         b          k           t          e
## 6            c         b          n           e          c
##   stalk.surface.above.ring stalk.surface.below.ring stalk.color.above.ring
## 1                        s                        s                      w
## 2                        s                        s                      w
## 3                        s                        s                      w
## 4                        s                        s                      w
## 5                        s                        s                      w
## 6                        s                        s                      w
##   stalk.color.below.ring veil.type veil.color ring.number ring.type
## 1                      w         p          w           o         p
## 2                      w         p          w           o         p
## 3                      w         p          w           o         p
## 4                      w         p          w           o         p
## 5                      w         p          w           o         e
## 6                      w         p          w           o         p
##   spore.print.color population habitat
## 1                 k          s       u
## 2                 n          n       g
## 3                 n          n       m
## 4                 k          s       u
## 5                 n          a       g
## 6                 k          n       g

0.0.1 Question 1: Experimenting with tree-based methods

This analysis uses CART, bagging, random forests, and boosting to predict whether a mushroom is edible or poisonous based on its features.

0.0.1.1 a) Train-test Split

Split the data into a training and test set, each consisting of 50% of the entire sample. Do this randomly, not based on positioning in the dataset (i.e. do not take the first half as your test set, etc., but instead use the function).

mushroom <- mushroom %>% mutate_all(as.factor)
train_idx <- sample(1:nrow(mushroom), nrow(mushroom)/2)
train_data <- mushroom[train_idx, ]
test_data <- mushroom[-train_idx, ]

Explanation: The dataset is randomly split into 50% training and 50% testing to avoid biased sampling. All variables are treated as factors to match classification assumptions.


0.0.1.2 b) CART with Cost Complexity Pruning

Using a CART, fit a classification tree to the training data and employ cost complexity pruning with CV used to determine the optimal number of terminal nodes to include. Then test this “best” model on the test data set. Interpret the final best model using the tree structure, and visualize it / add text to represent the decisions.

tree_model <- tree(class ~ ., data = train_data)
cv_tree <- cv.tree(tree_model, FUN = prune.misclass)

# Plot CV results
plot(cv_tree$size, cv_tree$dev, type = "b", xlab = "Tree Size", ylab = "CV Misclassification Error", main = "Cost-Complexity Pruning")

# Prune tree to optimal size
best_size <- cv_tree$size[which.min(cv_tree$dev)]
pruned_tree <- prune.misclass(tree_model, best = best_size)

# Visualize the pruned tree
plot(pruned_tree)
text(pruned_tree, pretty = 0)

# Predict and evaluate
tree_pred <- predict(pruned_tree, test_data, type = "class")
confusionMatrix(tree_pred, test_data$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    e    p
##          e 2095    5
##          p    0 1962
##                                           
##                Accuracy : 0.9988          
##                  95% CI : (0.9971, 0.9996)
##     No Information Rate : 0.5158          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9975          
##                                           
##  Mcnemar's Test P-Value : 0.07364         
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9975          
##          Pos Pred Value : 0.9976          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.5158          
##          Detection Rate : 0.5158          
##    Detection Prevalence : 0.5170          
##       Balanced Accuracy : 0.9987          
##                                           
##        'Positive' Class : e               
## 

Explanation: A Classification and Regression Tree (CART) was trained on the training data and pruned using cost-complexity pruning with cross-validation to avoid overfitting. The cross-validation plot indicated an optimal number of terminal nodes, after which pruning was applied to yield the most efficient tree. Visual inspection of the tree revealed that odor was the primary and most influential splitter, consistent with prior domain knowledge that smell is a strong indicator of mushroom edibility. The pruned tree achieved an impressive test accuracy of 99.88%, with sensitivity = 1.00 and specificity = 0.9975, as shown in the confusion matrix. This suggests that the tree is highly effective at distinguishing edible and poisonous mushrooms with minimal false positives or negatives. The model is interpretable and accurate, though potentially less flexible compared to ensemble methods.


0.0.1.3 c) Bagging

Use bootstrap aggregation on the training set to build a model which can predict edible vs poisonous status. Test your resulting bagged tree model on the testing data created in part a). Try some different values for \(B\), the number of bagged trees, to assess sensitivity of your results. How many trees seems sufficient? Create variable importance plots for this final model and interpret the findings.

tree_nums <- c(10, 50, 100, 200)
bagging_accuracies <- c()

for (b in tree_nums) {
  bag_model <- randomForest(class ~ ., data = train_data, mtry = ncol(train_data)-1, ntree = b)
  pred <- predict(bag_model, test_data)
  acc <- mean(pred == test_data$class)
  bagging_accuracies <- c(bagging_accuracies, acc)
}

# Plotting accuracies
qplot(tree_nums, bagging_accuracies, geom = "line") +
  labs(title = "Bagging: Accuracy vs. Number of Trees", x = "Number of Trees", y = "Accuracy") +
  ylim(0.99, 1)

# Final bagging model
final_bag_model <- randomForest(class ~ ., data = train_data, mtry = ncol(train_data)-1, ntree = 100)
varImpPlot(final_bag_model)

Explanation: Bagging was implemented using the Random Forest algorithm with mtry set to the total number of predictors, which mimics full bagging by allowing all features to be considered at each split. The model was tested with varying numbers of trees: 10, 50, 100, and 200. Accuracy improved and stabilized around 99.9% once the number of trees exceeded 50, indicating that performance becomes robust with relatively few trees. The variable importance plot for the final model (100 trees) revealed odor, spore.print.color, and stalk.color.below.ring as the top three features contributing to classification. This confirms the model’s ability to reduce variance and maintain stability across runs. Bagging proved to be both powerful and reliable in this binary classification task.


0.0.1.4 d) Random Forests

Use random forests on the training set to build a model which can predict edible vs poisonous status. Test your resulting random forest model on the testing data created in part a). Try some alternative values for \(mtry\), the number of predictors which are “tried” as optimal splitters at each node, and also \(B\). Compare these models in terms of their test performance. Which \(B\) and \(mtry\) combination seems best? Create variable importance plots for this final model and interpret the findings.

mtry_values <- c(2, 4, 6)
rf_acc <- c()

for (m in mtry_values) {
  rf_model <- randomForest(class ~ ., data = train_data, mtry = m, ntree = 100)
  pred <- predict(rf_model, test_data)
  rf_acc <- c(rf_acc, mean(pred == test_data$class))
}

# Accuracy comparison
data.frame(mtry = mtry_values, accuracy = rf_acc)
##   mtry accuracy
## 1    2        1
## 2    4        1
## 3    6        1
# Final random forest
final_rf <- randomForest(class ~ ., data = train_data, mtry = 4, ntree = 100)
varImpPlot(final_rf)

Explanation: Random Forests extend bagging by adding randomness in feature selection. Different values of mtry (2, 4, and 6) were tested, and each configuration yielded perfect accuracy (100%) on the test set. This indicates that the model is extremely effective at generalizing, even when the number of features tried at each split is limited. The best performing model (chosen with mtry = 4) again highlighted odor as the most important predictor, followed by spore.print.color and several stalk and gill features. This reinforces the consistency of feature importance across models and suggests that the underlying structure in the data is highly separable. Random Forests provide a balance of flexibility, accuracy, and robustness, making them a standout option among tree-based methods.


0.0.1.5 e) Boosting

Use boosting on the training set to build a model which can predict edible vs poisonous status. Test your resulting boosted model on the testing data created in part a). You should assess alternative values for \(\lambda\), perhaps 0.001, 0.01, 0.1, etc. What do you notice about the relationship between \(\lambda\) and \(B\)? You may use “stumps” for tree in the boosting algorithm if you’d like, so you will not need to modify , and instead keep it set to 1. Which combination (of \(\lambda\), \(B\)) seems best? Create variable importance plots for this final model and interpret the findings.

# Convert class to numeric for gbm
train_data$class_bin <- as.numeric(train_data$class) - 1
test_data$class_bin <- as.numeric(test_data$class) - 1

lambda_vals <- c(0.001, 0.01, 0.1)
boost_acc <- c()

for (l in lambda_vals) {
  boost_model <- gbm(class_bin ~ . - class, data = train_data, distribution = "bernoulli",
                     n.trees = 100, interaction.depth = 1, shrinkage = l, verbose = FALSE)
  pred <- predict(boost_model, test_data, n.trees = 100, type = "response")
  pred_class <- ifelse(pred > 0.5, 1, 0)
  acc <- mean(pred_class == test_data$class_bin)
  boost_acc <- c(boost_acc, acc)
}

# Accuracy vs. learning rate
data.frame(lambda = lambda_vals, accuracy = boost_acc)
##   lambda  accuracy
## 1  0.001 0.9862137
## 2  0.010 0.9862137
## 3  0.100 0.9970458
# Final model and variable importance
final_boost <- gbm(class_bin ~ . - class, data = train_data, distribution = "bernoulli",
                   n.trees = 100, interaction.depth = 1, shrinkage = 0.1, verbose = FALSE)
summary(final_boost)

##                                               var      rel.inf
## odor                                         odor 94.585764830
## spore.print.color               spore.print.color  4.761496972
## stalk.surface.below.ring stalk.surface.below.ring  0.263379743
## stalk.color.below.ring     stalk.color.below.ring  0.250393033
## gill.size                               gill.size  0.131524823
## stalk.surface.above.ring stalk.surface.above.ring  0.007440599
## cap.shape                               cap.shape  0.000000000
## cap.surface                           cap.surface  0.000000000
## cap.color                               cap.color  0.000000000
## bruises                                   bruises  0.000000000
## gill.attachment                   gill.attachment  0.000000000
## gill.spacing                         gill.spacing  0.000000000
## gill.color                             gill.color  0.000000000
## stalk.shape                           stalk.shape  0.000000000
## stalk.root                             stalk.root  0.000000000
## stalk.color.above.ring     stalk.color.above.ring  0.000000000
## veil.type                               veil.type  0.000000000
## veil.color                             veil.color  0.000000000
## ring.number                           ring.number  0.000000000
## ring.type                               ring.type  0.000000000
## population                             population  0.000000000
## habitat                                   habitat  0.000000000

Explanation: Boosting was performed using the gbm package with shallow trees (stumps) and learning rates (λ) of 0.001, 0.01, and 0.1. The best test performance was achieved with shrinkage = 0.1, producing an accuracy of approximately 99.7%. While this was slightly below the accuracy of bagging and random forests, it is still highly competitive. The model converged more quickly and effectively at higher learning rates. The variable importance plot showed that odor had an overwhelming influence, accounting for over 94% of the model’s predictive power, while all other variables had relatively minor contributions. This suggests that boosting, while sensitive to hyperparameter settings, is still capable of capturing the dominant structure in the data.


0.0.1.6 f) Best Model Selection

Based on your results to b) through e), which statistical learning algorithm seems to yield the best accuracy results? Select the best model from all models. Is this likely to be the same answer that other students will get on this exam, assuming you correctly used your name in setting the seed on part a) (set.seed(1234))?

# Confusion matrix heatmap (example with random forest)
rf_pred <- predict(final_rf, test_data)
cm <- confusionMatrix(rf_pred, test_data$class)

# Plotting confusion matrix
conf_matrix <- as.data.frame(cm$table)
ggplot(conf_matrix, aes(Prediction, Reference, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), vjust = 1) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Confusion Matrix: Random Forest")

Explanation: After comparing the results of all four methods—CART, Bagging, Random Forests, and Boosting—it is clear that Random Forests and Bagging outperform the others in terms of both accuracy and robustness. Random Forests slightly edge out bagging by introducing decorrelation between trees while maintaining perfect test set accuracy. Boosting performs very well when appropriately tuned but requires more careful calibration of parameters. CART, while interpretable, has slightly lower accuracy and may not generalize as well. Given the outstanding performance and consistency of Random Forests across different mtry values and its clear visualization of feature importance, it stands out as the best-performing model. Due to the simplicity of the dataset and the dominant influence of the odor variable, it is highly likely that other students using the same seed and methodology will arrive at a similar conclusion.


0.0.2 Question 2: Bootstrapping Standard Errors

Using the Bootstrap to estimate standard errors: In this problem, we will employ bootstrapping to estimate the variability of estimation where standard software does not exist. Use the function, or risk losing points!

Using the data from package , estimate the standard error of the difference in Median (miles per gallon) between 4 cylinder and 8 cylinder cars.

library(ISLR)
library(boot)
library(dplyr)

data(Auto)
Auto <- Auto %>%
  mutate(cyl4 = ifelse(cylinders == 4, 1, 0),
         cyl8 = ifelse(cylinders == 8, 1, 0))

boot_func <- function(data, index) {
  d <- data[index, ]
  med4 <- median(d$mpg[d$cyl4 == 1])
  med8 <- median(d$mpg[d$cyl8 == 1])
  return(med4 - med8)
}

set.seed(146)
boot_result <- boot(data = Auto, statistic = boot_func, R = 1000)
boot_result
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot_func, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1*     14.4 -0.30035   0.8256265

Explanation: In this question, we aimed to estimate the standard error of the difference in median miles per gallon (mpg) between 4-cylinder and 8-cylinder cars using the Auto dataset from the ISLR package. Since standard software tools do not directly provide the standard error for the difference in medians, we used bootstrapping with the boot() function, running 1,000 resampling iterations. The function calculated the difference in medians (median_4cyl - median_8cyl) for each bootstrap sample.

The original estimated difference in medians was approximately 14.4 mpg, indicating that 4-cylinder cars, on average, are significantly more fuel efficient than their 8-cylinder counterparts. The bootstrap-estimated standard error was approximately 0.83 mpg, with negligible bias (−0.30), which implies that this estimate is stable and reliable. This small standard error, relative to the large effect size, suggests that the observed difference is both statistically and practically significant. Bootstrapping proved to be an effective tool in quantifying the variability of a nonparametric statistic where no analytical formula is readily available.


0.0.3 Question 3: 10-Fold Cross-Validation with cpus Data

Write your own function for conducting 10-fold cross validation to assess the test error in a linear regression model predicting estimated performance in the dataset from R package , using only minimum main memory and maximum main memory as predictors. DO NOT worry about including interaction effects, or higher order effects as in the first exam. Just a main effect for each variable will be sufficient. You need to show how to use the function (creates the folds for you) from the R package , as explained in class.

library(MASS)
library(caret)

data(cpus)
set.seed(146)

flds <- createFolds(cpus$perf, k = 10)
cv_errors <- numeric(10)

for (i in 1:10) {
  train_idx <- unlist(flds[-i])
  test_idx <- flds[[i]]
  
  model <- lm(perf ~ mmin + mmax, data = cpus[train_idx, ])
  preds <- predict(model, newdata = cpus[test_idx, ])
  
  cv_errors[i] <- mean((cpus$perf[test_idx] - preds)^2)
}

mean(cv_errors)
## [1] 6322.073

Explanation: In this task, we assessed the out-of-sample prediction error of a linear regression model using 10-fold cross-validation. The goal was to predict CPU performance (perf) based on two predictors: minimum main memory (mmin) and maximum main memory (mmax) from the cpus dataset in the MASS package. To implement cross-validation, we used the createFolds() function from the caret package to split the data into 10 equal parts, ensuring each fold served as a test set once.

In each iteration, the model was trained on 90% of the data and tested on the remaining 10%. The mean squared prediction error was computed for each fold, and then averaged across all 10 folds to yield an overall estimate. The final mean squared error (MSE) was approximately 6,322, which reflects the average prediction error of the model when applied to new data. This MSE indicates moderate predictive performance. Since the model includes only two main-effect predictors, there may be room for improvement through incorporating interaction terms, non-linear transformations, or other relevant features. Nonetheless, the implementation demonstrates an effective and reproducible approach to evaluating model performance using cross-validation.