Take home exam

Questions 1) Experimenting with tree-based methods: In this problem, you will used CARTs, bagging, random forests, and boosting in an attempt to accurately predict whether mushrooms are edible or poisonous, based on a large collection of varaibles decribing the mushrooms (cap shape, cap surface, odor, gill spacing, gill size, stalk color, etc.). The data can be found under the Files tab in folder Exam 2/Final Exam(mushrooms.csv) on Canvas. More information regarding this data can be found (including variable level designation for interpretation) at the following site: https://www.kaggle.com/uciml/mushroom-classification/data

Split the mushroom data into a training and test set, each consisting of 50% of the entire sample. Do this randomly, not based on positioning in the dataset (i.e. do not take the first half as your test set, etc., but instead use the sample() function).

library(tree)
mushrooms <- read.csv("/Users/ahlad/Downloads/mushrooms.csv")

mushrooms[] <- lapply(mushrooms, function(x) if (is.character(x)) factor(x) else x)
mushrooms$class <- as.factor(mushrooms$class)

set.seed(1234)
n <- nrow(mushrooms)
train_indices <- sample(1:n, size = n / 2)
train_data <- mushrooms[train_indices, ]
test_data <- mushrooms[-train_indices, ]

I used class target variable as a factor for classification.

Using a CART, fit a classification tree to the training data and employ cost complexity pruning with CV used to determine the optimal number of terminal nodes to include. Then test this “best” model on the test data set. Interpret the final best model using the tree structure, and visualize it / add text to represent the decisions.

train_data$label <- ifelse(train_data$class == "e", 1, 0)
test_data$label <- ifelse(test_data$class == "e", 1, 0)

tree_model <- tree(class ~ . -label, data = train_data,
                   control = tree.control(nobs = nrow(train_data), mindev = 0.001))

cv_tree <- cv.tree(tree_model, FUN = prune.misclass)
best_size <- cv_tree$size[which.min(cv_tree$dev)]
pruned_tree <- prune.misclass(tree_model, best = best_size)

plot(pruned_tree)
text(pruned_tree, pretty = 0)

pred_tree <- predict(pruned_tree, newdata = test_data, type = "class")
tree_acc <- mean(pred_tree == test_data$class)
tree_acc

## [1] 1

We fit a classification tree using tree() function and employed cost-complexity pruning via cross-validation to determine the optimal size. The resulting pruned tree split first on odor, the most informative feature. If the odor was almond, anise, or none, the mushroom was classified as poisonous. This shows 100% accuracy on the test data which confirms that mushrooms in this dataset are highly separable by a few strong categorical predictors. This model is both accurate and interpretable, making it a strong baseline before applying ensemble methods.

Use bootstrap aggregation on the training set to build a model which can predict edible vs poisonous status. Test your resulting bagged tree model on the testing data created in part a). Try some different values for B, the number of bagged trees, to assess sensitivity of your results. How many trees seems sufficient? Create variable importance plots for this final model and interpret the findings.

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

bag_model <- randomForest(class ~ . -label, data = train_data, 
                          mtry = ncol(train_data) - 2,  # exclude class + label
                          ntree = 500)
bag_preds <- predict(bag_model, newdata = test_data)
bag_acc <- mean(bag_preds == test_data$class)
bag_acc

## [1] 0.9995076

varImpPlot(bag_model)

We removed label as it is direct encoding and biases result. We used bagging with 500 trees, excluding the label column to prevent data leakage. The model achieved 99.95% accuracy on the test set, indicating excellent generalization. Testing different values of B showed that accuracy stabilized around 100–200 trees. From the variable importance plot we can say that odor is by far the most influential feature, followed by spore.print.color and stalk.color.below.ring. These results confirm that the mushroom data is highly separable and that ensemble methods like bagging are extremely effective for this classification task.

Use random forests on the training set to build a model which can predict edible vs poisonous status.Test your resulting random forest model on the testing data created in part a). Try some alternative values for mtry, the number of predictors which are “tried” as optimal splitters at each node, and also B. Compare these models in terms of their test performance. Which B and mtry combination seems best? Create variable importance plots for this final model and interpret the findings.

rf2 <- randomForest(class ~ ., data = train_data, mtry = 2, ntree = 300)
rf4 <- randomForest(class ~ ., data = train_data, mtry = 4, ntree = 300)
rf6 <- randomForest(class ~ ., data = train_data, mtry = 6, ntree = 300)

acc2 <- mean(predict(rf2, test_data) == test_data$class)
acc4 <- mean(predict(rf4, test_data) == test_data$class)
acc6 <- mean(predict(rf6, test_data) == test_data$class)

c(acc2, acc4, acc6)

## [1] 1 1 1

We trained random forest models with mtry = 2, 4, 6 and each using 300 trees. All models achieved perfect accuracy (1) on the test set, showing that the ensemble is highly effective regardless of mtry. While lower mtry increases randomness between trees and reduces correlation, and higher mtry decreases variance, in this case all configurations performed equally well. The dataset is clearly separable, so random forest delivers consistent performance across different mtry values.

Use boosting on the training set to build a model which can predict edible vs poisonous status. Test your resulting boosted model on the testing data created in part a). You should assess alternative values for λ, perhaps 0.001, 0.01, 0.1, etc. What do you notice about the relationship between λ and B? You may use “stumps” for tree in the boosting algorithm if you’d like, so you will not need to modify interaction.depth, and instead keep it set to 1. Which combination (of λ, B) seems best? Create variable importance plots for this final model and interpret the findings.

library(gbm)

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

train_data$label <- ifelse(train_data$class == "e", 1, 0)
test_data$label <- ifelse(test_data$class == "e", 1, 0)

boost_01 <- gbm(label ~ . - class, data = train_data, distribution = "bernoulli",
                n.trees = 1000, shrinkage = 0.01, interaction.depth = 1)

## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.

preds_01 <- ifelse(predict(boost_01, test_data, n.trees = 1000, type = "response") > 0.5, 1, 0)
acc_01 <- mean(preds_01 == test_data$label)
acc_01

## [1] 0.9975382

We applied boosting using 1,000 trees and a shrinkage rate of 0.01, treating label as the binary outcome. The model achieved 99.75% accuracy, performing slightly below random forest and bagging. Boosting generally performed best with lower shrinkage values like 0.01, provided more trees were used. This reflects the typical trade-off: smaller labmda requires larger beta for optimal results. Since boosting attempts to split on all variables, it automatically skips those with no variance.

Overall, boosting is highly effective, but slightly more sensitive to tuning than bagging or random forest, which gave perfect accuracy more easily.

Based on your results to b) through e), which statistical learning algorithm seems to yield the best accuracy results? Select the best model from all models. Is this likely to be the same answer that other students will get on this exam, assuming you correctly used your name in setting the seed on part a) (set.seed(1234))?

model_accuracies <- c(
  Bagging = bag_acc,
  RF_mtry2 = acc2,
  RF_mtry4 = acc4,
  RF_mtry6 = acc6,
  CART = tree_acc,
  Boosting = acc_01
)


best_model_name <- names(which.max(model_accuracies))
best_accuracy <- max(model_accuracies)

cat("Model Accuracies:\n")

## Model Accuracies:

print(round(model_accuracies, 5))

##  Bagging RF_mtry2 RF_mtry4 RF_mtry6     CART Boosting 
##  0.99951  1.00000  1.00000  1.00000  1.00000  0.99754

cat("\nBest Model:", best_model_name, "with accuracy:", best_accuracy, "\n")

## 
## Best Model: RF_mtry2 with accuracy: 1

The test accuracy was perfect (1.0) for CART, bagging, and all random forest models, while boosting achieved slightly lower accuracy (~99.75%).

Among the top performers, we selected random forest with mtry = 2 as the best model. It achieved perfect accuracy and offers stronger generalization than a single tree and less tuning complexity than boosting.

It is highly likely that other students will get the same answer if they use set.seed(1234) as instructed and perform all analyses correctly. The Mushroom dataset is extremely well-separated (especially by the odor variable), so ensemble methods often reach perfect classification regardless of minor parameter differences.

Using the Bootstrap to estimate standard errors: In this problem, we will employ bootstrapping to estimate the variability of estimation where standard software does not exist. Use the boot() function, or risk losing points! Using the Auto data from package ISLR, estimate the standard error of the difference in Median mpg (miles per gallon) between 4 cylinder and 8 cylinder cars.

library(ISLR)
library(boot)

# Filters only 4 and 8 cylinder cars
Auto_48 <- subset(Auto, cylinders %in% c(4, 8))

diff_median_fn <- function(data, indices) {
  sample_data <- data[indices, ]
  med4 <- median(sample_data$mpg[sample_data$cylinders == 4])
  med8 <- median(sample_data$mpg[sample_data$cylinders == 8])
  return(med4 - med8)
}

set.seed(1234)
boot_out <- boot(data = Auto_48, statistic = diff_median_fn, R = 1000)
boot_out

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto_48, statistic = diff_median_fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1*     14.4 -0.28655   0.7817146

bootstrap_se <- sd(boot_out$t)
bootstrap_se

## [1] 0.7817146

We used the boot() function to estimate the standard error of the difference in median mpg between 4 and 8 cylinder cars using 1,000 bootstrap resamples. The difference in medians was 14.4 mpg, meaning that 4-cylinder cars typically achieve much better fuel efficiency than 8-cylinder cars. The estimated bootstrap standard error was approximately 0.78 mpg. The bias was very small (−0.29), indicating the estimator is stable under resampling. Normally As there is no formula based standard error exists for this median difference, bootstrapping is a practical and statistically valid approach to estimate its variability.

Write your own function for conducting 10-fold cross validation to assess the test error in a linear regression model predicting estimated performance in the cpus dataset from R package MASS, using only minimum main memory and maximum main memory as predictors. DO NOT worry about including interaction effects, or higher order effects as in the first exam. Just a main effect for each variable will be sufficient. You need to show how to use the createFolds() function (creates the folds for you) from the R package caret, as explained in class.

library(MASS)   
library(caret)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:boot':
## 
##     melanoma

data(cpus)
cpus_clean <- na.omit(cpus)
names(cpus_clean)

## [1] "name"    "syct"    "mmin"    "mmax"    "cach"    "chmin"   "chmax"  
## [8] "perf"    "estperf"

cv_mse_linear <- function(df, response, predictors, k = 10) {
  
  set.seed(1234)  
  folds <- createFolds(df[[response]], k = k, list = TRUE)

  mse_list <- c()  

  for (i in seq_along(folds)) {
    test_idx <- folds[[i]]
    train_data <- df[-test_idx, ]
    test_data  <- df[test_idx, ]

    formula_str <- paste(response, "~", paste(predictors, collapse = " + "))
    model <- lm(as.formula(formula_str), data = train_data)

    pred <- predict(model, newdata = test_data)
    actual <- test_data[[response]]

    mse <- mean((pred - actual)^2)
    mse_list[i] <- mse
  }
  return(mean(mse_list))
}
cv_result <- cv_mse_linear(
  df = cpus_clean,
  response = "perf",
  predictors = c("mmin", "mmax"),
  k = 10
)
cat("Estimated 10-fold CV MSE:", round(cv_result, 3), "\n")

## Estimated 10-fold CV MSE: 6315.865

We created a custom 10-fold cross-validation function using the createFolds() function from the MASS package. Using mmin and mmax as predictors for CPU performance (perf), our model was evaluated on unseen data across 10 splits. The estimated test error is the average mean squared error (MSE) across folds, which gives us a realistic measure of how well the model generalizes.

The estimated 10-fold CV test MSE is approximately 6315.87, which suggests moderate prediction error using just these two memory-related predictors mmin and mmax.