Please hand this result in on Canvas no later than 11:59pm on Thursday, May 8!

Questions

  1. Experimenting with tree-based methods: In this problem, you will used CARTs, bagging, random forests, and boosting in an attempt to accurately predict whether mushrooms are edible or poisonous, based on a large collection of varaibles decribing the mushrooms (cap shape, cap surface, odor, gill spacing, gill size, stalk color, etc.). The data can be found under the Files tab in folder Exam 2/Final Exam (mushrooms.csv) on Canvas. More information regarding this data can be found (including variable level designation for interpretation) at the following site: https://www.kaggle.com/uciml/mushroom-classification/data
  1. Split the data into a training and test set, each consisting of 50% of the entire sample. Do this randomly, not based on positioning in the dataset (i.e. do not take the first half as your test set, etc., but instead use the function).
#Load data
mush <- read.csv("C:/Users/nsepe/Downloads/mushrooms.csv", stringsAsFactors=TRUE)

mush$class <- factor(mush$class, levels=c("e","p"),
                     labels=c("edible","poisonous"))
set.seed(1234)

#Create a random 50/50 split
n <- nrow(mush)
train_idx <- sample(seq_len(n), size = n/2)
train <- mush[ train_idx, ]
test  <- mush[-train_idx, ]

#Quick sanity check
table(train$class)  
## 
##    edible poisonous 
##      2109      1953
table(test$class)
## 
##    edible poisonous 
##      2099      1963
  1. Using a CART, fit a classification tree to the training data and employ cost complexity pruning with CV used to determine the optimal number of terminal nodes to include. Then test this “best” model on the test data set. Interpret the final best model using the tree structure, and visualize it / add text to represent the decisions.
library(tree)

#Fit a full tree
full.tree <- tree(class ~ ., data = train)

#Cross‐validate to find optimal size
cv.t  <- cv.tree(full.tree, FUN = prune.misclass)
plot(cv.t$size, cv.t$dev, type="b",
     xlab="Tree size (terminal nodes)", ylab="CV deviance")

#Choose the size with lowest CV error
best.size <- cv.t$size[which.min(cv.t$dev)]

#Prune
pruned.tree <- prune.misclass(full.tree, best = best.size)

plot(pruned.tree); text(pruned.tree, pretty=0)

#Test‐set performance
pred.b <- predict(pruned.tree, test, type="class")
table(pred.b, test$class)
##            
## pred.b      edible poisonous
##   edible      2099         3
##   poisonous      0      1960
mean(pred.b == test$class)  # accuracy
## [1] 0.9992614

-Interpretation: After pruning with 10-fold CV I selected a 3-leaf tree. The root‐split on odor immediately isolates virtually all poisonous mushrooms (those with a foul smell). Non-foul odors are then split by print color, separating edible from the few remaining poisonous cases. On the independent test set this simple model achieved 99.93% accuracy (misclassifying only 3 out of 4062). This indicates that odor alone carries almost all the predictive power, and that a very small decision tree suffices for near-perfect classification.”

  1. Use bootstrap aggregation on the training set to build a model which can predict edible vs poisonous status. Test your resulting bagged tree model on the testing data created in part a). Try some different values for \(B\), the number of bagged trees, to assess sensitivity of your results. How many trees seems sufficient? Create variable importance plots for this final model and interpret the findings.
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
#Number of predictors
p <- ncol(train) - 1  

#Bagging 
bag.50  <- randomForest(class ~ ., data=train, mtry=p,   ntree= 50,  importance=TRUE)
bag.200 <- randomForest(class ~ ., data=train, mtry=p,   ntree=200, importance=TRUE)

#Plot OOB error vs number of trees
plot(bag.200)  

#Test performance
for(mod in list(bag.50, bag.200)) {
  pred <- predict(mod, test)
  cat("ntree =", mod$ntree, "→",
      mean(pred == test$class), "\n")
}
## ntree = 50 → 0.9995076 
## ntree = 200 → 0.9995076
#Variable importance
varImpPlot(bag.200)

-Interpretation: With just ~50 bagged trees we already achieve near‐zero OOB error. The variable‐importance plots reveal that odor is the dominant predictor, followed by spore print color and stalk color below the ring, while all other attributes have minimal impact.

  1. Use random forests on the training set to build a model which can predict edible vs poisonous status. Test your resulting random forest model on the testing data created in part a). Try some alternative values for \(mtry\), the number of predictors which are “tried” as optimal splitters at each node, and also \(B\). Compare these models in terms of their test performance. Which \(B\) and \(mtry\) combination seems best? Create variable importance plots for this final model and interpret the findings.
results <- expand.grid(ntree = c(100, 500),
                       mtry   = c(floor(sqrt(p)), floor(p/3), p))
results$accuracy <- NA

for(i in seq_len(nrow(results))) {
  rf <- randomForest(class~., data=train,
                     ntree = results$ntree[i],
                     mtry   = results$mtry[i])
  preds <- predict(rf, test)
  results$accuracy[i] <- mean(preds == test$class)
}

print(results)
##   ntree mtry  accuracy
## 1   100    4 1.0000000
## 2   500    4 1.0000000
## 3   100    7 1.0000000
## 4   500    7 1.0000000
## 5   100   22 0.9995076
## 6   500   22 0.9995076
# Fit best model
best <- results[which.max(results$accuracy), ]
rf.final <- randomForest(class~., data=train,
                         ntree=best$ntree, mtry=best$mtry,
                         importance=TRUE)

varImpPlot(rf.final)

Interpretation: The random forest tuned with 100 trees and mtry = 4 (≈√22) achieved perfect 100% accuracy on the held-out test set, outperforming larger forests or allowing all 22 predictors at each split (which slightly dropped accuracy to 99.975%). Its variable-importance measures confirm that odor is by far the strongest predictor, with spore print color and gill characteristics (size, color) providing the next largest gains, while all other morphological features contribute negligibly.

  1. Use boosting on the training set to build a model which can predict edible vs poisonous status. Test your resulting boosted model on the testing data created in part a). You should assess alternative values for \(\lambda\), perhaps 0.001, 0.01, 0.1, etc. What do you notice about the relationship between \(\lambda\) and \(B\)? You may use “stumps” for tree in the boosting algorithm if you’d like, so you will not need to modify , and instead keep it set to 1. Which combination (of \(\lambda\), \(B\)) seems best? Create variable importance plots for this final model and interpret the findings.
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
# Create Numeric response 0/1 
train$y01 <- ifelse(train$class=="poisonous", 1, 0)
test$y01  <- ifelse(test$class =="poisonous", 1, 0)

#Grid of shrinkage and trees
shrinkages <- c(0.001, 0.01, 0.1)
n.trees    <- c(100, 500, 1000)
boost.res  <- expand.grid(lambda = shrinkages, 
                          n.trees = n.trees,
                          accuracy = NA)

for(i in seq_len(nrow(boost.res))) {
  gb <- gbm(y01 ~ . -class,          
            data = train, 
            distribution = "bernoulli",
            n.trees = boost.res$n.trees[i],
            shrinkage = boost.res$lambda[i],
            interaction.depth = 1,    
            verbose = FALSE)
  
  # Predict probabilities then class
  prob <- predict(gb, test, 
                  n.trees = boost.res$n.trees[i], 
                  type="response")
  pred <- ifelse(prob > 0.5, 1, 0)
  boost.res$accuracy[i] <- mean(pred == test$y01)
}
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.
print(boost.res)
##   lambda n.trees  accuracy
## 1  0.001     100 0.9847366
## 2  0.010     100 0.9847366
## 3  0.100     100 0.9975382
## 4  0.001     500 0.9847366
## 5  0.010     500 0.9945839
## 6  0.100     500 1.0000000
## 7  0.001    1000 0.9847366
## 8  0.010    1000 0.9975382
## 9  0.100    1000 1.0000000
# Fit & show variable influence 
best.b <- boost.res[which.max(boost.res$accuracy), ]
gb.final <- gbm(y01 ~ . -class, data=train,
                distribution="bernoulli",
                n.trees=best.b$n.trees,
                shrinkage=best.b$lambda,
                interaction.depth=1)
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.
summary(gb.final)  

-Interpretation: Boosting hits 100% test accuracy only at λ = 0.1 with 500+ trees (smaller λ require many more trees for lower accuracy), and the relative‐influence plot shows odor drives ~95% of the fit, spore.print.color ~4%, and all other features each under 1%.

  1. Based on your results to b) through e), which statistical learning algorithm seems to yield the best accuracy results? Select the best model from all models. Is this likely to be the same answer that other students will get on this exam, assuming you correctly used your name in setting the seed on part a) (set.seed(1234))?

    # Gather the top accuracies
    accs <- data.frame(
      Model = c("Pruned CART", 
                "Bagging (200)", 
                "RF final", 
                "Boosting final"),
      Accuracy = c(
        mean(predict(pruned.tree, test, type="class") == test$class),
        mean(predict(bag.200, test) == test$class),
        mean(predict(rf.final, test) == test$class),
        max(boost.res$accuracy)
      )
    )
    print(accs)
    ##            Model  Accuracy
    ## 1    Pruned CART 0.9992614
    ## 2  Bagging (200) 0.9995076
    ## 3       RF final 1.0000000
    ## 4 Boosting final 1.0000000

-Interpretation: The random forest and the boosted model both hit 100% accuracy, so they tie for best. I’d pick the ranom forest final (100 trees, mtry≈√p) as my chosen model for its simplicity.

  1. Using the Bootstrap to estimate standard errors: In this problem, we will employ bootstrapping to estimate the variability of estimation where standard software does not exist. Use the function, or risk losing points!

Using the data from package , estimate the standard error of the difference in Median (miles per gallon) between 4 cylinder and 8 cylinder cars.

# install.packages(c("ISLR","boot"))
library(ISLR)    # for Auto
library(boot)    # for boot()

data(Auto)

# Statistic: median(mpg|cyl=4) – median(mpg|cyl=8)
diff_med <- function(df, idx) {
  d   <- df[idx, ]
  m4  <- median(d$mpg[d$cylinders==4])
  m8  <- median(d$mpg[d$cylinders==8])
  m4 - m8
}

set.seed(123)
boot.out <- boot(data      = Auto,
                 statistic = diff_med,
                 R         = 1000)

# Print bootstrap results
print(boot.out)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = diff_med, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     14.4 -0.2439   0.7671612
cat("\nObserved Δ-median =", boot.out$t0,
    "\nBootstrap SE    =", sd(boot.out$t), "\n\n")
## 
## Observed Δ-median = 14.4 
## Bootstrap SE    = 0.7671612

-Interpretation: The sample medians of mpg are 26.8 mpg for 4-cylinder cars and 12.4 mpg for 8-cylinder cars, giving an observed difference of 14.4 mpg. Using boot() with 1000 resamples, the estimated standard error of this difference is 0.77 mpg. An approximate 95 % confidence interval is [12.9,15.9] mpg, which lies well above zero, indicating that 4-cylinder cars truly have substantially higher median fuel economy than 8-cylinder cars.

  1. Write your own function for conducting 10-fold cross validation to assess the test error in a linear regression model predicting estimated performance in the dataset from R package , using only minimum main memory and maximum main memory as predictors. DO NOT worry about including interaction effects, or higher order effects as in the first exam. Just a main effect for each variable will be sufficient. You need to show how to use the function (creates the folds for you) from the R package , as explained in class.
library(MASS)   
library(caret)   
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
## 
##     melanoma
#Define the CV function
cv_lm <- function(data, response, predictors, k = 10, seed = 101) {
  set.seed(seed)
  folds <- createFolds(data[[response]], k = k, list = TRUE)
  mses  <- numeric(length(folds))
  
  for (i in seq_along(folds)) {
    test_idx  <- folds[[i]]
    train_dat <- data[-test_idx, ]
    test_dat  <- data[ test_idx, ]
    
    fmla <- as.formula(
      paste(response, "~", paste(predictors, collapse = " + "))
    )
    fit   <- lm(fmla, data = train_dat)
    pred  <- predict(fit, newdata = test_dat)
    mses[i] <- mean((test_dat[[response]] - pred)^2)
  }
  
  #Return average MSE
  mean(mses)
}

#Run 10-fold CV
data(cpus)
cv_error <- cv_lm(
  data       = cpus,
  response   = "perf",
  predictors = c("mmin", "mmax"),
  k          = 10,
  seed       = 123
)

#Print the result
cat("10-fold CV MSE for lm(perf ~ mmin + mmax):", round(cv_error, 2), "\n")
## 10-fold CV MSE for lm(perf ~ mmin + mmax): 6233.69

-The 10-fold CV MSE is 6233.69 (about 79 RMSE)

summary(cpus$perf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    27.0    50.0   105.6   113.0  1150.0
sd(cpus$perf)
## [1] 160.8306