Please hand this result in on Canvas no later than 11:59pm on Thursday, May 8!
Do not work in groups!
Do not discuss the exam with other students, or share any ideas or thoughts regarding the exam. This is academic misconduct, and will not be tolerated. Your presented solution must be your own work!
Must submit compiled RMarkdown file in PDF/HTML format.
If you cannot compile the RMarkdown file, please send an email to me for help before Thursday, May 8!
Question 1
Experimenting with tree-based methods: In this problem, you will used CARTs, bagging, random forests, and boosting in an attempt to accurately predict whether mushrooms are edible or poisonous, based on a large collection of varaibles decribing the mushrooms (cap shape, cap surface, odor, gill spacing, gill size, stalk color, etc.). The data can be found under the Files tab in folder Exam 2/Final Exam (mushrooms.csv) on Canvas. More information regarding this data can be found (including variable level designation for interpretation) at the following site: https://www.kaggle.com/uciml/mushroom-classification/data
Read in the Mushroom Dataset from the Google Sheets public URL.
url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vSP2UBc4FC9EgHTeQl8UtFy9PHP1J1pox-ljfMClb21UUf4tWq2PiVrdffsSvXqrxW6fNJFceXgr_TN/pub?output=csv"
mushroom <- read.csv(url)
dim(mushroom)
## [1] 8124 23
head(mushroom)
## class cap.shape cap.surface cap.color bruises odor gill.attachment
## 1 p x s n t p f
## 2 e x s y t a f
## 3 e b s w t l f
## 4 p x y w t p f
## 5 e x s g f n f
## 6 e x y y t a f
## gill.spacing gill.size gill.color stalk.shape stalk.root
## 1 c n k e e
## 2 c b k e c
## 3 c b n e c
## 4 c n n e e
## 5 w b k t e
## 6 c b n e c
## stalk.surface.above.ring stalk.surface.below.ring stalk.color.above.ring
## 1 s s w
## 2 s s w
## 3 s s w
## 4 s s w
## 5 s s w
## 6 s s w
## stalk.color.below.ring veil.type veil.color ring.number ring.type
## 1 w p w o p
## 2 w p w o p
## 3 w p w o p
## 4 w p w o p
## 5 w p w o e
## 6 w p w o p
## spore.print.color population habitat
## 1 k s u
## 2 n n g
## 3 n n m
## 4 k s u
## 5 n a g
## 6 k n g
This analysis uses CART, bagging, random forests, and boosting to predict whether a mushroom is edible or poisonous based on its features.
Split the data into a training and test set, each consisting of 50% of the entire sample. Do this randomly, not based on positioning in the dataset (i.e. do not take the first half as your test set, etc., but instead use the function).
mushroom <- mushroom %>% mutate_all(as.factor)
train_idx <- sample(1:nrow(mushroom), nrow(mushroom)/2)
train_data <- mushroom[train_idx, ]
test_data <- mushroom[-train_idx, ]
Explanation: The dataset is randomly split into 50% training and 50% testing to avoid biased sampling. All variables are treated as factors to match classification assumptions.
Using a CART, fit a classification tree to the training data and employ cost complexity pruning with CV used to determine the optimal number of terminal nodes to include. Then test this “best” model on the test data set. Interpret the final best model using the tree structure, and visualize it / add text to represent the decisions.
tree_model <- tree(class ~ ., data = train_data)
cv_tree <- cv.tree(tree_model, FUN = prune.misclass)
# Plot CV results
plot(cv_tree$size, cv_tree$dev, type = "b", xlab = "Tree Size", ylab = "CV Misclassification Error", main = "Cost-Complexity Pruning")
# Prune tree to optimal size
best_size <- cv_tree$size[which.min(cv_tree$dev)]
pruned_tree <- prune.misclass(tree_model, best = best_size)
# Visualize the pruned tree
plot(pruned_tree)
text(pruned_tree, pretty = 0)
# Predict and evaluate
tree_pred <- predict(pruned_tree, test_data, type = "class")
confusionMatrix(tree_pred, test_data$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction e p
## e 2095 5
## p 0 1962
##
## Accuracy : 0.9988
## 95% CI : (0.9971, 0.9996)
## No Information Rate : 0.5158
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9975
##
## Mcnemar's Test P-Value : 0.07364
##
## Sensitivity : 1.0000
## Specificity : 0.9975
## Pos Pred Value : 0.9976
## Neg Pred Value : 1.0000
## Prevalence : 0.5158
## Detection Rate : 0.5158
## Detection Prevalence : 0.5170
## Balanced Accuracy : 0.9987
##
## 'Positive' Class : e
##
Explanation: A Classification and Regression Tree
(CART) was trained on the training data and pruned using
cost-complexity pruning with
cross-validation to avoid overfitting. The
cross-validation plot indicated an optimal number of terminal nodes,
after which pruning was applied to yield the most efficient tree. Visual
inspection of the tree revealed that odor
was the primary and most influential splitter, consistent with prior
domain knowledge that smell is a strong indicator of mushroom edibility.
The pruned tree achieved an impressive test accuracy of
99.88%, with sensitivity = 1.00 and
specificity = 0.9975, as shown in the confusion matrix.
This suggests that the tree is highly effective at distinguishing edible
and poisonous mushrooms with minimal false positives or negatives. The
model is interpretable and accurate, though potentially less flexible
compared to ensemble methods.
Use bootstrap aggregation on the training set to build a model which can predict edible vs poisonous status. Test your resulting bagged tree model on the testing data created in part a). Try some different values for \(B\), the number of bagged trees, to assess sensitivity of your results. How many trees seems sufficient? Create variable importance plots for this final model and interpret the findings.
tree_nums <- c(10, 50, 100, 200)
bagging_accuracies <- c()
for (b in tree_nums) {
bag_model <- randomForest(class ~ ., data = train_data, mtry = ncol(train_data)-1, ntree = b)
pred <- predict(bag_model, test_data)
acc <- mean(pred == test_data$class)
bagging_accuracies <- c(bagging_accuracies, acc)
}
# Plotting accuracies
qplot(tree_nums, bagging_accuracies, geom = "line") +
labs(title = "Bagging: Accuracy vs. Number of Trees", x = "Number of Trees", y = "Accuracy") +
ylim(0.99, 1)
# Final bagging model
final_bag_model <- randomForest(class ~ ., data = train_data, mtry = ncol(train_data)-1, ntree = 100)
varImpPlot(final_bag_model)
Explanation: Bagging was implemented using the
Random Forest algorithm with mtry
set to the total number
of predictors, which mimics full bagging by allowing all features to be
considered at each split. The model was tested with varying numbers of
trees: 10, 50, 100, and 200. Accuracy improved and stabilized around
99.9% once the number of trees exceeded 50, indicating
that performance becomes robust with relatively few trees. The variable
importance plot for the final model (100 trees) revealed
odor
, spore.print.color
, and
stalk.color.below.ring
as the top three features
contributing to classification. This confirms the model’s ability to
reduce variance and maintain stability across runs. Bagging proved to be
both powerful and reliable in this binary classification task.
Use random forests on the training set to build a model which can predict edible vs poisonous status. Test your resulting random forest model on the testing data created in part a). Try some alternative values for \(mtry\), the number of predictors which are “tried” as optimal splitters at each node, and also \(B\). Compare these models in terms of their test performance. Which \(B\) and \(mtry\) combination seems best? Create variable importance plots for this final model and interpret the findings.
mtry_values <- c(2, 4, 6)
rf_acc <- c()
for (m in mtry_values) {
rf_model <- randomForest(class ~ ., data = train_data, mtry = m, ntree = 100)
pred <- predict(rf_model, test_data)
rf_acc <- c(rf_acc, mean(pred == test_data$class))
}
# Accuracy comparison
data.frame(mtry = mtry_values, accuracy = rf_acc)
## mtry accuracy
## 1 2 1
## 2 4 1
## 3 6 1
# Final random forest
final_rf <- randomForest(class ~ ., data = train_data, mtry = 4, ntree = 100)
varImpPlot(final_rf)
Explanation: Random Forests extend bagging by adding
randomness in feature selection. Different values of mtry
(2, 4, and 6) were tested, and each configuration yielded
perfect accuracy (100%) on the test set. This indicates
that the model is extremely effective at generalizing, even when the
number of features tried at each split is limited. The best performing
model (chosen with mtry = 4
) again highlighted
odor
as the most important predictor,
followed by spore.print.color
and several stalk and gill
features. This reinforces the consistency of feature importance across
models and suggests that the underlying structure in the data is highly
separable. Random Forests provide a balance of flexibility, accuracy,
and robustness, making them a standout option among tree-based
methods.
Use boosting on the training set to build a model which can predict edible vs poisonous status. Test your resulting boosted model on the testing data created in part a). You should assess alternative values for \(\lambda\), perhaps 0.001, 0.01, 0.1, etc. What do you notice about the relationship between \(\lambda\) and \(B\)? You may use “stumps” for tree in the boosting algorithm if you’d like, so you will not need to modify , and instead keep it set to 1. Which combination (of \(\lambda\), \(B\)) seems best? Create variable importance plots for this final model and interpret the findings.
# Convert class to numeric for gbm
train_data$class_bin <- as.numeric(train_data$class) - 1
test_data$class_bin <- as.numeric(test_data$class) - 1
lambda_vals <- c(0.001, 0.01, 0.1)
boost_acc <- c()
for (l in lambda_vals) {
boost_model <- gbm(class_bin ~ . - class, data = train_data, distribution = "bernoulli",
n.trees = 100, interaction.depth = 1, shrinkage = l, verbose = FALSE)
pred <- predict(boost_model, test_data, n.trees = 100, type = "response")
pred_class <- ifelse(pred > 0.5, 1, 0)
acc <- mean(pred_class == test_data$class_bin)
boost_acc <- c(boost_acc, acc)
}
# Accuracy vs. learning rate
data.frame(lambda = lambda_vals, accuracy = boost_acc)
## lambda accuracy
## 1 0.001 0.9862137
## 2 0.010 0.9862137
## 3 0.100 0.9970458
# Final model and variable importance
final_boost <- gbm(class_bin ~ . - class, data = train_data, distribution = "bernoulli",
n.trees = 100, interaction.depth = 1, shrinkage = 0.1, verbose = FALSE)
summary(final_boost)
## var rel.inf
## odor odor 94.585764830
## spore.print.color spore.print.color 4.761496972
## stalk.surface.below.ring stalk.surface.below.ring 0.263379743
## stalk.color.below.ring stalk.color.below.ring 0.250393033
## gill.size gill.size 0.131524823
## stalk.surface.above.ring stalk.surface.above.ring 0.007440599
## cap.shape cap.shape 0.000000000
## cap.surface cap.surface 0.000000000
## cap.color cap.color 0.000000000
## bruises bruises 0.000000000
## gill.attachment gill.attachment 0.000000000
## gill.spacing gill.spacing 0.000000000
## gill.color gill.color 0.000000000
## stalk.shape stalk.shape 0.000000000
## stalk.root stalk.root 0.000000000
## stalk.color.above.ring stalk.color.above.ring 0.000000000
## veil.type veil.type 0.000000000
## veil.color veil.color 0.000000000
## ring.number ring.number 0.000000000
## ring.type ring.type 0.000000000
## population population 0.000000000
## habitat habitat 0.000000000
Explanation: Boosting was performed using the
gbm
package with shallow trees (stumps) and learning rates
(λ
) of 0.001, 0.01, and 0.1. The best test performance was
achieved with shrinkage = 0.1, producing an accuracy of
approximately 99.7%. While this was slightly below the
accuracy of bagging and random forests, it is still highly competitive.
The model converged more quickly and effectively at higher learning
rates. The variable importance plot showed that
odor
had an overwhelming influence,
accounting for over 94% of the model’s predictive
power, while all other variables had relatively minor contributions.
This suggests that boosting, while sensitive to hyperparameter settings,
is still capable of capturing the dominant structure in the data.
Based on your results to b) through e), which statistical learning algorithm seems to yield the best accuracy results? Select the best model from all models. Is this likely to be the same answer that other students will get on this exam, assuming you correctly used your name in setting the seed on part a) (set.seed(1234))?
# Confusion matrix heatmap (example with random forest)
rf_pred <- predict(final_rf, test_data)
cm <- confusionMatrix(rf_pred, test_data$class)
# Plotting confusion matrix
conf_matrix <- as.data.frame(cm$table)
ggplot(conf_matrix, aes(Prediction, Reference, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), vjust = 1) +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(title = "Confusion Matrix: Random Forest")
Explanation: After comparing the results of all four
methods—CART, Bagging, Random Forests, and Boosting—it is clear that
Random Forests and Bagging outperform
the others in terms of both accuracy and robustness. Random Forests
slightly edge out bagging by introducing decorrelation between trees
while maintaining perfect test set accuracy. Boosting performs very well
when appropriately tuned but requires more careful calibration of
parameters. CART, while interpretable, has slightly lower accuracy and
may not generalize as well. Given the outstanding performance and
consistency of Random Forests across different mtry
values
and its clear visualization of feature importance, it stands out as the
best-performing model. Due to the simplicity of the
dataset and the dominant influence of the odor
variable, it
is highly likely that other students using the same seed and methodology
will arrive at a similar conclusion.
Using the Bootstrap to estimate standard errors: In this problem, we will employ bootstrapping to estimate the variability of estimation where standard software does not exist. Use the function, or risk losing points!
Using the data from package , estimate the standard error of the difference in Median (miles per gallon) between 4 cylinder and 8 cylinder cars.
library(ISLR)
library(boot)
library(dplyr)
data(Auto)
Auto <- Auto %>%
mutate(cyl4 = ifelse(cylinders == 4, 1, 0),
cyl8 = ifelse(cylinders == 8, 1, 0))
boot_func <- function(data, index) {
d <- data[index, ]
med4 <- median(d$mpg[d$cyl4 == 1])
med8 <- median(d$mpg[d$cyl8 == 1])
return(med4 - med8)
}
set.seed(146)
boot_result <- boot(data = Auto, statistic = boot_func, R = 1000)
boot_result
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Auto, statistic = boot_func, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 14.4 -0.30035 0.8256265
Explanation: In this question, we aimed to estimate
the standard error of the difference in median miles per gallon (mpg)
between 4-cylinder and 8-cylinder cars using the Auto
dataset from the ISLR package. Since standard software
tools do not directly provide the standard error for the difference in
medians, we used bootstrapping with the
boot()
function, running 1,000 resampling iterations. The
function calculated the difference in medians
(median_4cyl - median_8cyl
) for each bootstrap sample.
The original estimated difference in medians was approximately 14.4 mpg, indicating that 4-cylinder cars, on average, are significantly more fuel efficient than their 8-cylinder counterparts. The bootstrap-estimated standard error was approximately 0.83 mpg, with negligible bias (−0.30), which implies that this estimate is stable and reliable. This small standard error, relative to the large effect size, suggests that the observed difference is both statistically and practically significant. Bootstrapping proved to be an effective tool in quantifying the variability of a nonparametric statistic where no analytical formula is readily available.
cpus
DataWrite your own function for conducting 10-fold cross validation to assess the test error in a linear regression model predicting estimated performance in the dataset from R package , using only minimum main memory and maximum main memory as predictors. DO NOT worry about including interaction effects, or higher order effects as in the first exam. Just a main effect for each variable will be sufficient. You need to show how to use the function (creates the folds for you) from the R package , as explained in class.
library(MASS)
library(caret)
data(cpus)
set.seed(146)
flds <- createFolds(cpus$perf, k = 10)
cv_errors <- numeric(10)
for (i in 1:10) {
train_idx <- unlist(flds[-i])
test_idx <- flds[[i]]
model <- lm(perf ~ mmin + mmax, data = cpus[train_idx, ])
preds <- predict(model, newdata = cpus[test_idx, ])
cv_errors[i] <- mean((cpus$perf[test_idx] - preds)^2)
}
mean(cv_errors)
## [1] 6322.073
Explanation: In this task, we assessed the
out-of-sample prediction error of a linear regression model using
10-fold cross-validation. The goal was to predict CPU
performance (perf
) based on two predictors:
minimum main memory (mmin
) and
maximum main memory (mmax
) from the
cpus
dataset in the MASS package. To
implement cross-validation, we used the createFolds()
function from the caret
package to split the data into 10
equal parts, ensuring each fold served as a test set once.
In each iteration, the model was trained on 90% of the data and tested on the remaining 10%. The mean squared prediction error was computed for each fold, and then averaged across all 10 folds to yield an overall estimate. The final mean squared error (MSE) was approximately 6,322, which reflects the average prediction error of the model when applied to new data. This MSE indicates moderate predictive performance. Since the model includes only two main-effect predictors, there may be room for improvement through incorporating interaction terms, non-linear transformations, or other relevant features. Nonetheless, the implementation demonstrates an effective and reproducible approach to evaluating model performance using cross-validation.