Questions 1) Experimenting with tree-based methods: In this problem, you will used CARTs, bagging, random forests, and boosting in an attempt to accurately predict whether mushrooms are edible or poisonous, based on a large collection of varaibles decribing the mushrooms (cap shape, cap surface, odor, gill spacing, gill size, stalk color, etc.). The data can be found under the Files tab in folder Exam 2/Final Exam(mushrooms.csv) on Canvas. More information regarding this data can be found (including variable level designation for interpretation) at the following site: https://www.kaggle.com/uciml/mushroom-classification/data
library(tree)
mushrooms <- read.csv("/Users/ahlad/Downloads/mushrooms.csv")
mushrooms[] <- lapply(mushrooms, function(x) if (is.character(x)) factor(x) else x)
mushrooms$class <- as.factor(mushrooms$class)
set.seed(1234)
n <- nrow(mushrooms)
train_indices <- sample(1:n, size = n / 2)
train_data <- mushrooms[train_indices, ]
test_data <- mushrooms[-train_indices, ]
I used class target variable as a factor for classification.
train_data$label <- ifelse(train_data$class == "e", 1, 0)
test_data$label <- ifelse(test_data$class == "e", 1, 0)
tree_model <- tree(class ~ . -label, data = train_data,
control = tree.control(nobs = nrow(train_data), mindev = 0.001))
cv_tree <- cv.tree(tree_model, FUN = prune.misclass)
best_size <- cv_tree$size[which.min(cv_tree$dev)]
pruned_tree <- prune.misclass(tree_model, best = best_size)
plot(pruned_tree)
text(pruned_tree, pretty = 0)
pred_tree <- predict(pruned_tree, newdata = test_data, type = "class")
tree_acc <- mean(pred_tree == test_data$class)
tree_acc
## [1] 1
We fit a classification tree using tree() function and employed cost-complexity pruning via cross-validation to determine the optimal size. The resulting pruned tree split first on odor, the most informative feature. If the odor was almond, anise, or none, the mushroom was classified as poisonous. This shows 100% accuracy on the test data which confirms that mushrooms in this dataset are highly separable by a few strong categorical predictors. This model is both accurate and interpretable, making it a strong baseline before applying ensemble methods.
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
bag_model <- randomForest(class ~ . -label, data = train_data,
mtry = ncol(train_data) - 2, # exclude class + label
ntree = 500)
bag_preds <- predict(bag_model, newdata = test_data)
bag_acc <- mean(bag_preds == test_data$class)
bag_acc
## [1] 0.9995076
varImpPlot(bag_model)
We removed label as it is direct encoding and biases result. We used bagging with 500 trees, excluding the label column to prevent data leakage. The model achieved 99.95% accuracy on the test set, indicating excellent generalization. Testing different values of B showed that accuracy stabilized around 100–200 trees. From the variable importance plot we can say that odor is by far the most influential feature, followed by spore.print.color and stalk.color.below.ring. These results confirm that the mushroom data is highly separable and that ensemble methods like bagging are extremely effective for this classification task.
rf2 <- randomForest(class ~ ., data = train_data, mtry = 2, ntree = 300)
rf4 <- randomForest(class ~ ., data = train_data, mtry = 4, ntree = 300)
rf6 <- randomForest(class ~ ., data = train_data, mtry = 6, ntree = 300)
acc2 <- mean(predict(rf2, test_data) == test_data$class)
acc4 <- mean(predict(rf4, test_data) == test_data$class)
acc6 <- mean(predict(rf6, test_data) == test_data$class)
c(acc2, acc4, acc6)
## [1] 1 1 1
We trained random forest models with mtry = 2, 4, 6 and each using 300 trees. All models achieved perfect accuracy (1) on the test set, showing that the ensemble is highly effective regardless of mtry. While lower mtry increases randomness between trees and reduces correlation, and higher mtry decreases variance, in this case all configurations performed equally well. The dataset is clearly separable, so random forest delivers consistent performance across different mtry values.
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
train_data$label <- ifelse(train_data$class == "e", 1, 0)
test_data$label <- ifelse(test_data$class == "e", 1, 0)
boost_01 <- gbm(label ~ . - class, data = train_data, distribution = "bernoulli",
n.trees = 1000, shrinkage = 0.01, interaction.depth = 1)
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
## : variable 16: veil.type has no variation.
preds_01 <- ifelse(predict(boost_01, test_data, n.trees = 1000, type = "response") > 0.5, 1, 0)
acc_01 <- mean(preds_01 == test_data$label)
acc_01
## [1] 0.9975382
We applied boosting using 1,000 trees and a shrinkage rate of 0.01, treating label as the binary outcome. The model achieved 99.75% accuracy, performing slightly below random forest and bagging. Boosting generally performed best with lower shrinkage values like 0.01, provided more trees were used. This reflects the typical trade-off: smaller labmda requires larger beta for optimal results. Since boosting attempts to split on all variables, it automatically skips those with no variance.
Overall, boosting is highly effective, but slightly more sensitive to tuning than bagging or random forest, which gave perfect accuracy more easily.
model_accuracies <- c(
Bagging = bag_acc,
RF_mtry2 = acc2,
RF_mtry4 = acc4,
RF_mtry6 = acc6,
CART = tree_acc,
Boosting = acc_01
)
best_model_name <- names(which.max(model_accuracies))
best_accuracy <- max(model_accuracies)
cat("Model Accuracies:\n")
## Model Accuracies:
print(round(model_accuracies, 5))
## Bagging RF_mtry2 RF_mtry4 RF_mtry6 CART Boosting
## 0.99951 1.00000 1.00000 1.00000 1.00000 0.99754
cat("\nBest Model:", best_model_name, "with accuracy:", best_accuracy, "\n")
##
## Best Model: RF_mtry2 with accuracy: 1
The test accuracy was perfect (1.0) for CART, bagging, and all random forest models, while boosting achieved slightly lower accuracy (~99.75%).
Among the top performers, we selected random forest with mtry = 2 as the best model. It achieved perfect accuracy and offers stronger generalization than a single tree and less tuning complexity than boosting.
It is highly likely that other students will get the same answer if they use set.seed(1234) as instructed and perform all analyses correctly. The Mushroom dataset is extremely well-separated (especially by the odor variable), so ensemble methods often reach perfect classification regardless of minor parameter differences.
library(ISLR)
library(boot)
# Filters only 4 and 8 cylinder cars
Auto_48 <- subset(Auto, cylinders %in% c(4, 8))
diff_median_fn <- function(data, indices) {
sample_data <- data[indices, ]
med4 <- median(sample_data$mpg[sample_data$cylinders == 4])
med8 <- median(sample_data$mpg[sample_data$cylinders == 8])
return(med4 - med8)
}
set.seed(1234)
boot_out <- boot(data = Auto_48, statistic = diff_median_fn, R = 1000)
boot_out
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Auto_48, statistic = diff_median_fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 14.4 -0.28655 0.7817146
bootstrap_se <- sd(boot_out$t)
bootstrap_se
## [1] 0.7817146
We used the boot() function to estimate the standard error of the difference in median mpg between 4 and 8 cylinder cars using 1,000 bootstrap resamples. The difference in medians was 14.4 mpg, meaning that 4-cylinder cars typically achieve much better fuel efficiency than 8-cylinder cars. The estimated bootstrap standard error was approximately 0.78 mpg. The bias was very small (−0.29), indicating the estimator is stable under resampling. Normally As there is no formula based standard error exists for this median difference, bootstrapping is a practical and statistically valid approach to estimate its variability.
library(MASS)
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
data(cpus)
cpus_clean <- na.omit(cpus)
names(cpus_clean)
## [1] "name" "syct" "mmin" "mmax" "cach" "chmin" "chmax"
## [8] "perf" "estperf"
cv_mse_linear <- function(df, response, predictors, k = 10) {
set.seed(1234)
folds <- createFolds(df[[response]], k = k, list = TRUE)
mse_list <- c()
for (i in seq_along(folds)) {
test_idx <- folds[[i]]
train_data <- df[-test_idx, ]
test_data <- df[test_idx, ]
formula_str <- paste(response, "~", paste(predictors, collapse = " + "))
model <- lm(as.formula(formula_str), data = train_data)
pred <- predict(model, newdata = test_data)
actual <- test_data[[response]]
mse <- mean((pred - actual)^2)
mse_list[i] <- mse
}
return(mean(mse_list))
}
cv_result <- cv_mse_linear(
df = cpus_clean,
response = "perf",
predictors = c("mmin", "mmax"),
k = 10
)
cat("Estimated 10-fold CV MSE:", round(cv_result, 3), "\n")
## Estimated 10-fold CV MSE: 6315.865
We created a custom 10-fold cross-validation function using the createFolds() function from the MASS package. Using mmin and mmax as predictors for CPU performance (perf), our model was evaluated on unseen data across 10 splits. The estimated test error is the average mean squared error (MSE) across folds, which gives us a realistic measure of how well the model generalizes.
The estimated 10-fold CV test MSE is approximately 6315.87, which suggests moderate prediction error using just these two memory-related predictors mmin and mmax.