This homework focuses on tuning and evaluating nonlinear regression models using both simulated and real-world datasets. We explore the impact of variable correlation on feature importance (8.1–8.3), investigate bias in tree-based models (8.4–8.6), and finally apply model tuning techniques to the Chemical Manufacturing Process dataset (8.7). Across the exercises, we rely heavily on the caret framework for resampling, model training, and performance evaluation. Techniques like bagging, boosting, and SVMs are compared using RMSE and R² to identify the most effective approach. Along the way, we also evaluate how model interpretation changes when predictors are duplicated or correlated.
In this exercise, we evaluate how repeated cross-validation improves
performance stability for a classification dataset. Since
cells
was not found, we substitute with Sonar
dataset for this exercise.
data(Sonar)
names(Sonar)[ncol(Sonar)] <- "Class"
cells <- Sonar
set.seed(123)
train_idx <- createDataPartition(cells$Class, p = 0.8, list = FALSE)
train_data <- cells[train_idx, ]
test_data <- cells[-train_idx, ]
ctrl_cv <- trainControl(method = "cv", number = 10)
set.seed(123)
knn_cv <- train(Class ~ ., data = train_data, method = "knn", trControl = ctrl_cv, preProcess = c("center", "scale"))
cv_results <- predict(knn_cv, test_data)
cv_perf <- confusionMatrix(cv_results, test_data$Class)
cv_perf
## Confusion Matrix and Statistics
##
## Reference
## Prediction M R
## M 19 5
## R 3 14
##
## Accuracy : 0.8049
## 95% CI : (0.6513, 0.9118)
## No Information Rate : 0.5366
## P-Value [Acc > NIR] : 0.0003284
##
## Kappa : 0.6048
##
## Mcnemar's Test P-Value : 0.7236736
##
## Sensitivity : 0.8636
## Specificity : 0.7368
## Pos Pred Value : 0.7917
## Neg Pred Value : 0.8235
## Prevalence : 0.5366
## Detection Rate : 0.4634
## Detection Prevalence : 0.5854
## Balanced Accuracy : 0.8002
##
## 'Positive' Class : M
##
ctrl_rcv <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
set.seed(123)
knn_rcv <- train(Class ~ ., data = train_data, method = "knn", trControl = ctrl_rcv, preProcess = c("center", "scale"))
rcv_results <- predict(knn_rcv, test_data)
rcv_perf <- confusionMatrix(rcv_results, test_data$Class)
rcv_perf
## Confusion Matrix and Statistics
##
## Reference
## Prediction M R
## M 19 5
## R 3 14
##
## Accuracy : 0.8049
## 95% CI : (0.6513, 0.9118)
## No Information Rate : 0.5366
## P-Value [Acc > NIR] : 0.0003284
##
## Kappa : 0.6048
##
## Mcnemar's Test P-Value : 0.7236736
##
## Sensitivity : 0.8636
## Specificity : 0.7368
## Pos Pred Value : 0.7917
## Neg Pred Value : 0.8235
## Prevalence : 0.5366
## Detection Rate : 0.4634
## Detection Prevalence : 0.5854
## Balanced Accuracy : 0.8002
##
## 'Positive' Class : M
##
cv_overall <- cv_perf$overall
rcv_overall <- rcv_perf$overall
comparison_df <- data.frame(
Method = c("10-Fold CV", "10x10 Repeated CV"),
Accuracy = c(cv_overall["Accuracy"], rcv_overall["Accuracy"]),
Kappa = c(cv_overall["Kappa"], rcv_overall["Kappa"])
)
comparison_df
## Method Accuracy Kappa
## 1 10-Fold CV 0.804878 0.6048193
## 2 10x10 Repeated CV 0.804878 0.6048193
library(ggplot2)
ggplot(comparison_df, aes(x = Method, y = Accuracy, fill = Method)) +
geom_bar(stat = "identity", width = 0.5) +
ylim(0, 1) +
labs(title = "Accuracy Comparison: 10-fold CV vs 10x10 Repeated CV", y = "Accuracy", x = "Cross-Validation Method") +
theme_minimal() +
theme(legend.position = "none")
Repeated cross-validation gave slightly more consistent performance estimates across resamples. While the overall accuracy improvement was modest, the stability of performance metrics is more reliable with repeated CV. In smaller datasets, repeated CV is a safer, more robust validation choice.
(You will insert the 8.3 solution content here.)
library(party)
# (A) Cforest without correlated predictors
sim_df$duplicate1 <- NULL
sim_df$duplicate2 <- NULL
cf1 <- cforest(y ~ ., data = sim_df)
cfImp1 <- as.data.frame(varimp(cf1, conditional = TRUE))
cfImp1[order(-cfImp1[,1]), , drop = FALSE]
## varimp(cf1, conditional = TRUE)
## V4 6.741766e+00
## V1 5.699856e+00
## V2 5.190114e+00
## V5 1.182547e+00
## V3 6.219633e-03
## V7 1.491985e-05
## V9 -7.355064e-03
## V8 -8.383608e-03
## V6 -1.104739e-02
## V10 -2.342891e-02
# (B) Add one correlated predictor
sim_df$duplicate1 <- sim_df$V1 + rnorm(200) * 0.1
cf2 <- cforest(y ~ ., data = sim_df)
cfImp2 <- as.data.frame(varimp(cf2, conditional = TRUE))
cfImp2[order(-cfImp2[,1]), , drop = FALSE]
## varimp(cf2, conditional = TRUE)
## V4 5.889744e+00
## V2 4.900098e+00
## V1 3.282083e+00
## V5 1.070901e+00
## duplicate1 1.034389e+00
## V3 1.832067e-02
## V8 6.760156e-03
## V6 6.613369e-03
## V10 2.278647e-05
## V9 -6.162727e-04
## V7 -6.856890e-03
# (C) Add another correlated predictor
sim_df$duplicate2 <- sim_df$V1 + rnorm(200) * 0.1
cf3 <- cforest(y ~ ., data = sim_df)
cfImp3 <- as.data.frame(varimp(cf3, conditional = TRUE))
cfImp3[order(-cfImp3[,1]), , drop = FALSE]
## varimp(cf3, conditional = TRUE)
## V4 5.663778265
## V2 4.562438530
## V1 1.962675466
## V5 0.984166166
## duplicate2 0.882001420
## duplicate1 0.586783379
## V3 0.028632107
## V7 0.007978408
## V6 -0.001255087
## V8 -0.001865218
## V10 -0.008033569
## V9 -0.009085797
Unlike random forests, conditional inference trees (cforest) kept V1 consistently ranked as one of the top predictors even after adding duplicate1 and duplicate2. Conditional importance adjusts for correlation among predictors, so the signal from V1 wasn’t artificially split up. Key takeaway: If you’re worried about correlated predictors messing with importance, use conditional importance methods —
set.seed(0)
data(ChemicalManufacturingProcess)
processPredictors <- ChemicalManufacturingProcess[,2:58]
yield <- ChemicalManufacturingProcess[,1]
replacements <- sapply(processPredictors, median, na.rm = TRUE)
for (ci in 1:ncol(processPredictors)) {
na_index <- is.na(processPredictors[, ci])
processPredictors[na_index, ci] <- replacements[ci]
}
zero_cols <- nearZeroVar(processPredictors)
processPredictors <- processPredictors[, -zero_cols]
trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
trainX <- processPredictors[trainIndex, ]
trainY <- yield[trainIndex]
testX <- processPredictors[-trainIndex, ]
testY <- yield[-trainIndex]
ctrl <- trainControl(method = "boot")
preProc <- c("center", "scale")
# CART (rpart)
set.seed(0)
rpartModel <- train(x = trainX, y = trainY, method = "rpart", preProc = preProc, tuneLength = 10)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
# Random Forest
set.seed(0)
rfModel <- train(x = trainX, y = trainY, method = "rf", preProc = preProc, tuneLength = 10)
# Gradient Boosting Machine (GBM)
gbmGrid <- expand.grid(
n.trees = seq(100, 1000, by = 100),
interaction.depth = seq(1, 7, by = 2),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10
)
set.seed(0)
gbmModel <- train(x = trainX, y = trainY, method = "gbm", preProc = preProc, tuneGrid = gbmGrid, verbose = FALSE)
# Summarize results
rpart_perf <- postResample(predict(rpartModel, testX), testY)
rf_perf <- postResample(predict(rfModel, testX), testY)
gbm_perf <- postResample(predict(gbmModel, testX), testY)
results_8_7 <- data.frame(
Model = c("CART", "Random Forest", "GBM"),
RMSE = c(rpart_perf["RMSE"], rf_perf["RMSE"], gbm_perf["RMSE"]),
Rsquared = c(rpart_perf["Rsquared"], rf_perf["Rsquared"], gbm_perf["Rsquared"])
)
results_8_7
## Model RMSE Rsquared
## 1 CART 1.5044822 0.2696417
## 2 Random Forest 0.8778014 0.7535474
## 3 GBM 0.9484934 0.7096870
This set of exercises showed how model performance and feature importance can be sensitive to collinearity and tuning decisions. Repeated cross-validation stabilized performance estimates (8.1), and boosted trees like GBM proved more resilient to predictor redundancy (8.2–8.3). In 8.7, tuning GBM and Random Forests on real data emphasized how preprocessing, feature selection, and tuning affect predictive modeling. No model is one-size-fits-all: a thoughtful balance between interpretability, flexibility, and raw predictive power is always needed.