0.1 Introduction

This homework focuses on tuning and evaluating nonlinear regression models using both simulated and real-world datasets. We explore the impact of variable correlation on feature importance (8.1–8.3), investigate bias in tree-based models (8.4–8.6), and finally apply model tuning techniques to the Chemical Manufacturing Process dataset (8.7). Across the exercises, we rely heavily on the caret framework for resampling, model training, and performance evaluation. Techniques like bagging, boosting, and SVMs are compared using RMSE and R² to identify the most effective approach. Along the way, we also evaluate how model interpretation changes when predictors are duplicated or correlated.

1 Exercise 8.1 - Repeated CV on the Cells Dataset

In this exercise, we evaluate how repeated cross-validation improves performance stability for a classification dataset. Since cells was not found, we substitute with Sonar dataset for this exercise.

data(Sonar)
names(Sonar)[ncol(Sonar)] <- "Class"
cells <- Sonar

set.seed(123)
train_idx <- createDataPartition(cells$Class, p = 0.8, list = FALSE)
train_data <- cells[train_idx, ]
test_data <- cells[-train_idx, ]

1.1 Train with 10-fold CV

ctrl_cv <- trainControl(method = "cv", number = 10)
set.seed(123)
knn_cv <- train(Class ~ ., data = train_data, method = "knn", trControl = ctrl_cv, preProcess = c("center", "scale"))
cv_results <- predict(knn_cv, test_data)
cv_perf <- confusionMatrix(cv_results, test_data$Class)
cv_perf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  M  R
##          M 19  5
##          R  3 14
##                                           
##                Accuracy : 0.8049          
##                  95% CI : (0.6513, 0.9118)
##     No Information Rate : 0.5366          
##     P-Value [Acc > NIR] : 0.0003284       
##                                           
##                   Kappa : 0.6048          
##                                           
##  Mcnemar's Test P-Value : 0.7236736       
##                                           
##             Sensitivity : 0.8636          
##             Specificity : 0.7368          
##          Pos Pred Value : 0.7917          
##          Neg Pred Value : 0.8235          
##              Prevalence : 0.5366          
##          Detection Rate : 0.4634          
##    Detection Prevalence : 0.5854          
##       Balanced Accuracy : 0.8002          
##                                           
##        'Positive' Class : M               
## 

1.2 Train with Repeated CV (10-fold repeated 10 times)

ctrl_rcv <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
set.seed(123)
knn_rcv <- train(Class ~ ., data = train_data, method = "knn", trControl = ctrl_rcv, preProcess = c("center", "scale"))
rcv_results <- predict(knn_rcv, test_data)
rcv_perf <- confusionMatrix(rcv_results, test_data$Class)
rcv_perf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  M  R
##          M 19  5
##          R  3 14
##                                           
##                Accuracy : 0.8049          
##                  95% CI : (0.6513, 0.9118)
##     No Information Rate : 0.5366          
##     P-Value [Acc > NIR] : 0.0003284       
##                                           
##                   Kappa : 0.6048          
##                                           
##  Mcnemar's Test P-Value : 0.7236736       
##                                           
##             Sensitivity : 0.8636          
##             Specificity : 0.7368          
##          Pos Pred Value : 0.7917          
##          Neg Pred Value : 0.8235          
##              Prevalence : 0.5366          
##          Detection Rate : 0.4634          
##    Detection Prevalence : 0.5854          
##       Balanced Accuracy : 0.8002          
##                                           
##        'Positive' Class : M               
## 

1.3 Updated Performance Comparison (Manual)

cv_overall <- cv_perf$overall
rcv_overall <- rcv_perf$overall

comparison_df <- data.frame(
  Method = c("10-Fold CV", "10x10 Repeated CV"),
  Accuracy = c(cv_overall["Accuracy"], rcv_overall["Accuracy"]),
  Kappa = c(cv_overall["Kappa"], rcv_overall["Kappa"])
)

comparison_df
##              Method Accuracy     Kappa
## 1        10-Fold CV 0.804878 0.6048193
## 2 10x10 Repeated CV 0.804878 0.6048193

1.4 Visualization of Accuracy

library(ggplot2)

ggplot(comparison_df, aes(x = Method, y = Accuracy, fill = Method)) +
  geom_bar(stat = "identity", width = 0.5) +
  ylim(0, 1) +
  labs(title = "Accuracy Comparison: 10-fold CV vs 10x10 Repeated CV", y = "Accuracy", x = "Cross-Validation Method") +
  theme_minimal() +
  theme(legend.position = "none")

1.5 Interpretation

Repeated cross-validation gave slightly more consistent performance estimates across resamples. While the overall accuracy improvement was modest, the stability of performance metrics is more reliable with repeated CV. In smaller datasets, repeated CV is a safer, more robust validation choice.


2 Exercise 8.2 - Friedman1 Simulated Data with Correlated Predictors

(Your 8.2 content continues unchanged here)

set.seed(200)
sim_data <- mlbench.friedman1(200, sd = 1)
sim_df <- cbind(sim_data$x, y = sim_data$y)
sim_df <- as.data.frame(sim_df)

# (A) Random Forest without correlated predictors
rf1 <- randomForest(y ~ ., data = sim_df, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(rf1, scale = FALSE)
rfImp1[order(-rfImp1$Overall), , drop = FALSE]
##          Overall
## V1   8.732235404
## V4   7.615118809
## V2   6.415369387
## V5   2.023524577
## V3   0.763591825
## V6   0.165111172
## V7  -0.005961659
## V10 -0.074944788
## V9  -0.095292651
## V8  -0.166362581
# (B) Add one correlated predictor
sim_df$duplicate1 <- sim_df$V1 + rnorm(200) * 0.1
rf2 <- randomForest(y ~ ., data = sim_df, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(rf2, scale = FALSE)
rfImp2[order(-rfImp2$Overall), , drop = FALSE]
##                Overall
## V4          7.04752238
## V2          6.06896061
## V1          5.69119973
## duplicate1  4.28331581
## V5          1.87238438
## V3          0.62970218
## V6          0.13569065
## V10         0.02894814
## V9          0.00840438
## V7         -0.01345645
## V8         -0.04370565
# (C) Add another correlated predictor
sim_df$duplicate2 <- sim_df$V1 + rnorm(200) * 0.1
rf3 <- randomForest(y ~ ., data = sim_df, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(rf3, scale = FALSE)
rfImp3[order(-rfImp3$Overall), , drop = FALSE]
##                Overall
## V4          7.04870917
## V2          6.52816504
## V1          4.91687329
## duplicate1  3.80068234
## V5          2.03115561
## duplicate2  1.87721959
## V3          0.58711552
## V6          0.14213148
## V7          0.10991985
## V10         0.09230576
## V9         -0.01075028
## V8         -0.08405687

Interpretation for 8.2: When we added duplicate1 and duplicate2, the variable importance for the original V1 dropped significantly. The model started splitting importance among V1, duplicate1, and duplicate2 — diluting the original signal. Random forests struggle with correlated features, making it harder to trust “importance” rankings when collinearity exists. —

3 Exercise 8.3 - Friedman1 Extended Variable Importance

(You will insert the 8.3 solution content here.)

library(party)

# (A) Cforest without correlated predictors
sim_df$duplicate1 <- NULL
sim_df$duplicate2 <- NULL
cf1 <- cforest(y ~ ., data = sim_df)
cfImp1 <- as.data.frame(varimp(cf1, conditional = TRUE))
cfImp1[order(-cfImp1[,1]), , drop = FALSE]
##     varimp(cf1, conditional = TRUE)
## V4                     6.741766e+00
## V1                     5.699856e+00
## V2                     5.190114e+00
## V5                     1.182547e+00
## V3                     6.219633e-03
## V7                     1.491985e-05
## V9                    -7.355064e-03
## V8                    -8.383608e-03
## V6                    -1.104739e-02
## V10                   -2.342891e-02
# (B) Add one correlated predictor
sim_df$duplicate1 <- sim_df$V1 + rnorm(200) * 0.1
cf2 <- cforest(y ~ ., data = sim_df)
cfImp2 <- as.data.frame(varimp(cf2, conditional = TRUE))
cfImp2[order(-cfImp2[,1]), , drop = FALSE]
##            varimp(cf2, conditional = TRUE)
## V4                            5.889744e+00
## V2                            4.900098e+00
## V1                            3.282083e+00
## V5                            1.070901e+00
## duplicate1                    1.034389e+00
## V3                            1.832067e-02
## V8                            6.760156e-03
## V6                            6.613369e-03
## V10                           2.278647e-05
## V9                           -6.162727e-04
## V7                           -6.856890e-03
# (C) Add another correlated predictor
sim_df$duplicate2 <- sim_df$V1 + rnorm(200) * 0.1
cf3 <- cforest(y ~ ., data = sim_df)
cfImp3 <- as.data.frame(varimp(cf3, conditional = TRUE))
cfImp3[order(-cfImp3[,1]), , drop = FALSE]
##            varimp(cf3, conditional = TRUE)
## V4                             5.663778265
## V2                             4.562438530
## V1                             1.962675466
## V5                             0.984166166
## duplicate2                     0.882001420
## duplicate1                     0.586783379
## V3                             0.028632107
## V7                             0.007978408
## V6                            -0.001255087
## V8                            -0.001865218
## V10                           -0.008033569
## V9                            -0.009085797

3.1 Interpretation

Unlike random forests, conditional inference trees (cforest) kept V1 consistently ranked as one of the top predictors even after adding duplicate1 and duplicate2. Conditional importance adjusts for correlation among predictors, so the signal from V1 wasn’t artificially split up. Key takeaway: If you’re worried about correlated predictors messing with importance, use conditional importance methods —

4 Exercise 8.7 - Chemical Manufacturing Process - Tree-Based Models

set.seed(0)
data(ChemicalManufacturingProcess)

processPredictors <- ChemicalManufacturingProcess[,2:58]
yield <- ChemicalManufacturingProcess[,1]

replacements <- sapply(processPredictors, median, na.rm = TRUE)
for (ci in 1:ncol(processPredictors)) {
  na_index <- is.na(processPredictors[, ci])
  processPredictors[na_index, ci] <- replacements[ci]
}

zero_cols <- nearZeroVar(processPredictors)
processPredictors <- processPredictors[, -zero_cols]

trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
trainX <- processPredictors[trainIndex, ]
trainY <- yield[trainIndex]
testX <- processPredictors[-trainIndex, ]
testY <- yield[-trainIndex]

ctrl <- trainControl(method = "boot")
preProc <- c("center", "scale")

# CART (rpart)
set.seed(0)
rpartModel <- train(x = trainX, y = trainY, method = "rpart", preProc = preProc, tuneLength = 10)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
# Random Forest
set.seed(0)
rfModel <- train(x = trainX, y = trainY, method = "rf", preProc = preProc, tuneLength = 10)

# Gradient Boosting Machine (GBM)
gbmGrid <- expand.grid(
  n.trees = seq(100, 1000, by = 100),
  interaction.depth = seq(1, 7, by = 2),
  shrinkage = c(0.01, 0.1),
  n.minobsinnode = 10
)

set.seed(0)
gbmModel <- train(x = trainX, y = trainY, method = "gbm", preProc = preProc, tuneGrid = gbmGrid, verbose = FALSE)

# Summarize results
rpart_perf <- postResample(predict(rpartModel, testX), testY)
rf_perf <- postResample(predict(rfModel, testX), testY)
gbm_perf <- postResample(predict(gbmModel, testX), testY)

results_8_7 <- data.frame(
  Model = c("CART", "Random Forest", "GBM"),
  RMSE = c(rpart_perf["RMSE"], rf_perf["RMSE"], gbm_perf["RMSE"]),
  Rsquared = c(rpart_perf["Rsquared"], rf_perf["Rsquared"], gbm_perf["Rsquared"])
)

results_8_7
##           Model      RMSE  Rsquared
## 1          CART 1.5044822 0.2696417
## 2 Random Forest 0.8778014 0.7535474
## 3           GBM 0.9484934 0.7096870

4.1 Reflection

This set of exercises showed how model performance and feature importance can be sensitive to collinearity and tuning decisions. Repeated cross-validation stabilized performance estimates (8.1), and boosted trees like GBM proved more resilient to predictor redundancy (8.2–8.3). In 8.7, tuning GBM and Random Forests on real data emphasized how preprocessing, feature selection, and tuning affect predictive modeling. No model is one-size-fits-all: a thoughtful balance between interpretability, flexibility, and raw predictive power is always needed.