DATA624_Assignment9

8.1. Recreate the simulated data from Exercise 7.2

library(mlbench)
library(randomForest)
library(caret)
library(tidyverse)
library(party)
library(gbm)
library(Cubist)
library(rpart)
library(rJava)
library(RWeka)
library(rpart.plot)

library(doParallel)

cores <- detectCores()
cl <- makeCluster(2)
registerDoParallel(cl)

set.seed(200)

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(
  y ~ ., 
  data = simulated,
  importance = TRUE,
  ntree = 1000,
  mtry = 3
)

print(model1)

## 
## Call:
##  randomForest(formula = y ~ ., data = simulated, importance = TRUE,      ntree = 1000, mtry = 3) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 6.754258
##                     % Var explained: 72.3

rfImp1 <- varImp(model1, scale = FALSE)
rfImp1 |> 
  arrange(desc(Overall))

As we see Random Forest model did not significantly used noise predictors V6-V10. Their importance score values are very close to zeroes. Since noise variables have no real relationship with the outcome, shuffling them during permutation has a very tiny affect on the model’s performance. But Random Forest builds a lot of deep trees with random bootstrap samples of predictors, therefore noise variable occasionally ended up by being used in split by the chance. So when permutation importance is calculated by the model, these small random effects produce very small fluctuations in the error. In the result, noise variables ended up with these small importance values.

Also even V3 is considered as informative variable in the formula, Random Forest struggles to identify these symmetric quadratic curves with just 200 observations. Also the model uses a lot of different small rules for V3 across different trees and none of them considered as very strong. As a result, the permutation step does not detect a large drop in accuracy when V3 is shuffled, so its importance appears low in this model.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors.

set.seed(200)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9497025

As we see correlation between V1 and simulated$duplicate1is 0.95, and they have almost same information.

Fit another random forest model to these data.

set.seed(200)
model2 <- randomForest(y ~.,
                       data = simulated,
                       ntree = 1000, importance = TRUE, 
                       mtry = 3)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2 |> 
  arrange(desc(Overall))

Did the importance score for V1 change?

V1’s importance value dropped from previous 8.73 to 6.05, and at the same time duplicate1 has importance value of 4.20. This happened because Random Forest model splits importance between correlated predictors since they are considered as interchangeable for the model.

What happens when you add another predictor that is also highly correlated with V1?

set.seed(200)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1
cor(simulated$V1, simulated$duplicate2)

## [1] 0.9497025

set.seed(200)
model3 <- randomForest(y~.,
                data = simulated,
                ntree= 1000,
                importance = TRUE,
                mtry = 3
)

rfImp3 <- varImp(model3, scale = FALSE)
rfImp3 |> 
  arrange(desc(Overall))

Adding the first duplicate causes a big drop in duplicate1’s importance. Adding the duplicate2 mostly splitted importance between the two duplicates, while $V1$ slightly lost its importance value. At the same time now $V4$ is the most dominant one.

In the original model $V1$ had highest importance value, but it’s importance dropped once similar variables were added. When a predictor highly correlated with an informative variable (like $V1$ ) is added to the model, the importance score of the original variable decreases because the Random Forest algorithm spreads the importance value across the interchangeable predictors. The duplicates are created by adding random noise to the original signal as it’s done in this line:

simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1

So because of that the original clean version $V1$ usually retains a higher importance value than its noisier clones. However, as more noisier duplicates has been added, the total importance of that underlying signal continues to split across the group, which can create a delusion where the individual scores of critical variables appear lower than those of less-informative, unique predictors (like $V4$ or $V2$).

As we practiced in linear and non-linear models, I learned that it is useful to reduce the amount of correlated predictors to achieve better accuracy of the predictions. But Random Forest models are robust to correlated variables, and it can be beneficial to keep them in the mode if the goal is to have better accuracy. The downsides of using correlated variables are that interpretation of the important predictors can become less accurate, and it can increase computational costs for the model. So whether to remove correlated predictors depends on the goal of the model : better predictive power or better interpretation.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

set.seed(200)
model_cforest <- cforest(y ~ ., data = simulated, 
                         control = cforest_unbiased(ntree = 1000, mtry = 3))

imp_trad <- varimp(model_cforest, conditional = FALSE)

imp_cond <- varimp(model_cforest, conditional = TRUE)

comparisons <- data.frame(
  Standard_RF = rfImp3$Overall,
  #CF_Unconditional = imp_trad,
  CF_Conditional = imp_cond
)
rownames(comparisons) <- rownames(rfImp3)
comparisons[order(-comparisons$CF_Conditional), ]

The Conditional Inference Forest (cforest) does not follow the same pattern as the traditional Random Forest model when dealing with highly correlated predictors.

In the standard Random Forest model, we saw an effect where the importance of the $V1$ signal was split across its duplicates, allowing predictors $V4$ and $V2$ to falsely appear more dominant. However, by using the conditional importance measure, the model partials out these correlations. It strictly evaluates the unique contribution of each variable.

As a result, in the CF_Conditional column, the importance scores for $V1$ and its duplicates all dropped significantly. This is because the algorithm recognizes that $V1$, duplicate1, and duplicate2 have very similar information. So once the model has one of them, the others provide very little new information. While this leads to lower absolute importance values for the $V1$ and its duplicates group, it provides a much better assessment of a variable’s unique predictive power compared to the traditional method.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

ctrl <- trainControl(method = "cv", number = 10)

gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
                       .n.trees = seq(100, 500, by = 25),
                       .shrinkage = c(0.01, 0.1),
                       .n.minobsinnode = 10) 

set.seed(200)
gbmTune <- train(y ~ ., data = simulated,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 trControl = ctrl,
                 verbose = FALSE)

gbm_varImp <- varImp(gbmTune, scale = TRUE)
gbm_varImp

## gbm variable importance
## 
##            Overall
## V4         100.000
## V2          75.082
## V1          68.799
## V5          41.947
## V3          32.800
## duplicate1  23.194
## V6           5.230
## V10          3.619
## V8           3.467
## V7           2.872
## V9           2.794
## duplicate2   0.000

plot(gbm_varImp)

cubistGrid <- expand.grid(committees = c(1, 5, 10, 20, 50, 100),
                          neighbors = c(0, 1, 3, 6, 9))

set.seed(200)
cubistTune <- train(y ~ ., 
                    data = simulated,
                    method = "cubist",
                    tuneGrid = cubistGrid,
                    trControl = ctrl) 

cubistTune$bestTune

feature_names <- colnames(simulated)[-ncol(simulated)]
final_comparison <- data.frame(Variable = feature_names, stringsAsFactors = FALSE)

get_imp <- function(mod_imp) {
  imp_df <- as.data.frame(mod_imp$importance)
  if("Overall" %in% names(imp_df)) return(imp_df[feature_names, "Overall"])
  return(imp_df[feature_names, 1])
}

imp_rf_scaled     <- varImp(model3, scale = TRUE)
imp_gbm_scaled    <- varImp(gbmTune, scale = TRUE)
imp_cubist_scaled <- varImp(cubistTune, scale = TRUE)


#final_comparison$RF     <- get_imp(imp_rf_scale)
final_comparison$GBM    <- get_imp(imp_gbm_scaled)
final_comparison$Cubist <- get_imp(imp_cubist_scaled)

final_comparison[is.na(final_comparison)] <- 0
final_comparison %>% arrange(desc(Cubist))

The results confirm that while Random Forest model3 splits importance between correlated variables, GBM and Cubist are much more efficient at isolating the actual signal.

Random Forest shares importance across $V1$ and its clones. This makes $V1$ look weaker than it really is (dropping below $V4$ and $V2$), creating a misleading rankings of importance.
GBM puts $V1$ significantly higher than the duplicate1. Because it’s a sequential learner, it grabs the $V1$’s signal early, leaving very little error for the duplicate to fix.
Cubist model shows the closest pattern to the original data. It gives $V1$ a highest importance and the duplicates received no importance. Since it uses many linear models internally, it recognizes the redundancy as mathematical instability and simply drops highly correlated clones.

8.2. Use a simulation to show tree bias with different granularities.

For this exercise I simulate dataset using a latent signal z, then express it in seven different granularities:

x_cont : continuous (200+ unique values)
x_20 : 20‑level ordinal
x_10 : 10‑level ordinal
x_5 : 5‑level ordinal
x_3 : 3‑level ordinal
x_bin : binary
x_noise : pure noise (no relationship to z)

All except x_noise contain the same true information, just expressed at different granualities.

set.seed(200)
n <- 200

z <- rnorm(n)

x_cont <- z

x_20 <- cut(z, breaks = 20, labels = FALSE)
x_10 <- cut(z, breaks = 10, labels = FALSE)
x_5  <- cut(z, breaks = 5,  labels = FALSE)
x_3  <- cut(z, breaks = 3,  labels = FALSE)
x_bin <- ifelse(z > 0, 1, 0)

x_20 <- x_20 + rnorm(n, 0, 0.001)
x_10 <- x_10 + rnorm(n, 0, 0.001)
x_5  <- x_5  + rnorm(n, 0, 0.001)
x_3  <- x_3  + rnorm(n, 0, 0.001)
x_bin <- x_bin + rnorm(n, 0, 0.001)

x_noise <- rnorm(n)

y <- 4*z + rnorm(n, sd = 4)


sim_data <- data.frame(
  y,
  x_cont,
  x_20,
  x_10,
  x_5,
  x_3,
  x_bin,
  x_noise
)

head(sim_data, 15)

# str(sim_data)
sum(is.na(sim_data$y))

## [1] 0

nearZeroVar(sim_data, saveMetrics = TRUE)

I will use CART (rpart2), CTree (ctree) and Random Forest (rf) models to compare how their variable-importance biases change with this data

fit_rpart2 <- train(
  y ~ .,
  data = sim_data,
  method = "rpart2",
  tuneLength = 10
)

fit_ctree <- train(
  y ~ .,
  data = sim_data,
  method = "ctree"
)

fit_rf <- train(
  y ~ .,
  data = sim_data,
  method = "rf",
  importance = TRUE
)

imp_rpart2 <- varImp(fit_rpart2)$importance
imp_ctree  <- varImp(fit_ctree)$importance
imp_rf     <- varImp(fit_rf)$importance

comparison <- data.frame(
  Variable = rownames(imp_rpart2),
  CART_rpart2 = imp_rpart2$Overall,
  CTree = imp_ctree$Overall,
  RandomForest = imp_rf$Overall
)

comparison_long <- comparison |> 
  pivot_longer(
    cols = -Variable,
    names_to = "Model",
    values_to = "Importance"
  ) |> 
  group_by(Variable)|> 
  mutate(mean_imp = mean(Importance)) |> 
  ungroup() |> 
  arrange(desc(Variable)) |> 
  mutate(Variable = factor(Variable, levels = unique(Variable)))

ggplot(comparison_long, aes(x = Variable, y = Importance, fill = Model)) +
  geom_col(position = position_dodge(width = 0.8), width = 0.75) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Variable Importance Comparison Across Tree Models",
    y = "Importance",
    x = "Predictor"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 12),
    legend.position = "right",
    plot.title = element_text(face = "bold", size = 18),
    panel.grid.minor =element_line()
  )

comparison |> 
  arrange(desc(CART_rpart2))

Across the three tree models, the variable‑importance patterns show how differently each model reacts to predictors that all carry the same underlying signal.

The CART model (rpart2) leans heavily toward variables with more split points, so x_cont and the x_20, x_10andx_5 end up at the top even though they don’t actually contain more information than other variables. And as expected x_bin got lowest importance since it only consists of two values. This is the usual granularity bias, and we can see it pretty clearly here.
CART naturally prefers predictors with more possible split points, which is why the continuous and the variables with more bins usually get highest importance.
On the other hand, the CTree (ctree) model spreads importance much more evenly across all the signal variables, because its splitting method doesn’t push to having more possible cut points. So even x_bin and x_3 get high importance, which makes sense since they reflect the same informative structure.
Random Forest (rf) model also favors the mid‑level granularity predictors like x_10, x_bin and x_3, but the tree bias is not as extreme as CART, and importance is more distributed because of the averaging across trees.

In all models, the noise variable importance is at zero, which confirms the simulation is behaving correctly.
So even all these predictors carry the same informative signal, the models see them differently because of how their splitting methods work.

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure below provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

library(AppliedPredictiveModeling)
data(solubility)
ls(pattern = '^solT')

## [1] "solTestX"       "solTestXtrans"  "solTestY"       "solTrainX"     
## [5] "solTrainXtrans" "solTrainY"

trainingData <- solTrainXtrans
trainingData$Solubility <- solTrainY
dim(trainingData)

## [1] 951 229

ctrl <- trainControl(method = "repeatedcv",
                     repeats = 5)
set.seed(200)
gbm_0101 <- train(
  Solubility~.,
  data = trainingData,
  method = "gbm",
  trControl = ctrl,
  verbose = FALSE,
  bag.fraction = 0.1,
  tuneGrid = data.frame(
    n.trees = 1000,
    interaction.depth = 2,
    shrinkage = 0.1,
    n.minobsinnode = 10
  )
)

gbm_0909 <- train(
  Solubility~.,
  data = trainingData,
  method = "gbm",
  trControl = ctrl,
  verbose = FALSE,
  bag.fraction = 0.9,
  tuneGrid = data.frame(
    n.trees = 1000,
    interaction.depth = 2,
    shrinkage = 0.9,
    n.minobsinnode = 10
  )
)

imp_0101 <- varImp(gbm_0101)$importance |>
  rownames_to_column("Variable") |>
  rename(Importance = Overall) |>
  arrange(desc(Importance)) |>
  slice(1:20) |>
  mutate(Model = "shrinkage = 0.1, bag = 0.1")

imp_0909 <- varImp(gbm_0909)$importance |>
  rownames_to_column("Variable") |>
  rename(Importance = Overall) |>
  arrange(desc(Importance)) |>
  slice(1:20) |>
  mutate(Model = "shrinkage = 0.9, bag = 0.9")

imp_both <- bind_rows(imp_0101, imp_0909)

library(tidytext)

## Warning: package 'tidytext' was built under R version 4.5.3

ggplot(imp_both,
       aes(x = reorder_within(Variable, Importance, Model),
           y = Importance)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ Model, scales = "free_y") +
  scale_x_reordered() +
  labs(
    title = "Top 20 Important Variables for Each GBM Model",
    x = "Variable",
    y = "Importance"
  ) +
  theme_minimal()

GBM (shrinkage = 09, bag = 0.9)

On the right model, bagging fraction is 0.9, so each tree sees 90% of all data. Because of that, the strongest variable (NumCarbon) has highest chances to show strong signal in almost every tree. So model keeps picking it again and again almost in each tree.
At the same time right model combined with high learning rate (0.9), this model makes very aggressive moves toward correcting the residuals. As a result, most of the error is explained in the first few trees, leaving little signal for later trees. This causes the model to rely heavily on the variables used early, concentrating variable importance.

GBM (shrinkage = 0.1, bag = 0.1)

Combined with low learning rate (0.1), the model on the left makes very small updates to the predicted values at each iteration. It only partially corrects the residuals, so a lot of error is still left to be explained after each tree.
At the same time, bagging fraction is 0.1, so each tree sees only a small random part (10%) of the data. Because of that, the strongest variable has much smaller chances to look dominant in every tree, and different variables can appear as important depending on the sample.
Because learning is slow and data is noisier compared to right model, the left model keep using different variables across many trees. As a result, residuals are reduced more gradually, and multiple predictors used more over time.

(b) Which model do you think would be more predictive of other samples?

The left model (shrinkage = 0.1, bag = 0.1) has better chances to produce more accurate predictions on new data.
This is because the low learning rate here forces the model to learn slowly and not overfit too fast. It makes smaller updates, so the model builds up the prediction gradually and captures more general patterns of the data.
Also, the low bagging fraction introduces randomness, since each tree only sees a small subset of the data. This helps reduce variance and makes the model more robust.

On the other hand, the right model (shrinkage = 09, bag = 0.9) learns the pattern very aggressively and uses almost all of the data at each step. It quickly fits the strongest predictors and reduces error fast, but this usually can lead to overfitting early splits and relying too much on a few variables.

(c) How would increasing interaction depth affect the slope of predictor importance for either model?

Increasing interaction depth makes trees more complex and allows them to capture interactions between variables.

Since GBM (shrinkage = 09, bag = 0.9) model already relies heavily on a few strong predictors early, increasing depth will make those variables even more dominant. This happens because model will create more complex splits and combinations, so importance becomes even more concentrated and the slope gets steeper.

For GBM (shrinkage = 09, bag = 0.9), increasing depth will cause more variables to appear in each tree and contribute to predictions. Since the model is already using different predictors because of small learning rate and low bagging fraction, this will spread importance even more and make the slope more flatter.

Also increasing interaction depth can help model fit training data better, because it can capture more complex relationships. But at the same time it can start chasing the noise, not real patterns, so it can overfit and perform worse on unseen data.

So it really depends on the data. If there are real interactions in data, then increasing depth is useful. But if not, if can force the model to overfit on future predictions.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
library(caret)
library(glmnet)
library(pls)
library(dplyr)
library(ggplot2)
library(lattice)
library(RANN)
library(purrr)
library(e1071)
library(patchwork)
library(corrplot)
library(writexl)
options(scipen = 999)

library(doParallel)

cl <- makeCluster(2)
registerDoParallel(cl)

ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5,
  allowParallel = TRUE
)

library(ggplot2)
library(patchwork)

plot_model_diagnostics <- function(obs, pred, model_name = "Model") {
  
  diag_df <- data.frame(
    observed = as.numeric(obs),
    predicted = as.numeric(pred),
    residuals = as.numeric(obs) - as.numeric(pred)
  )
  
  p1 <- ggplot(data = diag_df, aes(x = predicted, y = observed)) +
    geom_point() +
    geom_abline(slope = 1, intercept = 0, linetype = 2) +
    labs(
      title = paste("Predicted vs Observed:", model_name),
      x = "Predicted",
      y = "Observed"
    )
  
  p2 <- ggplot(data = diag_df, aes(x = predicted, y = residuals)) +
    geom_point() +
    geom_hline(yintercept = 0, linetype = 2) +
    labs(
      title = paste("Residuals vs Predicted:", model_name),
      x = "Predicted",
      y = "Residuals"
    )
  
  p1 + p2
}

get_cv_results <- function(model, model_name) {
  best <- model$bestTune
  
  res <- model$results
  
  for (param in names(best)) {
    res <- res[res[[param]] == best[[param]], ]
  }
  
  data.frame(
    Model = model_name,
    CV_RMSE = res$RMSE,
    CV_R2 = res$Rsquared,
    CV_MAE = res$MAE
  )
}

get_test_results <- function(obs, pred, model_name) {
  res <- postResample(pred = pred, obs = obs)
  
  data.frame(
    Model = model_name,
    Test_RMSE = res["RMSE"],
    Test_R2 = res["Rsquared"],
    Test_MAE = res["MAE"]
  )
}

data("ChemicalManufacturingProcess")
dim(ChemicalManufacturingProcess)

## [1] 176  58

yield <- ChemicalManufacturingProcess[, 1]

processPredictors <- ChemicalManufacturingProcess[, -1]

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8)

sum(is.na(processPredictors))

## [1] 106

The first step was checking the predictor matrix for missing values. There were 106 missing cells across all measurements, which is still a relatively small amount compared with the full dataset (57 predictors and 176 observations) . To impute NA’s I choosed “bagged tree” imputation because it can better preserve nonlinear relationships between variables.

set.seed(200)
imputation <- preProcess(
  processPredictors,
  method = 'bagImpute'
)

processPredictors_imputed <- predict(imputation, processPredictors)
sum(is.na(processPredictors_imputed))

## [1] 0

nzv_vars <- nearZeroVar(processPredictors_imputed)
nzv_vars

## [1] 7

processPredictors_imputed <- processPredictors_imputed[, -nzv_vars]

Near-zero variance filtering removed 1 predictors that contained almost no useful variation.

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

I split data into training and test sets using a 75/25 split.

set.seed(200)

train_index <- createDataPartition(yield, p= 0.75, list = FALSE)

train_procPred <- processPredictors_imputed[train_index,]
test_procPred <- processPredictors_imputed[-train_index,]

train_yield <- yield[train_index]
test_yield <- yield[-train_index]

skewness(train_yield)

## [1] 0.2865166

hist(train_yield, breaks = 20)

Before fitting the model, I checked the distribution of the response variable. The yield variable only showed mild right skew, with skewness equal to 0.287, and the histogram looked close to symmetric. Because of this, I kept the response on its original scale to make the final predictions easier to interpret.

cor_matrix <- cor(train_procPred, use = "pairwise.complete.obs")
corrplot(
  cor_matrix,
  order = 'hclust',
  tl.cex = 0.2
)

The correlation matrix shows several strong clusters among variables. Since elastic net is specifically designed to handle correlated predictors through shrinkage, this supports using it as the main modeling approach.

high_corr <- findCorrelation(cor_matrix, cutoff = 0.9)

train_procPred_filtered <- train_procPred[, -high_corr] 
test_procPred_filtered <- test_procPred[, -high_corr]

Using a 0.9 cutoff, 10 highly correlated predictors were identified as potential redundant variables.

ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5,
  allowParallel = TRUE
)

Tree-based models

Single regression tree: rpart

rpart_grid <- data.frame(
  cp = seq(0.001, 0.1, length = 30)
)

start_time <- Sys.time()

set.seed(200)
tree_model <- train(
  x = train_procPred_filtered,
  y = train_yield,
  method = "rpart",
  trControl = ctrl,
  tuneGrid = rpart_grid
)

end_time <- Sys.time()

rpart_train_time <- end_time - start_time
rpart_train_time

## Time difference of 3.791131 secs

Random forest: rf

For this model I tune mtry which controls how many random predictors the model looks at when making each split in a tree.

rf_grid <- data.frame(
  mtry = c(2, 5, 10, 15, 20, 30, 40)
)

start_time <- Sys.time()

set.seed(200)
rf_model <- train(
  x = train_procPred_filtered,
  y = train_yield,
  method = "rf",
  trControl = ctrl,
  tuneGrid = rf_grid,
  ntree = 500,
  importance = TRUE
)

end_time <- Sys.time()
rf_train_time <- end_time - start_time
rf_train_time

## Time difference of 38.81446 secs

GBM

Here model tunes number of trees, depth, learning rate, and minimum node size.

gbm_grid <- expand.grid(
  n.trees = c(100, 500, 1000),
  interaction.depth = c(1, 3, 5),
  shrinkage = c(0.01, 0.05, 0.1),
  n.minobsinnode = c(5, 10, 25)
)

start_time <- Sys.time()

set.seed(200)
gbm_model <- train(
  x = train_procPred_filtered,
  y = train_yield,
  method = "gbm",
  trControl = ctrl,
  tuneGrid = gbm_grid,
  verbose = FALSE
)

end_time <- Sys.time()
gbm_train_time <- end_time - start_time
gbm_train_time

## Time difference of 1.567774 mins

Cubist

Tunes committees and neighbors.
committees control how many models Cubist builds one after another. More committees means the model keeps fixing previous mistakes, so it becomes more complex and fits data better, but can also overfit.

neighbors control how many nearby data points model uses to adjust predictions. If neighbors is 0, prediction comes only from the model. If it is larger, the model adjusts prediction based on average of similar observations.

cubist_grid <- expand.grid(
  committees = c(1, 5, 10, 25, 50),
  neighbors = c(0, 1, 3, 5, 9)
)

start_time <- Sys.time()

set.seed(200)
cubist_model <- train(
  x = train_procPred_filtered,
  y = train_yield,
  method = "cubist",
  trControl = ctrl,
  tuneGrid = cubist_grid
)

end_time <- Sys.time()
cubist_train_time <- end_time - start_time
cubist_train_time

## Time difference of 16.20378 secs

M5 model tree

Caret’s M5 model has limited tuning through train(), so this is usually kept simple

set.seed(200)

start_time <- Sys.time()

m5_model <- train(
  x = train_procPred_filtered,
  y = train_yield,
  method = "M5",
  trControl = ctrl
)

end_time <- Sys.time()
M5_train_time <- end_time - start_time
M5_train_time

## Time difference of 11.79469 secs

set.seed(200)
cv_results <- bind_rows(
  get_cv_results(tree_model, "Single Tree"),
  get_cv_results(rf_model, "Random Forest"),
  get_cv_results(gbm_model, "GBM"),
  get_cv_results(cubist_model, "Cubist"),
  get_cv_results(m5_model, "M5")
)

#cv_results |> arrange(CV_RMSE)

set.seed(200)
pred_tree   <- predict(tree_model, test_procPred_filtered)
pred_rf     <- predict(rf_model, test_procPred_filtered)
pred_gbm    <- predict(gbm_model, test_procPred_filtered)
pred_cubist <- predict(cubist_model, test_procPred_filtered)
pred_m5     <- predict(m5_model, test_procPred_filtered)

test_results <- bind_rows(
  get_test_results(test_yield, pred_tree, "Single Tree"),
  get_test_results(test_yield, pred_rf, "Random Forest"),
  get_test_results(test_yield, pred_gbm, "GBM"),
  get_test_results(test_yield, pred_cubist, "Cubist"),
  get_test_results(test_yield, pred_m5, "M5")
)

#test_results |> arrange(Test_RMSE)

model_train_times <- data.frame(
  Model = c("Cubist", "Random Forest", "GBM", "Single Tree", "M5"),
  Train_Time = c(cubist_train_time, 
                 rf_train_time, 
                 gbm_train_time, 
                 rpart_train_time, 
                 M5_train_time)
)

final_results <- test_results |>
  left_join(model_train_times, by = "Model")

#final_results

(a) Which tree-based regression model gives the optimal resampling and test set performance?

cv_results |> arrange(CV_RMSE)

final_results |> arrange(Test_RMSE)

Based on cross-validation, GBM and Cubist have almost identical train RMSE = 1.07, at the same Cubist have slightly better $R^2$.

But test set results shows that Cubist clearly outperformed other models with lowest RMSE and highest R².

So the optimal tree-based model for this data is Cubist, not GBM.

Cubist model test results:

RMSE = 0.8708
$R^2$ = 0.79
Training time is 17 seconds, which is much faster compared with Random Forest (40 seconds) and GBM (1 minute 37 seconds).

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

cubist_imp_vars <- varImp(cubist_model)
cubist_imp_vars

## cubist variable importance
## 
##   only 20 most important variables shown (out of 46)
## 
##                        Overall
## ManufacturingProcess17 100.000
## ManufacturingProcess32  81.720
## ManufacturingProcess09  40.860
## ManufacturingProcess39  35.484
## ManufacturingProcess33  31.183
## BiologicalMaterial11    30.108
## ManufacturingProcess13  30.108
## ManufacturingProcess28  23.656
## ManufacturingProcess04  23.656
## ManufacturingProcess06  22.581
## ManufacturingProcess19  20.430
## BiologicalMaterial06    18.280
## ManufacturingProcess10  17.204
## ManufacturingProcess14  17.204
## ManufacturingProcess15  16.129
## ManufacturingProcess37  12.903
## BiologicalMaterial05    11.828
## BiologicalMaterial03    10.753
## ManufacturingProcess11   8.602
## ManufacturingProcess03   7.527

cubist_top15 <- cubist_imp_vars$importance |>
  tibble::rownames_to_column("Predictor") |>
  arrange(desc(Overall)) |>
  slice(1:20)

ggplot(cubist_top15, aes(x = reorder(Predictor, Overall), y = Overall)) +
  geom_col() +
  coord_flip() +
  labs(title = "Top 20 Important Predictors (Cubist)") +
  theme_minimal()

MARS model’s important predictors

Predictor Overall

ManufacturingProcess32 100.00000

ManufacturingProcess09 46.53772

ManufacturingProcess13 0.00000

ENET Model important predictors

Predictor Overall

ManufacturingProcess32 100.00000

ManufacturingProcess17 80.31063

BiologicalMaterial06 75.08969

ManufacturingProcess13 74.46564

ManufacturingProcess36 68.75818

BiologicalMaterial03 67.10754

ManufacturingProcess06 66.37746

ManufacturingProcess09 65.28374

ManufacturingProcess33 46.90589

BiologicalMaterial08 44.97856

The most important predictors in Cubist tree-based model are mostly manufacturing process variables. The top predictors include ManufacturingProcess17, ManufacturingProcess32, ManufacturingProcess09, and several others. Only a few biological variables appear in the top list, such as BiologicalMaterial11 and BiologicalMaterial06.

As we can see manufacturing process variables clearly dominate the Cubist importance list. Most of the top predictors come from manufacturing process values, and this suggests that yield is more strongly affected by manufacturing process measurements rather than biological ones.

If compared to previous exercise models, there are some similarities. For example, ManufacturingProcess32 appears as the most important predictor in all Cubist, MARS, and ENET models. ManufacturingProcess09 and ManufacturingProcess13 also appear across multiple models, showing they are consistently important variables.

But there are also some differences. The Cubist model uses a larger range of predictors and splits importance across many variables, while MARS focuses on only a couple of key predictors. ENET includes balanced mix of both process and biological variables, showing that linear models can pick up different relationships compared to tree-based models.

Cubist captures nonlinear relationships and interactions in this data, so it uses more predictors compared to MARS or ENET. This is why its importance is more spread out and includes more process variables.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpart.plot(
  tree_model$finalModel,
  type = 1,
  extra = 101,
  fallen.leaves = TRUE
)

This single tree shows how outcome values changes depend on different splits and gives a simple way to understand the data.
The first split is on ManufacturingProcess32, which means this variable is the most important. When ManufacturingProcess32 is higher (>= 160), the yield values are >= 40.

Just like in Cubist model, a single tree model splits are based on manufacturing process variables, like ManufacturingProcess17, ManufacturingProcess06 and others. This shows that manufacturing variables have stronger effect on yield. Biological variables appear deeper in the tree, so here their effect is more conditional and depends on process conditions.

This shows that yield depends on combination of variables, not just one. Also variables interact with each other, because splits happen one after another.

Overall, the tree gives good explanation and shows dominance of manufacturing variables, but it is simple and cannot capture all complex relationships, which is why Cubist performed better.