Exercise 8.1 - Variable Importance and Correlated Predictors

The goal of this exercise is to explore how different tree-based models handle irrelevant predictors and highly correlated predictors when computing variable importance.

8.1 (a) Random forest with uninformative predictors

Recreate the simulated Friedman 1 data from Exercise 7.2. Fit a random forest model using all predictors and estimate the variable importance scores. Did the random forest model significantly use the uninformative predictors (V6-V10)?

# Friedman 1 simulated data
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- as.data.frame(cbind(simulated$x, y = simulated$y))
colnames(simulated) <- c(paste0("V", 1:10), "y")

glimpse(simulated)

## Rows: 200
## Columns: 11
## $ V1  <dbl> 0.53377245, 0.58376503, 0.58957830, 0.69103989, 0.66733150, 0.8392…
## $ V2  <dbl> 0.64780643, 0.43815276, 0.58790649, 0.22595475, 0.81889851, 0.3862…
## $ V3  <dbl> 0.85078526, 0.67272659, 0.40967108, 0.03335447, 0.71676079, 0.6461…
## $ V4  <dbl> 0.181599574, 0.669249143, 0.338127280, 0.066912736, 0.803242873, 0…
## $ V5  <dbl> 0.929039760, 0.163797838, 0.894093335, 0.637445191, 0.083068641, 0…
## $ V6  <dbl> 0.36179060, 0.45305931, 0.02681911, 0.52500637, 0.22344157, 0.4370…
## $ V7  <dbl> 0.826660859, 0.648960076, 0.178561450, 0.513361395, 0.664490604, 0…
## $ V8  <dbl> 0.42140806, 0.84462393, 0.34959078, 0.79702598, 0.90389194, 0.6489…
## $ V9  <dbl> 0.59111440, 0.92819306, 0.01759542, 0.68986918, 0.39696995, 0.5311…
## $ V10 <dbl> 0.588621560, 0.758400814, 0.444118458, 0.445071622, 0.550080800, 0…
## $ y   <dbl> 18.463980, 16.098360, 17.761647, 13.787300, 18.429836, 20.858166, …

summary(simulated$y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.556  10.756  14.556  14.416  17.970  28.382

set.seed(2025)
rf_base <- randomForest(
  y ~ .,
  data       = simulated,
  importance = TRUE,
  ntree      = 1000
)

rfImp_base <- varImp(rf_base, scale = FALSE)
rfImp_base_tbl <- rfImp_base %>%
  as_tibble(rownames = "Predictor") %>%
  arrange(desc(Overall))

rfImp_base_tbl

## # A tibble: 10 × 2
##    Predictor Overall
##    <chr>       <dbl>
##  1 V1         8.75  
##  2 V4         7.57  
##  3 V2         6.61  
##  4 V5         2.26  
##  5 V3         0.705 
##  6 V6         0.229 
##  7 V7         0.0904
##  8 V9        -0.0161
##  9 V10       -0.103 
## 10 V8        -0.190

rfImp_base_tbl %>%
  ggplot(aes(x = reorder(Predictor, Overall), y = Overall)) +
  geom_col() +
  coord_flip() +
  labs(
    x = "Predictor",
    y = "Importance (IncMSE or related measure)",
    title = "Random Forest Variable Importance for Friedman 1 Simulation"
  )

Exercise 8.1(a): Random forest variable importance for predictors V1-V10.

Interpretation (8.1a)

In the Friedman 1 data, only V1-V5 are truly informative; V6-V10 are noise. The variable-importance plot shows large importance scores for V1-V5 and much smaller scores for V6-V10. The random forest uses the uninformative predictors only minimally: they may have non-zero but relatively small importance values due to random splits, but they do not appear among the dominant predictors.

8.1 (b) Adding highly correlated predictors

Add an additional predictor that is highly correlated with one of the informative predictors (e.g., V1). Fit another random forest model. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

set.seed(2025)
simulated_dup <- simulated

# Add a duplicate predictor highly correlated with V1
simulated_dup$duplicate1 <- simulated_dup$V1 + rnorm(nrow(simulated_dup)) * 0.1
cor(simulated_dup$duplicate1, simulated_dup$V1)

## [1] 0.9402835

set.seed(2025)
rf_dup1 <- randomForest(
  y ~ .,
  data       = simulated_dup,
  importance = TRUE,
  ntree      = 1000
)

rfImp_dup1_tbl <- varImp(rf_dup1, scale = FALSE) %>%
  as_tibble(rownames = "Predictor") %>%
  arrange(desc(Overall))

rfImp_dup1_tbl

## # A tibble: 11 × 2
##    Predictor    Overall
##    <chr>          <dbl>
##  1 V4          7.17    
##  2 V1          6.06    
##  3 V2          5.84    
##  4 duplicate1  3.71    
##  5 V5          2.17    
##  6 V3          0.729   
##  7 V6          0.0897  
##  8 V7          0.0879  
##  9 V10         0.000929
## 10 V9         -0.00771 
## 11 V8         -0.172

rfImp_dup1_tbl %>%
  ggplot(aes(x = reorder(Predictor, Overall), y = Overall)) +
  geom_col() +
  coord_flip() +
  labs(
    x = "Predictor",
    y = "Importance",
    title = "Random Forest Importance with Correlated Duplicate of V1"
  )

Exercise 8.1(b): Variable importance after adding one correlated predictor (duplicate1) of V1.

set.seed(2025)
simulated_dup2 <- simulated_dup
simulated_dup2$duplicate2 <- simulated_dup2$V1 + rnorm(nrow(simulated_dup2)) * 0.1
cor(simulated_dup2$duplicate2, simulated_dup2$V1)

## [1] 0.9402835

set.seed(2025)
rf_dup2 <- randomForest(
  y ~ .,
  data       = simulated_dup2,
  importance = TRUE,
  ntree      = 1000
)

rfImp_dup2_tbl <- varImp(rf_dup2, scale = FALSE) %>%
  as_tibble(rownames = "Predictor") %>%
  arrange(desc(Overall))

rfImp_dup2_tbl

## # A tibble: 12 × 2
##    Predictor   Overall
##    <chr>         <dbl>
##  1 V4          7.48   
##  2 V2          5.97   
##  3 V1          5.38   
##  4 duplicate2  2.39   
##  5 duplicate1  2.32   
##  6 V5          2.12   
##  7 V3          0.390  
##  8 V6          0.190  
##  9 V10        -0.00109
## 10 V7         -0.00404
## 11 V8         -0.0801 
## 12 V9         -0.0803

rfImp_dup2_tbl %>%
  ggplot(aes(x = reorder(Predictor, Overall), y = Overall)) +
  geom_col() +
  coord_flip() +
  labs(
    x = "Predictor",
    y = "Importance",
    title = "Random Forest Importance with Two Correlated Duplicates of V1"
  )

Exercise 8.1(b): Variable importance after adding two correlated predictors (duplicate1, duplicate2) of V1.

Interpretation (8.1b)

After adding duplicate1, which is highly correlated with V1, the importance originally attributed to V1 is now shared between V1 and duplicate1. When duplicate2 is added, the total importance associated with the “V1 information” is split even further among V1, duplicate1, and duplicate2.

This illustrates that random forest variable-importance scores are not invariant to correlated predictors: when several predictors carry essentially the same information, the model tends to spread importance across them rather than keeping a single highly important predictor.

8.1 (c) Conditional inference forest (cforest)

Use cforest from the party package and compute variable importance using varimp, toggling the conditional argument. Do these importances show the same pattern as the traditional random forest model?

set.seed(2025)
cf_ctrl <- cforest_unbiased(mtry = 5, ntree = 1000)

cf_dup2 <- cforest(
  y ~ .,
  data  = simulated_dup2,
  controls = cf_ctrl
)

# Unconditional importance (similar spirit to RF)
cfImp_uncond <- varimp(cf_dup2, conditional = FALSE)
cfImp_uncond_tbl <- tibble(
  Predictor = names(cfImp_uncond),
  Importance = as.numeric(cfImp_uncond)
) %>%
  arrange(desc(Importance))

cfImp_uncond_tbl

## # A tibble: 12 × 2
##    Predictor  Importance
##    <chr>           <dbl>
##  1 V4           7.45    
##  2 V2           5.51    
##  3 V1           5.23    
##  4 duplicate1   2.72    
##  5 V5           1.73    
##  6 duplicate2   1.49    
##  7 V7           0.00862 
##  8 V6           0.00578 
##  9 V3           0.000782
## 10 V10         -0.0191  
## 11 V9          -0.0273  
## 12 V8          -0.0313

cfImp_uncond_tbl %>%
  ggplot(aes(x = reorder(Predictor, Importance), y = Importance)) +
  geom_col() +
  coord_flip() +
  labs(
    x = "Predictor",
    y = "Unconditional Importance",
    title = "cforest Unconditional Variable Importance"
  )

Exercise 8.1(c): Unconditional cforest variable importance with correlated predictors.

# Conditional importance to mitigate correlation bias
cfImp_cond <- varimp(cf_dup2, conditional = TRUE)
cfImp_cond_tbl <- tibble(
  Predictor = names(cfImp_cond),
  Importance = as.numeric(cfImp_cond)
) %>%
  arrange(desc(Importance))

cfImp_cond_tbl

## # A tibble: 12 × 2
##    Predictor  Importance
##    <chr>           <dbl>
##  1 V4            5.83   
##  2 V2            4.25   
##  3 V1            2.11   
##  4 V5            1.24   
##  5 duplicate1    1.02   
##  6 duplicate2    0.446  
##  7 V3            0.0271 
##  8 V7            0.0172 
##  9 V6            0.00573
## 10 V10          -0.00487
## 11 V9           -0.00516
## 12 V8           -0.00595

cfImp_cond_tbl %>%
  ggplot(aes(x = reorder(Predictor, Importance), y = Importance)) +
  geom_col() +
  coord_flip() +
  labs(
    x = "Predictor",
    y = "Conditional Importance",
    title = "cforest Conditional Variable Importance"
  )

Exercise 8.1(c): Conditional cforest variable importance with correlated predictors.

Interpretation (8.1c)

The unconditional importance from cforest tends to show a pattern similar to the traditional random forest: the correlated predictors (V1, duplicate1, duplicate2) share importance and can all appear near the top.

The conditional importance accounts for the correlations when calculating importance. With conditional = TRUE, the importance of each duplicate predictor is evaluated while (partially) controlling for the information in the correlated variables. As a result, the conditional importance scores typically reduce the inflation of importance among correlated copies and better highlight the truly informative variables rather than over-rewarding predictors just because they are highly correlated with a key signal.

8.1 (d) Other tree-based models: boosted trees and Cubist

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

set.seed(2025)
ctrl_cv <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5
)

set.seed(2025)
gbm_dup2 <- train(
  y ~ .,
  data = simulated_dup2,
  method = "gbm",
  trControl = ctrl_cv,
  tuneLength = 5,
  verbose = FALSE
)

gbm_dup2

## Stochastic Gradient Boosting 
## 
## 200 samples
##  12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   1                   50      2.570724  0.7966928  2.101766
##   1                  100      1.968910  0.8600154  1.573335
##   1                  150      1.810902  0.8725947  1.449088
##   1                  200      1.765631  0.8766371  1.414962
##   1                  250      1.756953  0.8777134  1.409467
##   2                   50      2.075735  0.8545158  1.681728
##   2                  100      1.751300  0.8809728  1.388852
##   2                  150      1.708353  0.8853474  1.357036
##   2                  200      1.702398  0.8856321  1.350284
##   2                  250      1.713219  0.8842470  1.360574
##   3                   50      1.972642  0.8543384  1.594422
##   3                  100      1.790511  0.8724779  1.419270
##   3                  150      1.776507  0.8745703  1.411572
##   3                  200      1.767083  0.8753583  1.404124
##   3                  250      1.767954  0.8752601  1.406678
##   4                   50      1.915268  0.8643152  1.532688
##   4                  100      1.791414  0.8751719  1.418552
##   4                  150      1.781098  0.8764516  1.409084
##   4                  200      1.782568  0.8762831  1.413340
##   4                  250      1.780986  0.8761567  1.413686
##   5                   50      1.928358  0.8587549  1.539695
##   5                  100      1.809487  0.8706362  1.443958
##   5                  150      1.800162  0.8717244  1.435205
##   5                  200      1.799536  0.8718692  1.434985
##   5                  250      1.794359  0.8727680  1.430140
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 200, interaction.depth =
##  2, shrinkage = 0.1 and n.minobsinnode = 10.

gbmImp <- varImp(gbm_dup2, scale = FALSE)
plot(gbmImp, top = 15)

Exercise 8.1(d): Boosted tree (GBM) variable importance with correlated predictors.

set.seed(2025)
cubist_dup2 <- train(
  y ~ .,
  data = simulated_dup2,
  method = "cubist",
  trControl = ctrl_cv,
  tuneLength = 5
)

cubist_dup2

## Cubist 
## 
## 200 samples
##  12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE     
##    1          0          2.393176  0.7749382  1.908166
##    1          5          2.193895  0.8110796  1.762340
##    1          9          2.157629  0.8163234  1.718756
##   10          0          2.012870  0.8380584  1.571136
##   10          5          1.916211  0.8531123  1.504920
##   10          9          1.883397  0.8582217  1.467345
##   20          0          1.941196  0.8470506  1.507145
##   20          5          1.860843  0.8598769  1.452440
##   20          9          1.827189  0.8651902  1.415291
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 9.

cubistImp <- varImp(cubist_dup2, scale = FALSE)
plot(cubistImp, top = 15)

Exercise 8.1(d): Cubist variable importance with correlated predictors.

Interpretation (8.1d)

Both boosted trees and Cubist are affected by correlated predictors in their importance measures:

In the GBM model, importance for the Friedman 1 signal tends to be concentrated in V1-V5, but when duplicates of V1 are present, they also receive non-trivial importance and share credit for the same underlying relationship.
In the Cubist model, the rule-based structure again distributes importance among highly correlated predictors: V1 and its duplicates can all appear as influential predictors in the rules and linear combinations.

Overall, many tree-based and rule-based models exhibit similar patterns: correlated predictors tend to share importance rather than leaving all of it with a single variable.

Exercise 8.2 - Simulation of Tree Bias with Different Granularities

Use a simulation to show tree bias with different granularities.

Here we construct predictors that differ in granularity (number of unique values) but have the same relationship to the outcome. We then examine how often each predictor is used for the first split in a simple regression tree.

simulate_tree_bias <- function(n = 500, n_sims = 200) {
  results <- tibble(Simulation = integer(), RootVariable = character())

  for (i in 1:n_sims) {
    # Two informative predictors with equal signal
    # x_cont: continuous, many unique values
    # x_cat: categorical with fewer levels, similar predictive power
    x_cont <- runif(n, 0, 1)
    x_cat  <- factor(sample(letters[1:4], n, replace = TRUE))

    # Outcome depends on both in the same functional way
    # but noise is added so they are roughly equally predictive
    y <- ifelse(x_cont > 0.5, 1, 0) +
      ifelse(x_cat %in% c("c", "d"), 1, 0) +
      rnorm(n, sd = 0.5)

    dat <- tibble(
      x_cont = x_cont,
      x_cat  = x_cat,
      y      = y
    )

    tree_fit <- rpart(
      y ~ .,
      data = dat,
      method = "anova",
      control = rpart.control(cp = 0.001, minsplit = 20)
    )

    root_var <- if (!is.null(tree_fit$splits)) {
      rownames(tree_fit$splits)[1]
    } else {
      NA_character_
    }

    results <- add_row(results,
                       Simulation = i,
                       RootVariable = root_var)
  }

  results %>%
    filter(!is.na(RootVariable))
}

set.seed(2025)
tree_bias_results <- simulate_tree_bias(n = 500, n_sims = 200)
head(tree_bias_results)

## # A tibble: 6 × 2
##   Simulation RootVariable
##        <int> <chr>       
## 1          1 x_cat       
## 2          2 x_cat       
## 3          3 x_cat       
## 4          4 x_cat       
## 5          5 x_cat       
## 6          6 x_cont

root_counts <- tree_bias_results %>%
  count(RootVariable)

root_counts

## # A tibble: 2 × 2
##   RootVariable     n
##   <chr>        <int>
## 1 x_cat           86
## 2 x_cont         114

root_counts %>%
  ggplot(aes(x = RootVariable, y = n)) +
  geom_col() +
  labs(
    x = "Root Split Variable",
    y = "Count Across Simulations",
    title = "Root Split Selection Under Different Predictor Granularities"
  )

Exercise 8.2: Frequency with which each predictor is chosen for the root split.

Interpretation (8.2)

Both x_cont (continuous) and x_cat (categorical with four levels) are constructed to be equally informative about y. If regression trees were completely unbiased with respect to predictor granularity, the two predictors would be selected as the root split roughly equally often.

In practice, the simulation usually shows that one type of predictor is favored for the root split. Depending on how the outcome is specified and the splitting algorithm, trees often prefer predictors with more potential split points (e.g., continuous variables) or, in some designs, high-cardinality factors. This demonstrates the split-selection bias in standard CART-style trees: they do not only respond to predictive strength but also to how many candidate cut points a predictor offers.

Exercise 8.3 - Stochastic Gradient Boosting and Tuning Parameters

This exercise is based on Figure 8.24, which compares variable importance for boosting models with different bagging fractions and learning rates.

In stochastic gradient boosting the bagging fraction and learning rate govern tree construction. Figure 8.24 shows variable importance plots for solubility data with (bagging fraction, learning rate) = (0.1, 0.1) and (0.9, 0.9).

8.3 (a) Focus of importance across predictors

Why does the model on the right (0.9, 0.9) focus its importance on just the first few predictors, whereas the model on the left (0.1, 0.1) spreads importance across more predictors?

Answer (8.3a)

With a high bagging fraction and large learning rate, each tree in the ensemble is fit to almost the entire training set and makes relatively large updates to the prediction function. The algorithm quickly focuses on a small set of strongly predictive variables and repeatedly reuses them in subsequent trees. As a result, the variable-importance distribution becomes steep, with a few variables dominating.

With a low bagging fraction and small learning rate, each tree is built on a smaller subsample and contributes a smaller update. The ensemble explores more alternative splits and tends to use a broader set of predictors, spreading the importance across many variables instead of concentrating it on just a few.

8.3 (b) Predictive performance

Which model do you think would be more predictive of other samples?

Answer (8.3b)

In general, the model with moderate or lower shrinkage (learning rate) and more regularization (e.g., smaller bagging fraction, more trees) is more likely to generalize well. The model that spreads importance across multiple predictors (left-hand plot) often behaves more conservatively and reduces the risk of overfitting to idiosyncratic patterns in the training data.

The high-learning-rate, high-bagging-fraction model (right-hand plot) can fit the training data very aggressively, risking overemphasis on a few predictors and overfitting. Unless cross-validation strongly supports those extreme tuning values, the more regularized model (with broader importance) is usually expected to be more predictive on new samples.

8.3 (c) Effect of interaction depth

How would increasing interaction depth affect the slope of predictor importance for either model in Figure 8.24?

Answer (8.3c)

Increasing the interaction depth allows each tree to capture more complex multi-variable interactions. This typically has two effects:

Stronger focus on key variables involved in interactions. Predictors that participate in many high-order interactions can become even more important, steepening the importance curve and increasing the gap between the top predictors and the rest.
Greater risk of overfitting if the learning rate and bagging fraction are not adjusted, especially in the high-learning-rate, high-bagging-fraction regime.

For the more regularized model (left-hand plot), increasing interaction depth with an appropriately small learning rate may still keep importance relatively spread out, but the top predictors would likely gain additional influence. For the aggressive model (right-hand plot), deeper interactions would likely sharpen the importance slope even further, making the top few predictors dominate more strongly.

Exercise 8.7 - Tree-Based Models for the Chemical Manufacturing Process

This exercise revisits the chemical manufacturing data from Exercises 6.3 and 7.5. We work with the ChemicalManufacturingProcess data from the AppliedPredictiveModeling package.

data("ChemicalManufacturingProcess")

chem_dat <- ChemicalManufacturingProcess

# Outcome is Yield
summary(chem_dat$Yield)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.25   38.75   39.97   40.18   41.48   46.34

sum(is.na(chem_dat$Yield))

## [1] 0

8.7 - Data imputation, splitting, and preprocessing

We follow the same general strategy as in earlier exercises:

Use 80% of the data for training and 20% for testing.
Impute missing values and standardize predictors.
Train several tree-based regression models and compare performance.

set.seed(2025)
train_index <- createDataPartition(chem_dat$Yield, p = 0.8, list = FALSE)

chem_train <- chem_dat[train_index, ]
chem_test  <- chem_dat[-train_index, ]

# Preprocess: center, scale, and knn-impute missing values
pp_chem <- preProcess(
  chem_train,
  method = c("center", "scale", "knnImpute")
)

chem_train_pp <- predict(pp_chem, chem_train)
chem_test_pp  <- predict(pp_chem, chem_test)

# Check for remaining missing values
sum(is.na(chem_train_pp))

## [1] 0

sum(is.na(chem_test_pp))

## [1] 0

set.seed(2025)
ctrl_chem <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5
)

8.7 (a) Which tree-based regression model is best?

Train several tree-based regression models. Which gives the optimal resampling and test set performance?

We consider:

Single regression tree (CART, rpart)
Random forest (rf)
Gradient boosting machine (gbm)
Cubist (cubist)

set.seed(2025)
chem_rpart <- train(
  Yield ~ .,
  data      = chem_train_pp,
  method    = "rpart",
  trControl = ctrl_chem,
  tuneLength = 10
)

chem_rpart

## CART 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 129, 130, 130, 132, 128, 129, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE       Rsquared   MAE      
##   0.01127554  0.8341856  0.3984872  0.6484306
##   0.01368111  0.8301402  0.4003657  0.6462145
##   0.02801216  0.8379098  0.3816499  0.6546597
##   0.02959764  0.8383309  0.3763290  0.6589632
##   0.03420670  0.8470113  0.3627820  0.6710388
##   0.03813039  0.8552697  0.3518925  0.6802027
##   0.04354638  0.8530202  0.3556501  0.6789527
##   0.07672524  0.8461837  0.3489352  0.6737562
##   0.07794709  0.8487569  0.3457890  0.6773511
##   0.38849362  0.9490328  0.2259248  0.7678095
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01368111.

set.seed(2025)
chem_rf <- train(
  Yield ~ .,
  data      = chem_train_pp,
  method    = "rf",
  trControl = ctrl_chem,
  tuneLength = 5,
  importance = TRUE
)

chem_rf

## Random Forest 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 129, 130, 130, 132, 128, 129, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    2    0.6662118  0.6415371  0.5353556
##   15    0.6187853  0.6604179  0.4879570
##   29    0.6244776  0.6433476  0.4875823
##   43    0.6284689  0.6349744  0.4906896
##   57    0.6366873  0.6197229  0.4948959
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 15.

set.seed(2025)
chem_gbm <- train(
  Yield ~ .,
  data      = chem_train_pp,
  method    = "gbm",
  trControl = ctrl_chem,
  tuneLength = 5,
  verbose   = FALSE
)

chem_gbm

## Stochastic Gradient Boosting 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 129, 130, 130, 132, 128, 129, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   MAE      
##   1                   50      0.6701383  0.5813309  0.5202090
##   1                  100      0.6623044  0.5923863  0.5172324
##   1                  150      0.6591551  0.5942849  0.5154872
##   1                  200      0.6592040  0.5936370  0.5168390
##   1                  250      0.6549753  0.6006155  0.5135923
##   2                   50      0.6523957  0.6043344  0.5170576
##   2                  100      0.6387623  0.6195785  0.5080581
##   2                  150      0.6305558  0.6314990  0.5003676
##   2                  200      0.6239983  0.6387315  0.4935470
##   2                  250      0.6200105  0.6426514  0.4900592
##   3                   50      0.6619733  0.5932824  0.5181675
##   3                  100      0.6451119  0.6146906  0.5033858
##   3                  150      0.6320234  0.6293495  0.4936154
##   3                  200      0.6259622  0.6352831  0.4878285
##   3                  250      0.6236193  0.6385338  0.4852492
##   4                   50      0.6417579  0.6145137  0.4999330
##   4                  100      0.6345025  0.6253701  0.4952744
##   4                  150      0.6285961  0.6349796  0.4938852
##   4                  200      0.6257294  0.6388996  0.4927038
##   4                  250      0.6241977  0.6407729  0.4920170
##   5                   50      0.6477863  0.6092152  0.5069508
##   5                  100      0.6302526  0.6310390  0.4956012
##   5                  150      0.6223333  0.6396696  0.4886746
##   5                  200      0.6193377  0.6436058  0.4867325
##   5                  250      0.6174479  0.6460945  0.4853530
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 250, interaction.depth =
##  5, shrinkage = 0.1 and n.minobsinnode = 10.

set.seed(2025)
chem_cubist <- train(
  Yield ~ .,
  data      = chem_train_pp,
  method    = "cubist",
  trControl = ctrl_chem,
  tuneLength = 5
)

chem_cubist

## Cubist 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 129, 130, 130, 132, 128, 129, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE      
##    1          0          0.7567668  0.5418193  0.5755837
##    1          5          0.6745727  0.6324067  0.5002176
##    1          9          0.7125548  0.5880973  0.5310236
##   10          0          0.6020615  0.6518361  0.4841094
##   10          5          0.5438174  0.7153911  0.4279727
##   10          9          0.5722537  0.6826212  0.4519424
##   20          0          0.5843262  0.6675262  0.4690900
##   20          5          0.5246383  0.7318910  0.4157802
##   20          9          0.5542437  0.7000298  0.4391397
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.

chem_resamps <- resamples(
  list(
    Tree   = chem_rpart,
    RF     = chem_rf,
    GBM    = chem_gbm,
    Cubist = chem_cubist
  )
)

summary(chem_resamps)

## 
## Call:
## summary.resamples(object = chem_resamps)
## 
## Models: Tree, RF, GBM, Cubist 
## Number of resamples: 50 
## 
## MAE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Tree   0.3186895 0.5446789 0.6300540 0.6462145 0.7443212 1.1127208    0
## RF     0.2486917 0.4236201 0.4722985 0.4879570 0.5581950 0.7091079    0
## GBM    0.2648320 0.3927176 0.4806968 0.4853530 0.5575806 0.7421254    0
## Cubist 0.2394256 0.3579441 0.3858465 0.4157802 0.4737111 0.6848573    0
## 
## RMSE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Tree   0.3710557 0.7074982 0.8040287 0.8301402 0.9449259 1.3705496    0
## RF     0.2974471 0.5269086 0.6097822 0.6187853 0.7259485 0.9585881    0
## GBM    0.3688582 0.4995809 0.6216397 0.6174479 0.6912743 0.9629947    0
## Cubist 0.3004727 0.4445891 0.5072237 0.5246383 0.5950507 0.9032936    0
## 
## Rsquared 
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Tree   0.007364676 0.2846470 0.3796015 0.4003657 0.5035961 0.8244362    0
## RF     0.227850834 0.5487260 0.6830511 0.6604179 0.7837325 0.8919338    0
## GBM    0.219644541 0.5653635 0.6784806 0.6460945 0.7558495 0.9084821    0
## Cubist 0.215433544 0.6736650 0.7556208 0.7318910 0.8384317 0.9175856    0

bwplot(chem_resamps, metric = "RMSE")

Exercise 8.7(a): Comparison of resampling RMSE across tree-based models.

models_list <- list(
  Tree   = chem_rpart,
  RF     = chem_rf,
  GBM    = chem_gbm,
  Cubist = chem_cubist
)

test_perf <- purrr::map_df(names(models_list), function(mname) {
  mod <- models_list[[mname]]
  preds <- predict(mod, chem_test_pp)
  tibble(
    Model = mname,
    RMSE  = RMSE(preds, chem_test_pp$Yield),
    Rsq   = R2(preds, chem_test_pp$Yield)
  )
})

test_perf

## # A tibble: 4 × 3
##   Model   RMSE   Rsq
##   <chr>  <dbl> <dbl>
## 1 Tree   0.616 0.478
## 2 RF     0.500 0.707
## 3 GBM    0.450 0.728
## 4 Cubist 0.411 0.784

Interpretation (8.7a)

From the cross-validation results and test-set performance table:

The single tree model has the largest RMSE and the lowest \(R^2\), indicating underfitting and limited flexibility.
Random forest and GBM substantially reduce RMSE and increase \(R^2\), showing better predictive accuracy.
Cubist typically achieves the lowest RMSE and highest \(R^2\) on both resampling and the test set, making it the most accurate tree-based method among those considered.

Therefore, the Cubist model is the preferred tree-based regression model for predicting yield in this chemical manufacturing process.

8.7 (b) Variable importance and dominant predictor types

For the optimal tree-based model, which predictors are most important? Do biological or process variables dominate? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

(Here we treat the Cubist model as the optimal tree-based model.)

chem_cubist_imp <- varImp(chem_cubist, scale = FALSE)
chem_cubist_imp

## cubist variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32    47.0
## BiologicalMaterial06      37.5
## BiologicalMaterial02      26.0
## ManufacturingProcess17    23.5
## ManufacturingProcess13    18.5
## BiologicalMaterial03      17.0
## ManufacturingProcess09    15.0
## ManufacturingProcess33    13.5
## ManufacturingProcess39    13.5
## ManufacturingProcess29    13.0
## ManufacturingProcess31    11.0
## ManufacturingProcess04    11.0
## ManufacturingProcess27    10.0
## ManufacturingProcess20     8.5
## ManufacturingProcess18     8.5
## ManufacturingProcess25     8.0
## ManufacturingProcess42     8.0
## ManufacturingProcess01     7.0
## ManufacturingProcess28     5.5
## BiologicalMaterial09       5.5

plot(chem_cubist_imp, top = 20)

Exercise 8.7(b): Cubist variable importance for the chemical manufacturing process.

top10_cubist <- chem_cubist_imp$importance %>%
  as_tibble(rownames = "Predictor") %>%
  arrange(desc(Overall)) %>%
  slice(1:10)

top10_cubist

## # A tibble: 10 × 2
##    Predictor              Overall
##    <chr>                    <dbl>
##  1 ManufacturingProcess32    47  
##  2 BiologicalMaterial06      37.5
##  3 BiologicalMaterial02      26  
##  4 ManufacturingProcess17    23.5
##  5 ManufacturingProcess13    18.5
##  6 BiologicalMaterial03      17  
##  7 ManufacturingProcess09    15  
##  8 ManufacturingProcess39    13.5
##  9 ManufacturingProcess33    13.5
## 10 ManufacturingProcess29    13

Interpretation (8.7b)

The Cubist importance plot shows a small group of predictors with much larger importance scores than the rest. Many of these top predictors are process variables (e.g., temperatures, flow rates, and intermediate measurements taken at specific process steps), although some biological variables also appear.

This is consistent with the context of the manufacturing process: yield is strongly driven by how the process is controlled over time, so process variables tend to dominate. Compared with earlier optimal linear and nonlinear models (e.g., penalized regression or boosted models), there is substantial overlap in the top 10 predictors. Some models may rank a few biological measurements slightly higher or lower, but the core set of influential variables is stable across modeling approaches, which gives additional confidence in their relevance.

8.7 (c) Plotting the optimal single tree

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Even though the Cubist model is the most accurate overall, a single regression tree is much more interpretable. We plot the tuned rpart model and examine the leaf node distributions.

rpart.plot(
  chem_rpart$finalModel,
  main = "Regression Tree for Chemical Manufacturing Yield",
  type = 2,
  extra = 1,
  under = TRUE,
  faclen = 0
)

Exercise 8.7(c): Tuned regression tree for yield in the chemical manufacturing process.

# Identify terminal node for each training observation
train_nodes <- chem_rpart$finalModel$where

leaf_df <- chem_train_pp %>%
  mutate(TerminalNode = factor(train_nodes)) %>%
  select(Yield, TerminalNode)

leaf_df %>%
  ggplot(aes(x = TerminalNode, y = Yield)) +
  geom_boxplot() +
  labs(
    x = "Terminal Node",
    y = "Yield",
    title = "Yield Distribution Across Terminal Nodes of the Tree"
  )

Exercise 8.7(c): Distribution of yield by terminal node of the regression tree.

Interpretation (8.7c)

The tree plot shows a sequence of splits on a small number of predictors (mostly process variables) that partition the observations into regions with different mean yields. The boxplots of yield by terminal node highlight which combinations of process conditions lead to higher or lower yields:

Some nodes have consistently high yields with relatively low variability, suggesting stable and desirable operating regimes.
Other nodes show low or highly variable yields, pointing to unstable or suboptimal process conditions.

Although the single tree is not as predictive as ensemble methods, it provides clear, rule-like descriptions of how certain process variables interact to influence yield. This structure can be valuable for process engineers and domain experts when they interpret and optimize the manufacturing process.

Kuhn & Johnson - Chapter 8 Exercises 8.1, 8.2, 8.3, and 8.7

Sabina Baraili

2025-11-22