Exercise 8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
library(dplyr)
library(tidyr)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Part A

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

# sort variable importances scores by largest to smallest
rfImp1[order(-rfImp1$Overall), , drop = FALSE]
##         Overall
## V1   8.62743275
## V4   7.50258584
## V2   6.27437240
## V5   2.13575650
## V3   0.72305459
## V6   0.12395003
## V10  0.04312556
## V7   0.02927888
## V9  -0.10344797
## V8  -0.11724317

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No, the top predictors are V1-V5. Predictors V6-V10 are very small or negative, which means that these predictors did no meaningfully reduce model error.

Part B

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9485201

Fit another random forest model to these data. Did the importance score for V1 change?

# fit another RF model to the new data
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2[order(-rfImp2$Overall), , drop = FALSE]
##                 Overall
## V4          7.135941576
## V1          6.774034589
## V2          6.426340527
## duplicate1  3.084990840
## V5          2.135242904
## V3          0.613805379
## V6          0.171933358
## V7          0.142238552
## V10        -0.009701234
## V8         -0.073192083
## V9         -0.098719872

The importance score for V1 dropped from 9.27 in the original model to 6.61 in the new model.

What happens when you add another predictor that is also highly correlated with V1?

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
# fit another RF model to the new data
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3[order(-rfImp3$Overall), , drop = FALSE]
##                 Overall
## V4          7.373782389
## V2          6.586726939
## V1          5.908641677
## duplicate1  2.351543736
## duplicate2  2.305339113
## V5          1.987341138
## V3          0.559845667
## V6          0.162417814
## V7          0.038423138
## V8          0.007497423
## V10         0.004023755
## V9         -0.001806331

When you add another highly correlated predictor, the original V1 predictor’s importance decreases even more. In the updated model, V1’s importance dropped while the two correlated duplicates received importance scores of 3.5 and 1.86.

Part C

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

simulated2 <- mlbench.friedman1(200, sd = 1)
simulated2 <- cbind(simulated2$x, simulated2$y)
simulated2 <- as.data.frame(simulated2)
colnames(simulated2)[ncol(simulated2)] <- "y"
library(party)

# fit cforest model
cf_mod <- cforest(
  y ~ ., 
  data = simulated2,
  controls = cforest_unbiased(mtry = 5, ntree = 1000)
)

# Traditional (unconditional) importance
imp_uncond <- varimp(cf_mod, conditional = FALSE)

# Unconditional Importance - Original
print("Unconditional Importance (Original):")
## [1] "Unconditional Importance (Original):"
imp_uncond
##          V1          V2          V3          V4          V5          V6 
##  4.79816268  5.31220145  0.23469285 14.87122745  3.01596249 -0.02014226 
##          V7          V8          V9         V10 
## -0.01153973 -0.02808856  0.23185703  0.01654741
# Adding Duplicates
simulated2$duplicate1 <- simulated2$V1 + rnorm(200) * .1
simulated2$duplicate2 <- simulated2$V1 + rnorm(200) * .1

# fiting new model
cf_mod2 <- cforest(
  y ~ ., 
  data = simulated2,
  controls = cforest_unbiased(mtry = 5, ntree = 1000)
)

# Traditional (unconditional) importance
imp_uncond2 <- varimp(cf_mod2, conditional = FALSE)

# Unconditional Importance - Original
print("Unconditional Importance (Duplicates):")
## [1] "Unconditional Importance (Duplicates):"
imp_uncond2
##          V1          V2          V3          V4          V5          V6 
##  3.34031918  5.06077596  0.32552911 13.33550183  2.99227521 -0.02043198 
##          V7          V8          V9         V10  duplicate1  duplicate2 
##  0.02163715 -0.01906310  0.16696560  0.03264320  1.01145180  1.63068678

The unconditional CForest importance shows the same pattern as the traditional random forest. The informative predictors (V1-V5) receive the highest importance and when duplicate predictors are added, the importance of V1 decreases while the duplicates receive substantial importance.

# Conditional importance (Original)
imp_cond <- varimp(cf_mod, conditional = TRUE)

print("Conditional Importance (Original):")
## [1] "Conditional Importance (Original):"
imp_cond
##            V1            V2            V3            V4            V5 
##  4.4189687486  3.5500305539  0.1391359530 11.6277720699  2.4776594601 
##            V6            V7            V8            V9           V10 
## -0.0156419262 -0.0106707027 -0.0057039272  0.0820425334  0.0003078762
# Conditional importance (Duplicates)
imp_cond2 <- varimp(cf_mod2, conditional = TRUE)

print("Conditional Importance (Duplicates):")
## [1] "Conditional Importance (Duplicates):"
imp_cond2
##           V1           V2           V3           V4           V5           V6 
##  1.938375656  3.582064309  0.169370885 10.900787113  2.268786809 -0.015398998 
##           V7           V8           V9          V10   duplicate1   duplicate2 
## -0.014952334 -0.003480553  0.043583029  0.001668314  0.368692590  0.685935961

The conditional importance scores from CForest do not follow this pattern. The duplicates have much lower importance values because this metric accounts for correlations amongst predictors and measures each varaible’s unique contributio.

Part D

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(gbm)     # boosted trees
library(Cubist)  # Cubist models

# Original predictors only (V1–V10)
sim_orig <- mlbench.friedman1(200, sd = 1)
sim_orig <- cbind(sim_orig$x, sim_orig$y)
sim_orig <- as.data.frame(sim_orig)
colnames(sim_orig)[ncol(sim_orig)] <- "y"

# Add first duplicate of V1
sim_dup1 <- sim_orig
sim_dup1$duplicate1 <- sim_dup1$V1 + rnorm(200) * .1
cor(sim_dup1$duplicate1, sim_dup1$V1)
## [1] 0.9507778
# Add first and second duplicate of V1
sim_dup2 <- sim_dup1
sim_dup2$duplicate2 <- sim_dup2$V1 + rnorm(200) * .1

## Common train control
ctrl <- trainControl(method = "cv", number = 10)

GBM

### ORIGINAL PREDICTORS
set.seed(123)
gbm_orig <- train(
  y ~ .,
  data = sim_orig,
  method = "gbm",
  trControl = ctrl,
  verbose = FALSE
)

gbmImp_orig <- varImp(gbm_orig, scale = FALSE)$importance
gbmImp_orig[order(-gbmImp_orig$Overall), , drop = FALSE]
##       Overall
## V4  5845.7941
## V1  3121.9431
## V5  3032.8781
## V2  2849.6088
## V3  1884.7755
## V10  353.8001
## V7   293.4523
## V9   238.4252
## V8   234.6394
## V6   221.0765
### WITH ONE DUPLICATE
set.seed(123)
gbm_dup1 <- train(
  y ~ .,
  data = sim_dup1,
  method = "gbm",
  trControl = ctrl,
  verbose = FALSE
)

gbmImp_dup1 <- varImp(gbm_dup1, scale = FALSE)$importance
gbmImp_dup1[order(-gbmImp_dup1$Overall), , drop = FALSE]
##              Overall
## V4         5626.6393
## V2         2722.1442
## V5         2676.0570
## V3         1935.5097
## duplicate1 1710.9739
## V1         1666.0219
## V10         587.0154
## V7          390.0232
## V8          333.5729
## V6          277.7722
## V9          239.3611
### WITH TWO DUPLICATES
set.seed(123)
gbm_dup2 <- train(
  y ~ .,
  data = sim_dup2,
  method = "gbm",
  trControl = ctrl,
  verbose = FALSE
)

gbmImp_dup2 <- varImp(gbm_dup2, scale = FALSE)$importance
gbmImp_dup2[order(-gbmImp_dup2$Overall), , drop = FALSE]
##              Overall
## V4         5650.5317
## V2         2714.3759
## V5         2666.2679
## V3         1939.0151
## duplicate1 1684.8457
## V1         1607.9299
## V10         521.5275
## V7          373.7660
## V8          324.3161
## V9          266.7001
## V6          254.2224
## duplicate2  164.1921

When duplciates are introduced into the model, the behavior is similar to the traditional random forest model. The V1 importance decreases after the first duplicate is added, but the duplicate importance is significant. When a second duplicate is added, the V1 importance decreases further and the two duplicate importance is substantial.

Cubist

### ORIGINAL PREDICTORS
X_orig <- sim_orig %>% select(-y)
cubist_orig <- cubist(X_orig, sim_orig$y, committees = 100)

cubistImp_orig <- varImp(cubist_orig)
cubistImp_orig[order(-cubistImp_orig$Overall), , drop = FALSE]
##     Overall
## V1     63.5
## V2     63.0
## V3     58.5
## V4     39.0
## V5     26.0
## V8     16.5
## V7      5.5
## V9      5.5
## V6      4.0
## V10     3.5
### WITH ONE DUPLICATE
X_dup1 <- sim_dup1 %>% select(-y)
cubist_dup1 <- cubist(X_dup1, sim_dup1$y, committees = 100)

cubistImp_dup1 <- varImp(cubist_dup1)
cubistImp_dup1[order(-cubistImp_dup1$Overall), , drop = FALSE]
##            Overall
## V3            67.0
## V2            61.0
## V1            52.0
## V4            40.5
## V5            25.5
## duplicate1    25.5
## V8             8.5
## V7             4.5
## V9             3.0
## V10            2.5
## V6             1.0
### WITH TWO DUPLICATES
X_dup2 <- sim_dup2 %>% select(-y)
cubist_dup2 <- cubist(X_dup2, sim_dup2$y, committees = 100)

cubistImp_dup2 <- varImp(cubist_dup2)
cubistImp_dup2[order(-cubistImp_dup2$Overall), , drop = FALSE]
##            Overall
## V2            66.5
## V3            57.5
## V4            48.0
## V1            45.5
## V5            39.0
## duplicate1    23.0
## V8            10.5
## duplicate2     6.0
## V7             5.0
## V10            2.0
## V6             1.5
## V9             0.5

Cubist models show a similar pattern to boosted trees and traditional random forest - when predictors highly correlated with V1 were added, the importance of V1 decreased and the duplicates received non-zero importance. However, the effect is noticeably weaker.

Exercise 8.2

Use a simulation to show tree bias with different granularities.

library(rpart)

# Function to simulate one dataset and record which variable is used at the ROOT split of the tree
one_sim <- function(n = 200) {
  # Predictors with increasing numbers of levels
  x2  <- factor(sample(letters[1:2],  n, replace = TRUE))    # 2 levels
  x4  <- factor(sample(letters[1:4],  n, replace = TRUE))    # 4 levels
  x10 <- factor(sample(letters[1:10], n, replace = TRUE))    # 10 levels
  x20 <- factor(sample(letters[1:20], n, replace = TRUE))    # 20 levels
  
  # Outcome completely unrelated to predictors
  y <- factor(sample(c("Class1", "Class2"), n, replace = TRUE))
  
  dat <- data.frame(y, x2, x4, x10, x20)
  
  fit <- rpart(
    y ~ .,
    data   = dat,
    method = "class",
    control = rpart.control(cp = 0, xval = 0, minsplit = 10)
  )
  
  # Variable used in the first (root) split
  root_var <- as.character(fit$frame$var[1])
  root_var
}

# Repeat the experiment many times
set.seed(2025)
n_sims <- 1000
root_vars <- replicate(n_sims, one_sim())

# Summarize how often each predictor is chosen at the root
tab_root <- table(root_vars)
tab_root
## root_vars
## x10  x2 x20  x4 
## 121   2 871   6
prop.table(tab_root)
## root_vars
##   x10    x2   x20    x4 
## 0.121 0.002 0.871 0.006
# Bar Plot
barplot(
  prop.table(tab_root),
  main = "Frequency of Being Chosen as Root Split",
  ylab = "Proportion of simulations",
  xlab = "Predictor"
)

This plot shows tree bias because all four predictors are pure noise, yet the tree still consistently chooses the predictors with more possible split points. This demonstrates that CART is biased towards variables with higher granularity.

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Part A

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right focuses its importance only on a few predictors because the learning rate and bagging fraction are very high. This makes each tree in the boosting process very strong and strong trees quickly latch on to predictors that explain the most variation. Since the learning rate is very large, those predictors get a very heavy weight very early on. Once that happens, the model continues to use the same few predictors over and over again, resulting in the variable importance getting concentrated at the top.

In contrast, the model on the left uses a small learning rate and small bagging fraction, which produces weaker trees. Weak trees learn slowly and use different subsets of the data, so the model ends up using more predictors across different boosting steps. This results in a more spread out importance.

Part B

Which model do you think would be more predictive of other samples?

The model on the left would be more predictive of other samples because it learns slowly and carefully, looking at more predictors. The model on the right learns too quickly and only locks onto a few predictors, possibly resulting in overfitting.

Part C

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth can result in each tree capturing more complex patterns. This usually makes the strongest predictors look even stronger. In the model on the right, the importance curve would get even steeper because deeper trees would keep reusing the same top predictors over and over. In the model on the left, the importance curve would also get steeper, but would still be more evenly distributed compared to the model on the right.

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library("AppliedPredictiveModeling")
data(ChemicalManufacturingProcess)

# separate response and predictors
yield <- ChemicalManufacturingProcess$Yield
predictors <- ChemicalManufacturingProcess[, -1]

ncol(predictors)
## [1] 57
# Check NAs
sum(is.na(ChemicalManufacturingProcess))
## [1] 106
# Impute using KNN
preProc <- preProcess(
  ChemicalManufacturingProcess, 
  method = "knnImpute")

chemical_imp <- predict(preProc, ChemicalManufacturingProcess)

# check for missing values after imputation
sum(is.na(chemical_imp))
## [1] 0
predictors_imp <- chemical_imp %>% select(-Yield)

# split data into X & y
X <- predictors_imp
y <- chemical_imp$Yield

# find near-zero variance predictors
nzv_pred <- nearZeroVar(X)

# remove the near-zero variance predictors
filtered_pred <- X[ , -nzv_pred]

# Indexing for train/test split
split <- createDataPartition(y, p = 0.8, list = FALSE)

# train and test sets
train_x <- X[ split, , drop = FALSE]
test_x  <- X[-split, , drop = FALSE]
train_y <- y[ split]
test_y  <- y[-split]

Part A

Which tree-based regression model gives the optimal resampling and test set performance?

# SET UP TRAIN CONTROL
ctrl <- trainControl(
  method = "cv",
  number = 10
)

CART

model_cart <- train(
  x = train_x,
  y = train_y,
  method = "rpart",
  trControl = ctrl
)

model_cart
## CART 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 130, 129, 128, 129, 128, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE       Rsquared   MAE      
##   0.07137402  0.8309802  0.4110673  0.6446151
##   0.08124992  0.8353065  0.3873825  0.6635985
##   0.40865960  0.9398351  0.3397546  0.7661296
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.07137402.

Bagged Trees

model_bagged <- train(
  x = train_x,
  y = train_y,
  method = "treebag",
  trControl = ctrl
)

model_bagged
## Bagged CART 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 132, 130, 129, 128, 131, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.6711089  0.5820764  0.5291135

Random Forest

model_rf <- train(
  x = train_x,
  y = train_y,
  method = "rf",
  trControl = ctrl,
  importance = TRUE
)

model_rf
## Random Forest 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 130, 131, 128, 129, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    2    0.6980237  0.6141307  0.5569789
##   29    0.6381884  0.6308702  0.4981354
##   57    0.6521590  0.6073879  0.5080742
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 29.

Boosted Trees

model_gbm <- train(
  x = train_x,
  y = train_y,
  method = "gbm",
  trControl = ctrl,
  verbose = FALSE
)

model_gbm
## Stochastic Gradient Boosting 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 128, 130, 129, 130, 132, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   MAE      
##   1                   50      0.6713659  0.5942663  0.5288096
##   1                  100      0.6646585  0.6070959  0.5198700
##   1                  150      0.6631846  0.6127577  0.5276784
##   2                   50      0.6496497  0.6300110  0.5167411
##   2                  100      0.6476721  0.6339703  0.5189780
##   2                  150      0.6322667  0.6484494  0.5015802
##   3                   50      0.6575407  0.6149365  0.5184601
##   3                  100      0.6316775  0.6466327  0.4998399
##   3                  150      0.6281451  0.6497704  0.4968216
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.

Cubist

model_cubist <- train(
  x = train_x,
  y = train_y,
  method = "cubist",
  trControl = ctrl
)

model_cubist
## Cubist 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 130, 130, 129, 128, 131, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE      
##    1          0          0.7912088  0.5150002  0.6227799
##    1          5          0.7196493  0.6105686  0.5577757
##    1          9          0.7757404  0.5332135  0.5933102
##   10          0          0.6731150  0.5865357  0.5284302
##   10          5          0.6156789  0.6597313  0.4898853
##   10          9          0.6488282  0.6132050  0.5141006
##   20          0          0.6334982  0.6229024  0.5048943
##   20          5          0.5773318  0.6933302  0.4609861
##   20          9          0.6148721  0.6451698  0.4891114
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.

Comparing Resampling

resamps <- resamples(
  list(
    CART = model_cart,
    BaggedTrees = model_bagged,
    RF = model_rf,
    GBM = model_gbm,
    Cubist = model_cubist
  )
)

summary(resamps)
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: CART, BaggedTrees, RF, GBM, Cubist 
## Number of resamples: 10 
## 
## MAE 
##                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CART        0.5186308 0.5874275 0.6191775 0.6446151 0.7198639 0.8014350    0
## BaggedTrees 0.4159340 0.4410103 0.4773120 0.5291135 0.5991577 0.7530325    0
## RF          0.3023788 0.3627797 0.5268906 0.4981354 0.6258511 0.6629070    0
## GBM         0.3279942 0.4660057 0.4866950 0.4968216 0.5564915 0.6629425    0
## Cubist      0.3647397 0.4108998 0.4582644 0.4609861 0.5134204 0.5967377    0
## 
## RMSE 
##                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CART        0.6395920 0.7767670 0.7961089 0.8309802 0.9189822 0.9844680    0
## BaggedTrees 0.4983984 0.5774575 0.6655919 0.6711089 0.7459837 0.8950134    0
## RF          0.3624095 0.5020652 0.6382949 0.6381884 0.7801838 0.9313015    0
## GBM         0.4063015 0.5806081 0.6145188 0.6281451 0.7047843 0.8403914    0
## Cubist      0.4223299 0.4940569 0.5917600 0.5773318 0.6259345 0.8314379    0
## 
## Rsquared 
##                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CART        0.1519906 0.3688339 0.4057057 0.4110673 0.4972725 0.5663949    0
## BaggedTrees 0.3200892 0.4508161 0.6522859 0.5820764 0.6736345 0.8139778    0
## RF          0.3651921 0.5345762 0.6514439 0.6308702 0.7441165 0.8869715    0
## GBM         0.4855539 0.6175037 0.6686994 0.6497704 0.6934549 0.7901644    0
## Cubist      0.4290911 0.6291202 0.7340465 0.6933302 0.7860551 0.8341960    0

Among the tree-based models, Cubist shows the best performance. It achieved the lowest resampling RMSE (0.5785), the highest \(R^2\) (69.2%), and the lowest MAE (0.4615). Random Forest performed second and boosted trees (GBM) came in third. The CART model had the poorest performance overall.

Test-Set Performance

pred_cart   <- predict(model_cart, test_x)
pred_bagged     <- predict(model_bagged, test_x)
pred_rf     <- predict(model_rf, test_x)
pred_gbm    <- predict(model_gbm, test_x)
pred_cubist <- predict(model_cubist, test_x)

print("CART:")
## [1] "CART:"
postResample(pred_cart,   test_y)
##      RMSE  Rsquared       MAE 
## 0.7858794 0.2563955 0.5957554
print("Bagged Trees:")
## [1] "Bagged Trees:"
postResample(pred_bagged,     test_y)
##      RMSE  Rsquared       MAE 
## 0.5357896 0.6158428 0.4200970
print("Random Forest:")
## [1] "Random Forest:"
postResample(pred_rf,     test_y)
##      RMSE  Rsquared       MAE 
## 0.4895522 0.6886866 0.4070290
print("Boosted Trees (GBM):")
## [1] "Boosted Trees (GBM):"
postResample(pred_gbm,    test_y)
##      RMSE  Rsquared       MAE 
## 0.5185347 0.6539269 0.4033885
print("Cubist:")
## [1] "Cubist:"
postResample(pred_cubist, test_y)
##      RMSE  Rsquared       MAE 
## 0.3946852 0.7917949 0.3295958

Cubist once again provided the best performance among all the tree-based models. It achieved the lowest test-set RMSE (0.3947) and MAE (0.3296), and had the highest \(R^2\) of 79.2%. The CART model performed the worst.

Part B

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(model_cubist)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess09  42.857
## ManufacturingProcess17  42.857
## BiologicalMaterial02    39.286
## ManufacturingProcess13  26.786
## ManufacturingProcess33  21.429
## ManufacturingProcess04  20.536
## ManufacturingProcess29  20.536
## BiologicalMaterial10    20.536
## ManufacturingProcess45  18.750
## BiologicalMaterial03    17.857
## ManufacturingProcess25  17.857
## BiologicalMaterial06    14.286
## ManufacturingProcess15  14.286
## BiologicalMaterial12    14.286
## ManufacturingProcess14  11.607
## BiologicalMaterial05     8.929
## ManufacturingProcess28   8.036
## ManufacturingProcess24   8.036
## ManufacturingProcess19   8.036

The top predictor is ManufacturingProcess32, which was the top predictor in the linear and non-linear models as well. Just like in the linear and non-linear models, both manufacturing process and biological material predictors appear in the top 10, however manufacturing processes dominate (7 out of 10).

Part C

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart.plot)

# Plot the CART tree
rpart.plot(
  model_cart$finalModel,
  type = 2,        # split labels
  extra = 101,     # nobs, mean response (Yield)
  under = TRUE,
  faclen = 0,
  cex = 0.7
)

The CART tree shows that ManufacturingProcess32 is the first and most important split for this model, meaning it is the strongest single predictor of yield. No biological predictors appear in the tree, so for CART, the main signals driving yield come from the manufacturing process variables. ManufacturingProcess31 appears as the next split, which suggests that it helps refine the prediction after Process32.