8.1. Recreate the simulated data from Exercise 7.2

8.1a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

library(mlbench)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
set.seed(200)

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)

colnames(simulated)[ncol(simulated)] <- "y"


model1 <- randomForest(
  y ~ ., 
  data = simulated,
  importance = TRUE,
  ntree = 1000
)

rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##         Overall
## V1   8.86329776
## V2   6.72851763
## V3   0.84145353
## V4   7.60284159
## V5   2.26864193
## V6   0.11268425
## V7   0.07374772
## V8  -0.07210708
## V9  -0.06913906
## V10 -0.10577619

No, the random forest did not significantly use the uninformative predictors. Variables V6–V10 all have importance values near zero (or negative), showing they didn’t contribute meaningfully to the model.

8.1b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example: simulated\(duplicate1 <- simulated\)V1 + rnorm(200) * .1 cor(simulated\(duplicate1, simulated\)V1)

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

set.seed(200)

simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1

cor(simulated$duplicate1, simulated$V1)
## [1] 0.9497025
model2 <- randomForest(
  y ~ ., 
  data = simulated,
  importance = TRUE,
  ntree = 1000
)

rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##                Overall
## V1          5.82920546
## V2          6.19827153
## V3          0.64690600
## V4          7.18846269
## V5          2.18817746
## V6          0.14271410
## V7          0.08390675
## V8         -0.14854947
## V9         -0.09698485
## V10        -0.04244151
## duplicate1  4.08026296

Yes, the importance score for V1 decreased after adding the correlated variable. The importance is now shared between V1 and the new predictor, and adding more correlated variables spreads the importance even further.

8.1c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
set.seed(200)

cf_model <- cforest(
  y ~ ., 
  data = simulated,
  controls = cforest_unbiased(ntree = 1000)
)

# Conditional importance
cf_imp_cond <- varimp(cf_model, conditional = TRUE)

# Traditional importance
cf_imp_trad <- varimp(cf_model, conditional = FALSE)

cf_imp_cond
##          V1          V2          V3          V4          V5          V6 
##  3.33759413  4.82582001  0.01146137  6.22339905  1.25721990  0.01194947 
##          V7          V8          V9         V10  duplicate1 
##  0.01685248 -0.01632860  0.00389434  0.00277754  1.39182243
cf_imp_trad
##           V1           V2           V3           V4           V5           V6 
##  6.738928518  6.080946911  0.006032679  7.766461332  1.899235131  0.005251184 
##           V7           V8           V9          V10   duplicate1 
##  0.041476323 -0.034157422 -0.035006239 -0.019734869  2.769462305

Yes, the general pattern is similar, with the main predictors still being the most important. However, the conditional importance gives a more balanced view and reduces the bias caused by correlated variables.

8.1d)

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(caret)
library(gbm)
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
set.seed(200)

ctrl <- trainControl(method = "cv", number = 10)

# Boosted Trees
boost_model <- train(
  y ~ ., 
  data = simulated,
  method = "gbm",
  trControl = ctrl,
  verbose = FALSE
)

varImp(boost_model)
## gbm variable importance
## 
##              Overall
## V4         1.000e+02
## V2         7.198e+01
## V1         7.129e+01
## V5         3.922e+01
## V3         2.658e+01
## duplicate1 1.089e+01
## V6         3.126e+00
## V7         2.547e+00
## V9         1.181e+00
## V8         1.342e-03
## V10        0.000e+00
# Cubist
cubist_model <- train(
  y ~ ., 
  data = simulated,
  method = "cubist",
  trControl = ctrl
)

varImp(cubist_model)
## cubist variable importance
## 
##            Overall
## V1          100.00
## V2           79.17
## V4           68.06
## V3           58.33
## V5           52.08
## duplicate1    0.00
## V8            0.00
## V7            0.00
## V6            0.00
## V10           0.00
## V9            0.00

Yes, the same pattern occurs. The important predictors (V1–V5) still dominate, and the correlated variable causes importance to be shared, while the uninformative predictors remain near zero.

8.2. Use a simulation to show tree bias with different granularities.

library(rpart)

set.seed(123)

n <- 200

# Simulation 
x_cont <- runif(n)  # continuous predictor
x_cat  <- factor(sample(c("A", "B"), n, replace = TRUE))  # categorical predictor
y <- rnorm(n)  # noise response

data_sim <- data.frame(x_cont, x_cat, y)

# Fit tree 
tree_model <- rpart(y ~ ., data = data_sim)

# Show tree (first split)
tree_model
## n= 200 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 200 197.379900  0.04212110  
##     2) x_cont< 0.1111159 14   9.881818 -0.51053480 *
##     3) x_cont>=0.1111159 186 182.900200  0.08371886  
##       6) x_cont>=0.1471042 177 170.627500  0.04529377  
##        12) x_cont< 0.2742751 28  20.481330 -0.34331380  
##          24) x_cont>=0.2199433 15   5.405710 -0.61773550 *
##          25) x_cont< 0.2199433 13  12.642620 -0.02667349 *
##        13) x_cont>=0.2742751 149 145.123100  0.11832070  
##          26) x_cont>=0.320928 139 130.866500  0.06812094  
##            52) x_cont< 0.9556685 132 125.002400  0.03054460  
##             104) x_cont>=0.9077436 9   6.687146 -0.70377380 *
##             105) x_cont< 0.9077436 123 113.107200  0.08427522  
##               210) x_cont< 0.3745884 11  11.700110 -0.48129310 *
##               211) x_cont>=0.3745884 112  97.542950  0.13982210 *
##            53) x_cont>=0.9556685 7   2.163031  0.77670330 *
##          27) x_cont< 0.320928 10   9.037462  0.81609730 *
##       7) x_cont< 0.1471042 9   6.871697  0.83941230 *
# Repeat simulation to demonstrate bias
results <- replicate(100, {
  x_cont <- runif(n)
  x_cat  <- factor(sample(c("A", "B"), n, replace = TRUE))
  y <- rnorm(n)
  
  df <- data.frame(x_cont, x_cat, y)
  model <- rpart(y ~ ., data = df)
  
  as.character(model$frame$var[1])  # first split variable
})

# how often each variable is chosen
table(results)
## results
##  x_cat x_cont 
##      7     93

The tree almost always splits on x_cont instead of x_cat, even though both predictors are just noise. Across the simulations, x_cont was chosen 93 times vs. only 7 times for x_cat, showing that trees are biased toward variables with more possible split points.

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

8.3a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right uses a high learning rate and high bagging fraction (both 0.9), which causes the boosting algorithm to learn very quickly. Because of this, the model places a large amount of importance on the strongest predictors early in the process. As a result, only a few predictors dominate, leading to a steep importance slope.

In contrast, the model on the left uses a lower learning rate and smaller bagging fraction (both 0.1). This slows down the learning process and allows the model to incorporate more predictors gradually. Consequently, the importance is distributed more evenly across predictors, resulting in a flatter importance slope.

8.3b) Which model do you think would be more predictive of other samples?

The model on the left is likely to be more predictive for new samples. This is because the lower learning rate reduces the risk of overfitting by making smaller, incremental updates to the model. Additionally, the smaller bagging fraction introduces more randomness, which helps improve generalization.

In contrast, the model on the right learns too aggressively and is more likely to overfit the training data, reducing its ability to perform well on new data.

8.3c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

library(caret)
library(gbm)
library(mlbench)

set.seed(123)

# Generate data 
sim <- mlbench.friedman1(200, sd = 1)
df <- data.frame(sim$x)
df$y <- sim$y

# Train Control 
ctrl <- trainControl(method = "cv", number = 5)

#  Model with low interaction depth 
set.seed(123)
gbm_low <- train(
  y ~ ., 
  data = df,
  method = "gbm",
  trControl = ctrl,
  tuneGrid = expand.grid(
    interaction.depth = 1,
    n.trees = 100,
    shrinkage = 0.1,
    n.minobsinnode = 10
  ),
  verbose = FALSE
)

# Model with high interaction depth 
set.seed(123)
gbm_high <- train(
  y ~ ., 
  data = df,
  method = "gbm",
  trControl = ctrl,
  tuneGrid = expand.grid(
    interaction.depth = 5,
    n.trees = 100,
    shrinkage = 0.1,
    n.minobsinnode = 10
  ),
  verbose = FALSE
)

# Variable Importance 
imp_low  <- varImp(gbm_low)
imp_high <- varImp(gbm_high)

imp_low
## gbm variable importance
## 
##      Overall
## X4  100.0000
## X2   69.6510
## X1   31.5082
## X5   26.9567
## X3   18.0759
## X6    1.8695
## X8    0.4042
## X7    0.0000
## X9    0.0000
## X10   0.0000
imp_high
## gbm variable importance
## 
##     Overall
## X4  100.000
## X2   62.731
## X5   32.578
## X1   30.524
## X3   18.896
## X8    3.629
## X6    3.486
## X7    1.291
## X10   1.119
## X9    0.000

Increasing the interaction depth allows the model to capture interactions between predictors, which spreads importance across more variables. As a result, the importance slope becomes less steep, since more predictors contribute instead of only a few dominating.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

8.7a) Which tree-based regression model gives the optimal resampling and test set performance?

library(AppliedPredictiveModeling)
library(caret)
library(partykit)
## Loading required package: libcoin
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp
data(ChemicalManufacturingProcess)

y <- ChemicalManufacturingProcess$Yield
x <- ChemicalManufacturingProcess[, -which(names(ChemicalManufacturingProcess) == "Yield")]

preProc  <- preProcess(x, method = c("knnImpute", "center", "scale"))
x_imputed <- predict(preProc, x)

df <- data.frame(x_imputed, Yield = y)

set.seed(123)
trainIndex <- createDataPartition(df$Yield, p = 0.8, list = FALSE)
train <- df[trainIndex, ]
test  <- df[-trainIndex, ]


ctrl <- trainControl(method = "cv", number = 10)

set.seed(123)
cart_model <- train(Yield ~ ., data = train, method = "rpart", trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
set.seed(123)
rf_model <- train(Yield ~ ., data = train, method = "rf", trControl = ctrl)

set.seed(123)
gbm_model <- train(Yield ~ ., data = train, method = "gbm", trControl = ctrl, verbose = FALSE)

# CV performance
cart_model$results
##           cp     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
## 1 0.06890550 1.459821 0.4227058 1.153820 0.3806593  0.1742005 0.2741902
## 2 0.08728537 1.473217 0.4166955 1.183735 0.3918565  0.1824338 0.2734764
## 3 0.42292490 1.711277 0.2983001 1.372367 0.2471270  0.1679511 0.1771943
rf_model$results
##   mtry     RMSE  Rsquared       MAE    RMSESD RsquaredSD     MAESD
## 1    2 1.225690 0.6500114 0.9888028 0.3026576  0.1335207 0.2044473
## 2   29 1.111872 0.6733679 0.8712198 0.3464082  0.1513388 0.2408945
## 3   57 1.129295 0.6507304 0.8643180 0.3656140  0.1529143 0.2437373
gbm_model$results
##   shrinkage interaction.depth n.minobsinnode n.trees     RMSE  Rsquared
## 1       0.1                 1             10      50 1.177715 0.6282387
## 4       0.1                 2             10      50 1.197713 0.6203947
## 7       0.1                 3             10      50 1.128815 0.6561905
## 2       0.1                 1             10     100 1.156282 0.6465120
## 5       0.1                 2             10     100 1.142907 0.6490741
## 8       0.1                 3             10     100 1.111797 0.6544727
## 3       0.1                 1             10     150 1.147562 0.6486036
## 6       0.1                 2             10     150 1.140255 0.6521379
## 9       0.1                 3             10     150 1.096569 0.6609543
##         MAE    RMSESD RsquaredSD     MAESD
## 1 0.9462110 0.3259866  0.1384237 0.2499189
## 4 0.9262079 0.3691944  0.1726710 0.2632443
## 7 0.8802184 0.3582641  0.1485189 0.2558811
## 2 0.9167662 0.3360332  0.1407597 0.2522941
## 5 0.9001427 0.3992805  0.1837742 0.2883595
## 8 0.8580807 0.3706501  0.1701075 0.2565727
## 3 0.9162734 0.3710523  0.1631749 0.2898520
## 6 0.8969534 0.4064194  0.1846293 0.2935339
## 9 0.8501470 0.3725611  0.1714498 0.2662285
# Test performance
postResample(predict(cart_model, test), test$Yield)
##      RMSE  Rsquared       MAE 
## 1.7003910 0.2248231 1.3158401
postResample(predict(rf_model,   test), test$Yield)
##      RMSE  Rsquared       MAE 
## 1.2815608 0.5215459 0.9636237
postResample(predict(gbm_model,  test), test$Yield)
##      RMSE  Rsquared       MAE 
## 1.0309183 0.6996226 0.7771521

The GBM (boosted tree) model gives the best performance. It has the lowest RMSE (~1.096) and highest R² (~0.661), outperforming both the random forest and CART models.

8.7b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(gbm_model)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess31  36.109
## BiologicalMaterial03    23.986
## BiologicalMaterial12    22.959
## ManufacturingProcess17  22.309
## ManufacturingProcess15  18.340
## ManufacturingProcess13  18.054
## BiologicalMaterial09    14.943
## ManufacturingProcess09  11.880
## BiologicalMaterial04    10.978
## ManufacturingProcess06  10.774
## ManufacturingProcess37   9.551
## ManufacturingProcess43   8.199
## ManufacturingProcess39   7.440
## ManufacturingProcess04   7.271
## BiologicalMaterial10     6.680
## BiologicalMaterial11     6.479
## BiologicalMaterial02     6.317
## BiologicalMaterial05     6.168
## ManufacturingProcess26   5.389

The most important predictors are mainly ManufacturingProcess variables, with a few BiologicalMaterial variables also appearing. Overall, the process variables dominate the list. Compared to the linear and nonlinear models, the top predictors are generally similar, though the tree-based model highlights nonlinear relationships and may rank some variables differently.

8.7c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpart_tree <- cart_model$finalModel
rpart_party <- as.party(rpart_tree)
plot(rpart_party)

Yes, this view provides additional insight. It shows clear threshold effects, where predictors like ManufacturingProcess32 and ManufacturingProcess17 split the data into groups with different yield levels, highlighting how specific process conditions impact yield.