Use a simulation to show tree bias with different granularities.
Here we construct predictors that differ in granularity (number of unique values) but have the same relationship to the outcome. We then examine how often each predictor is used for the first split in a simple regression tree.
simulate_tree_bias <- function(n = 500, n_sims = 200) {
results <- tibble(Simulation = integer(), RootVariable = character())
for (i in 1:n_sims) {
# Two informative predictors with equal signal
# x_cont: continuous, many unique values
# x_cat: categorical with fewer levels, similar predictive power
x_cont <- runif(n, 0, 1)
x_cat <- factor(sample(letters[1:4], n, replace = TRUE))
# Outcome depends on both in the same functional way
# but noise is added so they are roughly equally predictive
y <- ifelse(x_cont > 0.5, 1, 0) +
ifelse(x_cat %in% c("c", "d"), 1, 0) +
rnorm(n, sd = 0.5)
dat <- tibble(
x_cont = x_cont,
x_cat = x_cat,
y = y
)
tree_fit <- rpart(
y ~ .,
data = dat,
method = "anova",
control = rpart.control(cp = 0.001, minsplit = 20)
)
root_var <- if (!is.null(tree_fit$splits)) {
rownames(tree_fit$splits)[1]
} else {
NA_character_
}
results <- add_row(results,
Simulation = i,
RootVariable = root_var)
}
results %>%
filter(!is.na(RootVariable))
}
set.seed(2025)
tree_bias_results <- simulate_tree_bias(n = 500, n_sims = 200)
head(tree_bias_results)
## # A tibble: 6 × 2
## Simulation RootVariable
## <int> <chr>
## 1 1 x_cat
## 2 2 x_cat
## 3 3 x_cat
## 4 4 x_cat
## 5 5 x_cat
## 6 6 x_cont
root_counts <- tree_bias_results %>%
count(RootVariable)
root_counts
## # A tibble: 2 × 2
## RootVariable n
## <chr> <int>
## 1 x_cat 86
## 2 x_cont 114
root_counts %>%
ggplot(aes(x = RootVariable, y = n)) +
geom_col() +
labs(
x = "Root Split Variable",
y = "Count Across Simulations",
title = "Root Split Selection Under Different Predictor Granularities"
)
Exercise 8.2: Frequency with which each predictor is chosen for the root split.
Interpretation (8.2)
Both x_cont (continuous) and x_cat
(categorical with four levels) are constructed to be equally informative
about y. If regression trees were completely unbiased with
respect to predictor granularity, the two predictors would be selected
as the root split roughly equally often.
In practice, the simulation usually shows that one type of predictor is favored for the root split. Depending on how the outcome is specified and the splitting algorithm, trees often prefer predictors with more potential split points (e.g., continuous variables) or, in some designs, high-cardinality factors. This demonstrates the split-selection bias in standard CART-style trees: they do not only respond to predictive strength but also to how many candidate cut points a predictor offers.
This exercise is based on Figure 8.24, which compares variable importance for boosting models with different bagging fractions and learning rates.
In stochastic gradient boosting the bagging fraction and learning rate govern tree construction. Figure 8.24 shows variable importance plots for solubility data with (bagging fraction, learning rate) = (0.1, 0.1) and (0.9, 0.9).
Why does the model on the right (0.9, 0.9) focus its importance on just the first few predictors, whereas the model on the left (0.1, 0.1) spreads importance across more predictors?
Answer (8.3a)
With a high bagging fraction and large learning rate, each tree in the ensemble is fit to almost the entire training set and makes relatively large updates to the prediction function. The algorithm quickly focuses on a small set of strongly predictive variables and repeatedly reuses them in subsequent trees. As a result, the variable-importance distribution becomes steep, with a few variables dominating.
With a low bagging fraction and small learning rate, each tree is built on a smaller subsample and contributes a smaller update. The ensemble explores more alternative splits and tends to use a broader set of predictors, spreading the importance across many variables instead of concentrating it on just a few.
Which model do you think would be more predictive of other samples?
Answer (8.3b)
In general, the model with moderate or lower shrinkage (learning rate) and more regularization (e.g., smaller bagging fraction, more trees) is more likely to generalize well. The model that spreads importance across multiple predictors (left-hand plot) often behaves more conservatively and reduces the risk of overfitting to idiosyncratic patterns in the training data.
The high-learning-rate, high-bagging-fraction model (right-hand plot) can fit the training data very aggressively, risking overemphasis on a few predictors and overfitting. Unless cross-validation strongly supports those extreme tuning values, the more regularized model (with broader importance) is usually expected to be more predictive on new samples.
How would increasing interaction depth affect the slope of predictor importance for either model in Figure 8.24?
Answer (8.3c)
Increasing the interaction depth allows each tree to capture more complex multi-variable interactions. This typically has two effects:
For the more regularized model (left-hand plot), increasing interaction depth with an appropriately small learning rate may still keep importance relatively spread out, but the top predictors would likely gain additional influence. For the aggressive model (right-hand plot), deeper interactions would likely sharpen the importance slope even further, making the top few predictors dominate more strongly.
This exercise revisits the chemical manufacturing data from Exercises
6.3 and 7.5. We work with the ChemicalManufacturingProcess
data from the AppliedPredictiveModeling package.
data("ChemicalManufacturingProcess")
chem_dat <- ChemicalManufacturingProcess
# Outcome is Yield
summary(chem_dat$Yield)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.25 38.75 39.97 40.18 41.48 46.34
sum(is.na(chem_dat$Yield))
## [1] 0
We follow the same general strategy as in earlier exercises:
set.seed(2025)
train_index <- createDataPartition(chem_dat$Yield, p = 0.8, list = FALSE)
chem_train <- chem_dat[train_index, ]
chem_test <- chem_dat[-train_index, ]
# Preprocess: center, scale, and knn-impute missing values
pp_chem <- preProcess(
chem_train,
method = c("center", "scale", "knnImpute")
)
chem_train_pp <- predict(pp_chem, chem_train)
chem_test_pp <- predict(pp_chem, chem_test)
# Check for remaining missing values
sum(is.na(chem_train_pp))
## [1] 0
sum(is.na(chem_test_pp))
## [1] 0
set.seed(2025)
ctrl_chem <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 5
)
Train several tree-based regression models. Which gives the optimal resampling and test set performance?
We consider:
rpart)rf)gbm)cubist)set.seed(2025)
chem_rpart <- train(
Yield ~ .,
data = chem_train_pp,
method = "rpart",
trControl = ctrl_chem,
tuneLength = 10
)
chem_rpart
## CART
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 129, 130, 130, 132, 128, 129, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01127554 0.8341856 0.3984872 0.6484306
## 0.01368111 0.8301402 0.4003657 0.6462145
## 0.02801216 0.8379098 0.3816499 0.6546597
## 0.02959764 0.8383309 0.3763290 0.6589632
## 0.03420670 0.8470113 0.3627820 0.6710388
## 0.03813039 0.8552697 0.3518925 0.6802027
## 0.04354638 0.8530202 0.3556501 0.6789527
## 0.07672524 0.8461837 0.3489352 0.6737562
## 0.07794709 0.8487569 0.3457890 0.6773511
## 0.38849362 0.9490328 0.2259248 0.7678095
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01368111.
set.seed(2025)
chem_rf <- train(
Yield ~ .,
data = chem_train_pp,
method = "rf",
trControl = ctrl_chem,
tuneLength = 5,
importance = TRUE
)
chem_rf
## Random Forest
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 129, 130, 130, 132, 128, 129, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.6662118 0.6415371 0.5353556
## 15 0.6187853 0.6604179 0.4879570
## 29 0.6244776 0.6433476 0.4875823
## 43 0.6284689 0.6349744 0.4906896
## 57 0.6366873 0.6197229 0.4948959
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 15.
set.seed(2025)
chem_gbm <- train(
Yield ~ .,
data = chem_train_pp,
method = "gbm",
trControl = ctrl_chem,
tuneLength = 5,
verbose = FALSE
)
chem_gbm
## Stochastic Gradient Boosting
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 129, 130, 130, 132, 128, 129, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 0.6701383 0.5813309 0.5202090
## 1 100 0.6623044 0.5923863 0.5172324
## 1 150 0.6591551 0.5942849 0.5154872
## 1 200 0.6592040 0.5936370 0.5168390
## 1 250 0.6549753 0.6006155 0.5135923
## 2 50 0.6523957 0.6043344 0.5170576
## 2 100 0.6387623 0.6195785 0.5080581
## 2 150 0.6305558 0.6314990 0.5003676
## 2 200 0.6239983 0.6387315 0.4935470
## 2 250 0.6200105 0.6426514 0.4900592
## 3 50 0.6619733 0.5932824 0.5181675
## 3 100 0.6451119 0.6146906 0.5033858
## 3 150 0.6320234 0.6293495 0.4936154
## 3 200 0.6259622 0.6352831 0.4878285
## 3 250 0.6236193 0.6385338 0.4852492
## 4 50 0.6417579 0.6145137 0.4999330
## 4 100 0.6345025 0.6253701 0.4952744
## 4 150 0.6285961 0.6349796 0.4938852
## 4 200 0.6257294 0.6388996 0.4927038
## 4 250 0.6241977 0.6407729 0.4920170
## 5 50 0.6477863 0.6092152 0.5069508
## 5 100 0.6302526 0.6310390 0.4956012
## 5 150 0.6223333 0.6396696 0.4886746
## 5 200 0.6193377 0.6436058 0.4867325
## 5 250 0.6174479 0.6460945 0.4853530
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 250, interaction.depth =
## 5, shrinkage = 0.1 and n.minobsinnode = 10.
set.seed(2025)
chem_cubist <- train(
Yield ~ .,
data = chem_train_pp,
method = "cubist",
trControl = ctrl_chem,
tuneLength = 5
)
chem_cubist
## Cubist
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 129, 130, 130, 132, 128, 129, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 0.7567668 0.5418193 0.5755837
## 1 5 0.6745727 0.6324067 0.5002176
## 1 9 0.7125548 0.5880973 0.5310236
## 10 0 0.6020615 0.6518361 0.4841094
## 10 5 0.5438174 0.7153911 0.4279727
## 10 9 0.5722537 0.6826212 0.4519424
## 20 0 0.5843262 0.6675262 0.4690900
## 20 5 0.5246383 0.7318910 0.4157802
## 20 9 0.5542437 0.7000298 0.4391397
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
chem_resamps <- resamples(
list(
Tree = chem_rpart,
RF = chem_rf,
GBM = chem_gbm,
Cubist = chem_cubist
)
)
summary(chem_resamps)
##
## Call:
## summary.resamples(object = chem_resamps)
##
## Models: Tree, RF, GBM, Cubist
## Number of resamples: 50
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Tree 0.3186895 0.5446789 0.6300540 0.6462145 0.7443212 1.1127208 0
## RF 0.2486917 0.4236201 0.4722985 0.4879570 0.5581950 0.7091079 0
## GBM 0.2648320 0.3927176 0.4806968 0.4853530 0.5575806 0.7421254 0
## Cubist 0.2394256 0.3579441 0.3858465 0.4157802 0.4737111 0.6848573 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Tree 0.3710557 0.7074982 0.8040287 0.8301402 0.9449259 1.3705496 0
## RF 0.2974471 0.5269086 0.6097822 0.6187853 0.7259485 0.9585881 0
## GBM 0.3688582 0.4995809 0.6216397 0.6174479 0.6912743 0.9629947 0
## Cubist 0.3004727 0.4445891 0.5072237 0.5246383 0.5950507 0.9032936 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Tree 0.007364676 0.2846470 0.3796015 0.4003657 0.5035961 0.8244362 0
## RF 0.227850834 0.5487260 0.6830511 0.6604179 0.7837325 0.8919338 0
## GBM 0.219644541 0.5653635 0.6784806 0.6460945 0.7558495 0.9084821 0
## Cubist 0.215433544 0.6736650 0.7556208 0.7318910 0.8384317 0.9175856 0
bwplot(chem_resamps, metric = "RMSE")
Exercise 8.7(a): Comparison of resampling RMSE across tree-based models.
models_list <- list(
Tree = chem_rpart,
RF = chem_rf,
GBM = chem_gbm,
Cubist = chem_cubist
)
test_perf <- purrr::map_df(names(models_list), function(mname) {
mod <- models_list[[mname]]
preds <- predict(mod, chem_test_pp)
tibble(
Model = mname,
RMSE = RMSE(preds, chem_test_pp$Yield),
Rsq = R2(preds, chem_test_pp$Yield)
)
})
test_perf
## # A tibble: 4 × 3
## Model RMSE Rsq
## <chr> <dbl> <dbl>
## 1 Tree 0.616 0.478
## 2 RF 0.500 0.707
## 3 GBM 0.450 0.728
## 4 Cubist 0.411 0.784
Interpretation (8.7a)
From the cross-validation results and test-set performance table:
Therefore, the Cubist model is the preferred tree-based regression model for predicting yield in this chemical manufacturing process.
For the optimal tree-based model, which predictors are most important? Do biological or process variables dominate? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
(Here we treat the Cubist model as the optimal tree-based model.)
chem_cubist_imp <- varImp(chem_cubist, scale = FALSE)
chem_cubist_imp
## cubist variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 47.0
## BiologicalMaterial06 37.5
## BiologicalMaterial02 26.0
## ManufacturingProcess17 23.5
## ManufacturingProcess13 18.5
## BiologicalMaterial03 17.0
## ManufacturingProcess09 15.0
## ManufacturingProcess33 13.5
## ManufacturingProcess39 13.5
## ManufacturingProcess29 13.0
## ManufacturingProcess31 11.0
## ManufacturingProcess04 11.0
## ManufacturingProcess27 10.0
## ManufacturingProcess20 8.5
## ManufacturingProcess18 8.5
## ManufacturingProcess25 8.0
## ManufacturingProcess42 8.0
## ManufacturingProcess01 7.0
## ManufacturingProcess28 5.5
## BiologicalMaterial09 5.5
plot(chem_cubist_imp, top = 20)
Exercise 8.7(b): Cubist variable importance for the chemical manufacturing process.
top10_cubist <- chem_cubist_imp$importance %>%
as_tibble(rownames = "Predictor") %>%
arrange(desc(Overall)) %>%
slice(1:10)
top10_cubist
## # A tibble: 10 × 2
## Predictor Overall
## <chr> <dbl>
## 1 ManufacturingProcess32 47
## 2 BiologicalMaterial06 37.5
## 3 BiologicalMaterial02 26
## 4 ManufacturingProcess17 23.5
## 5 ManufacturingProcess13 18.5
## 6 BiologicalMaterial03 17
## 7 ManufacturingProcess09 15
## 8 ManufacturingProcess39 13.5
## 9 ManufacturingProcess33 13.5
## 10 ManufacturingProcess29 13
Interpretation (8.7b)
The Cubist importance plot shows a small group of predictors with much larger importance scores than the rest. Many of these top predictors are process variables (e.g., temperatures, flow rates, and intermediate measurements taken at specific process steps), although some biological variables also appear.
This is consistent with the context of the manufacturing process: yield is strongly driven by how the process is controlled over time, so process variables tend to dominate. Compared with earlier optimal linear and nonlinear models (e.g., penalized regression or boosted models), there is substantial overlap in the top 10 predictors. Some models may rank a few biological measurements slightly higher or lower, but the core set of influential variables is stable across modeling approaches, which gives additional confidence in their relevance.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
Even though the Cubist model is the most accurate overall, a
single regression tree is much more interpretable. We
plot the tuned rpart model and examine the leaf node
distributions.
rpart.plot(
chem_rpart$finalModel,
main = "Regression Tree for Chemical Manufacturing Yield",
type = 2,
extra = 1,
under = TRUE,
faclen = 0
)
Exercise 8.7(c): Tuned regression tree for yield in the chemical manufacturing process.
# Identify terminal node for each training observation
train_nodes <- chem_rpart$finalModel$where
leaf_df <- chem_train_pp %>%
mutate(TerminalNode = factor(train_nodes)) %>%
select(Yield, TerminalNode)
leaf_df %>%
ggplot(aes(x = TerminalNode, y = Yield)) +
geom_boxplot() +
labs(
x = "Terminal Node",
y = "Yield",
title = "Yield Distribution Across Terminal Nodes of the Tree"
)
Exercise 8.7(c): Distribution of yield by terminal node of the regression tree.
Interpretation (8.7c)
The tree plot shows a sequence of splits on a small number of predictors (mostly process variables) that partition the observations into regions with different mean yields. The boxplots of yield by terminal node highlight which combinations of process conditions lead to higher or lower yields:
Although the single tree is not as predictive as ensemble methods, it provides clear, rule-like descriptions of how certain process variables interact to influence yield. This structure can be valuable for process engineers and domain experts when they interpret and optimize the manufacturing process.