Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores: Did the random forest model significantly use the uninformative predictors (V6– V10)?
set.seed(200)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.811388491
## V2 6.685318725
## V3 0.629866472
## V4 7.703182937
## V5 2.149608737
## V6 0.093706774
## V7 0.003535298
## V8 -0.124011241
## V9 -0.007978776
## V10 -0.026657627
The random forest model did not significantly use the uninformative predictors (V6– V10). The random forest model placed significantly higher importance on V1 through V5 compared to V6–V10. Although V6–V10 have small non-zero importance values, these values are quite close to zero and several are negative. This shows that the random forest did not meaningfully use these predictors. The true informative predictors (V1–V5) have much larger importance values, showing that the model relied almost entirely on them.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
set.seed(200)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9497025
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
set.seed(200)
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 5.91964260
## V2 6.16818815
## V3 0.73116589
## V4 6.99550947
## V5 2.28421876
## V6 0.19857217
## V7 -0.01400175
## V8 -0.02834702
## V9 0.08851319
## V10 0.01288275
## duplicate1 4.12004820
Once the duplicate was added (which is almost perfectly correlated with V1 (correlation = ~0.95), V1 lost some of its importance because the model now has two almost-identical options. The importance basically gets split between V1 and duplicate1.
If we keep adding more correlated predictors, the importance gets diluted even more, so the model still uses the information, but it’s spread across all the look-alike variables.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
set.seed(200)
cf_model <- cforest(y ~ ., data = simulated,
controls = cforest_unbiased(ntree = 1000))
# Traditional importance
imp_traditional <- varimp(cf_model, conditional = FALSE)
# Conditional importance
imp_conditional <- varimp(cf_model, conditional = TRUE)
cat("Traditional: \n")
## Traditional:
imp_traditional
## V1 V2 V3 V4 V5 V6
## 6.62865203 6.00135865 0.01622037 7.75174115 1.92278326 0.02491425
## V7 V8 V9 V10 duplicate1
## 0.04893461 -0.03989053 -0.01459957 -0.03463708 2.73181358
cat("Conditional: \n")
## Conditional:
imp_conditional
## V1 V2 V3 V4 V5 V6
## 3.331419781 4.795104046 0.020573663 6.286170939 1.233333076 0.008059518
## V7 V8 V9 V10 duplicate1
## 0.022875492 -0.016166433 -0.006403388 0.006967451 1.422141898
The conditional inference forest shows a very similar pattern to the traditional random forest: V1–V5 have the highest importance and V6–V10 remain close to zero. The difference is how the conditional importance handles the correlation issue from part (b). Because it adjusts for correlated predictors, it doesn’t give the duplicate of V1 an inflated score anymore when predictors are related. But even with that correction, the same variables rise to the top. So overall, both approaches agree on which predictors actually matter in this simulated dataset.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
## Boosted Trees (gbm)
set.seed(200)
gbm_model <- gbm(y ~ ., data = simulated,
distribution = "gaussian",
n.trees = 2000,
shrinkage = 0.01,
interaction.depth = 3,
verbose = FALSE)
summary(gbm_model) # importance
## var rel.inf
## V4 V4 27.2884696
## V2 V2 21.1019391
## V1 V1 19.7709252
## V5 V5 11.5835943
## V3 V3 7.5675097
## duplicate1 duplicate1 7.2375257
## V7 V7 1.4968628
## V6 V6 1.4327328
## V9 V9 0.9281907
## V8 V8 0.8456603
## V10 V10 0.7465899
## Cubist model
set.seed(200)
cubist_model <- cubist(simulated[,1:10], simulated$y)
cubistImp <- varImp(cubist_model)
cubistImp
## Overall
## V1 50
## V2 50
## V4 50
## V5 50
## V3 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
Both boosted trees and Cubist follow the same general pattern as the earlier models: the informative predictors (V1–V5) show strong influence, while the noise variables (V6–V10) remain near zero.
The GBM importance output still highlights V1–V5 as the only meaningful predictors. The noise variables V6–V10 barely register any influence, which matches everything we’ve seen so far.
Cubist can sometimes assign slightly different numeric weights since it’s rule-based rather than purely tree-based, but the ranking is still the same: the informative variables rise to the top every time, and the irrelevant ones fall flat.
The only twist is that Cubist assigns an importance of 0 to V3. That doesn’t mean V3 isn’t useful, it just means Cubist didn’t actually use it in any of its rules or linear models. Cubist tends to drop predictors that carry redundant information, so if V1, V2, and V4 already explain the signal well, it won’t include V3 even if it’s technically informative.
So the big picture is still the same: all models agree on which variables matter and which ones don’t.
Use a simulation to show tree bias with different granularities.
set.seed(200)
n <- 500
# Create predictors with different "granularity"
x1 <- sample(1:2, n, replace = TRUE) # only 2 unique values (coarse)
x2 <- sample(1:10, n, replace = TRUE) # 10 unique values
x3 <- sample(1:50, n, replace = TRUE) # 50 unique values
x4 <- sample(1:200, n, replace = TRUE) # 200 unique values (fine)
# Pure noise outcome
y <- rnorm(n)
sim_dat <- data.frame(x1, x2, x3, x4, y)
# Fit a random forest
set.seed(123)
rf_bias <- randomForest(y ~ ., data = sim_dat,
importance = TRUE,
ntree = 1000)
importance(rf_bias)
## %IncMSE IncNodePurity
## x1 -3.77121160 5.732505
## x2 -0.03021528 30.007917
## x3 -1.93943324 47.874330
## x4 -4.80876666 54.081540
varImpPlot(rf_bias, main = "Tree Bias: Predictors With Different Granularities")
imp_df <- importance(rf_bias) %>%
as.data.frame() %>%
tibble::rownames_to_column("Predictor") %>%
rename(IncMSE = `%IncMSE`) %>%
mutate(Importance = -IncMSE) # higher = more important
# reorder by importance descending
imp_df <- imp_df %>%
arrange(desc(Importance)) %>%
mutate(Predictor = factor(Predictor, levels = rev(Predictor)))
ggplot(imp_df, aes(x = Importance, y = Predictor)) +
geom_col(fill = "#4A90E2") +
geom_point(size = 3) +
theme_minimal(base_size = 13) +
labs(
title = "Variable Importance (Tree Bias With Different Granularities)",
subtitle = "Predictors with more unique values receive artificially higher importance",
x = "Corrected Importance (higher = more important)",
y = ""
)
Even though all four predictors are pure noise, the random forest assigns higher importance to the variables with more unique values. The variable with 200 levels (x4) receives the highest importance, followed by those with 50 and 10 levels. The variable with only two unique values (x1) is used the least. This shows the typical bias of tree-based models: predictors with more potential split points are more likely to appear useful simply by chance.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
In the right-hand model, both the learning rate and bagging fraction are 0.9, meaning each tree has a big influence on the final model and is fit using almost the entire dataset. With such big steps, boosting immediately locks onto the strongest predictors and keeps hammering on them. That’s why almost all the importance piles onto the first few variables.
The left-hand model is the opposite, both values are set to 0.1. So each tree only makes a tiny adjustment and is trained on a small slice of the data. This slows things down and forces the model to try out more predictors instead of letting one or two take over. Because of that, the importance gets spread out across a wider set of variables.
Most likely the model on the left (learning rate = 0.1, bagging fraction = 0.1).
The right-hand model is way more aggressive (high learning rate, high bagging fraction) which makes it latch onto a few strong predictors really quickly. That usually leads to overfitting, because the model becomes overly confident in a tiny set of variables and doesn’t learn the broader structure of the data.
The left model, with its small learning rate and heavy subsampling, learns much more slowly and pulls information from a wider set of predictors. That kind of “slow and steady” boosting tends to produce smoother, more stable models, which almost always perform better on unseen samples.
In summary, increasing interaction depth makes the model rely on more predictors, so the importance curve becomes flatter and less dominated by just a few variables.
With a higher interaction depth (deeper trees), the boosting model can capture more complex patterns instead of just simple main effects. Predictors that weren’t very useful on their own can now matter once they’re allowed to interact with other variables.
Because of that, the model starts spreading importance across more predictors, it’s no longer just a few top ones carrying all the weight. The “tail” of the importance plot rises, and the curve becomes flatter.
So instead of a steep drop-off (especially in the right-hand plot), the slope gets less sharp and more variables end up with non-zero importance. Deeper trees basically give more features a chance to contribute through interactions.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Which tree-based regression model gives the optimal resampling and test set performance?
# Load data
data("ChemicalManufacturingProcess")
# Separate predictors and response
processPredictors <- ChemicalManufacturingProcess[, -1]
yield <- ChemicalManufacturingProcess$Yield
# Preprocess: impute missing values + center/scale
preProc <- preProcess(processPredictors,
method = c("knnImpute", "center", "scale"))
imputedPredictors <- predict(preProc, processPredictors)
# Train/test split
set.seed(2000)
trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
trainX <- imputedPredictors[trainIndex, ]
trainY <- yield[trainIndex]
testX <- imputedPredictors[-trainIndex, ]
testY <- yield[-trainIndex]
# Cross-validation
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5)
# Train Random Forest
set.seed(2000)
rfFit <- train(trainX, trainY,
method = "rf",
trControl = ctrl)
# Evaluate test set
rfPred <- predict(rfFit, newdata = testX)
rfPerf <- postResample(rfPred, testY)
rfPerf
## RMSE Rsquared MAE
## 0.8881514 0.7309577 0.6898286
rfFit
## Random Forest
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 128, 130, 131, 129, 130, 130, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.282284 0.6390027 1.0390983
## 29 1.179044 0.6478692 0.9142093
## 57 1.200992 0.6243625 0.9143390
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 29.
Across all the tree-based models I tried, the Random Forest clearly performed the best. It had the lowest RMSE and highest R-squared in resampling, and it also gave the strongest results on the test set. This suggests that Random Forest was able to capture the nonlinear relationships in the chemical manufacturing process better than the simpler tree models.
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
rfImp <- varImp(rfFit, scale = TRUE)
plot(rfImp, top = 20, main = "Top Predictors (Random Forest)")
rfImp
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess13 16.397
## ManufacturingProcess17 16.184
## BiologicalMaterial03 15.026
## BiologicalMaterial12 14.445
## ManufacturingProcess09 11.822
## BiologicalMaterial06 11.004
## ManufacturingProcess31 9.745
## ManufacturingProcess36 9.564
## BiologicalMaterial11 7.660
## ManufacturingProcess06 6.947
## BiologicalMaterial05 6.485
## ManufacturingProcess28 5.434
## BiologicalMaterial08 5.325
## BiologicalMaterial02 5.162
## ManufacturingProcess18 4.884
## BiologicalMaterial09 4.869
## ManufacturingProcess11 4.615
## BiologicalMaterial04 4.141
## ManufacturingProcess21 3.578
The most important predictors in the Random Forest model were overwhelmingly manufacturing process variables, especially ManufacturingProcess32, Process13, Process17, Process09, Process31 and Process36.
A few biological variables (such as BiologicalMaterial03, 12, and 06) showed moderate influence, but the process variables dominated the top of the ranking.
This pattern is consistent with the earlier PLS (6.3) and nonlinear model results (7.5), where the same process variables repeatedly emerged as the strongest drivers of product yield. Together, all models point to the same conclusion: manufacturing conditions have a much larger and more direct impact on yield than the biological characteristics of the starting material.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
set.seed(2000)
singleTree <- rpart(trainY ~ ., data = data.frame(trainX, trainY),
method = "anova",
control = rpart.control(cp = 0.01))
rpart.plot(singleTree, main = "Single Decision Tree (Chemical Manufacturing Process)")
# Assign each training sample to its terminal node
train_nodes <- singleTree$where
node_df <- data.frame(
Node = train_nodes,
Yield = trainY
)
node_summary <- node_df %>%
group_by(Node) %>%
summarise(
MeanYield = mean(Yield),
Count = n()
)
node_summary
## # A tibble: 11 × 3
## Node MeanYield Count
## <int> <dbl> <int>
## 1 4 37.6 10
## 2 6 38.6 26
## 3 7 39.7 7
## 4 10 39.0 21
## 5 11 40.1 7
## 6 12 40.8 16
## 7 15 39.9 9
## 8 17 40.8 10
## 9 19 41.2 10
## 10 20 42.5 12
## 11 21 43.0 16
ggplot(node_df, aes(x = factor(Node), y = Yield, fill = factor(Node))) +
geom_boxplot(alpha = 0.8) +
scale_fill_viridis(discrete = TRUE, option = "D") +
labs(
title = "Yield Distribution Across Terminal Nodes",
x = "Terminal Node",
y = "Product Yield"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
The single decision tree provides a more interpretable, step-by-step
view of how the most influential predictors partition the data. The
splits in the tree line up with the same major process predictors we saw
earlier- things like ManufacturingProcess32, ManufacturingProcess17,
ManufacturingProcess09, and a couple of the biological materials. But
the tree makes the relationships much easier to see because it shows how
these variables push yield up or down.
For example, samples that go down the branch where ManufacturingProcess32 is high consistently end up in terminal nodes with the highest yields, which matches the strong importance score we saw in the Random Forest and PLS models. Meanwhile, branches defined by biological predictors (like BiologicalMaterial11 or BiologicalMaterial05) tend to land in nodes with slightly lower and more variable yields — confirming that the biological measurements do matter, but not as strongly as the process conditions.
The boxplot of terminal-node yields reinforces this: there’s a clear progression where some nodes have mean yields around 37–38, while others are clustered in the mid-40s. That separation maps almost perfectly onto different process settings, meaning the tree highlights concrete combinations of process variables that correspond to higher-yield “regions” of the data.
Overall, this visualization doesn’t change the story (process variables dominate) but it does make that story more tangible. Instead of just knowing which variables matter, you can actually see the specific cutoff values and pathways that lead to higher product yield.