Let’s recreate the simulated data from > library(mlbench) > set.seed(200) > simulated <- mlbench.friedman1(200, sd = 1) > simulated <- cbind(simulated\(x, simulated\)y) > simulated <- as.data.frame(simulated) > colnames(simulated)[ncol(simulated)] <- “y”
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Loading data and libraries
library(mlbench)
library(randomForest)
library(caret)
library(party)
library(gbm)
library(Cubist)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(
y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
print(rfImp1)
## Overall
## V1 8.62743275
## V2 6.27437240
## V3 0.72305459
## V4 7.50258584
## V5 2.13575650
## V6 0.12395003
## V7 0.02927888
## V8 -0.11724317
## V9 -0.10344797
## V10 0.04312556
V6–V10 all have importance scores very near zero (and some even negative), compared with V1–V5 (which range 0.7–8.6). The forest did not meaningfully use V6–V10. Their contributions are effectively noise.
# duplicate of V1
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1
cat("Cor(simulated$duplicate1, V1) =", cor(simulated$duplicate1, simulated$V1), "\n")
## Cor(simulated$duplicate1, V1) = 0.9485201
# Re‑fit RF with duplicate1
model2 <- randomForest(
y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
print(rfImp2)
## Overall
## V1 6.774034589
## V2 6.426340527
## V3 0.613805379
## V4 7.135941576
## V5 2.135242904
## V6 0.171933358
## V7 0.142238552
## V8 -0.073192083
## V9 -0.098719872
## V10 -0.009701234
## duplicate1 3.084990840
# second duplicate of V1
simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1
cat("Cor(simulated$duplicate2, V1) =", cor(simulated$duplicate2, simulated$V1), "\n")
## Cor(simulated$duplicate2, V1) = 0.9337221
# Re‑fit RF with both duplicates
model3 <- randomForest(
y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
print(rfImp3)
## Overall
## V1 5.908641677
## V2 6.586726939
## V3 0.559845667
## V4 7.373782389
## V5 1.987341138
## V6 0.162417814
## V7 0.038423138
## V8 0.007497423
## V9 -0.001806331
## V10 0.004023755
## duplicate1 2.351543736
## duplicate2 2.305339113
With one duplicate (duplicate1, ρ≈0.949). V1 drops from 8.627 to 6.774. A bit of its “signal” shifts to duplicate1 (importance ≈0.17).
With two duplicates (duplicate1, duplicate2): V1 falls further (to 5.909). Now its importance is spread across both duplicate1 (≈0.16) and duplicate2 (≈0.0075).
So each added surrogate for V1 siphons off some of V1’s importance, diluting its score.
# Unbiased cforest
cf_ctrl <- cforest_unbiased(ntree = 1000, mtry = floor(sqrt(ncol(simulated)-1)))
cf1 <- cforest(
y ~ .,
data = simulated,
controls = cf_ctrl)
# Traditional permutation importance
imp_cf1 <- varimp(cf1)
print(sort(imp_cf1, decreasing = TRUE))
## V4 V1 V2 duplicate1 V5 duplicate2
## 5.69238840 4.85650670 4.83873965 2.25686987 1.74353145 1.68311274
## V3 V6 V7 V8 V9 V10
## 0.06427383 0.02667855 0.02181069 -0.02407249 -0.02433254 -0.05178070
# Conditional importance
imp_cf1_cond <- varimp(cf1, conditional = TRUE)
print(sort(imp_cf1_cond, decreasing = TRUE))
## V4 V2 V1 V5 duplicate1 duplicate2
## 4.247910855 3.587323075 1.780641419 1.172723715 0.856867376 0.453573306
## V3 V7 V8 V9 V6 V10
## 0.100029765 0.003453134 0.003450580 -0.001294559 -0.003434175 -0.011044080
The traditional measure mirrors randomForest’s bias toward variables with many possible splits (and splits signal across duplicates).
The conditional version adjusts for correlation: V1’s rank falls relative to V2, and its absolute score drops more sharply.
Both importance schemes keep V6–V10 negligible, and both show the same dilution effect on V1 when duplicates are present, but the conditional scores distribute credit more evenly among correlated features.
# Boosted trees (gbm)
set.seed(200)
gbm_model <- train(
y ~ .,
data = simulated,
method = "gbm",
trControl = trainControl(method = "none"),
tuneGrid = data.frame(
n.trees = 1000,
interaction.depth = 3,
shrinkage = 0.01,
n.minobsinnode = 10),
verbose = FALSE)
gbmImp <- varImp(gbm_model, scale = FALSE)
print(gbmImp)
## gbm variable importance
##
## Overall
## V4 42427.9
## V2 34219.6
## V1 33331.5
## V5 18724.5
## V3 11740.8
## duplicate1 5558.0
## duplicate2 3279.0
## V7 1866.8
## V6 1599.7
## V8 822.6
## V9 775.6
## V10 768.8
# Cubist
set.seed(200)
cub_model <- train(
y ~ .,
data = simulated,
method = "cubist",
trControl = trainControl(method = "none"),
tuneGrid = expand.grid(committees = 10, neighbors = 0))
cubImp <- varImp(cub_model, scale = FALSE)
print(cubImp)
## cubist variable importance
##
## Overall
## V1 71.0
## V2 57.5
## V4 48.5
## V5 41.0
## V3 41.0
## V6 13.0
## V8 4.0
## duplicate1 3.5
## V7 0.0
## V9 0.0
## V10 0.0
## duplicate2 0.0
In both gbm and Cubist, the uninformative predictors remain at very low importance.
The correlated duplicates again draw away some importance from V1.
Across these tree‑based learners, adding features highly correlated with an informative predictor consistently diffuses its measured importance.
The forest did not substantially use V6–V10.
Introducing one or two surrogates for V1 causes V1’s importance to decline, with its “credit” split among duplicates.
cforest’s traditional importance reproduces the same dilution effect; its conditional importance moderates the bias but still shows the split across correlated features.
Boosted trees and Cubist follow the same pattern: uninformative variables stay near zero, and correlated surrogates siphon importance from the original predictor.
Use a simulation to show tree bias with different granularities.
set.seed(123)
library(rpart)
# sims
nsim <- 500
n <- 500
root.vars <- replicate(nsim, {
# predictors
x_cont <- runif(n) #continuous on [0,1]
x_bin <- factor(sample(0:1, n, replace = TRUE)) #binary {0,1}
x_cat5 <- factor(sample(1:5, n, replace = TRUE)) #categorical with 5 levels
x_cat10 <- factor(sample(1:10, n, replace = TRUE)) #categorical with 10 levels
# pure‐noise outcome
y <- factor(sample(c("A","B"), n, replace = TRUE)) #random class with no dependence on any x
dat <- data.frame(y, x_cont, x_bin, x_cat5, x_cat10)
# fit a depth‐1 tree (one split only)
fit <- rpart(y ~ ., data = dat,
method = "class",
control = rpart.control(maxdepth = 1, minsplit = 2))
# name of the variable used at root
as.character(fit$frame$var[1])
})
tab <- table(root.vars) / nsim
print(tab)
## root.vars
## <leaf> x_bin x_cat10 x_cat5 x_cont
## 0.088 0.014 0.554 0.096 0.248
df <- as.data.frame(tab)
colnames(df) <- c("Variable", "Proportion")
barplot(df$Proportion,
names.arg = df$Variable,
las = 2,
ylim = c(0, max(df$Proportion) * 1.1),
ylab = "Proportion chosen at root",
main = "Root split frequency by variable granularity",
col = "steelblue")
if (!requireNamespace("ggplot2", quietly=TRUE)) install.packages("ggplot2")
library(ggplot2)
ggplot(df, aes(x = reorder(Variable, -Proportion), y = Proportion)) +
geom_col() +
theme_minimal(base_size = 14) +
labs(
title = "Depth 1 Tree: Root Split Frequency",
x = "Variable",
y = "Proportion chosen at root"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# more granularities (2,5,10,20, continuous)
simulate_bias <- function(levels) {
x <- factor(sample(1:levels, n, replace=TRUE))
dat <- data.frame(
y = factor(sample(c("A","B"), n, replace=TRUE)),
x = x,
x_cont = runif(n))
fit <- rpart(y ~ ., data = dat,
method = "class",
control = rpart.control(maxdepth = 1, minsplit = 2))
as.character(fit$frame$var[1])}
# run for each granularity
set.seed(123)
results <- lapply(c(2,5,10,20), function(k) {
root.vars_k <- replicate(nsim, {
x_cont <- runif(n)
y <- factor(sample(c("A","B"), n, replace=TRUE))
x_k <- factor(sample(1:k, n, replace=TRUE))
datk <- data.frame(y, x_cont, x_k)
fit <- rpart(y ~ ., data = datk,
method = "class",
control = rpart.control(maxdepth = 1, minsplit = 2))
as.character(fit$frame$var[1])})
prop.table(table(root.vars_k))["x_k"]})
gran_df <- data.frame(
Levels = c(2,5,10,20),
Proportion = unlist(results))
# plot of bias vs. number of levels
ggplot(gran_df, aes(x = Levels, y = Proportion)) +
geom_line() + geom_point() +
scale_x_continuous(breaks = gran_df$Levels) +
theme_minimal(base_size = 14) +
labs(
title = "Probability that a k level Factor is Chosen",
x = "Number of levels (k)",
y = "Proportion chosen at root")
Notes:
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
Which model do you think would be more predictive of other samples?
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Which tree-based regression model gives the optimal resampling and test set performance?
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(AppliedPredictiveModeling)
library(caret)
library(rpart.plot)
library(randomForest)
library(gbm)
library(Cubist)
set.seed(123)
# load the data frame
data(ChemicalManufacturingProcess, package = "AppliedPredictiveModeling")
df <- ChemicalManufacturingProcess
# split out predictors and outcome
X <- df[, -1] # all except "Yield"
y <- df$Yield # percent yield
#–– Preprocess: median‐impute, center, scale ––––––––––––––––––––––
preProc <- preProcess(X, method = c("medianImpute", "center", "scale"))
X_proc <- predict(preProc, X)
#–– Train/test split (70%/30%) –––––––––––––––––––––––––––––––––––
trainIndex <- createDataPartition(y, p = 0.7, list = FALSE)
X_train <- X_proc[trainIndex, ]
X_test <- X_proc[-trainIndex, ]
y_train <- y[trainIndex]
y_test <- y[-trainIndex]
#–– Common resampling control ––––––––––––––––––––––––––––––––––––
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
set.seed(123)
fit_rpart <- train(
X_train, y_train,
method = "rpart",
trControl = ctrl,
tuneLength= 10)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
set.seed(123)
fit_rf <- train(
X_train, y_train,
method = "rf",
trControl = ctrl,
tuneLength= 5,
importance= TRUE)
set.seed(123)
fit_gbm <- train(
X_train, y_train,
method = "gbm",
trControl = ctrl,
tuneGrid = expand.grid(
n.trees = 1000,
interaction.depth= 3,
shrinkage = 0.1,
n.minobsinnode = 10),
verbose = FALSE)
set.seed(123)
fit_cub <- train(
X_train, y_train,
method = "cubist",
trControl = ctrl,
tuneGrid = expand.grid(
committees = c(1,5,10),
neighbors = c(0,1,3)))
# compare CV‐RMSE
resamps <- resamples(list(
rpart = fit_rpart,
RF = fit_rf,
GBM = fit_gbm,
Cubist = fit_cub))
summary(resamps)$statistics$RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rpart 0.9736355 1.2816037 1.458387 1.548512 1.770213 2.615971 0
## RF 0.7316652 0.9844040 1.174324 1.213600 1.373255 2.003797 0
## GBM 0.5923449 1.0298047 1.103289 1.156373 1.306123 1.810390 0
## Cubist 0.5391541 0.8245418 1.079492 1.087381 1.300338 2.131965 0
bwplot(resamps, metric="RMSE")
From the table and boxplots, Cubist has the lowest mean CV‐RMSE.
# test‐set RMSE for each model
models <- list(rpart=fit_rpart, RF=fit_rf, GBM=fit_gbm, Cubist=fit_cub)
test_perf<- sapply(models, function(m) {
preds <- predict(m, X_test)
postResample(preds, y_test)["RMSE"]})
test_perf
## rpart.RMSE RF.RMSE GBM.RMSE Cubist.RMSE
## 1.2476269 0.9381969 1.1076705 0.9722108
On the hold-out set, Random Forest actually achieves the lowest RMSE.
I’ll take Random Forest as our “best” (lowest test RMSE) and extract its top 10 drivers:
rfImp <- varImp(fit_rf, scale=FALSE)$importance
top10 <- rfImp[order(-rfImp$Overall), , drop=FALSE][1:10, , drop=FALSE]
top10
## Overall
## ManufacturingProcess32 17.160745
## ManufacturingProcess17 10.692848
## BiologicalMaterial03 9.090432
## ManufacturingProcess13 8.163859
## ManufacturingProcess09 7.854339
## BiologicalMaterial06 7.598265
## BiologicalMaterial12 7.298030
## ManufacturingProcess36 6.882771
## BiologicalMaterial02 6.632651
## BiologicalMaterial11 6.485499
Comparing with the linear and nonlinear models:
set.seed(123)
fit_lm <- train(
X_train, y_train,
method = "lm",
trControl = ctrl)
set.seed(123)
fit_svm <- train(
X_train, y_train,
method = "svmRadial",
preProc = c("center","scale"),
trControl = ctrl,
tuneLength= 10)
# Random Forest (our “best” on test set)
rfImp <- varImp(fit_rf, scale=FALSE)$importance
top10_rf <- rfImp[order(-rfImp$Overall), , drop=FALSE][1:10, , drop=FALSE]
# Linear model
lmImp <- varImp(fit_lm)$importance
top10_lm <- lmImp[order(-lmImp$Overall), , drop=FALSE][1:10, , drop=FALSE]
# SVM
svmImp <- varImp(fit_svm)$importance
top10_svm <- svmImp[order(-svmImp$Overall), , drop=FALSE][1:10, , drop=FALSE]
# Print side by side
list(
RandomForest = top10_rf,
LinearModel = top10_lm,
SVM = top10_svm)
## $RandomForest
## Overall
## ManufacturingProcess32 17.160745
## ManufacturingProcess17 10.692848
## BiologicalMaterial03 9.090432
## ManufacturingProcess13 8.163859
## ManufacturingProcess09 7.854339
## BiologicalMaterial06 7.598265
## BiologicalMaterial12 7.298030
## ManufacturingProcess36 6.882771
## BiologicalMaterial02 6.632651
## BiologicalMaterial11 6.485499
##
## $LinearModel
## Overall
## BiologicalMaterial05 100.00000
## ManufacturingProcess32 97.77227
## ManufacturingProcess37 97.08589
## ManufacturingProcess29 92.83812
## ManufacturingProcess09 86.70044
## BiologicalMaterial11 82.75671
## ManufacturingProcess45 82.48512
## ManufacturingProcess04 81.95784
## BiologicalMaterial09 81.50917
## BiologicalMaterial12 75.28948
##
## $SVM
## Overall
## ManufacturingProcess13 100.00000
## ManufacturingProcess17 98.38029
## ManufacturingProcess32 96.84451
## ManufacturingProcess09 85.36014
## BiologicalMaterial06 81.53083
## BiologicalMaterial03 81.06423
## ManufacturingProcess36 79.38472
## BiologicalMaterial12 62.55889
## ManufacturingProcess06 62.52196
## BiologicalMaterial02 61.98064
The comparison is:
Random Forest: Cubist: Linear Model: SVM: 1. MP32 (17.16) 1. BM05 (100.0) 1. BM05 (100.0) 1. MP13 (100.0) 2. MP17 (10.69) 2. MP32 (97.8) 2. MP32 (97.8) 2. MP17 (98.4) 3. BM03 ( 9.09) 3. MP37 (97.1) 3. MP37 (97.1) 3. MP32 (96.8) 4. MP13 ( 8.16) 4. MP29 (92.8) 4. MP29 (92.8) 4. MP09 (85.4) 5. MP09 ( 7.85) 5. MP09 (86.7) 5. MP09 (86.7) 5. BM06 (81.5) 6. BM06 ( 7.60) 6. BM11 (82.8) 6. BM11 (82.8) 6. BM03 (81.1) 7. BM12 ( 7.30) 7. MP45 (82.5) 7. MP45 (82.5) 7. MP36 (79.4) 8. MP36 ( 6.88) 8. MP04 (82.0) 8. MP04 (82.0) 8. BM12 (62.6) 9. BM02 ( 6.63) 9. BM09 (81.5) 9. BM09 (81.5) 9. MP06 (62.5) 10.BM11 ( 6.49) 10.BM12 (75.3) 10.BM12 (75.3) 10.BM02 (62.0)
Notes:
library(partykit)
# Re‐fit the single tree with tuned cp so it carries y_train
tree2 <- rpart(
y_train ~ .,
data = data.frame(y_train, X_train),
control = rpart.control(cp = fit_rpart$bestTune$cp))
ptree <- as.party(tree2)
# Plot with boxplots in the terminal nodes and a main title
plot(
ptree,
main = "Optimal Single-Tree for Yield with Node Boxplots",
terminal_panel = node_boxplot)
The tree’s very first split is:
ManufacturingProcess32 < 0.192 → Node 2 (n=74)
≥ 0.192 → Node 3 (n=50)
So process32 is the single most powerful predictor of yield.
Does this add knowledge beyond variable‐importance?
Yes. Variable-importance told us “Process32 is most important,” but this tree+boxplot view shows: