Exc. 8.1

Let’s recreate the simulated data from > library(mlbench) > set.seed(200) > simulated <- mlbench.friedman1(200, sd = 1) > simulated <- cbind(simulated\(x, simulated\)y) > simulated <- as.data.frame(simulated) > colnames(simulated)[ncol(simulated)] <- “y”

  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores: > library(randomForest) > library(caret) > model1 <- randomForest(y ~ ., data = simulated,

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

  1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example: > simulated\(duplicate1 <- simulated\)V1 + rnorm(200) * .1 > cor(simulated\(duplicate1, simulated\)V1)

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

  1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

  2. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Solution for 8.1:

Loading data and libraries

library(mlbench)
library(randomForest)
library(caret)
library(party)    
library(gbm)     
library(Cubist)   

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a)

model1 <- randomForest(
  y ~ .,
  data      = simulated,
  importance = TRUE,
  ntree     = 1000)

rfImp1 <- varImp(model1, scale = FALSE)
print(rfImp1)
##         Overall
## V1   8.62743275
## V2   6.27437240
## V3   0.72305459
## V4   7.50258584
## V5   2.13575650
## V6   0.12395003
## V7   0.02927888
## V8  -0.11724317
## V9  -0.10344797
## V10  0.04312556

V6–V10 all have importance scores very near zero (and some even negative), compared with V1–V5 (which range 0.7–8.6). The forest did not meaningfully use V6–V10. Their contributions are effectively noise.

(b)

# duplicate of V1
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1
cat("Cor(simulated$duplicate1, V1) =", cor(simulated$duplicate1, simulated$V1), "\n")
## Cor(simulated$duplicate1, V1) = 0.9485201
# Re‑fit RF with duplicate1
model2 <- randomForest(
  y ~ .,
  data      = simulated,
  importance = TRUE,
  ntree     = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
print(rfImp2)
##                 Overall
## V1          6.774034589
## V2          6.426340527
## V3          0.613805379
## V4          7.135941576
## V5          2.135242904
## V6          0.171933358
## V7          0.142238552
## V8         -0.073192083
## V9         -0.098719872
## V10        -0.009701234
## duplicate1  3.084990840
# second duplicate of V1
simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1
cat("Cor(simulated$duplicate2, V1) =", cor(simulated$duplicate2, simulated$V1), "\n")
## Cor(simulated$duplicate2, V1) = 0.9337221
# Re‑fit RF with both duplicates
model3 <- randomForest(
  y ~ .,
  data      = simulated,
  importance = TRUE,
  ntree     = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
print(rfImp3)
##                 Overall
## V1          5.908641677
## V2          6.586726939
## V3          0.559845667
## V4          7.373782389
## V5          1.987341138
## V6          0.162417814
## V7          0.038423138
## V8          0.007497423
## V9         -0.001806331
## V10         0.004023755
## duplicate1  2.351543736
## duplicate2  2.305339113

With one duplicate (duplicate1, ρ≈0.949). V1 drops from 8.627 to 6.774. A bit of its “signal” shifts to duplicate1 (importance ≈0.17).

With two duplicates (duplicate1, duplicate2): V1 falls further (to 5.909). Now its importance is spread across both duplicate1 (≈0.16) and duplicate2 (≈0.0075).

So each added surrogate for V1 siphons off some of V1’s importance, diluting its score.

(c)

# Unbiased cforest
cf_ctrl <- cforest_unbiased(ntree = 1000, mtry = floor(sqrt(ncol(simulated)-1)))
cf1 <- cforest(
  y ~ .,
  data     = simulated,
  controls = cf_ctrl)

# Traditional permutation importance
imp_cf1 <- varimp(cf1)
print(sort(imp_cf1, decreasing = TRUE))
##          V4          V1          V2  duplicate1          V5  duplicate2 
##  5.69238840  4.85650670  4.83873965  2.25686987  1.74353145  1.68311274 
##          V3          V6          V7          V8          V9         V10 
##  0.06427383  0.02667855  0.02181069 -0.02407249 -0.02433254 -0.05178070
# Conditional importance
imp_cf1_cond <- varimp(cf1, conditional = TRUE)
print(sort(imp_cf1_cond, decreasing = TRUE))
##           V4           V2           V1           V5   duplicate1   duplicate2 
##  4.247910855  3.587323075  1.780641419  1.172723715  0.856867376  0.453573306 
##           V3           V7           V8           V9           V6          V10 
##  0.100029765  0.003453134  0.003450580 -0.001294559 -0.003434175 -0.011044080

The traditional measure mirrors randomForest’s bias toward variables with many possible splits (and splits signal across duplicates).

The conditional version adjusts for correlation: V1’s rank falls relative to V2, and its absolute score drops more sharply.

Both importance schemes keep V6–V10 negligible, and both show the same dilution effect on V1 when duplicates are present, but the conditional scores distribute credit more evenly among correlated features.

(d)

# Boosted trees (gbm)
set.seed(200)
gbm_model <- train(
  y ~ .,
  data      = simulated,
  method    = "gbm",
  trControl = trainControl(method = "none"),
  tuneGrid  = data.frame(
    n.trees       = 1000,
    interaction.depth = 3,
    shrinkage     = 0.01,
    n.minobsinnode  = 10),
  verbose   = FALSE)
gbmImp <- varImp(gbm_model, scale = FALSE)
print(gbmImp)
## gbm variable importance
## 
##            Overall
## V4         42427.9
## V2         34219.6
## V1         33331.5
## V5         18724.5
## V3         11740.8
## duplicate1  5558.0
## duplicate2  3279.0
## V7          1866.8
## V6          1599.7
## V8           822.6
## V9           775.6
## V10          768.8
# Cubist
set.seed(200)
cub_model <- train(
  y ~ .,
  data      = simulated,
  method    = "cubist",
  trControl = trainControl(method = "none"),
  tuneGrid  = expand.grid(committees = 10, neighbors = 0))
cubImp <- varImp(cub_model, scale = FALSE)
print(cubImp)
## cubist variable importance
## 
##            Overall
## V1            71.0
## V2            57.5
## V4            48.5
## V5            41.0
## V3            41.0
## V6            13.0
## V8             4.0
## duplicate1     3.5
## V7             0.0
## V9             0.0
## V10            0.0
## duplicate2     0.0

In both gbm and Cubist, the uninformative predictors remain at very low importance.

The correlated duplicates again draw away some importance from V1.

Across these tree‑based learners, adding features highly correlated with an informative predictor consistently diffuses its measured importance.

Conclusions:

  1. The forest did not substantially use V6–V10.

  2. Introducing one or two surrogates for V1 causes V1’s importance to decline, with its “credit” split among duplicates.

  3. cforest’s traditional importance reproduces the same dilution effect; its conditional importance moderates the bias but still shows the split across correlated features.

  4. Boosted trees and Cubist follow the same pattern: uninformative variables stay near zero, and correlated surrogates siphon importance from the original predictor.

Exc. 8.2

Use a simulation to show tree bias with different granularities.

Solution to 8.2.

set.seed(123)
library(rpart)

# sims
nsim <- 500   
n    <- 500 

root.vars <- replicate(nsim, {
  # predictors
  x_cont  <- runif(n) #continuous on [0,1]
  x_bin   <- factor(sample(0:1, n, replace = TRUE)) #binary {0,1}
  x_cat5  <- factor(sample(1:5, n, replace = TRUE)) #categorical with 5 levels
  x_cat10 <- factor(sample(1:10, n, replace = TRUE)) #categorical with 10 levels

  # pure‐noise outcome
  y <- factor(sample(c("A","B"), n, replace = TRUE)) #random class with no dependence on any x

  dat <- data.frame(y, x_cont, x_bin, x_cat5, x_cat10)

  # fit a depth‐1 tree (one split only)
  fit <- rpart(y ~ ., data = dat,
               method = "class",
               control = rpart.control(maxdepth = 1, minsplit = 2))

  # name of the variable used at root
  as.character(fit$frame$var[1])
})

tab <- table(root.vars) / nsim
print(tab)
## root.vars
##  <leaf>   x_bin x_cat10  x_cat5  x_cont 
##   0.088   0.014   0.554   0.096   0.248
df <- as.data.frame(tab)
colnames(df) <- c("Variable", "Proportion")

barplot(df$Proportion,
        names.arg = df$Variable,
        las       = 2,
        ylim      = c(0, max(df$Proportion) * 1.1),
        ylab      = "Proportion chosen at root",
        main      = "Root split frequency by variable granularity",
        col       = "steelblue")

if (!requireNamespace("ggplot2", quietly=TRUE)) install.packages("ggplot2")
library(ggplot2)

ggplot(df, aes(x = reorder(Variable, -Proportion), y = Proportion)) +
  geom_col() +
  theme_minimal(base_size = 14) +
  labs(
    title = "Depth 1 Tree: Root Split Frequency",
    x     = "Variable",
    y     = "Proportion chosen at root"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# more granularities (2,5,10,20, continuous)
simulate_bias <- function(levels) {
  x <- factor(sample(1:levels, n, replace=TRUE))
  dat <- data.frame(
    y      = factor(sample(c("A","B"), n, replace=TRUE)),
    x      = x,
    x_cont = runif(n))
  fit <- rpart(y ~ ., data = dat,
               method  = "class",
               control = rpart.control(maxdepth = 1, minsplit = 2))
  as.character(fit$frame$var[1])}

# run for each granularity
set.seed(123)
results <- lapply(c(2,5,10,20), function(k) {
  root.vars_k <- replicate(nsim, {
    x_cont  <- runif(n)
    y       <- factor(sample(c("A","B"), n, replace=TRUE))
    x_k     <- factor(sample(1:k, n, replace=TRUE))
    datk    <- data.frame(y, x_cont, x_k)
    fit     <- rpart(y ~ ., data = datk,
                     method  = "class",
                     control = rpart.control(maxdepth = 1, minsplit = 2))
    as.character(fit$frame$var[1])})
  prop.table(table(root.vars_k))["x_k"]})
gran_df <- data.frame(
  Levels      = c(2,5,10,20),
  Proportion  = unlist(results))

# plot of bias vs. number of levels
ggplot(gran_df, aes(x = Levels, y = Proportion)) +
  geom_line() + geom_point() +
  scale_x_continuous(breaks = gran_df$Levels) +
  theme_minimal(base_size = 14) +
  labs(
    title = "Probability that a k level Factor is Chosen",
    x     = "Number of levels (k)",
    y     = "Proportion chosen at root")

Notes:

  • Even though none of the predictors carry any information about y, the tree “prefers” to split on whichever variable offers the largest number of candidate cut‑points. That’s why 10‑level > cont > 5‑level > binary.
  • every time I double or triple the number of levels in a categorical predictor, I boost its “luck” of producing the best‐looking split purely by chance. This is the split‑point bias in action: more granularity, more candidate cut‐points, higher probability of finding a spuriously large impurity reduction.

Exc. 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

  1. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

  2. Which model do you think would be more predictive of other samples?

  3. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Solution to 8.3:

(a) I think that with a very large learning rate (0.9) and almost no subsampling (bag = 0.9), each tree is fit on nearly the full residuals and makes a large “step” toward the gradient. As a result, the strongest predictors (eg NumCarbon, MolWeight) explain most of the residual in the first few trees, leaving little structure for later trees to pick up.

(b) I would say the low‑rate/low‑bag model (on the left) is more predictive on new data. Because it is taking tiny gradient steps on different subsamples, it reduces variance and guards against overfitting, even though it needs more trees to converge.

(c) This would give each base learner more freedom to mix predictors in a single tree. That might flatten the importance curve in both:

  • In the high‑rate/high‑bag case, deeper splits let secondary predictors appear later in the tree, so the drop‑off after the top two or three variables is less sharp.
  • In the low‑rate/low‑bag case, the ensemble already spreads credit broadly, but deeper interactions will pull in still more variables via higher‐order splits, again reducing the steepness of the decline.

Exc. 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

  1. Which tree-based regression model gives the optimal resampling and test set performance?

  2. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

  3. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Solution to 8.7

library(AppliedPredictiveModeling)
library(caret)
library(rpart.plot)
library(randomForest)
library(gbm)
library(Cubist)
set.seed(123)

# load the data frame
data(ChemicalManufacturingProcess, package = "AppliedPredictiveModeling")
df <- ChemicalManufacturingProcess

# split out predictors and outcome
X <- df[, -1]      # all except "Yield"
y <- df$Yield      # percent yield

#–– Preprocess: median‐impute, center, scale ––––––––––––––––––––––
preProc <- preProcess(X, method = c("medianImpute", "center", "scale"))
X_proc <- predict(preProc, X)

#–– Train/test split (70%/30%) –––––––––––––––––––––––––––––––––––
trainIndex <- createDataPartition(y, p = 0.7, list = FALSE)
X_train    <- X_proc[trainIndex, ]
X_test     <- X_proc[-trainIndex, ]
y_train    <- y[trainIndex]
y_test     <- y[-trainIndex]

#–– Common resampling control ––––––––––––––––––––––––––––––––––––  
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)

(a) Train & compare tree-based regressors

set.seed(123)
fit_rpart <- train(
  X_train, y_train,
  method    = "rpart",
  trControl = ctrl,
  tuneLength= 10)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
set.seed(123)
fit_rf <- train(
  X_train, y_train,
  method    = "rf",
  trControl = ctrl,
  tuneLength= 5,
  importance= TRUE)

set.seed(123)
fit_gbm <- train(
  X_train, y_train,
  method    = "gbm",
  trControl = ctrl,
  tuneGrid  = expand.grid(
    n.trees          = 1000,
    interaction.depth= 3,
    shrinkage        = 0.1,
    n.minobsinnode   = 10),
  verbose   = FALSE)

set.seed(123)
fit_cub <- train(
  X_train, y_train,
  method    = "cubist",
  trControl = ctrl,
  tuneGrid  = expand.grid(
    committees = c(1,5,10),
    neighbors  = c(0,1,3)))

# compare CV‐RMSE
resamps <- resamples(list(
  rpart  = fit_rpart,
  RF     = fit_rf,
  GBM    = fit_gbm,
  Cubist = fit_cub))
summary(resamps)$statistics$RMSE
##             Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## rpart  0.9736355 1.2816037 1.458387 1.548512 1.770213 2.615971    0
## RF     0.7316652 0.9844040 1.174324 1.213600 1.373255 2.003797    0
## GBM    0.5923449 1.0298047 1.103289 1.156373 1.306123 1.810390    0
## Cubist 0.5391541 0.8245418 1.079492 1.087381 1.300338 2.131965    0
bwplot(resamps, metric="RMSE")

From the table and boxplots, Cubist has the lowest mean CV‐RMSE.

# test‐set RMSE for each model
models   <- list(rpart=fit_rpart, RF=fit_rf, GBM=fit_gbm, Cubist=fit_cub)
test_perf<- sapply(models, function(m) {
  preds <- predict(m, X_test)
  postResample(preds, y_test)["RMSE"]})
test_perf
##  rpart.RMSE     RF.RMSE    GBM.RMSE Cubist.RMSE 
##   1.2476269   0.9381969   1.1076705   0.9722108

On the hold-out set, Random Forest actually achieves the lowest RMSE.

(b) Variable importance in the best model

I’ll take Random Forest as our “best” (lowest test RMSE) and extract its top 10 drivers:

rfImp  <- varImp(fit_rf, scale=FALSE)$importance
top10  <- rfImp[order(-rfImp$Overall), , drop=FALSE][1:10, , drop=FALSE]
top10
##                          Overall
## ManufacturingProcess32 17.160745
## ManufacturingProcess17 10.692848
## BiologicalMaterial03    9.090432
## ManufacturingProcess13  8.163859
## ManufacturingProcess09  7.854339
## BiologicalMaterial06    7.598265
## BiologicalMaterial12    7.298030
## ManufacturingProcess36  6.882771
## BiologicalMaterial02    6.632651
## BiologicalMaterial11    6.485499

Comparing with the linear and nonlinear models:

set.seed(123)
fit_lm <- train(
  X_train, y_train,
  method    = "lm",
  trControl = ctrl)

set.seed(123)
fit_svm <- train(
  X_train, y_train,
  method    = "svmRadial",
  preProc   = c("center","scale"),
  trControl = ctrl,
  tuneLength= 10)

# Random Forest (our “best” on test set)
rfImp <- varImp(fit_rf, scale=FALSE)$importance
top10_rf <- rfImp[order(-rfImp$Overall), , drop=FALSE][1:10, , drop=FALSE]

# Linear model
lmImp <- varImp(fit_lm)$importance
top10_lm <- lmImp[order(-lmImp$Overall), , drop=FALSE][1:10, , drop=FALSE]

# SVM
svmImp <- varImp(fit_svm)$importance
top10_svm <- svmImp[order(-svmImp$Overall), , drop=FALSE][1:10, , drop=FALSE]

# Print side by side
list(
  RandomForest = top10_rf,
  LinearModel  = top10_lm,
  SVM          = top10_svm)
## $RandomForest
##                          Overall
## ManufacturingProcess32 17.160745
## ManufacturingProcess17 10.692848
## BiologicalMaterial03    9.090432
## ManufacturingProcess13  8.163859
## ManufacturingProcess09  7.854339
## BiologicalMaterial06    7.598265
## BiologicalMaterial12    7.298030
## ManufacturingProcess36  6.882771
## BiologicalMaterial02    6.632651
## BiologicalMaterial11    6.485499
## 
## $LinearModel
##                          Overall
## BiologicalMaterial05   100.00000
## ManufacturingProcess32  97.77227
## ManufacturingProcess37  97.08589
## ManufacturingProcess29  92.83812
## ManufacturingProcess09  86.70044
## BiologicalMaterial11    82.75671
## ManufacturingProcess45  82.48512
## ManufacturingProcess04  81.95784
## BiologicalMaterial09    81.50917
## BiologicalMaterial12    75.28948
## 
## $SVM
##                          Overall
## ManufacturingProcess13 100.00000
## ManufacturingProcess17  98.38029
## ManufacturingProcess32  96.84451
## ManufacturingProcess09  85.36014
## BiologicalMaterial06    81.53083
## BiologicalMaterial03    81.06423
## ManufacturingProcess36  79.38472
## BiologicalMaterial12    62.55889
## ManufacturingProcess06  62.52196
## BiologicalMaterial02    61.98064

The comparison is:

Random Forest: Cubist: Linear Model: SVM: 1. MP32 (17.16) 1. BM05 (100.0) 1. BM05 (100.0) 1. MP13 (100.0) 2. MP17 (10.69) 2. MP32 (97.8) 2. MP32 (97.8) 2. MP17 (98.4) 3. BM03 ( 9.09) 3. MP37 (97.1) 3. MP37 (97.1) 3. MP32 (96.8) 4. MP13 ( 8.16) 4. MP29 (92.8) 4. MP29 (92.8) 4. MP09 (85.4) 5. MP09 ( 7.85) 5. MP09 (86.7) 5. MP09 (86.7) 5. BM06 (81.5) 6. BM06 ( 7.60) 6. BM11 (82.8) 6. BM11 (82.8) 6. BM03 (81.1) 7. BM12 ( 7.30) 7. MP45 (82.5) 7. MP45 (82.5) 7. MP36 (79.4) 8. MP36 ( 6.88) 8. MP04 (82.0) 8. MP04 (82.0) 8. BM12 (62.6) 9. BM02 ( 6.63) 9. BM09 (81.5) 9. BM09 (81.5) 9. MP06 (62.5) 10.BM11 ( 6.49) 10.BM12 (75.3) 10.BM12 (75.3) 10.BM02 (62.0)

Notes:

  • Random Forest yields an even split: 5 process variables (MP32, 17, 13, 09, 36) and 5 biological materials (BM03, 06, 12, 02, 11).
  • Linear Model is dominated by BiologicalMaterial05 at the top, then a procession of process predictors—so it “leans” more on raw‐material measures.
  • SVM (radial) tilts heavily toward process variables (4 of the top 5) but still includes a couple of biological ones in spots 5–7.
  • Process variables are consistently important across all models, but the linear model places the greatest relative weight on biological measurements, while tree‐based learners (RF/SVM) place more emphasis on process controls.
  • Random Forest shows the most balanced mixture, reflecting both sources of variation equally, whereas SVM focuses most on process and LM on biology.

(c) Optimal single tree + yield distributions

library(partykit)

# Re‐fit the single tree with tuned cp so it carries y_train
tree2 <- rpart(
  y_train ~ .,
  data    = data.frame(y_train, X_train),
  control = rpart.control(cp = fit_rpart$bestTune$cp))

ptree <- as.party(tree2)

# Plot with boxplots in the terminal nodes and a main title
plot(
  ptree,
  main           = "Optimal Single-Tree for Yield with Node Boxplots",
  terminal_panel = node_boxplot)

The tree’s very first split is:

ManufacturingProcess32 < 0.192 → Node 2 (n=74)
≥ 0.192 → Node 3 (n=50)

So process32 is the single most powerful predictor of yield.

Does this add knowledge beyond variable‐importance?

Yes. Variable-importance told us “Process32 is most important,” but this tree+boxplot view shows:

  • Exact threshold (0.192) where you flip from lower‐to‐higher yield.
  • Magnitude of improvement (≈ +2–3% median yield when above that threshold).
  • Change in consistency (the higher‐branch has a narrower IQR, i.e. more reliable yields).
  • Actionable insight: we know exactly which batches to revisit—any run with Process32 below 0.192 will both under-yield and be more erratic, so that’s where process engineers should focus adjustments or tighter controls.