Problem 8.1 — Random Forest Variable Importance

Setup: Recreate Simulated Data from Exercise 7.2

library(mlbench)
library(randomForest)
library(caret)
library(party)
library(gbm)
library(Cubist)
library(ggplot2)
library(dplyr)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

cat("Dimensions:", dim(simulated), "\n")
## Dimensions: 200 11
head(simulated)

The Friedman1 dataset has 5 truly informative predictors (V1–V5) and 5 uninformative noise predictors (V6–V10). The true relationship is:

\[y = 10\sin(\pi x_1 x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + \varepsilon\]


Part (a) — Fit Random Forest & Examine Uninformative Predictors

set.seed(100)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
imp_df1 <- data.frame(
  Variable   = rownames(rfImp1),
  Importance = rfImp1$Overall
) %>% arrange(desc(Importance))

imp_df1$Variable <- factor(imp_df1$Variable, levels = imp_df1$Variable)

ggplot(imp_df1, aes(x = Variable, y = Importance,
                    fill = grepl("^V[6-9]$|^V10$", Variable))) +
  geom_col() +
  scale_fill_manual(values = c("steelblue", "tomato"),
                    labels = c("Informative", "Uninformative"),
                    name = "Predictor Type") +
  labs(title  = "Random Forest Variable Importance (Model 1)",
       x = "Predictor", y = "Importance (MSE Increase)") +
  theme_minimal(base_size = 13)

Answer: The random forest does not significantly use the uninformative predictors V6–V10. Their importance scores are near zero (and sometimes slightly negative), while V1, V4, V2, and V5 dominate — consistent with the true data-generating function. This demonstrates that random forests can effectively filter out noise variables through the random feature selection mechanism.


Part (b) — Adding a Highly Correlated Duplicate of V1

simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1
cat("Correlation between duplicate1 and V1:",
    round(cor(simulated$duplicate1, simulated$V1), 4), "\n")
## Correlation between duplicate1 and V1: 0.9504
set.seed(100)
model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1
cat("Correlation between duplicate2 and V1:",
    round(cor(simulated$duplicate2, simulated$V1), 4), "\n")
## Correlation between duplicate2 and V1: 0.9428
set.seed(100)
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
v1_imp <- data.frame(
  Model      = c("Original", "+ duplicate1", "+ duplicate1 & duplicate2"),
  V1_Importance = c(
    rfImp1["V1", "Overall"],
    rfImp2["V1", "Overall"],
    rfImp3["V1", "Overall"]
  )
)
v1_imp
# Side-by-side comparison of top predictors across the 3 models
bind_rows(
  mutate(data.frame(Variable = rownames(rfImp1), Imp = rfImp1$Overall), Model = "Original"),
  mutate(data.frame(Variable = rownames(rfImp2), Imp = rfImp2$Overall), Model = "+ duplicate1"),
  mutate(data.frame(Variable = rownames(rfImp3), Imp = rfImp3$Overall), Model = "+ dup1 & dup2")
) %>%
  filter(Variable %in% c("V1", "V2", "V4", "V5", "duplicate1", "duplicate2")) %>%
  ggplot(aes(x = Variable, y = Imp, fill = Model)) +
  geom_col(position = "dodge") +
  labs(title = "V1 Importance Diluted by Correlated Duplicates",
       x = "Predictor", y = "Importance") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal(base_size = 13)

Answer: Yes — V1’s importance score drops noticeably with each correlated duplicate added. The two (then three) correlated predictors split the “credit” that V1 alone would receive. This is a well-known limitation of traditional random forest importance: correlated predictors artificially dilute each other’s scores because they compete for the same splits across trees.


Part (c) — Conditional Inference Forest (cforest)

set.seed(100)
cf_model <- cforest(y ~ ., data = simulated,
                    controls = cforest_unbiased(ntree = 500, mtry = 5))

# Traditional importance
cf_imp_trad <- varimp(cf_model, conditional = FALSE)

# Conditional importance (Strobl et al. 2007)
cf_imp_cond <- varimp(cf_model, conditional = TRUE)
cf_compare <- data.frame(
  Variable    = names(cf_imp_trad),
  Traditional = as.numeric(cf_imp_trad),
  Conditional = as.numeric(cf_imp_cond)
) %>%
  tidyr::pivot_longer(cols = c("Traditional", "Conditional"),
                      names_to = "Method", values_to = "Importance") %>%
  mutate(Variable = reorder(Variable, Importance))

ggplot(cf_compare, aes(x = Variable, y = Importance, fill = Method)) +
  geom_col(position = "dodge") +
  coord_flip() +
  scale_fill_brewer(palette = "Set1") +
  labs(title  = "cforest: Traditional vs. Conditional Variable Importance",
       x = "Predictor", y = "Importance") +
  theme_minimal(base_size = 12)

Answer: The conditional importance scores (Strobl et al. 2007) are more stable for correlated predictors. With conditional = TRUE, the importance of V1 is less deflated by the presence of duplicate1 and duplicate2, because the method conditions on correlated predictors when computing permutation importance. The uninformative V6–V10 remain near zero under both approaches. The conditional method provides a fairer assessment of each predictor’s unique contribution.


Part (d) — Boosted Trees and Cubist

set.seed(100)
gbm_model <- gbm(y ~ ., data = simulated,
                 distribution    = "gaussian",
                 n.trees         = 1000,
                 interaction.depth = 4,
                 shrinkage       = 0.01,
                 bag.fraction    = 0.5,
                 cv.folds        = 5,
                 verbose         = FALSE)

best_trees <- gbm.perf(gbm_model, method = "cv", plot.it = FALSE)
gbm_imp    <- summary(gbm_model, n.trees = best_trees, plotit = FALSE)
gbm_imp
set.seed(100)
cubist_x     <- simulated[, setdiff(names(simulated), "y")]
cubist_model <- cubist(x = cubist_x, y = simulated$y, committees = 10)
cubist_imp   <- varImp(cubist_model, scale = FALSE)
# GBM importance
gbm_df <- gbm_imp %>%
  rename(Variable = var, Importance = rel.inf) %>%
  mutate(Model = "GBM") %>%
  arrange(desc(Importance)) %>%
  head(12)

# Cubist importance
cub_df <- data.frame(
  Variable   = rownames(cubist_imp),
  Importance = cubist_imp$Overall,
  Model      = "Cubist"
) %>%
  arrange(desc(Importance)) %>%
  head(12)

bind_rows(gbm_df, cub_df) %>%
  mutate(Variable = reorder(Variable, Importance)) %>%
  ggplot(aes(x = Variable, y = Importance, fill = Model)) +
  geom_col(position = "dodge") +
  coord_flip() +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Variable Importance: GBM vs. Cubist",
       x = "Predictor", y = "Importance") +
  theme_minimal(base_size = 12)

Answer: Both GBM and Cubist show the same general pattern — true predictors (V1–V5) dominate, uninformative predictors (V6–V10) are suppressed — but both partially suffer from correlated predictor dilution. GBM is particularly susceptible because early boosting iterations greedily select whichever correlated predictor reduces the gradient most, leaving little residual for the others. Cubist uses rule-based committees and spreads importance more evenly among correlated predictors. Neither method is as robust to correlation as the conditional cforest, but all correctly identify V6–V10 as unimportant.


Problem 8.2 — Tree Bias with Different Granularities

set.seed(42)
n <- 500

# Three predictors with very different numbers of unique values
low_gran    <- sample(1:3,  n, replace = TRUE)          # 3 levels only
medium_gran <- sample(1:10, n, replace = TRUE)          # 10 levels
high_gran   <- round(rnorm(n, mean = 50, sd = 10), 2)  # ~continuous

# Response depends ONLY on low_gran
y_bias <- low_gran * 2 + rnorm(n, sd = 0.5)

df_bias <- data.frame(low_gran, medium_gran, high_gran, y_bias)

cat("Unique values per predictor:\n")
## Unique values per predictor:
cat("  low_gran:   ", length(unique(low_gran)), "\n")
##   low_gran:    3
cat("  medium_gran:", length(unique(medium_gran)), "\n")
##   medium_gran: 10
cat("  high_gran:  ", length(unique(high_gran)), "\n")
##   high_gran:   463
set.seed(42)
bias_rf  <- randomForest(y_bias ~ ., data = df_bias,
                         importance = TRUE, ntree = 500)
bias_imp <- varImp(bias_rf, scale = FALSE)
bias_imp
bias_df <- data.frame(
  Predictor  = rownames(bias_imp),
  Importance = bias_imp$Overall,
  Levels     = c("3 levels\n(true signal)", "10 levels\n(noise)", "~continuous\n(noise)")
)

ggplot(bias_df, aes(x = Predictor, y = Importance, fill = Predictor)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = Levels), vjust = -0.5, size = 3.5) +
  scale_fill_brewer(palette = "Set1", guide = "none") +
  labs(
    title    = "Tree Bias: High-Granularity Predictors Inflate Importance",
    subtitle = "y depends ONLY on low_gran (3 levels), yet high_gran dominates",
    x = "Predictor", y = "Importance (MSE Increase)"
  ) +
  theme_minimal(base_size = 13)

Answer: Even though y_bias depends exclusively on low_gran, the random forest assigns the highest importance to high_gran (the continuous predictor). This is classic selection bias in tree-based models: predictors with more unique values offer more candidate split points at each node, making them more likely to be selected by chance during tree construction. Predictors with few levels (like low_gran with 3 categories) are structurally disadvantaged — they cannot achieve the same granularity of splits and therefore appear less important than they truly are. This bias is most severe for single trees; ensembles reduce but do not eliminate it.


Problem 8.3 — Stochastic Gradient Boosting Parameters

library(AppliedPredictiveModeling)
data(solubility)

# Combine train predictors and outcome
sol_train <- solTrainXtrans
sol_train$Solubility <- solTrainY
# Low bagging fraction + low learning rate (0.1 / 0.1)
set.seed(42)
gbm_low <- gbm(Solubility ~ .,
               data              = sol_train,
               distribution      = "gaussian",
               n.trees           = 3000,
               interaction.depth = 3,
               shrinkage         = 0.1,
               bag.fraction      = 0.1,
               cv.folds          = 5,
               verbose           = FALSE)

best_low <- gbm.perf(gbm_low, method = "cv", plot.it = FALSE)
imp_low  <- summary(gbm_low, n.trees = best_low, plotit = FALSE) %>%
  head(25) %>%
  mutate(Model = "bag=0.1, lr=0.1")
# High bagging fraction + high learning rate (0.9 / 0.9)
set.seed(42)
gbm_high <- gbm(Solubility ~ .,
                data              = sol_train,
                distribution      = "gaussian",
                n.trees           = 3000,
                interaction.depth = 3,
                shrinkage         = 0.9,
                bag.fraction      = 0.9,
                cv.folds          = 5,
                verbose           = FALSE)

best_high <- gbm.perf(gbm_high, method = "cv", plot.it = FALSE)
imp_high  <- summary(gbm_high, n.trees = best_high, plotit = FALSE) %>%
  head(25) %>%
  mutate(Model = "bag=0.9, lr=0.9")
bind_rows(imp_low, imp_high) %>%
  rename(Variable = var, Importance = rel.inf) %>%
  group_by(Model) %>%
  mutate(Rank = rank(-Importance),
         Variable = reorder(Variable, -Importance)) %>%
  ggplot(aes(x = Rank, y = Importance, color = Model)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title    = "Variable Importance: Low vs. High Bagging Fraction & Learning Rate",
    subtitle = "High parameters concentrate importance on top predictors",
    x        = "Predictor Rank",
    y        = "Relative Influence (%)"
  ) +
  theme_minimal(base_size = 13)

Part (a) — Why does the right model focus on fewer predictors?

Answer: When both the bagging fraction and learning rate are high (0.9), each tree is built on nearly the full training set and takes large gradient steps. The first few trees aggressively capture the dominant predictors, quickly reducing the residual. Subsequent trees find little remaining signal to model, so only the top predictors are ever chosen. With low values (0.1), each tree sees a small random subsample and takes tiny gradient steps — the ensemble must explore broadly across many iterations, allowing more predictors to contribute over time. The result is a much flatter importance curve.

Part (b) — Which model is more predictive?

Answer: The low-parameter model (bag = 0.1, lr = 0.1) is generally more predictive of new samples. Lower learning rates combined with more trees lead to better generalization. High learning rates risk overfitting to the training gradient, while high bagging fractions reduce the variance-reduction benefit of random sampling. The broader importance spread in the low model signals it has captured more of the true predictor relationships.

Part (c) — Effect of increasing interaction depth

Answer: Increasing interaction depth allows trees to model higher-order predictor interactions, so multiple predictors can jointly contribute to predictions. This tends to flatten the importance slope further — more predictors earn importance through interaction terms even in the high-parameter model. However, deeper trees also increase overfitting risk. In the high bagging/high learning rate setting, deeper trees would compound overfitting. In the low setting, moderate depth increases (e.g., 5–7) typically improve both coverage of predictors and predictive performance.


Problem 8.7 — Chemical Manufacturing Process: Tree-Based Models

Data Preparation

data(ChemicalManufacturingProcess)
df_cmp <- ChemicalManufacturingProcess

cat("Dimensions:", dim(df_cmp), "\n")
## Dimensions: 176 58
cat("Missing values:", sum(is.na(df_cmp)), "\n")
## Missing values: 106
library(RANN)  # for knnImpute

# Impute missing values with kNN
pre_proc <- preProcess(df_cmp, method = "knnImpute")
df_imp   <- predict(pre_proc, df_cmp)

# Remove near-zero variance predictors
nzv      <- nearZeroVar(df_imp)
if (length(nzv) > 0) df_imp <- df_imp[, -nzv]

# Train/test split (80/20)
set.seed(42)
train_idx  <- createDataPartition(df_imp$Yield, p = 0.8, list = FALSE)
train_data <- df_imp[ train_idx, ]
test_data  <- df_imp[-train_idx, ]

cat("Training samples:", nrow(train_data), "\n")
## Training samples: 144
cat("Test samples:    ", nrow(test_data),  "\n")
## Test samples:     32
ctrl <- trainControl(method = "cv", number = 10, verboseIter = FALSE)

Fit Tree-Based Models

library(rpart)
library(rpart.plot)

set.seed(42)
rpart_model <- train(Yield ~ ., data = train_data,
                     method    = "rpart",
                     tuneLength = 10,
                     trControl  = ctrl)
set.seed(42)
rf_model <- train(Yield ~ ., data = train_data,
                  method    = "rf",
                  tuneLength = 5,
                  trControl  = ctrl,
                  importance = TRUE)
set.seed(42)
gbm_model <- train(Yield ~ ., data = train_data,
                   method    = "gbm",
                   tuneLength = 5,
                   trControl  = ctrl,
                   verbose    = FALSE)
set.seed(42)
cubist_model <- train(Yield ~ ., data = train_data,
                      method    = "cubist",
                      tuneLength = 5,
                      trControl  = ctrl)

Part (a) — Model Performance Comparison

# Resampling performance
results <- resamples(list(
  RPART  = rpart_model,
  RF     = rf_model,
  GBM    = gbm_model,
  Cubist = cubist_model
))

summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: RPART, RF, GBM, Cubist 
## Number of resamples: 10 
## 
## MAE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RPART  0.5094332 0.5384660 0.6288195 0.6249946 0.6580124 0.8154511    0
## RF     0.3039023 0.4451209 0.4712620 0.4663977 0.5309958 0.5599520    0
## GBM    0.3131834 0.3721980 0.4443555 0.4457566 0.5368358 0.5776493    0
## Cubist 0.3205882 0.3491016 0.3831368 0.4047230 0.4348073 0.5472483    0
## 
## RMSE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RPART  0.6471415 0.7431495 0.7785785 0.7968623 0.8675633 1.0037398    0
## RF     0.3781068 0.5958784 0.6049589 0.5886004 0.6568276 0.6931466    0
## GBM    0.3626452 0.5304471 0.5643845 0.5731613 0.6867113 0.7572435    0
## Cubist 0.3789029 0.4654099 0.4734809 0.5091773 0.5504458 0.6838888    0
## 
## Rsquared 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RPART  0.2570077 0.3370597 0.4031235 0.4033136 0.4558698 0.6058422    0
## RF     0.4821465 0.5740774 0.6470681 0.6633991 0.7718791 0.8522501    0
## GBM    0.3262934 0.5589653 0.6722489 0.6480267 0.7504471 0.8382500    0
## Cubist 0.5551531 0.6714166 0.7387432 0.7280324 0.8066792 0.8413148    0
dotplot(results, metric = "RMSE",
        main = "10-Fold CV RMSE: Tree-Based Models")

test_preds <- sapply(
  list(RPART = rpart_model, RF = rf_model,
       GBM   = gbm_model,  Cubist = cubist_model),
  function(m) {
    preds <- predict(m, test_data)
    c(RMSE = round(RMSE(preds, test_data$Yield), 4),
      R2   = round(cor(preds, test_data$Yield)^2, 4))
  }
)
as.data.frame(t(test_preds))

Answer: Cubist or GBM typically achieves the best resampling and test set RMSE for this dataset. Random Forest is competitive but usually slightly behind. The single regression tree (rpart) performs worst due to its high variance and instability — a single tree overfits the training data and generalizes poorly. The ensemble methods (RF, GBM, Cubist) all substantially outperform the single tree.

Part (b) — Variable Importance in Optimal Model

# Plot top 20 for GBM and RF side by side
p1 <- plot(varImp(gbm_model),    top = 20, main = "GBM — Top 20 Predictors")
p2 <- plot(varImp(cubist_model), top = 20, main = "Cubist — Top 20 Predictors")
p1

p2

# Top 10 for the optimal model (GBM)
gbm_vi <- varImp(gbm_model)$importance %>%
  tibble::rownames_to_column("Predictor") %>%
  arrange(desc(Overall)) %>%
  head(10) %>%
  mutate(Type = ifelse(grepl("ManufacturingProcess", Predictor),
                       "Process", "Biological"))
gbm_vi

Answer: Manufacturing process variables dominate the top predictors (e.g., ManufacturingProcess32, ManufacturingProcess13, ManufacturingProcess17), consistent with what was found in the linear and nonlinear models from Chapters 6–7. Biological material variables appear lower in the rankings. This makes practical sense: process variables are controllable and show stronger, more systematic relationships with yield. The top 10 predictors are largely consistent across model types, which increases our confidence that these variables genuinely drive yield.

Part (c) — Plot the Optimal Single Tree

rpart.plot(
  rpart_model$finalModel,
  type    = 4,
  extra   = 101,
  fallen.leaves = TRUE,
  shadow.col    = "gray",
  main    = "Single Regression Tree: Chemical Manufacturing Yield"
)

Answer: The single tree provides an interpretable decision structure for yield prediction. Terminal nodes show the mean yield and the proportion of training samples at each leaf. The tree reveals clear thresholds: for example, certain values of ManufacturingProcess32 and ManufacturingProcess13 cleanly partition high-yield from low-yield batches. This view offers actionable process knowledge — operators can trace the decision path to understand which process conditions are necessary for high yield. However, the single tree is less accurate than ensemble methods and should be treated as a qualitative guide rather than a precise predictor.


Summary

Problem Key Finding
8.1a Random forest correctly ignores uninformative predictors V6–V10
8.1b Correlated duplicates dilute V1’s importance — a known RF limitation
8.1c Conditional cforest is most robust to correlated predictor inflation
8.1d GBM and Cubist show similar dilution but correctly suppress noise variables
8.2 Tree bias inflates importance for high-granularity predictors even when they are uninformative
8.3a High bagging/LR concentrates importance on top predictors; low values spread it broadly
8.3b Low bagging fraction + low learning rate generalizes better to new samples
8.3c Greater interaction depth flattens the importance slope by enabling predictor interactions
8.7 Cubist/GBM optimal; manufacturing process variables dominate importance

Rendered with R 4.5.2 — Applied Predictive Modeling, Kuhn & Johnson