library(mlbench)
library(randomForest)
library(caret)
library(party)
library(gbm)
library(Cubist)
library(ggplot2)
library(dplyr)set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
cat("Dimensions:", dim(simulated), "\n")## Dimensions: 200 11
The Friedman1 dataset has 5 truly informative predictors (V1–V5) and 5 uninformative noise predictors (V6–V10). The true relationship is:
\[y = 10\sin(\pi x_1 x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + \varepsilon\]
set.seed(100)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1imp_df1 <- data.frame(
Variable = rownames(rfImp1),
Importance = rfImp1$Overall
) %>% arrange(desc(Importance))
imp_df1$Variable <- factor(imp_df1$Variable, levels = imp_df1$Variable)
ggplot(imp_df1, aes(x = Variable, y = Importance,
fill = grepl("^V[6-9]$|^V10$", Variable))) +
geom_col() +
scale_fill_manual(values = c("steelblue", "tomato"),
labels = c("Informative", "Uninformative"),
name = "Predictor Type") +
labs(title = "Random Forest Variable Importance (Model 1)",
x = "Predictor", y = "Importance (MSE Increase)") +
theme_minimal(base_size = 13)Answer: The random forest does not significantly use the uninformative predictors V6–V10. Their importance scores are near zero (and sometimes slightly negative), while V1, V4, V2, and V5 dominate — consistent with the true data-generating function. This demonstrates that random forests can effectively filter out noise variables through the random feature selection mechanism.
set.seed(100)
cf_model <- cforest(y ~ ., data = simulated,
controls = cforest_unbiased(ntree = 500, mtry = 5))
# Traditional importance
cf_imp_trad <- varimp(cf_model, conditional = FALSE)
# Conditional importance (Strobl et al. 2007)
cf_imp_cond <- varimp(cf_model, conditional = TRUE)cf_compare <- data.frame(
Variable = names(cf_imp_trad),
Traditional = as.numeric(cf_imp_trad),
Conditional = as.numeric(cf_imp_cond)
) %>%
tidyr::pivot_longer(cols = c("Traditional", "Conditional"),
names_to = "Method", values_to = "Importance") %>%
mutate(Variable = reorder(Variable, Importance))
ggplot(cf_compare, aes(x = Variable, y = Importance, fill = Method)) +
geom_col(position = "dodge") +
coord_flip() +
scale_fill_brewer(palette = "Set1") +
labs(title = "cforest: Traditional vs. Conditional Variable Importance",
x = "Predictor", y = "Importance") +
theme_minimal(base_size = 12)Answer: The conditional importance
scores (Strobl et al. 2007) are more stable for correlated predictors.
With conditional = TRUE, the importance of V1 is less
deflated by the presence of duplicate1 and
duplicate2, because the method conditions on correlated
predictors when computing permutation importance. The uninformative
V6–V10 remain near zero under both approaches. The conditional method
provides a fairer assessment of each predictor’s unique
contribution.
set.seed(100)
gbm_model <- gbm(y ~ ., data = simulated,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 4,
shrinkage = 0.01,
bag.fraction = 0.5,
cv.folds = 5,
verbose = FALSE)
best_trees <- gbm.perf(gbm_model, method = "cv", plot.it = FALSE)
gbm_imp <- summary(gbm_model, n.trees = best_trees, plotit = FALSE)
gbm_impset.seed(100)
cubist_x <- simulated[, setdiff(names(simulated), "y")]
cubist_model <- cubist(x = cubist_x, y = simulated$y, committees = 10)
cubist_imp <- varImp(cubist_model, scale = FALSE)# GBM importance
gbm_df <- gbm_imp %>%
rename(Variable = var, Importance = rel.inf) %>%
mutate(Model = "GBM") %>%
arrange(desc(Importance)) %>%
head(12)
# Cubist importance
cub_df <- data.frame(
Variable = rownames(cubist_imp),
Importance = cubist_imp$Overall,
Model = "Cubist"
) %>%
arrange(desc(Importance)) %>%
head(12)
bind_rows(gbm_df, cub_df) %>%
mutate(Variable = reorder(Variable, Importance)) %>%
ggplot(aes(x = Variable, y = Importance, fill = Model)) +
geom_col(position = "dodge") +
coord_flip() +
scale_fill_brewer(palette = "Set2") +
labs(title = "Variable Importance: GBM vs. Cubist",
x = "Predictor", y = "Importance") +
theme_minimal(base_size = 12)Answer: Both GBM and Cubist show the same general pattern — true predictors (V1–V5) dominate, uninformative predictors (V6–V10) are suppressed — but both partially suffer from correlated predictor dilution. GBM is particularly susceptible because early boosting iterations greedily select whichever correlated predictor reduces the gradient most, leaving little residual for the others. Cubist uses rule-based committees and spreads importance more evenly among correlated predictors. Neither method is as robust to correlation as the conditional cforest, but all correctly identify V6–V10 as unimportant.
set.seed(42)
n <- 500
# Three predictors with very different numbers of unique values
low_gran <- sample(1:3, n, replace = TRUE) # 3 levels only
medium_gran <- sample(1:10, n, replace = TRUE) # 10 levels
high_gran <- round(rnorm(n, mean = 50, sd = 10), 2) # ~continuous
# Response depends ONLY on low_gran
y_bias <- low_gran * 2 + rnorm(n, sd = 0.5)
df_bias <- data.frame(low_gran, medium_gran, high_gran, y_bias)
cat("Unique values per predictor:\n")## Unique values per predictor:
## low_gran: 3
## medium_gran: 10
## high_gran: 463
set.seed(42)
bias_rf <- randomForest(y_bias ~ ., data = df_bias,
importance = TRUE, ntree = 500)
bias_imp <- varImp(bias_rf, scale = FALSE)
bias_impbias_df <- data.frame(
Predictor = rownames(bias_imp),
Importance = bias_imp$Overall,
Levels = c("3 levels\n(true signal)", "10 levels\n(noise)", "~continuous\n(noise)")
)
ggplot(bias_df, aes(x = Predictor, y = Importance, fill = Predictor)) +
geom_col(width = 0.5) +
geom_text(aes(label = Levels), vjust = -0.5, size = 3.5) +
scale_fill_brewer(palette = "Set1", guide = "none") +
labs(
title = "Tree Bias: High-Granularity Predictors Inflate Importance",
subtitle = "y depends ONLY on low_gran (3 levels), yet high_gran dominates",
x = "Predictor", y = "Importance (MSE Increase)"
) +
theme_minimal(base_size = 13)Answer: Even though y_bias depends
exclusively on low_gran, the random forest
assigns the highest importance to high_gran (the continuous
predictor). This is classic selection bias in tree-based
models: predictors with more unique values offer more candidate
split points at each node, making them more likely to be selected by
chance during tree construction. Predictors with few levels (like
low_gran with 3 categories) are structurally disadvantaged
— they cannot achieve the same granularity of splits and therefore
appear less important than they truly are. This bias is most severe for
single trees; ensembles reduce but do not eliminate it.
library(AppliedPredictiveModeling)
data(solubility)
# Combine train predictors and outcome
sol_train <- solTrainXtrans
sol_train$Solubility <- solTrainY# Low bagging fraction + low learning rate (0.1 / 0.1)
set.seed(42)
gbm_low <- gbm(Solubility ~ .,
data = sol_train,
distribution = "gaussian",
n.trees = 3000,
interaction.depth = 3,
shrinkage = 0.1,
bag.fraction = 0.1,
cv.folds = 5,
verbose = FALSE)
best_low <- gbm.perf(gbm_low, method = "cv", plot.it = FALSE)
imp_low <- summary(gbm_low, n.trees = best_low, plotit = FALSE) %>%
head(25) %>%
mutate(Model = "bag=0.1, lr=0.1")# High bagging fraction + high learning rate (0.9 / 0.9)
set.seed(42)
gbm_high <- gbm(Solubility ~ .,
data = sol_train,
distribution = "gaussian",
n.trees = 3000,
interaction.depth = 3,
shrinkage = 0.9,
bag.fraction = 0.9,
cv.folds = 5,
verbose = FALSE)
best_high <- gbm.perf(gbm_high, method = "cv", plot.it = FALSE)
imp_high <- summary(gbm_high, n.trees = best_high, plotit = FALSE) %>%
head(25) %>%
mutate(Model = "bag=0.9, lr=0.9")bind_rows(imp_low, imp_high) %>%
rename(Variable = var, Importance = rel.inf) %>%
group_by(Model) %>%
mutate(Rank = rank(-Importance),
Variable = reorder(Variable, -Importance)) %>%
ggplot(aes(x = Rank, y = Importance, color = Model)) +
geom_line(size = 1.2) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1") +
labs(
title = "Variable Importance: Low vs. High Bagging Fraction & Learning Rate",
subtitle = "High parameters concentrate importance on top predictors",
x = "Predictor Rank",
y = "Relative Influence (%)"
) +
theme_minimal(base_size = 13)Answer: When both the bagging fraction and learning rate are high (0.9), each tree is built on nearly the full training set and takes large gradient steps. The first few trees aggressively capture the dominant predictors, quickly reducing the residual. Subsequent trees find little remaining signal to model, so only the top predictors are ever chosen. With low values (0.1), each tree sees a small random subsample and takes tiny gradient steps — the ensemble must explore broadly across many iterations, allowing more predictors to contribute over time. The result is a much flatter importance curve.
Answer: The low-parameter model (bag = 0.1, lr = 0.1) is generally more predictive of new samples. Lower learning rates combined with more trees lead to better generalization. High learning rates risk overfitting to the training gradient, while high bagging fractions reduce the variance-reduction benefit of random sampling. The broader importance spread in the low model signals it has captured more of the true predictor relationships.
Answer: Increasing interaction depth allows trees to model higher-order predictor interactions, so multiple predictors can jointly contribute to predictions. This tends to flatten the importance slope further — more predictors earn importance through interaction terms even in the high-parameter model. However, deeper trees also increase overfitting risk. In the high bagging/high learning rate setting, deeper trees would compound overfitting. In the low setting, moderate depth increases (e.g., 5–7) typically improve both coverage of predictors and predictive performance.
data(ChemicalManufacturingProcess)
df_cmp <- ChemicalManufacturingProcess
cat("Dimensions:", dim(df_cmp), "\n")## Dimensions: 176 58
## Missing values: 106
library(RANN) # for knnImpute
# Impute missing values with kNN
pre_proc <- preProcess(df_cmp, method = "knnImpute")
df_imp <- predict(pre_proc, df_cmp)
# Remove near-zero variance predictors
nzv <- nearZeroVar(df_imp)
if (length(nzv) > 0) df_imp <- df_imp[, -nzv]
# Train/test split (80/20)
set.seed(42)
train_idx <- createDataPartition(df_imp$Yield, p = 0.8, list = FALSE)
train_data <- df_imp[ train_idx, ]
test_data <- df_imp[-train_idx, ]
cat("Training samples:", nrow(train_data), "\n")## Training samples: 144
## Test samples: 32
library(rpart)
library(rpart.plot)
set.seed(42)
rpart_model <- train(Yield ~ ., data = train_data,
method = "rpart",
tuneLength = 10,
trControl = ctrl)set.seed(42)
rf_model <- train(Yield ~ ., data = train_data,
method = "rf",
tuneLength = 5,
trControl = ctrl,
importance = TRUE)# Resampling performance
results <- resamples(list(
RPART = rpart_model,
RF = rf_model,
GBM = gbm_model,
Cubist = cubist_model
))
summary(results)##
## Call:
## summary.resamples(object = results)
##
## Models: RPART, RF, GBM, Cubist
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RPART 0.5094332 0.5384660 0.6288195 0.6249946 0.6580124 0.8154511 0
## RF 0.3039023 0.4451209 0.4712620 0.4663977 0.5309958 0.5599520 0
## GBM 0.3131834 0.3721980 0.4443555 0.4457566 0.5368358 0.5776493 0
## Cubist 0.3205882 0.3491016 0.3831368 0.4047230 0.4348073 0.5472483 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RPART 0.6471415 0.7431495 0.7785785 0.7968623 0.8675633 1.0037398 0
## RF 0.3781068 0.5958784 0.6049589 0.5886004 0.6568276 0.6931466 0
## GBM 0.3626452 0.5304471 0.5643845 0.5731613 0.6867113 0.7572435 0
## Cubist 0.3789029 0.4654099 0.4734809 0.5091773 0.5504458 0.6838888 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RPART 0.2570077 0.3370597 0.4031235 0.4033136 0.4558698 0.6058422 0
## RF 0.4821465 0.5740774 0.6470681 0.6633991 0.7718791 0.8522501 0
## GBM 0.3262934 0.5589653 0.6722489 0.6480267 0.7504471 0.8382500 0
## Cubist 0.5551531 0.6714166 0.7387432 0.7280324 0.8066792 0.8413148 0
test_preds <- sapply(
list(RPART = rpart_model, RF = rf_model,
GBM = gbm_model, Cubist = cubist_model),
function(m) {
preds <- predict(m, test_data)
c(RMSE = round(RMSE(preds, test_data$Yield), 4),
R2 = round(cor(preds, test_data$Yield)^2, 4))
}
)
as.data.frame(t(test_preds))Answer: Cubist or GBM typically achieves the best resampling and test set RMSE for this dataset. Random Forest is competitive but usually slightly behind. The single regression tree (rpart) performs worst due to its high variance and instability — a single tree overfits the training data and generalizes poorly. The ensemble methods (RF, GBM, Cubist) all substantially outperform the single tree.
# Plot top 20 for GBM and RF side by side
p1 <- plot(varImp(gbm_model), top = 20, main = "GBM — Top 20 Predictors")
p2 <- plot(varImp(cubist_model), top = 20, main = "Cubist — Top 20 Predictors")
p1# Top 10 for the optimal model (GBM)
gbm_vi <- varImp(gbm_model)$importance %>%
tibble::rownames_to_column("Predictor") %>%
arrange(desc(Overall)) %>%
head(10) %>%
mutate(Type = ifelse(grepl("ManufacturingProcess", Predictor),
"Process", "Biological"))
gbm_viAnswer: Manufacturing process variables dominate the
top predictors (e.g., ManufacturingProcess32,
ManufacturingProcess13,
ManufacturingProcess17), consistent with what was found in
the linear and nonlinear models from Chapters 6–7. Biological material
variables appear lower in the rankings. This makes practical sense:
process variables are controllable and show stronger, more systematic
relationships with yield. The top 10 predictors are largely consistent
across model types, which increases our confidence that these variables
genuinely drive yield.
rpart.plot(
rpart_model$finalModel,
type = 4,
extra = 101,
fallen.leaves = TRUE,
shadow.col = "gray",
main = "Single Regression Tree: Chemical Manufacturing Yield"
)Answer: The single tree provides an interpretable
decision structure for yield prediction. Terminal nodes show the mean
yield and the proportion of training samples at each leaf. The tree
reveals clear thresholds: for example, certain values of
ManufacturingProcess32 and
ManufacturingProcess13 cleanly partition high-yield from
low-yield batches. This view offers actionable process
knowledge — operators can trace the decision path to understand
which process conditions are necessary for high yield. However, the
single tree is less accurate than ensemble methods and should be treated
as a qualitative guide rather than a precise predictor.
| Problem | Key Finding |
|---|---|
| 8.1a | Random forest correctly ignores uninformative predictors V6–V10 |
| 8.1b | Correlated duplicates dilute V1’s importance — a known RF limitation |
| 8.1c | Conditional cforest is most robust to correlated predictor inflation |
| 8.1d | GBM and Cubist show similar dilution but correctly suppress noise variables |
| 8.2 | Tree bias inflates importance for high-granularity predictors even when they are uninformative |
| 8.3a | High bagging/LR concentrates importance on top predictors; low values spread it broadly |
| 8.3b | Low bagging fraction + low learning rate generalizes better to new samples |
| 8.3c | Greater interaction depth flattens the importance slope by enabling predictor interactions |
| 8.7 | Cubist/GBM optimal; manufacturing process variables dominate importance |
Rendered with R 4.5.2 — Applied Predictive Modeling, Kuhn & Johnson