Tree-Based Models and Variable Importance Analysis

library(AppliedPredictiveModeling)
library(mlbench)
library(randomForest)
library(caret)
library(party)
library(gbm)
library(Cubist)
library(rpart)
library(rpart.plot)

Exercise 8.1

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Part (a): Random Forest with Variable Importance

model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

##         Overall
## V1   8.62743275
## V2   6.27437240
## V3   0.72305459
## V4   7.50258584
## V5   2.13575650
## V6   0.12395003
## V7   0.02927888
## V8  -0.11724317
## V9  -0.10344797
## V10  0.04312556

The random forest clearly distinguishes between informative and uninformative predictors. V1 through V5 have substantially higher importance scores, while V6 through V10 show minimal importance. This indicates the model successfully identified that V6-V10 are noise variables and did not rely on them for predictions. The algorithm’s inherent variable selection capability is evident here.

Part (b): Effect of Correlated Predictor

simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9485201

model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                 Overall
## V1          6.774034589
## V2          6.426340527
## V3          0.613805379
## V4          7.135941576
## V5          2.135242904
## V6          0.171933358
## V7          0.142238552
## V8         -0.073192083
## V9         -0.098719872
## V10        -0.009701234
## duplicate1  3.084990840

When duplicate1 was added (highly correlated with V1), the importance of V1 decreased noticeably. The combined importance of V1 and duplicate1 is roughly equivalent to V1’s original importance. This demonstrates a key limitation of standard random forest importance measures: when correlated predictors exist, importance gets diluted across them. Neither variable appears as dominant as V1 did alone, even though they contain essentially the same information.

Part (c): Conditional Inference Trees

model_cforest <- cforest(y ~ ., data = simulated)
cfImp_traditional <- varimp(model_cforest, conditional = FALSE)
cfImp_conditional <- varimp(model_cforest, conditional = TRUE)

par(mfrow = c(1, 2), mar = c(5, 8, 4, 2))
barplot(sort(cfImp_traditional), main = "Traditional Importance", 
        horiz = TRUE, las = 1, cex.names = 0.6)
barplot(sort(cfImp_conditional), main = "Conditional Importance", 
        horiz = TRUE, las = 1, cex.names = 0.6)

Both importance measures from conditional inference forests show similar patterns to standard random forests, with V1-V5 clearly dominating. The conditional importance measure attempts to adjust for correlations between predictors by using permutation schemes that account for the correlation structure. In this dataset with the duplicate predictor, we still observe importance split between V1 and duplicate1, though the conditional approach provides a more stable assessment of true predictor relevance.

Part (d): Boosted Trees and Cubist

model_gbm <- gbm(y ~ ., data = simulated, distribution = "gaussian", 
                 n.trees = 1000, interaction.depth = 5, shrinkage = 0.1, verbose = FALSE)
gbmImp <- summary(model_gbm, plotit = FALSE)
par(mar = c(5, 8, 4, 2))
barplot(gbmImp$rel.inf, names.arg = gbmImp$var, 
        main = "Gradient Boosting Variable Influence", 
        horiz = TRUE, las = 1, cex.names = 0.7)

model_cubist <- cubist(x = simulated[, -ncol(simulated)], y = simulated$y)
cubistImp <- varImp(model_cubist)
par(mar = c(5, 8, 4, 2))
barplot(sort(cubistImp$Overall), main = "Cubist Variable Importance", 
        horiz = TRUE, las = 1, cex.names = 0.7)

Both gradient boosting and Cubist exhibit the same general pattern: informative predictors (V1-V5) dominate while noise variables (V6-V10) have negligible importance. Cubist’s rule-based approach creates sharp distinctions between relevant and irrelevant predictors. The consistency across all tree-based methods confirms their robustness to noise variables, though they still exhibit the importance dilution issue with correlated predictors that we observed earlier.

Exercise 8.2

set.seed(100)
n <- 500

continuous_pred <- runif(n, 0, 100)
binary_pred <- sample(c(0, 1), n, replace = TRUE)
categorical_pred <- sample(1:5, n, replace = TRUE)

y <- rnorm(n, 50, 10)

bias_data <- data.frame(y = y, 
                        continuous = continuous_pred,
                        binary = binary_pred,
                        categorical = as.factor(categorical_pred))

tree_model <- rpart(y ~ ., data = bias_data)
tree_imp <- tree_model$variable.importance
print("Single Tree Variable Importance:")

## [1] "Single Tree Variable Importance:"

print(tree_imp)

##  continuous categorical      binary 
##   4343.4922    668.0006    494.7215

rf_model <- randomForest(y ~ ., data = bias_data, importance = TRUE, ntree = 500)
print("Random Forest Importance:")

## [1] "Random Forest Importance:"

print(importance(rf_model))

##               %IncMSE IncNodePurity
## continuous   6.465590     4943.5781
## binary      -1.806497      415.6632
## categorical  1.884428     1394.3981

This simulation reveals tree bias toward variables with higher granularity. Even though none of the predictors actually relate to the outcome (y is pure noise), the continuous variable with the most unique values tends to be selected for splits more frequently than the binary or categorical variables. Single trees exhibit this bias strongly, while random forests reduce but don’t eliminate it through aggregation. This granularity bias is why conditional inference trees and other bias-corrected methods were developed.

Exercise 8.3

Part (a): Right vs. Left Model Comparison

The model on the right (bagging fraction = 0.9, learning rate = 0.9) concentrates importance heavily on the first few predictors, creating a steep decline. The left model (both parameters = 0.1) distributes importance more evenly across many predictors.

This occurs because the high learning rate on the right allows each tree to make large adjustments to the predictions, while the high bagging fraction means each tree sees most of the data. Early trees capture the strongest signals aggressively, leaving little residual variation for subsequent trees. The model quickly identifies and exploits the most predictive features.

Conversely, the left model uses a slow learning rate and small bagging fraction, forcing gradual refinement with diverse data subsets. This allows more predictors to contribute incrementally, preventing any single variable from dominating and resulting in a flatter importance distribution.

Part (b): Predictive Performance

The left model would likely generalize better to new samples. Distributed importance across multiple predictors typically indicates a more robust model less prone to overfitting. The right model’s extreme concentration on few predictors suggests it may have memorized training-specific patterns. While it might excel on training data, it’s vulnerable to distribution shifts in new data where those top predictors behave differently.

Part (c): Interaction Depth Effect

Increasing interaction depth would steepen the importance slope in both models, but more dramatically in the right model. Greater depth enables trees to capture complex multi-way interactions between predictors. In the left model, the conservative learning rate and bagging fraction would maintain relatively distributed importance, though top predictors would separate more from the rest. In the right model, importance would become even more concentrated at the top, as aggressive learning with complex interactions amplifies the dominance of the strongest predictors.

Exercise 8.7

data(ChemicalManufacturingProcess)
chem_data <- ChemicalManufacturingProcess

Data Preparation

set.seed(123)

predictors <- chem_data[, -1]
response <- chem_data[, 1]

preProcValues <- preProcess(predictors, method = c("knnImpute"))
predictors_imputed <- predict(preProcValues, predictors)

trainIndex <- createDataPartition(response, p = 0.75, list = FALSE)
x_train <- predictors_imputed[trainIndex, ]
y_train <- response[trainIndex]
x_test <- predictors_imputed[-trainIndex, ]
y_test <- response[-trainIndex]

preProcTransform <- preProcess(x_train, method = c("center", "scale", "nzv"))
x_train_processed <- predict(preProcTransform, x_train)
x_test_processed <- predict(preProcTransform, x_test)

Part (a): Optimal Tree-Based Model

ctrl <- trainControl(method = "cv", number = 10)

set.seed(123)
rf_fit <- train(x = x_train_processed, y = y_train,
                method = "rf", trControl = ctrl,
                tuneGrid = expand.grid(mtry = c(5, 10, 20, 30)),
                importance = TRUE, ntree = 500)

set.seed(123)
gbm_fit <- train(x = x_train_processed, y = y_train,
                 method = "gbm", trControl = ctrl, verbose = FALSE,
                 tuneGrid = expand.grid(n.trees = c(500, 1000),
                                       interaction.depth = c(3, 5, 7),
                                       shrinkage = c(0.01, 0.1),
                                       n.minobsinnode = 10))

set.seed(123)
cubist_fit <- train(x = x_train_processed, y = y_train,
                    method = "cubist", trControl = ctrl)

results <- resamples(list(RF = rf_fit, GBM = gbm_fit, Cubist = cubist_fit))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: RF, GBM, Cubist 
## Number of resamples: 10 
## 
## MAE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## RF     0.6076523 0.6944567 0.8088437 0.8822529 1.1173640 1.167822    0
## GBM    0.5773781 0.7891426 0.8764320 0.8821428 0.9686792 1.175333    0
## Cubist 0.4500026 0.6230469 0.7658646 0.7994515 0.9070274 1.363668    0
## 
## RMSE 
##             Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## RF     0.8136616 0.9060301 1.121126 1.127693 1.336541 1.493872    0
## GBM    0.7878795 1.0096775 1.063428 1.098657 1.241325 1.386665    0
## Cubist 0.5392508 0.8355637 1.087400 1.035826 1.186082 1.582674    0
## 
## Rsquared 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF     0.2802695 0.6534556 0.7182918 0.6858588 0.7821643 0.8432278    0
## GBM    0.3676868 0.5739102 0.6975686 0.6704409 0.7642087 0.8503834    0
## Cubist 0.2419630 0.6359267 0.7115661 0.6897072 0.8411836 0.8887299    0

rf_pred <- predict(rf_fit, x_test_processed)
rf_rmse_test <- RMSE(rf_pred, y_test)
rf_r2_test <- R2(rf_pred, y_test)

gbm_pred <- predict(gbm_fit, x_test_processed)
gbm_rmse_test <- RMSE(gbm_pred, y_test)
gbm_r2_test <- R2(gbm_pred, y_test)

cubist_pred <- predict(cubist_fit, x_test_processed)
cubist_rmse_test <- RMSE(cubist_pred, y_test)
cubist_r2_test <- R2(cubist_pred, y_test)

test_results <- data.frame(
  Model = c("Random Forest", "Gradient Boosting", "Cubist"),
  Test_RMSE = c(rf_rmse_test, gbm_rmse_test, cubist_rmse_test),
  Test_R2 = c(rf_r2_test, gbm_r2_test, cubist_r2_test)
)
print(test_results)

##               Model Test_RMSE   Test_R2
## 1     Random Forest 1.1612788 0.5640622
## 2 Gradient Boosting 1.2000131 0.5394872
## 3            Cubist 0.9417241 0.7131673

Based on the resampling and test set performance, the optimal model varies depending on the metric prioritized. Generally, random forests or gradient boosting tend to perform well on this type of data. The model with the lowest test RMSE represents the best predictive performance.

Part (b): Variable Importance Comparison

best_model <- rf_fit
tree_imp <- varImp(best_model, scale = FALSE)
plot(tree_imp, top = 20, main = "Top 20 Predictors - Tree Model")

bio_predictors <- grep("^Bio", names(x_train_processed), value = TRUE)
process_predictors <- grep("^Manufacturing", names(x_train_processed), value = TRUE)

top_10_vars <- rownames(tree_imp$importance)[order(tree_imp$importance$Overall, 
                                                    decreasing = TRUE)[1:10]]
print("Top 10 Important Predictors:")

## [1] "Top 10 Important Predictors:"

print(top_10_vars)

##  [1] "ManufacturingProcess32" "ManufacturingProcess31" "BiologicalMaterial03"  
##  [4] "ManufacturingProcess17" "ManufacturingProcess30" "ManufacturingProcess36"
##  [7] "ManufacturingProcess09" "ManufacturingProcess13" "BiologicalMaterial02"  
## [10] "BiologicalMaterial12"

bio_count <- sum(top_10_vars %in% bio_predictors)
process_count <- sum(top_10_vars %in% process_predictors)

print(paste("Biological predictors in top 10:", bio_count))

## [1] "Biological predictors in top 10: 3"

print(paste("Process predictors in top 10:", process_count))

## [1] "Process predictors in top 10: 7"

The top predictors show which variables most strongly influence yield. Manufacturing process variables tend to dominate the importance rankings, which makes practical sense since these are controllable factors in the production process. While biological predictors assess raw material quality, the process variables represent actionable levers that can be adjusted to optimize yield. Compared to linear models from earlier chapters, tree-based models may identify different variables as important due to their ability to capture non-linear relationships and interactions.

Part (c): Single Tree Visualization

set.seed(123)
single_tree <- rpart(y_train ~ ., data = data.frame(x_train_processed, y_train),
                     control = rpart.control(cp = 0.01))

rpart.plot(single_tree, main = "Decision Tree for Yield Prediction")

terminal_nodes <- single_tree$frame[single_tree$frame$var == "<leaf>", ]
node_stats <- data.frame(
  node = rownames(terminal_nodes),
  n = terminal_nodes$n,
  yval = terminal_nodes$yval
)
print("Terminal Node Statistics:")

## [1] "Terminal Node Statistics:"

print(node_stats)

##    node  n     yval
## 1     8  8 37.30500
## 2    36 17 38.13529
## 3    37 11 39.28909
## 4    19 23 39.49565
## 5     5 17 40.67588
## 6    48 14 40.04071
## 7    49 12 40.95417
## 8    25  7 42.00857
## 9    13 10 42.32700
## 10    7 13 43.04692

The single tree visualization reveals how different combinations of predictor values relate to yield. Each split represents a decision rule based on a predictor threshold, and the terminal nodes show the predicted yield for observations meeting those criteria. This provides insight into which predictor ranges are associated with high versus low yield.

Manufacturing process predictors typically appear higher in the tree (earlier splits), indicating they provide the strongest initial separation of high and low yield batches. Biological predictors may appear in later splits, refining predictions within groups. The tree structure suggests specific process parameter ranges that maximize yield, offering actionable guidance for process optimization. For instance, if a manufacturing variable appears early with a clear threshold, maintaining values on the favorable side of that threshold could be a practical recommendation for improving future batches.

The visualization demonstrates that ManufacturingProcess32 provides the primary split, suggesting it’s the most critical factor in distinguishing high from low yield. The combination of manufacturing and biological predictors throughout the tree indicates that both types contribute to yield prediction, though manufacturing variables dominate the upper levels. This hierarchical structure implies that while biological material quality matters, the manufacturing process settings have a stronger direct impact on final yield outcomes.