Exercise 8.1

Recreate Simulated Data

library(mlbench)
set.seed(200)

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)

colnames(simulated)[ncol(simulated)] <- "y"

(a) Random Forest Model + Variable Importance

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.5.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Question:
Did the random forest model significantly use the uninformative predictors (V6–V10)?

No, the random forest model did not significantly use the uninformative predictors (V6–V10). The top predictors have substantially larger positive importance scores, while V6–V10 have values close to zero or slightly negative. Importance values near zero indicate little to no predictive contribution, and negative values suggest the variables may introduce noise rather than useful signal. This shows the model correctly focused on the informative predictors and largely ignored the uninformative ones.

(b) Add Correlated Predictor

set.seed(200)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1

cor(simulated$duplicate1, simulated$V1)
## [1] 0.9497025
#Fit New Model
model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##                Overall
## V1          6.00709780
## V2          6.05937899
## V3          0.58465293
## V4          6.86363294
## V5          2.19939891
## V6          0.10898039
## V7          0.06104207
## V8         -0.04059204
## V9          0.06123662
## V10         0.09999339
## duplicate1  4.43323167

Questions:

  • Did the importance score for V1 change?

    The importance score for V1 decreased slightly after adding the highly correlated predictor, dropping from around 8.7 to about 6.01. This indicates that some of V1’s importance has been redistributed to the new correlated variable.

  • What happens when adding a highly correlated predictor?

    When a highly correlated predictor is added, the model spreads the importance between the correlated variables instead of assigning it all to one. Because both predictors contain similar information, the random forest can use either one in different trees, which reduces the individual importance of each variable. As a result, their importance scores decrease even though the overall predictive power of the model stays about the same.

(c) Conditional Inference Forest

library(party)
## Warning: package 'party' was built under R version 4.5.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.5.3
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.5.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.5.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.5.3
cf_model <- cforest(y ~ ., data = simulated)

cf_imp <- varimp(cf_model)
cf_imp
##           V1           V2           V3           V4           V5           V6 
##  6.813253411  6.230035644  0.030496476  7.342522924  1.896415492 -0.031302902 
##           V7           V8           V9          V10   duplicate1 
## -0.014426737 -0.029347716 -0.003838825 -0.016841560  2.536782153

Question: Do these importances show te same pattern as the traditional random forest?

Yes, the importance values follow a similar pattern to the traditional random forest. The key predictors like V1, V2, and V4 still have the highest scores, while the uninformative variables from V6 to V10 remain near zero or negative. The correlated variable duplicate1 also takes on some importance, indicating that the model is dividing the predictive signal between related features. Overall, the model continues to emphasize the important variables and minimize the influence of noise, consistent with the earlier results.

(d) Other Tree Models

library(gbm)
## Warning: package 'gbm' was built under R version 4.5.3
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.5.3
#Fit Boosted Trees
gbm_model <- gbm(y ~ ., data = simulated,
                 distribution = "gaussian",
                 n.trees = 1000,
                 interaction.depth = 3)
summary(gbm_model)

##                   var   rel.inf
## V4                 V4 27.600559
## V1                 V1 20.055311
## V2                 V2 19.404864
## V5                 V5 12.848122
## V3                 V3  7.479351
## duplicate1 duplicate1  4.963427
## V6                 V6  2.035954
## V7                 V7  1.562304
## V9                 V9  1.457551
## V8                 V8  1.394830
## V10               V10  1.197728
#Fit Cubist
cubist_model <- cubist(x = simulated[,-ncol(simulated)],
                      y = simulated$y)

cubist_model
## 
## Call:
## cubist.default(x = simulated[, -ncol(simulated)], y = simulated$y)
## 
## Number of samples: 200 
## Number of predictors: 11 
## 
## Number of committees: 1 
## Number of rules: 1

Question: Does the same importance pattern occur?

Yes, the same overall pattern is still present. The key variables like V4, V1, and V2 continue to have the highest importance, while variables V6 through V10 remain much lower. The correlated variable also shows some importance, indicating that the model is still splitting influence between related features. Overall, the model continues to prioritize the most informative predictors and give less weight to weaker ones.

Exercise 8.2

Use a simulation to show tree bias with different granularities.

library(randomForest)
library(party)

set.seed(123)

# Simulate data
n <- 1000
y <- rnorm(n)  # pure noise outcome

# Predictors with different granularities
x1 <- rnorm(n)                    # continuous (many split points)
x2 <- sample(1:20, n, replace=TRUE)  # medium granularity
x3 <- sample(1:5, n, replace=TRUE)   # low granularity
x4 <- sample(0:1, n, replace=TRUE)   # binary

data <- data.frame(y, x1, x2, x3, x4)

# Standard Random Forest
rf_model <- randomForest(y ~ ., data=data, importance=TRUE)
importance(rf_model)
##      %IncMSE IncNodePurity
## x1  1.082533     93.738123
## x2 -2.321591     54.245552
## x3 -4.252093     24.725638
## x4 -2.068891      8.619194
# Conditional Inference Forest (less biased)
cf_model <- cforest(y ~ ., data=data,
                    controls = cforest_unbiased(ntree=500, mtry=2))

varimp(cf_model)
##           x1           x2           x3           x4 
## -0.011511280 -0.005763513 -0.017214835 -0.012746774

These results illustrate tree bias caused by different levels of granularity. Even though all predictors are noise, x1, which is continuous and has the most possible split points, shows the highest importance, especially in IncNodePurity. As the number of categories decreases from x2 to x4, their importance also declines. This occurs because variables with more split options are more likely to appear useful by chance. In contrast, the conditional inference results at the bottom show all variables with values near zero, confirming that none are truly important and that this method reduces the bias.

Exercise 8.3

Conceptual Questions

(a) Why does the right model focus importance on fewer predictors while the left spreads importance?

The model on the right focuses its importance on just a few predictors because it uses a high learning rate and a large bagging fraction. This means each tree makes strong updates and is built using most of the data, so the model quickly locks onto the strongest predictors and gives them most of the importance. In contrast, the model on the left uses a smaller learning rate and less data per tree, so it learns more gradually and spreads importance across a larger number of predictors.

(b) Which model is more predictive of other samples?

The model on the left would likely be more predictive for new samples. Its lower learning rate and smaller bagging fraction act as regularization, helping prevent overfitting. By spreading importance across more predictors, it relies less on a small set of variables that may capture noise, making it more robust when applied to new data.

(c) How does increasing interaction depth affect predictor importance?

Increasing interaction depth allows the model to capture more complex relationships between predictors. As a result, importance tends to become more concentrated on variables involved in strong interactions, making the importance curve steeper. This effect would be stronger in the right model, where importance is already concentrated, while the left model would still show a more balanced distribution due to its more gradual learning.

Exercise 8.7 Tree-Based Models on Chemical Process Data

(a) Best Model Performance

# Libraries
library(caret)
library(gbm)
library(randomForest)
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.3
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.5.3
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:party':
## 
##     where
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(123)

# Load data
data(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess

# Split FIRST
trainIndex <- createDataPartition(df$Yield, p = 0.8, list = FALSE)
train <- df[trainIndex, ]
test  <- df[-trainIndex, ]

# Impute missing values using median
preProc <- preProcess(train, method = "medianImpute")

train_imputed <- predict(preProc, train)
test_imputed  <- predict(preProc, test)

# Cross-validation
ctrl <- trainControl(method = "cv", number = 5)

# Models
tree_model <- train(Yield ~ ., data = train_imputed,
                    method = "rpart",
                    trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
rf_model <- train(Yield ~ ., data = train_imputed,
                  method = "rf",
                  trControl = ctrl,
                  tuneLength = 5)

gbm_model <- train(Yield ~ ., data = train_imputed,
                   method = "gbm",
                   trControl = ctrl,
                   verbose = FALSE,
                   tuneLength = 5)

# Compare
resamples(list(Tree = tree_model,
               RF = rf_model,
               GBM = gbm_model)) %>% summary()
## 
## Call:
## summary.resamples(object = .)
## 
## Models: Tree, RF, GBM 
## Number of resamples: 5 
## 
## MAE 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Tree 1.0742174 1.0877440 1.0970787 1.1179499 1.1426969 1.1880124    0
## RF   0.6531250 0.7868625 0.8220182 0.8927160 1.0618623 1.1397121    0
## GBM  0.7251532 0.7420665 0.8214573 0.8370968 0.9165505 0.9802565    0
## 
## RMSE 
##           Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## Tree 1.3112961 1.3149071 1.315060 1.411250 1.483606 1.631383    0
## RF   0.9832113 0.9886813 1.043563 1.182022 1.317870 1.576785    0
## GBM  0.9115073 1.0433393 1.080267 1.070631 1.103301 1.214740    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Tree 0.3676634 0.3680714 0.4732691 0.4341356 0.4755086 0.4861655    0
## RF   0.4132577 0.5796483 0.6327096 0.6052876 0.6917760 0.7090465    0
## GBM  0.6101213 0.6430015 0.7007906 0.6810526 0.7212323 0.7301173    0
# Test performance
pred_tree <- predict(tree_model, test_imputed)
pred_rf   <- predict(rf_model, test_imputed)
pred_gbm  <- predict(gbm_model, test_imputed)

postResample(pred_tree, test_imputed$Yield)
##      RMSE  Rsquared       MAE 
## 1.7003910 0.2248231 1.3158401
postResample(pred_rf, test_imputed$Yield)
##      RMSE  Rsquared       MAE 
## 1.3225839 0.4840139 0.9976617
postResample(pred_gbm, test_imputed$Yield)
##      RMSE  Rsquared       MAE 
## 1.1050271 0.6464494 0.8851463

Question:
Which model gives the best resampling and test performance?

The gradient boosting model (GBM) provides the best performance among the tree-based models. Based on the resampling and test results, GBM has the lowest RMSE and MAE and the highest R² compared to the single decision tree and random forest. This indicates that GBM is better at capturing complex nonlinear relationships in the data. While random forests improve over a single tree by reducing variance, boosting further improves accuracy by sequentially correcting errors, making it the most predictive model in this case.

(b) Variable Importance

# Variable importance for GBM
gbm_imp <- varImp(gbm_model)

# Print importance
print(gbm_imp)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17  21.923
## BiologicalMaterial12    18.711
## ManufacturingProcess31  18.097
## ManufacturingProcess09  15.608
## ManufacturingProcess06  14.515
## BiologicalMaterial03    13.097
## ManufacturingProcess01   9.785
## BiologicalMaterial11     8.634
## ManufacturingProcess04   8.632
## ManufacturingProcess27   7.929
## BiologicalMaterial09     7.670
## ManufacturingProcess13   7.533
## ManufacturingProcess25   6.995
## BiologicalMaterial02     6.558
## ManufacturingProcess37   6.371
## BiologicalMaterial10     6.199
## ManufacturingProcess30   6.111
## ManufacturingProcess43   5.830
## BiologicalMaterial01     5.392
# Plot top 10 variables
plot(gbm_imp, top = 10)

Questions:

(c) Optimal Tree Visualization

rpart.plot(tree_model$finalModel,
           type = 2,
           extra = 101,
           fallen.leaves = TRUE,
           cex = 0.8)

Question:
Does this visualization provide insight into predictor relationships with yield?

Yes, this visualization provides clear insight into how predictors relate to yield. The tree shows that ManufacturingProcess32 is the most important variable since it forms the first split, with values greater than or equal to 160 leading to the highest yield (around 42). When ManufacturingProcess32 is lower, yield decreases and the model further splits on ManufacturingProcess17, indicating an interaction between these variables. Specifically, when both ManufacturingProcess32 is low and ManufacturingProcess17 is below 34, the yield is lowest (around 39), while higher values of ManufacturingProcess17 slightly improve yield (around 40). Overall, the tree highlights important threshold values and demonstrates how combinations of process variables influence yield, providing interpretable decision rules and reinforcing that process conditions play a dominant role.