Applied Predictive Modeling - Chapter 8 Exercises: 8.1, 8.2, 8.3, 8.7

8.1. Recreate the simulated data from Exercise 7.2:

  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores. Did the random forest model significantly use the uninformative predictors (V6 – V10)?
#Recreate the simulated data from Exercise 7.2
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

#Fit a random forest mode
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
model1 <- randomForest(y ~ ., data = simulated,
  importance = TRUE,
  ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

The random forest model does not significantly use the uninformative predictors (V6–V10). Their importance scores are close to zero or negative, indicating they do not contribute meaningfully to the model.

  1. Now add an additional predictor that is highly correlated with one of the informative predictors. Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
set.seed(200)
simulated$V11 <- simulated$V1 + rnorm(nrow(simulated), sd = 0.1)

model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

varImp(model2, scale = FALSE)
##         Overall
## V1   5.82920546
## V2   6.19827153
## V3   0.64690600
## V4   7.18846269
## V5   2.18817746
## V6   0.14271410
## V7   0.08390675
## V8  -0.14854947
## V9  -0.09698485
## V10 -0.04244151
## V11  4.08026296
#add another predictor
simulated$V12 <- simulated$V1 + rnorm(nrow(simulated), sd = 0.1)

After adding a predictor highly correlated with V1, the importance score for V1 decreases because the random forest splits the importance between correlated variables. When an additional correlated predictor is added, the importance of V1 decreases further, as the model distributes importance across all correlated variables.

  1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
set.seed(200)

# Fit cforest
cf_model <- party::cforest(
  y ~ .,
  data = simulated,
  controls = party::cforest_unbiased(ntree = 1000, mtry = 3)
)

# Traditional importance
cf_imp_traditional <- party::varimp(cf_model, conditional = FALSE)

# Conditional importance
cf_imp_conditional <- party::varimp(cf_model, conditional = TRUE)

# Compare 
importance_compare <- data.frame(
  Variable = names(cf_imp_traditional),
  Traditional = cf_imp_traditional,
  Conditional = cf_imp_conditional
)

importance_compare
##     Variable Traditional  Conditional
## V1        V1  4.68803823  1.677696227
## V2        V2  4.87537563  3.696407509
## V3        V3  0.12841378  0.061150108
## V4        V4  5.66388841  4.154158599
## V5        V5  1.64928419  1.076911294
## V6        V6  0.02153176  0.016131991
## V7        V7  0.03239220 -0.004102661
## V8        V8 -0.04204652 -0.008572804
## V9        V9  0.01701613  0.013092489
## V10      V10 -0.02084033 -0.022601123
## V11      V11  2.75274722  1.101704084
## V12      V12  2.09257232  0.588484329

The conditional inference forest shows a similar overall pattern to the traditional random forest, where the informative predictors (V1–V5) have higher importance than the uninformative predictors (V6–V10). However, the conditional importance values are generally lower, particularly for V1, because they adjust for correlations among predictors. This results in a more accurate and less biased assessment of variable importance compared to the traditional random forest.

  1. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
library(gbm)
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(Cubist)

set.seed(200)

# Boosted Trees 
gbm_model <- train(
  y ~ ., data = simulated,
  method = "gbm",
  verbose = FALSE,
  trControl = trainControl(method = "cv"),
  tuneLength = 5
)

gbm_imp <- varImp(gbm_model, scale = FALSE)
gbm_imp
## gbm variable importance
## 
##     Overall
## V4   4693.9
## V2   3567.6
## V1   3115.7
## V5   2131.3
## V3   1292.8
## V11  1288.8
## V12   504.8
## V6    267.8
## V7    237.1
## V9    161.4
## V10   158.9
## V8    127.3
# Cubist Model
cubist_model <- train(
  y ~ ., data = simulated,
  method = "cubist",
  trControl = trainControl(method = "cv"),
  tuneLength = 5
)

cubist_imp <- varImp(cubist_model, scale = FALSE)
cubist_imp
## cubist variable importance
## 
##     Overall
## V1     68.0
## V2     60.5
## V4     48.5
## V3     36.5
## V5     32.5
## V12    19.0
## V6      8.0
## V10     0.0
## V11     0.0
## V9      0.0
## V8      0.0
## V7      0.0

Yes, the same general pattern is observed with boosted trees and Cubist models. The informative predictors (V1–V5) have the highest importance, while the uninformative predictors (V6–V10) have much lower importance. However, boosted trees assign some importance to noise variables, while Cubist more clearly distinguishes between relevant and irrelevant predictors, often assigning zero importance to noise variables.

8.2. Use a simulation to show tree bias with different granularities.

library(rpart)

set.seed(200)

# Simulate data
n <- 1000

sim_data <- data.frame(
  y = rnorm(n),
  
  # Binary predictor: low granularity
  x_binary = factor(sample(c("A", "B"), n, replace = TRUE)),
  
  # 5-level categorical predictor: medium granularity
  x_cat5 = factor(sample(letters[1:5], n, replace = TRUE)),
  
  # 20-level categorical predictor: high granularity
  x_cat20 = factor(sample(letters[1:20], n, replace = TRUE)),
  
  # Continuous predictor: very high granularity
  x_cont = runif(n)
)

# Fit regression tree
tree_model <- rpart(y ~ ., data = sim_data, method = "anova")

# Variable importance
tree_model$variable.importance
## NULL
#Repeat Simulation
get_importance <- function() {
  sim_data <- data.frame(
    y = rnorm(n),
    x_binary = factor(sample(c("A", "B"), n, replace = TRUE)),
    x_cat5 = factor(sample(letters[1:5], n, replace = TRUE)),
    x_cat20 = factor(sample(letters[1:20], n, replace = TRUE)),
    x_cont = runif(n)
  )
  
  tree_model <- rpart(y ~ ., data = sim_data, method = "anova")
  
  imp <- tree_model$variable.importance
  
  data.frame(
    x_binary = ifelse("x_binary" %in% names(imp), imp["x_binary"], 0),
    x_cat5 = ifelse("x_cat5" %in% names(imp), imp["x_cat5"], 0),
    x_cat20 = ifelse("x_cat20" %in% names(imp), imp["x_cat20"], 0),
    x_cont = ifelse("x_cont" %in% names(imp), imp["x_cont"], 0)
  )
}

results <- replicate(100, get_importance(), simplify = FALSE)
results <- do.call(rbind, results)

colMeans(results)
##   x_binary     x_cat5    x_cat20     x_cont 
##  0.1953775  1.0454299 13.4478592  2.5119118

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

  1. Why does the model on the right focus its importance on just the first fewof predictors, whereas the model on the left spreads importance across more predictors?

The model on the right concentrates importance on a few predictors because the high learning rate and high bagging fraction cause the model to make large updates and rely heavily on the strongest predictors early in the boosting process. In contrast, the model on the left uses a lower learning rate and smaller bagging fraction, which results in smaller incremental updates and more randomness. This allows the model to explore and utilize a wider range of predictors, spreading the importance across more variables

  1. Which model do you think would be more predictive of other samples?

The model on the left would likely be more predictive of new data because the lower learning rate and smaller bagging fraction encourage more gradual learning and reduce overfitting. This results in better generalization, whereas the model on the right is more likely to overfit by focusing too heavily on a small number of predictors.

  1. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing interaction depth allows the model to capture more complex relationships between predictors, including interactions. As a result, more variables are used in the model, and the importance is distributed across a larger set of predictors. This would make the slope of the variable importance plot less steep for both models, reducing the dominance of the top predictors and spreading importance more evenly across variables.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

  1. Which tree-based regression model gives the optimal resampling and test set performance?
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:party':
## 
##     where
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(AppliedPredictiveModeling)
library(rpart)
library(ipred)
library(randomForest)
library(gbm)
library(Cubist)


data(ChemicalManufacturingProcess)
chem <- ChemicalManufacturingProcess

set.seed(200)

# Split data
trainIndex <- createDataPartition(chem$Yield, p = 0.8, list = FALSE)
train <- chem[trainIndex, ]
test  <- chem[-trainIndex, ]

# Preprocessing (adjust if your earlier steps differed)
preProc <- preProcess(train, method = c("center", "scale", "knnImpute"))

train_processed <- predict(preProc, train)
test_processed  <- predict(preProc, test)

ctrl <- trainControl(method = "cv", number = 5)

#SINGLE TREE
tree_model <- train(Yield ~ ., data = train_processed,
                    method = "rpart",
                    trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
#BAGGED TREE
bag_model <- train(Yield ~ ., data = train_processed,
                   method = "treebag",
                   trControl = ctrl)
#RANDOM FOREST
rf_model <- train(Yield ~ ., data = train_processed,
                  method = "rf",
                  tuneLength = 5,
                  trControl = ctrl)

#BOOSTED
gbm_model <- train(Yield ~ ., data = train_processed,
                   method = "gbm",
                   verbose = FALSE,
                   tuneLength = 5,
                   trControl = ctrl)
#CUBIST
cubist_model <- train(Yield ~ ., data = train_processed,
                      method = "cubist",
                      tuneLength = 5,
                      trControl = ctrl)

#compare
results <- resamples(list(
  Tree = tree_model,
  Bagged = bag_model,
  RF = rf_model,
  GBM = gbm_model,
  Cubist = cubist_model
))

summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: Tree, Bagged, RF, GBM, Cubist 
## Number of resamples: 5 
## 
## MAE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Tree   0.4885924 0.5644683 0.6228088 0.5973104 0.6480774 0.6626052    0
## Bagged 0.4841267 0.5552075 0.5681630 0.5593341 0.5845012 0.6046719    0
## RF     0.4007305 0.4198583 0.4871166 0.4852612 0.5387894 0.5798114    0
## GBM    0.4161347 0.4166749 0.4729525 0.4777244 0.5361006 0.5467593    0
## Cubist 0.3459945 0.3929618 0.4227893 0.4249255 0.4765017 0.4863804    0
## 
## RMSE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Tree   0.5779119 0.7256421 0.7513757 0.7296547 0.7924529 0.8008911    0
## Bagged 0.5835629 0.7170413 0.7340909 0.7176371 0.7699474 0.7835432    0
## RF     0.5005852 0.5374702 0.6423127 0.6266487 0.7008744 0.7520007    0
## GBM    0.5295759 0.6183204 0.6412231 0.6262832 0.6630572 0.6792395    0
## Cubist 0.4622149 0.5232249 0.5422054 0.5416877 0.5825185 0.5982750    0
## 
## Rsquared 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Tree   0.3394757 0.3481945 0.4830917 0.4769579 0.5769464 0.6370811    0
## Bagged 0.4365967 0.4518443 0.4645398 0.5159221 0.5601109 0.6665190    0
## RF     0.4580441 0.6361189 0.6510950 0.6192277 0.6623118 0.6885691    0
## GBM    0.4814893 0.5402406 0.5628029 0.6125113 0.7204292 0.7575944    0
## Cubist 0.6173370 0.6367065 0.6639168 0.6979676 0.7115683 0.8603094    0
bwplot(results)

#test set performance
pred_tree   <- predict(tree_model, test_processed)
pred_bag    <- predict(bag_model, test_processed)
pred_rf     <- predict(rf_model, test_processed)
pred_gbm    <- predict(gbm_model, test_processed)
pred_cubist <- predict(cubist_model, test_processed)

# Actual test values
actual <- test_processed$Yield

# Test set performance
test_results <- rbind(
  Tree   = postResample(pred_tree, actual),
  Bagged = postResample(pred_bag, actual),
  RF     = postResample(pred_rf, actual),
  GBM    = postResample(pred_gbm, actual),
  Cubist = postResample(pred_cubist, actual)
)

test_results
##             RMSE  Rsquared       MAE
## Tree   0.9739748 0.2550851 0.7033114
## Bagged 0.8217159 0.4656204 0.5777935
## RF     0.7510801 0.5828454 0.5313110
## GBM    0.7054155 0.6208059 0.5110799
## Cubist 0.6655470 0.6773248 0.5407647

The Cubist model provides the best performance based on both resampling and test set results. It has the lowest RMSE and MAE and the highest R-squared compared to the other models.

  1. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
varImp(cubist_model)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess17  100.00
## ManufacturingProcess32   94.68
## ManufacturingProcess13   43.62
## ManufacturingProcess39   35.11
## ManufacturingProcess04   32.98
## BiologicalMaterial12     30.85
## BiologicalMaterial06     25.53
## ManufacturingProcess26   22.34
## ManufacturingProcess33   21.28
## ManufacturingProcess10   21.28
## ManufacturingProcess29   20.21
## BiologicalMaterial02     20.21
## BiologicalMaterial09     20.21
## ManufacturingProcess30   17.02
## ManufacturingProcess09   17.02
## BiologicalMaterial08     14.89
## ManufacturingProcess01   13.83
## BiologicalMaterial10     12.77
## ManufacturingProcess14   11.70
## ManufacturingProcess21   10.64

The most important predictors in the cubist model are primarily biological variables, with some process variables also contributing. Overall, biological variables dominate the top predictors, suggesting that the characteristics of the biological materials play a larger role in determining yield than the manufacturing process variables. When compared to the optimal linear and nonlinear models, there is substantial overlap in the top predictors, particularly among the biological variables. However, the tree-based model identifies additional important variables and captures more complex relationships, leading to improved predictive performance.

  1. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(partykit)
## Loading required package: libcoin
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp
rpartTree2 <- as.party(tree_model$finalModel)

# Plot
plot(rpartTree2)

The single tree shows that ManufacturingProcess32 is the most important predictor, as it appears in the first split. When its value is above about 0.24, yield is highest and more consistent. For lower values, the model splits on BiologicalMaterial11, showing that biological variables influence yield within certain process conditions.