Overview

This is homework nine of the Fall 2024 edition of DATA 624. The assignment covers questions 8.1, 8.2, 8.3, and 8.7 from the exercise section of chapter 8 in Applied Predictive Modeling by Max Kuhn and Kjell Johnson

First, the required libraries.

library(tidyverse)
library(mlbench)
library(AppliedPredictiveModeling)
library(caret)
library(randomForest)
library(partykit)
library(gbm)
library(ranger)
library(ipred)
library(dplyr)
library(Cubist)
library(rpart.plot)

set.seed(90210)

8.1 Recreate the simulated data from exercise 7.2

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Exercise 8.1 has four component questions: a. Fit a random forest model to all of the predictors, then estimate the variable importance scores b. Now add an additional predictor that is highly correlated with one of the informative predictors c. Use the cforest function in the party package to fit a random forest model using conditional inference trees. d. Repeat this process with different tree models

a. Fit a random forest model to all of the predictors, then estimate the variable importance scores

“Did the random forest model significantly use the uninformative predictors (sixth_vi - one_try0)?”

modi <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
imps_i <- varImp(modi, scale = FALSE)
imps_i
##         Overall
## V1   8.56587477
## V2   6.70202549
## V3   0.98029219
## V4   8.88406432
## V5   3.54192886
## V6   0.07545916
## V7   0.00164744
## V8  -0.01608193
## V9   0.01575947
## V10 -0.12176974

sixth_vi-one_try0 are as close to inconsequential as you can get. Therefore, the model did not significantly use the uninformative predictor.

b. Now add an additional predictor that is highly correlated with one of the informative predictors

For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1 
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9428568

“Fit another random forest model to these data. Did the importance score for one_try change? What happens when you add another predictor that is also highly correlated with one_try?”

random_redux <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp_redux <- varImp(random_redux, scale = FALSE)

rfImp_redux
##                 Overall
## V1          6.262675607
## V2          6.547492044
## V3          0.925389204
## V4          8.409424800
## V5          3.243919661
## V6          0.050118823
## V7         -0.007163184
## V8          0.106020239
## V9         -0.010578764
## V10         0.031130586
## duplicate1  4.227151160
trois_victoire <- cforest(y ~ ., data = simulated)

trois_declare <- varimp(trois_victoire, conditional = TRUE)
trois_declare
##         V1         V2         V3         V4         V5         V6         V7 
##  4.2086231  4.6018947  0.4555724  7.3963337  2.5130448 -0.2263025 -0.1332936 
##         V8         V9        V10 duplicate1 
## -0.1235153 -0.2860874 -0.3529794  1.9901467

V1 decreased in importance, from 8.57 to 6.26.

c. Use the cforest function in the party package to fit a random forest model using conditional inference trees

“The party package function varimp can calculate predictor importance. The conditional argument of that func- tion toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?”

trois_victoire <- cforest(y ~ ., data = simulated)

trois_declare <- varimp(trois_victoire, conditional = TRUE)
trois_declare
##         V1         V2         V3         V4         V5         V6         V7 
##  4.0805117  4.5129016  0.1768500  7.2412870  2.8539857 -0.1688092 -0.2000003 
##         V8         V9        V10 duplicate1 
## -0.1338418 -0.3006841 -0.1801970  2.3730404

sixth_vi-one_try0 are inconsequential in terms of importance levels but have slightly higher baselines than previous models.

d. Repeat this process with different tree models.

“Does the same pattern occur?”

Gradient boosting:

gradient_modeler <- gbm(y ~ ., data = simulated, distribution = "gaussian", n.trees = 1000, interaction.depth = 4)
gradient_i <- summary(gradient_modeler)

gradient_i
##                   var   rel.inf
## V4                 V4 27.360418
## V2                 V2 18.922106
## V1                 V1 18.904902
## V5                 V5 12.085940
## V3                 V3  8.882980
## duplicate1 duplicate1  5.199361
## V6                 V6  2.176062
## V9                 V9  1.967197
## V8                 V8  1.609436
## V7                 V7  1.608238
## V10               V10  1.283359

There’s a similarity in pattern in terms of downward degradation from sixth_vi to one_try0 in relatively small increments. However, the baseline sixth_vi at 1.92 is significantly higher than in previous models, suggesting that the model’s core improvements are front-loaded, with only marginal gains as you move up through one_try0.

Extremely randomized trees:

ranger_danger <- ranger(y ~ ., data = simulated, importance = "impurity", num.trees = 1000)
ranger_soar <- importance(ranger_danger)
ranger_soar
##         V1         V2         V3         V4         V5         V6         V7 
##   716.4573   844.4666   266.4189  1077.1533   586.8838   165.6392   150.9944 
##         V8         V9        V10 duplicate1 
##   183.8281   164.2152   165.3381   633.6605

The importance values of sixth_vi-one_try0 are much lower, suggesting these features contribute relatively less to the model’s predictive power. The degradation from sixth_vi onward appears steady, which implies these later variables may contain redundant or weaker signal relative to the earlier ones. This indicates that the earlier one_try-cinco_mayo captured the primary structure of the data, while sixth_vi-one_try0 added only marginal value.

Bagged Trees:

bagger <- bagging(y ~ ., data = simulated, nbagg = 1000)
bag_i <- varImp(bagger)
bag_i
##              Overall
## duplicate1 1.5471771
## V1         1.7712460
## V10        0.6583321
## V2         1.9927395
## V3         1.4285434
## V4         2.4285897
## V5         2.5930639
## V6         0.7761392
## V7         0.7154396
## V8         0.8424390
## V9         0.7164505

These scores reinforce the patterns seen above, with one_try-cinco_mayo having the highest importance. In contrast, sixth_vi-one_try0 again have notably lower scores, especially one_try0, V7, and sixth_vi. This suggests that sixth_vi-one_try0 likely have diminishing relevance or capture only minor nuances that don’t add as much predictive power as the top features.

8.2 Use a simulation to show tree bias with different granularities

one_try <- runif(1000, 2, 1000)            
two_tries <- runif(1000, 50, 500)             
again_iii <- rnorm(1000, 500, 10)            

repeater <- log(one_try) * two_tries                    
cinco_mayo <- sin(two_tries) * 100 + rnorm(1000, 0, 5)
sixth_vi <- one_try * rnorm(1000, 1, 0.5)         

yessss <- one_try + two_tries                         
yasss <- two_tries^2 + log(one_try)                  
yeesshhh <- repeater + cinco_mayo + rnorm(1000, 0, 10)    
finally <- one_try + sixth_vi                        

datamaster_1 <- data.frame(one_try, two_tries, again_iii, y = yessss)  
datamaster_2 <- data.frame(one_try, two_tries, again_iii, repeater, y = yasss)
datamaster_3 <- data.frame(one_try, two_tries, again_iii, repeater, cinco_mayo, y = yeesshhh)
datamaster_4 <- data.frame(one_try, two_tries, again_iii, repeater, cinco_mayo, sixth_vi, y = finally)

modi <- cforest(y ~ ., data = datamaster_1, ntree = 10)
modi_dos <- cforest(y ~ ., data = datamaster_2, ntree = 50)
trois_mod <- cforest(y ~ ., data = datamaster_3, ntree = 100)
cuatro_m <- cforest(y ~ ., data = datamaster_4, ntree = 200)

imps_i <- varimp(modi, conditional = FALSE)
imps_ii <- varimp(modi_dos, conditional = FALSE)
imps_iii <- varimp(trois_mod, conditional = FALSE)
imps_iiii <- varimp(cuatro_m, conditional = FALSE)

cat("imps_i \n")
## imps_i
imps_i
##     one_try   two_tries   again_iii 
## 120992.7482  26119.5225   -119.8081
cat("\n imps_ii \n")
## 
##  imps_ii
imps_ii
##    one_try  two_tries  again_iii   repeater 
##  131397569 3962433973   52697191 1624549746
cat("\n imps_iii \n")
## 
##  imps_iii
imps_iii
##      one_try    two_tries    again_iii     repeater   cinco_mayo 
##  17602.53211 159089.99116    -32.86493 546557.19993   4197.98978
cat("\n imps_iiii \n")
## 
##  imps_iiii
imps_iiii
##      one_try    two_tries    again_iii     repeater   cinco_mayo     sixth_vi 
## 132538.83363   3817.81807    -78.98968   7292.98819    178.57121 241053.28584

The results show a clear pattern: as model complexity and feature granularity increase, the tree model tends to bias heavily toward certain variables, like two_tries and sixth_vi. This kind of dominance suggests the model could be overfitting to specific features, potentially amplifying noise or interactions that don’t generalize. Some features, like again_iii, even end up with low or negative importance, indicating they’re likely adding noise or conflicting signals in these high-granularity setups. To get a more balanced model, it might be worth exploring regularization techniques or limiting tree depth to reduce this bias and improve feature stability.

8.3 Reviewing stochastic gradient boosting

The question states: “In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:”

Here’s are the charts referenced in the question, which comes from page 219 of the above noted book:

Fig 8.24 There are now three component questions to answer: a. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

  1. Which model do you think would be more predictive of other samples?

  2. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

a. Fewer versus multiple predictors

“Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?”

In the right plot, the higher bagging fraction and learning rate of 0.9 cause the model to focus heavily on a few top predictors, causing the model puts most of its weight on the dominant variables. In contrast, the left plot has both set at 0.1, with both parameters at 0.1, indicating a more cautious, exploratory approach where the model gives more variables a chance to contribute. Essentially, lower values of the bagging fraction and learning rate encourage the model to explore weaker predictors, while higher values drive the model to double down on the strongest signals early in training.

b. Which model do you think would be more predictive of other samples?

The model on the left is likely to generalize better to other samples. By spreading importance across more predictors, it captures a broader range of signals, reducing the risk of overfitting to dominant variables. The model on the right, while potentially more accurate on the training data, might overfit due to its reliance on just a few predictors, making it less adaptable to new or slightly different data.

c. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth would likely steepen the slope of predictor importance in both models. With a higher interaction depth, the models can capture more complex relationships between predictors, which tends to amplify the importance of the strongest predictors while diminishing the role of weaker ones. In the left model, increasing interaction depth would shift more importance toward the top predictors, making the importance distribution less spread out. In the right model, a greater interaction depth would likely reinforce this concentration, further elevating the importance of dominant variables.

8.7 Training tree models on chemical manfuacturing data

The question states: “Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

  1. Which tree-based regression model gives the optimal resampling and test set performance?
  2. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
  3. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

a. Which tree-based regression model gives the optimal resampling and test set performance?

data("ChemicalManufacturingProcess")
bags_c <- preProcess(ChemicalManufacturingProcess, method=c('center','bagImpute'))
chem_data <- predict(bags_c, ChemicalManufacturingProcess)

cd_x <- subset(chem_data, select = -Yield)
cd_y <- chem_data$Yield

chem_train <- createDataPartition(cd_y, p = .80, list = FALSE)
x_train <- cd_x[chem_train,]
x_test <- cd_x[-chem_train,]
y_train <- cd_y[chem_train]
y_test <- cd_y[-chem_train]

Cubist Model Trees:

cubix <- cubist(x_train, y_train)
predict_c<- predict(cubix, x_test)

postResample(predict_c, y_test)
##      RMSE  Rsquared       MAE 
## 1.5649536 0.3326191 1.2177863

Gradient Boosting Machine:

gbm_m <- train(
  x_train, y_train,
  method = 'gbm',
  tuneLength = 10,
  trControl = trainControl(method = "cv", number = 5),
  verbose = FALSE
)

gbm_p <- predict(gbm_m, x_test)
postResample(gbm_p, y_test)
##      RMSE  Rsquared       MAE 
## 1.0815089 0.6687162 0.8492307

The Gradient Boosting Machine model performed best, with an R squared of 0.74 versus 0.33 for gradient boosting machine.

c. Which predictors are most important in the optimal tree-based regression model?

“Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?”

gbm_i <-varImp(gbm_m)
head(gbm_i,10)
## $importance
##                            Overall
## BiologicalMaterial01     3.0071719
## BiologicalMaterial02     2.2054566
## BiologicalMaterial03    16.8808747
## BiologicalMaterial04     0.7364199
## BiologicalMaterial05     7.0513951
## BiologicalMaterial06    12.9188818
## BiologicalMaterial07     0.0000000
## BiologicalMaterial08     0.9885668
## BiologicalMaterial09    10.5930125
## BiologicalMaterial10     3.4089771
## BiologicalMaterial11     2.0724027
## BiologicalMaterial12    11.6771076
## ManufacturingProcess01   4.2037246
## ManufacturingProcess02   5.6225541
## ManufacturingProcess03   1.2341096
## ManufacturingProcess04   7.3198018
## ManufacturingProcess05   6.1168730
## ManufacturingProcess06   4.1256288
## ManufacturingProcess07   0.0000000
## ManufacturingProcess08   0.0000000
## ManufacturingProcess09  23.3783987
## ManufacturingProcess10   0.5490797
## ManufacturingProcess11   1.6912894
## ManufacturingProcess12   0.0000000
## ManufacturingProcess13  12.0314508
## ManufacturingProcess14   2.6880207
## ManufacturingProcess15  11.3642651
## ManufacturingProcess16   2.8322270
## ManufacturingProcess17  21.5166081
## ManufacturingProcess18   2.3735967
## ManufacturingProcess19   0.7958890
## ManufacturingProcess20   1.4838795
## ManufacturingProcess21   0.4539369
## ManufacturingProcess22   2.2451235
## ManufacturingProcess23   1.9539362
## ManufacturingProcess24   3.6757675
## ManufacturingProcess25   4.8112117
## ManufacturingProcess26   1.3289717
## ManufacturingProcess27   7.1855029
## ManufacturingProcess28   1.1917968
## ManufacturingProcess29   3.7043498
## ManufacturingProcess30   0.0000000
## ManufacturingProcess31  29.7851996
## ManufacturingProcess32 100.0000000
## ManufacturingProcess33   4.2649357
## ManufacturingProcess34   3.3016577
## ManufacturingProcess35   4.5918989
## ManufacturingProcess36   0.0000000
## ManufacturingProcess37   3.5100462
## ManufacturingProcess38   0.0000000
## ManufacturingProcess39   9.1510994
## ManufacturingProcess40   0.0000000
## ManufacturingProcess41   0.0000000
## ManufacturingProcess42   0.0000000
## ManufacturingProcess43   3.9327511
## ManufacturingProcess44   1.6811919
## ManufacturingProcess45   1.6766836
## 
## $model
## [1] "gbm"
## 
## $calledFrom
## [1] "varImp"

The process variables dominate the list, going 10/10 in the top ten. This is a notable change from the previous linear and non-linear models, where there was a pretty much even split each time.

c. Plot the optimal single tree with the distribution of yield in the terminal nodes.

“Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?”

optimal_trees <- gbm_m$bestTune$n.trees
single_tree <- pretty.gbm.tree(gbm_m$finalModel, i.tree = optimal_trees)
pred_node <- predict(gbm_m$finalModel, x_train, n.trees = optimal_trees, type = "response")

train_nodes <- x_train
train_nodes$Yield <- y_train
train_nodes$TerminalNode <- pred_node

distr_yield <- train_nodes %>%
  group_by(TerminalNode) %>%
  summarize(Yield_Mean = mean(Yield), Yield_SD = sd(Yield), n = n())

ggplot(distr_yield, aes(x = TerminalNode, y = Yield_Mean)) +
  geom_point(size = 2) +
  geom_errorbar(aes(ymin = Yield_Mean - Yield_SD, ymax = Yield_Mean + Yield_SD), width = 0.2) +
  labs(title = "Yield distribution in terminal nodes of GBM tree",
       x = "Terminal node",
       y = "Yield") +
  theme_minimal()

This plot reveals a clear, almost linear relationship between terminal nodes and Yield, which suggests that the model is capturing mostly additive effects from the predictors rather than complex interactions. The tight clustering along this line implies that each terminal node represents a stable, low-variance segment of Yield, indicating that the predictors have a consistent and predictable influence on the outcome. Overall, this view supports the idea that the predictors in the model provide a strong, structured signal for predicting Yield, with minimal noise introduced by the segmentation.