This is homework nine of the Fall 2024 edition of DATA 624. The assignment covers questions 8.1, 8.2, 8.3, and 8.7 from the exercise section of chapter 8 in Applied Predictive Modeling by Max Kuhn and Kjell Johnson
First, the required libraries.
library(tidyverse)
library(mlbench)
library(AppliedPredictiveModeling)
library(caret)
library(randomForest)
library(partykit)
library(gbm)
library(ranger)
library(ipred)
library(dplyr)
library(Cubist)
library(rpart.plot)
set.seed(90210)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Exercise 8.1 has four component questions: a. Fit a random forest model to all of the predictors, then estimate the variable importance scores b. Now add an additional predictor that is highly correlated with one of the informative predictors c. Use the cforest function in the party package to fit a random forest model using conditional inference trees. d. Repeat this process with different tree models
“Did the random forest model significantly use the uninformative predictors (sixth_vi - one_try0)?”
modi <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
imps_i <- varImp(modi, scale = FALSE)
imps_i
## Overall
## V1 8.56587477
## V2 6.70202549
## V3 0.98029219
## V4 8.88406432
## V5 3.54192886
## V6 0.07545916
## V7 0.00164744
## V8 -0.01608193
## V9 0.01575947
## V10 -0.12176974
sixth_vi-one_try0 are as close to inconsequential as you can get. Therefore, the model did not significantly use the uninformative predictor.
“The party package function varimp can calculate predictor importance. The conditional argument of that func- tion toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?”
trois_victoire <- cforest(y ~ ., data = simulated)
trois_declare <- varimp(trois_victoire, conditional = TRUE)
trois_declare
## V1 V2 V3 V4 V5 V6 V7
## 4.0805117 4.5129016 0.1768500 7.2412870 2.8539857 -0.1688092 -0.2000003
## V8 V9 V10 duplicate1
## -0.1338418 -0.3006841 -0.1801970 2.3730404
sixth_vi-one_try0 are inconsequential in terms of importance levels but have slightly higher baselines than previous models.
“Does the same pattern occur?”
Gradient boosting:
gradient_modeler <- gbm(y ~ ., data = simulated, distribution = "gaussian", n.trees = 1000, interaction.depth = 4)
gradient_i <- summary(gradient_modeler)
gradient_i
## var rel.inf
## V4 V4 27.360418
## V2 V2 18.922106
## V1 V1 18.904902
## V5 V5 12.085940
## V3 V3 8.882980
## duplicate1 duplicate1 5.199361
## V6 V6 2.176062
## V9 V9 1.967197
## V8 V8 1.609436
## V7 V7 1.608238
## V10 V10 1.283359
There’s a similarity in pattern in terms of downward degradation from sixth_vi to one_try0 in relatively small increments. However, the baseline sixth_vi at 1.92 is significantly higher than in previous models, suggesting that the model’s core improvements are front-loaded, with only marginal gains as you move up through one_try0.
Extremely randomized trees:
ranger_danger <- ranger(y ~ ., data = simulated, importance = "impurity", num.trees = 1000)
ranger_soar <- importance(ranger_danger)
ranger_soar
## V1 V2 V3 V4 V5 V6 V7
## 716.4573 844.4666 266.4189 1077.1533 586.8838 165.6392 150.9944
## V8 V9 V10 duplicate1
## 183.8281 164.2152 165.3381 633.6605
The importance values of sixth_vi-one_try0 are much lower, suggesting these features contribute relatively less to the model’s predictive power. The degradation from sixth_vi onward appears steady, which implies these later variables may contain redundant or weaker signal relative to the earlier ones. This indicates that the earlier one_try-cinco_mayo captured the primary structure of the data, while sixth_vi-one_try0 added only marginal value.
Bagged Trees:
bagger <- bagging(y ~ ., data = simulated, nbagg = 1000)
bag_i <- varImp(bagger)
bag_i
## Overall
## duplicate1 1.5471771
## V1 1.7712460
## V10 0.6583321
## V2 1.9927395
## V3 1.4285434
## V4 2.4285897
## V5 2.5930639
## V6 0.7761392
## V7 0.7154396
## V8 0.8424390
## V9 0.7164505
These scores reinforce the patterns seen above, with one_try-cinco_mayo having the highest importance. In contrast, sixth_vi-one_try0 again have notably lower scores, especially one_try0, V7, and sixth_vi. This suggests that sixth_vi-one_try0 likely have diminishing relevance or capture only minor nuances that don’t add as much predictive power as the top features.
one_try <- runif(1000, 2, 1000)
two_tries <- runif(1000, 50, 500)
again_iii <- rnorm(1000, 500, 10)
repeater <- log(one_try) * two_tries
cinco_mayo <- sin(two_tries) * 100 + rnorm(1000, 0, 5)
sixth_vi <- one_try * rnorm(1000, 1, 0.5)
yessss <- one_try + two_tries
yasss <- two_tries^2 + log(one_try)
yeesshhh <- repeater + cinco_mayo + rnorm(1000, 0, 10)
finally <- one_try + sixth_vi
datamaster_1 <- data.frame(one_try, two_tries, again_iii, y = yessss)
datamaster_2 <- data.frame(one_try, two_tries, again_iii, repeater, y = yasss)
datamaster_3 <- data.frame(one_try, two_tries, again_iii, repeater, cinco_mayo, y = yeesshhh)
datamaster_4 <- data.frame(one_try, two_tries, again_iii, repeater, cinco_mayo, sixth_vi, y = finally)
modi <- cforest(y ~ ., data = datamaster_1, ntree = 10)
modi_dos <- cforest(y ~ ., data = datamaster_2, ntree = 50)
trois_mod <- cforest(y ~ ., data = datamaster_3, ntree = 100)
cuatro_m <- cforest(y ~ ., data = datamaster_4, ntree = 200)
imps_i <- varimp(modi, conditional = FALSE)
imps_ii <- varimp(modi_dos, conditional = FALSE)
imps_iii <- varimp(trois_mod, conditional = FALSE)
imps_iiii <- varimp(cuatro_m, conditional = FALSE)
cat("imps_i \n")
## imps_i
imps_i
## one_try two_tries again_iii
## 120992.7482 26119.5225 -119.8081
cat("\n imps_ii \n")
##
## imps_ii
imps_ii
## one_try two_tries again_iii repeater
## 131397569 3962433973 52697191 1624549746
cat("\n imps_iii \n")
##
## imps_iii
imps_iii
## one_try two_tries again_iii repeater cinco_mayo
## 17602.53211 159089.99116 -32.86493 546557.19993 4197.98978
cat("\n imps_iiii \n")
##
## imps_iiii
imps_iiii
## one_try two_tries again_iii repeater cinco_mayo sixth_vi
## 132538.83363 3817.81807 -78.98968 7292.98819 178.57121 241053.28584
The results show a clear pattern: as model complexity and feature
granularity increase, the tree model tends to bias heavily toward
certain variables, like two_tries
and
sixth_vi
. This kind of dominance suggests the model could
be overfitting to specific features, potentially amplifying noise or
interactions that don’t generalize. Some features, like
again_iii
, even end up with low or negative importance,
indicating they’re likely adding noise or conflicting signals in these
high-granularity setups. To get a more balanced model, it might be worth
exploring regularization techniques or limiting tree depth to reduce
this bias and improve feature stability.
The question states: “In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:”
Here’s are the charts referenced in the question, which comes from page 219 of the above noted book:
There are now three
component questions to answer: a. Why does the model on the right focus
its importance on just the first few of predictors, whereas the model on
the left spreads importance across more predictors?
Which model do you think would be more predictive of other samples?
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
“Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?”
In the right plot, the higher bagging fraction and learning rate of 0.9 cause the model to focus heavily on a few top predictors, causing the model puts most of its weight on the dominant variables. In contrast, the left plot has both set at 0.1, with both parameters at 0.1, indicating a more cautious, exploratory approach where the model gives more variables a chance to contribute. Essentially, lower values of the bagging fraction and learning rate encourage the model to explore weaker predictors, while higher values drive the model to double down on the strongest signals early in training.
The model on the left is likely to generalize better to other samples. By spreading importance across more predictors, it captures a broader range of signals, reducing the risk of overfitting to dominant variables. The model on the right, while potentially more accurate on the training data, might overfit due to its reliance on just a few predictors, making it less adaptable to new or slightly different data.
Increasing the interaction depth would likely steepen the slope of predictor importance in both models. With a higher interaction depth, the models can capture more complex relationships between predictors, which tends to amplify the importance of the strongest predictors while diminishing the role of weaker ones. In the left model, increasing interaction depth would shift more importance toward the top predictors, making the importance distribution less spread out. In the right model, a greater interaction depth would likely reinforce this concentration, further elevating the importance of dominant variables.
The question states: “Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
data("ChemicalManufacturingProcess")
bags_c <- preProcess(ChemicalManufacturingProcess, method=c('center','bagImpute'))
chem_data <- predict(bags_c, ChemicalManufacturingProcess)
cd_x <- subset(chem_data, select = -Yield)
cd_y <- chem_data$Yield
chem_train <- createDataPartition(cd_y, p = .80, list = FALSE)
x_train <- cd_x[chem_train,]
x_test <- cd_x[-chem_train,]
y_train <- cd_y[chem_train]
y_test <- cd_y[-chem_train]
Cubist Model Trees:
cubix <- cubist(x_train, y_train)
predict_c<- predict(cubix, x_test)
postResample(predict_c, y_test)
## RMSE Rsquared MAE
## 1.5649536 0.3326191 1.2177863
Gradient Boosting Machine:
gbm_m <- train(
x_train, y_train,
method = 'gbm',
tuneLength = 10,
trControl = trainControl(method = "cv", number = 5),
verbose = FALSE
)
gbm_p <- predict(gbm_m, x_test)
postResample(gbm_p, y_test)
## RMSE Rsquared MAE
## 1.0815089 0.6687162 0.8492307
The Gradient Boosting Machine model performed best, with an R squared of 0.74 versus 0.33 for gradient boosting machine.
“Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?”
gbm_i <-varImp(gbm_m)
head(gbm_i,10)
## $importance
## Overall
## BiologicalMaterial01 3.0071719
## BiologicalMaterial02 2.2054566
## BiologicalMaterial03 16.8808747
## BiologicalMaterial04 0.7364199
## BiologicalMaterial05 7.0513951
## BiologicalMaterial06 12.9188818
## BiologicalMaterial07 0.0000000
## BiologicalMaterial08 0.9885668
## BiologicalMaterial09 10.5930125
## BiologicalMaterial10 3.4089771
## BiologicalMaterial11 2.0724027
## BiologicalMaterial12 11.6771076
## ManufacturingProcess01 4.2037246
## ManufacturingProcess02 5.6225541
## ManufacturingProcess03 1.2341096
## ManufacturingProcess04 7.3198018
## ManufacturingProcess05 6.1168730
## ManufacturingProcess06 4.1256288
## ManufacturingProcess07 0.0000000
## ManufacturingProcess08 0.0000000
## ManufacturingProcess09 23.3783987
## ManufacturingProcess10 0.5490797
## ManufacturingProcess11 1.6912894
## ManufacturingProcess12 0.0000000
## ManufacturingProcess13 12.0314508
## ManufacturingProcess14 2.6880207
## ManufacturingProcess15 11.3642651
## ManufacturingProcess16 2.8322270
## ManufacturingProcess17 21.5166081
## ManufacturingProcess18 2.3735967
## ManufacturingProcess19 0.7958890
## ManufacturingProcess20 1.4838795
## ManufacturingProcess21 0.4539369
## ManufacturingProcess22 2.2451235
## ManufacturingProcess23 1.9539362
## ManufacturingProcess24 3.6757675
## ManufacturingProcess25 4.8112117
## ManufacturingProcess26 1.3289717
## ManufacturingProcess27 7.1855029
## ManufacturingProcess28 1.1917968
## ManufacturingProcess29 3.7043498
## ManufacturingProcess30 0.0000000
## ManufacturingProcess31 29.7851996
## ManufacturingProcess32 100.0000000
## ManufacturingProcess33 4.2649357
## ManufacturingProcess34 3.3016577
## ManufacturingProcess35 4.5918989
## ManufacturingProcess36 0.0000000
## ManufacturingProcess37 3.5100462
## ManufacturingProcess38 0.0000000
## ManufacturingProcess39 9.1510994
## ManufacturingProcess40 0.0000000
## ManufacturingProcess41 0.0000000
## ManufacturingProcess42 0.0000000
## ManufacturingProcess43 3.9327511
## ManufacturingProcess44 1.6811919
## ManufacturingProcess45 1.6766836
##
## $model
## [1] "gbm"
##
## $calledFrom
## [1] "varImp"
The process variables dominate the list, going 10/10 in the top ten. This is a notable change from the previous linear and non-linear models, where there was a pretty much even split each time.
“Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?”
optimal_trees <- gbm_m$bestTune$n.trees
single_tree <- pretty.gbm.tree(gbm_m$finalModel, i.tree = optimal_trees)
pred_node <- predict(gbm_m$finalModel, x_train, n.trees = optimal_trees, type = "response")
train_nodes <- x_train
train_nodes$Yield <- y_train
train_nodes$TerminalNode <- pred_node
distr_yield <- train_nodes %>%
group_by(TerminalNode) %>%
summarize(Yield_Mean = mean(Yield), Yield_SD = sd(Yield), n = n())
ggplot(distr_yield, aes(x = TerminalNode, y = Yield_Mean)) +
geom_point(size = 2) +
geom_errorbar(aes(ymin = Yield_Mean - Yield_SD, ymax = Yield_Mean + Yield_SD), width = 0.2) +
labs(title = "Yield distribution in terminal nodes of GBM tree",
x = "Terminal node",
y = "Yield") +
theme_minimal()
This plot reveals a clear, almost linear relationship between
terminal nodes and Yield
, which suggests that the model is
capturing mostly additive effects from the predictors rather than
complex interactions. The tight clustering along this line implies that
each terminal node represents a stable, low-variance segment of
Yield
, indicating that the predictors have a consistent and
predictable influence on the outcome. Overall, this view supports the
idea that the predictors in the model provide a strong, structured
signal for predicting Yield
, with minimal noise introduced
by the segmentation.