8.1. Recreate the simulated data from Exercise 7.2
8.1a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
library(mlbench)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(
y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000
)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.86329776
## V2 6.72851763
## V3 0.84145353
## V4 7.60284159
## V5 2.26864193
## V6 0.11268425
## V7 0.07374772
## V8 -0.07210708
## V9 -0.06913906
## V10 -0.10577619
No, the random forest did not significantly use the uninformative predictors. Variables V6–V10 all have importance values near zero (or negative), showing they didn’t contribute meaningfully to the model.
8.1b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example: simulated\(duplicate1 <- simulated\)V1 + rnorm(200) * .1 cor(simulated\(duplicate1, simulated\)V1)
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
set.seed(200)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9497025
model2 <- randomForest(
y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000
)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 5.82920546
## V2 6.19827153
## V3 0.64690600
## V4 7.18846269
## V5 2.18817746
## V6 0.14271410
## V7 0.08390675
## V8 -0.14854947
## V9 -0.09698485
## V10 -0.04244151
## duplicate1 4.08026296
Yes, the importance score for V1 decreased after adding the correlated variable. The importance is now shared between V1 and the new predictor, and adding more correlated variables spreads the importance even further.
8.1c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
set.seed(200)
cf_model <- cforest(
y ~ .,
data = simulated,
controls = cforest_unbiased(ntree = 1000)
)
# Conditional importance
cf_imp_cond <- varimp(cf_model, conditional = TRUE)
# Traditional importance
cf_imp_trad <- varimp(cf_model, conditional = FALSE)
cf_imp_cond
## V1 V2 V3 V4 V5 V6
## 3.33759413 4.82582001 0.01146137 6.22339905 1.25721990 0.01194947
## V7 V8 V9 V10 duplicate1
## 0.01685248 -0.01632860 0.00389434 0.00277754 1.39182243
cf_imp_trad
## V1 V2 V3 V4 V5 V6
## 6.738928518 6.080946911 0.006032679 7.766461332 1.899235131 0.005251184
## V7 V8 V9 V10 duplicate1
## 0.041476323 -0.034157422 -0.035006239 -0.019734869 2.769462305
Yes, the general pattern is similar, with the main predictors still being the most important. However, the conditional importance gives a more balanced view and reduces the bias caused by correlated variables.
8.1d)
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
library(caret)
library(gbm)
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
set.seed(200)
ctrl <- trainControl(method = "cv", number = 10)
# Boosted Trees
boost_model <- train(
y ~ .,
data = simulated,
method = "gbm",
trControl = ctrl,
verbose = FALSE
)
varImp(boost_model)
## gbm variable importance
##
## Overall
## V4 1.000e+02
## V2 7.198e+01
## V1 7.129e+01
## V5 3.922e+01
## V3 2.658e+01
## duplicate1 1.089e+01
## V6 3.126e+00
## V7 2.547e+00
## V9 1.181e+00
## V8 1.342e-03
## V10 0.000e+00
# Cubist
cubist_model <- train(
y ~ .,
data = simulated,
method = "cubist",
trControl = ctrl
)
varImp(cubist_model)
## cubist variable importance
##
## Overall
## V1 100.00
## V2 79.17
## V4 68.06
## V3 58.33
## V5 52.08
## duplicate1 0.00
## V8 0.00
## V7 0.00
## V6 0.00
## V10 0.00
## V9 0.00
Yes, the same pattern occurs. The important predictors (V1–V5) still dominate, and the correlated variable causes importance to be shared, while the uninformative predictors remain near zero.
8.2. Use a simulation to show tree bias with different granularities.
library(rpart)
set.seed(123)
n <- 200
# Simulation
x_cont <- runif(n) # continuous predictor
x_cat <- factor(sample(c("A", "B"), n, replace = TRUE)) # categorical predictor
y <- rnorm(n) # noise response
data_sim <- data.frame(x_cont, x_cat, y)
# Fit tree
tree_model <- rpart(y ~ ., data = data_sim)
# Show tree (first split)
tree_model
## n= 200
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 200 197.379900 0.04212110
## 2) x_cont< 0.1111159 14 9.881818 -0.51053480 *
## 3) x_cont>=0.1111159 186 182.900200 0.08371886
## 6) x_cont>=0.1471042 177 170.627500 0.04529377
## 12) x_cont< 0.2742751 28 20.481330 -0.34331380
## 24) x_cont>=0.2199433 15 5.405710 -0.61773550 *
## 25) x_cont< 0.2199433 13 12.642620 -0.02667349 *
## 13) x_cont>=0.2742751 149 145.123100 0.11832070
## 26) x_cont>=0.320928 139 130.866500 0.06812094
## 52) x_cont< 0.9556685 132 125.002400 0.03054460
## 104) x_cont>=0.9077436 9 6.687146 -0.70377380 *
## 105) x_cont< 0.9077436 123 113.107200 0.08427522
## 210) x_cont< 0.3745884 11 11.700110 -0.48129310 *
## 211) x_cont>=0.3745884 112 97.542950 0.13982210 *
## 53) x_cont>=0.9556685 7 2.163031 0.77670330 *
## 27) x_cont< 0.320928 10 9.037462 0.81609730 *
## 7) x_cont< 0.1471042 9 6.871697 0.83941230 *
# Repeat simulation to demonstrate bias
results <- replicate(100, {
x_cont <- runif(n)
x_cat <- factor(sample(c("A", "B"), n, replace = TRUE))
y <- rnorm(n)
df <- data.frame(x_cont, x_cat, y)
model <- rpart(y ~ ., data = df)
as.character(model$frame$var[1]) # first split variable
})
# how often each variable is chosen
table(results)
## results
## x_cat x_cont
## 7 93
The tree almost always splits on x_cont instead of x_cat, even though both predictors are just noise. Across the simulations, x_cont was chosen 93 times vs. only 7 times for x_cat, showing that trees are biased toward variables with more possible split points.
8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
8.3a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The model on the right uses a high learning rate and high bagging fraction (both 0.9), which causes the boosting algorithm to learn very quickly. Because of this, the model places a large amount of importance on the strongest predictors early in the process. As a result, only a few predictors dominate, leading to a steep importance slope.
In contrast, the model on the left uses a lower learning rate and smaller bagging fraction (both 0.1). This slows down the learning process and allows the model to incorporate more predictors gradually. Consequently, the importance is distributed more evenly across predictors, resulting in a flatter importance slope.
8.3b) Which model do you think would be more predictive of other samples?
The model on the left is likely to be more predictive for new samples. This is because the lower learning rate reduces the risk of overfitting by making smaller, incremental updates to the model. Additionally, the smaller bagging fraction introduces more randomness, which helps improve generalization.
In contrast, the model on the right learns too aggressively and is more likely to overfit the training data, reducing its ability to perform well on new data.
8.3c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
library(caret)
library(gbm)
library(mlbench)
set.seed(123)
# Generate data
sim <- mlbench.friedman1(200, sd = 1)
df <- data.frame(sim$x)
df$y <- sim$y
# Train Control
ctrl <- trainControl(method = "cv", number = 5)
# Model with low interaction depth
set.seed(123)
gbm_low <- train(
y ~ .,
data = df,
method = "gbm",
trControl = ctrl,
tuneGrid = expand.grid(
interaction.depth = 1,
n.trees = 100,
shrinkage = 0.1,
n.minobsinnode = 10
),
verbose = FALSE
)
# Model with high interaction depth
set.seed(123)
gbm_high <- train(
y ~ .,
data = df,
method = "gbm",
trControl = ctrl,
tuneGrid = expand.grid(
interaction.depth = 5,
n.trees = 100,
shrinkage = 0.1,
n.minobsinnode = 10
),
verbose = FALSE
)
# Variable Importance
imp_low <- varImp(gbm_low)
imp_high <- varImp(gbm_high)
imp_low
## gbm variable importance
##
## Overall
## X4 100.0000
## X2 69.6510
## X1 31.5082
## X5 26.9567
## X3 18.0759
## X6 1.8695
## X8 0.4042
## X7 0.0000
## X9 0.0000
## X10 0.0000
imp_high
## gbm variable importance
##
## Overall
## X4 100.000
## X2 62.731
## X5 32.578
## X1 30.524
## X3 18.896
## X8 3.629
## X6 3.486
## X7 1.291
## X10 1.119
## X9 0.000
Increasing the interaction depth allows the model to capture interactions between predictors, which spreads importance across more variables. As a result, the importance slope becomes less steep, since more predictors contribute instead of only a few dominating.
8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
8.7a) Which tree-based regression model gives the optimal resampling and test set performance?
library(AppliedPredictiveModeling)
library(caret)
library(partykit)
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
data(ChemicalManufacturingProcess)
y <- ChemicalManufacturingProcess$Yield
x <- ChemicalManufacturingProcess[, -which(names(ChemicalManufacturingProcess) == "Yield")]
preProc <- preProcess(x, method = c("knnImpute", "center", "scale"))
x_imputed <- predict(preProc, x)
df <- data.frame(x_imputed, Yield = y)
set.seed(123)
trainIndex <- createDataPartition(df$Yield, p = 0.8, list = FALSE)
train <- df[trainIndex, ]
test <- df[-trainIndex, ]
ctrl <- trainControl(method = "cv", number = 10)
set.seed(123)
cart_model <- train(Yield ~ ., data = train, method = "rpart", trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
set.seed(123)
rf_model <- train(Yield ~ ., data = train, method = "rf", trControl = ctrl)
set.seed(123)
gbm_model <- train(Yield ~ ., data = train, method = "gbm", trControl = ctrl, verbose = FALSE)
# CV performance
cart_model$results
## cp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 0.06890550 1.459821 0.4227058 1.153820 0.3806593 0.1742005 0.2741902
## 2 0.08728537 1.473217 0.4166955 1.183735 0.3918565 0.1824338 0.2734764
## 3 0.42292490 1.711277 0.2983001 1.372367 0.2471270 0.1679511 0.1771943
rf_model$results
## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 2 1.225690 0.6500114 0.9888028 0.3026576 0.1335207 0.2044473
## 2 29 1.111872 0.6733679 0.8712198 0.3464082 0.1513388 0.2408945
## 3 57 1.129295 0.6507304 0.8643180 0.3656140 0.1529143 0.2437373
gbm_model$results
## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared
## 1 0.1 1 10 50 1.177715 0.6282387
## 4 0.1 2 10 50 1.197713 0.6203947
## 7 0.1 3 10 50 1.128815 0.6561905
## 2 0.1 1 10 100 1.156282 0.6465120
## 5 0.1 2 10 100 1.142907 0.6490741
## 8 0.1 3 10 100 1.111797 0.6544727
## 3 0.1 1 10 150 1.147562 0.6486036
## 6 0.1 2 10 150 1.140255 0.6521379
## 9 0.1 3 10 150 1.096569 0.6609543
## MAE RMSESD RsquaredSD MAESD
## 1 0.9462110 0.3259866 0.1384237 0.2499189
## 4 0.9262079 0.3691944 0.1726710 0.2632443
## 7 0.8802184 0.3582641 0.1485189 0.2558811
## 2 0.9167662 0.3360332 0.1407597 0.2522941
## 5 0.9001427 0.3992805 0.1837742 0.2883595
## 8 0.8580807 0.3706501 0.1701075 0.2565727
## 3 0.9162734 0.3710523 0.1631749 0.2898520
## 6 0.8969534 0.4064194 0.1846293 0.2935339
## 9 0.8501470 0.3725611 0.1714498 0.2662285
# Test performance
postResample(predict(cart_model, test), test$Yield)
## RMSE Rsquared MAE
## 1.7003910 0.2248231 1.3158401
postResample(predict(rf_model, test), test$Yield)
## RMSE Rsquared MAE
## 1.2815608 0.5215459 0.9636237
postResample(predict(gbm_model, test), test$Yield)
## RMSE Rsquared MAE
## 1.0309183 0.6996226 0.7771521
The GBM (boosted tree) model gives the best performance. It has the lowest RMSE (~1.096) and highest R² (~0.661), outperforming both the random forest and CART models.
8.7b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
varImp(gbm_model)
## gbm variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess31 36.109
## BiologicalMaterial03 23.986
## BiologicalMaterial12 22.959
## ManufacturingProcess17 22.309
## ManufacturingProcess15 18.340
## ManufacturingProcess13 18.054
## BiologicalMaterial09 14.943
## ManufacturingProcess09 11.880
## BiologicalMaterial04 10.978
## ManufacturingProcess06 10.774
## ManufacturingProcess37 9.551
## ManufacturingProcess43 8.199
## ManufacturingProcess39 7.440
## ManufacturingProcess04 7.271
## BiologicalMaterial10 6.680
## BiologicalMaterial11 6.479
## BiologicalMaterial02 6.317
## BiologicalMaterial05 6.168
## ManufacturingProcess26 5.389
The most important predictors are mainly ManufacturingProcess variables, with a few BiologicalMaterial variables also appearing. Overall, the process variables dominate the list. Compared to the linear and nonlinear models, the top predictors are generally similar, though the tree-based model highlights nonlinear relationships and may rank some variables differently.
8.7c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
rpart_tree <- cart_model$finalModel
rpart_party <- as.party(rpart_tree)
plot(rpart_party)
Yes, this view provides additional insight. It shows clear threshold effects, where predictors like ManufacturingProcess32 and ManufacturingProcess17 split the data into groups with different yield levels, highlighting how specific process conditions impact yield.