8.1. Recreate the simulated data from Exercise 7.2:
#Recreate the simulated data from Exercise 7.2
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
#Fit a random forest mode
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
The random forest model does not significantly use the uninformative predictors (V6–V10). Their importance scores are close to zero or negative, indicating they do not contribute meaningfully to the model.
set.seed(200)
simulated$V11 <- simulated$V1 + rnorm(nrow(simulated), sd = 0.1)
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
varImp(model2, scale = FALSE)
## Overall
## V1 5.82920546
## V2 6.19827153
## V3 0.64690600
## V4 7.18846269
## V5 2.18817746
## V6 0.14271410
## V7 0.08390675
## V8 -0.14854947
## V9 -0.09698485
## V10 -0.04244151
## V11 4.08026296
#add another predictor
simulated$V12 <- simulated$V1 + rnorm(nrow(simulated), sd = 0.1)
After adding a predictor highly correlated with V1, the importance score for V1 decreases because the random forest splits the importance between correlated variables. When an additional correlated predictor is added, the importance of V1 decreases further, as the model distributes importance across all correlated variables.
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
set.seed(200)
# Fit cforest
cf_model <- party::cforest(
y ~ .,
data = simulated,
controls = party::cforest_unbiased(ntree = 1000, mtry = 3)
)
# Traditional importance
cf_imp_traditional <- party::varimp(cf_model, conditional = FALSE)
# Conditional importance
cf_imp_conditional <- party::varimp(cf_model, conditional = TRUE)
# Compare
importance_compare <- data.frame(
Variable = names(cf_imp_traditional),
Traditional = cf_imp_traditional,
Conditional = cf_imp_conditional
)
importance_compare
## Variable Traditional Conditional
## V1 V1 4.68803823 1.677696227
## V2 V2 4.87537563 3.696407509
## V3 V3 0.12841378 0.061150108
## V4 V4 5.66388841 4.154158599
## V5 V5 1.64928419 1.076911294
## V6 V6 0.02153176 0.016131991
## V7 V7 0.03239220 -0.004102661
## V8 V8 -0.04204652 -0.008572804
## V9 V9 0.01701613 0.013092489
## V10 V10 -0.02084033 -0.022601123
## V11 V11 2.75274722 1.101704084
## V12 V12 2.09257232 0.588484329
The conditional inference forest shows a similar overall pattern to the traditional random forest, where the informative predictors (V1–V5) have higher importance than the uninformative predictors (V6–V10). However, the conditional importance values are generally lower, particularly for V1, because they adjust for correlations among predictors. This results in a more accurate and less biased assessment of variable importance compared to the traditional random forest.
library(gbm)
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(Cubist)
set.seed(200)
# Boosted Trees
gbm_model <- train(
y ~ ., data = simulated,
method = "gbm",
verbose = FALSE,
trControl = trainControl(method = "cv"),
tuneLength = 5
)
gbm_imp <- varImp(gbm_model, scale = FALSE)
gbm_imp
## gbm variable importance
##
## Overall
## V4 4693.9
## V2 3567.6
## V1 3115.7
## V5 2131.3
## V3 1292.8
## V11 1288.8
## V12 504.8
## V6 267.8
## V7 237.1
## V9 161.4
## V10 158.9
## V8 127.3
# Cubist Model
cubist_model <- train(
y ~ ., data = simulated,
method = "cubist",
trControl = trainControl(method = "cv"),
tuneLength = 5
)
cubist_imp <- varImp(cubist_model, scale = FALSE)
cubist_imp
## cubist variable importance
##
## Overall
## V1 68.0
## V2 60.5
## V4 48.5
## V3 36.5
## V5 32.5
## V12 19.0
## V6 8.0
## V10 0.0
## V11 0.0
## V9 0.0
## V8 0.0
## V7 0.0
Yes, the same general pattern is observed with boosted trees and Cubist models. The informative predictors (V1–V5) have the highest importance, while the uninformative predictors (V6–V10) have much lower importance. However, boosted trees assign some importance to noise variables, while Cubist more clearly distinguishes between relevant and irrelevant predictors, often assigning zero importance to noise variables.
8.2. Use a simulation to show tree bias with different granularities.
library(rpart)
set.seed(200)
# Simulate data
n <- 1000
sim_data <- data.frame(
y = rnorm(n),
# Binary predictor: low granularity
x_binary = factor(sample(c("A", "B"), n, replace = TRUE)),
# 5-level categorical predictor: medium granularity
x_cat5 = factor(sample(letters[1:5], n, replace = TRUE)),
# 20-level categorical predictor: high granularity
x_cat20 = factor(sample(letters[1:20], n, replace = TRUE)),
# Continuous predictor: very high granularity
x_cont = runif(n)
)
# Fit regression tree
tree_model <- rpart(y ~ ., data = sim_data, method = "anova")
# Variable importance
tree_model$variable.importance
## NULL
#Repeat Simulation
get_importance <- function() {
sim_data <- data.frame(
y = rnorm(n),
x_binary = factor(sample(c("A", "B"), n, replace = TRUE)),
x_cat5 = factor(sample(letters[1:5], n, replace = TRUE)),
x_cat20 = factor(sample(letters[1:20], n, replace = TRUE)),
x_cont = runif(n)
)
tree_model <- rpart(y ~ ., data = sim_data, method = "anova")
imp <- tree_model$variable.importance
data.frame(
x_binary = ifelse("x_binary" %in% names(imp), imp["x_binary"], 0),
x_cat5 = ifelse("x_cat5" %in% names(imp), imp["x_cat5"], 0),
x_cat20 = ifelse("x_cat20" %in% names(imp), imp["x_cat20"], 0),
x_cont = ifelse("x_cont" %in% names(imp), imp["x_cont"], 0)
)
}
results <- replicate(100, get_importance(), simplify = FALSE)
results <- do.call(rbind, results)
colMeans(results)
## x_binary x_cat5 x_cat20 x_cont
## 0.1953775 1.0454299 13.4478592 2.5119118
8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The model on the right concentrates importance on a few predictors because the high learning rate and high bagging fraction cause the model to make large updates and rely heavily on the strongest predictors early in the boosting process. In contrast, the model on the left uses a lower learning rate and smaller bagging fraction, which results in smaller incremental updates and more randomness. This allows the model to explore and utilize a wider range of predictors, spreading the importance across more variables
The model on the left would likely be more predictive of new data because the lower learning rate and smaller bagging fraction encourage more gradual learning and reduce overfitting. This results in better generalization, whereas the model on the right is more likely to overfit by focusing too heavily on a small number of predictors.
Increasing interaction depth allows the model to capture more complex relationships between predictors, including interactions. As a result, more variables are used in the model, and the importance is distributed across a larger set of predictors. This would make the slope of the variable importance plot less steep for both models, reducing the dominance of the top predictors and spreading importance more evenly across variables.
8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:party':
##
## where
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(AppliedPredictiveModeling)
library(rpart)
library(ipred)
library(randomForest)
library(gbm)
library(Cubist)
data(ChemicalManufacturingProcess)
chem <- ChemicalManufacturingProcess
set.seed(200)
# Split data
trainIndex <- createDataPartition(chem$Yield, p = 0.8, list = FALSE)
train <- chem[trainIndex, ]
test <- chem[-trainIndex, ]
# Preprocessing (adjust if your earlier steps differed)
preProc <- preProcess(train, method = c("center", "scale", "knnImpute"))
train_processed <- predict(preProc, train)
test_processed <- predict(preProc, test)
ctrl <- trainControl(method = "cv", number = 5)
#SINGLE TREE
tree_model <- train(Yield ~ ., data = train_processed,
method = "rpart",
trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
#BAGGED TREE
bag_model <- train(Yield ~ ., data = train_processed,
method = "treebag",
trControl = ctrl)
#RANDOM FOREST
rf_model <- train(Yield ~ ., data = train_processed,
method = "rf",
tuneLength = 5,
trControl = ctrl)
#BOOSTED
gbm_model <- train(Yield ~ ., data = train_processed,
method = "gbm",
verbose = FALSE,
tuneLength = 5,
trControl = ctrl)
#CUBIST
cubist_model <- train(Yield ~ ., data = train_processed,
method = "cubist",
tuneLength = 5,
trControl = ctrl)
#compare
results <- resamples(list(
Tree = tree_model,
Bagged = bag_model,
RF = rf_model,
GBM = gbm_model,
Cubist = cubist_model
))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: Tree, Bagged, RF, GBM, Cubist
## Number of resamples: 5
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Tree 0.4885924 0.5644683 0.6228088 0.5973104 0.6480774 0.6626052 0
## Bagged 0.4841267 0.5552075 0.5681630 0.5593341 0.5845012 0.6046719 0
## RF 0.4007305 0.4198583 0.4871166 0.4852612 0.5387894 0.5798114 0
## GBM 0.4161347 0.4166749 0.4729525 0.4777244 0.5361006 0.5467593 0
## Cubist 0.3459945 0.3929618 0.4227893 0.4249255 0.4765017 0.4863804 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Tree 0.5779119 0.7256421 0.7513757 0.7296547 0.7924529 0.8008911 0
## Bagged 0.5835629 0.7170413 0.7340909 0.7176371 0.7699474 0.7835432 0
## RF 0.5005852 0.5374702 0.6423127 0.6266487 0.7008744 0.7520007 0
## GBM 0.5295759 0.6183204 0.6412231 0.6262832 0.6630572 0.6792395 0
## Cubist 0.4622149 0.5232249 0.5422054 0.5416877 0.5825185 0.5982750 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Tree 0.3394757 0.3481945 0.4830917 0.4769579 0.5769464 0.6370811 0
## Bagged 0.4365967 0.4518443 0.4645398 0.5159221 0.5601109 0.6665190 0
## RF 0.4580441 0.6361189 0.6510950 0.6192277 0.6623118 0.6885691 0
## GBM 0.4814893 0.5402406 0.5628029 0.6125113 0.7204292 0.7575944 0
## Cubist 0.6173370 0.6367065 0.6639168 0.6979676 0.7115683 0.8603094 0
bwplot(results)
#test set performance
pred_tree <- predict(tree_model, test_processed)
pred_bag <- predict(bag_model, test_processed)
pred_rf <- predict(rf_model, test_processed)
pred_gbm <- predict(gbm_model, test_processed)
pred_cubist <- predict(cubist_model, test_processed)
# Actual test values
actual <- test_processed$Yield
# Test set performance
test_results <- rbind(
Tree = postResample(pred_tree, actual),
Bagged = postResample(pred_bag, actual),
RF = postResample(pred_rf, actual),
GBM = postResample(pred_gbm, actual),
Cubist = postResample(pred_cubist, actual)
)
test_results
## RMSE Rsquared MAE
## Tree 0.9739748 0.2550851 0.7033114
## Bagged 0.8217159 0.4656204 0.5777935
## RF 0.7510801 0.5828454 0.5313110
## GBM 0.7054155 0.6208059 0.5110799
## Cubist 0.6655470 0.6773248 0.5407647
The Cubist model provides the best performance based on both resampling and test set results. It has the lowest RMSE and MAE and the highest R-squared compared to the other models.
varImp(cubist_model)
## cubist variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess17 100.00
## ManufacturingProcess32 94.68
## ManufacturingProcess13 43.62
## ManufacturingProcess39 35.11
## ManufacturingProcess04 32.98
## BiologicalMaterial12 30.85
## BiologicalMaterial06 25.53
## ManufacturingProcess26 22.34
## ManufacturingProcess33 21.28
## ManufacturingProcess10 21.28
## ManufacturingProcess29 20.21
## BiologicalMaterial02 20.21
## BiologicalMaterial09 20.21
## ManufacturingProcess30 17.02
## ManufacturingProcess09 17.02
## BiologicalMaterial08 14.89
## ManufacturingProcess01 13.83
## BiologicalMaterial10 12.77
## ManufacturingProcess14 11.70
## ManufacturingProcess21 10.64
The most important predictors in the cubist model are primarily biological variables, with some process variables also contributing. Overall, biological variables dominate the top predictors, suggesting that the characteristics of the biological materials play a larger role in determining yield than the manufacturing process variables. When compared to the optimal linear and nonlinear models, there is substantial overlap in the top predictors, particularly among the biological variables. However, the tree-based model identifies additional important variables and captures more complex relationships, leading to improved predictive performance.
library(partykit)
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
rpartTree2 <- as.party(tree_model$finalModel)
# Plot
plot(rpartTree2)
The single tree shows that ManufacturingProcess32 is the most important predictor, as it appears in the first split. When its value is above about 0.24, yield is highest and more consistent. For lower values, the model splits on BiologicalMaterial11, showing that biological variables influence yield within certain process conditions.