8.1.
Recreate the simulated data from Exercise 7.2:
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.3
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
(a) Fit a random forest model to all of the predictors, then
estimate the variable importance scores
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
set.seed(122)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1 %>%
mutate (var = rownames(rfImp1)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = 'pink') +
labs(title = 'Variable Importance' , y = 'Variable')

varImp(model1)
## Overall
## V1 56.8894847
## V2 46.7900310
## V3 11.8772831
## V4 51.0737047
## V5 23.4440838
## V6 2.3399939
## V7 0.7521566
## V8 -2.8220872
## V9 -0.4757812
## V10 -1.0011703
Fit another random forest model to these data. Did the importance
score for V1 change?
set.seed(123)
model2 <- randomForest(y ~ ., data = simulated2,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
varImp(model2)
## Overall
## V1 54.694421
## V2 46.803860
## V3 11.010007
## V4 51.399770
## V5 22.598665
## V6 2.498134
## V7 1.277084
## V8 -2.832192
## V9 -2.386409
## V10 -2.883679
V1 value has reduced from 56.8894847 to 54.694421.
c) Use the cforest function in the party package to fit a random
forest model using conditional inference trees. The party package
function varimp can calculate predictor importance. The conditional
argument of that function toggles between the traditional importance
measure and the modified version described in Strobl et al. (2007). Do
these importances show the same pattern as the traditional random forest
model?
library(party)
## Warning: package 'party' was built under R version 4.3.3
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.3.3
##
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
##
## where
set.seed(125)
party_model <- cforest(y ~ ., data = simulated,
control = cforest_control(ntree = 50)
)
varimp(party_model, conditional = TRUE)
## V1 V2 V3 V4 V5 V6
## 2.174880680 4.696646749 0.008201785 4.248047662 0.580985750 0.018011045
## V7 V8 V9 V10 duplicate1
## -0.118920221 0.017864104 0.005922653 -0.026902135 0.268407770
varimp(party_model, conditional = FALSE)
## V1 V2 V3 V4 V5 V6
## 8.068908321 7.183965697 -0.239228211 7.386741675 2.207413850 0.114762931
## V7 V8 V9 V10 duplicate1
## -0.077129200 0.009350458 0.024143583 0.095343477 1.680259504
The conditional importance (conditional = TRUE) and traditional
importance (conditional = FALSE) methods show similar patterns, with
variables V1, V2, and V4 being the most important in both cases.
However, the conditional importance values are generally smaller and
more stable, as they account for correlations between variables. In
contrast, the traditional importance measure tends to give higher values
and can overestimate the importance of some variables.
(d) Repeat this process with different tree models, such as boosted
trees and Cubist. Does the same pattern occur?
gbmgrid <- expand.grid(interaction.depth = seq(1,7, by = 2),
n.trees = seq(100, 500, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 3)
set.seed(125)
gbmtune <- train(y ~ ., data = simulated,
method = "gbm",
tuneGrid = gbmgrid,
verbose = FALSE)
gbmtune
## Stochastic Gradient Boosting
##
## 200 samples
## 11 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.01 1 100 4.025985 0.6095260 3.292619
## 0.01 1 150 3.723283 0.6539989 3.049312
## 0.01 1 200 3.480240 0.6875306 2.852652
## 0.01 1 250 3.275988 0.7122247 2.686021
## 0.01 1 300 3.106441 0.7321012 2.545712
## 0.01 1 350 2.965219 0.7460205 2.428331
## 0.01 1 400 2.847171 0.7588521 2.327823
## 0.01 1 450 2.743987 0.7707793 2.237843
## 0.01 1 500 2.654961 0.7803971 2.159588
## 0.01 3 100 3.506789 0.7114229 2.863898
## 0.01 3 150 3.123933 0.7453157 2.551115
## 0.01 3 200 2.849568 0.7689326 2.322464
## 0.01 3 250 2.640269 0.7897172 2.147377
## 0.01 3 300 2.479832 0.8065596 2.009731
## 0.01 3 350 2.354946 0.8188133 1.902602
## 0.01 3 400 2.257452 0.8281084 1.819125
## 0.01 3 450 2.179718 0.8354630 1.754417
## 0.01 3 500 2.120568 0.8405717 1.704437
## 0.01 5 100 3.339009 0.7355446 2.726759
## 0.01 5 150 2.948649 0.7641741 2.405704
## 0.01 5 200 2.679295 0.7870458 2.182475
## 0.01 5 250 2.491700 0.8032041 2.024575
## 0.01 5 300 2.351596 0.8161738 1.904030
## 0.01 5 350 2.249159 0.8256267 1.816579
## 0.01 5 400 2.172045 0.8327599 1.751218
## 0.01 5 450 2.115061 0.8377408 1.701069
## 0.01 5 500 2.070469 0.8418534 1.661168
## 0.01 7 100 3.258143 0.7503666 2.656502
## 0.01 7 150 2.865226 0.7748558 2.335328
## 0.01 7 200 2.606393 0.7942265 2.120253
## 0.01 7 250 2.428986 0.8090075 1.971420
## 0.01 7 300 2.304386 0.8196597 1.865051
## 0.01 7 350 2.215941 0.8277239 1.790378
## 0.01 7 400 2.151839 0.8334940 1.734929
## 0.01 7 450 2.102330 0.8381347 1.691689
## 0.01 7 500 2.067593 0.8412875 1.661831
## 0.10 1 100 2.136897 0.8310945 1.715040
## 0.10 1 150 1.984691 0.8428682 1.582699
## 0.10 1 200 1.936564 0.8462321 1.537561
## 0.10 1 250 1.934382 0.8445579 1.533205
## 0.10 1 300 1.929622 0.8449412 1.528449
## 0.10 1 350 1.936562 0.8433022 1.535880
## 0.10 1 400 1.946188 0.8410562 1.542785
## 0.10 1 450 1.950621 0.8400485 1.545190
## 0.10 1 500 1.954161 0.8396701 1.546084
## 0.10 3 100 1.966961 0.8442204 1.555410
## 0.10 3 150 1.940794 0.8465077 1.535860
## 0.10 3 200 1.938968 0.8464030 1.532269
## 0.10 3 250 1.940590 0.8456343 1.532829
## 0.10 3 300 1.940490 0.8453770 1.531802
## 0.10 3 350 1.941539 0.8451250 1.532339
## 0.10 3 400 1.940327 0.8452512 1.530979
## 0.10 3 450 1.940930 0.8450715 1.531387
## 0.10 3 500 1.941448 0.8449902 1.531750
## 0.10 5 100 2.022356 0.8350884 1.610613
## 0.10 5 150 2.012704 0.8358934 1.602173
## 0.10 5 200 2.009653 0.8359832 1.601010
## 0.10 5 250 2.009444 0.8358968 1.600470
## 0.10 5 300 2.008110 0.8360816 1.599553
## 0.10 5 350 2.008066 0.8360464 1.599460
## 0.10 5 400 2.007877 0.8360414 1.599252
## 0.10 5 450 2.007704 0.8360490 1.599133
## 0.10 5 500 2.007531 0.8360793 1.598984
## 0.10 7 100 2.036401 0.8347817 1.628206
## 0.10 7 150 2.028265 0.8351252 1.622561
## 0.10 7 200 2.026025 0.8353165 1.620206
## 0.10 7 250 2.025847 0.8352982 1.619681
## 0.10 7 300 2.026154 0.8351966 1.619884
## 0.10 7 350 2.026202 0.8351785 1.619972
## 0.10 7 400 2.026093 0.8351947 1.619910
## 0.10 7 450 2.026110 0.8351914 1.619977
## 0.10 7 500 2.026097 0.8351911 1.619974
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 3
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 300, interaction.depth =
## 1, shrinkage = 0.1 and n.minobsinnode = 3.
The best model shows n.trees = 300, interaction.depth = 1, shrinkage
=0.1 and n.minobsinnode = 3.
best_model <- gbmtune$finalModel
# Calculating variable importance
varImp(best_model)
## Overall
## V1 4346.43701
## V2 4171.54558
## V3 1711.07190
## V4 4827.29219
## V5 1914.76704
## V6 221.73680
## V7 162.21968
## V8 143.19312
## V9 64.69469
## V10 47.48801
## duplicate1 307.19967
V1, V4, and V2 are the most important predictors, as they have the
highest importance scores (4346.44, 4827.29, and 4171.55,
respectively).
Cubist
Cugrid <- expand.grid(committees = seq(1, 10, by = 1),
neighbors = seq(1, 9, by=1)
)
set.seed(126)
Cutune <- train(y ~ ., data = simulated,
method = "cubist",
trControl = trainControl(method = "cv", n = 10),
tuneGrid = Cugrid,
verbose = FALSE)
Cutune
## Cubist
##
## 200 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 1 2.711553 0.7238966 2.215276
## 1 2 2.348045 0.7876534 1.871487
## 1 3 2.234907 0.8062750 1.765299
## 1 4 2.214369 0.8107231 1.764077
## 1 5 2.163153 0.8178812 1.726898
## 1 6 2.151031 0.8183637 1.711522
## 1 7 2.154950 0.8169671 1.724436
## 1 8 2.151490 0.8177512 1.722473
## 1 9 2.149519 0.8181752 1.723557
## 2 1 2.539919 0.7559803 2.061367
## 2 2 2.284123 0.7983346 1.821549
## 2 3 2.180270 0.8169137 1.744395
## 2 4 2.144847 0.8241536 1.718802
## 2 5 2.119247 0.8271542 1.699976
## 2 6 2.109801 0.8275539 1.699869
## 2 7 2.109570 0.8263442 1.702196
## 2 8 2.108970 0.8264649 1.699717
## 2 9 2.111416 0.8260197 1.703254
## 3 1 2.484749 0.7643661 2.041314
## 3 2 2.211704 0.8096779 1.766315
## 3 3 2.102951 0.8286532 1.673652
## 3 4 2.083523 0.8332421 1.655971
## 3 5 2.054823 0.8364660 1.634239
## 3 6 2.048148 0.8361307 1.631841
## 3 7 2.054346 0.8342385 1.643854
## 3 8 2.055017 0.8344609 1.644232
## 3 9 2.055026 0.8344156 1.648627
## 4 1 2.516065 0.7595799 2.051457
## 4 2 2.239858 0.8045211 1.788455
## 4 3 2.116739 0.8242333 1.693169
## 4 4 2.084080 0.8308408 1.668192
## 4 5 2.055999 0.8340954 1.643708
## 4 6 2.048131 0.8341428 1.645326
## 4 7 2.049288 0.8332597 1.652279
## 4 8 2.046210 0.8340774 1.653007
## 4 9 2.047109 0.8340515 1.657664
## 5 1 2.495456 0.7635865 2.054273
## 5 2 2.211787 0.8099751 1.766575
## 5 3 2.091259 0.8292124 1.665582
## 5 4 2.065969 0.8345129 1.649284
## 5 5 2.031579 0.8386623 1.620889
## 5 6 2.024578 0.8386888 1.616626
## 5 7 2.024815 0.8378728 1.627215
## 5 8 2.020938 0.8388108 1.627668
## 5 9 2.020922 0.8388672 1.631744
## 6 1 2.467345 0.7681707 2.008971
## 6 2 2.186298 0.8142880 1.737625
## 6 3 2.065182 0.8328439 1.643727
## 6 4 2.032385 0.8391985 1.617222
## 6 5 2.001747 0.8425161 1.591420
## 6 6 1.994132 0.8425788 1.591613
## 6 7 1.990264 0.8423694 1.595081
## 6 8 1.985101 0.8435032 1.591010
## 6 9 1.986109 0.8434881 1.596037
## 7 1 2.433043 0.7745426 1.994020
## 7 2 2.145324 0.8212838 1.706675
## 7 3 2.026929 0.8392774 1.610952
## 7 4 2.003993 0.8440500 1.594171
## 7 5 1.975467 0.8470947 1.566880
## 7 6 1.971246 0.8466234 1.565913
## 7 7 1.970731 0.8460242 1.573014
## 7 8 1.966066 0.8470901 1.569619
## 7 9 1.966153 0.8471663 1.573976
## 8 1 2.451774 0.7707967 1.991866
## 8 2 2.172288 0.8163225 1.725709
## 8 3 2.054929 0.8339464 1.635521
## 8 4 2.025539 0.8395677 1.611045
## 8 5 1.997153 0.8427181 1.585633
## 8 6 1.992250 0.8424542 1.586280
## 8 7 1.988710 0.8422403 1.591324
## 8 8 1.983612 0.8434243 1.589274
## 8 9 1.985278 0.8433297 1.595619
## 9 1 2.423801 0.7762024 1.977252
## 9 2 2.146443 0.8209390 1.708224
## 9 3 2.032620 0.8379540 1.614166
## 9 4 2.007640 0.8429132 1.594746
## 9 5 1.978776 0.8461551 1.568630
## 9 6 1.975994 0.8456213 1.569041
## 9 7 1.974672 0.8451025 1.578013
## 9 8 1.968685 0.8463615 1.574942
## 9 9 1.969094 0.8463836 1.580368
## 10 1 2.438556 0.7735543 1.980604
## 10 2 2.165106 0.8179721 1.723205
## 10 3 2.054465 0.8343055 1.638588
## 10 4 2.025128 0.8397046 1.612519
## 10 5 1.995723 0.8431040 1.584958
## 10 6 1.993518 0.8424311 1.588204
## 10 7 1.990957 0.8420289 1.593642
## 10 8 1.985681 0.8431933 1.591474
## 10 9 1.987938 0.8430046 1.598182
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 7 and neighbors = 8.
The best model shows committees = 7 and neighbors = 8.
best_model_cu <- Cutune$finalModel
varImp(best_model_cu)
## Overall
## V1 71.5
## V3 36.5
## V2 54.0
## V4 50.0
## V5 50.0
## duplicate1 21.5
## V6 9.0
## V7 0.0
## V8 0.0
## V9 0.0
## V10 0.0
GBM gives high importance to many variables, with V1 and V4 as the
top features. All variables contribute to some degree. Cubist focuses on
fewer variables, with V1 being the most important, and some variables
(V7,V8,V9) are not used at all. In conclusion, in the Cubist model, the
importance of variables resembles the pattern seen in both the random
forest and GBM models: the first five predictors are the most
influential, while predictors V6 through V10 hold much lower or
negligible importance.
8.3
In stochastic gradient boosting the bagging fraction and learning
rate will govern the construction of the trees as they are guided by the
gradient. Although the optimal values of these parameters should be
obtained through the tuning process, it is helpful to understand how the
magnitudes of these parameters affect magnitudes of variable importance.
Figure 8.24 provides the variable importance plots for boosting using
two extreme values for the bagging fraction (0.1 and 0.9) and the
learning rate (0.1 and 0.9) for the solubility data. The left-hand plot
has both parameters set to 0.1, and the right-hand plot has both set to
0.9:
a) Why does the model on the right focus its importance on just the
first fewof predictors, whereas the model on the left spreads importance
across more predictors?
The model on the right, with higher bagging fraction and learning
rate, focuses on just a few key predictors because it aggressively fits
the data and prioritizes the most predictive features. The model on the
left, with lower values, is more conservative and spreads importance
across many predictors, exploring a wider range of features.
b) Which model would be more predictive of other samples?
The model on the left would likely be more predictive of other
samples, as it spreads importance across more features, reducing the
chance of overfitting.
c) How would increasing interaction depth affect the slope of
predictor importance?
Increasing interaction depth would likely make the slope steeper for
both models by giving more importance to predictors involved in complex
interactions. This effect would be stronger in the right model,
8.7
Refer to Exercises 6.3 and 7.5 which describe a chemical
manufacturing process. Use the same data imputation, data splitting, and
pre-processing steps as before and train several tree-based models:
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
chem <- ChemicalManufacturingProcess
preProcValues <- preProcess(chem, method = c("knnImpute"))
data_imp <- predict(preProcValues, chem)
set.seed(120)
index <- createDataPartition(data_imp$Yield, p=0.8, list=FALSE)
Train <- data_imp[index, ]
Test <- data_imp[-index, ]
trans_train <- preProcess(Train, method = c("center", "scale"))
trans_test <- preProcess(Test, method = c("center", "scale"))
Train_prep <- predict(trans_train, Train)
Test_prep <- predict(trans_test, Test)
performance2 <- data.frame(matrix(ncol = 3, nrow = 0,
dimnames = list(NULL, c("RMSE", "Rsquared", "MAE"))),
stringsAsFactors = FALSE)
Random Forest
set.seed(120)
rf_model <- train(Yield ~ ., data = Train_prep, method = "rf",
trControl = trainControl(method = "cv", number = 5))
# Predict on the test set
rf_pred <- predict(rf_model, Test_prep)
# Evaluate the model
rf_performance <- postResample(rf_pred, Test_prep$Yield)
performance2 <- rbind(performance2, data.frame(RMSE = rf_performance[1], Rsquared = rf_performance[2], MAE = rf_performance[3]))
rownames(performance2)[nrow(performance2)] <- "Random Forest"
Bagged Tree
set.seed(120)
bagged_tree_model <- train(Yield ~ ., data = Train_prep, method = "treebag",
trControl = trainControl(method = "cv", number = 5))
# Predict on the test set
bagged_pred <- predict(bagged_tree_model, Test_prep)
# Evaluate the model
bagged_performance <- postResample(bagged_pred, Test_prep$Yield)
performance2 <- rbind(performance2, data.frame(RMSE = bagged_performance[1], Rsquared = bagged_performance[2], MAE = bagged_performance[3]))
rownames(performance2)[nrow(performance2)] <- "Bagged Tree"
Boosted Tree (using GBM)
set.seed(120)
boosted_tree_model <- train(Yield ~ ., data = Train_prep, method = "gbm",
trControl = trainControl(method = "cv", number = 5),
verbose = FALSE)
# Predict on the test set
boosted_pred <- predict(boosted_tree_model, Test_prep)
# Evaluate the model
boosted_performance <- postResample(boosted_pred, Test_prep$Yield)
performance2 <- rbind(performance2, data.frame(RMSE = boosted_performance[1], Rsquared = boosted_performance[2], MAE = boosted_performance[3]))
rownames(performance2)[nrow(performance2)] <- "Boosted Tree"
Cubist
set.seed(120)
cubist_model <- train(Yield ~ ., data = Train_prep, method = "cubist",
trControl = trainControl(method = "cv", number = 5))
# Predict on the test set
cubist_pred <- predict(cubist_model, Test_prep)
# Evaluate the model
cubist_performance <- postResample(cubist_pred, Test_prep$Yield)
performance2 <- rbind(performance2, data.frame(RMSE = cubist_performance[1], Rsquared = cubist_performance[2], MAE = cubist_performance[3]))
rownames(performance2)[nrow(performance2)] <- "Cubist"
b) Which predictors are most important in the optimal tree-based
regression model? Do either the biological or process variables dominate
the list? How do the top 10 important predictors compare to the top 10
predictors from the optimal linear and nonlinear models?
plot(varImp(rf_model), 10)

A significant number of the top predictors are related to the
ManufacturingProcess variables (ManufacturingProcess32,
ManufacturingProcess13, ManufacturingProcess17, etc.), indicating that
the process-related variables are highly influential in predicting the
outcome.BiologicalMaterial variables such as BiologicalMaterial12,
BiologicalMaterial03, and BiologicalMaterial06 also appear prominently
in the top 10 list, showing that biological factors contribute
substantially to the model’s predictions as well. Both biological and
process variables appear in the top 10, but ManufacturingProcess
variables tend to dominate the list, with ManufacturingProcess32 having
the highest importance score (100). This suggests that the
process-related variables are likely more influential than the
biological variables in determining the outcome.
Previous analysis noted a more balanced influence between
manufacturing and biological predictors.The current findings indicate a
shift toward process dominance, where ManufacturingProcess variables
clearly outperform biological predictors in terms of importance for
yield optimization. This mirrors the conclusion in the previous analysis
but with a stronger emphasis on manufacturing processes as the key
factor in improving yield.
c) Plot the optimal single tree with the distribution of yield in
the terminalnodes. Does this view of the data provide additional
knowledge about the biological or process predictors and their
relationship with yield?
single_tree <- rpart(Yield ~ ., method = "anova", data = Train_prep)
rpart.plot(single_tree)

While ManufacturingProcess32 is at the top accounting for 100% of
the yield, BiologicalMaterial11 and ManufacturingProcess31 split that
into 58% and 42% respectively, with the BiologiclMaterial having
slightly more impact to final yield than the ManufacturingProcess at
that level of the split. There are less biological predictors present in
the single tree model than were seen in the more robust tree models.
That makes sense because the simpler tree model focuses on the most
influential variables, potentially simplifying the decision-making
process by highlighting the key manufacturing parameters that are most
critical for optimizing yield.