Recreate the simulated data from Exercise 7.2:
# Code provided by the textbook
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
# Code provided by the textbook
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
print(rfImp1)
## Overall
## V1 8.62743275
## V2 6.27437240
## V3 0.72305459
## V4 7.50258584
## V5 2.13575650
## V6 0.12395003
## V7 0.02927888
## V8 -0.11724317
## V9 -0.10344797
## V10 0.04312556
The model did not significantly use these predictors. The minimum value for the informative predictors (V1-V5) is 0.72, but none of the uninformative predictors have a value greater than 0.124.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
# Code provided by the textbook
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9485201
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
print(rfImp2)
## Overall
## V1 6.774034589
## V2 6.426340527
## V3 0.613805379
## V4 7.135941576
## V5 2.135242904
## V6 0.171933358
## V7 0.142238552
## V8 -0.073192083
## V9 -0.098719872
## V10 -0.009701234
## duplicate1 3.084990840
The importance score for V1 decreased from approximately 8.627 in model 1 to approximately 6.774 in model 2.
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9337221
model3 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
print(rfImp3)
## Overall
## V1 5.908641677
## V2 6.586726939
## V3 0.559845667
## V4 7.373782389
## V5 1.987341138
## V6 0.162417814
## V7 0.038423138
## V8 0.007497423
## V9 -0.001806331
## V10 0.004023755
## duplicate1 2.351543736
## duplicate2 2.305339113
The inclusion of another predictor that is highly correlated with V1 further decreased the importance score for V1 from approximately 6.774 to approximately 5.909.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
# Create new models using the party package
library(party)
## Warning: package 'party' was built under R version 4.3.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.3.3
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.3.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.3.3
model_1c <- cforest(y ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10, data = simulated)
model_2c <- cforest(y ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + duplicate1, data = simulated)
model_3c <- cforest(y ~ ., data = simulated)
# Test the original model using the two importance measures
party::varimp(model_1c, conditional = FALSE)
## V1 V2 V3 V4 V5 V6
## 8.823565206 6.746026308 0.018035965 8.299190525 1.831594395 -0.028250263
## V7 V8 V9 V10
## 0.012750215 -0.039005012 -0.009094921 -0.033128463
party::varimp(model_1c, conditional = TRUE)
## V1 V2 V3 V4 V5 V6
## 5.518010680 5.401157309 0.044177984 6.514907991 1.137068919 0.021115517
## V7 V8 V9 V10
## -0.027382651 -0.015742699 -0.004532868 -0.023613044
# Test the model with 1 highly correlated predictor using the two importance measures
party::varimp(model_2c, conditional = FALSE)
## V1 V2 V3 V4 V5 V6
## 6.782152367 6.374613432 -0.013099108 7.754921977 1.863663487 0.013229404
## V7 V8 V9 V10 duplicate1
## 0.002077785 -0.023110264 -0.033893360 -0.033671540 2.502969467
party::varimp(model_2c, conditional = TRUE)
## V1 V2 V3 V4 V5
## 3.3503506078 4.9501885378 0.0286519677 6.2690998330 1.1679495248
## V6 V7 V8 V9 V10
## -0.0003033275 0.0024905037 0.0068110285 -0.0140140085 -0.0055103784
## duplicate1
## 1.0811244519
# Test the model with 2 highly correlated predictors using the two importance measures
party::varimp(model_3c, conditional = FALSE)
## V1 V2 V3 V4 V5 V6
## 6.475712820 6.285831841 -0.006362714 7.409278398 1.856033791 -0.027114126
## V7 V8 V9 V10 duplicate1 duplicate2
## 0.081256113 -0.016766298 -0.043501215 -0.037083393 1.771759077 1.123783521
party::varimp(model_3c, conditional = TRUE)
## V1 V2 V3 V4 V5
## 2.8932752990 4.8521261504 0.0107989655 5.9612930097 1.2492712797
## V6 V7 V8 V9 V10
## 0.0066221631 -0.0009753738 -0.0085238454 -0.0112670161 -0.0212391636
## duplicate1 duplicate2
## 0.6708616356 0.3836462015
The importances generated by the party package show a similar pattern to those generated by the traditional random forest model, but with some key differences. The importance of V3 is substantially reduced by the party package models, becoming more similar to V6-V10 in importance than it is to V1, V2, V4, or V5. However, the addition of highly correlated predictors still has a substantial impact on the importance of V1.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
boosted_tree_1 <- train(x = simulated[,-c(11:13)],
y = simulated$y,
method = "bstTree")
bstree1_vi <- varImp(boosted_tree_1, scale = FALSE)
print(bstree1_vi)
## loess r-squared variable importance
##
## Overall
## V4 0.289801
## V1 0.277043
## V2 0.260337
## V5 0.134314
## V3 0.090934
## V9 0.023943
## V10 0.021639
## V8 0.015209
## V6 0.008499
## V7 0.005977
boosted_tree_2 <- train(x = simulated[,-c(11,13)],
y = simulated$y,
method = "bstTree")
bstree2_vi <- varImp(boosted_tree_2, scale = FALSE)
print(bstree2_vi)
## loess r-squared variable importance
##
## Overall
## V4 0.289801
## V1 0.277043
## V2 0.260337
## duplicate1 0.224231
## V5 0.134314
## V3 0.090934
## V9 0.023943
## V10 0.021639
## V8 0.015209
## V6 0.008499
## V7 0.005977
boosted_tree_3 <- train(x = simulated[,-11],
y = simulated$y,
method = "bstTree")
bstree3_vi <- varImp(boosted_tree_3, scale = FALSE)
print(bstree3_vi)
## loess r-squared variable importance
##
## Overall
## V4 0.289801
## V1 0.277043
## V2 0.260337
## duplicate1 0.224231
## duplicate2 0.218969
## V5 0.134314
## V3 0.090934
## V9 0.023943
## V10 0.021639
## V8 0.015209
## V6 0.008499
## V7 0.005977
The boosted trees model shows a similar result in that the predictor variables V1 - V5 have a higher importance than V6 - V10, but unlike the random forest model, the inclusion of correlated predictors does not reduce the importance of V1.
cubist_1 <- train(x = simulated[,-c(11:13)],
y = simulated$y,
method = "cubist",
tuneLength = 10)
cubist1_vi <- varImp(cubist_1, scale = FALSE)
print(cubist1_vi)
## cubist variable importance
##
## Overall
## V1 72.0
## V2 54.5
## V4 49.0
## V3 42.0
## V5 40.0
## V6 11.0
## V9 0.0
## V10 0.0
## V8 0.0
## V7 0.0
cubist_2 <- train(x = simulated[,-c(11,13)],
y = simulated$y,
method = "cubist",
tuneLength = 10)
cubist2_vi <- varImp(cubist_2, scale = FALSE)
print(cubist2_vi)
## cubist variable importance
##
## Overall
## V1 72.0
## V2 60.5
## V4 47.0
## V3 39.0
## V5 37.0
## V6 10.5
## V8 3.0
## duplicate1 2.5
## V10 0.0
## V9 0.0
## V7 0.0
cubist_3 <- train(x = simulated[,-11],
y = simulated$y,
method = "cubist",
tuneLength = 10)
cubist3_vi <- varImp(cubist_3, scale = FALSE)
print(cubist3_vi)
## cubist variable importance
##
## Overall
## V1 64.5
## V2 60.5
## V3 50.5
## V4 48.0
## V5 32.5
## V6 15.0
## duplicate1 5.0
## duplicate2 2.5
## V8 2.0
## V10 0.5
## V7 0.0
## V9 0.0
In the cubist models, as in the random forest and boosted trees models, V1-V5 are more important than V6-V10. However, the original cubist model uses V6 more than the previous models did. Additionally, the inclusion of only one predictor correlated to V1 did not impact the importance of V1, but the inclusion of a second predictor correlated with V1 did reduce the importance of that variable.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The model on the right has both parameters set to 0.9. This means that almost all of the training set is used to build each tree (because of the large bagging fraction) and each tree is of high-importance (because of the large learning rate). This results in the initial predictors having a very large weight, and then overcompensating on the remaining predictors. On the other hand, the model on the left uses only a small fraction of the training set to build each tree and each tree is of low importance, resulting in a more “cautious” model that takes a while to arrive at optimal parameters, causing the variable importance to be more evenly distributed.
Which model do you think would be more predictive of other samples?
I’d be inclined toward the model on the left. A large learning rate has a relatively high risk of overfitting, and that’s what it looks like this variable importance distribution might result in. The model on the left should be more stable.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
# Code provided by the textbook
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# Impute missing values using knn
library(impute)
chemicals_impute <- impute.knn(as.matrix(ChemicalManufacturingProcess), rng.seed = 31415)
chemicals_impute <- as.data.frame(chemicals_impute$data)
# Create a test/train split
training_rows_chemicals <- createDataPartition(chemicals_impute$Yield, p = 0.8, list = FALSE)
train_predictors_chemicals <- chemicals_impute[training_rows_chemicals,-1]
train_yield <- chemicals_impute[training_rows_chemicals, 1]
test_predictors_chemicals <- chemicals_impute[-training_rows_chemicals,-1]
test_yield <- chemicals_impute[-training_rows_chemicals, 1]
chemicals_random_forest <- train(x = train_predictors_chemicals,
y = train_yield,
method = "rf",
preProc = c("center", "scale"),
tuneLength = 10)
chemicals_boosted <- train(x = train_predictors_chemicals,
y = train_yield,
method = "bstTree",
preProc = c("center", "scale"),
tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
chemicals_cubist <- train(x = train_predictors_chemicals,
y = train_yield,
method = "cubist",
preProc = c("center", "scale"),
tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
Which tree-based regression model gives the optimal resampling and test set performance?
print(chemicals_random_forest$results[chemicals_random_forest$results$mtry == chemicals_random_forest$bestTune$mtry,])
## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 3 14 1.293772 0.5685713 0.984159 0.1512345 0.08039841 0.1230234
print(chemicals_boosted$results[(chemicals_boosted$results$maxdepth == chemicals_boosted$bestTune$maxdepth) & (chemicals_boosted$results$nu == chemicals_boosted$bestTune$nu) & (chemicals_boosted$results$mstop == chemicals_boosted$bestTune$mstop),])
## maxdepth nu mstop RMSE Rsquared MAE RMSESD RsquaredSD
## 20 2 0.1 500 1.264543 0.5708952 0.9555421 0.1352694 0.07864738
## MAESD
## 20 0.09737125
print(chemicals_cubist$results[(chemicals_cubist$results$committees == chemicals_cubist$bestTune$committees) & (chemicals_cubist$results$neighbors == chemicals_cubist$bestTune$neighbors),])
## committees neighbors RMSE Rsquared MAE RMSESD RsquaredSD
## 8 20 5 1.12639 0.6563459 0.851632 0.1224567 0.08832962
## MAESD
## 8 0.08875341
The cubist model produced the best resample performance, with an R-squared of 0.6563443 (the highest of the three models) and an RMSE of 1.126354 (the lowest of the three models).
rf_predict <- predict(chemicals_random_forest, test_predictors_chemicals)
yield_rf <- data.frame(obs = test_yield, pred = rf_predict)
print(defaultSummary(yield_rf))
## RMSE Rsquared MAE
## 0.9079207 0.6908367 0.7537983
boosted_predict <- predict(chemicals_boosted, test_predictors_chemicals)
yield_boosted <- data.frame(obs = test_yield, pred = boosted_predict)
print(defaultSummary(yield_boosted))
## RMSE Rsquared MAE
## 1.0094673 0.6498047 0.7807759
cubist_predict <- predict(chemicals_cubist, test_predictors_chemicals)
yield_cubist <- data.frame(obs = test_yield, pred = cubist_predict)
print(defaultSummary(yield_cubist))
## RMSE Rsquared MAE
## 0.9381122 0.6751391 0.7515165
The random forest model produced the best test set performance, with an R-squared of 0.6908367 (the highest of the three models) and an RMSE of 0.9079207 (the lowest of the three models).
Which predictors are most important in the optimal tree-based regression model? Do either biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
varImp(chemicals_random_forest, scale = FALSE)
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 79.672
## ManufacturingProcess31 43.350
## ManufacturingProcess13 32.645
## BiologicalMaterial12 29.530
## BiologicalMaterial03 27.170
## BiologicalMaterial06 17.668
## ManufacturingProcess17 17.381
## ManufacturingProcess09 16.165
## ManufacturingProcess06 14.787
## ManufacturingProcess28 14.515
## BiologicalMaterial02 14.359
## BiologicalMaterial11 12.934
## ManufacturingProcess36 10.717
## BiologicalMaterial04 9.420
## BiologicalMaterial08 8.506
## ManufacturingProcess39 7.687
## ManufacturingProcess11 7.396
## ManufacturingProcess20 7.093
## BiologicalMaterial09 6.845
## BiologicalMaterial05 6.748
Manufacturing Process variables dominate the top 10 list, taking 7 of the top 10 slots (including a sweep of the top 3). As with previous models utilizing this dataset, the top 20 are a little more balanced, with 11 Manufacturing Process variables and 9 Biological Material variables on the list.
The top ten important predictors from the linear model (HW 7) were Manufacturing Processes 32, 36, 13, 09, 17, 06, and 33, as well as Biological Materials 02, 06, and 08.
The top ten important predictors from the SVM model (HW 8) were Manufacturing Processes 13, 32, 17, 31, 09, 36, and 06, as well as Biological Materials 06, 03, and 12.
The top ten important predictors from this Random Forest model were Manufacturing Processes 32, 31, 13, 17, 09, 06, and 28, as well as Biological Materials 12, 03, and 06.
Appearing on all three top-ten lists were MP32, MP13, MP17, MP09, MP06, and BM06. MP28 was the only variable that appeared on this list but not on the lists for either the linear or svm models from previous assignments.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
single_tree <- train(x = train_predictors_chemicals,
y = train_yield,
method = "rpart",
preProc = c("center", "scale"),
tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
library(rpart.plot)
## Loading required package: rpart
rpart.plot(single_tree$finalModel)
It’s very interesting to me that this tree uses Biological Material 11 and Manufacturing Process 25, which did not appear on any of the lists of most important variables. It makes sense that the first decision point involves Manufacturing Process 32, which was the most consistently important variable across the models (it was the most important variable in the linear model and random forest model, and the 2nd most important variable in the SVM model).