Exercise 8.1

Recreate the simulated data from Exercise 7.2:

# Code provided by the textbook

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

8.1(a)

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

# Code provided by the textbook
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 – V10)?


print(rfImp1)
##         Overall
## V1   8.62743275
## V2   6.27437240
## V3   0.72305459
## V4   7.50258584
## V5   2.13575650
## V6   0.12395003
## V7   0.02927888
## V8  -0.11724317
## V9  -0.10344797
## V10  0.04312556

The model did not significantly use these predictors. The minimum value for the informative predictors (V1-V5) is 0.72, but none of the uninformative predictors have a value greater than 0.124.

8.1(b)

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

# Code provided by the textbook
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9485201

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?


model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
print(rfImp2)
##                 Overall
## V1          6.774034589
## V2          6.426340527
## V3          0.613805379
## V4          7.135941576
## V5          2.135242904
## V6          0.171933358
## V7          0.142238552
## V8         -0.073192083
## V9         -0.098719872
## V10        -0.009701234
## duplicate1  3.084990840

The importance score for V1 decreased from approximately 8.627 in model 1 to approximately 6.774 in model 2.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9337221
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
print(rfImp3)
##                 Overall
## V1          5.908641677
## V2          6.586726939
## V3          0.559845667
## V4          7.373782389
## V5          1.987341138
## V6          0.162417814
## V7          0.038423138
## V8          0.007497423
## V9         -0.001806331
## V10         0.004023755
## duplicate1  2.351543736
## duplicate2  2.305339113

The inclusion of another predictor that is highly correlated with V1 further decreased the importance score for V1 from approximately 6.774 to approximately 5.909.

8.1(c)

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?


# Create new models using the party package
library(party)
## Warning: package 'party' was built under R version 4.3.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.3.3
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.3.3
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.3.3
model_1c <- cforest(y ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10, data = simulated)
model_2c <- cforest(y ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + duplicate1, data = simulated)
model_3c <- cforest(y ~ ., data = simulated)

# Test the original model using the two importance measures
party::varimp(model_1c, conditional = FALSE)
##           V1           V2           V3           V4           V5           V6 
##  8.823565206  6.746026308  0.018035965  8.299190525  1.831594395 -0.028250263 
##           V7           V8           V9          V10 
##  0.012750215 -0.039005012 -0.009094921 -0.033128463
party::varimp(model_1c, conditional = TRUE)
##           V1           V2           V3           V4           V5           V6 
##  5.518010680  5.401157309  0.044177984  6.514907991  1.137068919  0.021115517 
##           V7           V8           V9          V10 
## -0.027382651 -0.015742699 -0.004532868 -0.023613044
# Test the model with 1 highly correlated predictor using the two importance measures
party::varimp(model_2c, conditional = FALSE)
##           V1           V2           V3           V4           V5           V6 
##  6.782152367  6.374613432 -0.013099108  7.754921977  1.863663487  0.013229404 
##           V7           V8           V9          V10   duplicate1 
##  0.002077785 -0.023110264 -0.033893360 -0.033671540  2.502969467
party::varimp(model_2c, conditional = TRUE)
##            V1            V2            V3            V4            V5 
##  3.3503506078  4.9501885378  0.0286519677  6.2690998330  1.1679495248 
##            V6            V7            V8            V9           V10 
## -0.0003033275  0.0024905037  0.0068110285 -0.0140140085 -0.0055103784 
##    duplicate1 
##  1.0811244519
# Test the model with 2 highly correlated predictors using the two importance measures
party::varimp(model_3c, conditional = FALSE)
##           V1           V2           V3           V4           V5           V6 
##  6.475712820  6.285831841 -0.006362714  7.409278398  1.856033791 -0.027114126 
##           V7           V8           V9          V10   duplicate1   duplicate2 
##  0.081256113 -0.016766298 -0.043501215 -0.037083393  1.771759077  1.123783521
party::varimp(model_3c, conditional = TRUE)
##            V1            V2            V3            V4            V5 
##  2.8932752990  4.8521261504  0.0107989655  5.9612930097  1.2492712797 
##            V6            V7            V8            V9           V10 
##  0.0066221631 -0.0009753738 -0.0085238454 -0.0112670161 -0.0212391636 
##    duplicate1    duplicate2 
##  0.6708616356  0.3836462015

The importances generated by the party package show a similar pattern to those generated by the traditional random forest model, but with some key differences. The importance of V3 is substantially reduced by the party package models, becoming more similar to V6-V10 in importance than it is to V1, V2, V4, or V5. However, the addition of highly correlated predictors still has a substantial impact on the importance of V1.

8.1(d)

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?


boosted_tree_1 <- train(x = simulated[,-c(11:13)],
                        y = simulated$y,
                        method = "bstTree")

bstree1_vi <- varImp(boosted_tree_1, scale = FALSE)
print(bstree1_vi)
## loess r-squared variable importance
## 
##      Overall
## V4  0.289801
## V1  0.277043
## V2  0.260337
## V5  0.134314
## V3  0.090934
## V9  0.023943
## V10 0.021639
## V8  0.015209
## V6  0.008499
## V7  0.005977
boosted_tree_2 <- train(x = simulated[,-c(11,13)],
                        y = simulated$y,
                        method = "bstTree")

bstree2_vi <- varImp(boosted_tree_2, scale = FALSE)
print(bstree2_vi)
## loess r-squared variable importance
## 
##             Overall
## V4         0.289801
## V1         0.277043
## V2         0.260337
## duplicate1 0.224231
## V5         0.134314
## V3         0.090934
## V9         0.023943
## V10        0.021639
## V8         0.015209
## V6         0.008499
## V7         0.005977
boosted_tree_3 <- train(x = simulated[,-11],
                        y = simulated$y,
                        method = "bstTree")

bstree3_vi <- varImp(boosted_tree_3, scale = FALSE)
print(bstree3_vi)
## loess r-squared variable importance
## 
##             Overall
## V4         0.289801
## V1         0.277043
## V2         0.260337
## duplicate1 0.224231
## duplicate2 0.218969
## V5         0.134314
## V3         0.090934
## V9         0.023943
## V10        0.021639
## V8         0.015209
## V6         0.008499
## V7         0.005977

The boosted trees model shows a similar result in that the predictor variables V1 - V5 have a higher importance than V6 - V10, but unlike the random forest model, the inclusion of correlated predictors does not reduce the importance of V1.

cubist_1 <- train(x = simulated[,-c(11:13)],
                  y = simulated$y,
                  method = "cubist",
                  tuneLength = 10)

cubist1_vi <- varImp(cubist_1, scale = FALSE)
print(cubist1_vi)
## cubist variable importance
## 
##     Overall
## V1     72.0
## V2     54.5
## V4     49.0
## V3     42.0
## V5     40.0
## V6     11.0
## V9      0.0
## V10     0.0
## V8      0.0
## V7      0.0
cubist_2 <- train(x = simulated[,-c(11,13)],
                  y = simulated$y,
                  method = "cubist",
                  tuneLength = 10)

cubist2_vi <- varImp(cubist_2, scale = FALSE)
print(cubist2_vi)
## cubist variable importance
## 
##            Overall
## V1            72.0
## V2            60.5
## V4            47.0
## V3            39.0
## V5            37.0
## V6            10.5
## V8             3.0
## duplicate1     2.5
## V10            0.0
## V9             0.0
## V7             0.0
cubist_3 <- train(x = simulated[,-11],
                  y = simulated$y,
                  method = "cubist",
                  tuneLength = 10)

cubist3_vi <- varImp(cubist_3, scale = FALSE)
print(cubist3_vi)
## cubist variable importance
## 
##            Overall
## V1            64.5
## V2            60.5
## V3            50.5
## V4            48.0
## V5            32.5
## V6            15.0
## duplicate1     5.0
## duplicate2     2.5
## V8             2.0
## V10            0.5
## V7             0.0
## V9             0.0

In the cubist models, as in the random forest and boosted trees models, V1-V5 are more important than V6-V10. However, the original cubist model uses V6 more than the previous models did. Additionally, the inclusion of only one predictor correlated to V1 did not impact the importance of V1, but the inclusion of a second predictor correlated with V1 did reduce the importance of that variable.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

fig_8.24
fig_8.24

8.3(a)

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?


The model on the right has both parameters set to 0.9. This means that almost all of the training set is used to build each tree (because of the large bagging fraction) and each tree is of high-importance (because of the large learning rate). This results in the initial predictors having a very large weight, and then overcompensating on the remaining predictors. On the other hand, the model on the left uses only a small fraction of the training set to build each tree and each tree is of low importance, resulting in a more “cautious” model that takes a while to arrive at optimal parameters, causing the variable importance to be more evenly distributed.

8.3(b)

Which model do you think would be more predictive of other samples?


I’d be inclined toward the model on the left. A large learning rate has a relatively high risk of overfitting, and that’s what it looks like this variable importance distribution might result in. The model on the left should be more stable.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

# Code provided by the textbook
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

# Impute missing values using knn
library(impute)
chemicals_impute <- impute.knn(as.matrix(ChemicalManufacturingProcess), rng.seed = 31415)
chemicals_impute <- as.data.frame(chemicals_impute$data)

# Create a test/train split
training_rows_chemicals <- createDataPartition(chemicals_impute$Yield, p = 0.8, list = FALSE)
train_predictors_chemicals <- chemicals_impute[training_rows_chemicals,-1]
train_yield <- chemicals_impute[training_rows_chemicals, 1]
test_predictors_chemicals <- chemicals_impute[-training_rows_chemicals,-1]
test_yield <- chemicals_impute[-training_rows_chemicals, 1]

chemicals_random_forest <- train(x = train_predictors_chemicals,
                                 y = train_yield,
                                 method = "rf",
                                 preProc = c("center", "scale"),
                                 tuneLength = 10)

chemicals_boosted <- train(x = train_predictors_chemicals,
                           y = train_yield,
                           method = "bstTree",
                           preProc = c("center", "scale"),
                           tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
chemicals_cubist <- train(x = train_predictors_chemicals,
                          y = train_yield,
                          method = "cubist",
                          preProc = c("center", "scale"),
                          tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

8.7(a)

Which tree-based regression model gives the optimal resampling and test set performance?


print(chemicals_random_forest$results[chemicals_random_forest$results$mtry == chemicals_random_forest$bestTune$mtry,])
##   mtry     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
## 3   14 1.293772 0.5685713 0.984159 0.1512345 0.08039841 0.1230234
print(chemicals_boosted$results[(chemicals_boosted$results$maxdepth == chemicals_boosted$bestTune$maxdepth) & (chemicals_boosted$results$nu == chemicals_boosted$bestTune$nu) & (chemicals_boosted$results$mstop == chemicals_boosted$bestTune$mstop),])
##    maxdepth  nu mstop     RMSE  Rsquared       MAE    RMSESD RsquaredSD
## 20        2 0.1   500 1.264543 0.5708952 0.9555421 0.1352694 0.07864738
##         MAESD
## 20 0.09737125
print(chemicals_cubist$results[(chemicals_cubist$results$committees == chemicals_cubist$bestTune$committees) & (chemicals_cubist$results$neighbors == chemicals_cubist$bestTune$neighbors),])
##   committees neighbors    RMSE  Rsquared      MAE    RMSESD RsquaredSD
## 8         20         5 1.12639 0.6563459 0.851632 0.1224567 0.08832962
##        MAESD
## 8 0.08875341

The cubist model produced the best resample performance, with an R-squared of 0.6563443 (the highest of the three models) and an RMSE of 1.126354 (the lowest of the three models).

rf_predict <- predict(chemicals_random_forest, test_predictors_chemicals)
yield_rf <- data.frame(obs = test_yield, pred = rf_predict)
print(defaultSummary(yield_rf))
##      RMSE  Rsquared       MAE 
## 0.9079207 0.6908367 0.7537983
boosted_predict <- predict(chemicals_boosted, test_predictors_chemicals)
yield_boosted <- data.frame(obs = test_yield, pred = boosted_predict)
print(defaultSummary(yield_boosted))
##      RMSE  Rsquared       MAE 
## 1.0094673 0.6498047 0.7807759
cubist_predict <- predict(chemicals_cubist, test_predictors_chemicals)
yield_cubist <- data.frame(obs = test_yield, pred = cubist_predict)
print(defaultSummary(yield_cubist))
##      RMSE  Rsquared       MAE 
## 0.9381122 0.6751391 0.7515165

The random forest model produced the best test set performance, with an R-squared of 0.6908367 (the highest of the three models) and an RMSE of 0.9079207 (the lowest of the three models).

8.7(b)

Which predictors are most important in the optimal tree-based regression model? Do either biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?


varImp(chemicals_random_forest, scale = FALSE)
## rf variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  79.672
## ManufacturingProcess31  43.350
## ManufacturingProcess13  32.645
## BiologicalMaterial12    29.530
## BiologicalMaterial03    27.170
## BiologicalMaterial06    17.668
## ManufacturingProcess17  17.381
## ManufacturingProcess09  16.165
## ManufacturingProcess06  14.787
## ManufacturingProcess28  14.515
## BiologicalMaterial02    14.359
## BiologicalMaterial11    12.934
## ManufacturingProcess36  10.717
## BiologicalMaterial04     9.420
## BiologicalMaterial08     8.506
## ManufacturingProcess39   7.687
## ManufacturingProcess11   7.396
## ManufacturingProcess20   7.093
## BiologicalMaterial09     6.845
## BiologicalMaterial05     6.748

Manufacturing Process variables dominate the top 10 list, taking 7 of the top 10 slots (including a sweep of the top 3). As with previous models utilizing this dataset, the top 20 are a little more balanced, with 11 Manufacturing Process variables and 9 Biological Material variables on the list.

The top ten important predictors from the linear model (HW 7) were Manufacturing Processes 32, 36, 13, 09, 17, 06, and 33, as well as Biological Materials 02, 06, and 08.

The top ten important predictors from the SVM model (HW 8) were Manufacturing Processes 13, 32, 17, 31, 09, 36, and 06, as well as Biological Materials 06, 03, and 12.

The top ten important predictors from this Random Forest model were Manufacturing Processes 32, 31, 13, 17, 09, 06, and 28, as well as Biological Materials 12, 03, and 06.

Appearing on all three top-ten lists were MP32, MP13, MP17, MP09, MP06, and BM06. MP28 was the only variable that appeared on this list but not on the lists for either the linear or svm models from previous assignments.

8.7(c)

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?


single_tree <- train(x = train_predictors_chemicals,
                     y = train_yield,
                     method = "rpart",
                     preProc = c("center", "scale"),
                     tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
library(rpart.plot)
## Loading required package: rpart
rpart.plot(single_tree$finalModel)

It’s very interesting to me that this tree uses Biological Material 11 and Manufacturing Process 25, which did not appear on any of the lists of most important variables. It makes sense that the first decision point involves Manufacturing Process 32, which was the most consistently important variable across the models (it was the most important variable in the linear model and random forest model, and the 2nd most important variable in the SVM model).