8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a)

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
print(rfImp1)
##         Overall
## V1   8.62743275
## V2   6.27437240
## V3   0.72305459
## V4   7.50258584
## V5   2.13575650
## V6   0.12395003
## V7   0.02927888
## V8  -0.11724317
## V9  -0.10344797
## V10  0.04312556

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No, all 5 unimformative predictors had an importance score with an absolute value less than 0.12, and all 5 of those importance scores had a lower absolute value than those of the 5 informative predictors.

(b)

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9485201

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

print(rfImp2[1, ])
## [1] 6.774035

The importance score for V1 decreased.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .11
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9211579
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
print(rfImp3[1,])
## [1] 6.507008

The importance score for V1 decreased from model 2 to model 3, but not as much as from model 1 to model 2.

(c)

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

suppressWarnings(library(party))
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
bagCtrl <- cforest_control(mtry = ncol(simulated) - 1)
cforest <- cforest(y ~ ., data = simulated, controls = bagCtrl)

cfImp <- varimp(cforest)
cfImp
##           V1           V2           V3           V4           V5           V6 
##  8.712613935  7.885522241  0.015540587  9.936236080  2.301086578  0.005971381 
##           V7           V8           V9          V10   duplicate1   duplicate2 
##  0.036254619 -0.027258381 -0.029171336  0.044816202  0.656011087  0.528821468

Yes, this variable importance profile shows the same pattern as the traditional random forest model, in which the informative predictors V1-V5 have a greater impact than the uninformative predictors V6-V10. Both highly correlated variables also have greater importance than the uninformative variables.

(d)

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(foreach)

num_cores <- detectCores()

cl <- makeCluster(num_cores - 1)  
registerDoParallel(cl)
library(gbm)
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2), n.trees = seq(100, 1000, by
                                                                            = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = c(5, 10, 20))
set.seed(1989)
gbmTune <- train(simulated[,-11], simulated[,11],
                  method = "gbm",
                  tuneGrid = gbmGrid,
                  verbose = FALSE)

varImp(gbmTune)
## gbm variable importance
## 
##             Overall
## V4         100.0000
## V1          77.1465
## V2          75.7360
## V5          40.8770
## V3          25.0376
## duplicate2   9.7790
## duplicate1   6.9342
## V6           2.0708
## V7           1.3095
## V9           0.6408
## V8           0.1254
## V10          0.0000

Yes, the same pattern occurs in gradient boosting, we see that the informative variables have the greatest importance, followed by the highly correlated variables, followed by the uninformative variables.

library(Cubist)
## Warning: package 'Cubist' was built under R version 4.3.3
cubistTuned <- train(simulated[,-11], simulated[,11], method = "cubist")
varImp(cubistTuned)
## cubist variable importance
## 
##            Overall
## V1         100.000
## V2          88.722
## V3          76.692
## V4          72.180
## V5          52.632
## V6          17.293
## duplicate1   9.774
## V8           5.263
## V10          0.000
## V9           0.000
## duplicate2   0.000
## V7           0.000

The variable importance profile in the Cubist model is slightly different. The five informative predictors are still the most important, but several uninformative variables and a duplicate had no importance on the model, and one of the uninformative variables was more important than the other duplicate.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

knitr::include_graphics("/Users/mollysiebecker/Desktop/Screenshot 2024-11-11 at 8.04.15 PM.png")

(a)

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Since the learning rate in the model on the right is so high, each individual learner can have a high impact on the model, and since the learners are sequential, each model impacts the next and has an unequal influence on the overall model. Therefore, having a high learning rate can cause a few variables that have an outsize influence on a few learners to have an outsize influence on the entire model. The model on the left, with the lower learning rate, decreases how much any one learner can impact the overall model, leading to the less steep variable importance profile.

Additionally, the high bagging fraction in the model on the right decreases variability because each learner gets trained on almost the entire data set. The lower bagging fraction in the model on the left introduces diversity which can lead to a variety of learners that emphasize different variables to different extents, again leading to the less steep variable importance profile.

(b)

Which model do you think would be more predictive of other samples?

I think the model on the left with both parameters set to 0.1 would be more predictive of other samples because the model on the right has a high risk of overfitting with the high learning rate.

(c)

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing interaction depth would also each tree to have more splits. For either model, this would increase complexity and allow more variables to have a greater impact on the model. Therefore, increasing interaction depth would decrease the slope of predictor importance and lead to a less steep variable importance profile for either model.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
library(VIM)
## Loading required package: colorspace
## Warning: package 'colorspace' was built under R version 4.3.3
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
data(ChemicalManufacturingProcess)
# use k-nearest neighbors imputation with k=5
ChemicalManufacturingProcess <- kNN(ChemicalManufacturingProcess, k=5)

# remove predictors with near zero variance
cmp_near_zero_var <- nearZeroVar(ChemicalManufacturingProcess)
ChemicalManufacturingProcess <- ChemicalManufacturingProcess[,-cmp_near_zero_var]

# center and scale
ChemicalManufacturingProcess <- as.data.frame(scale(ChemicalManufacturingProcess))

#create train/test split
set.seed(1989)
trainingRows <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.80, list=FALSE)

cmp_train <- ChemicalManufacturingProcess[trainingRows,]
cmp_test <- ChemicalManufacturingProcess[-trainingRows,]
train_control <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
cmp_rf <- train(cmp_train[-1], 
                cmp_train$Yield,
                method = "rf",
                trControl = train_control,
                preProc = c("center", "scale"))
cmp_cforest <- train(cmp_train[-1], 
                cmp_train$Yield,
                method = "cforest",
                trControl = train_control,
                preProc = c("center", "scale"))
set.seed(1989)
cmp_gbm <- train(cmp_train[-1], 
                cmp_train$Yield,
                method = "gbm",
                tuneGrid = gbmGrid,
                trControl = train_control,
                preProc = c("center", "scale"),
                verbose = FALSE)
cmp_cubist <- train(cmp_train[-1], 
                cmp_train$Yield,
                method = "cubist",
                trControl = train_control,
                preProc = c("center", "scale"),
                verbose = FALSE)

(a)

Which tree-based regression model gives the optimal resampling and test set performance?

rfPred <- predict(cmp_rf, newdata = cmp_test[-1])
postResample(pred = rfPred, obs = cmp_test$Yield)
##      RMSE  Rsquared       MAE 
## 0.7400475 0.6005565 0.5704017
cforestPred <- predict(cmp_cforest, newdata = cmp_test[-1])
postResample(pred = cforestPred, obs = cmp_test$Yield)
##      RMSE  Rsquared       MAE 
## 0.8019226 0.5090654 0.6292775
gbmPred <- predict(cmp_gbm, newdata = cmp_test[-1])
postResample(pred = gbmPred, obs = cmp_test$Yield)
##      RMSE  Rsquared       MAE 
## 0.7676040 0.5435367 0.5914653
cubistPred <- predict(cmp_cubist, newdata = cmp_test[-1])
postResample(pred = cubistPred, obs = cmp_test$Yield)
##      RMSE  Rsquared       MAE 
## 0.6183869 0.7155763 0.4588960

The Cubist model gives the best value of \(r^2\), approximately 0.72. This is also the best performance that has been seen for this data in any of the linear or nonlinear models so far.

(b)

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(cmp_cubist)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 59)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess09  42.069
## ManufacturingProcess04  37.241
## ManufacturingProcess29  31.034
## ManufacturingProcess33  31.034
## ManufacturingProcess45  28.966
## ManufacturingProcess37  24.828
## ManufacturingProcess13  24.828
## ManufacturingProcess27  21.379
## BiologicalMaterial05    20.690
## ManufacturingProcess17  20.000
## ManufacturingProcess31  20.000
## ManufacturingProcess24  13.103
## ManufacturingProcess01  11.724
## BiologicalMaterial12    11.724
## BiologicalMaterial02    11.034
## BiologicalMaterial09    10.345
## ManufacturingProcess10  10.345
## ManufacturingProcess12   7.586
## BiologicalMaterial03     6.897

The manufacturing processes dominate the list in the optimal cubist model more than they do in the optimal linear or nonlinear models. Only 3 of the top 10 predictors are also included in both the optimal linear and nonlinear models: manufacturing processes 32, 9, and 13.

(c)

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart.plot)
## Loading required package: rpart
grid <- expand.grid(cp = seq(0.001, 0.05, by = 0.005))

cmp_single_tree <- train(Yield ~ .,
                data = cmp_train,
                method = "rpart",
                preProc = c("center", "scale"),
                tuneGrid = grid)

best_tree <- cmp_single_tree$finalModel

suppressWarnings(rpart.plot(best_tree))

Viewing the optimal single tree confirms what the variable importance profile of the Cubist model showed, that manufacturing process 32 is the most important predictor because its value determines the first split in the tree. If manufacturing process 32 is sufficiently small, then the next determining variable is the second most important variable in the Cubist model (manufacturing process 9). If manufacturing process 32 is not sufficiently small, how large it is becomes the next decision point in the tree. We can see that both manufacturing processes 32 and 9 have a positive impact on yield, since when each is higher, yield is also higher.