Exercise 8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.5.2
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.5.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret) 
## Warning: package 'caret' was built under R version 4.5.1
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:fabletools':
## 
##     MAE, RMSE
model1 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

rfImp1

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

From the fitted random forest model to all the predictors, we can see that it did not significantly use the uninformative predictors.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1 
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data.

model2 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

rfImp2

Did the importance score for V1 change?

Yes, the importance score changed and notably we can see that V1 dropped in importance score as compared to the first random forest model.

What happens when you add another predictor that is also highly correlated with V1?

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)

rfImp3

From adding another predictor that is highly correlated to V1, we can see that the importance score of V1 is again decreased. While most other variables remained around constant.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that func-tion toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)
## Warning: package 'party' was built under R version 4.5.3
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.5.1
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 4.5.2
## Loading required package: stats4
## 
## Attaching package: 'modeltools'
## The following object is masked from 'package:fabletools':
## 
##     refit
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.5.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.5.1
## 
## Attaching package: 'zoo'
## The following object is masked from 'package:tsibble':
## 
##     index
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.5.2
## 
## Attaching package: 'party'
## The following object is masked from 'package:fabletools':
## 
##     response
## The following object is masked from 'package:dplyr':
## 
##     where
model4 <- cforest(y ~ ., data = simulated[, c(1:11)])
rfImp4 <- varimp(model4, conditional = TRUE)
rfImp4
##            V1            V2            V3            V4            V5 
##  5.699856e+00  5.190114e+00  6.219633e-03  6.741766e+00  1.182547e+00 
##            V6            V7            V8            V9           V10 
## -1.104739e-02  1.491985e-05 -8.383608e-03 -7.355064e-03 -2.342891e-02
#trad random forest importances
rfImp1
#conditional inference forest importances
rfImp4
##            V1            V2            V3            V4            V5 
##  5.699856e+00  5.190114e+00  6.219633e-03  6.741766e+00  1.182547e+00 
##            V6            V7            V8            V9           V10 
## -1.104739e-02  1.491985e-05 -8.383608e-03 -7.355064e-03 -2.342891e-02

These importances show similar patterns, as both models agree on vars that matter such as (V1, V2, V4) and others that don’t. Though the conditional method from Strobl (2007) produces a slightly different ranking (V4 > V1), the overall pattern is consistent.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted trees

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(1419)
boostMod <- train(y ~ ., data = simulated[, c(1:11)],
                 method = "gbm",
                 tuneGrid = gbmGrid, verbose = FALSE)

rfImp5 <- varImp(boostMod$finalModel, scale = FALSE)

rfImp5

The boosted trees model contain the pattern of V1, V2, V4, and V5 as top predictors. But boosted trees show a more gradual decay and gave higher relative importance to V7. ### Cubist

set.seed(3134)

cubMod <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")

rfImp6 <- varImp(cubMod$finalModel, scale = FALSE)

rfImp6

Interestingly, Cubist shows the cleanest pattern and it gave 0 importance to V6-10, which makes it the most interpretable. The same pattern is shown here with the Cubist method as well, as the V1-5 predictors are the most important.

8.2

Use a simulation to show tree bias with different granularities.

library(tree)
## Warning: package 'tree' was built under R version 4.5.3

From this simulation, we can see that while x1_low appears 2 times when the first split at root node, x3_high (noise with fine granularity) appears 3 times deeper in the tree. While, x2_mid (noise with medium granularity) does not appear at all and is completely ignored. To demonstrate the bias, we can run multiple simulation.

set.seed(4356)
df <- data.frame(x1_low = sample(-1:1, 250, replace = TRUE),
                 y = NA,
                 x2_mid = NA,
                 x3_high = NA)

df$y <- df$x1_low + rnorm(250, mean=0, sd=1)
df$x2_mid <- df$x1_low + round(rnorm(250, mean=0, sd=1), 1)
df$x3_high <- df$x1_low + round(rnorm(250, mean=0, sd=1), 2)

#fit tree
treeMod <- tree(y ~ ., data = df)
summary(treeMod) 
## 
## Regression tree:
## tree(formula = y ~ ., data = df)
## Variables actually used in tree construction:
## [1] "x1_low"  "x3_high"
## Number of terminal nodes:  6 
## Residual mean deviance:  0.8891 = 216.9 / 244 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.45700 -0.62120  0.02463  0.00000  0.58850  2.71300
treeMod$frame[treeMod$frame$var != "<leaf>", "var"]
## [1] x1_low  x1_low  x3_high x3_high x3_high
## Levels: <leaf> x1_low x2_mid x3_high

By running 100 simulations, we can see through the summary below that x3_high is used in 86% of the trees while x2_mod is used only 67% of trees.

set.seed(4356)
results <- replicate(100, {
  x1_low <- sample(-1:1, 250, replace = TRUE)
  y <- x1_low + rnorm(250, 0, 1)
  x2_mid <- x1_low + round(rnorm(250, 0, 1), 1)
  x3_high <- x1_low + round(rnorm(250, 0, 1), 2)
  df <- data.frame(x1_low, x2_mid, x3_high, y)
  treeMod <- tree(y ~ ., data = df)
  used_vars <- unique(as.character(treeMod$frame[treeMod$frame$var != "<leaf>", "var"]))
  return(used_vars)
})

table(unlist(results))
## 
##  x1_low  x2_mid x3_high 
##     100      67      86

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The figure on the left shows importance spread evenly throughout many predictors. The low shrinkage means gradual learning, and low bagging tells us that each tree uses only 10% of the data. This forces the the algorithm to consider more vars per tree, due to not being able to filter out weak predictors with a small sample, resulting in distribute importance values.

The right figure shows importance that is concentrated on a few top predictors. The high bagging provides a stable estimate that identify strong predictors, while the high shrinkage is aggressive. This lets the algorithm quickly filter out the weaker vars and concentrates the importance to the top predictors.

(b) Which model do you think would be more predictive of other samples?

I believe the model on the left with lower bagging and learning rate would be more predictive of other samples. This is because the lower shrinkage allows for gradualer learning over many iterations, reducing overfitting.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth would flatten the slope of predictor importance in both models. Deeper trees capture more complex interactions and allow weaker predictors to enter splits that were previously dominated by strong predictors since stronger predictors with higher importance values are lessened and the lower ones increase. This spreads importance more evenly across variables.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.5.3
data(ChemicalManufacturingProcess)
ChemicalManufacturingProcess
sum(is.na(ChemicalManufacturingProcess))
## [1] 106
miss <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(miss, ChemicalManufacturingProcess)
Chemical <- Chemical[, -nearZeroVar(Chemical)]

sum(is.na(Chemical))
## [1] 0
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)
trainX <- Chemical[index, -1]
trainY <- Chemical[index, 1]

testX <- Chemical[-index, -1]
testY <- Chemical[-index, 1]

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Random Forest

set.seed(9548)
RF_grid <- expand.grid(mtry=seq(7,49,by=3))
rfMod <- train(trainX, trainY, method = "rf", 
                tuneGrid = RF_grid)
impRF <- varImp(rfMod, scale = FALSE)
impRF <- impRF$importance %>% rownames_to_column("var") 
impRF <- impRF[order(-impRF$Overall), , drop=FALSE] %>%
  remove_rownames()

impRF

Cubist

cubeGrid <- expand.grid(committees = c(1,5,10,20,40,80),
                        neighbors = c(0,1,3,5,7,9))

cubeMod = train(trainX, trainY, 
                   method = "cubist", tuneGrid = cubeGrid) 

impCube <- data.frame(varImp(cubeMod)$importance) %>%
  rownames_to_column("var")
impCube <- impCube[order(-impCube$Overall), , drop=FALSE] %>%
  remove_rownames()

impCube

Gradient Boosting

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)

set.seed(5342)
boostMod <- train(trainX, trainY, method = "gbm", 
                   tuneGrid = gbmGrid, verbose = FALSE)

impBoost <- data.frame(varImp(boostMod)$importance) %>%
  rownames_to_column

impBoost <- impBoost[order(-impBoost$Overall), 
                     , drop=FALSE]%>% 
  remove_rownames()

impBoost

Performance from Resampling

train_rf_pred <- predict(rfMod)
train_cubist_pred <- predict(cubeMod)

train_boost_pred <- predict(boostMod)

results_table <- data.frame(
  rbind(
    postResample(pred = train_cubist_pred, obs = trainY),
    postResample(pred = train_boost_pred, obs = trainY),
    postResample(pred = train_rf_pred, obs = trainY)
  ),
  row.names = c("Cubist", "Gradient Boost", "Random Forest")
)

kable(results_table, 
      caption = "Model Performance Comparison",
      digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Model Performance Comparison
RMSE Rsquared MAE
Cubist 0.0000 1.0000 0.0000
Gradient Boost 0.0026 1.0000 0.0009
Random Forest 0.4302 0.9669 0.3317

Performance of Test Set

predRf <- predict(rfMod, newdata = testX)
predBoost <- predict(boostMod, newdata = testX)
predCube <- predict(cubeMod, newdata = testX)
kable(data.frame(rbind(postResample(pred = predCube, obs = testY), 
                       postResample(pred = predBoost, obs = testY),
                       postResample(pred = predRf, obs = testY)),
                 row.names = c("Cubist","Gradient Boost", "Random Forest")),
      caption = "Test Set Performance") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Test Set Performance
RMSE Rsquared MAE
Cubist 1.250654 0.6714124 0.7667326
Gradient Boost 1.264608 0.6612296 0.8916532
Random Forest 1.221934 0.7702507 0.8472754

Cubist Model provided the most optimal resampling and test set performance.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

kable(head(impCube,10)) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
var Overall
ManufacturingProcess32 100.00000
ManufacturingProcess09 75.75758
ManufacturingProcess17 40.40404
ManufacturingProcess13 37.37374
BiologicalMaterial03 34.34343
BiologicalMaterial06 33.33333
BiologicalMaterial02 33.33333
ManufacturingProcess04 30.30303
ManufacturingProcess29 29.29293
ManufacturingProcess33 27.27273

Process variables dominate the top important predictors, occupying 7 of the top 10 positions and holding the highest importance scores, while biological variables appear only in lower positions. The top 10 predictors from this model shared similar top 10 predictors as with the optimal linear and nonlinear such as ManufacturingProcess 32 and 09. The nonlinear and cubist model share biological predictors in the top 10 such as BiologicalMaterial02, 03 and 06.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(partykit)
## Warning: package 'partykit' was built under R version 4.5.3
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.5.3
## Registered S3 method overwritten by 'inum':
##   method          from   
##   format.interval tsibble
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.5.1
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine

This view of the data provides additional knowledge about the biological or process predictors and their relationship with yield. We can see that ManufacturingProcess predictors are prominent and are primary drivers of yield.

set.seed(1345)

#cross val control
ctrl <- trainControl(method = "cv", number = 10)

#tune grid
treeGrid <- expand.grid(maxdepth = seq(1, 15, by = 1))

# train treemod
treeMod <- train(trainX, trainY, 
                 method = "rpart2", 
                 tuneGrid = treeGrid, 
                 trControl = ctrl)


plot(as.party(treeMod$finalModel), 
     gp = gpar(fontsize = 11, fontfamily = "sans"),
     main = "Optimal Regression Tree",
     ip_args = list(abbreviate = FALSE, 
                    id = FALSE))