Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

8.1

8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.3
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6 – V10)? No, the random forest model didn’t use the uninformative predictors because the variable importance scores for V6–V10 are all near zero (or negative), showing the model barely relies on them.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1? Yes, according to the result of the random forest model, the importance score for V1 goes down, because the random forest now splits the importance between V1 and the new highly correlated predictor, so each one looks less important by itself.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)
## Warning: package 'party' was built under R version 4.4.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.4.3
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 4.4.3
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.4.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.4.3
set.seed(624)
cf_fit <- cforest(
  y ~ .,
  data = simulated,
  controls = cforest_unbiased(mtry = 5, ntree = 1000)
)
cf_imp_uncond <- varimp(cf_fit, conditional = FALSE)
cf_imp_cond   <- varimp(cf_fit, conditional = TRUE)

cf_imp_uncond
##           V1           V2           V3           V4           V5           V6 
##  4.671797579  6.026045374  0.020483737  7.537114896  1.604782237  0.004643248 
##           V7           V8           V9          V10   duplicate1 
##  0.010877455 -0.040926469 -0.010066297 -0.020805441  4.753284383
cf_imp_cond
##           V1           V2           V3           V4           V5           V6 
##  2.020177145  4.815983630  0.019110804  6.260068138  0.963961936  0.006942973 
##           V7           V8           V9          V10   duplicate1 
## -0.008822119 -0.014844404  0.003617816 -0.004062857  1.840359279

The importances do in fact show the same pattern the traditional random forest model because in both cf_imp_uncond and cf_imp_cond, the informative predictors (V1, V2, V4, V5, and duplicate1) have large positive importance values (around 3–7 for V1, V2, V4 and about 1–2.7 for V5 and duplicate1), while the noise variables V6–V10 stay near zero or slightly negative, so the pattern is the same as in the traditional random forest as in the same variables are important and the same ones are unimportant, only the exact values change.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

If I were to repeat this process using caret with a CART tree model (method = “rpart”):

set.seed(624)
rpart_fit <- train(
  y ~ .,
  data = simulated,
  method = "rpart",
  trControl = trainControl(method = "cv", number = 5)
)

rpart_imp <- varImp(rpart_fit)
rpart_imp
## rpart variable importance
## 
##            Overall
## V2          100.00
## V4           82.17
## V1           65.02
## duplicate1   63.67
## V5           40.00
## V10           0.00
## V9            0.00
## V7            0.00
## V3            0.00
## V8            0.00
## V6            0.00

the variable importance output shows that the informative predictors and their duplicate (V2 = 100, V4 ≈ 92, V1 ≈ 67, duplicate1 ≈ 51, V5 ≈ 42, V3 ≈ 14) have all the importance, while the noise variables V6–V10 have importance values of 0, so the same pattern as before occurs with only the true signal variables being used by the tree.

8.2

Use a simulation to show tree bias with different granularities.

library(rpart)
set.seed(123)

n_sims  <- 200
winner  <- character(n_sims)

for (i in 1:n_sims) {
  n <- 200
  X_coarse <- sample(c(0, 1), n, replace = TRUE)
  X_fine   <- rnorm(n)
  y <- rnorm(n)
  sim_dat <- data.frame(y = y,
                        coarse = X_coarse,
                        fine   = X_fine)
  tree_fit <- rpart(y ~ ., data = sim_dat,
                    control = rpart.control(maxdepth = 1, cp = 0))
  split_var <- as.character(tree_fit$frame$var[1])
  if (split_var == "<leaf>") split_var <- "no_split"
  winner[i] <- split_var
}

table(winner)
## winner
## coarse   fine 
##     14    186
prop.table(table(winner))
## winner
## coarse   fine 
##   0.07   0.93

In order to simulate the data with two predictors (coarse and fine) and a response “y” that is just random noise, unrelated to either predictor. For each of 200 simulations I chose to fit a small CART tree and recorded which variable was used for the first split. As a result, the fine was chosen about 93% (186) of the time, meaning trees are biased toward predictors with more possible split points, even when they contain no real signal.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9: (a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right uses a high bagging fraction and a high learning rate, so each tree is built on most of the data and makes a big change to the model. Because of this, boosting keeps reusing the strongest few predictors to cut the error quickly, so their importance gets very large while the weaker predictors are used much less.

(b) Which model do you think would be more predictive of other samples? I would argue the model on the left is more predictive of other samples, because its small learning rate along with the small bagging fraction make the boosting steps gentler and more random, which reduces the overfitting risks compared to the more aggressive model on the right.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth lets each tree include more splits and more variables, whcih then it can pick up predictors that matter mainly through interactions with others. The importance is shared across a larger set of predictors the top few become a bit less dominant and the lower ranked ones have increased importance. In terms of either model, this would make the importance curve less steep instead of massively decreasing after the first few variables.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models: (a) Which tree-based regression model gives the optimal resampling and test set performance?

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3
library(rpart)
library(gbm)
## Warning: package 'gbm' was built under R version 4.4.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.4.3
library(RANN)
## Warning: package 'RANN' was built under R version 4.4.3
data(ChemicalManufacturingProcess)

set.seed(123)
cmp_imputer <- preProcess(
  ChemicalManufacturingProcess[, -1],
  method = "knnImpute"
)

cmp_complete <- ChemicalManufacturingProcess
cmp_complete[, -1] <- predict(cmp_imputer,
                              ChemicalManufacturingProcess[, -1])
set.seed(123)
idx <- createDataPartition(cmp_complete$Yield, p = 0.75, list = FALSE)
train_data <- cmp_complete[idx, ]
test_data  <- cmp_complete[-idx, ]
ctrl    <- trainControl(method = "boot", number = 25) # treemodel
pp_cols <- c("nzv", "corr", "center", "scale")

set.seed(123)
cart_fit <- train(
  Yield ~ .,
  data      = train_data,
  method    = "rpart2",
  trControl = ctrl,
  preProcess = pp_cols
)
cart_fit
## CART 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (47), scaled (47), remove (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE      Rsquared   MAE     
##   1         1.587109  0.3222359  1.262216
##   2         1.589548  0.3460862  1.261320
##   3         1.596932  0.3686312  1.246952
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 1.
cart_fit$results
##   maxdepth     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
## 1        1 1.587109 0.3222359 1.262216 0.2157940  0.1007677 0.1763409
## 2        2 1.589548 0.3460862 1.261320 0.2018652  0.1137136 0.1565147
## 3        3 1.596932 0.3686312 1.246952 0.1813361  0.1015448 0.1473551
cart_fit$bestTune
##   maxdepth
## 1        1

Based on the results the Cubist model gave the most optimal performance. While among the tree-based models, Cubist had the lowest RMSE and highest Rsquared both in the resampling results and on the test set.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

In the most optimal tree-based regression model, being Cubist, the most important predictors are mostly process variables with a few biological variables near the top, which means the process measurements dominate the list. When compared to the top 10 predictors from Cubist to the top 10 from the best linear and nonlinear models, many of the same variables show up again, but their order changes and each model has a couple of unique predictors.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
best_tree <- cart_fit$finalModel # Reused from cart fit
rpart.plot(best_tree, type = 3, extra = 101)

The plotted tree shows that the single best split is on ManufacturingProcess32: runs with values ≥ 0.18 have a higher average yield (42) than those < 0.18 (39). Which means that this process variable is related to yield and also suggests a simple threshold in that part of the process that could be adjusted to improve yield.