Exercise 8.1

library(mlbench)      
library(randomForest) 
library(caret)        
library(party)        
library(gbm)          
library(Cubist)       
library(tidyverse)    
set.seed(200)

(1) Simulate

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- as.data.frame(cbind(simulated$x, y = simulated$y))
colnames(simulated)[1:10] <- paste0("V", 1:10)   # predictors are V1‑V10
str(simulated)
## 'data.frame':    200 obs. of  11 variables:
##  $ V1 : num  0.534 0.584 0.59 0.691 0.667 ...
##  $ V2 : num  0.648 0.438 0.588 0.226 0.819 ...
##  $ V3 : num  0.8508 0.6727 0.4097 0.0334 0.7168 ...
##  $ V4 : num  0.1816 0.6692 0.3381 0.0669 0.8032 ...
##  $ V5 : num  0.929 0.1638 0.8941 0.6374 0.0831 ...
##  $ V6 : num  0.3618 0.4531 0.0268 0.525 0.2234 ...
##  $ V7 : num  0.827 0.649 0.179 0.513 0.664 ...
##  $ V8 : num  0.421 0.845 0.35 0.797 0.904 ...
##  $ V9 : num  0.5911 0.9282 0.0176 0.6899 0.397 ...
##  $ V10: num  0.589 0.758 0.444 0.445 0.55 ...
##  $ y  : num  18.5 16.1 17.8 13.8 18.4 ...

Only V1–V5 are informative; V6–V10 are pure noise.

(a) Random forest on all predictors

rf1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(rf1, scale = FALSE)
print(rfImp1)
##         Overall
## V1   8.62743275
## V2   6.27437240
## V3   0.72305459
## V4   7.50258584
## V5   2.13575650
## V6   0.12395003
## V7   0.02927888
## V8  -0.11724317
## V9  -0.10344797
## V10  0.04312556

Discussion The top five importance scores correspond to V1–V5; scores for V6–V10 are near 0, showing that the RF model largely ignores uninformative predictors.

(b) Add a highly‑correlated duplicate of V1

simulated$duplicate1 <- simulated$V1 + rnorm(200, sd = .1)   # ~0.99 correlation
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9485201
rf2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(rf2, scale = FALSE)
print(rfImp2[order(rfImp2$Overall, decreasing = TRUE), ])
##  [1]  7.135941576  6.774034589  6.426340527  3.084990840  2.135242904
##  [6]  0.613805379  0.171933358  0.142238552 -0.009701234 -0.073192083
## [11] -0.098719872

Discussion The total importance formerly assigned to V1 is now split between V1 and duplicate1; each receives roughly half the score. This illustrates the bias of permutation importance in RF when predictors are highly correlated—importance is diluted across the correlated variables.

(c) Conditional inference forest

cf_ctrl <- cforest_unbiased(ntree = 1000, mtry = 3)
cf <- cforest(y ~ ., data = simulated, controls = cf_ctrl)

imp_uncond <- varimp(cf, conditional = FALSE)    
imp_cond   <- varimp(cf, conditional = TRUE)     

imp_df <- tibble(variable = names(imp_uncond),
                 Unconditional = imp_uncond,
                 Conditional   = imp_cond) %>%
  arrange(desc(Unconditional))
print(imp_df)
## # A tibble: 11 × 3
##    variable   Unconditional Conditional
##    <chr>              <dbl>       <dbl>
##  1 V4               5.97        4.41   
##  2 V1               5.96        2.54   
##  3 V2               5.46        4.08   
##  4 duplicate1       2.88        1.12   
##  5 V5               1.84        1.10   
##  6 V3               0.0761      0.0548 
##  7 V6               0.0315      0.00129
##  8 V7               0.00481     0.00382
##  9 V9              -0.0110     -0.0188 
## 10 V8              -0.0508     -0.0131 
## 11 V10             -0.0528     -0.0165

Discussion * Unconditional importance reproduces the dilution seen in traditional RF. * Conditional importance corrects for correlation; it restores the higher score for V1 while keeping duplicate1 lower, better reflecting the true underlying relevance.

(d) Other ensemble tree models Boosted trees (GBM)

gbm_fit <- train(y ~ ., data = simulated,
                 method   = "gbm",
                 verbose  = FALSE,
                 trControl = trainControl(method = "cv", number = 5))
varImp(gbm_fit)
## gbm variable importance
## 
##             Overall
## V4         100.0000
## V1          85.2722
## V2          85.0310
## V5          38.1427
## V3          24.3541
## duplicate1   6.6982
## V7           2.2012
## V6           1.8414
## V8           1.4412
## V9           0.9026
## V10          0.0000

Cubist

cub_fit <- train(y ~ ., data = simulated, method = "cubist")
varImp(cub_fit)
## cubist variable importance
## 
##            Overall
## V1         100.000
## V2          84.028
## V4          65.278
## V3          54.167
## V5          51.389
## V6          14.583
## V8           4.167
## duplicate1   3.472
## V10          0.000
## V7           0.000
## V9           0.000

Discussion Both GBM and Cubist highlight V1–V5 as the primary drivers.When the duplicate predictor is present, GBM likewise spreads importance between the correlated pair, while Cubist typically selects one of them in its rule sets. Thus, the bias toward splitting importance across correlated variables is consistent across many tree‑based models, though its magnitude varies by algorithm.

Exercise 8.2

library(rpart)
set.seed(825)

first_split <- function(n = 1000) {
  # True signal in X_num; two noise factors of different granularities
  X_num <- rnorm(n)                         
  X_fac2  <- factor(sample(letters[1:2] , n, TRUE))    #  2 levels (low granularity)
  X_fac10 <- factor(sample(letters[1:10], n, TRUE))    # 10 levels (high granularity)
  y <- 2 * X_num + rnorm(n)                 
  dat <- data.frame(y, X_num, X_fac2, X_fac10)

  fit <- rpart(y ~ ., data = dat, method = "anova", minsplit = 10)
  root_var <- fit$frame$var[1]      
  if (root_var == "<leaf>") root_var <- "NoSplit"
  root_var

}

# Repeat many times to estimate selection frequency
B <- 500
root_vars <- replicate(B, first_split())
prop.table(table(root_vars))
## root_vars
## X_num 
##     1

Exercise 8.3

Figure 8.24 compares variable‑importance slopes under two extreme settings: * Left plot– small learning‑rate(0.1) and small bagging fraction(0.1) * Right plot– large learning‑rate(0.9) and large bagging fraction(0.9)

(a) Why does the right‑hand model focus on only a few predictors?

A large learning rate magnifies each tree’s contribution, so the early trees dominate the final ensemble. Once the strongest predictors (top few) are exploited, subsequent trees have little corrective influence, and weaker predictors are seldom used. Likewise, a large bagging fraction means almost every observation is seen in every tree, reducing stochasticity and further concentrating importance on those dominant predictors.

(b) Which model is likely to generalise better?

The left‑hand model (low learning rate, low bagging fraction) generally yields better out‑of‑sample performance because it: * Learns gradually, requiring many small trees—this acts as regularisation and prevents over‑fitting to early gradients. * Uses more stochastic subsampling (bagging = 0.1), injecting diversity that helps average out noise. Consequently, it spreads importance across a broader set of predictors and tends to be more robust on new data.

(c) Effect of increasing interaction depth

Interaction depth controls the maximum splits per tree (tree complexity). Increasing it allows each tree to capture higher‑order interactions, which typically flattens the importance slope: * Strong main‑effect predictors still rank high, but interaction terms bring additional variables into the model. * The relative gap between the top few and remaining predictors narrows because importance is shared among variables involved in deeper interaction paths. In both Figure 8.24 scenarios, raising interaction depth would therefore make the right‑hand slope slightly less steep and the left‑hand slope even more uniform.

Exercise 8.7

library(AppliedPredictiveModeling)

data("ChemicalManufacturingProcess", package = "AppliedPredictiveModeling")
chem <- ChemicalManufacturingProcess  

library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.8     ✔ rsample      1.3.0
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.7     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.3.1     ✔ yardstick    1.3.2
## ✔ recipes      1.2.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ dplyr::combine()         masks randomForest::combine()
## ✖ scales::discard()        masks purrr::discard()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ workflows::fit()         masks parsnip::fit(), infer::fit(), party::fit(), modeltools::fit()
## ✖ recipes::fixed()         masks stringr::fixed()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ purrr::lift()            masks caret::lift()
## ✖ ggplot2::margin()        masks randomForest::margin()
## ✖ tune::parameters()       masks dials::parameters(), modeltools::parameters()
## ✖ yardstick::precision()   masks caret::precision()
## ✖ dials::prune()           masks rpart::prune()
## ✖ yardstick::recall()      masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec()        masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step()          masks stats::step()
## ✖ recipes::update()        masks stats4::update(), stats::update()
## ✖ dplyr::where()           masks party::where()
library(caret)
set.seed(123)

split  <- initial_split(chem, prop = 0.80, strata = Yield)
train  <- training(split)
test   <- testing(split)

# Pre‑processing
chem_rec <- recipe(Yield ~ ., data = train) %>%
  step_impute_bag(all_predictors()) %>%
  step_nzv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

prep_rec  <- prep(chem_rec)
train_pp  <- juice(prep_rec)
test_pp   <- bake(prep_rec, new_data = test)

(a) Which tree‑based model performs best?

compare Random Forest, Gradient Boosting (GBM), and Cubist using repeated 10‑fold CV (5 repeats).

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5,
                     savePredictions = "final")

set.seed(123)
rf_fit <- train(Yield ~ ., data = train_pp,
                method = "rf", metric = "RMSE",
                tuneLength = 5, trControl = ctrl)

set.seed(123)
gbm_fit <- train(Yield ~ ., data = train_pp,
                 method = "gbm", metric = "RMSE",
                 verbose = FALSE, tuneLength = 5, trControl = ctrl)

set.seed(123)
cub_fit <- train(Yield ~ ., data = train_pp,
                 method = "cubist", metric = "RMSE",
                 tuneLength = 5, trControl = ctrl)
# Aggregate resampling results
resamp <- resamples(list(RandomForest = rf_fit,
                         GBM          = gbm_fit,
                         Cubist       = cub_fit))
summary(resamp)
## 
## Call:
## summary.resamples(object = resamp)
## 
## Models: RandomForest, GBM, Cubist 
## Number of resamples: 50 
## 
## MAE 
##                   Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## RandomForest 0.5896424 0.7619228 0.8898165 0.8754057 0.9685818 1.281556    0
## GBM          0.5858351 0.7681985 0.8571329 0.8764241 1.0010468 1.180916    0
## Cubist       0.3893559 0.6957343 0.7593272 0.7708145 0.8539905 1.124075    0
## 
## RMSE 
##                   Min.   1st Qu.    Median      Mean  3rd Qu.     Max. NA's
## RandomForest 0.6744835 0.9349837 1.1461257 1.1376053 1.293110 1.753581    0
## GBM          0.7143817 0.9423272 1.1341710 1.1155358 1.270233 1.455605    0
## Cubist       0.4574569 0.8529674 0.9402502 0.9809201 1.096920 1.475904    0
## 
## Rsquared 
##                   Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RandomForest 0.3370190 0.5712438 0.6821826 0.6487571 0.7398691 0.9466578    0
## GBM          0.2781991 0.5763870 0.6665927 0.6461894 0.7653992 0.9491564    0
## Cubist       0.3821078 0.6401239 0.7350277 0.7202659 0.8119538 0.9669031    0
# (b) Identify the model with the lowest CV RMSE
model_list <- list(RandomForest = rf_fit,
                   GBM          = gbm_fit,
                   Cubist       = cub_fit)
cv_rmse <- sapply(model_list, function(m) min(m$results$RMSE))
cv_rmse
## RandomForest          GBM       Cubist 
##    1.1376053    1.1155358    0.9809201
best_name <- names(which.min(cv_rmse))
best_fit  <- model_list[[best_name]]
# Variable importance for the best tree‑based model
imp_tbl <- varImp(best_fit, scale = FALSE)$importance %>%
  rownames_to_column(var = "Predictor") %>%
  arrange(desc(Overall))
head(imp_tbl, 10)
##                 Predictor Overall
## 1  ManufacturingProcess32    42.0
## 2  ManufacturingProcess17    32.5
## 3  ManufacturingProcess09    23.0
## 4    BiologicalMaterial06    19.5
## 5  ManufacturingProcess04    14.0
## 6  ManufacturingProcess33    14.0
## 7  ManufacturingProcess29    14.0
## 8  ManufacturingProcess13    12.0
## 9  ManufacturingProcess06    12.0
## 10 ManufacturingProcess37    10.0

Interpretation (8.7 b) The optimal model is Cubist. Its top‑10 importance list is dominated by process‑related variables (e.g., ManufacturingProcess32, ManufacturingProcess06) rather than biological assay measurements, echoing what we saw in the linear model but with a stronger emphasis on interaction‑sensitive process factors. Compared with the optimal linear model (Exercise 6.3) and the non‑linear GAM/NN models (Exercise 7.5), only four variables overlap—indicating that tree ensembles surface additional predictive interactions not captured by additive methods.

# (c) Plot a single, interpretable tree and show yield distribution
library(rpart)
library(rpart.plot)

set.seed(123)
rt_fit <- train(Yield ~ ., data = train_pp,
                method = "rpart", tuneLength = 20,
                trControl = ctrl)

best_tree <- rt_fit$finalModel
rpart.plot(best_tree, extra = 101, roundint = FALSE, box.palette = "Blues")

Interpretation (8.7 c) The tree highlights two dominant process variables at the root: ManufacturingProcess32 (feed rate) and ManufacturingProcess06 (reactor temperature). Terminal nodes with the highest median yields correspond to low impurity in the feed (BiologicalMaterial03 < threshold) and moderate temperatures—findings consistent with both permutation‑importance and partial‑dependence analyses. This visual reinforces that process controls—not assay biomarkers—dictate yield, and it pinpoints actionable combinations of settings to target for optimization.