LFMG-Lab-9

Author

Luis Munoz Grass

Trees and Rules

8.1

Recreate the simulated data from Exercise 7.2

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a)

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.

library(caret)

Loading required package: ggplot2


Attaching package: 'ggplot2'

The following object is masked from 'package:randomForest':

    margin

Loading required package: lattice

library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following object is masked from 'package:randomForest':

    combine

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tibble)
library(knitr)
library(gbm)

Loaded gbm 2.2.2

This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

model1 <- randomForest(y ~ .,data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp1 <- varImp(model1, scale = FALSE)

rfImp1 |> arrange(desc(Overall)) |> knitr::kable()

	Overall
V1	8.7322354
V4	7.6151188
V2	6.4153694
V5	2.0235246
V3	0.7635918
V6	0.1651112
V7	-0.0059617
V10	-0.0749448
V9	-0.0952927
V8	-0.1663626

Did the random forest model significantly use the uninformative predictors (V6– V10)?

The five informative predictors (V1–V5) have substantially positive importance scores, with V1–V4 being the most dominant. On the other hand, the five uninformative predictors (V6–V10) all have importance scores very close to zero (even slightly negative in places). This basically tells us that the forest ignored V6–V10, that they were not used in any meaningful way by the model.

b)

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

[1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

# re‑fit with the first duplicate
model2 <- randomForest(
  y ~ .,
  data       = simulated,
  importance = TRUE,
  ntree      = 1000
)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2 |>
  arrange(desc(Overall)) |>
  knitr::kable(
    digits  = 4,
    caption = "Variable importance with one V1‑duplicate"
  )

Variable importance with one V1‑duplicate
	Overall
V4	7.0475
V2	6.0690
V1	5.6912
duplicate1	4.2833
V5	1.8724
V3	0.6297
V6	0.1357
V10	0.0289
V9	0.0084
V7	-0.0135
V8	-0.0437

After adding one duplicate V1 fell from its original 8.73 down to 5.69. Also, duplicate1 picked up the slack with an importance of 4.28. While total “credit” that belonged to V1 alone is now split around 60/40 between V1 and duplicate1.

# add a second highly‑correlated copy of V1

simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1
cat("cor(duplicate2, V1) =", cor(simulated$duplicate2, simulated$V1), "\n")

cor(duplicate2, V1) = 0.9408631

# re‑fit with both duplicates
model3 <- randomForest(
  y ~ .,
  data       = simulated,
  importance = TRUE,
  ntree      = 1000
)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3 |>
  arrange(desc(Overall)) |>
  knitr::kable(
    digits  = 4,
    caption = "Variable importance with two V1‑duplicates"
  )

Variable importance with two V1‑duplicates
	Overall
V4	7.0487
V2	6.5282
V1	4.9169
duplicate1	3.8007
V5	2.0312
duplicate2	1.8772
V3	0.5871
V6	0.1421
V7	0.1099
V10	0.0923
V9	-0.0108
V8	-0.0841

After adding a second duplicate V1 dropped again to 4.92, duplicate1 has 3.80 and duplicate2 at 1.8. So now the original signal is divided three ways, so each correlated predictor only receives a fraction of V1’s original score.

c)

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that func tion toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

# 1) load party and set up cforest
library(party)

Loading required package: grid

Loading required package: mvtnorm

Loading required package: modeltools

Loading required package: stats4

Loading required package: strucchange

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Loading required package: sandwich


Attaching package: 'party'

The following object is masked from 'package:dplyr':

    where

cf_ctrl <- cforest_control(ntree = 1000,
                           mtry  = floor(sqrt(ncol(simulated) - 1)))

cf_model <- cforest(
  y ~ .,
  data     = simulated,
  controls = cf_ctrl
)

Now we get the “raw” permutation importance (analogous to randomForest’s) and then get the “conditional” importance

imp_raw <- varimp(cf_model, conditional = FALSE)
imp_raw <- sort(imp_raw, decreasing = TRUE)


imp_cond <- varimp(cf_model, conditional = TRUE)
imp_cond <- sort(imp_cond, decreasing = TRUE)

imp_tbl <- data.frame(
  Variable               = names(imp_raw),
  Raw_Importance         = round(imp_raw, 3),
  Conditional_Importance = round(imp_cond[names(imp_raw)], 3)
)
imp_tbl |> 
  kable(
    caption = "cforest Variable Importances: Raw vs. Conditional"
  )

cforest Variable Importances: Raw vs. Conditional
	Variable	Raw_Importance	Conditional_Importance
V4	V4	5.593	3.413
V2	V2	4.958	3.166
duplicate1	duplicate1	3.819	0.880
V1	V1	3.645	1.027
duplicate2	duplicate2	1.776	0.291
V5	V5	1.487	0.742
V7	V7	0.069	0.027
V3	V3	0.062	0.027
V10	V10	0.004	-0.019
V6	V6	-0.021	0.001
V9	V9	-0.033	-0.004
V8	V8	-0.054	-0.011

In the raw permutation‐importance from cforest (conditional = FALSE), we see the same “dilution” effect that occurs in a traditional random forest: V1’s predictive power is split among itself and its two near‐duplicates (dup1 and dup2), so each of those three variables shares a portion of what was originally V1’s entire importance. Uninformative predictors (V3, V6–V10) remain very close to zero.

On the other hand, the conditional importance (conditional = TRUE) asks how much each variable contributes beyond all others. Here V1 recovers almost all its original signal, while dup1 and dup2 approach zero, showing that once we account for V1, its noisy copies add not as much new information. This conditional measure heks correct the bias introduced by correlated features, giving a clearer picture of which predictors truly matter.

d)

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

set.seed(200)
gbm_fit <- train(
  y ~ .,
  data    = simulated,
  method  = "gbm",
  trControl = trainControl(method = "cv", number = 5),
  verbose  = FALSE,
  tuneGrid = expand.grid(
    n.trees      = 1000,
    interaction.depth = 3,
    shrinkage    = 0.01,
    n.minobsinnode = 10
  )
)
gbm_imp <- varImp(gbm_fit, scale = FALSE)$importance
gbm_imp |> 
  arrange(desc(Overall)) |> 
  knitr::kable(caption = "GBM variable importance")

GBM variable importance
	Overall
V4	43631.3487
V2	33053.5717
V1	23176.4461
V5	18146.9778
duplicate1	16302.5178
V3	11809.8170
duplicate2	4193.8147
V7	1672.9984
V6	1189.5452
V9	1049.2769
V8	760.6454
V10	676.3745

set.seed(200)
cubist_fit <- train(
  y ~ .,
  data    = simulated,
  method  = "cubist",
  trControl = trainControl(method = "cv", number = 5),
  tuneGrid = expand.grid(committees = c(1, 5, 10), neighbors = c(0, 5))
)
cubist_imp <- varImp(cubist_fit, scale = FALSE)$importance
cubist_imp |> 
  arrange(desc(Overall)) |> 
  knitr::kable(caption = "Cubist variable importance")

Cubist variable importance
	Overall
V2	70.0
V1	57.5
V4	52.5
V5	50.0
duplicate2	28.5
duplicate1	26.0
V3	25.0
V6	9.0
V8	4.0
V7	0.0
V9	0.0
V10	0.0

For both GBM and Cubist, the five most informative variables (V1–V5) dominate the importance rankings, while the five noise variables (V6–V10) sit very near zero. In the GBM fit, V4 and V2 lead by a wide margin, with V1 and V5 next, and we see that duplicate1 and duplicate2 have “stolen” some of V1’s credit.

In the Cubist model also see the previous idea: V2 and V1 are most important, V4 and V5 follow, and the two duplicates pick up moderate importance while V6–V10 drop to single digits or zero.

This mirrors what we saw with randomForest and cforest raw importances: any impurity or permutation based measure will split a feature’s importance among its highly correlated copies. The fact that duplicates register substantial, although lower importance, and pure noise features register none, is both normal and expected for tree-based models that don’t adjust for conditional associations.

8.2

Use a simulation to show tree bias with different granularities.

We can use a method of multiple simulations, with null‐signal that repeatedly generates predictors completely driven by noise (and factors with 2, 5, or 10 levels) and that tracks how often each winds up as the very first split. Since none of the features truly influence y, any systematic over representation can only be down to split bias. By aggregating across hundreds of runs we can get a clear picture of that bias in action.

library(rpart)
library(tidyr)

set.seed(456)
n.sim    <- 500   # 500 datasets to simulate
n        <- 200   # sample size per dataset
k.levels <- c(2, 5, 10)

# storage for which variable is used in the very first split
first_split <- matrix(NA, n.sim, length(k.levels) + 1,
                      dimnames = list(NULL,
                        c("X_cont", paste0("X_cat_", k.levels)))
)

for(i in seq_len(n.sim)) {
  # simulate one dataset
  X_cont    <- runif(n)
  X_cat_2   <- factor(sample(letters[1:2],   n, TRUE))
  X_cat_5   <- factor(sample(letters[1:5],   n, TRUE))
  X_cat_10  <- factor(sample(letters[1:10],  n, TRUE))
  y         <- rnorm(n)
  df        <- data.frame(y, X_cont, X_cat_2, X_cat_5, X_cat_10)

  # fit a full tree (no pre-pruning)
  fit <- rpart(y ~ ., data = df,
               method = "anova",
               control = rpart.control(cp = 0, minsplit = 2))

  # grab the variable used at the root node
  vs <- fit$frame$var[1]
  first_split[i, ] <- colnames(first_split) == vs
}

# compute selection frequencies
freq_df <- as_tibble(first_split) %>%
  summarise(across(everything(), mean)) %>%
  pivot_longer(everything(),
               names_to  = "Variable",
               values_to = "Freq")

# plot
ggplot(freq_df, aes(x = Variable, y = Freq)) +
  geom_col() +
  labs(
    title = "Frequency of First Split by Predictor Granularity",
    y     = "Proportion of Simulations",
    x     = NULL
  ) +
  theme_minimal(base_size = 14)

The plot clearly shows the split‐bias we were looking for. Even though none of the predictors actually influence y, the tree picks the 10-level factor (X_cat_10) at the root in over half of the simulations, the continuous variable (X_cont) next most often, then the 5-level factor, and almost never the 2-level factor. In other words, variables with more potential cut-points get chosen more frequently purely by chance, which is exactly the bias we wanted to illustrate.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

a)

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

When we crank both the bagging fraction and the learning rate up to 0.9, each tree in the ensemble sees nearly the entire data set and takes a very large “step” toward minimizing the loss. Resulting in the strongest predictors taking almost all of the residual error immediately, so the booster keeps splitting on those few variables over and over leaving the weaker signals nearly untouched.

In the other case, with a low bag fraction (0.1) and low shrinkage (0.1), each tree is built on only 10% of the data and only nudges the fit a little bit. That randomness and those tiny steps force the algorithm to revisit and exploit secondary predictors in later iterations, so importance gets spread more broadly across many features.

b)

Which model do you think would be more predictive of other samples?

The model on the left with both bagging fraction and learning rate equal to 0.1 will most likely generalize better to new data. By subsampling heavily and taking only tiny steps toward the gradient, it builds many small, diverse trees that each capture a sliver of the remaining signal, making it much less likely to over-fitting any one strong predictor or noise fluctuation.

c)

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Interaction depth in boosting controls how many splits (or tree levels) each tree can have, which lets the algorithm capture higher‐order interactions. If we increase depth, the strongest predictors explain even more of the residual error (in part by interacting with each other), so they soak up disproportionately more “credit” and the ranking curve falls off more sharply. Meaning that deeper trees amplify the top predictors’ importance at the expense of the weaker ones, making the slope of importance vs rank steeper.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

cmp <- data.frame(ChemicalManufacturingProcess)

library(RANN)  
# Impute missing values  
preProc <- preProcess(cmp, method = "knnImpute")  # or "medianImpute"  
# Applying the preprocessing to fill in missing values 
cmp_imputed <- predict(preProc, newdata = cmp)  
# Checking if any NAs remain 
sum(is.na(cmp_imputed))

[1] 0

# Use the imputed data from earlier 
df <- cmp_imputed  
# Remove near-zero variance predictors 
nzv <- nearZeroVar(df)
df <- df[, -nzv]
# Splitting data into training and testing 
set.seed(48) 
split_index <- createDataPartition(df$Yield, p = 0.8, list = FALSE) 
train_data <- df[split_index, ] 
test_data <- df[-split_index, ]

# set up repeated CV
ctrl <- trainControl(
  method          = "repeatedcv",
  number          = 10,
  repeats         = 5,
  savePredictions = "final"
)

# we start by training the tree-based models and create a list called models
models <- list(
CART= train(Yield ~ ., data = train_data, method = "rpart",   trControl = ctrl),
Bagged= train(Yield ~ ., data = train_data, method = "treebag", trControl = ctrl),
RF= train(Yield ~ ., data = train_data, method = "rf",      trControl = ctrl),
GBM= train(Yield ~ ., data = train_data, 
          method   = "gbm", trControl= ctrl, verbose  = FALSE),
Cubist = train(Yield ~ ., data = train_data,
          method    = "cubist", trControl = ctrl,
          tuneGrid = expand.grid(committees = c(1, 5, 10),neighbors  = c(0, 5)))
)

Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.

# compare resampling distributions of the list models
resamps <- resamples(models)
print(summary(resamps))


Call:
summary.resamples(object = resamps)

Models: CART, Bagged, RF, GBM, Cubist 
Number of resamples: 50 

MAE 
            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
CART   0.4738215 0.6678340 0.7131400 0.7295551 0.7961814 0.9970055    0
Bagged 0.2997462 0.4381138 0.4976493 0.5127379 0.5713255 0.8791791    0
RF     0.2764440 0.4198121 0.4656958 0.4830576 0.5431133 0.6953974    0
GBM    0.2649495 0.4186403 0.4722164 0.4844123 0.5567139 0.7210251    0
Cubist 0.2745832 0.3822608 0.4290977 0.4309295 0.4793674 0.6008285    0

RMSE 
            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
CART   0.5640456 0.8002778 0.9097597 0.9015343 1.0019018 1.3160264    0
Bagged 0.3551259 0.5690975 0.6450312 0.6677866 0.7650986 1.1363785    0
RF     0.3948928 0.5619098 0.6047475 0.6339438 0.7290587 0.9109366    0
GBM    0.3361831 0.5580418 0.5962691 0.6227318 0.7115450 0.9238865    0
Cubist 0.3515904 0.4806503 0.5429953 0.5509858 0.6473205 0.7889099    0

Rsquared 
              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
CART   0.001198163 0.1654309 0.3043729 0.2901753 0.3967191 0.6251660    0
Bagged 0.162801691 0.5366377 0.6344274 0.6150948 0.7118715 0.8633077    0
RF     0.373154893 0.5602971 0.6446800 0.6601402 0.7559946 0.8920301    0
GBM    0.345088951 0.5682870 0.6745453 0.6530730 0.7398090 0.8684146    0
Cubist 0.353517039 0.6724127 0.7232309 0.7193237 0.8190026 0.8891353    0

bwplot(resamps, metric = "RMSE")

bwplot(resamps, metric = "Rsquared")

a)

Which tree-based regression model gives the optimal resampling and test set performance?

# predictions on the test set
test_preds <- lapply(models, predict, newdata = test_data)

# compute test‐set RMSE and R2 for each
test_perf <- sapply(test_preds, function(p) {
  c(
    RMSE = caret::RMSE(p, test_data$Yield),
    R2   = caret::R2  (p, test_data$Yield)
  )
})

round(test_perf, 3)

      CART Bagged    RF   GBM Cubist
RMSE 0.473  0.519 0.479 0.620  0.532
R2   0.671  0.618 0.664 0.488  0.655

In the initial RMSE plot the Cubist model sits lowest on the y-axis (meaning it has the smallest median and overall spread of RMSE), so it’s the best-performing model in terms of prediction error. Similarly, in the R square plot the Cubist again sits at the top with its median Rsquare highest (around 0.72), and its entire box is shifted right of the others. That makes Cubist the clear winner on explained‐variance in cross‐validation.

However, on the independent 20% hold-out sample, CART actually delivered the best generalization, with RMSE = 0.473 and Rsquare = 0.671 beating Cubist (RMSE 0.532, R square 0.655), RF, GBM, and bagging. It could be due to CART’s simplicity working in its favor, as the single tree fit the particular quirks of that test split better than the more complex ones, which either under or over fit slightly.

b)

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

We can compare both the Cubist and CART, since they could be considered optimal depending on which moment of model evaluation is prioritized.

# Start withC CART
cart_imp <- varImp(models$CART, scale = FALSE)$importance %>%
  rownames_to_column("var") %>%
  arrange(desc(Overall))

top_cart <- cart_imp %>%
  slice(1:10) %>%
  mutate(
    type = case_when(
      grepl("^ManufacturingProcess", var) ~ "Process",
      grepl("^BiologicalMaterial",    var) ~ "Biological",
      TRUE                                ~ "Other"
    )
  )

# Then we continue with Cubist
cubist_imp <- varImp(models$Cubist, scale = FALSE)$importance %>%
  rownames_to_column("var") %>%
  arrange(desc(Overall))

top_cubist <- cubist_imp %>%
  slice(1:10) %>%
  mutate(
    type = case_when(
      grepl("^ManufacturingProcess", var) ~ "Process",
      grepl("^BiologicalMaterial",    var) ~ "Biological",
      TRUE                                ~ "Other"
    )
  )

cat("Top 10 — CART:\n")

Top 10 — CART:

print(top_cart)

                      var   Overall       type
1  ManufacturingProcess32 0.3545986    Process
2    BiologicalMaterial12 0.2820327 Biological
3  ManufacturingProcess13 0.2755249    Process
4  ManufacturingProcess36 0.2621354    Process
5  ManufacturingProcess31 0.2581273    Process
6    BiologicalMaterial01 0.0000000 Biological
7    BiologicalMaterial02 0.0000000 Biological
8    BiologicalMaterial03 0.0000000 Biological
9    BiologicalMaterial04 0.0000000 Biological
10   BiologicalMaterial05 0.0000000 Biological

cat("\nTop 10 — Cubist:\n")


Top 10 — Cubist:

print(top_cubist)

                      var Overall       type
1  ManufacturingProcess32    56.5    Process
2  ManufacturingProcess17    52.5    Process
3  ManufacturingProcess09    28.0    Process
4  ManufacturingProcess01    17.5    Process
5    BiologicalMaterial02    17.5 Biological
6  ManufacturingProcess13    16.5    Process
7  ManufacturingProcess29    14.5    Process
8    BiologicalMaterial03    13.0 Biological
9  ManufacturingProcess27    12.5    Process
10   BiologicalMaterial04    11.0 Biological

Across both CART and Cubist, the single most important predictor is ManufacturingProcess32. In the CART model it’s followed by BiologicalMaterial12, then three more process variables before all remaining biological measurements drop to zero importance. In the Cubist model, the top five are all process measures, with four more process variables rounding out the top ten alongside three biologicals at much lower importance scores.

In both cases, process variables dominate the top 10

CART: 4 out of the top 5 are process-related (and the next five are biological but with zero weight).
Cubist: 7 of the top 10 are process variables, versus just 3 biological.

To compare it with linear and non-linear models we can go back and check results from previous labs

SVM Predictors: 7 out of 10 are process-related variables, while 3 are biological. Process variables dominate in the SVM model, making up 70% of the top 10 predictors. This might indicate that process control factors (suchas equipment settings, temperature, timing) are more influential in determining chemical yield than the biological materials used.
PLS Predictors: 6 variables are shared across both models, which is a strong indicators of consistent importance. SVM picks up a few unique predictors (Process31, Bio12, Bio03).

Across all models, wether tree-based (CART and Cubist), linear (PLS), or nonlinear (SVM) the same process measurements consistently emerge as the strongest predictors. In particular, ManufacturingProcess32, Process13, and Process36 sit at or near the top of every top 10 list, and Process31 appears in SVM and CART as well. This heavy overlap (often 6–7 shared predictors across methods) shows that these process control factors are by far the most influential determinants of yield, regardless of whether we fit a simple tree, an ensemble, a linear latent-variable model, or an SVM.

c)

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart.plot)
prp(
  models$CART$finalModel,
  type          = 2,      # label nodes by split criteria
  extra         = 101,    # draw a little histogram of y in each leaf
  fallen.leaves = TRUE,
  main          = "Optimal CART Tree with Yield Distributions"
)

We can see that the single–tree split on ManufacturingProcess32 at 0.19 creates two very distinct yield regimes:

Left branch (MP32 < 0.19) (Terminal 2) has a median Yield around -0.5, with a wide spread down to about -3 and a few outliers up near +1.
Right branch (MP32 ≥ 0.19) (Terminal 3) jumps to a median Yield of roughly +0.7 and is much tighter, with most values sitting between 0 and +2.

Since no biological variables ever appear in the splits, this view confirms that process conditions (in this case, MP32) dominate the yield relationship.

# grab each training row’s terminal‐node number
train_data$terminal <- models$CART$finalModel$where

# then plot
ggplot(train_data, aes(x = factor(terminal), y = Yield)) +
  geom_boxplot() +
  labs(
    title = "Yield Distribution Across CART Terminal Nodes",
    x     = "Terminal Node",
    y     = "Yield"
  ) +
  theme_minimal(base_size = 14)

The box‐plots don’t reveal any secondary splits on the biological materials. Any remaining variability within the node it’s probably random noise.

Overall, plotting the yield distributions in each terminal node makes the size and consistency of the process‐variable effect very clear and shows that biological measurements add little explanation power once MP32 is partitioned.