Assignment 9: Regression Trees and Rule-Based Models

Author

Amanda Rose Knudsen

Published

April 27, 2025

This assignment covers Exercises 8.1, 8.2, 8.3, and 8.7 from Kuhn and Johnson’s Applied Predictive Modeling. Link to Applied Predictive Modeling for reference.

library(tidyverse)
library(caret)
library(pls)
library(glmnet)
library(corrplot)
library(e1071)
library(lattice)
library(car)
library(RANN)
library(AppliedPredictiveModeling)
library(mlbench)
library(earth)
library(kernlab)
library(randomForest)
library(party)
library(gbm)
library(Cubist)

8.1

Recreate the simulated data from Exercise 7.2:

set.seed(5889)
simulated <- mlbench.friedman1(200, sd = 1) 
simulated <- cbind(simulated$x, simulated$y) 
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Now we have a dataset with 10 predictors (V1-V10) and a response (y).

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

set.seed(5889)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
        Overall
V1   5.04568474
V2   5.47816849
V3   0.59183147
V4  11.35909806
V5   2.39778578
V6   0.04174732
V7   0.07879051
V8   0.17006476
V9  -0.08921946
V10  0.17140784

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

As expected, V1-V5 (the real signal variables) show higher importance than the uninformative predictors (V6-V10). The V6-V10 variables show low importance and the random forest model did not significantly use them. V4 has the highest importance. This shows that the random forest model correctly focuses on the true predictors (V1-V5).

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

set.seed(5889)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
[1] 0.9481188
set.seed(5889)
rf_model_8b1 <- randomForest(y ~ ., data = simulated,
                              importance = TRUE,
                              ntree = 1000)
rf_Imp_8b1 <- varImp(rf_model_8b1, scale=FALSE)
rf_Imp_8b1
               Overall
V1          3.98204125
V2          5.41719046
V3          0.52995395
V4         10.65944548
V5          1.97444733
V6         -0.05345410
V7          0.03304348
V8          0.11024692
V9         -0.20587498
V10         0.14401905
duplicate1  2.59203861

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

Yes, the importance score changed for V1 - it dropped.

When we add another correlated predictor:

set.seed(5889)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1

set.seed(102)
rf_model_8b2 <- randomForest(y ~ ., data = simulated,
                             importance = TRUE,
                             ntree = 1000)

rfImp_8b2 <- varImp(rf_model_8b2, scale = FALSE)
rfImp_8b2
               Overall
V1          3.43315091
V2          5.68683233
V3          0.51668776
V4         11.74375014
V5          1.84501003
V6         -0.01603326
V7          0.11891280
V8         -0.03674017
V9         -0.08560529
V10         0.12468069
duplicate1  2.03759968
duplicate2  2.00470726

We can see that the importance of V1 drops further, and the total importance for the signal of V1 gets diluted across the correlated variables. We know that random forest distributes variable importance across correlated predictors, so the predictive power is hared with the duplicates (which is what dilutes the overall importance score). This shows that when variables are highly correlated, the model spreads the importance across them, which can make it harder to tell which variable is actually driving the predictions.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

First we’ll fit a random forest model using conditional inference trees. Then we’ll get variable importance fo the traditional then the conditional way.

f <- as.formula(y ~ .)

set.seed(5889)
cf_model <- cforest(f, data = simulated,
                    controls = cforest_unbiased(mtry = 5, ntree = 1000))

cf_imp_traditional <- varimp(cf_model, conditional = FALSE)

cf_imp_conditional <- varimp(cf_model, conditional = TRUE)
cf_imp_traditional
          V1           V2           V3           V4           V5           V6 
 2.318941908  4.916702156  0.105360047 11.118553721  1.773865705 -0.041995949 
          V7           V8           V9          V10   duplicate1   duplicate2 
 0.030269845 -0.009877498 -0.050613852  0.039145539  1.796421200  1.262580326 
cf_imp_conditional
           V1            V2            V3            V4            V5 
 1.5995865863  3.4616445083  0.0245315080  8.9905780961  1.3208012588 
           V6            V7            V8            V9           V10 
-0.0156487542  0.0041797095  0.0008722701 -0.0318388378  0.0182324740 
   duplicate1    duplicate2 
 0.6019709275  0.3493596000 

In the previous section, we saw that when variables are highly correlated, a random forest splits the importance between them. This made it hard to tell how important V1 really was on its own.

The conditional inference forest uses a different way to measure importance that accounts for this correlation. When we use conditional = TRUE, the importance score for V1 doesn’t drop as much even with duplicates. This method gives a better picture of which variables are truly important by trying not to confuse shared information with shared importance.

What we see with the traditional cforest model is a pattern similar to the previous section where importance is split more across V1 and its duplicates. The total contribution is inflated and split across all three correlated variables (V1 and its 2 duplicates).

When we look at the conditional importance cforest model, the distrbution of importance across correlated variables is not as ‘split’ - more of a share of the importance goes to V1 rather than split with a greater proportion to its duplicates. All values for V1 and its duplicates ‘shrink’ but V1 remains the most ‘important’ and the duplicate importance is reduced. (While V1 has a lower score in the conditional than the traditional, it’s important to compare relative to the duplicates.) This makes the ranking of variable importance more clear and meaningful, especially when predictors are correlated.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

First we’ll repeat this process with boosted trees using gbm.

set.seed(5889)
gbm_model <- gbm(
  formula = y ~ .,
  data = simulated,
  distribution = "gaussian",
  n.trees = 1000,
  interaction.depth = 3,
  shrinkage = 0.01,
  verbose = FALSE
)

Now we’ll view the boosted trees variable importance:

gbm_importance <- summary(gbm_model)

gbm_importance
                  var    rel.inf
V4                 V4 33.2131787
V2                 V2 22.0484596
V1                 V1 15.3806323
V5                 V5 11.9388205
V3                 V3  6.5931846
duplicate1 duplicate1  5.2553147
V8                 V8  1.7071948
V10               V10  1.5140607
V6                 V6  0.8980554
V7                 V7  0.7831722
V9                 V9  0.6679265
duplicate2 duplicate2  0.0000000
gbm_importance[order(-gbm_importance$rel.inf), ]
                  var    rel.inf
V4                 V4 33.2131787
V2                 V2 22.0484596
V1                 V1 15.3806323
V5                 V5 11.9388205
V3                 V3  6.5931846
duplicate1 duplicate1  5.2553147
V8                 V8  1.7071948
V10               V10  1.5140607
V6                 V6  0.8980554
V7                 V7  0.7831722
V9                 V9  0.6679265
duplicate2 duplicate2  0.0000000

Now we’ll look at a cubist model.

First we have to separate x and y, and then we’ll fit the cubist model

x <- simulated[, setdiff(names(simulated), "y")]
y <- simulated$y
set.seed(5889)
cubist_model <- cubist(x = x, y = y)

summary(cubist_model)

Call:
cubist.default(x = x, y = y)


Cubist [Release 2.07 GPL Edition]  Sun Apr 27 16:00:13 2025
---------------------------------

    Target attribute `outcome'

Read 200 cases (13 attributes) from undefined.data

Model:

  Rule 1: [42 cases, mean 11.832113, range 2.174002 to 18.99647, est err 1.477235]

    if
    V2 <= 0.1902741
    then
    outcome = 1.398953 + 10.1 V4 + 10.5 V2 + 6 V5 - 1.3 V6 + 1.1 V1

  Rule 2: [12 cases, mean 12.390821, range 7.595099 to 20.15425, est err 2.541644]

    if
    V4 <= 0.04400872
    then
    outcome = 5.955302 + 149.4 V4 + 6 V5 + 0.8 V2 + 0.7 V1

  Rule 3: [62 cases, mean 14.821066, range 4.265033 to 23.09879, est err 1.300391]

    if
    V1 <= 0.898298
    V2 > 0.1902741
    V3 > 0.4662886
    V5 <= 0.9053603
    then
    outcome = -8.990035 + 11.6 V1 + 10.8 V4 + 9 V3 + 7.5 V2 + 5.7 V5

  Rule 4: [51 cases, mean 16.357525, range 5.881745 to 23.04229, est err 1.272446]

    if
    V1 <= 0.898298
    V2 > 0.1902741
    V3 <= 0.4662886
    V5 <= 0.9053603
    then
    outcome = 2.806488 + 10.1 V4 + 9.9 V1 - 9.6 V3 + 6.2 V2 + 4.4 V5

  Rule 5: [21 cases, mean 16.725746, range 10.37667 to 23.3855, est err 2.270617]

    if
    V1 > 0.898298
    V2 > 0.1902741
    V4 > 0.04400872
    then
    outcome = 6.15789 + 9.5 V4 + 5 V5 + 4.6 V10

  Rule 6: [19 cases, mean 17.492044, range 10.4218 to 23.84154, est err 1.539024]

    if
    V1 <= 0.898298
    V2 > 0.1902741
    V4 > 0.04400872
    V5 > 0.9053603
    then
    outcome = -63.126015 + 73.1 V5 + 9.2 V4 + 9.5 V1 + 2.9 V2


Evaluation on training data (200 cases):

    Average  |error|           1.423071
    Relative |error|               0.36
    Correlation coefficient        0.92


    Attribute usage:
      Conds  Model

       94%    90%    V2
       74%    90%    V1
       64%   100%    V5
       55%    55%    V3
       25%   100%    V4
              20%    V6
              10%    V10


Time: 0.0 secs

Both the boosted trees (GBM) and the Cubist models were able to identify the truly informative predictors and mostly ignore the noise variables. In the GBM results, variables like V4, V2, V1, V5, and V3 had the highest importance scores, which matches what we’d expect based on the original simulation. Variables like duplicate1, V6, V8, and V10 showed up with lower importance, meaning the model didn’t rely heavily on them.

In the Cubist model, we saw a similar pattern. Predictors like V2, V1, V5, and V4 were used most frequently in the rules, while uninformative variables like V6 and V10 were rarely used. Notably, the Cubist model didn’t even list the duplicates in its summary output, which suggests that it completely ignored them.

Compared to traditional random forests, both models seem to be better at down-weighting the unimportant predictors and focusing more cleanly on the truly relevant ones. This makes their importance scores easier to interpret when there’s noise or redundant information in the data.

8.2 Use a simulation to show tree bias with different granularities.

We’ll use a simulation setup outlined in Chapter 8 in Applied Predictive Modeling. While we haven’t so far done this, we’ll add some comments into this code block so we can refer back and know what we were doing. This guidance comes from section 8.1 of our book.

set.seed(5889)

n <- 200 # n is the number of observations

# Creating binary predictors (10)
binary_predictors <- replicate(10, sample(0:1, n, replace = TRUE))
colnames(binary_predictors) <- paste0("Bin", 1:10)

# Creating categorical predictors (5) with 4 levels
categorical_predictors <- replicate(5, sample(letters[1:4], n, replace = TRUE))
colnames(categorical_predictors) <- paste0("Cat", 1:5)

# Creating continuous predictors (5)
continuous_predictors <- replicate(5, rnorm(n))
colnames(continuous_predictors) <- paste0("Cont", 1:5)

# Combining predictors to X
X <- data.frame(binary_predictors, categorical_predictors, continuous_predictors)

# Converting categorical columns to factors
X[paste0("Cat", 1:5)] <- lapply(X[paste0("Cat", 1:5)], factor)

# Creating noise response
Y <- rnorm(n)

# Final data
sim_data <- cbind(Y, X)
# Fitting a random forest
set.seed(5889)
rf_bias <- randomForest(Y ~ ., data = sim_data, importance = TRUE)

# Now we view variable importance
importance(rf_bias)
         %IncMSE IncNodePurity
Bin1   3.7532171      4.920837
Bin2   1.4660695      3.123283
Bin3  -0.3273303      3.939556
Bin4  -0.9805770      2.587228
Bin5  -0.6816732      3.191380
Bin6   0.6648496      3.363189
Bin7   1.3463142      4.182530
Bin8   3.7662354      5.360696
Bin9  -1.2660624      3.687873
Bin10 -0.2847020      2.532573
Cat1   2.5957344     12.157016
Cat2  -0.3810364     10.855587
Cat3   2.0364899     13.028446
Cat4   0.9073277     14.016578
Cat5  -0.1452788     12.358564
Cont1 -0.5535166     23.344385
Cont2  0.1281624     21.852921
Cont3 -1.3081358     25.289685
Cont4 -0.3740921     24.354348
Cont5  0.4550090     25.451645

We simulated data with three types of predictors: 10 binary, 5 categorical (4 levels each), and 5 continuous. The outcome (Y) was noise and had no relationship with any predictor.

When fitting a random forest, we can observe that variables with more unique values (especially the continuous predictors) received higher importance scores, despite having no real predictive power.

This simulation illustrates bias in tree-based models, where predictors with more potential split points are favored during tree construction. It’s a known issue, and a reason why importance scores from trees should be interpreted with knowledge of this issue and therefore caution in interpretation.

8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradi- ent. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

When the learning rate and bagging fraction are both set high (as in the right plot, where both are set to 0.9), each individual tree makes a bigger impact and has access to nearly the full training data. This can cause the model to focus quickly on the strongest predictors and rely heavily on them, leading to a steep dropoff in importance for the rest.

The left model uses more conservative settings (both tuning parameters are set to 0.1), which slow down the learning process. This means more trees are needed to converge, and along the way the model has more chances to explore weaker predictors. As a result, the model spreads importance across more variables.

(b) Which model do you think would be more predictive of other samples?

The model on the left, which has lower bagging and shrinkage rates, will likely generalize better to new data. It’s less likely to overfit in the training set because it learns more slowly and uses only subsets of the data at each stage. I think the model on the left would be more predictive of other samples.

The model on the right is at higher risk of overfitting. It relies heavily on a few predictors and may not adapt well to new samples.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth enables the model to capture more complex patterns and interactions between variables. Increasing interaction depth means allowing more complex trees with deeper splits. For the right model, this might intensify the focus on top predictors. For the left model, increasing depth could allow more subtle or interacting predictors to show their usefulness. In both cases, interaction depth has the potential to affect the slope of predictor importance.

8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

data(ChemicalManufacturingProcess)

First we’ll impute missing values with bagImpute, then remove zero variance predictors, then split into train/test sets, and then set up the train control variable.

set.seed(5889)

pre_87 <- preProcess(ChemicalManufacturingProcess, method = c("bagImpute"))
cmp_imputed_87 <- predict(pre_87, ChemicalManufacturingProcess)

zero_vars_87 <- nearZeroVar(cmp_imputed_87)
cmp_data_87 <- cmp_imputed_87[, -zero_vars_87]

train_index_87 <- createDataPartition(cmp_data_87$Yield, p = 0.75, list = FALSE)
cmp_train_87 <- cmp_data_87[train_index_87, ]
cmp_test_87  <- cmp_data_87[-train_index_87, ]

ctrl_87 <- trainControl(method = "repeatedcv", number = 10, repeats = 5)

Now that we’ve done the same steps as before, we can move on to train several tree-based models. We’ll train: - Random Forest - Boosted Trees - Cubist - Single Tree

Random Forest

set.seed(5889)
rfFit_87 <- train(Yield ~ ., data = cmp_train_87,
                  method = "rf",
                  preProcess = c("center", "scale"),
                  trControl = ctrl_87,
                  importance = TRUE)

rfPred_87 <- predict(rfFit_87, newdata = cmp_test_87)
rfResults_87 <- postResample(pred = rfPred_87, obs = cmp_test_87$Yield)

Boosted Tree (GBM)

set.seed(5889)
gbmFit_87 <- train(Yield ~ ., data = cmp_train_87,
                   method = "gbm",
                   preProcess = c("center", "scale"),
                   trControl = ctrl_87,
                   verbose = FALSE)

gbmPred_87 <- predict(gbmFit_87, newdata = cmp_test_87)
gbmResults_87 <- postResample(pred = gbmPred_87, obs = cmp_test_87$Yield)

Cubist

set.seed(5889)
cubistFit_87 <- train(Yield ~ ., data = cmp_train_87,
                      method = "cubist",
                      preProcess = c("center", "scale"),
                      trControl = ctrl_87)

cubistPred_87 <- predict(cubistFit_87, newdata = cmp_test_87)
cubistResults_87 <- postResample(pred = cubistPred_87, obs = cmp_test_87$Yield)

Single Tree

set.seed(5889)
rpartFit_87 <- train(Yield ~ ., data = cmp_train_87,
                     method = "rpart",
                     preProcess = c("center", "scale"),
                     trControl = ctrl_87)

rpartPred_87 <- predict(rpartFit_87, newdata = cmp_test_87)
rpartResults_87 <- postResample(pred = rpartPred_87, obs = cmp_test_87$Yield)

Comparing model performance

model_perf_87 <- data.frame(
  Model = c("Random Forest", "Boosted Tree (GBM)", "Cubist", "Single Tree"),
  RMSE = c(rfResults_87["RMSE"], gbmResults_87["RMSE"], 
           cubistResults_87["RMSE"], rpartResults_87["RMSE"]),
  Rsquared = c(rfResults_87["Rsquared"], gbmResults_87["Rsquared"], 
               cubistResults_87["Rsquared"], rpartResults_87["Rsquared"]),
  MAE = c(rfResults_87["MAE"], gbmResults_87["MAE"], 
          cubistResults_87["MAE"], rpartResults_87["MAE"])
)
model_perf_87
               Model     RMSE  Rsquared       MAE
1      Random Forest 1.310293 0.5746280 0.9714808
2 Boosted Tree (GBM) 1.180022 0.6508292 0.8887212
3             Cubist 1.232556 0.6433936 0.7392549
4        Single Tree 1.553991 0.3805440 1.2000345

(a) Which tree-based regression model gives the optimal resampling and test set performance?

The Boosted Tree model gives the lowest RMSE and highest R-squared, which means it made the most accurate predictions on the test set (lowest typical error) and it explained the largest proportion of variance in Yield across test samples. While the Cubist model had slightly better MAE than the Boosted Tree model, which suggests more consistent accuracy, its RMSE was higher and R-squared was lower, which indicates it wasn’t as good overall at handling larger errors or explaining variance. Overall, the Boosted Tree (GBM) model gives the optimal resampling and test set performance.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

gbm_importance <- varImp(gbmFit_87, scale = TRUE)

plot(gbm_importance, top = 10, main = "Top 10 Most Important Predictors")

The most important predictors in the optimal tree-based model, Boosted Trees (GBM), are a mix of manufacturing process and biological variables. The top predictor by far is ManufacturingProcess32, followed by ManufacturingProcess31, and then BiologicalMaterial03. 8 of the top 10 are manufacturing processes, and 2 of the top 10 are biological materials. This indicates there is more importance on Manufacturing processes, however, despite comprising just 2 of the top 10 important predictors, biological materials are in the top 5 (3 and 4), so they are still important, just less so than manufacturing processes.

Compared to the optimal linear model in 6.3, which was dominated by manufacturing predictors for the first 8 of the top 10 important predictors, and the nonlinear model in 7.5, which showed 7 of 10 top important predictors were Manufacturing but the 5th most important was a biological material, the GBM model seems to be somewhere in the middle of the two. In all three, including the Boosted Trees (GBM) model, ManufacturingProcess32 is at the very top of predictor importance. There are other overlapping top important predictors such as ManufacturingProcess17, which appears in the top 5 for all 3 models referenced, while Boosted Trees (GBM) also captures others not previously seen in the top 10 like ManufacturingProcess01. BiologicalMaterial03 and 12 appeared in the top 10 for the optimal nonlinear model and in the top 20 for the optimal linear model.

This reinforces the value of tree-based ensembles like boosting, which are flexible enough to pick up both strong linear effects and localized nonlinear patterns, leading to improved prediction performance and enriched feature insights.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart.plot)
Loading required package: rpart
rpart.plot(rpartFit_87$finalModel,
           box.palette = "Blues", 
           shadow.col = "gray", 
           nn = TRUE,
           main = "Optimal Single Tree for Yield")

The single tree model split the data on ManufacturingProcess32, reinforcing its role as the most important predictor. The model’s structure is simple, with a single decision point at a threshold of 0.19, separating the data into two groups with noticeably different average yields. The tree algorithm decided that ManufacturingProcess32 alone was the best split, and that adding more predictors wouldn’t meaningfully improve RMSE, so it stopped here.

If ManufacturingProcess32 < 0.19, the model sends the data left to node 2, which has a predicted Yield of 39 and includes 60% of the observations. If ManufacturingProcess32 >= 0.19, the data goes right to node 3, which predicts a Yield of 42 for the remaining 40%.

Compared to more complex models, this plot offers greater interpretability at the expense of predictive accuracy. However, it provides a straightforward and actionable insight: increasing ManufacturingProcess32 above 0.19 is associated with a higher predicted yield.

This view supports the observation from prior assignments that ManufacturingProcess32 is a key driver of yield.