library(tidyverse)
library(caret)
library(pls)
library(glmnet)
library(corrplot)
library(e1071)
library(lattice)
library(car)
library(RANN)
library(AppliedPredictiveModeling)
library(mlbench)
library(earth)
library(kernlab)
library(randomForest)
library(party)
library(gbm)
library(Cubist)Assignment 9: Regression Trees and Rule-Based Models
This assignment covers Exercises 8.1, 8.2, 8.3, and 8.7 from Kuhn and Johnson’s Applied Predictive Modeling. Link to Applied Predictive Modeling for reference.
8.1
Recreate the simulated data from Exercise 7.2:
set.seed(5889)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"Now we have a dataset with 10 predictors (V1-V10) and a response (y).
(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:
set.seed(5889)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)rfImp1 Overall
V1 5.04568474
V2 5.47816849
V3 0.59183147
V4 11.35909806
V5 2.39778578
V6 0.04174732
V7 0.07879051
V8 0.17006476
V9 -0.08921946
V10 0.17140784
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
As expected, V1-V5 (the real signal variables) show higher importance than the uninformative predictors (V6-V10). The V6-V10 variables show low importance and the random forest model did not significantly use them. V4 has the highest importance. This shows that the random forest model correctly focuses on the true predictors (V1-V5).
(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
First we’ll fit a random forest model using conditional inference trees. Then we’ll get variable importance fo the traditional then the conditional way.
f <- as.formula(y ~ .)
set.seed(5889)
cf_model <- cforest(f, data = simulated,
controls = cforest_unbiased(mtry = 5, ntree = 1000))
cf_imp_traditional <- varimp(cf_model, conditional = FALSE)
cf_imp_conditional <- varimp(cf_model, conditional = TRUE)cf_imp_traditional V1 V2 V3 V4 V5 V6
2.318941908 4.916702156 0.105360047 11.118553721 1.773865705 -0.041995949
V7 V8 V9 V10 duplicate1 duplicate2
0.030269845 -0.009877498 -0.050613852 0.039145539 1.796421200 1.262580326
cf_imp_conditional V1 V2 V3 V4 V5
1.5995865863 3.4616445083 0.0245315080 8.9905780961 1.3208012588
V6 V7 V8 V9 V10
-0.0156487542 0.0041797095 0.0008722701 -0.0318388378 0.0182324740
duplicate1 duplicate2
0.6019709275 0.3493596000
In the previous section, we saw that when variables are highly correlated, a random forest splits the importance between them. This made it hard to tell how important V1 really was on its own.
The conditional inference forest uses a different way to measure importance that accounts for this correlation. When we use conditional = TRUE, the importance score for V1 doesn’t drop as much even with duplicates. This method gives a better picture of which variables are truly important by trying not to confuse shared information with shared importance.
What we see with the traditional cforest model is a pattern similar to the previous section where importance is split more across V1 and its duplicates. The total contribution is inflated and split across all three correlated variables (V1 and its 2 duplicates).
When we look at the conditional importance cforest model, the distrbution of importance across correlated variables is not as ‘split’ - more of a share of the importance goes to V1 rather than split with a greater proportion to its duplicates. All values for V1 and its duplicates ‘shrink’ but V1 remains the most ‘important’ and the duplicate importance is reduced. (While V1 has a lower score in the conditional than the traditional, it’s important to compare relative to the duplicates.) This makes the ranking of variable importance more clear and meaningful, especially when predictors are correlated.
(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
First we’ll repeat this process with boosted trees using gbm.
set.seed(5889)
gbm_model <- gbm(
formula = y ~ .,
data = simulated,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 3,
shrinkage = 0.01,
verbose = FALSE
)Now we’ll view the boosted trees variable importance:
gbm_importance <- summary(gbm_model)gbm_importance var rel.inf
V4 V4 33.2131787
V2 V2 22.0484596
V1 V1 15.3806323
V5 V5 11.9388205
V3 V3 6.5931846
duplicate1 duplicate1 5.2553147
V8 V8 1.7071948
V10 V10 1.5140607
V6 V6 0.8980554
V7 V7 0.7831722
V9 V9 0.6679265
duplicate2 duplicate2 0.0000000
gbm_importance[order(-gbm_importance$rel.inf), ] var rel.inf
V4 V4 33.2131787
V2 V2 22.0484596
V1 V1 15.3806323
V5 V5 11.9388205
V3 V3 6.5931846
duplicate1 duplicate1 5.2553147
V8 V8 1.7071948
V10 V10 1.5140607
V6 V6 0.8980554
V7 V7 0.7831722
V9 V9 0.6679265
duplicate2 duplicate2 0.0000000
Now we’ll look at a cubist model.
First we have to separate x and y, and then we’ll fit the cubist model
x <- simulated[, setdiff(names(simulated), "y")]
y <- simulated$yset.seed(5889)
cubist_model <- cubist(x = x, y = y)
summary(cubist_model)
Call:
cubist.default(x = x, y = y)
Cubist [Release 2.07 GPL Edition] Sun Apr 27 16:00:13 2025
---------------------------------
Target attribute `outcome'
Read 200 cases (13 attributes) from undefined.data
Model:
Rule 1: [42 cases, mean 11.832113, range 2.174002 to 18.99647, est err 1.477235]
if
V2 <= 0.1902741
then
outcome = 1.398953 + 10.1 V4 + 10.5 V2 + 6 V5 - 1.3 V6 + 1.1 V1
Rule 2: [12 cases, mean 12.390821, range 7.595099 to 20.15425, est err 2.541644]
if
V4 <= 0.04400872
then
outcome = 5.955302 + 149.4 V4 + 6 V5 + 0.8 V2 + 0.7 V1
Rule 3: [62 cases, mean 14.821066, range 4.265033 to 23.09879, est err 1.300391]
if
V1 <= 0.898298
V2 > 0.1902741
V3 > 0.4662886
V5 <= 0.9053603
then
outcome = -8.990035 + 11.6 V1 + 10.8 V4 + 9 V3 + 7.5 V2 + 5.7 V5
Rule 4: [51 cases, mean 16.357525, range 5.881745 to 23.04229, est err 1.272446]
if
V1 <= 0.898298
V2 > 0.1902741
V3 <= 0.4662886
V5 <= 0.9053603
then
outcome = 2.806488 + 10.1 V4 + 9.9 V1 - 9.6 V3 + 6.2 V2 + 4.4 V5
Rule 5: [21 cases, mean 16.725746, range 10.37667 to 23.3855, est err 2.270617]
if
V1 > 0.898298
V2 > 0.1902741
V4 > 0.04400872
then
outcome = 6.15789 + 9.5 V4 + 5 V5 + 4.6 V10
Rule 6: [19 cases, mean 17.492044, range 10.4218 to 23.84154, est err 1.539024]
if
V1 <= 0.898298
V2 > 0.1902741
V4 > 0.04400872
V5 > 0.9053603
then
outcome = -63.126015 + 73.1 V5 + 9.2 V4 + 9.5 V1 + 2.9 V2
Evaluation on training data (200 cases):
Average |error| 1.423071
Relative |error| 0.36
Correlation coefficient 0.92
Attribute usage:
Conds Model
94% 90% V2
74% 90% V1
64% 100% V5
55% 55% V3
25% 100% V4
20% V6
10% V10
Time: 0.0 secs
Both the boosted trees (GBM) and the Cubist models were able to identify the truly informative predictors and mostly ignore the noise variables. In the GBM results, variables like V4, V2, V1, V5, and V3 had the highest importance scores, which matches what we’d expect based on the original simulation. Variables like duplicate1, V6, V8, and V10 showed up with lower importance, meaning the model didn’t rely heavily on them.
In the Cubist model, we saw a similar pattern. Predictors like V2, V1, V5, and V4 were used most frequently in the rules, while uninformative variables like V6 and V10 were rarely used. Notably, the Cubist model didn’t even list the duplicates in its summary output, which suggests that it completely ignored them.
Compared to traditional random forests, both models seem to be better at down-weighting the unimportant predictors and focusing more cleanly on the truly relevant ones. This makes their importance scores easier to interpret when there’s noise or redundant information in the data.
8.2 Use a simulation to show tree bias with different granularities.
We’ll use a simulation setup outlined in Chapter 8 in Applied Predictive Modeling. While we haven’t so far done this, we’ll add some comments into this code block so we can refer back and know what we were doing. This guidance comes from section 8.1 of our book.
set.seed(5889)
n <- 200 # n is the number of observations
# Creating binary predictors (10)
binary_predictors <- replicate(10, sample(0:1, n, replace = TRUE))
colnames(binary_predictors) <- paste0("Bin", 1:10)
# Creating categorical predictors (5) with 4 levels
categorical_predictors <- replicate(5, sample(letters[1:4], n, replace = TRUE))
colnames(categorical_predictors) <- paste0("Cat", 1:5)
# Creating continuous predictors (5)
continuous_predictors <- replicate(5, rnorm(n))
colnames(continuous_predictors) <- paste0("Cont", 1:5)
# Combining predictors to X
X <- data.frame(binary_predictors, categorical_predictors, continuous_predictors)
# Converting categorical columns to factors
X[paste0("Cat", 1:5)] <- lapply(X[paste0("Cat", 1:5)], factor)
# Creating noise response
Y <- rnorm(n)
# Final data
sim_data <- cbind(Y, X)# Fitting a random forest
set.seed(5889)
rf_bias <- randomForest(Y ~ ., data = sim_data, importance = TRUE)
# Now we view variable importance
importance(rf_bias) %IncMSE IncNodePurity
Bin1 3.7532171 4.920837
Bin2 1.4660695 3.123283
Bin3 -0.3273303 3.939556
Bin4 -0.9805770 2.587228
Bin5 -0.6816732 3.191380
Bin6 0.6648496 3.363189
Bin7 1.3463142 4.182530
Bin8 3.7662354 5.360696
Bin9 -1.2660624 3.687873
Bin10 -0.2847020 2.532573
Cat1 2.5957344 12.157016
Cat2 -0.3810364 10.855587
Cat3 2.0364899 13.028446
Cat4 0.9073277 14.016578
Cat5 -0.1452788 12.358564
Cont1 -0.5535166 23.344385
Cont2 0.1281624 21.852921
Cont3 -1.3081358 25.289685
Cont4 -0.3740921 24.354348
Cont5 0.4550090 25.451645
We simulated data with three types of predictors: 10 binary, 5 categorical (4 levels each), and 5 continuous. The outcome (Y) was noise and had no relationship with any predictor.
When fitting a random forest, we can observe that variables with more unique values (especially the continuous predictors) received higher importance scores, despite having no real predictive power.
This simulation illustrates bias in tree-based models, where predictors with more potential split points are favored during tree construction. It’s a known issue, and a reason why importance scores from trees should be interpreted with knowledge of this issue and therefore caution in interpretation.
8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradi- ent. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
When the learning rate and bagging fraction are both set high (as in the right plot, where both are set to 0.9), each individual tree makes a bigger impact and has access to nearly the full training data. This can cause the model to focus quickly on the strongest predictors and rely heavily on them, leading to a steep dropoff in importance for the rest.
The left model uses more conservative settings (both tuning parameters are set to 0.1), which slow down the learning process. This means more trees are needed to converge, and along the way the model has more chances to explore weaker predictors. As a result, the model spreads importance across more variables.
(b) Which model do you think would be more predictive of other samples?
The model on the left, which has lower bagging and shrinkage rates, will likely generalize better to new data. It’s less likely to overfit in the training set because it learns more slowly and uses only subsets of the data at each stage. I think the model on the left would be more predictive of other samples.
The model on the right is at higher risk of overfitting. It relies heavily on a few predictors and may not adapt well to new samples.
(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Increasing the interaction depth enables the model to capture more complex patterns and interactions between variables. Increasing interaction depth means allowing more complex trees with deeper splits. For the right model, this might intensify the focus on top predictors. For the left model, increasing depth could allow more subtle or interacting predictors to show their usefulness. In both cases, interaction depth has the potential to affect the slope of predictor importance.
8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
data(ChemicalManufacturingProcess)First we’ll impute missing values with bagImpute, then remove zero variance predictors, then split into train/test sets, and then set up the train control variable.
set.seed(5889)
pre_87 <- preProcess(ChemicalManufacturingProcess, method = c("bagImpute"))
cmp_imputed_87 <- predict(pre_87, ChemicalManufacturingProcess)
zero_vars_87 <- nearZeroVar(cmp_imputed_87)
cmp_data_87 <- cmp_imputed_87[, -zero_vars_87]
train_index_87 <- createDataPartition(cmp_data_87$Yield, p = 0.75, list = FALSE)
cmp_train_87 <- cmp_data_87[train_index_87, ]
cmp_test_87 <- cmp_data_87[-train_index_87, ]
ctrl_87 <- trainControl(method = "repeatedcv", number = 10, repeats = 5)Now that we’ve done the same steps as before, we can move on to train several tree-based models. We’ll train: - Random Forest - Boosted Trees - Cubist - Single Tree
Random Forest
set.seed(5889)
rfFit_87 <- train(Yield ~ ., data = cmp_train_87,
method = "rf",
preProcess = c("center", "scale"),
trControl = ctrl_87,
importance = TRUE)
rfPred_87 <- predict(rfFit_87, newdata = cmp_test_87)
rfResults_87 <- postResample(pred = rfPred_87, obs = cmp_test_87$Yield)Boosted Tree (GBM)
set.seed(5889)
gbmFit_87 <- train(Yield ~ ., data = cmp_train_87,
method = "gbm",
preProcess = c("center", "scale"),
trControl = ctrl_87,
verbose = FALSE)
gbmPred_87 <- predict(gbmFit_87, newdata = cmp_test_87)
gbmResults_87 <- postResample(pred = gbmPred_87, obs = cmp_test_87$Yield)Cubist
set.seed(5889)
cubistFit_87 <- train(Yield ~ ., data = cmp_train_87,
method = "cubist",
preProcess = c("center", "scale"),
trControl = ctrl_87)
cubistPred_87 <- predict(cubistFit_87, newdata = cmp_test_87)
cubistResults_87 <- postResample(pred = cubistPred_87, obs = cmp_test_87$Yield)Single Tree
set.seed(5889)
rpartFit_87 <- train(Yield ~ ., data = cmp_train_87,
method = "rpart",
preProcess = c("center", "scale"),
trControl = ctrl_87)
rpartPred_87 <- predict(rpartFit_87, newdata = cmp_test_87)
rpartResults_87 <- postResample(pred = rpartPred_87, obs = cmp_test_87$Yield)Comparing model performance
model_perf_87 <- data.frame(
Model = c("Random Forest", "Boosted Tree (GBM)", "Cubist", "Single Tree"),
RMSE = c(rfResults_87["RMSE"], gbmResults_87["RMSE"],
cubistResults_87["RMSE"], rpartResults_87["RMSE"]),
Rsquared = c(rfResults_87["Rsquared"], gbmResults_87["Rsquared"],
cubistResults_87["Rsquared"], rpartResults_87["Rsquared"]),
MAE = c(rfResults_87["MAE"], gbmResults_87["MAE"],
cubistResults_87["MAE"], rpartResults_87["MAE"])
)
model_perf_87 Model RMSE Rsquared MAE
1 Random Forest 1.310293 0.5746280 0.9714808
2 Boosted Tree (GBM) 1.180022 0.6508292 0.8887212
3 Cubist 1.232556 0.6433936 0.7392549
4 Single Tree 1.553991 0.3805440 1.2000345
(a) Which tree-based regression model gives the optimal resampling and test set performance?
The Boosted Tree model gives the lowest RMSE and highest R-squared, which means it made the most accurate predictions on the test set (lowest typical error) and it explained the largest proportion of variance in Yield across test samples. While the Cubist model had slightly better MAE than the Boosted Tree model, which suggests more consistent accuracy, its RMSE was higher and R-squared was lower, which indicates it wasn’t as good overall at handling larger errors or explaining variance. Overall, the Boosted Tree (GBM) model gives the optimal resampling and test set performance.
(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
gbm_importance <- varImp(gbmFit_87, scale = TRUE)
plot(gbm_importance, top = 10, main = "Top 10 Most Important Predictors")The most important predictors in the optimal tree-based model, Boosted Trees (GBM), are a mix of manufacturing process and biological variables. The top predictor by far is ManufacturingProcess32, followed by ManufacturingProcess31, and then BiologicalMaterial03. 8 of the top 10 are manufacturing processes, and 2 of the top 10 are biological materials. This indicates there is more importance on Manufacturing processes, however, despite comprising just 2 of the top 10 important predictors, biological materials are in the top 5 (3 and 4), so they are still important, just less so than manufacturing processes.
Compared to the optimal linear model in 6.3, which was dominated by manufacturing predictors for the first 8 of the top 10 important predictors, and the nonlinear model in 7.5, which showed 7 of 10 top important predictors were Manufacturing but the 5th most important was a biological material, the GBM model seems to be somewhere in the middle of the two. In all three, including the Boosted Trees (GBM) model, ManufacturingProcess32 is at the very top of predictor importance. There are other overlapping top important predictors such as ManufacturingProcess17, which appears in the top 5 for all 3 models referenced, while Boosted Trees (GBM) also captures others not previously seen in the top 10 like ManufacturingProcess01. BiologicalMaterial03 and 12 appeared in the top 10 for the optimal nonlinear model and in the top 20 for the optimal linear model.
This reinforces the value of tree-based ensembles like boosting, which are flexible enough to pick up both strong linear effects and localized nonlinear patterns, leading to improved prediction performance and enriched feature insights.
(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(rpart.plot)Loading required package: rpart
rpart.plot(rpartFit_87$finalModel,
box.palette = "Blues",
shadow.col = "gray",
nn = TRUE,
main = "Optimal Single Tree for Yield")The single tree model split the data on ManufacturingProcess32, reinforcing its role as the most important predictor. The model’s structure is simple, with a single decision point at a threshold of 0.19, separating the data into two groups with noticeably different average yields. The tree algorithm decided that ManufacturingProcess32 alone was the best split, and that adding more predictors wouldn’t meaningfully improve RMSE, so it stopped here.
If ManufacturingProcess32 < 0.19, the model sends the data left to node 2, which has a predicted Yield of 39 and includes 60% of the observations. If ManufacturingProcess32 >= 0.19, the data goes right to node 3, which predicts a Yield of 42 for the remaining 40%.
Compared to more complex models, this plot offers greater interpretability at the expense of predictive accuracy. However, it provides a straightforward and actionable insight: increasing ManufacturingProcess32 above 0.19 is associated with a higher predicted yield.
This view supports the observation from prior assignments that ManufacturingProcess32 is a key driver of yield.