Exercises

Applied Predictive Modeling
Chapter 8 Regression Trees and Rule-Based Models

8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

knit_table(rownames_to_column(rfImp1, var = "Predictor") |> arrange(desc(Overall)), 'Random Forest Variable Importance')
Random Forest Variable Importance
Predictor Overall
V1 8.8428961
V4 7.7593467
V2 6.7450825
V5 2.2362828
V3 0.6783065
V6 0.1142989
V10 0.0386321
V7 0.0372475
V9 -0.0449562
V8 -0.0534964

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The uninformative predictors (V6 – V10) are insignificant compared to V1-V5.


b

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216

Fit another random forest model to these data. Did the importance score for V1 change?

model2 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp2 <- varImp(model2, scale = FALSE)

knit_table(rownames_to_column(rfImp2, var = "Predictor") |> arrange(desc(Overall)), 'Random Forest Variable Importance')
Random Forest Variable Importance
Predictor Overall
V4 7.1870160
V2 6.3089082
V1 6.0083194
duplicate1 3.6181017
V5 2.1310402
V3 0.5716045
V6 0.2113046
V7 0.0251004
V10 0.0248783
V9 -0.0036795
V8 -0.1169800

The importance score for V1 did change and decrease in importance. The highest predictor is now V4 followed by V2, V1, and then the dulicated highly correlated predictor.

What happens when you add another predictor that is also highly correlated with V1?

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9451703
model3 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp3 <- varImp(model3, scale = FALSE)

knit_table(rownames_to_column(rfImp3, var = "Predictor") |> arrange(desc(Overall)), 'Random Forest Variable Importance')
Random Forest Variable Importance
Predictor Overall
V4 7.3629103
V2 6.2444652
V1 4.8055879
duplicate2 2.9624430
duplicate1 2.4952351
V5 2.2847092
V3 0.4479551
V6 0.2551798
V10 0.0145412
V7 -0.0117712
V9 -0.0318780
V8 -0.0621027

When another predictor is added that highly correlated with V1, the important score for V4 increases, while the importance score for V1 and the correlated predictors decreased.


c

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

cforestModel <- cforest(y ~ .,
        data = simulated[,c(1:11)])

knit_table(data.frame(Importance = varimp(cforestModel, conditional = TRUE)) |>
  arrange(desc(Importance)), 'Conditional Random Forest Variable Importance')
Conditional Random Forest Variable Importance
Importance
V4 6.8143143
V1 5.7881278
V2 5.3010943
V5 1.2480326
V7 0.0082914
V8 0.0082136
V3 0.0037292
V6 0.0019265
V10 0.0002662
V9 -0.0124501

These importances show a similar pattern as the traditional random forest model with V1, V2, V4, and V5 being more significant predictors than the rest.


d

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Tree

# boosted tree
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)

gbmTune <- train(y ~ ., data = simulated[, c(1:11)],
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

rfImp5 <- varImp(gbmTune$finalModel, scale = FALSE)

knit_table(rownames_to_column(rfImp5, var = "Predictor") |> arrange(desc(Overall)), 'Boosted Tree Variable Importance')
Boosted Tree Variable Importance
Predictor Overall
V1 4634.0234
V2 4316.6044
V4 4287.0793
V5 1844.2147
V3 1310.7178
V6 416.3966
V7 368.9383
V10 250.6263
V9 246.1400
V8 224.4769

The boosted tree model has significantly higher importance scores; however, the same predictors (V1-V5) remain the most significant. V2 appears to be the second most significant value as opposed to the third with the boosted tree.


Cubist

# cubist
set.seed(100)
cubistTuned <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")
rfImp6 <- varImp(cubistTuned$finalModel, scale = FALSE)

knit_table(rownames_to_column(rfImp6, var = "Predictor") |> arrange(desc(Overall)), 'Cubist Variable Importance')
Cubist Variable Importance
Predictor Overall
V1 72.0
V2 54.5
V4 49.0
V3 42.0
V5 40.0
V6 0.0
V7 0.0
V8 0.0
V9 0.0
V10 0.0

The cubist model also has higher importance scores. The same predictors (V1-V5) remain the most significant. V2 appears to be the second most significant value as opposed to the third and V3 appears to be the fourth most significant value as opposed to the fifth with the cubist model.


8.2

Use a simulation to show tree bias with different granularities

set.seed(123)

V1 <- runif(1000, 2,1000)
V2 <- runif(1000, 50,500)
V3 <- rnorm(1000, 500,10)
y <- V2 + V1 + V3

df <- data.frame(V1, V2, V3, y)

model7 <- rpart(y ~ ., data = df)

rpart.plot(model7)

varImp(model7)
##      Overall
## V1 3.6830790
## V2 2.5832190
## V3 0.1699149

Trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors. This is shown by the simulation in which predictors with highest variance, and therefore the lowest granularity, get ranked with highest importance.


8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9

a

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The bagging fraction is the fraction of training data used to add randomness and reduce variance (overfitting).

The learning rate is the fraction of the current predicted value that is added to the previous iteration’s predicted value. The learning rate shrinks the impact of each tree, requiring more trees to reach the same performance but resulting in a better-generalized model.

Since the model on the right has a high bagging fraction (0.9) and a high learning rate (0.9), the model is likely overfit due to more data being included in each iteration and and is learning faster or is more greedy. Since it is learning faster, it increases the weight or contribution of each predictor, hence focuses its importance on just the first few predictors.

The model of the left had a low bagging fraction (0.1) and a low learning rate (0.1). This means the model includes less data in each iteration and learns slower, takes more computation time, and incorporates information from more predictors thus this model often performs better.


b

Which model do you think would be more predictive of other samples?

I think the model of the left would be more predictive of other samples due to not being as overfit and better at generalizing across diverse samples.


c

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing interaction depth or tree depth would increase the slope of predictor importance by allowing predictors with less importance to contribute more or increase their importance


8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

a

Which tree-based regression model gives the optimal resampling and test set performance?

Impute, Transform, and Scale the Data

data(ChemicalManufacturingProcess)

# impute, transform, and standardize the data
data_prep_process <- preProcess(
  ChemicalManufacturingProcess, 
  method = c("knnImpute", "BoxCox", "center", "scale"))
prep_data <- predict(data_prep_process, ChemicalManufacturingProcess)

# identify predictors with zero or near zero variance
colnames(prep_data)[nearZeroVar(prep_data)]
# remove columns with zero or near zero variance 
prep_data = prep_data[,-nearZeroVar(prep_data)]

Split the Data

# Set the random number seed so we can reproduce the results
set.seed(20)
trainingRows <- createDataPartition(prep_data$Yield, p = .80, list= FALSE)

#split data into train and test sets
train_chem <- prep_data[trainingRows, ]
test_chem <- prep_data[-trainingRows, ]

Single Tree

set.seed(100)
rpartTune <- train(Yield~., data=train_chem,
                   method = "rpart2",
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))

rpart_pred <- predict(rpartTune, test_chem)
rpart_results_chem = postResample(pred = rpart_pred, obs = test_chem$Yield)
rpart_results_chem
##      RMSE  Rsquared       MAE 
## 0.7739539 0.4304565 0.6472335

Bagged Tree

baggedTree <- bagging(Yield~., data=train_chem,)

bagged_pred <- predict(baggedTree, test_chem)
bagged_results_chem = postResample(pred = bagged_pred, obs = test_chem$Yield)
bagged_results_chem
##      RMSE  Rsquared       MAE 
## 0.6260646 0.5316352 0.4960005

Random Forest

set.seed(100)
rfModel <- randomForest(Yield~., data=train_chem,
                        importance = TRUE,
                        ntree = 1000)

rf_pred <- predict(rfModel, test_chem)
rf_results_chem = postResample(pred = rf_pred, obs = test_chem$Yield)
rf_results_chem
##      RMSE  Rsquared       MAE 
## 0.5528423 0.6622202 0.4489977

Boosted Tree

set.seed(100)
gbmTune_chem <- train(Yield~., data=train_chem,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmTune_chem_pred <- predict(gbmTune_chem, test_chem)
gbmTune_results_chem = postResample(pred = gbmTune_chem_pred, obs = test_chem$Yield)
gbmTune_results_chem
##      RMSE  Rsquared       MAE 
## 0.5276659 0.6703243 0.4451713

Cubist

cubistTuned_chem <- train(Yield~., data=train_chem,
                    method = "cubist")

cubistTuned_chem_pred <- predict(cubistTuned_chem, test_chem)
gcubistTuned_results_chem = postResample(pred = cubistTuned_chem_pred, obs = test_chem$Yield)
gcubistTuned_results_chem
##      RMSE  Rsquared       MAE 
## 0.4624855 0.7530501 0.3558033

Results

# compile all model performance results
final_results_chem = rbind(
  rpart_results_chem,
  bagged_results_chem,
  rf_results_chem,
  gbmTune_results_chem,
  gcubistTuned_results_chem
)
# add model name column
final_results_chem = data.frame(cbind(Model = c('Single Tree', 'Bagged', 'Random Forest', 'Boosted', 'Cubist'), final_results_chem))
rownames(final_results_chem) <- NULL
# output results
knit_table(final_results_chem |> arrange(RMSE), 'Model Performance Metrics')
Model Performance Metrics
Model RMSE Rsquared MAE
Cubist 0.462485487957998 0.753050056396098 0.355803265825987
Boosted 0.527665856779124 0.670324317094116 0.445171263940013
Random Forest 0.552842348236997 0.662220237346389 0.448997662030822
Bagged 0.626064553345541 0.531635151190147 0.496000546161113
Single Tree 0.773953949593253 0.430456457503794 0.647233512240889

The Cubist model appears to have best the optimal resampling and test set performance due to having the lowest RMSE.

b

  • Which predictors are most important in the optimal tree-based regression model?
  • Do either the biological or process variables dominate the list?
  • How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
plot(varImp(cubistTuned_chem), top = 10,
     main = "Cubist Model Variable Importance - Top 10")

# cubist_vi <- varImp(cubistTuned_chem$finalModel, scale = FALSE)
# knit_table(rownames_to_column(cubist_vi, var = "Predictor") |> arrange(desc(Overall)), 'Cubist Variable Importance')

Process variables dominate the top 10 most important predictors of the cubist model. Process variables make up 60% of the most important predictors in the top 5 important predictors and 70% of the most important predictors in the top 10 important predictors. This same pattern is seen within the top 10 most important predictors of the optimal linear model and nonlinear models that was fit to this data in the previous two assignments.

c

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpartTree <- rpart(Yield~., data=train_chem)

rpart.plot(rpartTree)

This view provides an interpretable visualization with reasoning for each tree split which provides insight into the predictors relationship with Yield. In addition, this view, provides the correlation between the predictors and the Yield. For example, the decision tree highlights that ManufacturingProcess32 is the primary driver or top predictor of yeild with a threshold of 0.22.