Applied Predictive
Modeling
Chapter 8 Regression Trees and Rule-Based Models
Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
knit_table(rownames_to_column(rfImp1, var = "Predictor") |> arrange(desc(Overall)), 'Random Forest Variable Importance')
| Predictor | Overall |
|---|---|
| V1 | 8.8428961 |
| V4 | 7.7593467 |
| V2 | 6.7450825 |
| V5 | 2.2362828 |
| V3 | 0.6783065 |
| V6 | 0.1142989 |
| V10 | 0.0386321 |
| V7 | 0.0372475 |
| V9 | -0.0449562 |
| V8 | -0.0534964 |
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
The uninformative predictors (V6 – V10) are insignificant compared to V1-V5.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216
Fit another random forest model to these data. Did the importance score for V1 change?
model2 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
knit_table(rownames_to_column(rfImp2, var = "Predictor") |> arrange(desc(Overall)), 'Random Forest Variable Importance')
| Predictor | Overall |
|---|---|
| V4 | 7.1870160 |
| V2 | 6.3089082 |
| V1 | 6.0083194 |
| duplicate1 | 3.6181017 |
| V5 | 2.1310402 |
| V3 | 0.5716045 |
| V6 | 0.2113046 |
| V7 | 0.0251004 |
| V10 | 0.0248783 |
| V9 | -0.0036795 |
| V8 | -0.1169800 |
The importance score for V1 did change and decrease in importance. The highest predictor is now V4 followed by V2, V1, and then the dulicated highly correlated predictor.
What happens when you add another predictor that is also highly correlated with V1?
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9451703
model3 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
knit_table(rownames_to_column(rfImp3, var = "Predictor") |> arrange(desc(Overall)), 'Random Forest Variable Importance')
| Predictor | Overall |
|---|---|
| V4 | 7.3629103 |
| V2 | 6.2444652 |
| V1 | 4.8055879 |
| duplicate2 | 2.9624430 |
| duplicate1 | 2.4952351 |
| V5 | 2.2847092 |
| V3 | 0.4479551 |
| V6 | 0.2551798 |
| V10 | 0.0145412 |
| V7 | -0.0117712 |
| V9 | -0.0318780 |
| V8 | -0.0621027 |
When another predictor is added that highly correlated with V1, the important score for V4 increases, while the importance score for V1 and the correlated predictors decreased.
Use the cforest function in the party package to fit a
random forest model using conditional inference trees. The
party package function varimp can calculate
predictor importance. The conditional argument of that function toggles
between the traditional importance measure and the modified version
described in Strobl et al. (2007). Do these importances show the
same pattern as the traditional random forest model?
cforestModel <- cforest(y ~ .,
data = simulated[,c(1:11)])
knit_table(data.frame(Importance = varimp(cforestModel, conditional = TRUE)) |>
arrange(desc(Importance)), 'Conditional Random Forest Variable Importance')
| Importance | |
|---|---|
| V4 | 6.8143143 |
| V1 | 5.7881278 |
| V2 | 5.3010943 |
| V5 | 1.2480326 |
| V7 | 0.0082914 |
| V8 | 0.0082136 |
| V3 | 0.0037292 |
| V6 | 0.0019265 |
| V10 | 0.0002662 |
| V9 | -0.0124501 |
These importances show a similar pattern as the traditional random forest model with V1, V2, V4, and V5 being more significant predictors than the rest.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
# boosted tree
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(100)
gbmTune <- train(y ~ ., data = simulated[, c(1:11)],
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
rfImp5 <- varImp(gbmTune$finalModel, scale = FALSE)
knit_table(rownames_to_column(rfImp5, var = "Predictor") |> arrange(desc(Overall)), 'Boosted Tree Variable Importance')
| Predictor | Overall |
|---|---|
| V1 | 4634.0234 |
| V2 | 4316.6044 |
| V4 | 4287.0793 |
| V5 | 1844.2147 |
| V3 | 1310.7178 |
| V6 | 416.3966 |
| V7 | 368.9383 |
| V10 | 250.6263 |
| V9 | 246.1400 |
| V8 | 224.4769 |
The boosted tree model has significantly higher importance scores; however, the same predictors (V1-V5) remain the most significant. V2 appears to be the second most significant value as opposed to the third with the boosted tree.
# cubist
set.seed(100)
cubistTuned <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")
rfImp6 <- varImp(cubistTuned$finalModel, scale = FALSE)
knit_table(rownames_to_column(rfImp6, var = "Predictor") |> arrange(desc(Overall)), 'Cubist Variable Importance')
| Predictor | Overall |
|---|---|
| V1 | 72.0 |
| V2 | 54.5 |
| V4 | 49.0 |
| V3 | 42.0 |
| V5 | 40.0 |
| V6 | 0.0 |
| V7 | 0.0 |
| V8 | 0.0 |
| V9 | 0.0 |
| V10 | 0.0 |
The cubist model also has higher importance scores. The same predictors (V1-V5) remain the most significant. V2 appears to be the second most significant value as opposed to the third and V3 appears to be the fourth most significant value as opposed to the fifth with the cubist model.
Use a simulation to show tree bias with different granularities
set.seed(123)
V1 <- runif(1000, 2,1000)
V2 <- runif(1000, 50,500)
V3 <- rnorm(1000, 500,10)
y <- V2 + V1 + V3
df <- data.frame(V1, V2, V3, y)
model7 <- rpart(y ~ ., data = df)
rpart.plot(model7)
varImp(model7)
## Overall
## V1 3.6830790
## V2 2.5832190
## V3 0.1699149
Trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors. This is shown by the simulation in which predictors with highest variance, and therefore the lowest granularity, get ranked with highest importance.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The bagging fraction is the fraction of training data used to add randomness and reduce variance (overfitting).
The learning rate is the fraction of the current predicted value that is added to the previous iteration’s predicted value. The learning rate shrinks the impact of each tree, requiring more trees to reach the same performance but resulting in a better-generalized model.
Since the model on the right has a high bagging fraction (0.9) and a high learning rate (0.9), the model is likely overfit due to more data being included in each iteration and and is learning faster or is more greedy. Since it is learning faster, it increases the weight or contribution of each predictor, hence focuses its importance on just the first few predictors.
The model of the left had a low bagging fraction (0.1) and a low learning rate (0.1). This means the model includes less data in each iteration and learns slower, takes more computation time, and incorporates information from more predictors thus this model often performs better.
Which model do you think would be more predictive of other samples?
I think the model of the left would be more predictive of other samples due to not being as overfit and better at generalizing across diverse samples.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Increasing interaction depth or tree depth would increase the slope of predictor importance by allowing predictors with less importance to contribute more or increase their importance
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Which tree-based regression model gives the optimal resampling and test set performance?
Impute, Transform, and Scale the Data
data(ChemicalManufacturingProcess)
# impute, transform, and standardize the data
data_prep_process <- preProcess(
ChemicalManufacturingProcess,
method = c("knnImpute", "BoxCox", "center", "scale"))
prep_data <- predict(data_prep_process, ChemicalManufacturingProcess)
# identify predictors with zero or near zero variance
colnames(prep_data)[nearZeroVar(prep_data)]
# remove columns with zero or near zero variance
prep_data = prep_data[,-nearZeroVar(prep_data)]
Split the Data
# Set the random number seed so we can reproduce the results
set.seed(20)
trainingRows <- createDataPartition(prep_data$Yield, p = .80, list= FALSE)
#split data into train and test sets
train_chem <- prep_data[trainingRows, ]
test_chem <- prep_data[-trainingRows, ]
set.seed(100)
rpartTune <- train(Yield~., data=train_chem,
method = "rpart2",
tuneLength = 10,
trControl = trainControl(method = "cv"))
rpart_pred <- predict(rpartTune, test_chem)
rpart_results_chem = postResample(pred = rpart_pred, obs = test_chem$Yield)
rpart_results_chem
## RMSE Rsquared MAE
## 0.7739539 0.4304565 0.6472335
baggedTree <- bagging(Yield~., data=train_chem,)
bagged_pred <- predict(baggedTree, test_chem)
bagged_results_chem = postResample(pred = bagged_pred, obs = test_chem$Yield)
bagged_results_chem
## RMSE Rsquared MAE
## 0.6260646 0.5316352 0.4960005
set.seed(100)
rfModel <- randomForest(Yield~., data=train_chem,
importance = TRUE,
ntree = 1000)
rf_pred <- predict(rfModel, test_chem)
rf_results_chem = postResample(pred = rf_pred, obs = test_chem$Yield)
rf_results_chem
## RMSE Rsquared MAE
## 0.5528423 0.6622202 0.4489977
set.seed(100)
gbmTune_chem <- train(Yield~., data=train_chem,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
gbmTune_chem_pred <- predict(gbmTune_chem, test_chem)
gbmTune_results_chem = postResample(pred = gbmTune_chem_pred, obs = test_chem$Yield)
gbmTune_results_chem
## RMSE Rsquared MAE
## 0.5276659 0.6703243 0.4451713
cubistTuned_chem <- train(Yield~., data=train_chem,
method = "cubist")
cubistTuned_chem_pred <- predict(cubistTuned_chem, test_chem)
gcubistTuned_results_chem = postResample(pred = cubistTuned_chem_pred, obs = test_chem$Yield)
gcubistTuned_results_chem
## RMSE Rsquared MAE
## 0.4624855 0.7530501 0.3558033
# compile all model performance results
final_results_chem = rbind(
rpart_results_chem,
bagged_results_chem,
rf_results_chem,
gbmTune_results_chem,
gcubistTuned_results_chem
)
# add model name column
final_results_chem = data.frame(cbind(Model = c('Single Tree', 'Bagged', 'Random Forest', 'Boosted', 'Cubist'), final_results_chem))
rownames(final_results_chem) <- NULL
# output results
knit_table(final_results_chem |> arrange(RMSE), 'Model Performance Metrics')
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| Cubist | 0.462485487957998 | 0.753050056396098 | 0.355803265825987 |
| Boosted | 0.527665856779124 | 0.670324317094116 | 0.445171263940013 |
| Random Forest | 0.552842348236997 | 0.662220237346389 | 0.448997662030822 |
| Bagged | 0.626064553345541 | 0.531635151190147 | 0.496000546161113 |
| Single Tree | 0.773953949593253 | 0.430456457503794 | 0.647233512240889 |
The Cubist model appears to have best the optimal resampling and test set performance due to having the lowest RMSE.
plot(varImp(cubistTuned_chem), top = 10,
main = "Cubist Model Variable Importance - Top 10")
# cubist_vi <- varImp(cubistTuned_chem$finalModel, scale = FALSE)
# knit_table(rownames_to_column(cubist_vi, var = "Predictor") |> arrange(desc(Overall)), 'Cubist Variable Importance')
Process variables dominate the top 10 most important predictors of the cubist model. Process variables make up 60% of the most important predictors in the top 5 important predictors and 70% of the most important predictors in the top 10 important predictors. This same pattern is seen within the top 10 most important predictors of the optimal linear model and nonlinear models that was fit to this data in the previous two assignments.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
rpartTree <- rpart(Yield~., data=train_chem)
rpart.plot(rpartTree)
This view provides an interpretable visualization with reasoning for each tree split which provides insight into the predictors relationship with Yield. In addition, this view, provides the correlation between the predictors and the Yield. For example, the decision tree highlights that ManufacturingProcess32 is the primary driver or top predictor of yeild with a threshold of 0.22.