Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1 |>
arrange(desc(Overall)) |>
knitr::kable()
| Overall | |
|---|---|
| V1 | 8.7322354 |
| V4 | 7.6151188 |
| V2 | 6.4153694 |
| V5 | 2.0235246 |
| V3 | 0.7635918 |
| V6 | 0.1651112 |
| V7 | -0.0059617 |
| V10 | -0.0749448 |
| V9 | -0.0952927 |
| V8 | -0.1663626 |
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
No. V6-V10 are very insignificant compared to V1-V5.
(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2 |>
arrange(desc(Overall)) |>
knitr::kable()
| Overall | |
|---|---|
| V4 | 7.0475224 |
| V2 | 6.0689606 |
| V1 | 5.6911997 |
| duplicate1 | 4.2833158 |
| V5 | 1.8723844 |
| V3 | 0.6297022 |
| V6 | 0.1356906 |
| V10 | 0.0289481 |
| V9 | 0.0084044 |
| V7 | -0.0134564 |
| V8 | -0.0437056 |
The importance score for V1 decreased significantly. The importance scores for the other significant variables also decreased slightly.
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3 |>
arrange(desc(Overall)) |>
knitr::kable()
| Overall | |
|---|---|
| V4 | 7.0487092 |
| V2 | 6.5281650 |
| V1 | 4.9168733 |
| duplicate1 | 3.8006823 |
| V5 | 2.0311556 |
| duplicate2 | 1.8772196 |
| V3 | 0.5871155 |
| V6 | 0.1421315 |
| V7 | 0.1099199 |
| V10 | 0.0923058 |
| V9 | -0.0107503 |
| V8 | -0.0840569 |
When another highly correlated predictor is added the importance of V1 decreases even more.
(c) Use the cforest function in the party
package to fit a random forest model using conditional inference trees.
The party package function varimp can calculate predictor
importance. The conditional argument of that function
toggles between the traditional importance measure and the modified
version described in Strobl et al. (2007). Do these importances show the
same pattern as the traditional random forest model?
cforestModel <- cforest(y ~ .,
data = simulated[,c(1:11)])
data.frame(Importance = varimp(cforestModel)) |>
arrange(desc(Importance)) |>
knitr::kable()
| Importance | |
|---|---|
| V1 | 8.3013758 |
| V4 | 7.3074416 |
| V2 | 6.3849498 |
| V5 | 2.1097038 |
| V3 | 0.2021611 |
| V7 | 0.1021704 |
| V9 | -0.1127212 |
| V10 | -0.1345254 |
| V8 | -0.1382315 |
| V6 | -0.1612203 |
data.frame(Importance = varimp(cforestModel, conditional=T)) |>
arrange(desc(Importance)) |>
knitr::kable()
| Importance | |
|---|---|
| V1 | 6.2461745 |
| V4 | 6.0852583 |
| V2 | 5.2780486 |
| V5 | 1.4900429 |
| V3 | 0.0514986 |
| V7 | -0.0179179 |
| V6 | -0.1412536 |
| V9 | -0.1834125 |
| V10 | -0.1838288 |
| V8 | -0.2463130 |
The pattern of importance remains the same (V1, V4, V2, V5, V3). However, V3 is much less significant than in the original random forest model. V6-V10 remain unimportant.
(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
# boosted model
boostedModel <- gbm(y ~ .,
data = simulated[,c(1:11)],
distribution = "gaussian",
n.trees=1000)
summary(boostedModel, plotit=F) |>
dplyr::select(-var) |>
knitr::kable()
| rel.inf | |
|---|---|
| V4 | 25.317426 |
| V1 | 24.548002 |
| V2 | 17.731543 |
| V3 | 10.442501 |
| V5 | 10.386558 |
| V7 | 3.113415 |
| V6 | 2.988204 |
| V8 | 2.062039 |
| V10 | 1.720418 |
| V9 | 1.689893 |
The boosted model has the same significant (V1-V5) and insignificant (V6-V10) predictors. However, the pattern of importance is slightly different, with V4 having a higher influence than V1.
# cubist model
cubistModel <- train(y ~ .,
data = simulated[,c(1:11)],
method = "cubist")
varImp(cubistModel$finalModel, scale = FALSE) |>
arrange(desc(Overall)) |>
knitr::kable()
| Overall | |
|---|---|
| V1 | 72.0 |
| V2 | 54.5 |
| V4 | 49.0 |
| V3 | 42.0 |
| V5 | 40.0 |
| V6 | 11.0 |
| V7 | 0.0 |
| V8 | 0.0 |
| V9 | 0.0 |
| V10 | 0.0 |
In the cubist model, V1 is again the most important variable but in this model, V2 is rated more important than V4 and V3 is rated more important than V5. V7-V10 are wholly insignificant but V6 has a slightly higher importance in this model. The importance of V6 still pales in comparison to the other significant variables.
Use a simulation to show tree bias with different granularities.
Predictor selection tends to be more biased towards predictors with more distinct values.
set.seed(613)
a <- sample(1:10 / 10, 200, replace = TRUE)
b <- sample(1:100 / 100, 200, replace = TRUE)
c <- sample(1:1000 / 1000, 200, replace = TRUE)
d <- sample(1:10000 / 10000, 200, replace = TRUE)
y <- a + b + c + rnorm(200, mean=0, sd=5)
simData <- data.frame(a, b, c, d, y)
rpartTree <- rpart(y ~ ., data = simData)
plot(as.party(rpartTree))
d appears in the tree very often despite not being
directly related to y.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The model on the right has a higher bagging fraction and a higher learning rate (0.9 for both). Therefore, the model tends to fit faster to a few predictors as it overfits and focuses on the earlier predictors. The model on the left spreads the importance across more predictors because there is a lower bagging fraction and a lower learning rate so the model incorporates information from more predictors.
(b) Which model do you think would be more predictive of other samples?
The left model with a lower bagging fraction and lower learning rate is probably more predictive of other samples as it is less likely to overfit to the training data. The model on right is likely overfit and not as able to generalize to new data.
(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Increasing the interaction depth is likely to spread the variable importance over more predictors, increasing the importance of some of the less significant variables.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
# load data
data(ChemicalManufacturingProcess)
# apply same imputations, data splitting, and pre-processing as done in HW7
# impute
imputations <- preProcess(ChemicalManufacturingProcess,
method = c("knnImpute"),
k=5)
chem_man_imputed <- predict(imputations, ChemicalManufacturingProcess)
# filter nonzero variance variables
chem_man_filtered <- chem_man_imputed[,-nearZeroVar(chem_man_imputed)]
set.seed(613) # for reproducibility
# split into training and testing
train_indices <- sample(nrow(chem_man_filtered), nrow(chem_man_filtered)*.8, replace=F)
trainChem <- chem_man_filtered[train_indices,]
testChem <- chem_man_filtered[-train_indices,]
# train models
# random forest model
rf_mod <- randomForest(Yield~.,
data=trainChem,
importance=T,
ntree=1000)
rf_pred <- predict(rf_mod, testChem)
# bagged tree
bagged_mod <- bagging(Yield ~ .,
data = trainChem)
bagged_pred <- predict(bagged_mod, testChem)
# boosted tree
boosted_mod <- gbm(Yield ~ .,
data = trainChem,
distribution = "gaussian",
n.trees=1000)
boosted_pred <- predict(boosted_mod, testChem)
## Using 1000 trees...
# cubist
cubist_mod <- train(Yield ~ .,
data = trainChem,
method = "cubist")
cubist_pred <- predict(cubist_mod, testChem)
(a) Which tree-based regression model gives the optimal resampling and test set performance?
rbind(RandomForest = postResample(rf_pred, testChem$Yield),
BaggedTree = postResample(bagged_pred, testChem$Yield),
BoostedTree = postResample(boosted_pred, testChem$Yield),
Cubist = postResample(cubist_pred, testChem$Yield))
## RMSE Rsquared MAE
## RandomForest 0.6240582 0.5225684 0.5052709
## BaggedTree 0.6770303 0.4435007 0.5418391
## BoostedTree 0.6618618 0.5656576 0.5076747
## Cubist 0.5847081 0.6158759 0.4676564
The cubist model has the highest \(R^2\) and the lowest \(RMSE\).
(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
We can look at the top 10 most important predictors in the cubist model.
plot(varImp(cubist_mod), 10)
The top 10 predictors are mostly process variables, with only one biological predictor present.
The top two predictors are the same for the non-linear SVM model and
the tree model. The third most influential variable in the tree model,
ManufacturingProcess09 is the tenth most important in the
non-linear SVM model. BiologicalMaterial06 appears in the
top four most important predictors for both models.
ManufacturingProcess17 is the only other shared important
predictor for these two models.
ManufacturingProcess32 is the top predictor for both the
non-linear lasso model and the tree model.
ManufacturingProcess13 and
ManufacturingProcess09 appear in the top three, but the
order of importance is switched for these two models. None of the
biological predictors are shared amongst these models.
ManufacturingProcess17 also appears as an important
predictor within this model.
(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
single_tree <- rpart(Yield ~ .,
data = trainChem)
rpart.plot(single_tree)
The top predictor is the same, which is unsurprising. There are more biological predictors present in the single tree model than were seen in the more robust tree models. The tree gives interpretable breakdowns behind the reasoning of the splits which show the relationship between each predictor in the tree and the yield.