library(AppliedPredictiveModeling)
library(caret)
library(corrplot)
library(earth)
library(e1071)
library(elasticnet)
library(grplasso)
library(kernlab)
library(lars)
library(MASS)
library(mlbench)
library(nnet)
library(penalized)
library(pls)
library(stats)
library(tidyverse)
library(VIM)
library(rpart)
library(Cubist)
library(gbm)
library(ipred)
library(party)
library(partykit)
library(randomForest)
Recreate the simulated data from Exercise 7.2:
set.seed(91)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative predictors (V6– V10)?
rfImp1
## Overall
## V1 8.85240589
## V2 8.62267451
## V3 0.77808305
## V4 7.26700539
## V5 2.36361531
## V6 0.02562361
## V7 -0.04264491
## V8 -0.12836485
## V9 -0.11431802
## V10 -0.13949255
Given the above table, one can say with confidence that the uninformative predictors V6 - V10 were not used by the random forest model significantly.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9430022
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
set.seed(92)
rf_model <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rf_imp <- varImp(rf_model, scale = FALSE)
rf_imp
## Overall
## V1 6.72598952
## V2 8.74182559
## V3 0.81745852
## V4 6.86628176
## V5 1.99758295
## V6 -0.06402895
## V7 -0.21956188
## V8 -0.20096911
## V9 -0.08310622
## V10 -0.04818898
## duplicate1 3.30979863
simulated$duplicate2 <- simulated$V2 + rnorm(200) * .1
set.seed(93)
rf_model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rf_imp1 <- varImp(rf_model1, scale = FALSE)
rf_imp1
## Overall
## V1 6.79869675
## V2 6.26392745
## V3 0.66037848
## V4 7.02399328
## V5 1.86403957
## V6 0.01567341
## V7 -0.04861496
## V8 -0.10475925
## V9 -0.02233431
## V10 -0.01778979
## duplicate1 3.03549633
## duplicate2 4.19806600
Adding an additional predictor that is highly correlated with one of the informative predictors like V1 does cause a change in the importance score for all of the variables. In this case, duplicating V1 caused it to suffer the most. However, when creating a duplicate of a different highly correlated predictor such as V2, it seems as though V1 recovered a little while V2 dropped further.
Use the cforest function in the party
package to fit a random forest model using conditional inference trees.
The party package function varimp can
calculate predictor importance. The conditional argument of that
function toggles between the traditional importance measure and the
modified version described in Strobl et al. (2007). Do these importances
show the same pattern as the traditional random forest model?
simulated <- simulated %>%
select(c(-(starts_with("d"))))
set.seed(94)
cf_model <- cforest(y ~ .,
data = simulated,
ntree = 1000)
cf_imp <- varimp(cf_model, conditional = TRUE)
cf_imp
## V1 V2 V3 V4 V5 V6
## 6.72178837 6.22083279 0.06375578 5.14898534 1.68356346 -0.17218347
## V7 V8 V9 V10
## -0.43137594 -0.33208587 -0.32078204 -0.18744396
As can be seen in the above results, the conditional influence model does seem to show a similar pattern to that of the random forest one although not with the same magnitudes. The similarities are shown more in the fact that V1 continues to dominate, V3 continues to be a dipping point, and V6 - V10 continue to be seldom used if at all.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
gbm_grid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(95)
gbm_model <- train(y ~ .,
data = simulated,
method = "gbm",
tuneGrid = gbm_grid,
verbose = FALSE)
gbm_imp <- varImp(gbm_model$finalModel, scale = FALSE)
gbm_imp
## Overall
## V1 42994.8024
## V2 41925.1380
## V3 11513.4247
## V4 39860.0603
## V5 18421.9969
## V6 744.9128
## V7 922.0981
## V8 1566.6877
## V9 721.1009
## V10 909.4869
set.seed(96)
c_model <- train(y ~ .,
data = simulated,
method = "cubist",)
c_imp <- varImp(c_model$finalModel, scale = FALSE)
c_imp
## Overall
## V2 73.5
## V3 37.5
## V1 63.5
## V4 50.0
## V7 32.0
## V5 29.0
## V6 3.0
## V10 0.5
## V8 0.0
## V9 0.0
After testing for both boosted trees and Cubist models, it seems that boosted trees do follow the same pattern as previously seen however Cubist deviates from the pattern making V2 dominant and actually making use of V7.
Use a simulation to show tree bias with different granularities.
set.seed(97)
x <- sample(1:50 / 10, 100, replace = TRUE)
y <- sample(1:100 / 100, 100, replace = TRUE)
z <- sample(1:500 / 1000, 100, replace = TRUE)
xyz <- x + y + z
sim <- data.frame(x,y,z,xyz)
rp_model <- rpart(y ~ .,
data = sim)
plot(as.party(rp_model), gp = gpar(fontsize = 7))
varImp(rp_model)
## Overall
## x 0.6122648
## xyz 1.1494434
## z 0.6221138
As can be seen above, if a variable has a lot of different values, the tree tends to pick that one instead of the others. There’s a greater chance that the model will choose the noisy variables over the useful ones at the top nodes.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
knitr::include_graphics('Fig.8.24.png')
Image 1: Chapter 8 Exercise 3 comparison of variable importance magnitudes for differing values of the bagging fraction and shrinkage parameters (Kuhn et al., 2016)
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
Boosting has a sharper importance profile than random forests because its trees depend on each other, leading to similar structures and higher importance scores for the same predictors. Increasing the learning rate and bagging fraction makes the model behave more like boosting, while lowering these values allows more predictors to gain importance.
Which model do you think would be more predictive of other samples?
Models with a smaller number of key predictors tend to be more accurate compared to those with more, since this approach lowers the risk of overfitting.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Increasing interaction depth would lessen the steepness of predictor importance for the model on the left, but it would have a lesser impact on the model on the right.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
data(ChemicalManufacturingProcess)
cmp <- kNN(ChemicalManufacturingProcess, k = 5) %>%
select(c(1:58)) %>%
select(!nearZeroVar(.))
set.seed(98)
ctrl1 <- trainControl(method = "cv", number = 10)
split <- createDataPartition(cmp$Yield, p = 0.8, list = FALSE)
train.data <- cmp[split, ]
test.data <- cmp[-split, ]
set.seed(99)
st_model <- train(Yield ~ .,
data = train.data,
method = "rpart",
tuneLength = 10,
trControl = ctrl1)
st_pred <- predict(st_model, test.data)
set.seed(100)
bt_model <- bagging(Yield ~ .,
data = train.data)
bt_pred <- predict(bt_model, test.data)
set.seed(101)
rf_model2 <- randomForest(Yield ~ .,
data = train.data,
importance = TRUE,
ntree = 1000)
rf_pred <- predict(rf_model2, test.data)
gbm_grid1 <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(102)
gbm_model1 <- train(Yield ~ .,
data = train.data,
method = "gbm",
tuneGrid = gbm_grid1,
verbose = FALSE)
gbm_pred <- predict(gbm_model1, test.data)
set.seed(103)
c_model1 <- train(Yield ~ .,
data = train.data,
method = "cubist")
c_pred <- predict(c_model1, test.data)
Which tree-based regression model gives the optimal resampling and test set performance?
postResample(st_pred, test.data$Yield)
## RMSE Rsquared MAE
## 1.6104036 0.3436246 1.2285913
postResample(bt_pred, test.data$Yield)
## RMSE Rsquared MAE
## 1.2560876 0.6367270 0.8503925
postResample(rf_pred, test.data$Yield)
## RMSE Rsquared MAE
## 1.1839600 0.7468729 0.8379200
postResample(gbm_pred, test.data$Yield)
## RMSE Rsquared MAE
## 0.9072270 0.8231149 0.6145274
postResample(c_pred, test.data$Yield)
## RMSE Rsquared MAE
## 0.8795236 0.8075330 0.6480669
As we can see from the results above, the boosted tree model
(gbm_model1) outperformed all other models obtaining an
RMSE value of 0.9072270, R^2 of 0.8231149, and an MAE of 0.6145274.
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
set.seed(104)
ridge_grid1 <- data.frame(.lambda = seq(0.001, 0.5, length = 25))
ridge_model1 <- train(Yield ~ .,
data = train.data,
method = "ridge",
metric = "Rsquared",
tuneGrid = ridge_grid1,
trControl = ctrl1,
preProc = c("center", "scale"))
set.seed(105)
svm_model1 <- train(Yield ~ .,
data = train.data,
method = "svmRadial",
trControl = ctrl1,
preProcess = c("center","scale"),
tuneLength = 10)
svm_pred1 <- predict(svm_model1, newdata = test.data)
varImp(gbm_model1)
## gbm variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.000
## BiologicalMaterial12 19.305
## ManufacturingProcess31 16.634
## ManufacturingProcess06 16.258
## ManufacturingProcess13 16.026
## BiologicalMaterial03 15.893
## ManufacturingProcess09 12.267
## ManufacturingProcess17 11.995
## BiologicalMaterial05 9.076
## BiologicalMaterial11 8.513
## BiologicalMaterial09 8.469
## BiologicalMaterial06 6.793
## ManufacturingProcess28 6.774
## ManufacturingProcess37 6.400
## ManufacturingProcess05 6.359
## ManufacturingProcess20 6.031
## BiologicalMaterial02 5.926
## ManufacturingProcess04 5.817
## ManufacturingProcess27 5.624
## ManufacturingProcess11 5.563
varImp(ridge_model1)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 92.10
## ManufacturingProcess36 84.55
## ManufacturingProcess13 77.45
## BiologicalMaterial02 75.80
## ManufacturingProcess31 72.75
## BiologicalMaterial03 70.48
## BiologicalMaterial12 64.89
## ManufacturingProcess29 62.84
## ManufacturingProcess17 58.92
## ManufacturingProcess33 58.75
## ManufacturingProcess09 58.54
## BiologicalMaterial04 57.22
## BiologicalMaterial01 51.50
## ManufacturingProcess06 50.10
## BiologicalMaterial08 48.58
## BiologicalMaterial11 44.03
## ManufacturingProcess11 39.56
## ManufacturingProcess26 36.67
## ManufacturingProcess30 30.61
varImp(svm_model1)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 92.10
## ManufacturingProcess36 84.55
## ManufacturingProcess13 77.45
## BiologicalMaterial02 75.80
## ManufacturingProcess31 72.75
## BiologicalMaterial03 70.48
## BiologicalMaterial12 64.89
## ManufacturingProcess29 62.84
## ManufacturingProcess17 58.92
## ManufacturingProcess33 58.75
## ManufacturingProcess09 58.54
## BiologicalMaterial04 57.22
## BiologicalMaterial01 51.50
## ManufacturingProcess06 50.10
## BiologicalMaterial08 48.58
## BiologicalMaterial11 44.03
## ManufacturingProcess11 39.56
## ManufacturingProcess26 36.67
## ManufacturingProcess30 30.61
The first of the above data frames displays which predictors are most
important in the boosted tree model built previously. Out of the top 10
predictors, it seems like the manufacturing process variables dominate
the list. Compared to the the top 10 predictors from the optimal linear
model (the ridge model seen above) and the optimal nonlinear model (the
SVM model below the ridge model), the top 10 important predictors from
this model include BiologicalMaterial03,
BiologicalMaterial12, ManufacturingProcess13,
ManufacturingProcess17,
ManufacturingProcess31, and
ManufacturingProcess32, but exclude
BiologicalMaterial02, BiologicalMaterial06,
ManufacturingProcess29, and
ManufacturingProcess36. However,
ManufacturingProcess32 continues to be of most
importance.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
set.seed(106)
rp_model1 <- rpart(Yield ~ .,
data = train.data)
plot(as.party(rp_model1), ip_args = list(abbreviate = 4), gp = gpar(fontsize = 7))
As can be seen above, this view of the data does provide additional knowledge about the biological or process predictors and their relationship with yield. The main takeaways being that the main factors that influence yield are related to manufacturing processes, and these processes play a much bigger role in determining yield.
Kuhn, M., Johnson, K., & Springer Science+Business Media. (2016). Applied predictive modeling. Springer.