library(ggplot2)
library(magrittr)
Kuhn, Max. Applied Predictive Modeling (p. 218).
Recreate the simulated data from Exercise 7.2
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
library(caret)
model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
The random forest model did use the uninformative predictors (V6 – V10) but not very significantly as the importance numbers for these predictors are low.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
## Overall
## V1 4.91687329
## V2 6.52816504
## V3 0.58711552
## V4 7.04870917
## V5 2.03115561
## V6 0.14213148
## V7 0.10991985
## V8 -0.08405687
## V9 -0.01075028
## V10 0.09230576
## duplicate1 3.80068234
## duplicate2 1.87721959
After adding an additional predictor that is highly correlated, the importance score for V1 decreased significantly. Further descrease (i.e. dilution of importance) happened after another highly correlated predictor was added. The newly added uninformative predictors, with high correlations to the V1 predictor, have abnormally large importance values.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(party)
## The mtry parameter should be the number of predictors
## (the number of columns minus 1 for the outcome).
sim_partc <- simulated[, 1:11]
bagCtrl <- cforest_control(mtry = ncol(sim_partc) - 1)
baggedTree <- cforest(y ~., data = sim_partc, controls = bagCtrl)
imp_partc <- varimp(baggedTree, conditional = TRUE)
imp_partc[order(-abs(imp_partc))]
## V4 V2 V1 V5 V8
## 6.034951318 4.339605372 3.318573594 0.742661197 -0.011539167
## V7 V9 V10 V6 V3
## 0.008964129 -0.008083828 0.005557381 -0.003772097 -0.001281939
sim_partc <- simulated[, 1:12]
bagCtrl <- cforest_control(mtry = ncol(sim_partc) - 1)
baggedTree <- cforest(y ~., data = sim_partc, controls = bagCtrl)
imp_partc <- varimp(baggedTree, conditional = TRUE)
imp_partc[order(-abs(imp_partc))]
## V4 V2 duplicate1 V5 V1
## 5.996272076 4.670717938 0.791541391 0.675670117 0.588175959
## V10 V7 V6 V3 V9
## 0.017414135 0.011714376 0.007003095 0.003935970 -0.001919159
## V8
## -0.001345279
sim_partc <- simulated[, 1:13]
bagCtrl <- cforest_control(mtry = ncol(sim_partc) - 1)
baggedTree <- cforest(y ~., data = sim_partc, controls = bagCtrl)
imp_partc <- varimp(baggedTree, conditional = TRUE)
imp_partc[order(-abs(imp_partc))]
## V4 V2 V5 duplicate1 V1
## 5.695196980 4.418333064 0.796500365 0.662161136 0.471536152
## V10 duplicate2 V8 V3 V7
## 0.026446640 0.023248430 -0.004928977 0.004543721 0.004361530
## V6 V9
## -0.001310992 0.001260097
In the above case, it is evident that the decline in importance is much less pronounced than with the traditional random forest model.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
## Boosted Trees
library(gbm)
sim_data <- simulated[, 1:11]
gbmModel <- gbm(y ~., data = sim_data, distribution = "gaussian")
summary(gbmModel)
## var rel.inf
## V4 V4 30.5443022
## V1 V1 26.3506687
## V2 V2 24.4445984
## V5 V5 10.8300221
## V3 V3 7.3794063
## V6 V6 0.4510024
## V7 V7 0.0000000
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
sim_data <- simulated[, 1:12]
gbmModel <- gbm(y ~., data = sim_data, distribution = "gaussian")
summary(gbmModel)
## var rel.inf
## V4 V4 29.1254395
## V2 V2 23.1244921
## duplicate1 duplicate1 17.7606567
## V1 V1 11.9064649
## V5 V5 9.2661917
## V3 V3 8.4007807
## V6 V6 0.2484665
## V9 V9 0.1675079
## V7 V7 0.0000000
## V8 V8 0.0000000
## V10 V10 0.0000000
sim_data <- simulated[, 1:13]
gbmModel <- gbm(y ~., data = sim_data, distribution = "gaussian")
summary(gbmModel)
## var rel.inf
## V4 V4 28.2782328
## V2 V2 21.6991520
## duplicate1 duplicate1 14.1929228
## V1 V1 13.5423876
## V5 V5 11.8066539
## V3 V3 8.9282442
## duplicate2 duplicate2 0.8028415
## V7 V7 0.3469890
## V10 V10 0.1481724
## V6 V6 0.1273970
## V9 V9 0.1270067
## V8 V8 0.0000000
## Cubist Model
yCol <- which(colnames(sim_data) == "y")
sim_data <- simulated[, 1:11]
cubistModel <- train(sim_data[, -yCol],
sim_data[, yCol],
method = "cubist")
varImp(cubistModel, scale = FALSE)
## cubist variable importance
##
## Overall
## V1 72.0
## V2 54.5
## V4 49.0
## V3 42.0
## V5 40.0
## V6 11.0
## V8 0.0
## V9 0.0
## V10 0.0
## V7 0.0
sim_data <- simulated[, 1:12]
cubistModel <- train(sim_data[, -yCol],
sim_data[, yCol],
method = "cubist")
varImp(cubistModel, scale = FALSE)
## cubist variable importance
##
## Overall
## V2 62.0
## V1 55.5
## V4 50.0
## V3 42.0
## duplicate1 37.0
## V5 31.0
## V6 15.5
## V8 0.0
## V7 0.0
## V9 0.0
## V10 0.0
sim_data <- simulated[, 1:13]
cubistModel <- train(sim_data[, -yCol],
sim_data[, yCol],
method = "cubist")
varImp(cubistModel, scale = FALSE)
## cubist variable importance
##
## Overall
## V2 69.5
## V1 54.0
## V4 50.0
## V5 38.0
## V3 32.0
## duplicate1 25.0
## duplicate2 25.0
## V6 10.0
## V8 3.0
## V10 0.0
## V7 0.0
## V9 0.0
In the case with Boosted and Cubist Trees, the decline in importance is again more gradual than with the traditional random forest model. However, more variables are assigned levels of importance with values greater than zero. Further, the newly added (highly correlated) predictors still appear among the list of influential variables.
Use a simulation to show tree bias with different granularities.
varImpSorted <- function(dfVarImp) {
varImpRows <- order(abs(dfVarImp$Overall), decreasing = TRUE)
dfResult <- data.frame(dfVarImp[varImpRows, 1],
row.names = rownames(dfVarImp)[varImpRows])
colnames(dfResult) <- colnames(dfVarImp)
return(dfResult)
}
sim_data <- simulated[, 1:11]
rfModel <- randomForest(y ~ .,
data = sim_data,
importance = TRUE,
ntree = 1000)
varImpSorted(varImp(rfModel, scale = FALSE))
## Overall
## V1 8.85865210
## V4 7.79041405
## V2 6.57684003
## V5 2.15062112
## V3 0.69517829
## V6 0.16551981
## V8 -0.10270669
## V9 -0.07651010
## V7 0.02333943
## V10 -0.01594849
postResample(pred = predict(rfModel), obs = sim_data$y)
## RMSE Rsquared MAE
## 2.554986 0.823782 2.092077
sim_data <- simulated[, 1:12]
rfModel <- randomForest(y ~ .,
data = sim_data,
importance = TRUE,
ntree = 1000)
varImpSorted(varImp(rfModel, scale = FALSE))
## Overall
## V4 6.92636679
## V2 6.19094491
## V1 5.49235852
## duplicate1 4.39153041
## V5 1.88461384
## V3 0.64406350
## V6 0.12649095
## V8 -0.09127384
## V7 0.06673548
## V9 0.02758416
## V10 0.01402926
postResample(pred = predict(rfModel), obs = sim_data$y)
## RMSE Rsquared MAE
## 2.6208229 0.7940312 2.1560878
sim_data <- simulated[, 1:13]
rfModel <- randomForest(y ~ .,
data = sim_data,
importance = TRUE,
ntree = 1000)
varImpSorted(varImp(rfModel, scale = FALSE))
## Overall
## V4 6.913866151
## V2 6.493718347
## V1 5.106983492
## duplicate1 3.570249805
## duplicate2 1.917100445
## V5 1.914067144
## V3 0.486767309
## V6 0.077448314
## V8 -0.068658456
## V9 -0.035937291
## V7 0.014490091
## V10 -0.002880012
postResample(pred = predict(rfModel), obs = sim_data$y)
## RMSE Rsquared MAE
## 2.6300680 0.7706705 2.1649502
With the added (highly correlated) predictors, performance of the model suffers, based on RMSE and other performance indicators shown above. Increase in RMSE with more correlated predictors added, indicates an increase in bias and model underfitting.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The text states that: The importance profile for boosting has a much steeper importance slope than the one for random forests. This is due to the fact that the trees from boosting are dependent on each other and hence will have correlated structures as the method follows by the gradient. Therefore, many of the same predictors will be selected across the trees, increasing their contribution to the importance metric. With higher values for bagging fraction, more of the training data end up being is used. Likewise, higher values for learning rate will allow a larger fraction of the current predicted value to be added to the previous iteration’s predicted value. In short, the more these two parameters are close to 1, the closer the model will be to having regular (un-regulated) boosting behavior and therefore the importance curve would exhibit the steep curve described by the excerpt from the text. On the flip side, if we lower these 2 tuning parameters (and therefore regulate the model behavior more), the more of the predictors will have importance (and with higher levels) on the model’s performance.
In my opinion, the model on the left, which is more regulated by the lower values of the 2 tuning parameters, will exhibit smaller variance for new samples, have a lower RMSE and therefore be more predictive.
Increasing interaction depth (tree depth) is likely to draw the model closer to random forest and therefore would flatten the slope of predictor importance more for the left-hand model, but not likely to change the steep slope for the right-hand model.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Data Pre-Processing
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
preP <- preProcess(ChemicalManufacturingProcess,
method = c("BoxCox", "knnImpute", "center", "scale"))
df <- predict(preP, ChemicalManufacturingProcess)
## Restore the response variable values to original
df$Yield = ChemicalManufacturingProcess$Yield
## Split the data into a training and a test set
trainRows <- createDataPartition(df$Yield, p = .80, list = FALSE)
df.train <- df[trainRows, ]
df.test <- df[-trainRows, ]
Tree-based Regression Models Training
colYield <- which(colnames(df) == "Yield")
trainX <- df.train[, -colYield]
trainY <- df.train$Yield
testX <- df.test[, -colYield]
testY <- df.test$Yield
## Single Tree Models
## Model 1 tunes over the complexity parameter
st1Model <- train(trainX, trainY,
method = "rpart",
tuneLength = 10,
trControl = trainControl(method = "cv"))
st1Model.train.pred <- predict(st1Model)
st1Model.test.pred <- predict(st1Model, newdata = testX)
## Model 2 tunes over the maximum depth
st2Model <- train(trainX, trainY,
method = "rpart2",
tuneLength = 10,
trControl = trainControl(method = "cv"))
st2Model.train.pred <- predict(st2Model)
st2Model.test.pred <- predict(st2Model, newdata = testX)
## Model Tree Models
library(RWeka)
## Model-Based version
m5Model <- M5P(trainY~., data = trainX)
m5Model.train.pred <- predict(m5Model)
m5Model.test.pred <- predict(m5Model, newdata = testX)
## Rule-Based version
m5RModl <- M5Rules(trainY~., data = trainX)
m5RModl.train.pred <- predict(m5RModl)
m5RModl.test.pred <- predict(m5RModl, newdata = testX)
## Bagged Tree Model
library(party)
bagCtrl <- cforest_control(mtry = ncol(trainX))
bagTree <- cforest(trainY ~., data = trainX, controls = bagCtrl)
bagTree.train.pred <- predict(bagTree)
bagTree.test.pred <- predict(bagTree, newdata = testX)
## Random Forest Model
library(randomForest)
library(caret)
rfModel <- randomForest(trainY ~ .,
data = trainX,
importance = TRUE,
ntree = 1000)
rfModel.train.pred <- predict(rfModel)
rfModel.test.pred <- predict(rfModel, newdata = testX)
## Boosted Trees
library(gbm)
gbmModel <- gbm(trainY ~ ., data = trainX, distribution = "gaussian")
gbmModel.train.pred <- predict(gbmModel, n.tree = 100)
gbmModel.test.pred <- predict(gbmModel, n.tree = 100, newdata = testX)
## Cubist Model
cubistModel <- train(trainX,
trainY,
method = "cubist")
cubistModel.train.pred <- predict(cubistModel)
cubistModel.test.pred <- predict(cubistModel, newdata = testX)
print("Train set performance")
## [1] "Train set performance"
rbind(
"st1Model" = postResample(pred = st1Model.train.pred, obs = trainY),
"st2Model" = postResample(pred = st2Model.train.pred, obs = trainY),
"m5Model" = postResample(pred = m5Model.train.pred, obs = trainY),
"m5RModl" = postResample(pred = m5RModl.train.pred, obs = trainY),
"bagged" = postResample(pred = bagTree.train.pred, obs = trainY),
"rforest" = postResample(pred = rfModel.train.pred, obs = trainY),
"boosted" = postResample(pred = gbmModel.train.pred, obs = trainY),
"cubist" = postResample(pred = cubistModel.train.pred, obs = trainY)
)
## RMSE Rsquared MAE
## st1Model 1.2077967 0.5389803 0.9442928
## st2Model 1.0374397 0.6598601 0.8030060
## m5Model 0.7011910 0.8501936 0.5682746
## m5RModl 0.8687389 0.7662811 0.6683153
## bagged 0.8346806 0.8082896 0.6455589
## rforest 1.1127120 0.6337759 0.8672670
## boosted 0.7789427 0.8305777 0.5979851
## cubist 0.1754463 0.9920887 0.1342076
print("Test set performance")
## [1] "Test set performance"
rbind(
"st1Model" = postResample(pred = st1Model.test.pred, obs = testY),
"st2Model" = postResample(pred = st2Model.test.pred, obs = testY),
"m5Model" = postResample(pred = m5Model.test.pred, obs = testY),
"m5RModl" = postResample(pred = m5RModl.test.pred, obs = testY),
"bagged" = postResample(pred = bagTree.test.pred, obs = testY),
"rforest" = postResample(pred = rfModel.test.pred, obs = testY),
"boosted" = postResample(pred = gbmModel.test.pred, obs = testY),
"cubist" = postResample(pred = cubistModel.test.pred, obs = testY)
)
## RMSE Rsquared MAE
## st1Model 1.532614 0.4659000 1.2471736
## st2Model 1.481003 0.5067299 1.2167997
## m5Model 1.249298 0.6448689 0.9287938
## m5RModl 1.674628 0.3860707 1.2093524
## bagged 1.384970 0.5808880 1.0562028
## rforest 1.244433 0.6967441 0.9453439
## boosted 1.337591 0.6025043 0.9765952
## cubist 1.092527 0.7482687 0.8152786
Based on the training and the test performance metrics, the Cubist model looks to be most optimal.
varImpSorted <- function(dfVarImp) {
varImpRows <- order(abs(dfVarImp$Overall), decreasing = TRUE)
dfResult <- data.frame(dfVarImp[varImpRows, 1],
row.names = rownames(dfVarImp)[varImpRows])
colnames(dfResult) <- colnames(dfVarImp)
return(dfResult)
}
mdlVarImp <- varImp(cubistModel)
plot(mdlVarImp)
mdlvarImp <- varImpSorted(mdlVarImp$importance)
print("Top 10 important predictors")
## [1] "Top 10 important predictors"
head(mdlvarImp, 10)
## Overall
## ManufacturingProcess17 100.00000
## ManufacturingProcess32 95.65217
## ManufacturingProcess39 52.17391
## ManufacturingProcess09 51.08696
## ManufacturingProcess33 44.56522
## BiologicalMaterial12 36.95652
## BiologicalMaterial03 31.52174
## ManufacturingProcess29 25.00000
## ManufacturingProcess04 22.82609
## ManufacturingProcess22 21.73913
print(paste("Number of process variables:",
length(grep(
"Manuf.*", rownames(mdlvarImp)[which(mdlvarImp$Overall > 0)]
))))
## [1] "Number of process variables: 26"
print(paste("Number of biological variables:",
length(grep(
"Bio.*", rownames(mdlvarImp)[which(mdlvarImp$Overall > 0)]
))))
## [1] "Number of biological variables: 9"
Based on the above numbers, process variables dominate the importance list over biological ones.
The top 10 important predictors in this model are mostly present in the top 10 predictors from the optimal linear and nonlinear models, although with different importance levels which is to be expected.
library(partykit)
plot(as.party(st2Model$finalModel))
The above graphical view of the optimal single tree can provide quick and intuitive knowledge about the biological or process predictors and their relationship with yield. For example, one can easily navigate the tree visually and determine the path leading to the highest yields. In this diagram it becomes evident that certain key process predictors are likely to lead to highest yields.