Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE) %>%
rownames_to_column("var")
rfImp1 <- rfImp1[order(-rfImp1$Overall), , drop=FALSE] %>% remove_rownames()
kab_tab2(rfImp1)| var | Overall |
|---|---|
| V1 | 8.8389 |
| V4 | 7.5882 |
| V2 | 6.4902 |
| V5 | 2.2743 |
| V3 | 0.6758 |
| V6 | 0.1744 |
| V7 | 0.1514 |
| V9 | -0.0299 |
| V8 | -0.0308 |
| V10 | -0.0853 |
V6 – V10)?No, the uninformative predictors, V6 – V10, were the least important predictors in the random forest model and they were significantly less influential than the other 5 predictors, V1 – V5, with the most important one V6 at 0.17 having only about 25% the influence of V3 the least influential of the top 5 at 0.67.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
## [1] 0.9396
Fit another random forest model to these data.
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE) %>%
rownames_to_column("var") %>% remove_rownames()
rfImp2 <- rfImp2[order(-rfImp2$Overall), , drop=FALSE]
rfImp1[11,] <- NA
tab1 <- cbind(rfImp1, rfImp2) %>% remove_rownames()
tab1 <- tab1[c(2,1,4,3)]
colnames(tab1) <- c("Model_1", "Model_1_var", "Model_2", "Model_2_var")
kab_tab(tab1)| Model_1 | Model_1_var | Model_2 | Model_2_var |
|---|---|---|---|
| 8.8389 | V1 | 6.9392 | V4 |
| 7.5882 | V4 | 6.2978 | V1 |
| 6.4902 | V2 | 6.0804 | V2 |
| 2.2743 | V5 | 3.5641 | duplicate1 |
| 0.6758 | V3 | 2.0310 | V5 |
| 0.1744 | V6 | 0.5841 | V3 |
| 0.1514 | V7 | 0.0795 | V6 |
| -0.0299 | V9 | -0.0072 | V10 |
| -0.0308 | V8 | -0.0257 | V7 |
| -0.0853 | V10 | -0.0884 | V9 |
| NA | NA | -0.1101 | V8 |
V1 change?Yes, the importance score for V1 was reduced by the addition of a highly correlated variable and dropped down to second place.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007)1.
library(party)
data <- simulated[1:11]
CI_model <- cforest(y ~ ., data=data)
CI_Imp_T <- as.data.frame(varimp(CI_model, conditional = T)) %>%
rownames_to_column("var") %>% remove_rownames()
CI_Imp_T <- CI_Imp_T[order(-CI_Imp_T$`varimp(CI_model, conditional = T)`), , drop=FALSE]
CI_Imp_F <- as.data.frame(varimp(CI_model, conditional = F)) %>%
rownames_to_column("var") %>% remove_rownames()
CI_Imp_F <- CI_Imp_F[order(-CI_Imp_F$`varimp(CI_model, conditional = F)`), , drop=FALSE]
tab3 <- cbind(rfImp1[1:10,], CI_Imp_T, CI_Imp_F) %>% remove_rownames()
tab3 <- tab3[c(2,1,4,3,6,5)]
colnames(tab3) <- c("Model_1", "Model_1_var", "CI cond=T", "CI_T_vars", "CI cond=F", "CI_F_vars")
kab_tab(tab3)| Model_1 | Model_1_var | CI cond=T | CI_T_vars | CI cond=F | CI_F_vars |
|---|---|---|---|---|---|
| 8.8389 | V1 | 6.8480 | V4 | 8.7353 | V1 |
| 7.5882 | V4 | 5.6150 | V1 | 8.2706 | V4 |
| 6.4902 | V2 | 5.2925 | V2 | 6.5767 | V2 |
| 2.2743 | V5 | 1.2380 | V5 | 1.9824 | V5 |
| 0.6758 | V3 | 0.0209 | V7 | 0.0525 | V7 |
| 0.1744 | V6 | 0.0188 | V3 | 0.0370 | V3 |
| 0.1514 | V7 | 0.0000 | V8 | 0.0102 | V6 |
| -0.0299 | V9 | -0.0019 | V9 | -0.0030 | V8 |
| -0.0308 | V8 | -0.0062 | V10 | -0.0174 | V9 |
| -0.0853 | V10 | -0.0067 | V6 | -0.0436 | V10 |
The conditional computation of importance flips the order of the first two variables as compared with the random forest model and the non-conditional computation. There are also other differences in the pattern of importance that can be seen in the table above for instance the importance measure of V3 drops significantly in the two conditional inference models.
Repeat this process with different tree models, such as boosted trees and Cubist.
library(gbm)
gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 1000, by = 50),
.shrinkage = c(0.01, 0.1),
.n.minobsinnode = 10)
boost_model1 <- train(simulated[1:10], simulated$y, method = "gbm",
tuneGrid = gbmGrid, verbose = FALSE)
boost_model2 <- train(simulated[-c(11,13)], simulated$y, method = "gbm",
tuneGrid = gbmGrid, verbose = FALSE)
boost_model3 <- train(simulated[-11], simulated$y, method = "gbm",
tuneGrid = gbmGrid, verbose = FALSE)
boost_imp1 <- data.frame(varImp(boost_model1)$importance) %>%
rownames_to_column("var") %>% remove_rownames()
boost_imp1 <- boost_imp1[order(-boost_imp1$Overall), , drop=FALSE]
boost_imp2 <- data.frame(varImp(boost_model2)$importance) %>%
rownames_to_column("var") %>% remove_rownames()
boost_imp2 <- boost_imp2[order(-boost_imp2$Overall), , drop=FALSE]
boost_imp3 <- data.frame(varImp(boost_model3)$importance) %>%
rownames_to_column("var") %>% remove_rownames()
boost_imp3 <- boost_imp3[order(-boost_imp3$Overall), , drop=FALSE]
boost_imp1[c(11:12),] <- NA
boost_imp2[12,] <- NA
tab5 <- cbind(boost_imp1, boost_imp2, boost_imp3) %>% remove_rownames()
tab5 <- tab5[c(2,1,4,3,6,5)]
names(tab5) <- c("boost1", "var1", "boost2", "var2", "boost3", "var3")
kab_tab(tab5)| boost1 | var1 | boost2 | var2 | boost3 | var3 |
|---|---|---|---|---|---|
| 100.000 | V4 | 100.0000 | V4 | 100.000 | V4 |
| 94.533 | V1 | 75.5275 | V2 | 83.154 | V2 |
| 78.198 | V2 | 68.7917 | V1 | 82.446 | V1 |
| 35.322 | V5 | 40.3182 | V5 | 42.792 | V5 |
| 31.846 | V3 | 23.7523 | V3 | 32.617 | V3 |
| 2.700 | V7 | 20.5599 | duplicate1 | 12.054 | duplicate2 |
| 2.220 | V8 | 3.0663 | V7 | 2.993 | V7 |
| 2.198 | V6 | 2.2864 | V6 | 2.940 | duplicate1 |
| 1.366 | V10 | 0.6198 | V9 | 2.870 | V6 |
| 0.000 | V9 | 0.4249 | V8 | 1.007 | V8 |
| NA | NA | 0.0000 | V10 | 0.334 | V10 |
| NA | NA | NA | NA | 0.000 | V9 |
You can see in the table above that the order of importance changes in the boosted trees models when correlated predictors are added. Although, unlike the random forest model, the top predictor never gets bumped out of the top slot. In addition, the duplicate predictors never move above the 6th slot while in the random forest model they took the 4th and 5th slots.
library(Cubist)
cubist_model1 = train(simulated[1:10], simulated$y, method = "cubist")
cubist_model2 = train(simulated[-c(11,13)], simulated$y, method = "cubist")
cubist_model3 = train(simulated[-11], simulated$y, method = "cubist")
cubist_imp1 <- data.frame(varImp(cubist_model1)$importance) %>%
rownames_to_column("var") %>% remove_rownames()
cubist_imp1 <- cubist_imp1[order(-cubist_imp1$Overall), , drop=FALSE]
cubist_imp2 <- data.frame(varImp(cubist_model2)$importance) %>%
rownames_to_column("var") %>% remove_rownames()
cubist_imp2 <- cubist_imp2[order(-cubist_imp2$Overall), , drop=FALSE]
cubist_imp3 <- data.frame(varImp(cubist_model3)$importance) %>%
rownames_to_column("var") %>% remove_rownames()
cubist_imp3 <- cubist_imp3[order(-cubist_imp3$Overall), , drop=FALSE]
cubist_imp1[c(11:12),] <- NA
cubist_imp2[12,] <- NA
tab6 <- cbind(cubist_imp1, cubist_imp2, cubist_imp3) %>% remove_rownames()
tab6 <- tab6[c(2,1,4,3,6,5)]
names(tab6) <- c("cubist1", "var1", "cubist2", "var2", "cubist3", "var3")
kab_tab(tab6)| cubist1 | var1 | cubist2 | var2 | cubist3 | var3 |
|---|---|---|---|---|---|
| 100.00 | V1 | 100.000 | V1 | 100.000 | V1 |
| 75.69 | V2 | 75.694 | V2 | 88.321 | V2 |
| 68.06 | V4 | 68.750 | V4 | 71.533 | V4 |
| 58.33 | V3 | 58.333 | V3 | 62.044 | V5 |
| 55.56 | V5 | 56.944 | V5 | 57.664 | V3 |
| 15.28 | V6 | 13.889 | V6 | 11.679 | V6 |
| 0.00 | V7 | 2.083 | duplicate1 | 5.109 | duplicate1 |
| 0.00 | V8 | 0.000 | V7 | 1.460 | V10 |
| 0.00 | V9 | 0.000 | V8 | 1.460 | duplicate2 |
| 0.00 | V10 | 0.000 | V9 | 0.000 | V7 |
| NA | NA | 0.000 | V10 | 0.000 | V8 |
| NA | NA | NA | NA | 0.000 | V9 |
Of all the models tried, the cubist model changes the least when correlated predictors are added. The top 3 slots stay the same in all three scenarios and the top 6 stay the same when only one correlated predictor is added. The duplicate predictors also remain lower in the list in the 7th and 9th slots as compared with both the random forest model where they were in the 4th and 5th slots and the boosted model where they were in the 6th and 8th slots.
Use a simulation to show tree bias with different granularities.
set.seed(222)
x1_low <- sample(-1:1, 250, replace = T)
y <- x1_low + rnorm(250, mean=0, sd=1)
x2_mid <- x1_low + round(rnorm(250, mean=0, sd=1), 1)
x3_high <- x1_low + round(rnorm(250, mean=0, sd=1), 2)
df <- data.frame(cbind(x1_low, x2_mid, x3_high, y))
str(df)## 'data.frame': 250 obs. of 4 variables:
## $ x1_low : num 1 -1 0 -1 1 1 0 0 0 -1 ...
## $ x2_mid : num 1.3 -0.8 0.4 1.5 1.7 1.4 0.8 2 1.6 1.1 ...
## $ x3_high: num 1.25 -1.14 -0.35 -0.32 0.33 1.74 0.29 1.26 -0.68 -0.6 ...
## $ y : num 0.651 -0.755 0.602 -1.377 0.63 ...
library(rpart)
tree_mod <- rpart(y ~ ., data = df)
cor_to_y <- c(cor(x1_low, y), cor(x2_mid, y), cor(x3_high, y))
no_unique_vals <- c(length(unique(x1_low)),
length(unique(x2_mid)),
length(unique(x3_high)))
var_imp <- varImp(tree_mod)
tab7 <- cbind(cor_to_y, no_unique_vals, var_imp)
names(tab7)[3] <- "importance"
kab_tab2(tab7)| cor_to_y | no_unique_vals | importance | |
|---|---|---|---|
| x1_low | 0.5293 | 3 | 0.3214 |
| x2_mid | 0.2283 | 70 | 1.0573 |
| x3_high | 0.2775 | 209 | 1.1188 |
In the simulation above you can see that the predictor with the highest number of distinct values was given the highest importance score even though it has a much lower correlation to the outcome.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The model on the right has a higher bagging parameter and so uses a larger random sample of the predictors at each iteration in the algorithm. As a result more of the predictors can be filtered out as less influential at each stage in the process, whereas in the model on the right because it is using a much smaller random sample of predictors it cannot filter out as many predictors at each iteration and must use the ones it has so more of the predictors are used in the final combined model. Also the variable importance is calculated by averaging the improvement value for that predictor across all trees in the ensemble. So again the importance values will be higher when there are less predictors to choose from in each tree because they cannot be filtered out. Whereas these same predictors might be filtered out as less influential when there are more predictors to choose from.
I would think that the model with the lower bagging fraction and learning rate would be a better predictor of new samples since it has a heavier reliance on a larger sample of the predictors and since lower learning rates were recommended because they build up the predictions more slowly through more iterations.
Increasing tree depth spreads the importance out over more predictors as the number of nodes increases. So increasing the depth should cause the predictors with higher importance values to be lessened and the ones with lower values to be increased, thus flattening the curve in the slope of predictor importance in either model.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and steps as before and train several tree-based models:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
imputer <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
CMP_imputed <- predict(imputer, ChemicalManufacturingProcess)
near0 <- nearZeroVar(CMP_imputed)
cmp <- CMP_imputed[,-near0]
trainingRows <- createDataPartition(cmp$Yield, p = .80, list= FALSE)
train <- cmp[trainingRows, ]
trainx <- train[,-1]
trainy <- train$Yield
test <- cmp[-trainingRows, ]
testx <- test[,-1]
testy <- test$Yieldset.seed(123)
RF_grid <- expand.grid(mtry=seq(7,49,by=3))
RF_mod <- train(trainx, trainy, method = "rf",
tuneGrid = RF_grid)
RF_imp <- varImp(RF_mod, scale = FALSE) #
RF_imp <- RF_imp$importance %>% rownames_to_column("var")
RF_imp <- RF_imp[order(-RF_imp$Overall), , drop=FALSE] %>% remove_rownames()
plot(RF_mod)| var | Overall |
|---|---|
| ManufacturingProcess32 | 30.578 |
| ManufacturingProcess13 | 10.805 |
| BiologicalMaterial12 | 6.665 |
| ManufacturingProcess31 | 6.435 |
| BiologicalMaterial06 | 5.722 |
| ManufacturingProcess17 | 5.276 |
| ManufacturingProcess36 | 4.966 |
| ManufacturingProcess06 | 4.593 |
| BiologicalMaterial03 | 4.478 |
| ManufacturingProcess28 | 4.194 |
| BiologicalMaterial11 | 3.133 |
| ManufacturingProcess09 | 3.091 |
library(gbm)
gbmGrid2 <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 1000, by = 100),
.shrinkage = c(0.005, 0.01, 0.1, 0.2),
.n.minobsinnode = seq(5, 16, by = 5))
boost_mod <- train(trainx, trainy, method = "gbm",
tuneGrid = gbmGrid2, verbose = FALSE)
boost_imp <- data.frame(varImp(boost_mod)$importance) %>%
rownames_to_column("var")
boost_imp <- boost_imp[order(-boost_imp$Overall), , drop=FALSE] %>%
remove_rownames()
plot(boost_mod)| var | Overall |
|---|---|
| ManufacturingProcess32 | 100.000 |
| ManufacturingProcess13 | 28.978 |
| BiologicalMaterial12 | 21.216 |
| ManufacturingProcess17 | 19.744 |
| ManufacturingProcess06 | 19.224 |
| ManufacturingProcess31 | 14.309 |
| ManufacturingProcess09 | 11.045 |
| BiologicalMaterial03 | 8.492 |
| BiologicalMaterial11 | 7.675 |
| BiologicalMaterial09 | 7.099 |
| ManufacturingProcess21 | 6.478 |
| BiologicalMaterial05 | 6.227 |
cub_grid <- expand.grid(committees = c(1,5,10,20,40,80), neighbors = c(0,1,3,5,7,9))
cubist_mod = train(trainx, trainy, method = "cubist", tuneGrid = cub_grid)
cubist_imp <- data.frame(varImp(cubist_mod)$importance) %>%
rownames_to_column("var")
cubist_imp <- cubist_imp[order(-cubist_imp$Overall), , drop=FALSE] %>%
remove_rownames()
plot(cubist_mod)| var | Overall |
|---|---|
| ManufacturingProcess32 | 100.00 |
| ManufacturingProcess13 | 62.50 |
| ManufacturingProcess29 | 32.95 |
| ManufacturingProcess31 | 31.82 |
| BiologicalMaterial06 | 29.55 |
| ManufacturingProcess09 | 29.55 |
| BiologicalMaterial10 | 27.27 |
| BiologicalMaterial03 | 27.27 |
| BiologicalMaterial02 | 26.14 |
| ManufacturingProcess17 | 23.86 |
| ManufacturingProcess33 | 23.86 |
| ManufacturingProcess15 | 19.32 |
Which tree-based regression model gives the optimal resampling and test set performance?
RF_pred_train <- predict(RF_mod)
boost_pred_train <- predict(boost_mod)
cubist_pred_train <- predict(cubist_mod)
kab_tab2(data.frame(rbind(postResample(pred = cubist_pred_train, obs = trainy),
postResample(pred = boost_pred_train, obs = trainy),
postResample(pred = RF_pred_train, obs = trainy)),
row.names = c("Cubist","Gradient Boost", "Random Forest")))| RMSE | Rsquared | MAE | |
|---|---|---|---|
| Cubist | 0.0000 | 1.0000 | 0.0000 |
| Gradient Boost | 0.1383 | 0.9837 | 0.0924 |
| Random Forest | 0.2580 | 0.9564 | 0.1948 |
RF_pred <- predict(RF_mod, newdata = testx)
boost_pred <- predict(boost_mod, newdata = testx)
cubist_pred <- predict(cubist_mod, newdata = testx)
kab_tab2(data.frame(rbind(postResample(pred = cubist_pred, obs = testy),
postResample(pred = boost_pred, obs = testy),
postResample(pred = RF_pred, obs = testy)),
row.names = c("Cubist","Gradient Boost", "Random Forest")))| RMSE | Rsquared | MAE | |
|---|---|---|---|
| Cubist | 0.3175 | 0.9097 | 0.2494 |
| Gradient Boost | 0.5171 | 0.7808 | 0.3878 |
| Random Forest | 0.6248 | 0.7046 | 0.4496 |
## Cubist
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 0.9320 0.3706 0.6893
## 1 1 0.9487 0.3780 0.6940
## 1 3 0.9322 0.3787 0.6869
## 1 5 0.9303 0.3785 0.6853
## 1 7 0.9298 0.3762 0.6865
## 1 9 0.9306 0.3738 0.6874
## 5 0 0.7424 0.5058 0.5645
## 5 1 0.7299 0.5256 0.5501
## 5 3 0.7314 0.5210 0.5526
## 5 5 0.7343 0.5165 0.5544
## 5 7 0.7376 0.5121 0.5567
## 5 9 0.7407 0.5083 0.5600
## 10 0 0.7043 0.5444 0.5389
## 10 1 0.6864 0.5683 0.5201
## 10 3 0.6908 0.5619 0.5242
## 10 5 0.6946 0.5568 0.5280
## 10 7 0.6978 0.5521 0.5304
## 10 9 0.7007 0.5489 0.5330
## 20 0 0.6736 0.5801 0.5201
## 20 1 0.6541 0.6028 0.4962
## 20 3 0.6603 0.5961 0.5036
## 20 5 0.6645 0.5909 0.5085
## 20 7 0.6686 0.5857 0.5122
## 20 9 0.6714 0.5828 0.5152
## 40 0 0.6614 0.5963 0.5095
## 40 1 0.6407 0.6195 0.4852
## 40 3 0.6472 0.6132 0.4936
## 40 5 0.6515 0.6082 0.4977
## 40 7 0.6558 0.6029 0.5016
## 40 9 0.6587 0.5998 0.5045
## 80 0 0.6485 0.6161 0.4980
## 80 1 0.6280 0.6379 0.4732
## 80 3 0.6343 0.6319 0.4823
## 80 5 0.6384 0.6275 0.4871
## 80 7 0.6427 0.6222 0.4904
## 80 9 0.6455 0.6195 0.4932
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 80 and neighbors = 1.
| committees | neighbors | |
|---|---|---|
| 32 | 80 | 1 |
The cubist model gives the best (lowest) RMSE and MAE values for both the resampling and the test set predictions, as well as the highest \(R^2\) (although you really shouldn’t use the \(R^2\) statistic to compare across different types of models, only to compare different tuning parameters for the same model).
Although the model chosen by the algorithm was the 80 committee and 1 neighbor model you can see in the cubist model plot above that there is little gained after 40 committees, so the 40 committee and 1 neighbor model may be selected instead to limit the complexity.
| var | Overall |
|---|---|
| ManufacturingProcess32 | 100.00 |
| ManufacturingProcess13 | 62.50 |
| ManufacturingProcess29 | 32.95 |
| ManufacturingProcess31 | 31.82 |
| BiologicalMaterial06 | 29.55 |
| ManufacturingProcess09 | 29.55 |
| BiologicalMaterial10 | 27.27 |
| BiologicalMaterial03 | 27.27 |
| BiologicalMaterial02 | 26.14 |
| ManufacturingProcess17 | 23.86 |
The manufacturing processes still dominate the list taking the top 4 slots, however, like the optimal nonlinear model 4 of the top 10 are biological materials (and 2 of the top 6) are biological materials so they are not completely absent like they were in the optimal linear model. The same two predictors came out on top in the optimal tree model as in the nonlinear model, ManufacturingProcess32 and ManufacturingProcess13, although in reverse order. The next two most influential variables, ManufacturingProcess29 and ManufacturingProcess31 came out much lower in the nonlinear list at ranks 11 and 17 respectively.
Plot the optimal single tree with the distribution of yield in the terminal nodes.
First we’ll have to train a model since we haven’t trained a single tree model yet!
set.seed(123)
ctrl <- trainControl(method = "cv", number = 10)
tree_grid <- expand.grid(maxdepth= seq(1,15, by=1))
tree_mod <- train(trainx, trainy, method = "rpart2",
tuneGrid = tree_grid, trControl = ctrl)
library(partykit)
plot(as.party(tree_mod$finalModel), gp=gpar(fontsize=11))The most notable insight that stands out in the tree plot is that the distributions for all the nodes on the left branch of the top split on ManufacturingProcess32 except for one are lower than all of the distributions for the right side branch. You can clearly see why that split was made and why that variable consistently comes out on or near the top of the list of importance.
Strobl C, Boulesteix A, Zeileis A, Hothorn T (2007). “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution.”BMC Bioinformatics, 8(1), 25.↩