Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <-as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Did the random forest model significantly use the uninformative predictors(V6-V10)?
From the output of the varImp function, we see that the random forest did use predictors 6-10, although they were not as significant.The most important variables are 1, 2, 4, 5.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
set.seed(123)
simulated$duplicate1 <- simulated$V1 + rnorm(200) *.1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9504983
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is high correlated to V1?
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 5.183260546
## V2 6.318610678
## V3 0.697206467
## V4 6.952539867
## V5 1.986661056
## V6 0.233829847
## V7 -0.006712779
## V8 -0.083743959
## V9 -0.020932796
## V10 -0.069609685
## duplicate1 4.305479706
Looking at the output for the variable importance, there was a decrease in the importance of V1 by about 2. There was an overall decrease, which I think makes sense. The importance value of the duplicated column is 3.456.
set.seed(234)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .12
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9243674
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
## Overall
## V1 4.15929157
## V2 6.16758547
## V3 0.43858329
## V4 7.28389142
## V5 1.90255763
## V6 0.21213093
## V7 -0.01960485
## V8 -0.09552319
## V9 -0.06038266
## V10 -0.01800977
## duplicate1 3.14294997
## duplicate2 2.87803982
From the output, it looks as though the V1’s importance was reduced even further when we added another highly correlated variable.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importance show the same pattern as the traditional random forest model?
library(party)
set.seed(345)
cforest1 <- cforest(y ~ ., data = simulated[, 1:11], controls = cforest_control(ntree = 1000)) #using the columns that we originally created to not inlcude the highly correlated columns
cforest2 <- cforest(y ~ ., data = simulated[, 1:12], controls = cforest_control(ntree = 1000)) #includes the first highly correlated column to V1
cforest3 <- cforest(y ~ ., data = simulated, controls = cforest_control(ntree = 1000))
cforest.imp1 <- varimp(cforest1)
cforest.imp2 <- varimp(cforest2)
cforest.imp3 <- varimp(cforest3)
cforest.imp4 <- varimp(cforest1, conditional = TRUE)
cforest.imp5 <- varimp(cforest2, conditional = TRUE)
cforest.imp6 <- varimp(cforest3, conditional = TRUE)
library(knitr)
var.names <- c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "Corr1", "Corr2")
cforest.imp1 <- data.frame(Original_Simulated = cforest.imp1, Variable = factor(names(cforest.imp1), levels = var.names))
cforest.imp2 <- data.frame(Original_Simulated_Correlated1 = cforest.imp2, Variable = factor(names(cforest.imp2), levels = var.names))
cforest.imp3 <- data.frame(Original_Simulated_Correlated2 = cforest.imp3, Variable = factor(names(cforest.imp3), levels = var.names))
cforest.imp4 <- data.frame(Original_Simulated2 = cforest.imp4, Variable = factor(names(cforest.imp4), levels = var.names))
cforest.imp5 <- data.frame(Original_Simulated2_Correlated1 = cforest.imp5, Variable = factor(names(cforest.imp5), levels = var.names))
cforest.imp6 <- data.frame(Original_Simulated2_Correlated2 = cforest.imp6, Variable = factor(names(cforest.imp6), levels = var.names))
cforest.imp <- merge(cforest.imp1, cforest.imp2, all = TRUE)
cforest.imp <- merge(cforest.imp, cforest.imp3, all = TRUE)
cforest.imp <- merge(cforest.imp, cforest.imp4, all = TRUE)
cforest.imp <- merge(cforest.imp, cforest.imp5, all = TRUE)
cforest.imp <- merge(cforest.imp, cforest.imp6, all = TRUE)
cforest.imp$Variable <- factor(cforest.imp$Variable, levels = var.names)
attach(cforest.imp)
cforest.imp <- cforest.imp[order(Variable), ]
kable(cforest.imp)
| Variable | Original_Simulated | Original_Simulated_Correlated1 | Original_Simulated_Correlated2 | Original_Simulated2 | Original_Simulated2_Correlated1 | Original_Simulated2_Correlated2 | |
|---|---|---|---|---|---|---|---|
| 1 | V1 | 9.2912099 | 5.6900862 | 4.2663934 | 3.1088424 | 1.1760428 | 0.5722840 |
| 3 | V2 | 7.2699157 | 6.3294000 | 6.1227909 | 3.9328688 | 3.5972613 | 3.4458574 |
| 4 | V3 | 0.0465104 | 0.0468226 | 0.0571431 | -0.0043185 | 0.0004532 | 0.0265667 |
| 5 | V4 | 8.7462832 | 8.0197883 | 7.8767015 | 4.9071478 | 4.6617310 | 4.4435125 |
| 6 | V5 | 2.1798340 | 2.0186085 | 1.7982137 | 0.7181428 | 0.8664706 | 0.7762352 |
| 7 | V6 | 0.0153615 | -0.0050624 | 0.0521642 | 0.0065393 | 0.0042916 | 0.0194132 |
| 8 | V7 | 0.0998168 | 0.0497234 | 0.0420129 | 0.0287170 | 0.0067734 | 0.0099143 |
| 9 | V8 | -0.0447170 | -0.0436383 | -0.0092861 | -0.0193337 | -0.0133336 | 0.0008951 |
| 10 | V9 | -0.0281041 | -0.0261491 | -0.0027388 | 0.0001600 | -0.0064891 | 0.0005056 |
| 2 | V10 | -0.0378497 | -0.0237428 | -0.0056340 | 0.0037756 | -0.0039407 | -0.0004141 |
| 11 | NA | NA | 4.0672712 | 2.9405583 | NA | 0.8991779 | 0.5276705 |
| 12 | NA | NA | 4.0672712 | 2.9405583 | NA | 0.8991779 | 0.2393722 |
| 13 | NA | NA | 4.0672712 | 2.6147749 | NA | 0.8991779 | 0.5276705 |
| 14 | NA | NA | 4.0672712 | 2.6147749 | NA | 0.8991779 | 0.2393722 |
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
library(ipred)
set.seed(789)
bg1 <- bagging(y ~ ., data = simulated[, 1:11], nbags = 50)
bg2 <- bagging(y ~ ., data = simulated[, 1:12], nbags = 50)
bg3 <- bagging(y ~ ., data = simulated, nbags = 50)
bg.varimp1 <- varImp(bg1)
bg.varimp2 <- varImp(bg2)
bg.varimp3 <- varImp(bg3)
names(bg.varimp1) <- "Original_Simulated"
names(bg.varimp2) <- "Simulated_Correlated1"
names(bg.varimp3) <- "Simulated_Correlated2"
bg.varimp1$Variable <- factor(rownames(bg.varimp1), levels = var.names)
bg.varimp2$Variable <- factor(rownames(bg.varimp2), levels = var.names)
bg.varimp3$Variable <- factor(rownames(bg.varimp3), levels = var.names)
bg.imp <- merge(bg.varimp1, bg.varimp2, all = TRUE)
bg.imp <- merge(bg.imp, bg.varimp3, all = TRUE)
attach(bg.imp)
bg.imp <- bg.imp[order(Variable), ]
kable(bg.imp)
| Variable | Original_Simulated | Simulated_Correlated1 | Simulated_Correlated2 | |
|---|---|---|---|---|
| 1 | V1 | 1.8937773 | 1.9005476 | 1.9169439 |
| 3 | V2 | 2.4074842 | 2.1294167 | 1.9858935 |
| 4 | V3 | 1.4737451 | 0.9737914 | 0.9478109 |
| 5 | V4 | 2.8503501 | 2.6136684 | 2.3375513 |
| 6 | V5 | 2.4934352 | 2.2500432 | 2.0839561 |
| 7 | V6 | 0.9243688 | 0.8501971 | 0.7452643 |
| 8 | V7 | 1.1582842 | 0.8305714 | 0.7755231 |
| 9 | V8 | 0.5758429 | 0.4132404 | 0.4788912 |
| 10 | V9 | 0.7445645 | 0.6010269 | 0.4456381 |
| 2 | V10 | 0.7938636 | 0.6764631 | 0.7833463 |
| 11 | NA | NA | 1.6696019 | 1.7558808 |
| 12 | NA | NA | 1.6696019 | 1.5874815 |
The trend, for the most part, seems to follow the other models. I was surprised to see Variable 5 ranking higher in these models. In general, I think there’s an overall higher rating of importance on the other variables as well. Now I’ll take a cubist route using the Cubist package setting the committees argument to 100 to indicate how many models we are fitting.
library(Cubist)
set.seed(891)
simulated2 <- simulated[-12] #break up data so one set contains just one of the correlated columns
simulated3 <- simulated[-13]
cb1 <- cubist(x = simulated[, 1:10], y = simulated$y, committees = 100)
cb2 <- cubist(x = simulated2[, names(simulated2) != "y"], y = simulated2$y, committees = 100)
cb3 <- cubist(x = simulated3[, names(simulated3) != "y"], y = simulated3$y, committees = 100)
#cb2 <- cubist(x = simulated2[, 1:10, 12], y = simulated2$y, committees = 100)
#cb3 <- cubist(x = simulated3[, 1:10, 12], y = simulated3$y, committees = 100)
cb.varimp1 <- varImp(cb1)
cb.varimp2 <- varImp(cb2)
cb.varimp3 <- varImp(cb3)
cb.varimp1
## Overall
## V1 71.5
## V3 47.0
## V2 58.5
## V4 48.0
## V5 33.0
## V6 13.0
## V7 0.0
## V8 0.0
## V9 0.0
## V10 0.0
cb.varimp2
## Overall
## V3 48.5
## V2 58.5
## V1 58.0
## V4 47.5
## duplicate2 21.5
## V5 36.5
## V6 14.0
## V8 2.0
## V10 1.0
## V9 0.5
## V7 0.0
cb.varimp3
## Overall
## V1 50.0
## V3 46.0
## V2 58.0
## V4 47.5
## V5 30.5
## duplicate1 25.5
## V8 7.5
## V6 3.5
## V7 0.0
## V9 0.0
## V10 0.0
The output for the cubist models has V1, V2, V3, V4, V5 and surprisingly the duplicate correlated variables.
Use a simulation to show tree bias with different granularities.
There was a phenomenon mentioned in the text. This means that predictors with a bigger number of possible split points are likewise more likely to be employed at the top of a tree partition. Even in cases where there is little to no correlation with the reaction, this nevertheless occurs. We will generate three variables—two predictor variables and one response—in order to demonstrate this through simulations. Two homogenous groups will be formed by one of the predictors, which will be categorical in nature, while the other will not. We’ll construct a model and assess the significance of each variable. Variable 1 ought to be quite significant, while variable 2 ought to be negligible.
set.seed(147)
X1 <- rep(1:2, each = 100)
Y <- X1 + rnorm(200, mean = 0, sd = 4)
set.seed(148)
X2 <- rnorm(200, mean = 0, sd = 2)
simData <- data.frame(Y = Y, X1 = X1, X2 = X2)
library(rpart)
set.seed(149)
fit <- rpart(Y ~ ., data = simData)
varImp(fit)
## Overall
## X1 0.05409211
## X2 0.34896693
X2 should not have a higher importance rate than X1, but the above output shows otherwise.
In stochastic gradient boosting, the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction(.1 and .9) and the learning rate (.1 and .9) for the solubility data. The left hand plot has both parameters set to .1 and the right has both set to .9:
Why does the model on the right focus its importance on just the first few of the predictors, whereas the model on the left spreads importance across more predictors?
Answer:
The model gets increasingly greedy as we raise the learning rate in the direction of 1. Our model will find fewer factors associated with the response as greed increases. The model requires more data as the bagging fraction rises. Combining it means that less predictors will be important when we raise the learning rate and bagging percentage from .1 to.9.
Which model do you think would be more predictive of other samples?
Answer:
I think that the model where both parameters are .1 would be more predictive of other samples because when we increase the parameters, model performance will decrease.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Answer:
Increasing interaction depth will spread variable importance over more predictors and increase the importance overall.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
library(AppliedPredictiveModeling)
library(caret)
set.seed(100)
data(ChemicalManufacturingProcess)
# imputation
miss <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(miss, ChemicalManufacturingProcess)
# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]
set.seed(624)
# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)
# train
train_x <- Chemical[index, -1]
train_y <- Chemical[index, 1]
# test
test_x <- Chemical[-index, -1]
test_y <- Chemical[-index, 1]
Which tree-based regression model gives the optimal resampling and test set performance?
Single Tree
set.seed(100)
cartTune <- train(train_x, train_y,
method = "rpart",
tuneLength = 10,
trControl = trainControl(method = "cv"))
cartPred <- predict(cartTune, test_x)
postResample(cartPred, test_y)
## RMSE Rsquared MAE
## 1.5632389 0.5035625 1.1913334
Bagged Trees
set.seed(100)
baggedTree <- ipredbagg(train_y, train_x)
baggedPred <- predict(baggedTree, test_x)
postResample(baggedPred, test_y)
## RMSE Rsquared MAE
## 1.2311358 0.7141633 0.8911824
Random Forest
set.seed(100)
rfModel <- randomForest(train_x, train_y,
importance = TRUE,
ntree = 1000)
rfPred <- predict(rfModel, test_x)
postResample(rfPred, test_y)
## RMSE Rsquared MAE
## 1.2598015 0.7216121 0.8998200
Boosted Trees
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(100)
gbmTune <- train(train_x, train_y,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
gbmPred <- predict(gbmTune, test_x)
postResample(gbmPred, test_y)
## RMSE Rsquared MAE
## 1.1824945 0.7195344 0.8678498
Cubist
set.seed(100)
cubistTuned <- train(train_x, train_y,
method = "cubist")
cubistPred <- predict(cubistTuned, test_x)
postResample(cubistPred, test_y)
## RMSE Rsquared MAE
## 0.9804618 0.7964499 0.7363520
The lowest RMSE is found in the cubist model, giving the best optimal resampling and test set performance.
rbind(cart = postResample(cartPred, test_y),
bagged = postResample(baggedPred, test_y),
randomForest = postResample(rfPred, test_y),
boosted = postResample(gbmPred, test_y),
cubist = postResample(cubistPred, test_y))
## RMSE Rsquared MAE
## cart 1.5632389 0.5035625 1.1913334
## bagged 1.2311358 0.7141633 0.8911824
## randomForest 1.2598015 0.7216121 0.8998200
## boosted 1.1824945 0.7195344 0.8678498
## cubist 0.9804618 0.7964499 0.7363520
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
The manufacturing process variables dominate the list at a ratio of 16:4, whereas the optimal linear and nonlinear models had ratios of 11:9.
varImp(cubistTuned, scale = TRUE)
## cubist variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess17 62.35
## ManufacturingProcess09 54.12
## BiologicalMaterial03 54.12
## BiologicalMaterial02 52.94
## ManufacturingProcess13 38.82
## BiologicalMaterial06 36.47
## ManufacturingProcess04 34.12
## ManufacturingProcess33 32.94
## ManufacturingProcess10 22.35
## ManufacturingProcess11 18.82
## ManufacturingProcess14 17.65
## ManufacturingProcess39 17.65
## ManufacturingProcess28 14.12
## ManufacturingProcess16 12.94
## BiologicalMaterial12 12.94
## ManufacturingProcess31 12.94
## ManufacturingProcess29 12.94
## ManufacturingProcess02 12.94
## ManufacturingProcess21 11.76
plot(varImp(cubistTuned), top = 20)
For the tree-based model, only 3 are biological variables out of the top 10, compared to 4 in the linear and nonlinear models. ManufactingProcess32 still is deemed the most important. The other predictors have less variable importance. BiologicalMaterial06 was deemed only the seventh most important, where it was the second most important in other variables. There are some predictors that were not in the top 10 previously, that are in the top 10 now, such as Manufacting Processes number 17, 4, 33, and 10.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
rpartTree <- rpart(Yield ~ ., data = Chemical[index, ])
plot(as.party(rpartTree), ip_args = list(abbreviate = 4), gp = gpar(fontsize = 7))