set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative
predictors (V6 – V10)?
# simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
# cor(simulated$duplicate1, simulated$V1)
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
Use the cforest function in the party package to fit
a random forest model using conditional inference trees. The party
package function varimp can calculate predictor importance.
The conditional argument of that function toggles between the
traditional importance measure and the modified version described in
Strobl et al. (2007). Do these importances show the same pattern as the
traditional random forest model?
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
rfImp1 %>%
arrange(desc(Overall))
## Overall
## V1 8.732235404
## V4 7.615118809
## V2 6.415369387
## V5 2.023524577
## V3 0.763591825
## V6 0.165111172
## V7 -0.005961659
## V10 -0.074944788
## V9 -0.095292651
## V8 -0.166362581
The output shown above shows us that the uninformative predictors
don’t really contribute much to the random forest model. In fact,
V7 to V10 contribute negatively to the
model.
set.seed(200)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9497025
The output above shows us that duplicate1 and
V1 have a very high correlation value of 0.95. Now let’s
fit a random forest model and see the variable importance plot again
with this new duplicate1 variable.
model2 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2 %>%
arrange(desc(Overall))
## Overall
## V4 6.86363294
## V2 6.05937899
## V1 6.00709780
## duplicate1 4.43323167
## V5 2.19939891
## V3 0.58465293
## V6 0.10898039
## V10 0.09999339
## V9 0.06123662
## V7 0.06104207
## V8 -0.04059204
The importance for V1 shifted downward from 1st place to
3rd place. with duplicate1 coming right after in 4th place
based on the output above. Now let’s add another predictor that is also
highly correlated with V1.
set.seed(100)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .15
cor(simulated$duplicate2, simulated$V1)
## [1] 0.8998308
The output above shows us that duplicate2 and
V1 have a correlation of 0.9, which is very high. Let’s fit
a random forest model to the modified dataset.
model3 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3 %>%
arrange(desc(Overall))
## Overall
## V4 7.254235714
## V2 6.584371908
## V1 5.760644106
## duplicate1 3.356974030
## V5 2.194899926
## duplicate2 1.830541809
## V3 0.490851860
## V6 0.171059467
## V10 0.013491410
## V8 -0.001042333
## V7 -0.047423191
## V9 -0.100834810
The output above shows us that V1 is still in 3rd place
and duplicate1 is still in 4th place, with
duplicate2 behind duplicate1 by 2 places.
duplicate2 has roughly half the importance of
duplicate1. This might be because duplicate1
is slightly more correlated to V1 than
duplicate2.
cforestModel <- cforest(
y ~ .,
data = simulated
)
cforestImpTrad <- varimp(cforestModel, conditional = FALSE) %>% sort(decreasing = TRUE)
cforestImpStrobl <- varimp(cforestModel, conditional = TRUE) %>% sort(decreasing = TRUE)
cforestImpTrad
## V4 V1 V2 duplicate1 duplicate2 V5
## 6.49466579 6.49080949 5.55767891 4.33895836 2.75129426 2.01489687
## V7 V3 V9 V8 V6 V10
## 0.22587739 0.19886434 0.04520632 0.03562828 0.03543652 -0.30228330
cforestImpStrobl
## V4 V2 V1 duplicate1 V5
## 5.7081811063 4.7418228149 3.3435137816 1.8499787494 1.4786681242
## duplicate2 V3 V6 V7 V8
## 0.7587727058 0.0736672033 0.0557837900 0.0003579684 -0.1045100995
## V9 V10
## -0.1902618378 -0.4713946121
When using the traditional importance measure, V1 and
V2 switched positions, while duplicate2 and
V5 also split positions. Also notice that V6
and V3 have switched positions for both importance
measures. This indicates that one of the uninformative predictors,
V6 is above an informative predictor, V3.
Plotting importance from GBM model found in this Stack Overflow answer.
gbmModel <- gbm(y ~ ., data = simulated, distribution = 'gaussian')
summary.gbm(gbmModel)
## var rel.inf
## V4 V4 30.6526712
## V2 V2 24.3167512
## V1 V1 21.5777329
## V5 V5 10.6053022
## V3 V3 7.5204037
## duplicate1 duplicate1 2.9138287
## duplicate2 duplicate2 1.8524757
## V7 V7 0.1596372
## V6 V6 0.1444553
## V9 V9 0.1408557
## V10 V10 0.1158861
## V8 V8 0.0000000
The output above shows us that V4, V2 and
V1 are the top 3 most important predictors. V6
to V10 have little to no importance. Also note that while
duplicate1 and duplicate2 have importance
because of the high correlation with V1,
duplicates 1 and 2 are on the low end of the importance
hierarchy, just right below the uninformative predictors.
cubistTuned <- train(y ~ ., data = simulated, method = "cubist")
varImp(cubistTuned)
## cubist variable importance
##
## Overall
## V1 100.00
## V2 79.17
## V4 68.06
## V3 58.33
## V5 52.08
## V6 25.69
## duplicate2 0.00
## V9 0.00
## V8 0.00
## V10 0.00
## duplicate1 0.00
## V7 0.00
Very interestingly, the output above shows us that V1 to
V6 are the most important predictors, while the rest of the
predictors have no importance to the cubist model. V6 is an
uninformative predictor and has some importance in the model, and
duplicate1 and duplicate2 have no importance,
which is a result of these two variables conveying almost the same
information as V1.
Use a simulation to show tree bias with different granularities.
Page 182 in the Applied Predictive Modeling textbook says the following about regression trees:
“Finally, these trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors (Loh and Shih 1997; Carolin et al. 2007; Loh 2010).”
We should create a dataset which contains variables with increasing numbers of distinct values in order to illustrate this characteristic of regression trees.
(Tutorial used to create vector of random samples)
set.seed(100)
x1 <- sample(0:1, size = 200, replace = TRUE)
x2 <- sample(0:1000, size = 200, replace = FALSE)
x3 <- sample(0:100000, size = 200, replace = FALSE)
y <- sample(0:1000, size = 200, replace = TRUE)
sample_df <- data.frame(x1, x2, x3, y)
Now that we have our data, let’s fit an Single Tree model to the
dataset and check the variable importance using the varImp
function.
set.seed(100)
rpartTune <- train(y ~ ., data = sample_df,
method = "rpart2",
trControl = trainControl(method = "cv"),
tuneLength = 10)
## note: only 8 possible values of the max tree depth from the initial fit.
## Truncating the grid to 8 .
varImp(rpartTune)
## rpart2 variable importance
##
## Overall
## x3 100.00
## x2 83.85
## x1 0.00
The output above shows us that x3, which was took 200
values from 0 to 100000 without replacement has the highest importance,
which is to be expected. Followed by x2 with 0 to 1000
values without replacement, followed lastly with x1, which
takes 200 values, and these values can only be 0 or 1. Knowing this,
x1 has an importance of 0, which is to be expected, since
it only has 2 distinct values.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
Which model do you think would be more predictive of other samples?
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
The model on the right focuses its importance on just the first few predictors because the bagging fraction is higher, so a larger subset of the predictors are used in each regression tree. Therefore, the highly important predictors will appear more often in the iterations. So many of the same predictors are selected across the trees, increasing their contribution to the importance metric.The model on the left spreads its importance across more predictors because the bagging fraction is 0.1 instead of 0.9, thereby decreasing the chances that a predictor will appear twice between trees generated in the iterative fashion that stochastic gradient boosting is known for.
The lower the learning rate, the less the effect the predictive value from the previous iteration is going to have on the current prediction. This allows for less important predictors to have more of an effect on the model. So this ultimately explains why the importance profile for the model on the left is more spread out than the importance model on the right.
Generally speaking, we would prefer a model that isn’t prone to overfitting. I feel that with the model on the right, the model is fit too closely to the original training data, while the model on the left, which has it’s importance profile more spread out, would be less prone to overfitting, quite possibly allowing for a lower RMSE if we we’re to evaluate both models with a test set.
Page 205 in the Applied Predictive Modeling textbook explains the following about interaction depth:
“When regression tree are used as the base learner, simple gradient boosting for regression has two tuning parameters: tree depth and number of iterations. Tree depth in this context is also known as interaction depth, since each subsequential split can be thought of as a higher-level interaction term with all of the other previous split predictors.”
Therefore, if we were to increase the interaction depth, we would effectively have more nodes for each of the trees. Having more nodes would give more chances for the less important predictors to pop up more between iterations. So the slopes would be less steeper as we increase the interaction depth.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Which tree-based regression model gives the optimal resampling and test set performance?
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
Question 6.3 parts a, b, and c were performed in a previous homework and will be replicated here for the data imputation, data splitting, and pre-processing steps.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
impute <- ChemicalManufacturingProcess %>%
preProcess("knnImpute")
imputed <- predict(impute, ChemicalManufacturingProcess)
X <- imputed %>% select(-Yield)
y = imputed$Yield
high_freq_predictors <- X[,-nearZeroVar(X)]
dim(high_freq_predictors)
## [1] 176 56
We create the training and test set in the code chunk below.
set.seed(1)
sample <- sample.split(y, SplitRatio = 0.8)
train_X <- subset(high_freq_predictors, sample == TRUE) %>% as.matrix()
test_X <- subset(high_freq_predictors, sample == FALSE) %>% as.matrix()
train_y <- subset(y, sample == TRUE) %>% as.vector()
test_y <- subset(y, sample == FALSE) %>% as.vector()
chemdata <- cbind(train_X, train_y)
chemdata <- as.data.frame(chemdata)
colnames(chemdata)[ncol(chemdata)] <- "y"
set.seed(100)
rpartTune <- train(train_X, train_y,
method = "rpart2",
tuneLength = 10,
trControl = trainControl(method = "cv"))
rfModel <- randomForest(y ~ .,
chemdata,
importance = TRUE,
ntree = 1000)
gbmModel <- gbm(y ~ ., data = chemdata, distribution = 'gaussian')
cubistTuned <- train(y ~ ., data = chemdata, method = "cubist")
rpartPred <- predict(rpartTune, newdata = test_X)
postResample(pred = rpartPred, obs = test_y)
## RMSE Rsquared MAE
## 0.7830512 0.4084461 0.6230981
rfPred <- predict(rfModel, newdata = test_X)
postResample(pred = rfPred, obs = test_y)
## RMSE Rsquared MAE
## 0.5913957 0.5515540 0.4206182
gbmPred <- predict(gbmModel, newdata = data.frame(test_X))
## Using 100 trees...
postResample(pred = gbmPred, obs = test_y)
## RMSE Rsquared MAE
## 0.6612517 0.4709783 0.5090027
cubistPred <- predict(cubistTuned, newdata = data.frame(test_X))
postResample(pred = cubistPred, obs = test_y)
## RMSE Rsquared MAE
## 0.5868265 0.5714771 0.4384501
The output above shows us that the Cubist model gives us the optimal resampling and test set performance. Notice that the R-squared is 0.57, which means that the model does slightly better than random guessing, as opposed to the Boosted Trees model, which gives an R-squared of 0.46.
varImp(cubistTuned)
## cubist variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess31 45.946
## ManufacturingProcess09 43.243
## ManufacturingProcess13 42.342
## ManufacturingProcess04 37.838
## ManufacturingProcess33 36.937
## ManufacturingProcess17 30.631
## BiologicalMaterial03 27.928
## BiologicalMaterial02 26.126
## ManufacturingProcess37 25.225
## ManufacturingProcess28 19.820
## ManufacturingProcess29 18.919
## ManufacturingProcess25 18.919
## ManufacturingProcess27 13.514
## BiologicalMaterial09 13.514
## BiologicalMaterial06 11.712
## ManufacturingProcess30 9.910
## BiologicalMaterial08 9.910
## BiologicalMaterial11 9.910
## BiologicalMaterial12 9.009
The output above shows us the top 20 predictors with the highest
overall importance. Let’s focus on just the first 10. As we can see from
the output above, ManufacturingProcess32 has an importance
of 100, followed by a drop of less than half to 45.95 for
ManufacturingProcess31.Out of the top 10, there are only 2
biological processes, which indicates that manufacturing processes
dominate the top 10.
For comparisons sake, I have provided the SVM model from Chapter 7 and the PLS model from Chapter 6 as shown below.
svmRTuned <- train(train_X, train_y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 140 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 125, 128, 127, 125, 128, 125, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7893858 0.5054155 0.6470156
## 0.50 0.7240074 0.5453873 0.5959249
## 1.00 0.6622741 0.5886838 0.5427651
## 2.00 0.6263311 0.6334186 0.5091241
## 4.00 0.6139378 0.6515314 0.4969291
## 8.00 0.6245266 0.6412209 0.5079059
## 16.00 0.6237248 0.6421324 0.5080469
## 32.00 0.6237248 0.6421324 0.5080469
## 64.00 0.6237248 0.6421324 0.5080469
## 128.00 0.6237248 0.6421324 0.5080469
## 256.00 0.6237248 0.6421324 0.5080469
## 512.00 0.6237248 0.6421324 0.5080469
## 1024.00 0.6237248 0.6421324 0.5080469
## 2048.00 0.6237248 0.6421324 0.5080469
##
## Tuning parameter 'sigma' was held constant at a value of 0.01440088
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01440088 and C = 4.
svmRPred <- predict(svmRTuned, newdata = test_X)
postResample(pred = svmRPred, obs = test_y)
## RMSE Rsquared MAE
## 0.5987115 0.5605562 0.4612658
varImp(svmRTuned)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 98.03
## BiologicalMaterial06 86.76
## ManufacturingProcess09 84.29
## BiologicalMaterial03 79.41
## ManufacturingProcess17 79.32
## ManufacturingProcess31 76.95
## BiologicalMaterial02 69.88
## BiologicalMaterial12 69.71
## ManufacturingProcess36 67.83
## ManufacturingProcess06 57.40
## ManufacturingProcess11 55.01
## BiologicalMaterial04 54.73
## BiologicalMaterial11 45.83
## ManufacturingProcess29 45.61
## BiologicalMaterial08 44.90
## BiologicalMaterial01 44.41
## ManufacturingProcess33 43.41
## ManufacturingProcess30 41.97
## BiologicalMaterial09 40.99
ctrl <- trainControl(method = "cv", number = 10)
plsTune <- caret::train(
train_X,
train_y,
method = "pls",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale")
)
plsTune
## Partial Least Squares
##
## 140 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 126, 125, 126, 126, 127, 128, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.8479010 0.4265157 0.6657416
## 2 1.0529488 0.5038503 0.7214306
## 3 0.8474292 0.5771544 0.5984535
## 4 0.9456246 0.5676970 0.6447407
## 5 1.0659868 0.5552396 0.6719471
## 6 1.1568744 0.5409098 0.6957232
## 7 1.2351712 0.5339243 0.7324440
## 8 1.3269965 0.5165428 0.7683998
## 9 1.4055688 0.5197192 0.7652225
## 10 1.5269544 0.5119143 0.7816197
## 11 1.5896791 0.5089670 0.8410855
## 12 1.6910618 0.4993789 0.8852889
## 13 1.8672296 0.4920632 0.9566055
## 14 2.0574667 0.4932648 1.0114094
## 15 2.0859090 0.5021506 1.0113019
## 16 2.1538067 0.5048579 1.0344105
## 17 2.1909744 0.5084537 1.0469477
## 18 2.2070858 0.5063554 1.0582202
## 19 2.2386613 0.5105727 1.0724065
## 20 2.2673578 0.5123984 1.0827180
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
plsPred <- predict(plsTune, newdata = test_X)
postResample(pred = plsPred, obs = test_y)
## RMSE Rsquared MAE
## 0.6625784 0.5450552 0.5616085
varImp(plsTune)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 84.32
## ManufacturingProcess13 82.00
## ManufacturingProcess17 78.26
## ManufacturingProcess36 73.23
## ManufacturingProcess11 57.16
## BiologicalMaterial02 56.69
## BiologicalMaterial08 55.83
## BiologicalMaterial06 54.80
## BiologicalMaterial12 53.51
## ManufacturingProcess33 53.10
## ManufacturingProcess06 52.95
## BiologicalMaterial11 52.67
## BiologicalMaterial03 52.19
## BiologicalMaterial01 47.55
## BiologicalMaterial04 46.40
## ManufacturingProcess34 45.04
## ManufacturingProcess12 44.92
## ManufacturingProcess28 43.77
## ManufacturingProcess04 41.58
Through the postResample function, we can see that the SVM model from Chapter 7 performs the best out of the three models that I have presented in this part of the question in terms of test set performance. The Rsquared for the SVM model (0.593) is higher than the Rsquared for the Cubist model (0.571). Note that the Rsquared for the PLS model is less than 0.5, which indicates that the model does not do any better than random guessing and should be ignored.
ggplot(varImp(cubistTuned), top = 10)
ggplot(varImp(svmRTuned), top = 10)
The slope for the predictor importance for the SVM model is much less than the slope for the predictor importance of the Cubist model.
rpartTree <- as.party(rpartTune$finalModel)
plot(rpartTree)
The plot above shows us that the yield, on average, is greater if
ManufacturingProcess32 is greater than or equal to 0.192.
The plot also shows us that we have on average, negative yield if
ManufacturingProcess32 is less than 0.192. Also, the only
two biological processes play a role only if
ManufacturingProcess32 is less than 0.192.