This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
library(caret)
model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.9388088608
## V2 6.4722961938
## V3 0.7421784299
## V4 7.5816497914
## V5 2.1118763152
## V6 0.1865935610
## V7 0.0957719681
## V8 -0.0005593971
## V9 -0.0300840189
## V10 0.0000201536
Did the random forest model significantly use the uninformative predictors (V6 - V10)? The predictors V6-V10 had smaller weights compared to V1-V5.
(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.943414
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model1b <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model1b, scale = FALSE)
rfImp2
## Overall
## V1 7.25761619
## V2 6.20516602
## V3 0.67427738
## V4 6.94720746
## V5 2.24138202
## V6 0.22981030
## V7 0.10747745
## V8 -0.15202100
## V9 -0.10758150
## V10 0.05539831
## duplicate1 3.20783807
Yes, it did. The importance for V1 is reduced, splitted between V1 and duplicated1. This is because the trees in the random forest are just as likely to pick V1 for a split than to pick duplicated1, since they are so correlated. As a result, the total number of splits that originally belonged to V1 is “shared” with duplicated1.
(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(party)
library(dplyr)
cforestModel <- cforest(y ~ ., data=simulated)
# Unconditional importance measure
varimp(cforestModel) %>% sort(decreasing = T)
## V4 V1 V2 duplicate1 V5
## 7.9713194601 6.8956566612 6.1704953666 2.2610870168 1.9153771711
## V7 V3 V6 V10 V8
## 0.0390985854 0.0368739498 0.0351874549 -0.0006565073 -0.0209013677
## V9
## -0.0286556191
varimp(cforestModel, conditional=T) %>% sort(decreasing = T)
## V4 V2 V1 V5 duplicate1 V7
## 6.196881150 5.011240164 3.672758137 1.056173178 0.699528754 0.031128777
## V3 V6 V8 V9 V10
## 0.027372701 0.018796813 -0.002783976 -0.002969819 -0.024768453
The new importance calculated (with conditional set to TRUE) is slightly different than the traditional measurement. The uninformative predictors (V6 ~ V10) are still rated low importances. The importance of other predictors are reduced, with V1 and its highly correlated duplicated1 being reduced the most. Both measurement rated V4 as the top most important predictor, followed by V2.
(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur? For the gradient boosted trees model, I use the gbm function. The model is built using its default input parameters. The summary function can be used to get the relative influence of the predictors on the model.
library(gbm)
gbmModel <- gbm(y ~ ., data=simulated, distribution='gaussian')
summary(gbmModel)
## var rel.inf
## V4 V4 29.8190123
## V2 V2 27.4949042
## V1 V1 23.3681006
## V5 V5 10.3158194
## V3 V3 7.2979335
## duplicate1 duplicate1 0.9060600
## V6 V6 0.5418961
## V9 V9 0.1395007
## V8 V8 0.1167733
## V7 V7 0.0000000
## V10 V10 0.0000000
The same pattern occurs. The uninformative predictors V6 ~ V10 are still rated very low. V4 is still the top rated predictor, followed by V2. The correlated predictors V1 and duplicate1, tho, have higher differences in their rated importance. The GBM model seems to detect that V1 is much more important than duplicated1
library(Cubist)
cubistModel <- cubist(x=simulated[,-(ncol(simulated)-1)], y=simulated$y, committees=100)
varImp(cubistModel)
## Overall
## V1 70.5
## V3 44.5
## V2 58.0
## V4 49.0
## duplicate1 5.5
## V5 31.5
## V6 10.5
## V8 0.5
## V7 0.0
## V9 0.0
## V10 0.0
Again, the uninformative predictors V6 ~ V10 are rated low in their importance. The top rated predictor is not V4 this time, but V1. This is different than the random forest and GBM models. the Cubist model values V1 higher than duplicated1 just like the GBM model.
8.2 Use a simulation to show tree bias with different granularities.
The tree models suffer from selection bias that the predictors having a higher number of distinct values are favored over more granular predictors.
Below, I simulated 4 variables using the sample function with replacement. All 4 variables are values ranging from 0 to 1, with the following granularity:
The target variable y is created by adding x1 and x4, plus a random error term. Note that x2 and x3 are not used to generate the target.
set.seed(1)
x1 <- sample(0:10000 / 10000, 200, replace = T)
x2 <- sample(0:1000 / 1000, 200, replace = T)
x3 <- sample(0:100 / 100, 200, replace = T)
x4 <- sample(0:10 / 10, 200, replace = T)
y <- x1 + x4 + rnorm(200)
df <- data.frame(x1, x2, x3, x4, y)
str(df)
## 'data.frame': 200 obs. of 5 variables:
## $ x1: num 0.266 0.372 0.573 0.908 0.202 ...
## $ x2: num 0.267 0.218 0.517 0.269 0.181 0.519 0.563 0.129 0.256 0.718 ...
## $ x3: num 0.66 0.18 0.96 0.9 0.95 0.73 0.37 0.78 0.01 0.94 ...
## $ x4: num 0.8 1 0.1 0.8 1 1 0.3 0.4 1 0.1 ...
## $ y : num 2.1399 3.2678 0.0699 1.3173 0.7855 ...
Below, a regression tree is fitted using rpart, and the variable importance is calculated using varImp.
library(rpart)
rpartTree <- rpart(y ~ ., data=df)
varImp(rpartTree)
## Overall
## x1 1.0385735
## x2 0.8086365
## x3 0.7884261
## x4 0.6570896
As you can see, the tree uses x1 mostly to split; while it uses x4 the least. Interestingly, the tree model also uses x2 and x3 to split, even though they are not used to generate the target. X2 and x3 all have more distinct values than x4. This demonstrates that there is a selection bias in the tree model, that it favors predictors with more distinct values.
8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
Gradient boosting trees make predictions by adding the predictions of each subsequent tree. These trees are dependent on each other - each tree is built upon the errors of previous tree. Therefore they have correlated structures. Suppose one of the tree split on one feature frequently. Its subsequent tree, which is built using its errors, will likely split on that feature frequently as well. As a result, many of the same predictors will be selected across the trees, increasing their contribution to the importance metric.
Learning rate controls the fraction of the predictions of each tree being added. A higher learning rate means that larger fraction of each tree’s predictions are added to the final prediction. This effectively means that a higher learning rate increases the dependent / correlation structure. More of the same predictors will be selected among the trees. This is why the right-hand plot has its importance focus on just the first few of the predictors, and look very steep.
Bagging fraction is the fraction of data being used in each iteration of the trees. When you have a small bagging fraction, say 0.1, on each iteration just 10% of the full data is randomly sampled. So each tree may be built using very different dataset. Since the dataset are very different, the trees will be splitting very differently from each other. On the contrast, when you have large bagging fraction, say 0.9, essentially on each iteration the trees are seeing the same dataset - they will likely split similarly. This means that larger bagging fraction increases the dependent / correlated structure in the boosting trees. Therefore, the right-hand plot with a larger bagging fraction has its importance focus on just the first few of the predictors.
(b) Which model do you think would be more predictive of other samples?
Learning rate and bagging fraction are important parameters to control the overfitting of the gradient boosting model that requires tuning. A smaller learning rate and bagging fraction leads to better generalization ability over unseen samples. If I have to guess, the model with 0.1 learning rate and bagging fraction will be more predictive of out of bag samples. However, since this invovles a trade off between bias-variance, I can only confirm using cross validation or a test set.
(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24? Below, I train two GBM models with all parameters the same except for the interaction depth. Model 1 has interaction depth of 1, while Model 2 has interaction depth of 10. The predictor importance of both models are plotted:
library(AppliedPredictiveModeling)
data(solubility)
grid1 <- expand.grid(n.trees=100, interaction.depth=1, shrinkage=0.1, n.minobsinnode=10)
gbm1 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = grid1, verbose = FALSE)
grid2 <- expand.grid(n.trees=100, interaction.depth=10, shrinkage=0.1, n.minobsinnode=10)
gbm2 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = grid2, verbose = FALSE)
plot(varImp(gbm1))
plot(varImp(gbm2))
As you can see, increasing the interaction depth seems to spread out the importance more, since each tree now can grow deeper, and has more chance for other features to be involved in the splitting process. Therefore, increasing the depth reduce the slope of the importance plot.
8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models: The missing values in the ChemicalManufacturingProcess data are imputed using the bagImpute method. The train test set are splitted, with 20% of the data assigned to the test set.
data(ChemicalManufacturingProcess)
# Impute missing values using `bagImpulte`
(cmpImpute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute')))
## Created from 152 samples and 57 variables
##
## Pre-processing:
## - bagged tree imputation (57)
## - ignored (0)
Below, 4 regression tree models are trained: single tree, random forest, gradient boosting, and cubist. The bootstrapped resampling method is used with 25 repetition to determine the final models. RMSE is used as the metric. There is no need for centering and scaling for the tree models.
Single tree, tuned over the complexity parameter:
cmp <- predict(cmpImpute, ChemicalManufacturingProcess[,-c(1)])
# Train/test plitting data, 20% testing
set.seed(1)
trainRow <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X.train <- cmp[trainRow, ]
y.train <- ChemicalManufacturingProcess$Yield[trainRow]
X.test <- cmp[-trainRow, ]
y.test <- ChemicalManufacturingProcess$Yield[-trainRow]
set.seed(1)
rpartModel <- train(x = X.train,
y = y.train,
method = "rpart",
tuneLength = 10,
control = rpart.control(maxdepth=2))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
rpartModel
## CART
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01143214 1.406955 0.4650570 1.123290
## 0.01620406 1.406955 0.4650570 1.123290
## 0.02156477 1.406955 0.4650570 1.123290
## 0.02456980 1.406955 0.4650570 1.123290
## 0.02550828 1.406955 0.4650570 1.123290
## 0.02805096 1.406955 0.4650570 1.123290
## 0.04854745 1.413676 0.4604097 1.133811
## 0.06127500 1.407444 0.4637199 1.130400
## 0.07019415 1.406903 0.4645587 1.130002
## 0.47674176 1.645284 0.4289123 1.352884
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.07019415.
set.seed(1)
rfModel <- train(x = X.train,
y = y.train,
method = 'rf',
tuneLength = 10)
rfModel
## Random Forest
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.334763 0.5755652 1.0622661
## 8 1.238707 0.6087402 0.9664760
## 14 1.215154 0.6117694 0.9416574
## 20 1.204967 0.6138399 0.9313799
## 26 1.201247 0.6111992 0.9252312
## 32 1.196467 0.6104437 0.9179918
## 38 1.201453 0.6051999 0.9214785
## 44 1.203379 0.6009496 0.9203457
## 50 1.207303 0.5969586 0.9226280
## 57 1.214638 0.5909974 0.9283741
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 32.
Gradient boosting, tuned over the number of trees, interaction depth, shrinkage rate, and minimum terminal node size:
set.seed(1)
grid <- expand.grid(n.trees=c(50, 100, 150, 200),
interaction.depth=c(1, 5, 10, 15),
shrinkage=c(0.01, 0.1, 0.5),
n.minobsinnode=c(5, 10, 15))
gbmModel <- train(x = X.train,
y = y.train,
method = 'gbm',
tuneGrid = grid,
verbose = FALSE)
plot(gbmModel)
gbmModel$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 64 200 5 0.1 5
gbmModel$finalModel
## A gradient boosted model with gaussian loss function.
## 200 iterations were performed.
## There were 57 predictors of which 56 had non-zero influence.
set.seed(1)
cubistModel <- train(x = X.train,
y = y.train,
method = 'cubist')
cubistModel
## Cubist
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 1.828754 0.3291015 1.3056173
## 1 5 1.803375 0.3424485 1.2794785
## 1 9 1.810739 0.3396976 1.2895186
## 10 0 1.254953 0.5642174 0.9604272
## 10 5 1.236735 0.5768528 0.9416704
## 10 9 1.246940 0.5706782 0.9508531
## 20 0 1.187407 0.6082339 0.9136121
## 20 5 1.166623 0.6210345 0.8940021
## 20 9 1.176697 0.6154686 0.9045637
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
(a) Which tree-based regression model gives the optimal resampling and test set performance?
The resampling performance of all the models are calculated below:
resamp <- resamples(list(SingleTree=rpartModel, RandomForest=rfModel, GradientBoosting=gbmModel, Cubist=cubistModel))
summary(resamp)
##
## Call:
## summary.resamples(object = resamp)
##
## Models: SingleTree, RandomForest, GradientBoosting, Cubist
## Number of resamples: 25
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## SingleTree 0.9362303 1.0587955 1.1117806 1.1300022 1.2306692 1.280239
## RandomForest 0.6764093 0.8772126 0.9113982 0.9179918 0.9761576 1.104029
## GradientBoosting 0.7355846 0.8348256 0.8950613 0.8947279 0.9590113 1.016490
## Cubist 0.7396698 0.8388576 0.9018535 0.8940021 0.9394838 1.018107
## NA's
## SingleTree 0
## RandomForest 0
## GradientBoosting 0
## Cubist 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SingleTree 1.1430661 1.312677 1.409903 1.406903 1.483944 1.604770 0
## RandomForest 0.8693020 1.112714 1.194372 1.196467 1.288142 1.450791 0
## GradientBoosting 0.9347463 1.072781 1.183065 1.166904 1.247199 1.340630 0
## Cubist 0.9744006 1.070400 1.171816 1.166623 1.231114 1.420784 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## SingleTree 0.3215285 0.3908312 0.4696616 0.4645587 0.5338047 0.6071253
## RandomForest 0.4629570 0.5665089 0.5905094 0.6104437 0.6568450 0.7820165
## GradientBoosting 0.5136675 0.5634120 0.6010607 0.6180209 0.6725748 0.7152428
## Cubist 0.4766788 0.5686033 0.6362580 0.6210345 0.6676087 0.7404939
## NA's
## SingleTree 0
## RandomForest 0
## GradientBoosting 0
## Cubist 0
Looking at the Mean of the RMSE metric, it appears the gradient boosting model is optimal.
The test set performance is calculated below:
testPerf <- function(models, testData, testTarget) {
method <- c()
res <- data.frame()
for(model in models){
method <- c(method, model$method)
pred <- predict(model, newdata=testData)
res <- rbind(res, t(postResample(pred=pred, obs=testTarget)))
}
row.names(res) <- method
return(res)
}
models <- list(rpartModel, rfModel, gbmModel, cubistModel)
performance <- testPerf(models, X.test, y.test)
performance
## RMSE Rsquared MAE
## rpart 1.8616397 0.1224457 1.4768675
## rf 1.3568448 0.4812336 1.0309896
## gbm 1.3270583 0.5140878 1.0287075
## cubist 0.9130396 0.7639094 0.7021449
The test set performance suggests that the cubist model is the best. Here, I would trust the resampled metrics than the metric reported by the test set, since the size of the data is very small and the test set performance may not be a close approximation of the true model performance. There, I would choose grandient boosting model.
(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
The predictor importance can be found using the varImp function:
varImp(gbmModel)
## gbm variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17 18.251
## BiologicalMaterial12 13.408
## BiologicalMaterial06 12.755
## ManufacturingProcess13 12.566
## ManufacturingProcess06 11.476
## BiologicalMaterial11 10.530
## BiologicalMaterial03 9.245
## ManufacturingProcess21 8.646
## BiologicalMaterial05 7.782
## BiologicalMaterial09 6.548
## ManufacturingProcess01 6.485
## ManufacturingProcess43 5.681
## ManufacturingProcess09 5.162
## ManufacturingProcess20 4.720
## ManufacturingProcess04 4.362
## ManufacturingProcess36 3.797
## ManufacturingProcess45 3.631
## ManufacturingProcess31 3.487
## ManufacturingProcess23 3.252
Out of top 20 important variables, 13 are ManufacturingProcess predictors and 7 are BiologicalMaterial predictors. The ManufacturingProcess predictors dominated the list.
(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
rpartModel$finalModel
## n= 144
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 144 484.5825 40.18903
## 2) ManufacturingProcess32< 159.5 85 132.7408 39.13376 *
## 3) ManufacturingProcess32>=159.5 59 120.8210 41.70932 *
The optimal single tree model is plotted. It appears the model made one single split using ManufacturingProcess32 at the value of 159.5. If it’s less than 159.5, yield is assigned to be 39.13376. If it’s equal or more than 159.5, the yield is assigned to be 41.70932.
plot(rpartModel$finalModel)
text(rpartModel$finalModel)
This view provide two additional knowledge about the relationships:
Higher value of ManufacturingProcess32 leads to higher yield, and vice versa. The assign values are the means of the groups splitted by ManufacturingProcess32 at 159.5, as demonstrated below:
trainData <- data.frame(X.train, y.train)
less <- subset(trainData, ManufacturingProcess32 < 159.5)
more <- subset(trainData, ManufacturingProcess32 >= 159.5)
mean(less$y.train)
## [1] 39.13376
mean(more$y.train)
## [1] 41.70932