Recreate the simulated data from Exercise 7.2:
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
No. The random forest was able to identify that only the first 5 variables were relevant to determining the response variable. The drop off after the 5 necessary predictors clearly shows this.
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
varImpPlot(model1)
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
V1 and duplicate1 are both still identified as important (they are both listed before any of the noise variables). However, their importance appears to have been split. V1 is now not listed as the most important predictor. This can help explain why simply using these plots can be dangerous as it may not completely indicate which predictors are the most directly related to the response variable.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
varImpPlot(model1)
Yes, the model provides the same predictors in roughly the same position of importance. This provides additional evidence as to the predictors that are most critical to identifying the response variable.
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
##
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
##
## boundary
model1 <- cforest(y ~ ., data = simulated)
rfImp1 <- varImp(model1)
rfImp1
## Overall
## V1 4.6171158805
## V2 6.0579730772
## V3 0.0003116115
## V4 7.6223892727
## V5 1.7161194047
## V6 -0.0289427183
## V7 0.0465374951
## V8 -0.0380965511
## V9 0.0046062409
## V10 -0.0310326410
## duplicate1 5.0941897280
The boosted tree model produces radically different results. Although it still identifies the relevent predictors, it has them in a different level of importance compared to the previous models. Most importantly, the previous weakest predictor (V3) is identified as not relevent here. This is odd as know that the predictor is in fact important.
library(gbm)
## Loaded gbm 2.1.5
gbmTune <- gbm.fit(simulated %>% select(-y), simulated$y, distribution = 'gaussian')
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 24.3727 nan 0.0010 0.0110
## 2 24.3586 nan 0.0010 0.0106
## 3 24.3468 nan 0.0010 0.0113
## 4 24.3331 nan 0.0010 0.0087
## 5 24.3218 nan 0.0010 0.0092
## 6 24.3071 nan 0.0010 0.0103
## 7 24.2943 nan 0.0010 0.0105
## 8 24.2830 nan 0.0010 0.0093
## 9 24.2709 nan 0.0010 0.0083
## 10 24.2602 nan 0.0010 0.0080
## 20 24.1329 nan 0.0010 0.0100
## 40 23.8941 nan 0.0010 0.0097
## 60 23.6501 nan 0.0010 0.0105
## 80 23.4133 nan 0.0010 0.0100
## 100 23.1724 nan 0.0010 0.0097
summary(gbmTune)
## var rel.inf
## V1 V1 40.88644
## V2 V2 21.56516
## duplicate1 duplicate1 20.10280
## V4 V4 17.44560
## V3 V3 0.00000
## V5 V5 0.00000
## V6 V6 0.00000
## V7 V7 0.00000
## V8 V8 0.00000
## V9 V9 0.00000
## V10 V10 0.00000
Use a simulation to show tree bias with different granularities.
I created 3 variables. a is predictive of the response variable y but has low granularity while b is not predictive of the response variable y but has high grandularity. I will then build a tree out of the data and see if the model decides to split on the more granular, but meaningless, predictor.
a <- rep(0:1, each=100)
b <- rnorm(200, mean=0, sd=2)
e <- rnorm(200, mean=0, sd=5)
y <- a + e
data <- tibble(y=y, a=a, b=b)
It was not difficult to get the tree to split on the meaningless predictor. However, I needed to play around with the standard deviations on both the b variable and the e variable. If their variance was small enough, then the model could accurately determine to split on the a variable. However, once the variable got noisy enough, the model selected the more granular predictor nearly every time.
library(partykit)
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner,
## node_surv, node_terminal, varimp
library(rpart)
r <- rpart(y ~ ., data=data, maxdepth=1)
plot(as.party(r))
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
In stochastic gradient boosting, predictor importance is based on the reduction in squared error. The two hyperparameters we are playing with are bagging fraction (which represents the proportion of training values that are used to build the tree) and shrinkage (which represents the learning rate or how much variance is explained by any given tree). When these values are low, it is only natural that importance will be spread over many more predictors. Each bagging fraction is likely to contain a very different set of data, leading to tree that are non-colinear. This makes it likely the model will learn to identify importance facets from many different predictors. Conversly, when the bagging fraction and shrinkage are high, the trees are likely to be highly co-linear meaning they are more likely to rely on the same predictors.
I believe the first model is more robust. It is essential when created an ensemble model that the various models are making different types of mistakes. If all the models are nearly identical, as with the right model in this example, then most of the models are not contributing any new information to the prediction. In this example, there may be knowledge contained in predictors not being utilized by the right model.
Increasing the interaction depth would bring the predictor importance slopes more into alignment. The right model would give extra importance to the less predictors as those predictors will be given the opportunity to split on the data. These predictors were not given the opportunity before because the same fewest predictors were consistently being selected. Conversly, additional depth in the left model increases the likelyhood that the most predictive predictors will be used to split the data, increasing their importance.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
As directed, I copied the previously used code to create a training and testing split.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
Chemical.Y <- ChemicalManufacturingProcess$Yield
Chemical.X <- ChemicalManufacturingProcess %>%
select(-Yield)
set.seed(123)
imputed.data <- mice::mice(Chemical.X, m=5, maxit=10, seed=123, printFlag=FALSE)
## Warning: Number of logged events: 1350
completed.data <- cbind(Chemical.Y, mice::complete(imputed.data, 1))
set.seed(123)
part <- caret::createDataPartition(completed.data$Chemical.Y, p=0.8, list=FALSE)
training <- completed.data %>%
filter(row_number() %in% part)
testing <- completed.data %>%
filter(!row_number() %in% part)
After testing all the models, I settled on a gradient boosted model. Exploring the hyperparameter space gave me the best parameters to use and the the resulting predictions are the best of any of the methods we have explored over the past several homeworks.
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2), n.trees = seq(100, 1000, by = 50), shrinkage = c(0.01, 0.1), n.minobsinnode=c(3))
gbmTune <- train(training %>% select(-Chemical.Y), training$Chemical.Y, method = "gbm", tuneGrid=gbmGrid, verbose = FALSE)
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
gbmTune$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 76 1000 7 0.01 3
Metrics::rmse(testing$Chemical.Y, predict(gbmTune, testing %>% select(-Chemical.Y)))
## [1] 0.9564706
The top ten list is displayed below. The list is very similar, but not identical to previously tested methods. While previously tested methods weighted more heavily in Manufacturing Processes, this model is more evenly spread. All models agree that ManufacturingProcess32 is the most important.
summary(gbmTune)
## var rel.inf
## ManufacturingProcess32 ManufacturingProcess32 24.717991552
## ManufacturingProcess13 ManufacturingProcess13 6.906255989
## ManufacturingProcess09 ManufacturingProcess09 5.018347288
## BiologicalMaterial03 BiologicalMaterial03 4.714757932
## BiologicalMaterial12 BiologicalMaterial12 4.590697073
## ManufacturingProcess17 ManufacturingProcess17 4.553506444
## BiologicalMaterial09 BiologicalMaterial09 3.445449723
## ManufacturingProcess06 ManufacturingProcess06 3.217161275
## BiologicalMaterial11 BiologicalMaterial11 3.197827103
## BiologicalMaterial05 BiologicalMaterial05 2.268780338
## ManufacturingProcess31 ManufacturingProcess31 2.183720557
## BiologicalMaterial06 BiologicalMaterial06 1.735928910
## BiologicalMaterial02 BiologicalMaterial02 1.607461207
## ManufacturingProcess28 ManufacturingProcess28 1.557782902
## ManufacturingProcess15 ManufacturingProcess15 1.437704497
## ManufacturingProcess11 ManufacturingProcess11 1.346729182
## ManufacturingProcess34 ManufacturingProcess34 1.254622026
## ManufacturingProcess29 ManufacturingProcess29 1.241296671
## ManufacturingProcess37 ManufacturingProcess37 1.166343656
## ManufacturingProcess05 ManufacturingProcess05 1.162727156
## ManufacturingProcess21 ManufacturingProcess21 1.158716356
## ManufacturingProcess04 ManufacturingProcess04 1.157864204
## ManufacturingProcess20 ManufacturingProcess20 1.065752381
## ManufacturingProcess30 ManufacturingProcess30 1.052053868
## BiologicalMaterial01 BiologicalMaterial01 1.037805386
## ManufacturingProcess01 ManufacturingProcess01 1.020311387
## ManufacturingProcess39 ManufacturingProcess39 1.017458711
## BiologicalMaterial08 BiologicalMaterial08 0.971268482
## ManufacturingProcess43 ManufacturingProcess43 0.961235266
## BiologicalMaterial04 BiologicalMaterial04 0.943858845
## ManufacturingProcess18 ManufacturingProcess18 0.871847148
## ManufacturingProcess14 ManufacturingProcess14 0.857693718
## BiologicalMaterial10 BiologicalMaterial10 0.848342810
## ManufacturingProcess33 ManufacturingProcess33 0.778537642
## ManufacturingProcess24 ManufacturingProcess24 0.767948048
## ManufacturingProcess19 ManufacturingProcess19 0.745493681
## ManufacturingProcess22 ManufacturingProcess22 0.742638242
## ManufacturingProcess27 ManufacturingProcess27 0.734728821
## ManufacturingProcess10 ManufacturingProcess10 0.682639019
## ManufacturingProcess25 ManufacturingProcess25 0.670080734
## ManufacturingProcess02 ManufacturingProcess02 0.659994397
## ManufacturingProcess45 ManufacturingProcess45 0.584973196
## ManufacturingProcess35 ManufacturingProcess35 0.547901351
## ManufacturingProcess16 ManufacturingProcess16 0.514224281
## ManufacturingProcess42 ManufacturingProcess42 0.463439222
## ManufacturingProcess23 ManufacturingProcess23 0.422899176
## ManufacturingProcess44 ManufacturingProcess44 0.415134943
## ManufacturingProcess26 ManufacturingProcess26 0.293688857
## ManufacturingProcess38 ManufacturingProcess38 0.203918351
## ManufacturingProcess36 ManufacturingProcess36 0.167220095
## ManufacturingProcess03 ManufacturingProcess03 0.138542162
## ManufacturingProcess07 ManufacturingProcess07 0.092393415
## ManufacturingProcess08 ManufacturingProcess08 0.050644767
## ManufacturingProcess40 ManufacturingProcess40 0.015448403
## ManufacturingProcess12 ManufacturingProcess12 0.015117883
## ManufacturingProcess41 ManufacturingProcess41 0.003093273
## BiologicalMaterial07 BiologicalMaterial07 0.000000000
Searching over the hyperparameter space demonstrates that 2 splits is the ideal number for a single tree. The tree is trained and the results are displayed below. Ultimately this view provides some additional insight. It shows that after the initial split (on the most important predictor as expected), the second split is different for each of the two paths. This raises the importance of BiologicalMaterial11 much higher than anticipated.
single.tree <- train(training %>% select(-Chemical.Y), training$Chemical.Y, method='rpart2', tuneLength=40, trControl=trainControl(method='cv'))
## note: only 11 possible values of the max tree depth from the initial fit.
## Truncating the grid to 11 .
single.tree
## CART
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 128, 131, 131, 130, 130, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 1.429123 0.4535543 1.149425
## 2 1.384670 0.5028967 1.111596
## 3 1.414418 0.4828775 1.101937
## 4 1.460226 0.4468509 1.133803
## 5 1.450373 0.4551442 1.124306
## 6 1.443212 0.4694187 1.106913
## 7 1.456020 0.4672857 1.110311
## 8 1.437315 0.4801473 1.087532
## 9 1.426246 0.4894606 1.071656
## 10 1.432154 0.4850517 1.073811
## 11 1.436393 0.4822823 1.083132
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 2.
library(partykit)
library(rpart)
r <- rpart(Chemical.Y ~ ., data=training, maxdepth=2)
plot(as.party(r))