Questions

Question 8.1

Recreate the simulated data from Exercise 7.2:

  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No. The random forest was able to identify that only the first 5 variables were relevant to determining the response variable. The drop off after the 5 necessary predictors clearly shows this.

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
varImpPlot(model1)

  1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

V1 and duplicate1 are both still identified as important (they are both listed before any of the noise variables). However, their importance appears to have been split. V1 is now not listed as the most important predictor. This can help explain why simply using these plots can be dangerous as it may not completely indicate which predictors are the most directly related to the response variable.

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
varImpPlot(model1)

  1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

Yes, the model provides the same predictors in roughly the same position of importance. This provides additional evidence as to the predictors that are most critical to identifying the response variable.

library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
## 
##     boundary
model1 <- cforest(y ~ ., data = simulated)
rfImp1 <- varImp(model1)
rfImp1
##                  Overall
## V1          4.6171158805
## V2          6.0579730772
## V3          0.0003116115
## V4          7.6223892727
## V5          1.7161194047
## V6         -0.0289427183
## V7          0.0465374951
## V8         -0.0380965511
## V9          0.0046062409
## V10        -0.0310326410
## duplicate1  5.0941897280
  1. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

The boosted tree model produces radically different results. Although it still identifies the relevent predictors, it has them in a different level of importance compared to the previous models. Most importantly, the previous weakest predictor (V3) is identified as not relevent here. This is odd as know that the predictor is in fact important.

library(gbm)
## Loaded gbm 2.1.5
gbmTune <- gbm.fit(simulated %>% select(-y), simulated$y, distribution = 'gaussian')
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1       24.3727             nan     0.0010    0.0110
##      2       24.3586             nan     0.0010    0.0106
##      3       24.3468             nan     0.0010    0.0113
##      4       24.3331             nan     0.0010    0.0087
##      5       24.3218             nan     0.0010    0.0092
##      6       24.3071             nan     0.0010    0.0103
##      7       24.2943             nan     0.0010    0.0105
##      8       24.2830             nan     0.0010    0.0093
##      9       24.2709             nan     0.0010    0.0083
##     10       24.2602             nan     0.0010    0.0080
##     20       24.1329             nan     0.0010    0.0100
##     40       23.8941             nan     0.0010    0.0097
##     60       23.6501             nan     0.0010    0.0105
##     80       23.4133             nan     0.0010    0.0100
##    100       23.1724             nan     0.0010    0.0097
summary(gbmTune)

##                   var  rel.inf
## V1                 V1 40.88644
## V2                 V2 21.56516
## duplicate1 duplicate1 20.10280
## V4                 V4 17.44560
## V3                 V3  0.00000
## V5                 V5  0.00000
## V6                 V6  0.00000
## V7                 V7  0.00000
## V8                 V8  0.00000
## V9                 V9  0.00000
## V10               V10  0.00000

Question 8.2

Use a simulation to show tree bias with different granularities.

I created 3 variables. a is predictive of the response variable y but has low granularity while b is not predictive of the response variable y but has high grandularity. I will then build a tree out of the data and see if the model decides to split on the more granular, but meaningless, predictor.

a <- rep(0:1, each=100)
b <- rnorm(200, mean=0, sd=2)
e <- rnorm(200, mean=0, sd=5)
y <- a + e

data <- tibble(y=y, a=a, b=b)

It was not difficult to get the tree to split on the meaningless predictor. However, I needed to play around with the standard deviations on both the b variable and the e variable. If their variance was small enough, then the model could accurately determine to split on the a variable. However, once the variable got noisy enough, the model selected the more granular predictor nearly every time.

library(partykit)
## Loading required package: libcoin
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner,
##     node_surv, node_terminal, varimp
library(rpart)
r <- rpart(y ~ ., data=data, maxdepth=1)
plot(as.party(r))

Question 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

  1. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

In stochastic gradient boosting, predictor importance is based on the reduction in squared error. The two hyperparameters we are playing with are bagging fraction (which represents the proportion of training values that are used to build the tree) and shrinkage (which represents the learning rate or how much variance is explained by any given tree). When these values are low, it is only natural that importance will be spread over many more predictors. Each bagging fraction is likely to contain a very different set of data, leading to tree that are non-colinear. This makes it likely the model will learn to identify importance facets from many different predictors. Conversly, when the bagging fraction and shrinkage are high, the trees are likely to be highly co-linear meaning they are more likely to rely on the same predictors.

  1. Which model do you think would be more predictive of other samples?

I believe the first model is more robust. It is essential when created an ensemble model that the various models are making different types of mistakes. If all the models are nearly identical, as with the right model in this example, then most of the models are not contributing any new information to the prediction. In this example, there may be knowledge contained in predictors not being utilized by the right model.

  1. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth would bring the predictor importance slopes more into alignment. The right model would give extra importance to the less predictors as those predictors will be given the opportunity to split on the data. These predictors were not given the opportunity before because the same fewest predictors were consistently being selected. Conversly, additional depth in the left model increases the likelyhood that the most predictive predictors will be used to split the data, increasing their importance.

Question 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

  1. Which tree-based regression model gives the optimal resampling and test set performance?

As directed, I copied the previously used code to create a training and testing split.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
Chemical.Y <- ChemicalManufacturingProcess$Yield
Chemical.X <- ChemicalManufacturingProcess %>%
  select(-Yield)

set.seed(123)
imputed.data <- mice::mice(Chemical.X, m=5, maxit=10, seed=123, printFlag=FALSE)
## Warning: Number of logged events: 1350
completed.data <- cbind(Chemical.Y, mice::complete(imputed.data, 1)) 

set.seed(123)

part <- caret::createDataPartition(completed.data$Chemical.Y, p=0.8, list=FALSE)
training <- completed.data %>%
  filter(row_number() %in% part)
testing <- completed.data %>%
  filter(!row_number() %in% part)

After testing all the models, I settled on a gradient boosted model. Exploring the hyperparameter space gave me the best parameters to use and the the resulting predictions are the best of any of the methods we have explored over the past several homeworks.

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2), n.trees = seq(100, 1000, by = 50), shrinkage = c(0.01, 0.1), n.minobsinnode=c(3))
gbmTune <- train(training %>% select(-Chemical.Y), training$Chemical.Y, method = "gbm", tuneGrid=gbmGrid, verbose = FALSE)
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
gbmTune$bestTune
##    n.trees interaction.depth shrinkage n.minobsinnode
## 76    1000                 7      0.01              3
Metrics::rmse(testing$Chemical.Y, predict(gbmTune, testing %>% select(-Chemical.Y)))
## [1] 0.9564706
  1. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

The top ten list is displayed below. The list is very similar, but not identical to previously tested methods. While previously tested methods weighted more heavily in Manufacturing Processes, this model is more evenly spread. All models agree that ManufacturingProcess32 is the most important.

summary(gbmTune)

##                                           var      rel.inf
## ManufacturingProcess32 ManufacturingProcess32 24.717991552
## ManufacturingProcess13 ManufacturingProcess13  6.906255989
## ManufacturingProcess09 ManufacturingProcess09  5.018347288
## BiologicalMaterial03     BiologicalMaterial03  4.714757932
## BiologicalMaterial12     BiologicalMaterial12  4.590697073
## ManufacturingProcess17 ManufacturingProcess17  4.553506444
## BiologicalMaterial09     BiologicalMaterial09  3.445449723
## ManufacturingProcess06 ManufacturingProcess06  3.217161275
## BiologicalMaterial11     BiologicalMaterial11  3.197827103
## BiologicalMaterial05     BiologicalMaterial05  2.268780338
## ManufacturingProcess31 ManufacturingProcess31  2.183720557
## BiologicalMaterial06     BiologicalMaterial06  1.735928910
## BiologicalMaterial02     BiologicalMaterial02  1.607461207
## ManufacturingProcess28 ManufacturingProcess28  1.557782902
## ManufacturingProcess15 ManufacturingProcess15  1.437704497
## ManufacturingProcess11 ManufacturingProcess11  1.346729182
## ManufacturingProcess34 ManufacturingProcess34  1.254622026
## ManufacturingProcess29 ManufacturingProcess29  1.241296671
## ManufacturingProcess37 ManufacturingProcess37  1.166343656
## ManufacturingProcess05 ManufacturingProcess05  1.162727156
## ManufacturingProcess21 ManufacturingProcess21  1.158716356
## ManufacturingProcess04 ManufacturingProcess04  1.157864204
## ManufacturingProcess20 ManufacturingProcess20  1.065752381
## ManufacturingProcess30 ManufacturingProcess30  1.052053868
## BiologicalMaterial01     BiologicalMaterial01  1.037805386
## ManufacturingProcess01 ManufacturingProcess01  1.020311387
## ManufacturingProcess39 ManufacturingProcess39  1.017458711
## BiologicalMaterial08     BiologicalMaterial08  0.971268482
## ManufacturingProcess43 ManufacturingProcess43  0.961235266
## BiologicalMaterial04     BiologicalMaterial04  0.943858845
## ManufacturingProcess18 ManufacturingProcess18  0.871847148
## ManufacturingProcess14 ManufacturingProcess14  0.857693718
## BiologicalMaterial10     BiologicalMaterial10  0.848342810
## ManufacturingProcess33 ManufacturingProcess33  0.778537642
## ManufacturingProcess24 ManufacturingProcess24  0.767948048
## ManufacturingProcess19 ManufacturingProcess19  0.745493681
## ManufacturingProcess22 ManufacturingProcess22  0.742638242
## ManufacturingProcess27 ManufacturingProcess27  0.734728821
## ManufacturingProcess10 ManufacturingProcess10  0.682639019
## ManufacturingProcess25 ManufacturingProcess25  0.670080734
## ManufacturingProcess02 ManufacturingProcess02  0.659994397
## ManufacturingProcess45 ManufacturingProcess45  0.584973196
## ManufacturingProcess35 ManufacturingProcess35  0.547901351
## ManufacturingProcess16 ManufacturingProcess16  0.514224281
## ManufacturingProcess42 ManufacturingProcess42  0.463439222
## ManufacturingProcess23 ManufacturingProcess23  0.422899176
## ManufacturingProcess44 ManufacturingProcess44  0.415134943
## ManufacturingProcess26 ManufacturingProcess26  0.293688857
## ManufacturingProcess38 ManufacturingProcess38  0.203918351
## ManufacturingProcess36 ManufacturingProcess36  0.167220095
## ManufacturingProcess03 ManufacturingProcess03  0.138542162
## ManufacturingProcess07 ManufacturingProcess07  0.092393415
## ManufacturingProcess08 ManufacturingProcess08  0.050644767
## ManufacturingProcess40 ManufacturingProcess40  0.015448403
## ManufacturingProcess12 ManufacturingProcess12  0.015117883
## ManufacturingProcess41 ManufacturingProcess41  0.003093273
## BiologicalMaterial07     BiologicalMaterial07  0.000000000
  1. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Searching over the hyperparameter space demonstrates that 2 splits is the ideal number for a single tree. The tree is trained and the results are displayed below. Ultimately this view provides some additional insight. It shows that after the initial split (on the most important predictor as expected), the second split is different for each of the two paths. This raises the importance of BiologicalMaterial11 much higher than anticipated.

single.tree <- train(training %>% select(-Chemical.Y), training$Chemical.Y, method='rpart2', tuneLength=40, trControl=trainControl(method='cv'))
## note: only 11 possible values of the max tree depth from the initial fit.
##  Truncating the grid to 11 .
single.tree
## CART 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 128, 131, 131, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE      Rsquared   MAE     
##    1        1.429123  0.4535543  1.149425
##    2        1.384670  0.5028967  1.111596
##    3        1.414418  0.4828775  1.101937
##    4        1.460226  0.4468509  1.133803
##    5        1.450373  0.4551442  1.124306
##    6        1.443212  0.4694187  1.106913
##    7        1.456020  0.4672857  1.110311
##    8        1.437315  0.4801473  1.087532
##    9        1.426246  0.4894606  1.071656
##   10        1.432154  0.4850517  1.073811
##   11        1.436393  0.4822823  1.083132
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 2.
library(partykit)
library(rpart)
r <- rpart(Chemical.Y ~ ., data=training, maxdepth=2)
plot(as.party(r))