DATA 624 - Homework # 9

8.1 Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 - V10)?

rfImp1

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

No it did not. The predictors V6 ~ V10 have very small importance, compare to other predictors such as V1, V2, V4, and V5.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

Yes, it did. The importance for V1 is reduced, splitted between V1 and duplicated1. This is because the trees in the random forest are just as likely to pick V1 for a split than to pick duplicated1, since they are so correlated. As a result, the total number of splits that originally belonged to V1 is “shared” with duplicated1.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)
library(dplyr)
cforestModel <- cforest(y ~ ., data=simulated)

# Unconditional importance measure
varimp(cforestModel) %>% sort(decreasing = T)

##            V4            V2    duplicate1            V1            V5 
##  7.6223892727  6.0579730772  5.0941897280  4.6171158805  1.7161194047 
##            V7            V9            V3            V6           V10 
##  0.0465374951  0.0046062409  0.0003116115 -0.0289427183 -0.0310326410 
##            V8 
## -0.0380965511

# Conditional importance measure
varimp(cforestModel, conditional=T) %>% sort(decreasing = T)

##           V4           V2   duplicate1           V1           V5 
##  6.175211816  4.806626676  2.039771619  1.894092913  1.028543061 
##           V6           V3           V9           V7          V10 
##  0.021733840  0.020998486  0.003766844 -0.006026025 -0.015318278 
##           V8 
## -0.022310557

The new importance calculated (with conditional set to TRUE) is slightly different than the traditional measurement. The uninformative predictors (V6 ~ V10) are still rated low importances. The importance of other predictors are reduced, with V1 and its highly correlated duplicated1 being reduced the most. Both measurement rated V4 as the top most important predictor, followed by V2.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

For the gradient boosted trees model, I use the gbm function. The model is built using its default input parameters. The summary function can be used to get the relative influence of the predictors on the model.

library(gbm)
gbmModel <- gbm(y ~ ., data=simulated, distribution='gaussian')
summary(gbmModel)

##                   var    rel.inf
## V4                 V4 27.8345107
## V2                 V2 23.2038908
## V1                 V1 17.6456066
## duplicate1 duplicate1 12.3248580
## V5                 V5 10.5288759
## V3                 V3  7.9846250
## V7                 V7  0.2611591
## V6                 V6  0.2164740
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

The same pattern occurs. The uninformative predictors V6 ~ V10 are still rated very low. V4 is still the top rated predictor, followed by V2. The correlated predictors V1 and duplicate1, tho, have higher differences in their rated importance. The GBM model seems to detect that V1 is much more important than duplicated1. This is because the trees from boosting are dependent on each other and will have correlated structures. Many same predictors will be selected across the trees. As a result, the magnitude of their contribution to the importance metric looks more “stretched” comparing to other tree models.

For the Cubist model, I use the cubist function. I set the committees to 100, to match that of the default number of trees in the gbm model. The varImp function calculats the variable importance of a Cubist model.

library(Cubist)
cubistModel <- cubist(x=simulated[,-(ncol(simulated)-1)], y=simulated$y, committees=100)
varImp(cubistModel)

##            Overall
## V3            43.5
## V1            52.5
## V2            59.5
## duplicate1    27.5
## V4            46.0
## V8             4.0
## V5            27.0
## V6            10.0
## V10            1.0
## V7             0.0
## V9             0.0

Again, the uninformative predictors V6 ~ V10 are rated low in their importance. The top rated predictor, tho, is not V4 this time, but V2. This is different than the random forest and GBM models. For V1 and its correlated predictor duplicated1, the Cubist model values V1 higher than duplicated1 just like the GBM model.

8.2 Use a simulation to show tree bias with different granularities.

The tree models suffer from selection bias that the predictors having a higher number of distinct values are favored over more granular predictors.

Below, I simulated 4 variables using the sample function with replacement. All 4 variables are values ranging from 0 to 1, with the following granularity:

The x1 variable has 10001 possible values
The x2 variable has 1001 possible values
The x3 variable has 101 possible values
The x4 variable has 11 possible values

The target variable y is created by adding x1 and x4, plus a random error term. Note that x2 and x3 are not used to generate the target.

set.seed(1)
x1 <- sample(0:10000 / 10000, 200, replace = T)
x2 <- sample(0:1000 / 1000, 200, replace = T)
x3 <- sample(0:100 / 100, 200, replace = T)
x4 <- sample(0:10 / 10, 200, replace = T)

y <- x1 + x4 + rnorm(200) 

df <- data.frame(x1, x2, x3, x4, y)
str(df)

## 'data.frame':    200 obs. of  5 variables:
##  $ x1: num  0.266 0.372 0.573 0.908 0.202 ...
##  $ x2: num  0.267 0.218 0.517 0.269 0.181 0.519 0.563 0.129 0.256 0.718 ...
##  $ x3: num  0.66 0.18 0.96 0.9 0.95 0.73 0.37 0.78 0.01 0.94 ...
##  $ x4: num  0.8 1 0.1 0.8 1 1 0.3 0.4 1 0.1 ...
##  $ y : num  2.1399 3.2678 0.0699 1.3173 0.7855 ...

Below, a regression tree is fitted using rpart, and the variable importance is calculated using varImp.

library(rpart)

rpartTree <- rpart(y ~ ., data=df)
varImp(rpartTree)

##      Overall
## x1 1.0385735
## x2 0.8086365
## x3 0.7884261
## x4 0.6570896

As you can see, the tree uses x1 mostly to split; while it uses x4 the least. Interestingly, the tree model also uses x2 and x3 to split, even though they are not used to generate the target. X2 and x3 all have more distinct values than x4. This demonstrates that there is a selection bias in the tree model, that it favors predictors with more distinct values.

8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Gradient boosting trees make predictions by adding the predictions of each subsequent tree. These trees are dependent on each other - each tree is built upon the errors of previous tree. Therefore they have correlated structures. Suppose one of the tree split on one feature frequently. Its subsequent tree, which is built using its errors, will likely split on that feature frequently as well. As a result, many of the same predictors will be selected across the trees, increasing their contribution to the importance metric.

Learning rate controls the fraction of the predictions of each tree being added. A higher learning rate means that larger fraction of each tree’s predictions are added to the final prediction. This effectively means that a higher learning rate increases the dependent / correlation structure. More of the same predictors will be selected among the trees. This is why the right-hand plot has its importance focus on just the first few of the predictors, and look very steep.

Bagging fraction is the fraction of data being used in each iteration of the trees. When you have a small bagging fraction, say 0.1, on each iteration just 10% of the full data is randomly sampled. So each tree may be built using very different dataset. Since the dataset are very different, the trees will be splitting very differently from each other. On the contrast, when you have large bagging fraction, say 0.9, essentially on each iteration the trees are seeing the same dataset - they will likely split similarly. This means that larger bagging fraction increases the dependent / correlated structure in the boosting trees. Therefore, the right-hand plot with a larger bagging fraction has its importance focus on just the first few of the predictors.

(b) Which model do you think would be more predictive of other samples?

Learning rate and bagging fraction are important parameters to control the overfitting of the gradient boosting model that requires tuning. A smaller learning rate and bagging fraction leads to better generalization ability over unseen samples. If I have to guess, the model with 0.1 learning rate and bagging fraction will be more predictive of out of bag samples. However, since this invovles a trade off between bias-variance, I can only confirm using cross validation or a test set.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Below, I train two GBM models with all parameters the same except for the interaction depth. Model 1 has interaction depth of 1, while Model 2 has interaction depth of 10. The predictor importance of both models are plotted:

library(AppliedPredictiveModeling)

data(solubility)

grid1 <- expand.grid(n.trees=100, interaction.depth=1, shrinkage=0.1, n.minobsinnode=10)
gbm1 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = grid1, verbose = FALSE)

grid2 <- expand.grid(n.trees=100, interaction.depth=10, shrinkage=0.1, n.minobsinnode=10)
gbm2 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = grid2, verbose = FALSE)

plot(varImp(gbm1))

plot(varImp(gbm2))

As you can see, increasing the interaction depth seems to spread out the importance more, since each tree now can grow deeper, and has more chance for other features to be involved in the splitting process. Therefore, increasing the depth reduce the slope of the importance plot.

8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

The missing values in the ChemicalManufacturingProcess data are imputed using the bagImpute method. The train test set are splitted, with 20% of the data assigned to the test set.

data(ChemicalManufacturingProcess)

# Impute missing values using `bagImpulte`
(cmpImpute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute')))

## Created from 152 samples and 57 variables
## 
## Pre-processing:
##   - bagged tree imputation (57)
##   - ignored (0)

cmp <- predict(cmpImpute, ChemicalManufacturingProcess[,-c(1)])

# Train/test plitting data, 20% testing
set.seed(1)
trainRow <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X.train <- cmp[trainRow, ]
y.train <- ChemicalManufacturingProcess$Yield[trainRow]
X.test <- cmp[-trainRow, ]
y.test <- ChemicalManufacturingProcess$Yield[-trainRow]

Below, 4 regression tree models are trained: single tree, random forest, gradient boosting, and cubist. The bootstrapped resampling method is used with 25 repetition to determine the final models. RMSE is used as the metric. There is no need for centering and scaling for the tree models.

Single tree, tuned over the complexity parameter:

set.seed(1)
rpartModel <- train(x = X.train,
                    y = y.train,
                    method = "rpart",
                    tuneLength = 10,
                    control = rpart.control(maxdepth=2))
rpartModel

## CART 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE     
##   0.01143214  1.406955  0.4650570  1.123290
##   0.01620406  1.406955  0.4650570  1.123290
##   0.02156477  1.406955  0.4650570  1.123290
##   0.02456980  1.406955  0.4650570  1.123290
##   0.02550828  1.406955  0.4650570  1.123290
##   0.02805096  1.406955  0.4650570  1.123290
##   0.04854745  1.413676  0.4604097  1.133811
##   0.06127500  1.407444  0.4637199  1.130400
##   0.07019415  1.406903  0.4645587  1.130002
##   0.47674176  1.645284  0.4289123  1.352884
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.07019415.

Random forest, tuned over number of features to use in each tree:

set.seed(1)
rfModel <- train(x = X.train,
                 y = y.train,
                 method = 'rf',
                 tuneLength = 10)
rfModel

## Random Forest 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.332743  0.5778432  1.0623662
##    8    1.237215  0.6096541  0.9669610
##   14    1.220952  0.6081323  0.9443334
##   20    1.207311  0.6111087  0.9321872
##   26    1.198862  0.6126788  0.9225103
##   32    1.195359  0.6115664  0.9187137
##   38    1.200667  0.6049647  0.9218494
##   44    1.205700  0.5990895  0.9228859
##   50    1.211269  0.5935762  0.9251367
##   57    1.216546  0.5883913  0.9299937
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 32.

Gradient boosting, tuned over the number of trees, interaction depth, shrinkage rate, and minimum terminal node size:

set.seed(1)
grid <- expand.grid(n.trees=c(50, 100, 150, 200), 
                    interaction.depth=c(1, 5, 10, 15), 
                    shrinkage=c(0.01, 0.1, 0.5), 
                    n.minobsinnode=c(5, 10, 15))
gbmModel <- train(x = X.train,
                  y = y.train, 
                  method = 'gbm', 
                  tuneGrid = grid, 
                  verbose = FALSE)

plot(gbmModel)

gbmModel$bestTune

##    n.trees interaction.depth shrinkage n.minobsinnode
## 64     200                 5       0.1              5

gbmModel$finalModel

## A gradient boosted model with gaussian loss function.
## 200 iterations were performed.
## There were 57 predictors of which 55 had non-zero influence.

Cubist, tuned over the number of committees and neighbors:

set.seed(1)
cubistModel <- train(x = X.train,
                     y = y.train,
                     method = 'cubist')
cubistModel

## Cubist 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##    1          0          1.939537  0.3009835  1.3306156
##    1          5          1.921018  0.3086535  1.3044559
##    1          9          1.928331  0.3067974  1.3155708
##   10          0          1.272976  0.5524599  0.9579863
##   10          5          1.258384  0.5636896  0.9420485
##   10          9          1.265884  0.5594987  0.9513648
##   20          0          1.222938  0.5869734  0.9307929
##   20          5          1.208383  0.5968280  0.9119309
##   20          9          1.216740  0.5923736  0.9230380
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.

(a) Which tree-based regression model gives the optimal resampling and test set performance?

The resampling performance of all the models are calculated below:

resamp <- resamples(list(SingleTree=rpartModel, RandomForest=rfModel, GradientBoosting=gbmModel, Cubist=cubistModel))

summary(resamp)

## 
## Call:
## summary.resamples(object = resamp)
## 
## Models: SingleTree, RandomForest, GradientBoosting, Cubist 
## Number of resamples: 25 
## 
## MAE 
##                       Min.   1st Qu.    Median      Mean   3rd Qu.
## SingleTree       0.9362303 1.0587955 1.1117806 1.1300022 1.2306692
## RandomForest     0.6722850 0.8648722 0.9082083 0.9187137 0.9868557
## GradientBoosting 0.7361555 0.8508808 0.9156229 0.9020898 0.9701113
## Cubist           0.7468634 0.8576607 0.8925733 0.9119309 0.9351510
##                      Max. NA's
## SingleTree       1.280239    0
## RandomForest     1.093441    0
## GradientBoosting 1.035262    0
## Cubist           1.171089    0
## 
## RMSE 
##                       Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
## SingleTree       1.1430661 1.312677 1.409903 1.406903 1.483944 1.604770
## RandomForest     0.8731812 1.112670 1.200417 1.195359 1.286659 1.432061
## GradientBoosting 0.9561651 1.107102 1.193341 1.181825 1.268725 1.358972
## Cubist           0.9925675 1.090693 1.200062 1.208383 1.262978 1.508471
##                  NA's
## SingleTree          0
## RandomForest        0
## GradientBoosting    0
## Cubist              0
## 
## Rsquared 
##                       Min.   1st Qu.    Median      Mean   3rd Qu.
## SingleTree       0.3215285 0.3908312 0.4696616 0.4645587 0.5338047
## RandomForest     0.4596517 0.5553790 0.6011582 0.6115664 0.6563815
## GradientBoosting 0.4982489 0.5658860 0.5931751 0.6093390 0.6735463
## Cubist           0.4233126 0.5340132 0.6092376 0.5968280 0.6544410
##                       Max. NA's
## SingleTree       0.6071253    0
## RandomForest     0.7956340    0
## GradientBoosting 0.7096665    0
## Cubist           0.7386835    0

Looking at the Mean of the RMSE metric, it appears the gradient boosting model is optimal.

The test set performance is calculated below:

testPerf <- function(models, testData, testTarget) {
  method <- c()
  res <- data.frame()
  for(model in models){
    method <- c(method, model$method)
    pred <- predict(model, newdata=testData)
    res <- rbind(res, t(postResample(pred=pred, obs=testTarget)))
  }
  row.names(res) <- method
  return(res)
}

models <- list(rpartModel, rfModel, gbmModel, cubistModel)

performance <- testPerf(models, X.test, y.test)
performance

##             RMSE  Rsquared       MAE
## rpart  1.8616397 0.1224457 1.4768675
## rf     1.3135442 0.5137743 0.9873300
## gbm    1.2144783 0.5913214 0.9505038
## cubist 0.8749209 0.7821383 0.6652048

The test set performance suggests that the cubist model is the best. Here, I would trust the resampled metrics than the metric reported by the test set, since the size of the data is very small and the test set performance may not be a close approximation of the true model performance. There, I would choose grandient boosting model.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

The predictor importance can be found using the varImp function:

varImp(gbmModel)

## gbm variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17  18.122
## BiologicalMaterial06    13.884
## ManufacturingProcess13  13.277
## BiologicalMaterial12    12.973
## ManufacturingProcess06  11.708
## BiologicalMaterial03    11.029
## ManufacturingProcess21   9.212
## BiologicalMaterial11     8.169
## BiologicalMaterial09     8.004
## ManufacturingProcess09   7.516
## ManufacturingProcess01   7.436
## ManufacturingProcess31   5.225
## ManufacturingProcess04   5.053
## BiologicalMaterial05     4.584
## ManufacturingProcess27   3.877
## ManufacturingProcess19   3.854
## ManufacturingProcess43   3.850
## BiologicalMaterial10     3.705
## ManufacturingProcess03   3.537

Out of top 20 important variables, 13 are ManufacturingProcess predictors and 7 are BiologicalMaterial predictors. The ManufacturingProcess predictors dominated the list.

The top 10 predictors from the optimal linear model are:

## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess36 
##            0.651786278            0.372412766            0.260394128 
## ManufacturingProcess17 ManufacturingProcess13 ManufacturingProcess37 
##            0.257187426            0.215257776            0.163439748 
## ManufacturingProcess06   BiologicalMaterial06 ManufacturingProcess34 
##            0.163095669            0.133816040            0.126158248 
## ManufacturingProcess15   
##            0.115667061

The top 10 predictors from the optimal nonlinear models are:

## ManufacturingProcess32  100.00
## BiologicalMaterial06     84.75
## ManufacturingProcess36   73.66
## BiologicalMaterial03     70.69
## ManufacturingProcess13   68.69
## BiologicalMaterial02     64.16
## BiologicalMaterial12     60.11
## ManufacturingProcess17   57.19
## ManufacturingProcess09   55.42
## ManufacturingProcess33   55.39

As you can see, all three model agrees that ManufacturingProcess32 is the top most important predictor. In the optimal linear model, almost all of the top 10 predictors are from the ManufacturingProcess predictors; while for the optimal tree and nonlinear models, the top 10 predictors are not as dominated by the ManufacturingProcess predictors.

Also, notice that in the optimal tree model (gradient boosting), the importance metric drops off quickly. The metric suggest that the 2nd important predictor is 18.122/100-1 = 82% less important than the 1st important predictor. On the other hand, in the nonlinear and linear models, this drop is no as significant. For the linear model, the drop in important, for example between the 1st and 2nd most important feature is a drop of 0.372/0.652-1 = 43%. For the nonlinear model, the drop is just 84.75-100-1 = 15%. The importance decrease much more slowly.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Below, the optimal single tree model is plotted. It appears the model made one single split using ManufacturingProcess32 at the value of 159.5. If it’s less than 159.5, yield is assigned to be 39.13376. If it’s equal or more than 159.5, the yield is assigned to be 41.70932.

rpartModel$finalModel

## n= 144 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 144 484.5825 40.18903  
##   2) ManufacturingProcess32< 159.5 85 132.7408 39.13376 *
##   3) ManufacturingProcess32>=159.5 59 120.8210 41.70932 *

plot(rpartModel$finalModel)
text(rpartModel$finalModel)

This view provide two additional knowledge about the relationships:

Higher value of ManufacturingProcess32 leads to higher yield, and vice versa.
The assign values are the means of the groups splitted by ManufacturingProcess32 at 159.5, as demonstrated below:

trainData <- data.frame(X.train, y.train)
less <- subset(trainData, ManufacturingProcess32 < 159.5)
more <- subset(trainData, ManufacturingProcess32 >= 159.5)

mean(less$y.train)

## [1] 39.13376

mean(more$y.train)

## [1] 41.70932

DATA 624 - Homework # 9

Jun Yan

April 14, 2019