8.1

Recreate the simulated data from Exercise 7.2

  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- model1$importance #varImp(model1, scale = FALSE)
vip(model1, color = 'red', fill='green') + ggtitle('Model1 Var Imp')

Did the random forest model significantly use the uninformative predictors (V6 - V10)?

Based on the chart above, it appears that the random forest mdoel did not significantly use these variables (V6- V10).

  1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

Adding a highly correlated variable changes the result of the variable importance plots. V1 score changed. V1 in model1 was the most important variable; however, with the addition of duplicate 1 varitable that is highly correlated to V1, the importance score of V1 decreased. Also, V6-V10 scores also changed. They increased in importance.

library(caret)
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
vip(model2, color = 'green', fill='red') + ggtitle('Model2 Var Imp')

  1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

When variable importance is conditional, it considers correlation between variable v1 and duplicate1, which then adjusts the importance score accordingly. With unconditional approach, V1 and duplicate1 are treated with same importance. Patterns I see don’t seem to be same pattern as traditional random forest model.

library(partykit)
model3 <- cforest(y ~ ., data = simulated, ntree = 1000)
# Conditional variable importance
cfImp_cond <- varimp(model3, conditional = TRUE)
# Un-conditional variable importance
cfImp_uncond <- varimp(model3, conditional = FALSE)

Conditional

barplot(sort(cfImp_cond),horiz = TRUE, main = 'Conditional')

Unconditional

barplot(sort(cfImp_uncond),horiz = TRUE, main = 'Unconditional')

  1. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
library(Cubist)
library(caret)
model4 <- cubist(x = simulated[, names(simulated)[names(simulated) != 'y']], y = simulated[,c('y')])
# Conditional variable importance
cImp_cond <- varImp(model4, conditional = TRUE)
# Un-conditional variable importance
cImp_uncond <- varImp(model4, conditional = FALSE)

For cubist, the patterns are the same for conditional and unconditional.

barplot((t(cImp_cond)),horiz = TRUE, main = 'Conditional')

barplot((t(cImp_uncond)),horiz = TRUE, main = 'Unconditional')


8.2

Use a simulation to show tree bias with different granularities.

V1 <- runif(1000, 2,1000)
V2 <- runif(1000, 50,500)
V3 <- rnorm(1000, 500,10)
y <- V2 + V1 

df <- data.frame(V1, V2, V3, y)
model5 <- cforest(y ~ ., data = df, ntree = 10)

#unconditional
cfImp_cond <- varimp(model5, conditional = FALSE)

With V1 havaing more distinct values, the random forest gives this variable the highest score. V2 has a higher score compared to V3 since V2 has more distinct values than V3.

barplot(sort(cfImp_cond),horiz = TRUE, main = 'Unconditional')


8.3

Figure 8.24

Figure 8.24

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9

LEFT = {bagging fraction = 0.1, learning rate = 0.1} | RIGHT = {bagging fraction = 0.9, learning rate = 0.9}

  1. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right has higher bagging fraction and learning ratio.When the bagging fraction is higher, this means that the fraction of the training data used to construct the decision tree also becomes higher. As the bagging fraction approaches 1, each bootstrap sample become more similar with each other. This leads to some predictors being more dominant.

The model on the left has a lower bagging rate and learning rate. The lower the learning rate, the less greedy the model is, which makes the model more likely to identify more predictors to be important.

  1. Which model do you think would be more predictive of other samples?

The model of the left with the lower bagging rate and lower learning rate would be likely to have better performance compared to the mdoel on the right.

  1. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

As we increase tree depth, the variable importance is more likely to be spread out over more predictors as more and more predictors are considered for tree splitting.


8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

Split train/test.

cmp_predictors = as.matrix(ChemicalManufacturingProcess[,2:58])
cmp_yield = ChemicalManufacturingProcess[,1]  

set.seed(100)
train_select <- createDataPartition(cmp_yield, p=0.75, list=F) #create train set
train_x <- ChemicalManufacturingProcess[train_select,-1]
train_y <-  ChemicalManufacturingProcess[train_select,1]
test_x <- ChemicalManufacturingProcess[-train_select,-1]
test_y <-  ChemicalManufacturingProcess[-train_select,1]


pre_process <- c("nzv",  "corr", "center","scale", "medianImpute")
set.seed(200)
ctrl <- trainControl(method = "boot", number = 25)
  1. Which tree-based regression model gives the optimal resampling and test set performance?

Based on RMSE result, cubist appears to have the lowest RMSE value.

Recursive Partitioning

set.seed(300)
rpartGrid <- expand.grid(maxdepth= seq(1,10,by=1))
rp_model <- train(x = train_x, y = train_y, method = "rpart2",metric = "Rsquared", tuneGrid = rpartGrid,
                       trControl = ctrl, preProcess=pre_process)

Random Forest

set.seed(300)
rfGrid <- expand.grid(mtry=seq(2,38,by=3))
rf_model <- train(x = train_x, y = train_y, method = "rf", tuneGrid = rfGrid, metric = "Rsquared", importance = TRUE, 
                  trControl = ctrl,preProcess=pre_process)

Generalized Boosted Regression

library(gbm)
## Loaded gbm 2.1.5
set.seed(300)
gbmGrid <- expand.grid(interaction.depth=seq(1,6,by=1),
                       n.trees=c(25,50,100,200),
                       shrinkage=c(0.01,0.05,0.1,0.2),
                       n.minobsinnode=5)

gb_model <- train(x = train_x, y = train_y,method = "gbm", metric = "Rsquared",verbose = FALSE, 
                  tuneGrid = gbmGrid, trControl = ctrl, preProcess=pre_process)

Cubist

set.seed(300)
cubistGrid <- expand.grid(committees = c(1, 5, 10, 20, 50, 100), 
                          neighbors = c(0, 1, 3, 5, 7))

cubist_model <- train(x = train_x, y = train_y,method = "cubist", 
                        verbose = FALSE, metric = "Rsquared", tuneGrid = cubistGrid,trControl = ctrl, preProcess=pre_process)

Recursive Partitioning: RMSE 1.496662

rp_model
## CART 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (47), scaled (47), median imputation (47),
##  remove (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE      Rsquared   MAE     
##    1        1.446007  0.4220720  1.143513
##    2        1.502728  0.3892350  1.192220
##    3        1.494603  0.4095253  1.180323
##    4        1.494340  0.4239887  1.182846
##    5        1.496662  0.4248578  1.181303
##    6        1.520564  0.4165578  1.191895
##    7        1.520294  0.4229997  1.186132
##    8        1.525088  0.4228835  1.191509
##    9        1.530088  0.4196601  1.193589
##   10        1.538378  0.4155667  1.199175
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 5.

Random Forest: RMSE 1.200401

rf_model
## Random Forest 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (47), scaled (47), median imputation (47),
##  remove (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.320376  0.5691294  1.0436938
##    5    1.242293  0.6013719  0.9720371
##    8    1.218744  0.6081985  0.9513966
##   11    1.209857  0.6090369  0.9398972
##   14    1.200401  0.6124067  0.9299892
##   17    1.199905  0.6092975  0.9313788
##   20    1.194800  0.6104796  0.9261917
##   23    1.198033  0.6077147  0.9287821
##   26    1.199683  0.6038839  0.9295573
##   29    1.200444  0.6031840  0.9290230
##   32    1.200941  0.6018622  0.9276243
##   35    1.211061  0.5935499  0.9362114
##   38    1.210580  0.5940910  0.9371074
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 14.

Generalized Boosting: RMSE 1.194412

gb_model
## Stochastic Gradient Boosting 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (47), scaled (47), median imputation (47),
##  remove (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE      Rsquared   MAE      
##   0.01       1                   25      1.713421  0.4819946  1.3571651
##   0.01       1                   50      1.607853  0.4972286  1.2674166
##   0.01       1                  100      1.476911  0.5221612  1.1654662
##   0.01       1                  200      1.351535  0.5431167  1.0657285
##   0.01       2                   25      1.689819  0.4995401  1.3406533
##   0.01       2                   50      1.562149  0.5209369  1.2328571
##   0.01       2                  100      1.413837  0.5388320  1.1145509
##   0.01       2                  200      1.298976  0.5553711  1.0192666
##   0.01       3                   25      1.680068  0.5097741  1.3337627
##   0.01       3                   50      1.544032  0.5332486  1.2208632
##   0.01       3                  100      1.391206  0.5444963  1.0940655
##   0.01       3                  200      1.277749  0.5639034  1.0022225
##   0.01       4                   25      1.672308  0.5243584  1.3282058
##   0.01       4                   50      1.533669  0.5367465  1.2134469
##   0.01       4                  100      1.376582  0.5529506  1.0852364
##   0.01       4                  200      1.264614  0.5710770  0.9931650
##   0.01       5                   25      1.669260  0.5257313  1.3273246
##   0.01       5                   50      1.531100  0.5388978  1.2129025
##   0.01       5                  100      1.372559  0.5535882  1.0834247
##   0.01       5                  200      1.260669  0.5723870  0.9894489
##   0.01       6                   25      1.663472  0.5355138  1.3231436
##   0.01       6                   50      1.521538  0.5441659  1.2071121
##   0.01       6                  100      1.362701  0.5587429  1.0757300
##   0.01       6                  200      1.250047  0.5794039  0.9794532
##   0.05       1                   25      1.431720  0.5212763  1.1296549
##   0.05       1                   50      1.325706  0.5373127  1.0440116
##   0.05       1                  100      1.277340  0.5492961  1.0003337
##   0.05       1                  200      1.271666  0.5535652  0.9887431
##   0.05       2                   25      1.381431  0.5312744  1.0855944
##   0.05       2                   50      1.288525  0.5498220  1.0096878
##   0.05       2                  100      1.252644  0.5648314  0.9756844
##   0.05       2                  200      1.227951  0.5814545  0.9544170
##   0.05       3                   25      1.340778  0.5494681  1.0566491
##   0.05       3                   50      1.261921  0.5620340  0.9899643
##   0.05       3                  100      1.232201  0.5791961  0.9618919
##   0.05       3                  200      1.208345  0.5949999  0.9365272
##   0.05       4                   25      1.335872  0.5477303  1.0539631
##   0.05       4                   50      1.259873  0.5647446  0.9876622
##   0.05       4                  100      1.225381  0.5845456  0.9538416
##   0.05       4                  200      1.204639  0.5988190  0.9314796
##   0.05       5                   25      1.333420  0.5433711  1.0533135
##   0.05       5                   50      1.250932  0.5664684  0.9791900
##   0.05       5                  100      1.216350  0.5863053  0.9429517
##   0.05       5                  200      1.194412  0.6015103  0.9212233
##   0.05       6                   25      1.326120  0.5489203  1.0478633
##   0.05       6                   50      1.241881  0.5743299  0.9720595
##   0.05       6                  100      1.211085  0.5908748  0.9388077
##   0.05       6                  200      1.196340  0.5998556  0.9215901
##   0.10       1                   25      1.337647  0.5254716  1.0498406
##   0.10       1                   50      1.287789  0.5397814  1.0055440
##   0.10       1                  100      1.281903  0.5450445  0.9991686
##   0.10       1                  200      1.283337  0.5469108  1.0001891
##   0.10       2                   25      1.290522  0.5408344  1.0120448
##   0.10       2                   50      1.258719  0.5590678  0.9848854
##   0.10       2                  100      1.237131  0.5752840  0.9669757
##   0.10       2                  200      1.228188  0.5826838  0.9528319
##   0.10       3                   25      1.292146  0.5375782  1.0147189
##   0.10       3                   50      1.264751  0.5559729  0.9902035
##   0.10       3                  100      1.244973  0.5722418  0.9674498
##   0.10       3                  200      1.233130  0.5806102  0.9541416
##   0.10       4                   25      1.275248  0.5464811  1.0032062
##   0.10       4                   50      1.244531  0.5671624  0.9713187
##   0.10       4                  100      1.228697  0.5793449  0.9551650
##   0.10       4                  200      1.219937  0.5850983  0.9465322
##   0.10       5                   25      1.250406  0.5620594  0.9787226
##   0.10       5                   50      1.222669  0.5794037  0.9472771
##   0.10       5                  100      1.205240  0.5928691  0.9256528
##   0.10       5                  200      1.197315  0.5983532  0.9176540
##   0.10       6                   25      1.253090  0.5644710  0.9775822
##   0.10       6                   50      1.220784  0.5856716  0.9449732
##   0.10       6                  100      1.207360  0.5948927  0.9307187
##   0.10       6                  200      1.199964  0.5989132  0.9247113
##   0.20       1                   25      1.302057  0.5301085  1.0234120
##   0.20       1                   50      1.312754  0.5214456  1.0251724
##   0.20       1                  100      1.324777  0.5194423  1.0283380
##   0.20       1                  200      1.324010  0.5238115  1.0246157
##   0.20       2                   25      1.289058  0.5401963  1.0043093
##   0.20       2                   50      1.269608  0.5617679  0.9786454
##   0.20       2                  100      1.257876  0.5711676  0.9711393
##   0.20       2                  200      1.254104  0.5748248  0.9692057
##   0.20       3                   25      1.291760  0.5375625  0.9966757
##   0.20       3                   50      1.277594  0.5509380  0.9814454
##   0.20       3                  100      1.263415  0.5604979  0.9723079
##   0.20       3                  200      1.258556  0.5645535  0.9659650
##   0.20       4                   25      1.289691  0.5342971  0.9943173
##   0.20       4                   50      1.270047  0.5501486  0.9774601
##   0.20       4                  100      1.264751  0.5552468  0.9731044
##   0.20       4                  200      1.262690  0.5575242  0.9717571
##   0.20       5                   25      1.275473  0.5481703  0.9926977
##   0.20       5                   50      1.261985  0.5573279  0.9781802
##   0.20       5                  100      1.258264  0.5607697  0.9748566
##   0.20       5                  200      1.255744  0.5632469  0.9727153
##   0.20       6                   25      1.247444  0.5684286  0.9682643
##   0.20       6                   50      1.231271  0.5802438  0.9573174
##   0.20       6                  100      1.227512  0.5829011  0.9525557
##   0.20       6                  200      1.227286  0.5832301  0.9520439
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 5
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 200,
##  interaction.depth = 5, shrinkage = 0.05 and n.minobsinnode = 5.

Cubist: RMSE 1.123437

cubist_model
## Cubist 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (47), scaled (47), median imputation (47),
##  remove (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##     1         0          1.900356  0.3391244  1.3774697
##     1         1          1.852243  0.3783705  1.3491538
##     1         3          1.868236  0.3621745  1.3523916
##     1         5          1.876522  0.3557701  1.3553085
##     1         7          1.877604  0.3526216  1.3568618
##     5         0          1.360823  0.5274541  1.0542433
##     5         1          1.316455  0.5636903  1.0076885
##     5         3          1.326182  0.5525188  1.0238896
##     5         5          1.333516  0.5471739  1.0342985
##     5         7          1.339091  0.5435583  1.0401352
##    10         0          1.299510  0.5541688  0.9979394
##    10         1          1.248361  0.5897690  0.9448069
##    10         3          1.263809  0.5777453  0.9637459
##    10         5          1.273117  0.5719587  0.9741553
##    10         7          1.279562  0.5681643  0.9799410
##    20         0          1.247859  0.5784621  0.9687138
##    20         1          1.187194  0.6178969  0.9038855
##    20         3          1.204245  0.6063998  0.9282984
##    20         5          1.218000  0.5979068  0.9424389
##    20         7          1.224935  0.5936971  0.9479289
##    50         0          1.188141  0.6117330  0.9181411
##    50         1          1.128215  0.6481615  0.8572959
##    50         3          1.144807  0.6390132  0.8771819
##    50         5          1.158805  0.6302804  0.8914944
##    50         7          1.165503  0.6262431  0.8962858
##   100         0          1.181725  0.6160486  0.9109076
##   100         1          1.123671  0.6505370  0.8493752
##   100         3          1.139919  0.6421493  0.8696818
##   100         5          1.153927  0.6335253  0.8835815
##   100         7          1.160370  0.6297168  0.8904697
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were committees = 100 and neighbors
##  = 1.
  1. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

The top 10 important predictors as per cubist model are ManufacturingProcess32 ManufacturingProcess13 ManufacturingProcess09 ManufacturingProcess17 BiologicalMaterial06 BiologicalMaterial03 BiologicalMaterial11 ManufacturingProcess33 ManufacturingProcess39 BiologicalMaterial09.

cubist_imp <- varImp(cubist_model, scale = FALSE)
plot(cubist_imp, top=15, scales = list(y = list(cex = 0.8)))

  1. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

As expected, the tree starts with splitting the data at the very top of the tree with the most important variable ManufacturingProcess32.

plot(as.party(rp_model$finalModel),gp=gpar(fontsize=10))