Exercise 8.1

Recreate the simulated data from Exercise 7.2:

Part (a)

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

var Overall
V1 8.8389
V4 7.5882
V2 6.4902
V5 2.2743
V3 0.6758
V6 0.1744
V7 0.1514
V9 -0.0299
V8 -0.0308
V10 -0.0853
Did the random forest model significantly use the uninformative predictors (V6V10)?

No, the uninformative predictors, V6V10, were the least important predictors in the random forest model and they were significantly less influential than the other 5 predictors, V1V5, with the most important one V6 at 0.17 having only about 25% the influence of V3 the least influential of the top 5 at 0.67.

Part (b)

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

## [1] 0.9396

Fit another random forest model to these data.

Model_1 Model_1_var Model_2 Model_2_var
8.8389 V1 6.9392 V4
7.5882 V4 6.2978 V1
6.4902 V2 6.0804 V2
2.2743 V5 3.5641 duplicate1
0.6758 V3 2.0310 V5
0.1744 V6 0.5841 V3
0.1514 V7 0.0795 V6
-0.0299 V9 -0.0072 V10
-0.0308 V8 -0.0257 V7
-0.0853 V10 -0.0884 V9
NA NA -0.1101 V8
Did the importance score for V1 change?

Yes, the importance score for V1 was reduced by the addition of a highly correlated variable and dropped down to second place.

What happens when you add another predictor that is also highly correlated with V1?
## [1] 0.9313
Model_1 Model_1_var Model_2 Model_2_var Model_3 Model_3_var
8.8389 V1 6.9392 V4 7.2802 V4
7.5882 V4 6.2978 V1 6.9574 V2
6.4902 V2 6.0804 V2 5.6564 V1
2.2743 V5 3.5641 duplicate1 2.6550 duplicate2
0.6758 V3 2.0310 V5 2.5663 duplicate1
0.1744 V6 0.5841 V3 2.0942 V5
0.1514 V7 0.0795 V6 0.5397 V3
-0.0299 V9 -0.0072 V10 0.1412 V6
-0.0308 V8 -0.0257 V7 0.0928 V7
-0.0853 V10 -0.0884 V9 0.0168 V10
NA NA -0.1101 V8 -0.0075 V9
NA NA NA NA -0.0963 V8

Once again, the importance score for V1 was reduced by the addition of a second highly correlated variable and dropped down to third place. Most other variables importance values stayed almost the same, with the exception of V3 which was also reduced by each additional variable, however their ranks changed.

Part (c)

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007)1.

Do these importances show the same pattern as the traditional random forest model?
Model_1 Model_1_var CI cond=T CI_T_vars CI cond=F CI_F_vars
8.8389 V1 6.8480 V4 8.7353 V1
7.5882 V4 5.6150 V1 8.2706 V4
6.4902 V2 5.2925 V2 6.5767 V2
2.2743 V5 1.2380 V5 1.9824 V5
0.6758 V3 0.0209 V7 0.0525 V7
0.1744 V6 0.0188 V3 0.0370 V3
0.1514 V7 0.0000 V8 0.0102 V6
-0.0299 V9 -0.0019 V9 -0.0030 V8
-0.0308 V8 -0.0062 V10 -0.0174 V9
-0.0853 V10 -0.0067 V6 -0.0436 V10

The conditional computation of importance flips the order of the first two variables as compared with the random forest model and the non-conditional computation. There are also other differences in the pattern of importance that can be seen in the table above for instance the importance measure of V3 drops significantly in the two conditional inference models.

Part (d)

Repeat this process with different tree models, such as boosted trees and Cubist.

Does the same pattern occur?

Boosted Trees Model

boost1 var1 boost2 var2 boost3 var3
100.000 V4 100.0000 V4 100.000 V4
94.533 V1 75.5275 V2 83.154 V2
78.198 V2 68.7917 V1 82.446 V1
35.322 V5 40.3182 V5 42.792 V5
31.846 V3 23.7523 V3 32.617 V3
2.700 V7 20.5599 duplicate1 12.054 duplicate2
2.220 V8 3.0663 V7 2.993 V7
2.198 V6 2.2864 V6 2.940 duplicate1
1.366 V10 0.6198 V9 2.870 V6
0.000 V9 0.4249 V8 1.007 V8
NA NA 0.0000 V10 0.334 V10
NA NA NA NA 0.000 V9

You can see in the table above that the order of importance changes in the boosted trees models when correlated predictors are added. Although, unlike the random forest model, the top predictor never gets bumped out of the top slot. In addition, the duplicate predictors never move above the 6th slot while in the random forest model they took the 4th and 5th slots.

Cubist Model

cubist1 var1 cubist2 var2 cubist3 var3
100.00 V1 100.000 V1 100.000 V1
75.69 V2 75.694 V2 88.321 V2
68.06 V4 68.750 V4 71.533 V4
58.33 V3 58.333 V3 62.044 V5
55.56 V5 56.944 V5 57.664 V3
15.28 V6 13.889 V6 11.679 V6
0.00 V7 2.083 duplicate1 5.109 duplicate1
0.00 V8 0.000 V7 1.460 V10
0.00 V9 0.000 V8 1.460 duplicate2
0.00 V10 0.000 V9 0.000 V7
NA NA 0.000 V10 0.000 V8
NA NA NA NA 0.000 V9

Of all the models tried, the cubist model changes the least when correlated predictors are added. The top 3 slots stay the same in all three scenarios and the top 6 stay the same when only one correlated predictor is added. The duplicate predictors also remain lower in the list in the 7th and 9th slots as compared with both the random forest model where they were in the 4th and 5th slots and the boosted model where they were in the 6th and 8th slots.

Exercise 8.2

Use a simulation to show tree bias with different granularities.

## 'data.frame':    250 obs. of  4 variables:
##  $ x1_low : num  1 -1 0 -1 1 1 0 0 0 -1 ...
##  $ x2_mid : num  1.3 -0.8 0.4 1.5 1.7 1.4 0.8 2 1.6 1.1 ...
##  $ x3_high: num  1.25 -1.14 -0.35 -0.32 0.33 1.74 0.29 1.26 -0.68 -0.6 ...
##  $ y      : num  0.651 -0.755 0.602 -1.377 0.63 ...
cor_to_y no_unique_vals importance
x1_low 0.5293 3 0.3214
x2_mid 0.2283 70 1.0573
x3_high 0.2775 209 1.1188

In the simulation above you can see that the predictor with the highest number of distinct values was given the highest importance score even though it has a much lower correlation to the outcome.

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Part (a)

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right has a higher bagging parameter and so uses a larger random sample of the predictors at each iteration in the algorithm. As a result more of the predictors can be filtered out as less influential at each stage in the process, whereas in the model on the right because it is using a much smaller random sample of predictors it cannot filter out as many predictors at each iteration and must use the ones it has so more of the predictors are used in the final combined model. Also the variable importance is calculated by averaging the improvement value for that predictor across all trees in the ensemble. So again the importance values will be higher when there are less predictors to choose from in each tree because they cannot be filtered out. Whereas these same predictors might be filtered out as less influential when there are more predictors to choose from.

Part (b)

Which model do you think would be more predictive of other samples?

I would think that the model with the lower bagging fraction and learning rate would be a better predictor of new samples since it has a heavier reliance on a larger sample of the predictors and since lower learning rates were recommended because they build up the predictions more slowly through more iterations.

Part (c)

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing tree depth spreads the importance out over more predictors as the number of nodes increases. So increasing the depth should cause the predictors with higher importance values to be lessened and the ones with lower values to be increased, thus flattening the curve in the slope of predictor importance in either model.

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and steps as before and train several tree-based models:

Random Forest

var Overall
ManufacturingProcess32 30.578
ManufacturingProcess13 10.805
BiologicalMaterial12 6.665
ManufacturingProcess31 6.435
BiologicalMaterial06 5.722
ManufacturingProcess17 5.276
ManufacturingProcess36 4.966
ManufacturingProcess06 4.593
BiologicalMaterial03 4.478
ManufacturingProcess28 4.194
BiologicalMaterial11 3.133
ManufacturingProcess09 3.091

Gradient Boost Machine

var Overall
ManufacturingProcess32 100.000
ManufacturingProcess13 28.978
BiologicalMaterial12 21.216
ManufacturingProcess17 19.744
ManufacturingProcess06 19.224
ManufacturingProcess31 14.309
ManufacturingProcess09 11.045
BiologicalMaterial03 8.492
BiologicalMaterial11 7.675
BiologicalMaterial09 7.099
ManufacturingProcess21 6.478
BiologicalMaterial05 6.227

Cubist Model

var Overall
ManufacturingProcess32 100.00
ManufacturingProcess13 62.50
ManufacturingProcess29 32.95
ManufacturingProcess31 31.82
BiologicalMaterial06 29.55
ManufacturingProcess09 29.55
BiologicalMaterial10 27.27
BiologicalMaterial03 27.27
BiologicalMaterial02 26.14
ManufacturingProcess17 23.86
ManufacturingProcess33 23.86
ManufacturingProcess15 19.32

Part (a)

Model Comparison

Which tree-based regression model gives the optimal resampling and test set performance?

Test Set Performance
RMSE Rsquared MAE
Cubist 0.3175 0.9097 0.2494
Gradient Boost 0.5171 0.7808 0.3878
Random Forest 0.6248 0.7046 0.4496
## Cubist 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE    Rsquared  MAE   
##    1          0          0.9320  0.3706    0.6893
##    1          1          0.9487  0.3780    0.6940
##    1          3          0.9322  0.3787    0.6869
##    1          5          0.9303  0.3785    0.6853
##    1          7          0.9298  0.3762    0.6865
##    1          9          0.9306  0.3738    0.6874
##    5          0          0.7424  0.5058    0.5645
##    5          1          0.7299  0.5256    0.5501
##    5          3          0.7314  0.5210    0.5526
##    5          5          0.7343  0.5165    0.5544
##    5          7          0.7376  0.5121    0.5567
##    5          9          0.7407  0.5083    0.5600
##   10          0          0.7043  0.5444    0.5389
##   10          1          0.6864  0.5683    0.5201
##   10          3          0.6908  0.5619    0.5242
##   10          5          0.6946  0.5568    0.5280
##   10          7          0.6978  0.5521    0.5304
##   10          9          0.7007  0.5489    0.5330
##   20          0          0.6736  0.5801    0.5201
##   20          1          0.6541  0.6028    0.4962
##   20          3          0.6603  0.5961    0.5036
##   20          5          0.6645  0.5909    0.5085
##   20          7          0.6686  0.5857    0.5122
##   20          9          0.6714  0.5828    0.5152
##   40          0          0.6614  0.5963    0.5095
##   40          1          0.6407  0.6195    0.4852
##   40          3          0.6472  0.6132    0.4936
##   40          5          0.6515  0.6082    0.4977
##   40          7          0.6558  0.6029    0.5016
##   40          9          0.6587  0.5998    0.5045
##   80          0          0.6485  0.6161    0.4980
##   80          1          0.6280  0.6379    0.4732
##   80          3          0.6343  0.6319    0.4823
##   80          5          0.6384  0.6275    0.4871
##   80          7          0.6427  0.6222    0.4904
##   80          9          0.6455  0.6195    0.4932
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 80 and neighbors = 1.
committees neighbors
32 80 1

The cubist model gives the best (lowest) RMSE and MAE values for both the resampling and the test set predictions, as well as the highest \(R^2\) (although you really shouldn’t use the \(R^2\) statistic to compare across different types of models, only to compare different tuning parameters for the same model).

Although the model chosen by the algorithm was the 80 committee and 1 neighbor model you can see in the cubist model plot above that there is little gained after 40 committees, so the 40 committee and 1 neighbor model may be selected instead to limit the complexity.

Part (b)

Which predictors are most important in the optimal tree-based regression model?
Do either the biological or process variables dominate the list?
How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
var Overall
ManufacturingProcess32 100.00
ManufacturingProcess13 62.50
ManufacturingProcess29 32.95
ManufacturingProcess31 31.82
BiologicalMaterial06 29.55
ManufacturingProcess09 29.55
BiologicalMaterial10 27.27
BiologicalMaterial03 27.27
BiologicalMaterial02 26.14
ManufacturingProcess17 23.86

The manufacturing processes still dominate the list taking the top 4 slots, however, like the optimal nonlinear model 4 of the top 10 are biological materials (and 2 of the top 6) are biological materials so they are not completely absent like they were in the optimal linear model. The same two predictors came out on top in the optimal tree model as in the nonlinear model, ManufacturingProcess32 and ManufacturingProcess13, although in reverse order. The next two most influential variables, ManufacturingProcess29 and ManufacturingProcess31 came out much lower in the nonlinear list at ranks 11 and 17 respectively.

Part (c)

Plot the optimal single tree with the distribution of yield in the terminal nodes.

First we’ll have to train a model since we haven’t trained a single tree model yet!

Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

The most notable insight that stands out in the tree plot is that the distributions for all the nodes on the left branch of the top split on ManufacturingProcess32 except for one are lower than all of the distributions for the right side branch. You can clearly see why that split was made and why that variable consistently comes out on or near the top of the list of importance.

Footnotes


  1. Strobl C, Boulesteix A, Zeileis A, Hothorn T (2007). “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution.”BMC Bioinformatics, 8(1), 25.