8.1, 8.2, 8.3, and 8.7
8.1. Recreate the simulated data from Exercise 7.2:
The histograms above and below show us the distribution of the generated predictors and response variables. The response ‘y’ shows a normal distribution while the rest of the predictors appear to have a more even distribution along the range 0 to 1.
- Fit a random forest model to all of the predictors, then estimate the variable importance scores:
## overall names
## 1 57.4506930 V1
## 4 52.7991593 V4
## 2 46.0366873 V2
## 5 22.2954807 V5
## 3 9.8217121 V3
## 6 3.2482485 V6
## 7 2.7239894 V7
## 9 -0.6204323 V9
## 8 -0.6437884 V8
## 10 -1.5041925 V10
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
The top 5 predictors by importance in the random forest model are not uninformative predictors (V6 – V10).
- Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
## [1] 0.9396216
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
## overall names
## 4 51.7654739 V4
## 2 45.4568836 V2
## 1 33.7040129 V1
## 11 24.7043635 duplicate1
## 5 23.2785698 V5
## 3 9.5329892 V3
## 6 1.6701722 V6
## 10 -0.1393049 V10
## 7 -0.4936772 V7
## 9 -1.9061987 V9
## 8 -2.2341346 V8
The importance of V1 dropped from first to third in importance once we added the highly correlated (to V1) predictor “duplicate1”.
- Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
Conditional inference trees appear to be resiliant to the inclusion of highly correlated predictors such as duplicate1. The top three predictors are V4, V2 and V1 as it was the case for the traditional random forest model before the addition of duplicate1
## overall names
## 4 10.14481309 V4
## 1 8.41556912 V1
## 2 7.76850810 V2
## 5 2.30344986 V5
## 11 1.32708646 duplicate1
## 7 0.07769047 V7
## 3 0.01775823 V3
## 10 0.01218696 V10
## 8 -0.01937243 V8
## 6 -0.02006879 V6
## 9 -0.02654575 V9
- Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Using the Cubist algorithm:
## cubist variable importance
##
## Overall
## V1 100.000
## V2 75.694
## V4 68.750
## V3 58.333
## V5 56.944
## V6 13.889
## duplicate1 2.083
## V8 0.000
## V7 0.000
## V9 0.000
## V10 0.000
The top 3 predictors by importance to both Cubist and GBM are also V1, V2 and V4. In both algorithms, the highly correlated predictor duplicate1 is far away from V1 in terms of the importance ranking.
Using Boosted Trees (GBM) algorithm:
## 1.57 MB
## gbm variable importance
##
## Overall
## V4 100.0000
## V2 80.0621
## V1 66.7313
## V5 38.5909
## V3 31.9838
## duplicate1 22.2548
## V6 3.0471
## V7 0.6677
## V8 0.5199
## V9 0.1241
## V10 0.0000
8.2. Use a simulation to show tree bias with different granularities.
The selection of predictors by the tree algorithm is bias towards predictors with greater number of explanatory values. In this example we compare how the reduction in number of possible values in a predictor affect the importance of such predictor. We accomplish this reduction by rounding the values, in the range of 0 to 1, to the closest 0.05, 0.10, 0.20 and 0.50 value.
## Init_Var_Imp Init_Var_Score Variable_05 Score_05 Variable_10 Score_10
## 1 V1 57.4506930056448 V4 65.2246939 V4 63.642381
## 2 V2 46.0366872928665 V2 59.0933631 V2 58.026260
## 3 V3 9.82171210587341 V1 39.4527598 V1 39.767729
## 4 V4 52.799159269236 V5 21.5056721 V5 22.094157
## 5 V5 22.2954807159216 V3 7.2946338 V3 9.622636
## 6 V6 3.24824849617659 V7 7.1880668 V6 4.685158
## 7 V7 2.72398939458952 V6 1.7180686 V7 4.536553
## 8 V8 -0.64378842692609 V8 -0.5685687 V8 1.172350
## 9 V9 -0.620432345998556 V10 -1.2439725 V9 -1.654690
## 10 V10 -1.50419249428298 V9 -2.4985204 V10 -2.999434
## Variable_20 Score_20 Variable_50 Score_50
## 1 V4 62.7502318 V4 60.4837864
## 2 V2 58.0672314 V2 59.8735391
## 3 V1 39.2598371 V1 35.8362918
## 4 V5 20.8026223 V5 22.4357103
## 5 V3 8.3063157 V3 9.0340547
## 6 V7 4.6015906 V7 3.7254815
## 7 V6 3.1768937 V6 2.6318389
## 8 V8 -0.3335553 V10 0.3204597
## 9 V9 -0.8916528 V8 -0.2665205
## 10 V10 -1.2239424 V9 -3.2822866
We see that V1 is the most important predictor when its value is continuous and when it has been agregated to the nearest 0.05. Its importance drops to third place when the values are a rounded to any value greater than 0.05.
8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
bag.fraction (Subsampling fraction) - the fraction of the training set observations randomly selected to propose the next tree in the expansion.
Shrinkage (Learning Rate) - In the context of GBMs, shrinkage is used for reducing, or shrinking, the impact of each additional fitted base-learner (tree). It reduces the size of incremental steps and thus penalizes the importance of each consecutive iteration.
Effect of Learning Rate and Bag Fraction. Left (0.1, 0.1) and Right (0.9, 0.9)
The right model has its learning rate and bagging fraction set at a higher value compared to the left model. A fast (i.e. set at a 0.9) learning rate has a high risk that one non-optimal model will have a strong influence on the rest of the ensemble while a slow learning rate reduces this risk. A fast learning rate generalized while a slow learning rate will explain more of the nuances in the data set through less importan (overall) predictors.
The high bag fraction tell the right model to include a larger subset of the predictors ( a bag fraction of 1 includes all the predictors). A lower bag fraction would create trees where the with greater difference in the predictor mix. Different trees will give importance to different predictors.
GBM (BOOSTED MODELS) TUNING PARAMETERS
https://www.listendata.com/2015/07/gbm-boosted-models-tuning-parameters.html
Appendix S1: Explanation of the effect of learning rate on predictive stability in boosted regression trees
https://web.stanford.edu/~hastie/Papers/Ecology/ELH_appendixs1-s2.pdf
(b) Which model do you think would be more predictive of other samples?
The slow learning and low bag fraction model should be less prone to over-fitting and better able to predict on new data. The fast learner and high bag fraction could be to specific to the training data set.
(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Interaction depth controls the number of nodes in a tree. It also controls the “higher-level interaction term with all of the other previous split predictors”. Therefore, a higher interaction depth would take into account more of the contribution from lesser important predictors. This would increaset the slope of the predictor importance plot.
8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
This data set contains 57 predictors (12 describing the input biological material and 45 describing the process predictors) that explaing the production yield column for the 176 observed manufacturing runs.
Before modelling, the data will be pre-process to:
- Remove predictors with near-zero variance.
- Impute missing values using the predictive mean matching (pmm) algorithm.
- Predictors values will go through a BoxCox transformation, will be centered to have a median to zero and rescaled so that all observations have similar range of values.
The data set will be split into a training and test set with a split ratio of 0.8.
(a) Which tree-based regression model gives the optimal resampling and test set performance?
## Linear_Model RMSE Rsquare
## 1 Linear Regression 0.9884417 0.4922378
## 3 Partial Least Squares 0.6834779 0.6372129
## 2 Robust Linear Model 0.6061122 0.6499379
## 4 Ridge-regression 0.6466016 0.6573752
## Non_Linear_Model RMSE Rsquare
## 1 Support Vector Machine 0.5913078 0.6635761
## Non_Linear_Model RMSE Rsquare
## 1 k-Nearest Neighbors 0.7120312 0.5058445
## 3 Support Vector Machine 0.5913078 0.6635761
## 2 Neural Network avNNet 0.1098640 0.9884795
## Tree_Model RMSE Rsquare
## 1 Single Tree 0.6150260 0.6346226
## 3 Bagged Tree 0.6020142 0.6590350
## 5 Cubist 0.6020142 0.6590350
## 4 Gradient Boosted Machine 0.5474177 0.7070415
## 2 Random Forest 0.5313520 0.7458404
As we see in the tables above, the best performing tree based model (Random Forest with a Rsquare of 0.75) outperforms the best linear regression model (Ridge-regression with a Rsquare of 0.66). However, both algorithms are outperformed by the top non-linear regression algorithm (Neural Network avNNet with a Rsquare of 0.99).
(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
In the case of the tree based models, the Manufacturing Process 32 predictor is the most important one just as it was in the case of the linear and non-linear regression algorithms. However, in the tree based model Manufacturing Process 32 is significantly more importance than the rest of the predictors.
Wee see that the importance of the predictors is more balanced across a bigger number of predictors rather than just a main one as in the case of the tree based model.
(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
Using the rpart.plot package we have plotted below the structure of the single tree model algorithm trained using the rpart package.
Each node shows - the predicted value, - the percentage of observations in the node.
## Yield cover
## -1.85 when MP32 < 0.22 & BM12 < -0.57 & BM05 >= 0.58 5%
## -1.33 when MP32 < 0.22 & BM12 < -0.57 & BM09 >= 0.31 & BM05 < 0.58 5%
## -0.74 when MP32 < 0.22 & BM12 >= -0.57 & MP25 >= 0.019 & BM11 < -0.21 10%
## -0.65 when MP32 < 0.22 & BM12 < -0.57 & BM09 < 0.31 & BM05 < 0.58 10%
## -0.53 when MP32 < 0.22 & BM12 >= -0.57 & MP25 >= 0.019 & BM11 >= -0.21 & MP04 < 0.67 11%
## -0.11 when MP32 >= 0.22 & MP17 >= -1.4 & MP09 < -0.49 8%
## 0.20 when MP32 < 0.22 & BM12 >= -0.57 & MP25 >= 0.019 & BM11 >= -0.21 & MP04 >= 0.67 6%
## 0.32 when MP32 >= 0.22 & BM09 >= -0.46 & MP17 >= -1.4 & MP09 >= -0.49 & MP31 < 0.2 12%
## 0.41 when MP32 < 0.22 & BM12 >= -0.57 & MP25 < 0.019 10%
## 0.88 when MP32 >= 0.22 & BM09 >= -0.46 & MP17 >= -1.4 & MP09 >= -0.49 & MP31 >= 0.2 5%
## 1.08 when MP32 >= 0.22 & BM09 < -0.46 & MP17 >= -1.4 & MP09 >= -0.49 9%
## 1.48 when MP32 >= 0.22 & MP17 < -1.4 8%
We can see how the Biological Markers (BM) predictors split the responses with lower yield as compared to the Manufacturing Process (MP) predictors that split the response on the higher yield part of the distribution. We can also set in table form, the rules used by the single tree model to split the response “Yield” using the most important predictors.
Plotting rpart trees with the rpart.plot package