DATA 624 - Homework #9

8.1, 8.2, 8.3, and 8.7

8.1. Recreate the simulated data from Exercise 7.2:

The histograms above and below show us the distribution of the generated predictors and response variables. The response ‘y’ shows a normal distribution while the rest of the predictors appear to have a more even distribution along the range 0 to 1.

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

##       overall names
## 1  57.4506930    V1
## 4  52.7991593    V4
## 2  46.0366873    V2
## 5  22.2954807    V5
## 3   9.8217121    V3
## 6   3.2482485    V6
## 7   2.7239894    V7
## 9  -0.6204323    V9
## 8  -0.6437884    V8
## 10 -1.5041925   V10

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The top 5 predictors by importance in the random forest model are not uninformative predictors (V6 – V10).

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1 
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9396216

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

##       overall      names
## 4  51.7654739         V4
## 2  45.4568836         V2
## 1  33.7040129         V1
## 11 24.7043635 duplicate1
## 5  23.2785698         V5
## 3   9.5329892         V3
## 6   1.6701722         V6
## 10 -0.1393049        V10
## 7  -0.4936772         V7
## 9  -1.9061987         V9
## 8  -2.2341346         V8

The importance of V1 dropped from first to third in importance once we added the highly correlated (to V1) predictor “duplicate1”.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

Conditional inference trees appear to be resiliant to the inclusion of highly correlated predictors such as duplicate1. The top three predictors are V4, V2 and V1 as it was the case for the traditional random forest model before the addition of duplicate1

##        overall      names
## 4  10.14481309         V4
## 1   8.41556912         V1
## 2   7.76850810         V2
## 5   2.30344986         V5
## 11  1.32708646 duplicate1
## 7   0.07769047         V7
## 3   0.01775823         V3
## 10  0.01218696        V10
## 8  -0.01937243         V8
## 6  -0.02006879         V6
## 9  -0.02654575         V9

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Using the Cubist algorithm:

## cubist variable importance
## 
##            Overall
## V1         100.000
## V2          75.694
## V4          68.750
## V3          58.333
## V5          56.944
## V6          13.889
## duplicate1   2.083
## V8           0.000
## V7           0.000
## V9           0.000
## V10          0.000

The top 3 predictors by importance to both Cubist and GBM are also V1, V2 and V4. In both algorithms, the highly correlated predictor duplicate1 is far away from V1 in terms of the importance ranking.

Using Boosted Trees (GBM) algorithm:

## 1.57 MB

## gbm variable importance
## 
##             Overall
## V4         100.0000
## V2          80.0621
## V1          66.7313
## V5          38.5909
## V3          31.9838
## duplicate1  22.2548
## V6           3.0471
## V7           0.6677
## V8           0.5199
## V9           0.1241
## V10          0.0000

8.2. Use a simulation to show tree bias with different granularities.

The selection of predictors by the tree algorithm is bias towards predictors with greater number of explanatory values. In this example we compare how the reduction in number of possible values in a predictor affect the importance of such predictor. We accomplish this reduction by rounding the values, in the range of 0 to 1, to the closest 0.05, 0.10, 0.20 and 0.50 value.

##    Init_Var_Imp     Init_Var_Score Variable_05   Score_05 Variable_10  Score_10
## 1            V1   57.4506930056448          V4 65.2246939          V4 63.642381
## 2            V2   46.0366872928665          V2 59.0933631          V2 58.026260
## 3            V3   9.82171210587341          V1 39.4527598          V1 39.767729
## 4            V4    52.799159269236          V5 21.5056721          V5 22.094157
## 5            V5   22.2954807159216          V3  7.2946338          V3  9.622636
## 6            V6   3.24824849617659          V7  7.1880668          V6  4.685158
## 7            V7   2.72398939458952          V6  1.7180686          V7  4.536553
## 8            V8  -0.64378842692609          V8 -0.5685687          V8  1.172350
## 9            V9 -0.620432345998556         V10 -1.2439725          V9 -1.654690
## 10          V10  -1.50419249428298          V9 -2.4985204         V10 -2.999434
##    Variable_20   Score_20 Variable_50   Score_50
## 1           V4 62.7502318          V4 60.4837864
## 2           V2 58.0672314          V2 59.8735391
## 3           V1 39.2598371          V1 35.8362918
## 4           V5 20.8026223          V5 22.4357103
## 5           V3  8.3063157          V3  9.0340547
## 6           V7  4.6015906          V7  3.7254815
## 7           V6  3.1768937          V6  2.6318389
## 8           V8 -0.3335553         V10  0.3204597
## 9           V9 -0.8916528          V8 -0.2665205
## 10         V10 -1.2239424          V9 -3.2822866

We see that V1 is the most important predictor when its value is continuous and when it has been agregated to the nearest 0.05. Its importance drops to third place when the values are a rounded to any value greater than 0.05.

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

bag.fraction (Subsampling fraction) - the fraction of the training set observations randomly selected to propose the next tree in the expansion.

Shrinkage (Learning Rate) - In the context of GBMs, shrinkage is used for reducing, or shrinking, the impact of each additional fitted base-learner (tree). It reduces the size of incremental steps and thus penalizes the importance of each consecutive iteration.

Effect of Learning Rate and Bag Fraction. Left (0.1, 0.1) and Right (0.9, 0.9)

The right model has its learning rate and bagging fraction set at a higher value compared to the left model. A fast (i.e. set at a 0.9) learning rate has a high risk that one non-optimal model will have a strong influence on the rest of the ensemble while a slow learning rate reduces this risk. A fast learning rate generalized while a slow learning rate will explain more of the nuances in the data set through less importan (overall) predictors.

The high bag fraction tell the right model to include a larger subset of the predictors ( a bag fraction of 1 includes all the predictors). A lower bag fraction would create trees where the with greater difference in the predictor mix. Different trees will give importance to different predictors.

GBM (BOOSTED MODELS) TUNING PARAMETERS

https://www.listendata.com/2015/07/gbm-boosted-models-tuning-parameters.html

Appendix S1: Explanation of the effect of learning rate on predictive stability in boosted regression trees

https://web.stanford.edu/~hastie/Papers/Ecology/ELH_appendixs1-s2.pdf

(b) Which model do you think would be more predictive of other samples?

The slow learning and low bag fraction model should be less prone to over-fitting and better able to predict on new data. The fast learner and high bag fraction could be to specific to the training data set.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Interaction depth controls the number of nodes in a tree. It also controls the “higher-level interaction term with all of the other previous split predictors”. Therefore, a higher interaction depth would take into account more of the contribution from lesser important predictors. This would increaset the slope of the predictor importance plot.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

This data set contains 57 predictors (12 describing the input biological material and 45 describing the process predictors) that explaing the production yield column for the 176 observed manufacturing runs.

Before modelling, the data will be pre-process to:

Remove predictors with near-zero variance.
Impute missing values using the predictive mean matching (pmm) algorithm.
Predictors values will go through a BoxCox transformation, will be centered to have a median to zero and rescaled so that all observations have similar range of values.

The data set will be split into a training and test set with a split ratio of 0.8.

(a) Which tree-based regression model gives the optimal resampling and test set performance?

##            Linear_Model      RMSE   Rsquare
## 1     Linear Regression 0.9884417 0.4922378
## 3 Partial Least Squares 0.6834779 0.6372129
## 2   Robust Linear Model 0.6061122 0.6499379
## 4      Ridge-regression 0.6466016 0.6573752

##         Non_Linear_Model      RMSE   Rsquare
## 1 Support Vector Machine 0.5913078 0.6635761

##         Non_Linear_Model      RMSE   Rsquare
## 1    k-Nearest Neighbors 0.7120312 0.5058445
## 3 Support Vector Machine 0.5913078 0.6635761
## 2  Neural Network avNNet 0.1098640 0.9884795

##                 Tree_Model      RMSE   Rsquare
## 1              Single Tree 0.6150260 0.6346226
## 3              Bagged Tree 0.6020142 0.6590350
## 5                   Cubist 0.6020142 0.6590350
## 4 Gradient Boosted Machine 0.5474177 0.7070415
## 2            Random Forest 0.5313520 0.7458404

As we see in the tables above, the best performing tree based model (Random Forest with a Rsquare of 0.75) outperforms the best linear regression model (Ridge-regression with a Rsquare of 0.66). However, both algorithms are outperformed by the top non-linear regression algorithm (Neural Network avNNet with a Rsquare of 0.99).

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

In the case of the tree based models, the Manufacturing Process 32 predictor is the most important one just as it was in the case of the linear and non-linear regression algorithms. However, in the tree based model Manufacturing Process 32 is significantly more importance than the rest of the predictors.

Wee see that the importance of the predictors is more balanced across a bigger number of predictors rather than just a main one as in the case of the tree based model.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Using the rpart.plot package we have plotted below the structure of the single tree model algorithm trained using the rpart package.

Each node shows - the predicted value, - the percentage of observations in the node.

##  Yield                                                                                                                                                                cover
##  -1.85 when MP32 <  0.22 & BM12 <  -0.57                                                                                & BM05 >= 0.58                                   5%
##  -1.33 when MP32 <  0.22 & BM12 <  -0.57 & BM09 >=  0.31                                                                & BM05 <  0.58                                   5%
##  -0.74 when MP32 <  0.22 & BM12 >= -0.57                                & MP25 >= 0.019                 & BM11 <  -0.21                                                 10%
##  -0.65 when MP32 <  0.22 & BM12 <  -0.57 & BM09 <   0.31                                                                & BM05 <  0.58                                  10%
##  -0.53 when MP32 <  0.22 & BM12 >= -0.57                                & MP25 >= 0.019                 & BM11 >= -0.21                & MP04 <  0.67                   11%
##  -0.11 when MP32 >= 0.22                                 & MP17 >= -1.4                 & MP09 <  -0.49                                                                  8%
##   0.20 when MP32 <  0.22 & BM12 >= -0.57                                & MP25 >= 0.019                 & BM11 >= -0.21                & MP04 >= 0.67                    6%
##   0.32 when MP32 >= 0.22                 & BM09 >= -0.46 & MP17 >= -1.4                 & MP09 >= -0.49                                               & MP31 <  0.2     12%
##   0.41 when MP32 <  0.22 & BM12 >= -0.57                                & MP25 <  0.019                                                                                 10%
##   0.88 when MP32 >= 0.22                 & BM09 >= -0.46 & MP17 >= -1.4                 & MP09 >= -0.49                                               & MP31 >= 0.2      5%
##   1.08 when MP32 >= 0.22                 & BM09 <  -0.46 & MP17 >= -1.4                 & MP09 >= -0.49                                                                  9%
##   1.48 when MP32 >= 0.22                                 & MP17 <  -1.4                                                                                                  8%

We can see how the Biological Markers (BM) predictors split the responses with lower yield as compared to the Manufacturing Process (MP) predictors that split the response on the higher yield part of the distribution. We can also set in table form, the rules used by the single tree model to split the response “Yield” using the most important predictors.

Plotting rpart trees with the rpart.plot package

http://finzi.psych.upenn.edu/library/rpart.plot/doc/prp.pdf