library(gt)
library(mlbench)
library(caret)
library(skimr)
library(AppliedPredictiveModeling)
library(rpart)
library(tidyverse)
library(tidymodels)
library(vip)
library(ggthemes)
library(randomForest)
library(gbm)
library(party)
library(Cubist)

Part A.

Did the random forest model significantly use the uninformative predictors (V6-V10)?

No, variable V6-V10 had a neglible impact on the model - very low scores.

Part B.

Add an additional predictor that is highly correlated with one of the informative predictors. For example:

## [1] 0.9402881

Adding a second correlated variable reduces the importance of V1 as well as the importance of the 1st correlated variable. Additionally, the importance of V3 also seems to have been reduced a bit.

Part C.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

The Conditional Tree model seems to be a more extreme version of M3, above. V1, V3, dupe1 and dupe2 have all been made less imporant compared to the previous plot in Part B.

Part D.

The GB model was similar to the M3 Random Forest, but Variable 3 seems to be more important in the GB model compared to the RF model.

Of all the models, the cubist model appear to be the best - it ignored the unimportant variables and also handled the correlated variables much better.

  1. Use a simulation to show tree bias with different granularities.

Tree models are known to suffer from selection bias. Predictors with higher frequency of distince values are favored over predictors with lower frequencies of distinct values.

Data summary
Name df
Number of rows 2000
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
p1 0 1 0.50 0.29 0.00 0.25 0.50 0.76 1.00 ▇▇▇▇▇
p2 0 1 0.49 0.29 0.00 0.24 0.50 0.72 1.00 ▇▇▇▇▇
p3 0 1 0.51 0.29 0.00 0.25 0.51 0.77 1.00 ▇▇▇▇▇
p4 0 1 0.50 0.32 0.00 0.20 0.50 0.80 1.00 ▇▆▅▆▅
y 0 1 2.00 0.60 0.23 1.59 2.00 2.41 3.75 ▁▅▇▅▁
##     Overall
## p1 4.958308
## p2 2.982601
## p3 2.450068
## p4 1.637977

Simulation Observation:

The simulation above demonstrates the tendency for tree models to show selection bias toward predictors with more distinct values. The value of importance was rank ordered by the predictors with the most distinct values down to the predictors with the least distinct values. The simulation show less strong bias when a fifth “error term” of the form rnorm(2000) was added to the equation.


  1. In stocastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1. and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

fig8.3

fig8.3


Part A

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right has learning of 0.1, therefore the importance get’s spread out over more predictors. The higher learning rate will focus the importance on a smaller set of variables.


Part B

Which model do you think would be more predictive of other samples?

The model on the left should do better. The model on the left is more likely to generalize while the one on the right is more likely to overfit the training data. The more weak learners the better.


Part C

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth will spread out the importance more, since each tree now can grow deeper, and has more chance for other features to be involved in the splitting process. Therefore, increasing the depth reduce the slope of the importance plot.