Exercise 8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Random forest variable importance
predictors Overall
V1 8.492
V4 7.892
V2 6.551
V5 2.190
V3 0.683
V6 0.105
V10 0.032
V7 0.009
V9 -0.016
V8 -0.083

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

Looking at the table above we can see that the Random Forest model did not significantly use the V6 - V10 predictors.

(b)

Now add an additional predictor that is highly correlated with one of the informative predictors. Fit another random forest model to these data.

For example:

set.seed(2)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
simulated$duplicate1_2 <- simulated$V1 + rnorm(200) * .1

Did the importance score for V1 change?

Model2 Variable importance
predictors Overall
V4 6.549
V2 5.959
V1 5.004
duplicate1 4.515
V5 2.044
V3 0.453
V6 0.206
V10 -0.005
V7 -0.014
V9 -0.096
V8 -0.134

Whereas in Model1, V1, had the highest importance, in Model2 V1 is less important. This is because the variable duplicate1 is being used to explain some of the variance, taking away part of the explanatory power of V1.

What happens when you add another predictor that is also highly correlated with V1?

Model2 Variable importance
predictors Overall
V4 54.442
V2 51.460
V5 24.275
V1 23.484
duplicate1 22.329
duplicate1_2 20.907
V3 9.237
V6 3.009
V7 1.148
V9 0.626
V10 0.457
V8 -0.064

Adding another predictor that is highly correlated to V1 and duplicate1 just spreads the importance over the three variables, reducing the V1 and duplicate1 importance and assigning some of that to the new variable duplicte1_2.

(c)

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

Model2 Variable importance
predictors Overall
V4 9.527
V2 7.818
V1 5.306
duplicate1 3.565
V5 2.281
duplicate1_2 0.877
V7 0.061
V10 0.050
V6 0.011
V3 0.009
V9 -0.024
V8 -0.031

While the importances for the cforest model and the randomForest models show the same trends. The overall importance for V1 is much higher in the cforest model versus the traditional randomForest model. It would seem that the cforest model gives more importance to one of the three highly correlated variables whereas the traditional randomForest model spread the importance across all three variables.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Tree

Using a boosted tree method to predict the simulated Y follows the same pattern as the randomForrest model in Part A. That is the importance of V1 keeps decreasing the more highly correlated predictors are added. V1’s importance drops from second to third when adding one duplicated predictor and then drops again to fourth when adding another duplicated value.

Cubist

Looking at the plots above, the Cubist tree reacts differently to duplicated or highly correlated data. The variable importance for V1 changes slightly the more duplicated predictors that are added.

Exercise 8.2

Use a simulation to show tree bias with different granularities.

set.seed(701)

v1 <- sample(0:1000 / 1000, 200, replace = T) #most granular
v2 <- sample(0:100 / 100, 200, replace = T) #middle granular
v3 <- sample(0:10 / 10, 200, replace = T) #least granular
y <- v1+v3+rnorm(200)
Variable importance for a random forest model using simulated data
Predictors Overall
v1 32.790
v3 24.840
v2 8.691

Looking at the table above we see that the Random Forest model likes to give the most importance to V1, the most granular predictor, while giving the least importance to V3, the least granular predictor. Tree models prefer more granularity, which creates selection bias since the tree will give more importance to variables that have more distinct values. It also means trees could be more affected with noise variables since they usually are more granular.

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors?

The gradient boosting tree method creates an initial tree and sums the predictions of each subsequent tree. A higher learning rate means that a greater portion of each tree’s predictions is added to the final prediction. The learning rate is a regularization parameter that counters the algorithm’s nature to choose optimal learners at each stage. In the right-hand plot, the learning rate is barely penalized at a value of 0.9, so the same variables tend to dominate the process as they build upon one another. Similarly, the greater the bagging fraction the more data is used during training, causing a subsequent stage to more often choose the same variables as a prior stage. Therefore, the right-hand plot with a larger bagging fraction has its important focus on just the first few of the predictors.

(b) Which model do you think would be more predictive of other samples?

Given that the model on the right seems to be overfitting the data, I think the model on the left with the lower bagging fraction and learning rate parameters would perform better on other samples. I suspect that the more dissimilar weak learners contributing to a final model give that model more flexibility with observations from other data.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

For the model on the left, I assume we would see a more dramatic change as the increase the interaction depth since there are more variables for the model to consider unimportant. For the model on the right, I assume that an increase in interaction depth would also reduce the importance of variables. The effect, however, might not be as clear given that the NumCarbon predictor is already assigned most of the predictor importance.

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Random Forest

The best random forest model is one that uses 9 number of predictors producing an \(R^2\) of 0.7694 with a resampled test \(R^2\) of 0.7093. Therefore, our model performed consistently on training and test data.

Boosted Tree

The boosted tree model producing the best results is one with a shrinkage of 0.01, an interaction depth of 5, and 1000 trees. This model produced a training \(R^2\) of 0.822 and a resampled \(R^2\) of 0.7378. These metrics are also very similar and therefore, our model will fit new data with a similar accuracy to the test data.

Cubist

The cubist model with the best \(R^2\), 0.8039, was the one using 10 committees and 5 neighbors. This model, however, greatly under performed with the test data only producing an \(R^2\) of 0.2434.

Looking at the graph above we see that the optimal tree-based regression model is the boosted tree model.

(b) - Which predictors are most important in the optimal tree-based regression model? - Do either the biological or process variables dominate the list? - How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Top 10 Predictor Categories
Type Number Percent
Biological 4 0.4
Process 6 0.6

The number of Top 10 predictors that fall within each category is identical to the Linear model’s variable importance from HW #7 question 6.3.e. 7 Process predictors and 3 Biological predictors made up the Top 10 variable importance list for both the Linear and Tree based models, whereas the Top 10 variables for the Non-Linear model was comprised of 6 Process and 4 Biological predictors.

Each model did have differences in the specific predictors that made the top 10 list. The consistent predictors in all the models were BiologicalMaterial11, BiologicalMaterial09, ManufacturingProcess33, ManufacturingProcess36.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

note: only 9 possible values of the max tree depth from the initial fit.
 Truncating the grid to 9 .

The tree diagram above shows the same trend we have been seeing with the Linear and Non-Linear models. The more Biological Material the higher the yield, whereas the higher the Manufacturing Process the lower the yield.