8.1, 8.2, 8.3, 8.7
(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
No, the predictors V6 through V10 have varimps that are either very low or negative.
(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
## [1] 0.9460206
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
The varimp is split between the two predictors. V1 drops and Duplicate1 picks up the slack.
(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
## V1 V2 V3 V4 V5 V6
## 4.717052339 6.202960979 0.002124085 7.597897956 1.647855387 -0.002763382
## V7 V8 V9 V10 duplicate1
## -0.009188887 -0.044026153 0.037599747 0.003582451 5.027062808
## V1 V2 V3 V4 V5 V6
## 1.993307775 4.936195081 0.006540315 6.233000527 1.040846448 0.001450660
## V7 V8 V9 V10 duplicate1
## -0.030621571 -0.011665644 0.008592973 -0.020742648 2.028815743
The pattern is roughly the same, although the distribution of importance among the significant predictors is somewhat changed.
(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
1) Boosted tree
## Distribution not specified, assuming gaussian ...
## var rel.inf
## V4 V4 29.8571887
## V2 V2 25.4113586
## V1 V1 18.6495880
## V5 V5 10.9958687
## duplicate1 duplicate1 7.6226186
## V3 V3 7.1729942
## V7 V7 0.2903832
## V6 V6 0.0000000
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
2) Cubist
## Overall
## V3 43.5
## V1 52.5
## V2 59.5
## duplicate1 27.5
## V4 46.0
## V8 4.0
## V5 27.0
## V6 10.0
## V10 1.0
## V7 0.0
## V9 0.0
The two models show a similar pattern except that v6 and v8 have more importance in the cubist model.
##
## Call:
## randomForest(formula = target ~ ., data = dfHigh, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 1203335
## % Var explained: 67.52
## Overall
## v1 135438.0
## v2 467955.5
## v3 4149434.1
##
## Call:
## randomForest(formula = target ~ ., data = dfLow, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 1207423
## % Var explained: 67.41
## Overall
## v1 161870.8
## v2 446547.5
## v3 3994244.0
I used two dataframes of simulated data, each with 3 variables and the same y. The only difference was the low granularity dataframe rounded to the nearest 100. The high granularity model explained slightly more of the variance than the low granularity, and tended to spread out the explanatory power more evenly among the predictors.
(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
When both parameters are set high the model will have a tendency to overfit. It is interesting to note that an overfit model in a boosting algorithm may result in a model with fewer predictors at higher levels of importance. In some models, as we begin to overfit we may pick up small effects in the training data that add features to our model. Presumably because boosting is iterative, overfitting drops out features as high importance features take over,
(b) Which model do you think would be more predictive of other samples?
I would presume the left hand model. On the other hand, a model with a low bagging fraction might underfit the data.
(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Interaction depth specifies the maximum depth for each tree (starting from a single node). Increasing the depth also works in the direction of overfitting. Therefore, the tendency will be toward fewer predictors with higher importance.
(a) Which tree-based regression model gives the optimal resampling and test set performance?
1) Random Forest
## RMSE Rsquared MAE
## 1.0996245 0.6943453 0.9388005
2) Boosted
## Distribution not specified, assuming gaussian ...
## gbm(formula = Yield ~ ., data = dfTrain)
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 57 predictors of which 39 had non-zero influence.
## RMSE Rsquared MAE
## 1.0570916 0.6975547 0.8547055
3) Cubist
##
## Call:
## cubist.default(x = dfTrain2, y = yTrain)
##
## Number of samples: 143
## Number of predictors: 57
##
## Number of committees: 1
## Number of rules: 3
## RMSE Rsquared MAE
## 1.3214809 0.5397992 1.0173420
The boosted models performs best on the test data.
(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
Manufacturing processes are more prominent in this model. Also, one predictor dominates: MP32.
## var rel.inf
## ManufacturingProcess32 ManufacturingProcess32 28.390778
## ManufacturingProcess31 ManufacturingProcess31 8.636307
## BiologicalMaterial12 BiologicalMaterial12 8.218515
## BiologicalMaterial04 BiologicalMaterial04 6.242313
## ManufacturingProcess09 ManufacturingProcess09 5.636546
## ManufacturingProcess13 ManufacturingProcess13 5.347599
## ManufacturingProcess06 ManufacturingProcess06 5.211052
## ManufacturingProcess17 ManufacturingProcess17 3.099549
## ManufacturingProcess37 ManufacturingProcess37 2.766831
## BiologicalMaterial06 BiologicalMaterial06 2.419689
(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
The tree confirms the importance of MP32 in determining yield. MP32 wasn’t even in the top 10 in the non-tree models. In addition, while manufacturing predictors dominate at each level, biological predictors are also present. There is really only one route from root node to end which follows manufacturing indicators exclusively.