Eric_Hirsch_624_Homework

8.1, 8.2, 8.3, 8.7

8.1. Recreate the simulated data from Exercise 7.2:

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No, the predictors V6 through V10 have varimps that are either very low or negative.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

## [1] 0.9460206

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

The varimp is split between the two predictors. V1 drops and Duplicate1 picks up the slack.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

##           V1           V2           V3           V4           V5           V6 
##  4.717052339  6.202960979  0.002124085  7.597897956  1.647855387 -0.002763382 
##           V7           V8           V9          V10   duplicate1 
## -0.009188887 -0.044026153  0.037599747  0.003582451  5.027062808

##           V1           V2           V3           V4           V5           V6 
##  1.993307775  4.936195081  0.006540315  6.233000527  1.040846448  0.001450660 
##           V7           V8           V9          V10   duplicate1 
## -0.030621571 -0.011665644  0.008592973 -0.020742648  2.028815743

The pattern is roughly the same, although the distribution of importance among the significant predictors is somewhat changed.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

1) Boosted tree

## Distribution not specified, assuming gaussian ...

##                   var    rel.inf
## V4                 V4 29.8571887
## V2                 V2 25.4113586
## V1                 V1 18.6495880
## V5                 V5 10.9958687
## duplicate1 duplicate1  7.6226186
## V3                 V3  7.1729942
## V7                 V7  0.2903832
## V6                 V6  0.0000000
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

2) Cubist

##            Overall
## V3            43.5
## V1            52.5
## V2            59.5
## duplicate1    27.5
## V4            46.0
## V8             4.0
## V5            27.0
## V6            10.0
## V10            1.0
## V7             0.0
## V9             0.0

The two models show a similar pattern except that v6 and v8 have more importance in the cubist model.

8.2. Use a simulation to show tree bias with different granularities.

## 
## Call:
##  randomForest(formula = target ~ ., data = dfHigh, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 1203335
##                     % Var explained: 67.52

##      Overall
## v1  135438.0
## v2  467955.5
## v3 4149434.1

## 
## Call:
##  randomForest(formula = target ~ ., data = dfLow, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 1207423
##                     % Var explained: 67.41

##      Overall
## v1  161870.8
## v2  446547.5
## v3 3994244.0

I used two dataframes of simulated data, each with 3 variables and the same y. The only difference was the low granularity dataframe rounded to the nearest 100. The high granularity model explained slightly more of the variance than the low granularity, and tended to spread out the explanatory power more evenly among the predictors.

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

When both parameters are set high the model will have a tendency to overfit. It is interesting to note that an overfit model in a boosting algorithm may result in a model with fewer predictors at higher levels of importance. In some models, as we begin to overfit we may pick up small effects in the training data that add features to our model. Presumably because boosting is iterative, overfitting drops out features as high importance features take over,

(b) Which model do you think would be more predictive of other samples?

I would presume the left hand model. On the other hand, a model with a low bagging fraction might underfit the data.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Interaction depth specifies the maximum depth for each tree (starting from a single node). Increasing the depth also works in the direction of overfitting. Therefore, the tendency will be toward fewer predictors with higher importance.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

(a) Which tree-based regression model gives the optimal resampling and test set performance?

1) Random Forest

##      RMSE  Rsquared       MAE 
## 1.0996245 0.6943453 0.9388005

2) Boosted

## Distribution not specified, assuming gaussian ...

## gbm(formula = Yield ~ ., data = dfTrain)
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 57 predictors of which 39 had non-zero influence.

##      RMSE  Rsquared       MAE 
## 1.0570916 0.6975547 0.8547055

3) Cubist

## 
## Call:
## cubist.default(x = dfTrain2, y = yTrain)
## 
## Number of samples: 143 
## Number of predictors: 57 
## 
## Number of committees: 1 
## Number of rules: 3

##      RMSE  Rsquared       MAE 
## 1.3214809 0.5397992 1.0173420

The boosted models performs best on the test data.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Manufacturing processes are more prominent in this model. Also, one predictor dominates: MP32.

##                                           var   rel.inf
## ManufacturingProcess32 ManufacturingProcess32 28.390778
## ManufacturingProcess31 ManufacturingProcess31  8.636307
## BiologicalMaterial12     BiologicalMaterial12  8.218515
## BiologicalMaterial04     BiologicalMaterial04  6.242313
## ManufacturingProcess09 ManufacturingProcess09  5.636546
## ManufacturingProcess13 ManufacturingProcess13  5.347599
## ManufacturingProcess06 ManufacturingProcess06  5.211052
## ManufacturingProcess17 ManufacturingProcess17  3.099549
## ManufacturingProcess37 ManufacturingProcess37  2.766831
## BiologicalMaterial06     BiologicalMaterial06  2.419689

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

The tree confirms the importance of MP32 in determining yield. MP32 wasn’t even in the top 10 in the non-tree models. In addition, while manufacturing predictors dominate at each level, biological predictors are also present. There is really only one route from root node to end which follows manufacturing indicators exclusively.

Eric_Hirsch_624_Homework_9

8.1. Recreate the simulated data from Exercise 7.2:

8.2. Use a simulation to show tree bias with different granularities.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models: