Questions

8.1:

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a)

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated,
  importance = TRUE,
  ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

Predictors V6 through V10 have by far the lowest importance within the list of 10, as one would expect from uninformative predictors.

(b)

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated,
  importance = TRUE,
  ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1

model3 <- randomForest(y ~ ., data = simulated,
  importance = TRUE,
  ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
##                Overall
## V1          4.91687329
## V2          6.52816504
## V3          0.58711552
## V4          7.04870917
## V5          2.03115561
## V6          0.14213148
## V7          0.10991985
## V8         -0.08405687
## V9         -0.01075028
## V10         0.09230576
## duplicate1  3.80068234
## duplicate2  1.87721959

We can see that the importance of V1 drops with the addition of a highly correlated variable, though the importance of each added variable is higher than that of the uninformative ones. The addition of an additional correlated variable didn’t have nearly as strong an impact on the drop of importance of V1 as the addition of the first was.

(c)

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

model4 <- cforest(y ~ ., data = simulated[,-c(12,13)])
varimp(model4, conditional = TRUE)
##            V1            V2            V3            V4            V5 
##  5.699856e+00  5.190114e+00  6.219633e-03  6.741766e+00  1.182547e+00 
##            V6            V7            V8            V9           V10 
## -1.104739e-02  1.491985e-05 -8.383608e-03 -7.355064e-03 -2.342891e-02

The importance here follows a similar pattern, with V1 and V2 having similar values; V4 is the most important in each case as well. The only significant difference is V3 having an importance lower than that of a couple of the uninformative ones (V6, V10).

(d)

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

set.seed(4)
bagModel <- bagging(y ~ ., data = simulated[,-c(12,13)], nbag = 44)
(bagImp <- varImp(bagModel))
##       Overall
## V1  1.8450874
## V10 0.8078330
## V2  2.2561427
## V3  1.3791569
## V4  2.7519637
## V5  2.4373769
## V6  0.9309052
## V7  0.9629567
## V8  0.6600959
## V9  0.6630257

We see a similar pattern with the importance of variables for a bagged model.

cubistModel <- cubist(simulated[,c(1:10)],simulated$y, committees = 50)
(cubistImp <- varImp(cubistModel))
##     Overall
## V1     71.0
## V3     44.5
## V2     57.5
## V4     48.5
## V5     35.0
## V6     15.5
## V7      0.0
## V8      0.0
## V9      0.0
## V10     0.0

For the cubist model, the importance order is somewhat different, with \(V4\) no longer having the highest importance value. The order here is 2-1-4-3-5 as opposed to the 4-2-1-5-3 favored by the other models.


8.2:

Use a simulation to show tree bias with different granularities.

v1 <- rep(1:4, each = 200)
v2 <- rep(1:16, each = 50)
v3 <- rep(1:32, each = 25)
data8_2 <- data.frame(v1 = sample(v1), v2=sample(v2), v3=sample(v3), y=(v1+v2)*rnorm(2,0.5))
model8_2 <- randomForest(y ~ . , data = data8_2,
  importance = TRUE,
  ntree = 1000)
(varImp(model8_2))
##     Overall
## v1 3.746256
## v2 4.566638
## v3 3.172335

In this simulation, variable 1 has 4 discrete values, v2 has 16 and v3 -32. The predicted variable y is generated using v1 and v2, without v3 being an input, but in the variable importance table above we can see that the unused v3 has a higher importance than v2 despite being an insignificant predictor. This suggests that a randomForest model has a bias in favor of predictors with higher variance.


8.3:

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a)

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The bagging fraction determines how much of each tree is used for sampling. The graph on the left plots the results of a bagging fraction on 0.1, meaning small subsets of the data are used for each tree. This means that different predictors may be strong for separate trees, spreading the importance around when combined over the set of trees.

The learning rate determines how much of each trees prediction are used in each successive iteration. A higher rate here would make it more likely that recurring predictors would be repeated, leading to certain predictors having an overall higher level of importance.

The model on the right highlights which of the predictors keep appearing in each tree and focuses a lot of importance with these predictors. The one on the left is more likely to find different predictors across each successive tree and spread out the overall importances.

(b)

Which model do you think would be more predictive of other samples?

The model on the right is more likely to be somewhat overfitted to the bagged data, so predicting other samples may be better done with the left model. Realistically the ideal model may have parameters somewhere between these two extremes - ideally we would be able to tune it and find out.

(c)

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

data(solubility)
setting_1 <- expand.grid(n.trees=100, interaction.depth=1, shrinkage=0.1, n.minobsinnode = 10)
setting_2 <- expand.grid(n.trees=100, interaction.depth=20, shrinkage=0.1, n.minobsinnode = 10)

gbm_1 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = setting_1, verbose = FALSE)


gbm_2 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = setting_2, verbose = FALSE)
gbm1_Imp <- varImp(gbm_1)
gbm2_Imp <- varImp(gbm_2)
plot(gbm1_Imp, top = 30)

plot(gbm2_Imp, top = 30)

We can see from the two graphs above that the model with a higher interaction depth has a steeper slope of importance with many more predictors included in the list of the most important. The higher interaction depth increases the number of predictors included in the final model.


8.7:

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

data(ChemicalManufacturingProcess)
(Chem_Imp <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('knnImpute')))
## Created from 152 samples and 57 variables
## 
## Pre-processing:
##   - centered (57)
##   - ignored (0)
##   - 5 nearest neighbor imputation (57)
##   - scaled (57)
chem_mod <- predict(Chem_Imp, ChemicalManufacturingProcess[,-c(1)])
train_row <- sort(sample(nrow(chem_mod), nrow(chem_mod)*.7))
train_x_set <- chem_mod[train_row,]
test_x_set <- chem_mod[-train_row,]
train_y_set <- ChemicalManufacturingProcess[train_row,1]
test_y_set <- ChemicalManufacturingProcess[-train_row,1]

Random Forest

control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
model8_7_1 <- train(train_x_set, 
                    train_y_set, 
                    method = 'rf',
                    metric = metric,
                    tuneLength = 15,
                    trControl = control)
model8_7_1
## Random Forest 
## 
## 123 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 111, 111, 111, 111, 111, 109, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.226868  0.6084847  1.0201283
##    5    1.160600  0.6290117  0.9635216
##    9    1.133526  0.6364267  0.9406978
##   13    1.112987  0.6474797  0.9186701
##   17    1.113671  0.6423437  0.9123679
##   21    1.111052  0.6411034  0.9046681
##   25    1.108662  0.6426592  0.8993534
##   29    1.109705  0.6379578  0.8999850
##   33    1.110398  0.6383133  0.8941634
##   37    1.117584  0.6305988  0.9013689
##   41    1.122421  0.6259761  0.9023138
##   45    1.125407  0.6240853  0.9019940
##   49    1.127803  0.6226682  0.9028689
##   53    1.134733  0.6158645  0.9060036
##   57    1.143268  0.6119118  0.9133298
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 25.
plot(model8_7_1)

cforest

(model8_7_2 <- train(train_x_set, 
                    train_y_set, 
                    method = 'cforest',
                    metric = metric,
                    tuneLength = 15,
                    trControl = control))
## Conditional Inference Random Forest 
## 
## 123 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 111, 111, 111, 110, 111, 110, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.435462  0.5334834  1.2223561
##    5    1.271072  0.5754500  1.0652924
##    9    1.215658  0.5892568  1.0116227
##   13    1.194926  0.5910066  0.9920529
##   17    1.183503  0.5916274  0.9772004
##   21    1.186010  0.5857331  0.9813590
##   25    1.186382  0.5812055  0.9826090
##   29    1.183606  0.5806410  0.9817397
##   33    1.186818  0.5776648  0.9852693
##   37    1.188050  0.5733839  0.9868000
##   41    1.185921  0.5755445  0.9870229
##   45    1.195139  0.5682492  0.9939751
##   49    1.194675  0.5688634  0.9922801
##   53    1.203696  0.5618467  0.9963802
##   57    1.208599  0.5566277  1.0017364
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 17.
plot(model8_7_2)

cubist

set.seed(4)
(model8_7_3 <- train(train_x_set, 
                    train_y_set, 
                    method = 'cubist',
                    metric = metric,
                    tuneLength = 15,
                    trControl = control))
## Cubist 
## 
## 123 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 110, 111, 111, 110, 111, 111, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##    1          0          1.477597  0.4998202  1.1096340
##    1          5          1.364317  0.5769665  0.9784651
##    1          9          1.391616  0.5542093  0.9995467
##   10          0          1.178752  0.5976878  0.9508272
##   10          5          1.099292  0.6532886  0.8520550
##   10          9          1.115220  0.6384497  0.8800726
##   20          0          1.173249  0.6008806  0.9465768
##   20          5          1.086935  0.6579468  0.8442398
##   20          9          1.108596  0.6426958  0.8738122
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
plot(model8_7_3)

bagged

B_control <- bagControl(fit=ctreeBag$fit,
                       predict=ctreeBag$pred,
                       aggregate=ctreeBag$aggregate)

(model8_7_4 <- train(train_x_set, 
                    train_y_set, 
                    method = 'bag',
                    metric = metric,
                    B = 20,
                    trControl = control,
                    bagControl = B_control))
## Warning: executing %dopar% sequentially: no parallel backend registered
## Bagged Model 
## 
## 123 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 109, 111, 111, 111, 111, 111, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE  
##   1.254697  0.5371473  1.034
## 
## Tuning parameter 'vars' was held constant at a value of 57

(a)

Which tree-based regression model gives the optimal resampling and test set performance?

summary(resamples(list(RandomForest = model8_7_1,
                       cForest = model8_7_2,
                       cubist = model8_7_3,
                       bagged = model8_7_4
                       )
                  )
        )
## 
## Call:
## summary.resamples(object = resamples(list(RandomForest = model8_7_1, cForest
##  = model8_7_2, cubist = model8_7_3, bagged = model8_7_4)))
## 
## Models: RandomForest, cForest, cubist, bagged 
## Number of resamples: 30 
## 
## MAE 
##                   Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## RandomForest 0.4956501 0.7479578 0.9065731 0.8993534 1.0144817 1.440293    0
## cForest      0.6377331 0.8592052 1.0010812 0.9772004 1.0671166 1.342674    0
## cubist       0.5320834 0.7214715 0.8237791 0.8442398 0.9631806 1.513275    0
## bagged       0.5441523 0.9110368 1.0182415 1.0340000 1.1946844 1.532016    0
## 
## RMSE 
##                   Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## RandomForest 0.6449084 0.9078077 1.079606 1.108662 1.319344 1.719904    0
## cForest      0.7938529 1.1040161 1.196971 1.183503 1.292766 1.562193    0
## cubist       0.6495250 0.9250516 1.068274 1.086935 1.214117 1.838520    0
## bagged       0.6910230 1.1223437 1.273920 1.254697 1.434562 1.811232    0
## 
## Rsquared 
##                   Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RandomForest 0.3348552 0.4907648 0.6782945 0.6426592 0.7664907 0.8558077    0
## cForest      0.1850696 0.5253804 0.5796824 0.5916274 0.6964444 0.8583958    0
## cubist       0.3556403 0.6073666 0.6612749 0.6579468 0.7297899 0.9016940    0
## bagged       0.1660380 0.4426571 0.5509883 0.5371473 0.6556451 0.8414997    0

The best mean \(RMSE\) metric comes from our cubist model - let’s take a look at the test performance metrics:

models <- list(model8_7_1, model8_7_2, model8_7_3, model8_7_4)

run_test_data <- function(mods, testX, testY){
  meths <- c()
  RMSEs <- c()
  R_SQs <- c()
  for (m in mods){
    meths <- c(meths,m$method)
    prediction <- predict(m,testX)
    RMSEs <- c(RMSEs, Metrics::rmse(testY,prediction))
    R_SQs <- c(R_SQs, cor(testY,prediction)^2)
  }
  df <- data.frame(rbind(RMSEs,R_SQs))
  names(df) <- meths
  return(df)
}
(run_test_data(models, test_x_set, test_y_set))
##              rf   cforest    cubist       bag
## RMSEs 1.4152741 1.5004756 1.0311772 1.6413117
## R_SQs 0.5145061 0.4585449 0.7340574 0.3169009

Again here we see that test set performance based on both \(R^{2}\) and \(RMSE\) appears best for our cubist model.

(b)

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list?

cubeImp <- varImp(model8_7_3)
plot(cubeImp,top = 15)

As with all of the previous models, Manufacturing Processes take up a majority the list of most important predictors for our cubist model. There are, however a few Biological Materials in the top-10: 06, 02, 12 and 03.

How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Manufacturing Processes 32 and 9 consistently appear at the top of all of the lists we have generated, regardless of model type, suggesting that these predictors are in fact very indicative of performance. The lists of most important predictors tend to divert somewhat after these three, with more shuffling going on below that, suggesting that the rest of the predictors’ importance is based on how the data is analyzed.

(c)

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

model_rpart <- train(train_x_set, 
                    train_y_set, 
                    method = 'rpart',
                    metric = metric,
                    tuneLength = 15,
                    control = rpart.control(maxdepth = 4, cp = 0.001))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
(finalTree <- model_rpart$finalModel)
## n= 123 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 123 385.95810 40.28813  
##   2) ManufacturingProcess32< 0.006316353 63  84.05776 39.11444 *
##   3) ManufacturingProcess32>=0.006316353 60 123.99130 41.52050 *
rpart.plot(finalTree, faclen = 0, cex = 0.8, extra = 1)

This simplest single tree only provides one split - on Manufacturing Process32 and splits the data nearly 50-50 to a value of 39 and 42. We can see from the histogram below that a more than 50% of the training data falls between these two values.

hist(train_y_set, breaks = 8, prob = TRUE)
lines(density(train_y_set))