library(tidyverse)
library(mlbench)
library(caret)
library(Cubist)
library(gbm)
library(ipred)
library(party)
library(partykit)
library(randomForest)
library(rpart)
library(RWeka)
library(AppliedPredictiveModeling)
library(rattle)

8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200) 
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model1 <- randomForest(y ~ ., data = simulated, 
                        importance = TRUE,
                       ntree = 1000) 
rfImp1 <- varImp(model1, scale = FALSE)

rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No. Their significance are extremely small.

  1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated, 
                        importance = TRUE,
                       ntree = 1000) 
rfImp2 <- varImp(model2, scale = FALSE)

rfImp2
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

The importance score for V1 fell. So did the score for the other variables. Multicollinearity ususally affects the ranking of variable importance as it may be difficult to score their importance if they are technically the same. V4 is now the most important predictor in this case.

  1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
model2 <- party:: cforest(y ~ ., data = simulated,
        controls=cforest_control(mtry=(ncol(simulated)-1)))

party::varimp(model2)
##           V1           V2           V3           V4           V5           V6 
##  3.887369983  7.744416028  0.050926368 10.241797325  2.064600935 -0.018782373 
##           V7           V8           V9          V10   duplicate1 
##  0.029094384 -0.023311385  0.009101122  0.026801558  5.756908601

With theconditional = FALSE, the importance scores follow the same pattern as the tradional random forest.

party::varimp(model2, conditional = T)
##           V1           V2           V3           V4           V5           V6 
##  0.555456266  4.725776440  0.007057617  6.054858562  0.698371838 -0.008996735 
##           V7           V8           V9          V10   duplicate1 
##  0.015960555 -0.008461915  0.005040511  0.023136037  0.864870599

With conditional = TRUE, the scores are even smaller however some of the less significant predictor values did improve a bit.

On a whole (V1-V5) still remain most important in both case so they indeed follow the pattern of the ransom forest.

  1. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted

gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmModel)

##                   var    rel.inf
## V4                 V4 28.1586654
## V2                 V2 21.9983634
## duplicate1 duplicate1 15.8249555
## V1                 V1 15.1558423
## V5                 V5  9.9637365
## V3                 V3  8.0879655
## V6                 V6  0.8104713
## V7                 V7  0.0000000
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

Cubist

cubist_model <- cubist(simulated[, c(1:10, 12)], simulated$y, committees = 100)
varImp(cubist_model)
##            Overall
## V3            43.5
## V1            52.5
## V2            59.5
## duplicate1    27.5
## V4            46.0
## V8             4.0
## V5            27.0
## V6            10.0
## V10            1.0
## V7             0.0
## V9             0.0

Only Cubist has variable V2 as the most important. V6 is which is uninformative in the previous models is considered significant in the Cubist model.

8.2

Use a simulation to show tree bias with different granularities.

set.seed(300)
x1 <- rnorm(300, 30, 1)
x2 <- rnorm(300, 30, 2)
x3 <- rnorm(300, 30, 3)

set.seed(300)
zy <- (.4 * x1) + (.2 * x2) + (.1 * x3) + rnorm(300, 0, sqrt(1- (.16 + .04 + .01)))
y <- (1.5 * zy) + 10

simulated2 <- data.frame(x1 = x1, x2 = x2, x3 = x3, y=y)

rpartfit <- rpart(y ~., data = simulated2)

varImp(rpartfit)
##     Overall
## x1 3.556750
## x2 1.823912
## x3 1.755870

According to the text pg 182, regression trees suffer from selection bias: predictors with a higher number of distinct values, that is, with lower variance are favored over more granular predictors which has higher variance. In this simulation, x1 has the smallest standard deviation, so it is the strongest predictor in this case.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

  1. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Bagging fraction parameter - is the fraction of randomly sampled observations from the training set)

Shrinkage parameter - is known as the learning rate. The larger the number, the faster the learning rate.

Higher shrinking parameter means you will converge faster, thus taking larger steps down the gradient descent which may cause the algorithm to miss the optimal point and eventually overfit.

  1. Which model do you think would be more predictive of other samples?

The learning rate of 0.1 is better since it is slower and the importance spreads out over more predictors. Small incremental steps down the gradient descent appears to work best.

  1. How would increasing interaction depth affect the slope of predictor importance for either model in Fig.8.24?
data(solubility)
gbmGrid1 <- expand.grid(.interaction.depth = 1, 
                        .n.trees = 100, 
                        .shrinkage = 0.1,
                        .n.minobsinnode=10)

set.seed(100)
gbmTune1 <- train(solTrainXtrans, 
                  solTrainY,
                  method = "gbm", 
                  tuneGrid = gbmGrid1, 
                  verbose = FALSE)

plot(varImp(gbmTune1), top = 30)

gbmGrid2 <- expand.grid(.interaction.depth = 20, 
                        .n.trees = 100, 
                        .shrinkage = 0.1,
                        .n.minobsinnode=10)

set.seed(100)
gbmTune2 <- train(solTrainXtrans, 
                  solTrainY,
                  method = "gbm", 
                  tuneGrid = gbmGrid2, 
                  verbose = FALSE)

plot(varImp(gbmTune2), top = 30)

Looking at the two plots, we can see that the increase in the interaction.depth paramenter helps to spread out the importance more among the predictors.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Preprocessing

data("ChemicalManufacturingProcess")

preprocessing <- preProcess(ChemicalManufacturingProcess[,-1], method = c("center", "scale", "knnImpute", "corr", "nzv"))
Xpreprocess <- predict(preprocessing, ChemicalManufacturingProcess[,-1])

yield <- as.matrix(ChemicalManufacturingProcess$Yield)

set.seed(789)
split2 <- yield %>%
  createDataPartition(p = 0.8, list = FALSE, times = 1)

Xtrain.data  <- Xpreprocess[split2, ] #chem train
xtest.data <- Xpreprocess[-split2, ] #chem test
Ytrain.data  <- yield[split2, ] #yield train
ytest.data <- yield[-split2, ] #yield test

Modeling

Single Trees

set.seed(100)
chem_rpart_tuned <- train(Xtrain.data, Ytrain.data, 
                          method = "rpart2", 
                          tuneLength = 10, 
                          trControl = trainControl(method = "cv"))

Model Trees

set.seed(100)
chem_m5_tuned <- train(Xtrain.data, Ytrain.data, 
                  method = "M5", 
                  tuneLength = 10, 
                  control = Weka_control(M = 10))

Bagged Trees

set.seed(100)
chem_bagged_tuned <- train(Xtrain.data, Ytrain.data, 
                      method = "treebag", 
                      tuneLength = 10, 
                      trControl = trainControl(method = "cv", number = 10))

Random Forest

set.seed(100)

chem_rf_tuned <- train(Xtrain.data, Ytrain.data, 
                      method = "rf", 
                      tuneLength = 10, 
                      trControl = trainControl(method = "cv", number = 10))

Boosted Trees

set.seed(100)

gbmGrid <- expand.grid(.interaction.depth = 1, 
                       .n.trees = 100, 
                       .shrinkage = 0.1,
                       .n.minobsinnode = 10)


chem_gbm_tuned <- train(Xtrain.data, Ytrain.data, 
                       method = "gbm", 
                       tuneGrid = gbmGrid,
                       verbose = FALSE)

Cubist

set.seed(100)
chem_cubist_tuned <- train(Xtrain.data, Ytrain.data, 
                       method = "cubist")

Evaluation

Training
RMSE = c(min(chem_rpart_tuned$results$RMSE), min(chem_m5_tuned$results$RMSE), min(chem_bagged_tuned$results$RMSE),min(chem_rf_tuned$results$RMSE),min(chem_gbm_tuned$results$RMSE), min(chem_cubist_tuned$results$RMSE))

Rsquared = c(max(chem_rpart_tuned$results$Rsquared), max(chem_m5_tuned$results$Rsquared), max(chem_bagged_tuned$results$Rsquared), max(chem_rf_tuned$results$Rsquared), max(chem_gbm_tuned$results$Rsquared), max(chem_cubist_tuned$results$Rsquared))

MAE = c(min(chem_rpart_tuned$results$MAE), min(chem_m5_tuned$results$MAE), min(chem_bagged_tuned$results$MAE), min(chem_rf_tuned$results$MAE), min(chem_gbm_tuned$results$MAE), min(chem_cubist_tuned$results$MAE))

results <- cbind(RMSE, Rsquared, MAE) %>% data.frame(row.names = c("RPART", "M5", "BAG", "RF", "GBM", "CUBIST"))

kableExtra::kable(results) %>% kableExtra::kable_styling(bootstrap_options = "striped")
RMSE Rsquared MAE
RPART 1.334645 0.5054817 1.0618190
M5 1.398013 0.4903795 1.0569324
BAG 1.192763 0.6121492 0.9418477
RF 1.126824 0.6742688 0.8929892
GBM 1.264458 0.5518754 0.9923112
CUBIST 1.100763 0.6565921 0.8554157
Test
rpart_pred <- predict(chem_rpart_tuned, newdata = xtest.data)
rpartpv <- postResample(pred = rpart_pred, obs = ytest.data)

m5_pred <- predict(chem_m5_tuned, newdata = xtest.data)
m5pv <- postResample(pred = m5_pred, obs = ytest.data)

bagged_pred <- predict(chem_bagged_tuned, newdata = xtest.data)
baggedpv <- postResample(pred = bagged_pred, obs = ytest.data)

rf_pred <- predict(chem_rf_tuned, newdata = xtest.data)
rfpv <- postResample(pred = rf_pred, obs = ytest.data)

gbm_pred <- predict(chem_gbm_tuned, newdata = xtest.data)
gbmpv <- postResample(pred = gbm_pred, obs = ytest.data)

cubist_pred <- predict(chem_cubist_tuned, newdata = xtest.data)
cubistpv <- postResample(pred = cubist_pred, obs = ytest.data)

data.frame(rpartpv, m5pv, baggedpv, rfpv, gbmpv, cubistpv) %>% kableExtra::kable() %>% kableExtra::kable_styling(bootstrap_options = "striped")
rpartpv m5pv baggedpv rfpv gbmpv cubistpv
RMSE 1.4975873 1.2776163 1.2846622 1.0746261 1.2134944 0.9541025
Rsquared 0.2261018 0.4002705 0.4053841 0.5644691 0.4802251 0.6613366
MAE 1.1315189 0.9872927 0.9218436 0.8051689 0.8915011 0.7247069

(a) Which tree-based regression model …

gives the optimal resampling and test set performance?

From the looks of the table above, the cubist model outperforms the other models on both training and test set due to it having the lowest RMSE score.

chem_cubist_tuned
## Cubist 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##    1          0          1.654406  0.4227947  1.1951170
##    1          5          1.630817  0.4359900  1.1694179
##    1          9          1.639838  0.4320087  1.1753534
##   10          0          1.206217  0.5960610  0.9349341
##   10          5          1.180251  0.6128340  0.9062233
##   10          9          1.191380  0.6057719  0.9181023
##   20          0          1.126936  0.6401021  0.8842087
##   20          5          1.100763  0.6565921  0.8554157
##   20          9          1.112568  0.6494242  0.8663721
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
chem_cubist_tuned$bestTune
##   committees neighbors
## 8         20         5
chem_cubist_tuned$finalModel
## 
## Call:
## cubist.default(x = x, y = y, committees = param$committees)
## 
## Number of samples: 144 
## Number of predictors: 56 
## 
## Number of committees: 20 
## Number of rules per committee: 2, 1, 1, 1, 7, 1, 6, 1, 6, 1, 4, 1, 1, 1, 7, 1, 3, 1, 4, 1
ggplot(chem_cubist_tuned)

plotmo::plotmo(chem_cubist_tuned)
##  plotmo grid:    BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##                            -0.1070431          -0.04306519           -0.1062217
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##           -0.07283723         -0.004683137          -0.09620684
##  BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
##               0.06681          -0.04830923           -0.1178766
##  BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
##            -0.1012727          -0.03863564              0.1056672
##  ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
##               0.5096271              0.1087038              0.3424324
##  ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
##             -0.06529069             -0.2229103              0.4390925
##  ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
##               0.8941637              0.1066231             -0.1030952
##  ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
##              0.02020002             -0.4806937           -0.007834829
##  ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
##              0.04826216            -0.09295527             0.06169755
##  ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
##             0.005007187             0.06617593             -0.1360039
##  ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
##               0.0688801             -0.1744786             -0.1218132
##  ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
##             -0.01031118             -0.1438567              0.0651293
##  ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
##              0.06432695             0.06918722              0.7255096
##  ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
##              -0.0066778             0.03954225             0.09273307
##  ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
##             -0.08632349              0.1836771              0.1182687
##  ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
##             -0.05513017              0.4884872            -0.03063781
##  ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
##               0.7174727               0.231727             -0.4626528
##  ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
##              -0.4405878              0.2027957             -0.1289558
##  ManufacturingProcess44 ManufacturingProcess45
##               0.2946725              0.1522024

(b) Which predictors are most important …

in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(chem_cubist_tuned)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess17 100.000
## ManufacturingProcess32  94.667
## ManufacturingProcess09  64.000
## BiologicalMaterial03    60.000
## BiologicalMaterial02    50.667
## ManufacturingProcess13  46.667
## ManufacturingProcess04  42.667
## ManufacturingProcess33  42.667
## BiologicalMaterial12    42.667
## ManufacturingProcess39  34.667
## BiologicalMaterial06    34.667
## ManufacturingProcess15  26.667
## ManufacturingProcess29  25.333
## ManufacturingProcess37  20.000
## ManufacturingProcess02  18.667
## BiologicalMaterial08    18.667
## ManufacturingProcess26  14.667
## ManufacturingProcess27  14.667
## ManufacturingProcess31  13.333
## BiologicalMaterial09     9.333
plot(varImp(chem_cubist_tuned))

Comparison
Ex 6.3 Linear (Ridge)
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(101)

ridgeRegFit <- train(Xtrain.data, Ytrain.data, method = "ridge", tuneGrid = ridgeGrid, trControl = trainControl(method = "cv", number = 10))

predictions <- ridgeRegFit %>% predict(xtest.data)

cbind(
  RMSE = RMSE(predictions, ytest.data),
  R_squared = caret::R2(predictions, ytest.data)
)
##         RMSE R_squared
## [1,] 1.06489 0.5657402
varImp(ridgeRegFit)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess13  100.00
## ManufacturingProcess32   97.82
## ManufacturingProcess17   86.84
## BiologicalMaterial06     83.64
## BiologicalMaterial03     78.13
## ManufacturingProcess09   72.37
## BiologicalMaterial12     72.20
## ManufacturingProcess36   70.51
## BiologicalMaterial02     63.10
## ManufacturingProcess06   61.88
## ManufacturingProcess31   58.39
## BiologicalMaterial11     56.85
## ManufacturingProcess33   47.06
## ManufacturingProcess11   45.94
## BiologicalMaterial04     45.43
## ManufacturingProcess29   44.71
## BiologicalMaterial08     44.36
## ManufacturingProcess12   38.22
## BiologicalMaterial01     35.56
## BiologicalMaterial09     33.79
plot(varImp(ridgeRegFit))

Ex. 7.5 Non-Linear (MARS)
marsGrid <- expand.grid(.degree = 1:3, .nprune = 2:100)
set.seed(100)

chem_mars_tuned <- train(Xtrain.data, Ytrain.data,
                   method = "earth",
                   tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv", number = 10))

marspred <- predict(chem_mars_tuned, newdata = xtest.data)
marspv <- postResample(pred = marspred, obs = ytest.data)

marspv
##      RMSE  Rsquared       MAE 
## 1.1997927 0.4779621 0.9605993
varImp(chem_mars_tuned)
## earth variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   60.24
## ManufacturingProcess14    0.00
## ManufacturingProcess22    0.00
## BiologicalMaterial02      0.00
## ManufacturingProcess30    0.00
## ManufacturingProcess38    0.00
## ManufacturingProcess29    0.00
## BiologicalMaterial01      0.00
## BiologicalMaterial05      0.00
## ManufacturingProcess40    0.00
## ManufacturingProcess01    0.00
## BiologicalMaterial06      0.00
## BiologicalMaterial10      0.00
## ManufacturingProcess21    0.00
## ManufacturingProcess33    0.00
## ManufacturingProcess31    0.00
## ManufacturingProcess28    0.00
## ManufacturingProcess44    0.00
## ManufacturingProcess12    0.00
plot(varImp(chem_mars_tuned))

Notes:

  • All three models have the manufacturing processes as very important predictors in determining the porduct’s outcome.

  • A common predictor for the models is ManufacturingProcess32 in the top 3.

  • Only the non-linear model considered two predictors as important and they are both from the manufacturing elements. Therefore there is less spread of importance in this model among the predictors.

  • It seems as though tree modeling is best for this data if we were to judge based on the performance scores.

(c) Plot the optimal single tree …

with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

fancyRpartPlot(chem_rpart_tuned$finalModel, palettes = 'PuRd')

Yes, a graphical view of the data does provide a better understanding. You can see that the tree has several branches with details for each important predictor at some level. It also carefully depicts the relationship between the biological and process predictors and the yield.