8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200) 
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated, 
                        importance = TRUE,
                       ntree = 1000) 
rfImp1 <- varImp(model1, scale = FALSE)

rfImp1

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No. Their significance are small.

b

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated, 
                        importance = TRUE,
                       ntree = 1000) 
rfImp2 <- varImp(model2, scale = FALSE)

rfImp2

The importance score for V1 is lower; scores for other varibles fell as well. Multicollinearity ususally affects the ranking of variable importance as it may be difficult to score their importance if variables are highly correlated. V4 is now the most important predictor.

c

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
model2 <-  cforest(y ~ ., data = simulated)

varimp(model2)
##          V1          V2          V3          V4          V5          V6 
##  6.14253842  6.10410006  0.10632869  6.54418720  2.07799547 -0.11606490 
##          V7          V8          V9         V10  duplicate1 
##  0.05872350 -0.08585961 -0.08030916 -0.14555774  5.73199392

With theconditional = FALSE, the importance scores are very similar to the tradional random forest.

varimp(model2, conditional = T)
##          V1          V2          V3          V4          V5          V6 
##  3.22747234  5.19402511  0.07065979  5.87536300  1.29939244 -0.14118239 
##          V7          V8          V9         V10  duplicate1 
## -0.09302146 -0.30875505 -0.05486725 -0.18846005  2.62605554

setting conditional = TRUE, the scores are smaller. However, some of the less significant predictors did improve a bit.

(V1-V5) are most important in both cases; they follow the pattern of the ransom forest.

D

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted

gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmModel)

Cubist

cubist_model <- cubist(simulated[, c(1:10, 12)], simulated$y, committees = 100)
varImp(cubist_model)

Only Cubist has V2 as the most important. V6, which is uninformative in the previous models, is considered significant in the Cubist model.

8.2

Use a simulation to show tree bias with different granularities.

set.seed(100)
x1 <- rnorm(300, 30, 1)
x2 <- rnorm(300, 30, 2)
x3 <- rnorm(300, 30, 3)

set.seed(100)
zy <- (.4 * x1) + (.2 * x2) + (.1 * x3) + rnorm(300, 0, sqrt(1- (.16 + .04 + .01)))
y <- (1.5 * zy) + 10

simulated2 <- data.frame(x1 = x1, x2 = x2, x3 = x3, y=y)

rpartfit <- rpart(y ~., data = simulated2)

varImp(rpartfit)

Regression trees suffer from selection bias. Predictors with lower variance are favored over more granular predictors which has higher variance. In this simulation, x1 has the smallest standard deviation, so it is the strongest predictor in this case.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

a

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Bagging fraction parameter - is the fraction of randomly sampled observations from the training set)

Shrinkage parameter - is known as the learning rate. The larger the number, the faster the learning rate.

Higher shrinking parameter means it will converge faster, thus taking larger steps down the gradient descent which may cause the algorithm to miss the optimal point and eventually overfit.

b

Which model do you think would be more predictive of other samples?

The learning rate of 0.1 is better since it is slower and the importance spreads out over more predictors. Small incremental steps down the gradient descent appears to work best.

c

How would increasing interaction depth affect the slope of predictor importance for either model in Fig.8.24?
data(solubility)
gbmGrid1 <- expand.grid(.interaction.depth = 1, 
                        .n.trees = 100, 
                        .shrinkage = 0.1,
                        .n.minobsinnode=10)

set.seed(100)
gbmTune1 <- train(solTrainXtrans, 
                  solTrainY,
                  method = "gbm", 
                  tuneGrid = gbmGrid1, 
                  verbose = FALSE)

plot(varImp(gbmTune1), top = 30)

gbmGrid2 <- expand.grid(.interaction.depth = 20, 
                        .n.trees = 100, 
                        .shrinkage = 0.1,
                        .n.minobsinnode=10)

set.seed(100)
gbmTune2 <- train(solTrainXtrans, 
                  solTrainY,
                  method = "gbm", 
                  tuneGrid = gbmGrid2, 
                  verbose = FALSE)

plot(varImp(gbmTune2), top = 30)

Looking at the two plots, we can see that the increase in the interaction.depth paramenter helps to spread out the importance among the predictors.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models: Preprocessing

data("ChemicalManufacturingProcess")

preprocessing <- preProcess(ChemicalManufacturingProcess[,-1], method = c("center", "scale", "knnImpute", "corr", "nzv"))
Xpreprocess <- predict(preprocessing, ChemicalManufacturingProcess[,-1])

yield <- as.matrix(ChemicalManufacturingProcess$Yield)

set.seed(789)
split2 <- yield %>%
  createDataPartition(p = 0.8, list = FALSE, times = 1)

Xtrain.data  <- Xpreprocess[split2, ] #chem train
xtest.data <- Xpreprocess[-split2, ] #chem test
Ytrain.data  <- yield[split2, ] #yield train
ytest.data <- yield[-split2, ] #yield test
set.seed(100)
chem_rpart_tuned <- train(Xtrain.data, Ytrain.data, 
                          method = "rpart2", 
                          tuneLength = 10, 
                          trControl = trainControl(method = "cv"))

chem_m5_tuned <- train(Xtrain.data, Ytrain.data, 
                  method = "M5", 
                  tuneLength = 10, 
                  control = Weka_control(M = 10))
chem_bagged_tuned <- train(Xtrain.data, Ytrain.data, 
                      method = "treebag", 
                      tuneLength = 10, 
                      trControl = trainControl(method = "cv", number = 10))

chem_rf_tuned <- train(Xtrain.data, Ytrain.data, 
                      method = "rf", 
                      tuneLength = 10, 
                      trControl = trainControl(method = "cv", number = 10))

gbmGrid <- expand.grid(.interaction.depth = 1, 
                       .n.trees = 100, 
                       .shrinkage = 0.1,
                       .n.minobsinnode = 10)


chem_gbm_tuned <- train(Xtrain.data, Ytrain.data, 
                       method = "gbm", 
                       tuneGrid = gbmGrid,
                       verbose = FALSE)

chem_cubist_tuned <- train(Xtrain.data, Ytrain.data, 
                       method = "cubist")

RMSE = c(min(chem_rpart_tuned$results$RMSE), min(chem_m5_tuned$results$RMSE), min(chem_bagged_tuned$results$RMSE),min(chem_rf_tuned$results$RMSE),min(chem_gbm_tuned$results$RMSE), min(chem_cubist_tuned$results$RMSE))

Rsquared = c(max(chem_rpart_tuned$results$Rsquared), max(chem_m5_tuned$results$Rsquared), max(chem_bagged_tuned$results$Rsquared), max(chem_rf_tuned$results$Rsquared), max(chem_gbm_tuned$results$Rsquared), max(chem_cubist_tuned$results$Rsquared))

MAE = c(min(chem_rpart_tuned$results$MAE), min(chem_m5_tuned$results$MAE), min(chem_bagged_tuned$results$MAE), min(chem_rf_tuned$results$MAE), min(chem_gbm_tuned$results$MAE), min(chem_cubist_tuned$results$MAE))

results <- cbind(RMSE, Rsquared, MAE) %>% data.frame(row.names = c("RPART", "M5", "BAG", "RF", "GBM", "CUBIST"))

kable(results) %>% kable_styling(bootstrap_options = "striped")
RMSE Rsquared MAE
RPART 1.334645 0.5054817 1.0618190
M5 1.517643 0.4258644 1.0991966
BAG 1.138023 0.6642550 0.8995001
RF 1.140752 0.6978560 0.8950110
GBM 1.248370 0.5706191 0.9786424
CUBIST 1.180610 0.6165652 0.9265535

(a) Which tree-based regression model …

gives the optimal resampling and test set performance?

Based on table above, the cubist model outperforms the other models on both training and test because of the low RMSE value.

chem_cubist_tuned
## Cubist 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##    1          0          1.801862  0.3562192  1.3305668
##    1          5          1.786225  0.3629854  1.3000932
##    1          9          1.790586  0.3608555  1.3102473
##   10          0          1.268888  0.5639529  0.9992920
##   10          5          1.246434  0.5793699  0.9715563
##   10          9          1.258836  0.5717701  0.9842097
##   20          0          1.204203  0.6013281  0.9524467
##   20          5          1.180610  0.6165652  0.9265535
##   20          9          1.194745  0.6077350  0.9389085
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.

The final values used for the model were committees = 20 and neighbors = 5.

chem_cubist_tuned$bestTune
chem_cubist_tuned$finalModel
## 
## Call:
## cubist.default(x = x, y = y, committees = param$committees)
## 
## Number of samples: 144 
## Number of predictors: 56 
## 
## Number of committees: 20 
## Number of rules per committee: 2, 1, 1, 1, 7, 1, 6, 1, 6, 1, 4, 1, 1, 1, 7, 1, 3, 1, 4, 1
ggplot(chem_cubist_tuned)

plotmo(chem_cubist_tuned)
##  plotmo grid:    BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##                            -0.1070431          -0.04306519           -0.1062217
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##           -0.07283723         -0.004683137          -0.09620684
##  BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
##               0.06681          -0.04830923           -0.1178766
##  BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
##            -0.1012727          -0.03863564              0.1056672
##  ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
##               0.5096271              0.1087038              0.3424324
##  ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
##             -0.06529069             -0.2229103              0.4390925
##  ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
##               0.8941637              0.1066231             -0.1030952
##  ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
##              0.02020002             -0.4806937           -0.007834829
##  ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
##              0.04826216            -0.09295527             0.06169755
##  ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
##             0.005007187             0.06617593             -0.1360039
##  ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
##               0.0688801             -0.1744786             -0.1218132
##  ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
##             -0.01031118             -0.1438567              0.0651293
##  ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
##              0.06432695             0.06918722              0.7255096
##  ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
##              -0.0066778             0.03954225             0.09273307
##  ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
##             -0.08632349              0.1836771              0.1182687
##  ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
##             -0.05513017              0.4884872            -0.03063781
##  ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
##               0.7174727               0.231727             -0.4626528
##  ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
##              -0.4405878              0.2027957             -0.1289558
##  ManufacturingProcess44 ManufacturingProcess45
##               0.2946725              0.1522024

(b)

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(chem_cubist_tuned)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess17 100.000
## ManufacturingProcess32  94.667
## ManufacturingProcess09  64.000
## BiologicalMaterial03    60.000
## BiologicalMaterial02    50.667
## ManufacturingProcess13  46.667
## BiologicalMaterial12    42.667
## ManufacturingProcess04  42.667
## ManufacturingProcess33  42.667
## ManufacturingProcess39  34.667
## BiologicalMaterial06    34.667
## ManufacturingProcess15  26.667
## ManufacturingProcess29  25.333
## ManufacturingProcess37  20.000
## BiologicalMaterial08    18.667
## ManufacturingProcess02  18.667
## ManufacturingProcess27  14.667
## ManufacturingProcess26  14.667
## ManufacturingProcess31  13.333
## ManufacturingProcess01   9.333
plot(varImp(chem_cubist_tuned))

Comparison

Ex 6.3 Linear (Ridge)

ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(101)

ridgeRegFit <- train(Xtrain.data, Ytrain.data, method = "ridge", tuneGrid = ridgeGrid, trControl = trainControl(method = "cv", number = 10))

predictions <- ridgeRegFit %>% predict(xtest.data)

cbind(
  RMSE = RMSE(predictions, ytest.data),
  R_squared = R2(predictions, ytest.data)
)
##         RMSE R_squared
## [1,] 1.06489 0.5657402
varImp(ridgeRegFit)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess13  100.00
## ManufacturingProcess32   97.82
## ManufacturingProcess17   86.84
## BiologicalMaterial06     83.64
## BiologicalMaterial03     78.13
## ManufacturingProcess09   72.37
## BiologicalMaterial12     72.20
## ManufacturingProcess36   70.51
## BiologicalMaterial02     63.10
## ManufacturingProcess06   61.88
## ManufacturingProcess31   58.39
## BiologicalMaterial11     56.85
## ManufacturingProcess33   47.06
## ManufacturingProcess11   45.94
## BiologicalMaterial04     45.43
## ManufacturingProcess29   44.71
## BiologicalMaterial08     44.36
## ManufacturingProcess12   38.22
## BiologicalMaterial01     35.56
## BiologicalMaterial09     33.79
plot(varImp(ridgeRegFit))

Ex. 7.5 Non-Linear (MARS)

marsGrid <- expand.grid(.degree = 1:3, .nprune = 2:100)
set.seed(100)

chem_mars_tuned <- train(Xtrain.data, Ytrain.data,
                   method = "earth",
                   tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv", number = 10))
## Loading required package: earth
## Warning: package 'earth' was built under R version 3.6.3
marspred <- predict(chem_mars_tuned, newdata = xtest.data)
marspv <- postResample(pred = marspred, obs = ytest.data)

marspv
##      RMSE  Rsquared       MAE 
## 1.1997927 0.4779621 0.9605993
varImp(chem_mars_tuned)
## earth variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   60.24
## ManufacturingProcess14    0.00
## ManufacturingProcess22    0.00
## BiologicalMaterial02      0.00
## ManufacturingProcess30    0.00
## ManufacturingProcess38    0.00
## ManufacturingProcess29    0.00
## BiologicalMaterial01      0.00
## BiologicalMaterial05      0.00
## ManufacturingProcess40    0.00
## ManufacturingProcess01    0.00
## BiologicalMaterial06      0.00
## BiologicalMaterial10      0.00
## ManufacturingProcess21    0.00
## ManufacturingProcess33    0.00
## ManufacturingProcess31    0.00
## ManufacturingProcess28    0.00
## ManufacturingProcess44    0.00
## ManufacturingProcess12    0.00
plot(varImp(chem_mars_tuned))

Notes:

All three models have the manufacturing processes as very important predictors.

All models have ManufacturingProcess32 in the top 3.

Only the non-linear model considered two predictors as important and they are both from the manufacturing.

Tree model is the best for this data based on the performance scores.

(c) Plot the optimal single tree …

with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

fancyRpartPlot(chem_rpart_tuned$finalModel, palettes = 'PuRd')

Yes, a graphical view of the data does provides a better picture. The tree has several branches with details for each important predictor at some level. It also depicts the relationship between the biological and process predictors and the yield.