Exercise 8.1:

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <-as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a):

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors(V6-V10)?

From the output of the varImp function, we see that the random forest did use predictors 6-10, although they were not as significant.The most important variables are 1, 2, 4, 5.

(b):

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

set.seed(123)
simulated$duplicate1 <- simulated$V1 + rnorm(200) *.1

cor(simulated$duplicate1, simulated$V1)
## [1] 0.9504983

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is high correlated to V1?

model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##                 Overall
## V1          5.183260546
## V2          6.318610678
## V3          0.697206467
## V4          6.952539867
## V5          1.986661056
## V6          0.233829847
## V7         -0.006712779
## V8         -0.083743959
## V9         -0.020932796
## V10        -0.069609685
## duplicate1  4.305479706

Looking at the output for the variable importance, there was a decrease in the importance of V1 by about 2. There was an overall decrease, which I think makes sense. The importance value of the duplicated column is 3.456.

set.seed(234)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .12

cor(simulated$duplicate2, simulated$V1)
## [1] 0.9243674
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
##                Overall
## V1          4.15929157
## V2          6.16758547
## V3          0.43858329
## V4          7.28389142
## V5          1.90255763
## V6          0.21213093
## V7         -0.01960485
## V8         -0.09552319
## V9         -0.06038266
## V10        -0.01800977
## duplicate1  3.14294997
## duplicate2  2.87803982

From the output, it looks as though the V1’s importance was reduced even further when we added another highly correlated variable.

(c):

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importance show the same pattern as the traditional random forest model?

library(party)
set.seed(345)
cforest1 <- cforest(y ~ ., data = simulated[, 1:11], controls = cforest_control(ntree = 1000)) #using the columns that we originally created to not inlcude the highly correlated columns

cforest2 <- cforest(y ~ ., data = simulated[, 1:12], controls = cforest_control(ntree = 1000)) #includes the first highly correlated column to V1

cforest3 <- cforest(y ~ ., data = simulated, controls = cforest_control(ntree = 1000))
cforest.imp1 <- varimp(cforest1)
cforest.imp2 <- varimp(cforest2)
cforest.imp3 <- varimp(cforest3)

cforest.imp4 <- varimp(cforest1, conditional = TRUE)
cforest.imp5 <- varimp(cforest2, conditional = TRUE)
cforest.imp6 <- varimp(cforest3, conditional = TRUE)
library(knitr)

var.names <- c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "Corr1", "Corr2")

cforest.imp1 <- data.frame(Original_Simulated = cforest.imp1, Variable = factor(names(cforest.imp1), levels = var.names))
cforest.imp2 <- data.frame(Original_Simulated_Correlated1 = cforest.imp2, Variable = factor(names(cforest.imp2), levels = var.names))

cforest.imp3 <- data.frame(Original_Simulated_Correlated2 = cforest.imp3, Variable = factor(names(cforest.imp3), levels = var.names))
cforest.imp4 <- data.frame(Original_Simulated2 = cforest.imp4, Variable = factor(names(cforest.imp4), levels = var.names))
cforest.imp5 <- data.frame(Original_Simulated2_Correlated1 = cforest.imp5, Variable = factor(names(cforest.imp5), levels = var.names))
cforest.imp6 <- data.frame(Original_Simulated2_Correlated2 = cforest.imp6, Variable = factor(names(cforest.imp6), levels = var.names))


cforest.imp <- merge(cforest.imp1, cforest.imp2, all = TRUE)
cforest.imp <- merge(cforest.imp, cforest.imp3, all = TRUE)
cforest.imp <- merge(cforest.imp, cforest.imp4, all = TRUE)
cforest.imp <- merge(cforest.imp, cforest.imp5, all = TRUE)
cforest.imp <- merge(cforest.imp, cforest.imp6, all = TRUE)

cforest.imp$Variable <- factor(cforest.imp$Variable, levels = var.names)
attach(cforest.imp)
cforest.imp <- cforest.imp[order(Variable), ]
kable(cforest.imp)
Variable Original_Simulated Original_Simulated_Correlated1 Original_Simulated_Correlated2 Original_Simulated2 Original_Simulated2_Correlated1 Original_Simulated2_Correlated2
1 V1 9.2912099 5.6900862 4.2663934 3.1088424 1.1760428 0.5722840
3 V2 7.2699157 6.3294000 6.1227909 3.9328688 3.5972613 3.4458574
4 V3 0.0465104 0.0468226 0.0571431 -0.0043185 0.0004532 0.0265667
5 V4 8.7462832 8.0197883 7.8767015 4.9071478 4.6617310 4.4435125
6 V5 2.1798340 2.0186085 1.7982137 0.7181428 0.8664706 0.7762352
7 V6 0.0153615 -0.0050624 0.0521642 0.0065393 0.0042916 0.0194132
8 V7 0.0998168 0.0497234 0.0420129 0.0287170 0.0067734 0.0099143
9 V8 -0.0447170 -0.0436383 -0.0092861 -0.0193337 -0.0133336 0.0008951
10 V9 -0.0281041 -0.0261491 -0.0027388 0.0001600 -0.0064891 0.0005056
2 V10 -0.0378497 -0.0237428 -0.0056340 0.0037756 -0.0039407 -0.0004141
11 NA NA 4.0672712 2.9405583 NA 0.8991779 0.5276705
12 NA NA 4.0672712 2.9405583 NA 0.8991779 0.2393722
13 NA NA 4.0672712 2.6147749 NA 0.8991779 0.5276705
14 NA NA 4.0672712 2.6147749 NA 0.8991779 0.2393722

(d):

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(ipred)
set.seed(789)
bg1 <- bagging(y ~ ., data = simulated[, 1:11], nbags = 50)
bg2 <-  bagging(y ~ ., data = simulated[, 1:12], nbags = 50) 
bg3 <- bagging(y ~ ., data = simulated, nbags = 50)

bg.varimp1 <- varImp(bg1)
bg.varimp2 <- varImp(bg2)
bg.varimp3 <- varImp(bg3)

names(bg.varimp1) <- "Original_Simulated"
names(bg.varimp2) <- "Simulated_Correlated1"
names(bg.varimp3) <- "Simulated_Correlated2"

bg.varimp1$Variable <- factor(rownames(bg.varimp1), levels = var.names)
bg.varimp2$Variable <- factor(rownames(bg.varimp2), levels = var.names)
bg.varimp3$Variable <- factor(rownames(bg.varimp3), levels = var.names)

bg.imp <- merge(bg.varimp1, bg.varimp2, all = TRUE)
bg.imp <- merge(bg.imp, bg.varimp3, all = TRUE)

attach(bg.imp)
bg.imp <- bg.imp[order(Variable), ]
kable(bg.imp)
Variable Original_Simulated Simulated_Correlated1 Simulated_Correlated2
1 V1 1.8937773 1.9005476 1.9169439
3 V2 2.4074842 2.1294167 1.9858935
4 V3 1.4737451 0.9737914 0.9478109
5 V4 2.8503501 2.6136684 2.3375513
6 V5 2.4934352 2.2500432 2.0839561
7 V6 0.9243688 0.8501971 0.7452643
8 V7 1.1582842 0.8305714 0.7755231
9 V8 0.5758429 0.4132404 0.4788912
10 V9 0.7445645 0.6010269 0.4456381
2 V10 0.7938636 0.6764631 0.7833463
11 NA NA 1.6696019 1.7558808
12 NA NA 1.6696019 1.5874815

The trend, for the most part, seems to follow the other models. I was surprised to see Variable 5 ranking higher in these models. In general, I think there’s an overall higher rating of importance on the other variables as well. Now I’ll take a cubist route using the Cubist package setting the committees argument to 100 to indicate how many models we are fitting.

library(Cubist)
set.seed(891)
simulated2 <- simulated[-12] #break up data so one set contains just one of the correlated columns
simulated3 <- simulated[-13]
cb1 <- cubist(x = simulated[, 1:10], y = simulated$y, committees = 100)
cb2 <- cubist(x = simulated2[, names(simulated2) != "y"], y = simulated2$y, committees = 100)
cb3 <- cubist(x = simulated3[, names(simulated3) != "y"], y = simulated3$y, committees = 100)

#cb2 <- cubist(x = simulated2[, 1:10, 12], y = simulated2$y, committees = 100)
#cb3 <- cubist(x = simulated3[, 1:10, 12], y = simulated3$y, committees = 100)

cb.varimp1 <- varImp(cb1)
cb.varimp2 <- varImp(cb2)
cb.varimp3 <- varImp(cb3)

cb.varimp1
##     Overall
## V1     71.5
## V3     47.0
## V2     58.5
## V4     48.0
## V5     33.0
## V6     13.0
## V7      0.0
## V8      0.0
## V9      0.0
## V10     0.0
cb.varimp2
##            Overall
## V3            48.5
## V2            58.5
## V1            58.0
## V4            47.5
## duplicate2    21.5
## V5            36.5
## V6            14.0
## V8             2.0
## V10            1.0
## V9             0.5
## V7             0.0
cb.varimp3
##            Overall
## V1            50.0
## V3            46.0
## V2            58.0
## V4            47.5
## V5            30.5
## duplicate1    25.5
## V8             7.5
## V6             3.5
## V7             0.0
## V9             0.0
## V10            0.0

The output for the cubist models has V1, V2, V3, V4, V5 and surprisingly the duplicate correlated variables.

Exercise 8.2:

Use a simulation to show tree bias with different granularities.

There was a phenomenon mentioned in the text. This means that predictors with a bigger number of possible split points are likewise more likely to be employed at the top of a tree partition. Even in cases where there is little to no correlation with the reaction, this nevertheless occurs. We will generate three variables—two predictor variables and one response—in order to demonstrate this through simulations. Two homogenous groups will be formed by one of the predictors, which will be categorical in nature, while the other will not. We’ll construct a model and assess the significance of each variable. Variable 1 ought to be quite significant, while variable 2 ought to be negligible.

set.seed(147)

X1 <- rep(1:2, each = 100)
Y <- X1 + rnorm(200, mean = 0, sd = 4)

set.seed(148)
X2 <- rnorm(200, mean = 0, sd = 2)

simData <- data.frame(Y = Y, X1 = X1, X2 = X2)

library(rpart)
set.seed(149)
fit <- rpart(Y ~ ., data = simData)
varImp(fit)
##       Overall
## X1 0.05409211
## X2 0.34896693

X2 should not have a higher importance rate than X1, but the above output shows otherwise.

Exercise 8.3:

In stochastic gradient boosting, the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction(.1 and .9) and the learning rate (.1 and .9) for the solubility data. The left hand plot has both parameters set to .1 and the right has both set to .9:

(a):

Why does the model on the right focus its importance on just the first few of the predictors, whereas the model on the left spreads importance across more predictors?

Answer:

The model gets increasingly greedy as we raise the learning rate in the direction of 1. Our model will find fewer factors associated with the response as greed increases. The model requires more data as the bagging fraction rises. Combining it means that less predictors will be important when we raise the learning rate and bagging percentage from .1 to.9.

(b):

Which model do you think would be more predictive of other samples?

Answer:

I think that the model where both parameters are .1 would be more predictive of other samples because when we increase the parameters, model performance will decrease.

(c):

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Answer:

Increasing interaction depth will spread variable importance over more predictors and increase the importance overall.

Exercise 8.7:

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
library(caret)

set.seed(100)

data(ChemicalManufacturingProcess)

# imputation
miss <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(miss, ChemicalManufacturingProcess)

# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]

set.seed(624)

# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)

# train 
train_x <- Chemical[index, -1]
train_y <- Chemical[index, 1]

# test
test_x <- Chemical[-index, -1]
test_y <- Chemical[-index, 1]

(a):

Which tree-based regression model gives the optimal resampling and test set performance?

Single Tree

set.seed(100)

cartTune <- train(train_x, train_y,
                  method = "rpart",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))

cartPred <- predict(cartTune, test_x)

postResample(cartPred, test_y)
##      RMSE  Rsquared       MAE 
## 1.5632389 0.5035625 1.1913334

Bagged Trees

set.seed(100)

baggedTree <- ipredbagg(train_y, train_x)
 
baggedPred <- predict(baggedTree, test_x)

postResample(baggedPred, test_y)
##      RMSE  Rsquared       MAE 
## 1.2311358 0.7141633 0.8911824

Random Forest

set.seed(100)

rfModel <- randomForest(train_x, train_y, 
                        importance = TRUE,
                        ntree = 1000)


rfPred <- predict(rfModel, test_x)

postResample(rfPred, test_y)
##      RMSE  Rsquared       MAE 
## 1.2598015 0.7216121 0.8998200

Boosted Trees

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)

gbmTune <- train(train_x, train_y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test_x)

postResample(gbmPred, test_y)
##      RMSE  Rsquared       MAE 
## 1.1824945 0.7195344 0.8678498

Cubist

set.seed(100)
cubistTuned <- train(train_x, train_y, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test_x)

postResample(cubistPred, test_y)
##      RMSE  Rsquared       MAE 
## 0.9804618 0.7964499 0.7363520

The lowest RMSE is found in the cubist model, giving the best optimal resampling and test set performance.

rbind(cart = postResample(cartPred, test_y),
      bagged = postResample(baggedPred, test_y),
      randomForest = postResample(rfPred, test_y),
      boosted = postResample(gbmPred, test_y),
      cubist = postResample(cubistPred, test_y))
##                   RMSE  Rsquared       MAE
## cart         1.5632389 0.5035625 1.1913334
## bagged       1.2311358 0.7141633 0.8911824
## randomForest 1.2598015 0.7216121 0.8998200
## boosted      1.1824945 0.7195344 0.8678498
## cubist       0.9804618 0.7964499 0.7363520

(b):

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

The manufacturing process variables dominate the list at a ratio of 16:4, whereas the optimal linear and nonlinear models had ratios of 11:9.

varImp(cubistTuned, scale = TRUE)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess17   62.35
## ManufacturingProcess09   54.12
## BiologicalMaterial03     54.12
## BiologicalMaterial02     52.94
## ManufacturingProcess13   38.82
## BiologicalMaterial06     36.47
## ManufacturingProcess04   34.12
## ManufacturingProcess33   32.94
## ManufacturingProcess10   22.35
## ManufacturingProcess11   18.82
## ManufacturingProcess14   17.65
## ManufacturingProcess39   17.65
## ManufacturingProcess28   14.12
## ManufacturingProcess16   12.94
## BiologicalMaterial12     12.94
## ManufacturingProcess31   12.94
## ManufacturingProcess29   12.94
## ManufacturingProcess02   12.94
## ManufacturingProcess21   11.76
plot(varImp(cubistTuned), top = 20) 

For the tree-based model, only 3 are biological variables out of the top 10, compared to 4 in the linear and nonlinear models. ManufactingProcess32 still is deemed the most important. The other predictors have less variable importance. BiologicalMaterial06 was deemed only the seventh most important, where it was the second most important in other variables. There are some predictors that were not in the top 10 previously, that are in the top 10 now, such as Manufacting Processes number 17, 4, 33, and 10.

(c):

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpartTree <- rpart(Yield ~ ., data = Chemical[index, ])

plot(as.party(rpartTree), ip_args = list(abbreviate = 4), gp = gpar(fontsize = 7))