library(ggplot2)
library(magrittr)

Homework-9 Regression Trees and Rule-Based Models

Kuhn, Max. Applied Predictive Modeling (p. 218).

Exercise 8.1

Recreate the simulated data from Exercise 7.2

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)
model1 <- randomForest(y ~ .,
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The random forest model did use the uninformative predictors (V6 – V10) but not very significantly as the importance numbers for these predictors are low.

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ .,
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9408631

model3 <- randomForest(y ~ .,
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3

##                Overall
## V1          4.91687329
## V2          6.52816504
## V3          0.58711552
## V4          7.04870917
## V5          2.03115561
## V6          0.14213148
## V7          0.10991985
## V8         -0.08405687
## V9         -0.01075028
## V10         0.09230576
## duplicate1  3.80068234
## duplicate2  1.87721959

After adding an additional predictor that is highly correlated, the importance score for V1 decreased significantly. Further descrease (i.e. dilution of importance) happened after another highly correlated predictor was added. The newly added uninformative predictors, with high correlations to the V1 predictor, have abnormally large importance values.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)
## The mtry parameter should be the number of predictors
## (the number of columns minus 1 for the outcome).
sim_partc <- simulated[, 1:11]
bagCtrl <- cforest_control(mtry = ncol(sim_partc) - 1)
baggedTree <- cforest(y ~., data = sim_partc, controls = bagCtrl)
imp_partc <- varimp(baggedTree, conditional = TRUE)
imp_partc[order(-abs(imp_partc))]

##           V4           V2           V1           V5           V8 
##  6.034951318  4.339605372  3.318573594  0.742661197 -0.011539167 
##           V7           V9          V10           V6           V3 
##  0.008964129 -0.008083828  0.005557381 -0.003772097 -0.001281939

sim_partc <- simulated[, 1:12]
bagCtrl <- cforest_control(mtry = ncol(sim_partc) - 1)
baggedTree <- cforest(y ~., data = sim_partc, controls = bagCtrl)
imp_partc <- varimp(baggedTree, conditional = TRUE)
imp_partc[order(-abs(imp_partc))]

##           V4           V2   duplicate1           V5           V1 
##  5.996272076  4.670717938  0.791541391  0.675670117  0.588175959 
##          V10           V7           V6           V3           V9 
##  0.017414135  0.011714376  0.007003095  0.003935970 -0.001919159 
##           V8 
## -0.001345279

sim_partc <- simulated[, 1:13]
bagCtrl <- cforest_control(mtry = ncol(sim_partc) - 1)
baggedTree <- cforest(y ~., data = sim_partc, controls = bagCtrl)
imp_partc <- varimp(baggedTree, conditional = TRUE)
imp_partc[order(-abs(imp_partc))]

##           V4           V2           V5   duplicate1           V1 
##  5.695196980  4.418333064  0.796500365  0.662161136  0.471536152 
##          V10   duplicate2           V8           V3           V7 
##  0.026446640  0.023248430 -0.004928977  0.004543721  0.004361530 
##           V6           V9 
## -0.001310992  0.001260097

In the above case, it is evident that the decline in importance is much less pronounced than with the traditional random forest model.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

## Boosted Trees
library(gbm)
sim_data <- simulated[, 1:11]
gbmModel <- gbm(y ~., data = sim_data, distribution = "gaussian")
summary(gbmModel)

##     var    rel.inf
## V4   V4 30.5443022
## V1   V1 26.3506687
## V2   V2 24.4445984
## V5   V5 10.8300221
## V3   V3  7.3794063
## V6   V6  0.4510024
## V7   V7  0.0000000
## V8   V8  0.0000000
## V9   V9  0.0000000
## V10 V10  0.0000000

sim_data <- simulated[, 1:12]
gbmModel <- gbm(y ~., data = sim_data, distribution = "gaussian")
summary(gbmModel)

##                   var    rel.inf
## V4                 V4 29.1254395
## V2                 V2 23.1244921
## duplicate1 duplicate1 17.7606567
## V1                 V1 11.9064649
## V5                 V5  9.2661917
## V3                 V3  8.4007807
## V6                 V6  0.2484665
## V9                 V9  0.1675079
## V7                 V7  0.0000000
## V8                 V8  0.0000000
## V10               V10  0.0000000

sim_data <- simulated[, 1:13]
gbmModel <- gbm(y ~., data = sim_data, distribution = "gaussian")
summary(gbmModel)

##                   var    rel.inf
## V4                 V4 28.2782328
## V2                 V2 21.6991520
## duplicate1 duplicate1 14.1929228
## V1                 V1 13.5423876
## V5                 V5 11.8066539
## V3                 V3  8.9282442
## duplicate2 duplicate2  0.8028415
## V7                 V7  0.3469890
## V10               V10  0.1481724
## V6                 V6  0.1273970
## V9                 V9  0.1270067
## V8                 V8  0.0000000

## Cubist Model
yCol <- which(colnames(sim_data) == "y")
sim_data <- simulated[, 1:11]
cubistModel <- train(sim_data[, -yCol],
                     sim_data[, yCol],
                     method = "cubist")
varImp(cubistModel, scale = FALSE)

## cubist variable importance
## 
##     Overall
## V1     72.0
## V2     54.5
## V4     49.0
## V3     42.0
## V5     40.0
## V6     11.0
## V8      0.0
## V9      0.0
## V10     0.0
## V7      0.0

sim_data <- simulated[, 1:12]
cubistModel <- train(sim_data[, -yCol],
                     sim_data[, yCol],
                     method = "cubist")
varImp(cubistModel, scale = FALSE)

## cubist variable importance
## 
##            Overall
## V2            62.0
## V1            55.5
## V4            50.0
## V3            42.0
## duplicate1    37.0
## V5            31.0
## V6            15.5
## V8             0.0
## V7             0.0
## V9             0.0
## V10            0.0

sim_data <- simulated[, 1:13]
cubistModel <- train(sim_data[, -yCol],
                     sim_data[, yCol],
                     method = "cubist")
varImp(cubistModel, scale = FALSE)

## cubist variable importance
## 
##            Overall
## V2            69.5
## V1            54.0
## V4            50.0
## V5            38.0
## V3            32.0
## duplicate1    25.0
## duplicate2    25.0
## V6            10.0
## V8             3.0
## V10            0.0
## V7             0.0
## V9             0.0

In the case with Boosted and Cubist Trees, the decline in importance is again more gradual than with the traditional random forest model. However, more variables are assigned levels of importance with values greater than zero. Further, the newly added (highly correlated) predictors still appear among the list of influential variables.

Exercise 8.2

Use a simulation to show tree bias with different granularities.

varImpSorted <- function(dfVarImp) {
  varImpRows <- order(abs(dfVarImp$Overall), decreasing = TRUE)
  dfResult <- data.frame(dfVarImp[varImpRows, 1],
                         row.names = rownames(dfVarImp)[varImpRows])
  colnames(dfResult) <- colnames(dfVarImp)
  return(dfResult)
}

sim_data <- simulated[, 1:11]
rfModel <- randomForest(y ~ .,
                        data = sim_data,
                        importance = TRUE,
                        ntree = 1000)
varImpSorted(varImp(rfModel, scale = FALSE))

##         Overall
## V1   8.85865210
## V4   7.79041405
## V2   6.57684003
## V5   2.15062112
## V3   0.69517829
## V6   0.16551981
## V8  -0.10270669
## V9  -0.07651010
## V7   0.02333943
## V10 -0.01594849

postResample(pred = predict(rfModel), obs = sim_data$y)

##     RMSE Rsquared      MAE 
## 2.554986 0.823782 2.092077

sim_data <- simulated[, 1:12]
rfModel <- randomForest(y ~ .,
                        data = sim_data,
                        importance = TRUE,
                        ntree = 1000)
varImpSorted(varImp(rfModel, scale = FALSE))

##                Overall
## V4          6.92636679
## V2          6.19094491
## V1          5.49235852
## duplicate1  4.39153041
## V5          1.88461384
## V3          0.64406350
## V6          0.12649095
## V8         -0.09127384
## V7          0.06673548
## V9          0.02758416
## V10         0.01402926

postResample(pred = predict(rfModel), obs = sim_data$y)

##      RMSE  Rsquared       MAE 
## 2.6208229 0.7940312 2.1560878

sim_data <- simulated[, 1:13]
rfModel <- randomForest(y ~ .,
                        data = sim_data,
                        importance = TRUE,
                        ntree = 1000)
varImpSorted(varImp(rfModel, scale = FALSE))

##                 Overall
## V4          6.913866151
## V2          6.493718347
## V1          5.106983492
## duplicate1  3.570249805
## duplicate2  1.917100445
## V5          1.914067144
## V3          0.486767309
## V6          0.077448314
## V8         -0.068658456
## V9         -0.035937291
## V7          0.014490091
## V10        -0.002880012

postResample(pred = predict(rfModel), obs = sim_data$y)

##      RMSE  Rsquared       MAE 
## 2.6300680 0.7706705 2.1649502

With the added (highly correlated) predictors, performance of the model suffers, based on RMSE and other performance indicators shown above. Increase in RMSE with more correlated predictors added, indicates an increase in bias and model underfitting.

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The text states that: The importance profile for boosting has a much steeper importance slope than the one for random forests. This is due to the fact that the trees from boosting are dependent on each other and hence will have correlated structures as the method follows by the gradient. Therefore, many of the same predictors will be selected across the trees, increasing their contribution to the importance metric. With higher values for bagging fraction, more of the training data end up being is used. Likewise, higher values for learning rate will allow a larger fraction of the current predicted value to be added to the previous iteration’s predicted value. In short, the more these two parameters are close to 1, the closer the model will be to having regular (un-regulated) boosting behavior and therefore the importance curve would exhibit the steep curve described by the excerpt from the text. On the flip side, if we lower these 2 tuning parameters (and therefore regulate the model behavior more), the more of the predictors will have importance (and with higher levels) on the model’s performance.
Which model do you think would be more predictive of other samples?

In my opinion, the model on the left, which is more regulated by the lower values of the 2 tuning parameters, will exhibit smaller variance for new samples, have a lower RMSE and therefore be more predictive.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing interaction depth (tree depth) is likely to draw the model closer to random forest and therefore would flatten the slope of predictor importance more for the left-hand model, but not likely to change the steep slope for the right-hand model.

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Which tree-based regression model gives the optimal resampling and test set performance?
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Data Pre-Processing

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

preP <- preProcess(ChemicalManufacturingProcess, 
                   method = c("BoxCox", "knnImpute", "center", "scale"))
df <- predict(preP, ChemicalManufacturingProcess)
## Restore the response variable values to original
df$Yield = ChemicalManufacturingProcess$Yield

## Split the data into a training and a test set
trainRows <- createDataPartition(df$Yield, p = .80, list = FALSE)
df.train <- df[trainRows, ]
df.test <- df[-trainRows, ]

Tree-based Regression Models Training

colYield <- which(colnames(df) == "Yield")
trainX <- df.train[, -colYield]
trainY <- df.train$Yield
testX <- df.test[, -colYield]
testY <- df.test$Yield

## Single Tree Models
## Model 1 tunes over the complexity parameter
st1Model <- train(trainX, trainY,
                  method = "rpart",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))
st1Model.train.pred <- predict(st1Model)
st1Model.test.pred <- predict(st1Model, newdata = testX)

## Model 2 tunes over the maximum depth
st2Model <- train(trainX, trainY,
                  method = "rpart2",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))
st2Model.train.pred <- predict(st2Model)
st2Model.test.pred <- predict(st2Model, newdata = testX)


## Model Tree Models
library(RWeka)

## Model-Based version
m5Model <- M5P(trainY~., data = trainX)
m5Model.train.pred <- predict(m5Model)
m5Model.test.pred <- predict(m5Model, newdata = testX)

## Rule-Based version
m5RModl <- M5Rules(trainY~., data = trainX)
m5RModl.train.pred <- predict(m5RModl)
m5RModl.test.pred <- predict(m5RModl, newdata = testX)

## Bagged Tree Model
library(party)
bagCtrl <- cforest_control(mtry = ncol(trainX))
bagTree <- cforest(trainY ~., data = trainX, controls = bagCtrl)
bagTree.train.pred <- predict(bagTree)
bagTree.test.pred <- predict(bagTree, newdata = testX)

## Random Forest Model
library(randomForest)
library(caret)
rfModel <- randomForest(trainY ~ .,
                        data = trainX,
                        importance = TRUE,
                        ntree = 1000)
rfModel.train.pred <- predict(rfModel)
rfModel.test.pred <- predict(rfModel, newdata = testX)

## Boosted Trees
library(gbm)
gbmModel <- gbm(trainY ~ ., data = trainX, distribution = "gaussian")
gbmModel.train.pred <- predict(gbmModel, n.tree = 100)
gbmModel.test.pred <- predict(gbmModel, n.tree = 100, newdata = testX)

## Cubist Model
cubistModel <- train(trainX,
                     trainY,
                     method = "cubist")
cubistModel.train.pred <- predict(cubistModel)
cubistModel.test.pred <- predict(cubistModel, newdata = testX)

Part A

print("Train set performance")

## [1] "Train set performance"

rbind(
  "st1Model" = postResample(pred = st1Model.train.pred, obs = trainY),
  "st2Model" = postResample(pred = st2Model.train.pred, obs = trainY),
  "m5Model" = postResample(pred = m5Model.train.pred, obs = trainY),
  "m5RModl" = postResample(pred = m5RModl.train.pred, obs = trainY),
  "bagged" = postResample(pred = bagTree.train.pred, obs = trainY),
  "rforest" = postResample(pred = rfModel.train.pred, obs = trainY),
  "boosted" = postResample(pred = gbmModel.train.pred, obs = trainY),
  "cubist" = postResample(pred = cubistModel.train.pred, obs = trainY)
)

##               RMSE  Rsquared       MAE
## st1Model 1.2077967 0.5389803 0.9442928
## st2Model 1.0374397 0.6598601 0.8030060
## m5Model  0.7011910 0.8501936 0.5682746
## m5RModl  0.8687389 0.7662811 0.6683153
## bagged   0.8346806 0.8082896 0.6455589
## rforest  1.1127120 0.6337759 0.8672670
## boosted  0.7789427 0.8305777 0.5979851
## cubist   0.1754463 0.9920887 0.1342076

print("Test set performance")

## [1] "Test set performance"

rbind(
  "st1Model" = postResample(pred = st1Model.test.pred, obs = testY),
  "st2Model" = postResample(pred = st2Model.test.pred, obs = testY),
  "m5Model" = postResample(pred = m5Model.test.pred, obs = testY),
  "m5RModl" = postResample(pred = m5RModl.test.pred, obs = testY),
  "bagged" = postResample(pred = bagTree.test.pred, obs = testY),
  "rforest" = postResample(pred = rfModel.test.pred, obs = testY),
  "boosted" = postResample(pred = gbmModel.test.pred, obs = testY),
  "cubist" = postResample(pred = cubistModel.test.pred, obs = testY)
)

##              RMSE  Rsquared       MAE
## st1Model 1.532614 0.4659000 1.2471736
## st2Model 1.481003 0.5067299 1.2167997
## m5Model  1.249298 0.6448689 0.9287938
## m5RModl  1.674628 0.3860707 1.2093524
## bagged   1.384970 0.5808880 1.0562028
## rforest  1.244433 0.6967441 0.9453439
## boosted  1.337591 0.6025043 0.9765952
## cubist   1.092527 0.7482687 0.8152786

Based on the training and the test performance metrics, the Cubist model looks to be most optimal.

Part B

varImpSorted <- function(dfVarImp) {
  varImpRows <- order(abs(dfVarImp$Overall), decreasing = TRUE)
  dfResult <- data.frame(dfVarImp[varImpRows, 1],
                         row.names = rownames(dfVarImp)[varImpRows])
  colnames(dfResult) <- colnames(dfVarImp)
  return(dfResult)
}
mdlVarImp <- varImp(cubistModel)
plot(mdlVarImp)

mdlvarImp <- varImpSorted(mdlVarImp$importance)

print("Top 10 important predictors")

## [1] "Top 10 important predictors"

head(mdlvarImp, 10)

##                          Overall
## ManufacturingProcess17 100.00000
## ManufacturingProcess32  95.65217
## ManufacturingProcess39  52.17391
## ManufacturingProcess09  51.08696
## ManufacturingProcess33  44.56522
## BiologicalMaterial12    36.95652
## BiologicalMaterial03    31.52174
## ManufacturingProcess29  25.00000
## ManufacturingProcess04  22.82609
## ManufacturingProcess22  21.73913

print(paste("Number of process variables:",
            length(grep(
              "Manuf.*", rownames(mdlvarImp)[which(mdlvarImp$Overall > 0)]
            ))))

## [1] "Number of process variables: 26"

print(paste("Number of biological variables:",
            length(grep(
              "Bio.*", rownames(mdlvarImp)[which(mdlvarImp$Overall > 0)]
            ))))

## [1] "Number of biological variables: 9"

Based on the above numbers, process variables dominate the importance list over biological ones.

The top 10 important predictors in this model are mostly present in the top 10 predictors from the optimal linear and nonlinear models, although with different importance levels which is to be expected.

Part C

library(partykit)
plot(as.party(st2Model$finalModel))

The above graphical view of the optimal single tree can provide quick and intuitive knowledge about the biological or process predictors and their relationship with yield. For example, one can easily navigate the tree visually and determine the path leading to the highest yields. In this diagram it becomes evident that certain key process predictors are likely to lead to highest yields.

DATA-624 HW-9

Simon U.

05/03/2020