Chapter 8 - Regression Trees and Rule-Based Models

8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd=1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance score.

Random forests is more computationally efficient on a tree-by-tree basis since the tree building process only needs to evaluate a fraction of the original predictors at each split, although more trees are usually required by random forests. Combining this attribute with the ability to parallel process tree building makes random forests more computationally efficient than boosting.

Let’s fit the dataset in the random forests model and find the variable or feature importance.

library(randomForest)
library(caret)

model.rf<- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
model.rf
## 
## Call:
##  randomForest(formula = y ~ ., data = simulated, importance = TRUE,      ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 6.754258
##                     % Var explained: 72.3

Now, finding the feature importance. Here we can see all the variables and their overall score.

rf.imp <- varImp(model.rf, scale = FALSE)
rf.imp
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Using varImpPlot to view variables in the plot and understand their ranking/ importance

varImpPlot(model.rf)

We can see that variables frm V6 to V10 do not appear to be important at all.

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

simulated$dup1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$dup1, simulated$V1)
## [1] 0.9460206

We can see there is ~94% correlation with V1.

model.rf2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
model.rf2
## 
## Call:
##  randomForest(formula = y ~ ., data = simulated, importance = TRUE,      ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 6.922537
##                     % Var explained: 71.61
rf2.Imp <- varImp(model.rf2, scale = FALSE)
rf2.Imp
##          Overall
## V1    5.69119973
## V2    6.06896061
## V3    0.62970218
## V4    7.04752238
## V5    1.87238438
## V6    0.13569065
## V7   -0.01345645
## V8   -0.04370565
## V9    0.00840438
## V10   0.02894814
## dup1  4.28331581
varImpPlot(model.rf2)

We can see that variable V1 importance decreased significantly with added predictor, the duplicate variable.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that func- tion toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(data.table)
library(party)

simulated <- subset(simulated, select = -dup1)

model3 <- cforest(y ~ ., data = simulated)
feature.imp <- varimp(model3)
feature.imp
##            V1            V2            V3            V4            V5 
##  9.0625534937  6.7499292185  0.0207328511  8.2653505901  1.8439761772 
##            V6            V7            V8            V9           V10 
##  0.0005993854  0.0344981972 -0.0318008565  0.0046931439 -0.0231819864

It appears that the cforest also has V1-V5 as imporatnt variables, and V6-V10 are low ranked features.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(Cubist)
## Warning: package 'Cubist' was built under R version 3.4.4
model.cub <- cubist(simulated[,-11], simulated$y)
model.cub
## 
## Call:
## cubist.default(x = simulated[, -11], y = simulated$y)
## 
## Number of samples: 200 
## Number of predictors: 10 
## 
## Number of committees: 1 
## Number of rules: 1
summary(model.cub)
## 
## Call:
## cubist.default(x = simulated[, -11], y = simulated$y)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Sun Apr 28 22:01:09 2019
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 200 cases (11 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.944664]
## 
##  outcome = 0.183529 + 8.9 V4 + 7.9 V1 + 7.1 V2 + 5.3 V5
## 
## 
## Evaluation on training data (200 cases):
## 
##     Average  |error|           2.224012
##     Relative |error|               0.55
##     Correlation coefficient        0.84
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##           100%    V1
##           100%    V2
##           100%    V4
##           100%    V5
## 
## 
## Time: 0.0 secs

It appears that Cubist only uses highly important variables. V1, V2, V4 and V5 are the most important features that got used by Cubist model.

8.2

Use a simulation to show tree bias with different granularities.

We can use rpart to fit the model and rpart.plot which is an extension of plot to visualize the same.

library(tidyverse)
library(rpart)
library(rpart.plot)

data.set <- twoClassSim(300, noiseVars = 8, corrVar = 6, corrValue = 0.8) %>%
            mutate(TwoFactor1 = as.factor(round(TwoFactor1, 0)),
                   TwoFactor2 = as.factor(round(TwoFactor2, 0)))

r.fit <- rpart(Linear01 ~ ., data=data.set)
rpart.plot(r.fit)

varImp(r.fit)
##               Overall
## Corr1      0.78249579
## Corr2      0.43282264
## Corr3      0.46687451
## Corr4      0.90525169
## Corr5      0.37270428
## Corr6      0.72610752
## Linear02   0.83759541
## Linear03   0.45900186
## Linear04   0.95666938
## Linear05   1.25476901
## Linear06   0.76111067
## Linear07   0.19864176
## Linear08   0.55827948
## Linear09   0.14981697
## Linear10   0.59264207
## Noise2     0.31739566
## Noise3     0.56824421
## Noise4     1.48251253
## Noise5     0.05101641
## Noise6     0.28014828
## Noise7     0.28239327
## Noise8     0.57478978
## Nonlinear2 0.70332393
## Nonlinear3 0.51860792
## TwoFactor1 0.82678816
## TwoFactor2 0.30394295
## Nonlinear1 0.00000000
## Noise1     0.00000000
## Class      0.00000000

we can see overall score for linear, non-linear, noise and correlation also.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Which model do you think would be more predictive of other samples?

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

  1. Which tree-based regression model gives the optimal resampling and test set performance?
library(AppliedPredictiveModeling)
library(missForest)

data(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess
df_imp1 <- missForest(df)
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
df_imp <- df_imp1$ximp

data <- df_imp[,2:58]
target <- df_imp[,1]

training <- createDataPartition( target, p=0.75 )
predictor_training <- data[training$Resample1,]
target_training <- target[training$Resample]

predictor_testing <- data[-training$Resample1,]
target_testing <- target[-training$Resample1]
ctrl <- trainControl(method = "cv", number = 10)

Single Tree Model building and tuning

rt_grid <- expand.grid(maxdepth= seq(1,10,by=1))
rt_Tune <- train(x = predictor_training, y = target_training, method = "rpart2", metric = "Rsquared", tuneGrid = rt_grid, trControl = ctrl)

Predict

rt_pred = predict(rt_Tune, predictor_testing)
postResample(pred = rt_pred, obs = target_testing)
##      RMSE  Rsquared       MAE 
## 1.2487359 0.5731629 0.9551321

Random Forest Model building and tuning

rf_grid <- expand.grid(mtry=seq(2,38,by=3))
rf_Tune <- train(x = predictor_training, y = target_training, method = "rf", tuneGrid = rf_grid, metric = "Rsquared", importance = TRUE, trControl = ctrl)

Predict

rf_pred = predict(rt_Tune, predictor_testing)
postResample(pred = rt_pred, obs = target_testing)
##      RMSE  Rsquared       MAE 
## 1.2487359 0.5731629 0.9551321

Cubist Model building and tuning

cube_grid <- expand.grid(committees = c(1, 5, 10, 20, 50), neighbors = c(0, 1, 3, 5))

cube_Tune <- train(x = predictor_training, y = target_training, method = "cubist", metric = "Rsquared", tuneGrid = cube_grid, trControl = ctrl)

Predict

cube_pred = predict(cube_Tune, predictor_testing)
postResample(pred = cube_pred, obs = target_testing)
##      RMSE  Rsquared       MAE 
## 1.2013718 0.6230656 0.8570788

Cubist model has the best RMSE score. One of the main reason could be that uses only top predictors.

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

plot(varImp(cube_Tune), top=10, scales = list(y = list(cex = 0.8)))

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(partykit)
plot(as.party(rt_Tune$finalModel),gp=gpar(fontsize=11))