HHernandez_DATA624

Regression Trees and Rule-based Models

Exercise 8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6 - V10)? ##### No, the model did not used them, very low (negative) importance compared to V1-V5

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

The importance of V1 decreased. Due to the correlation, the importance to the model was distributed between V1 and duplicate1 since, during tree model creation, any of them can be picked at random and any of them provide a good data split.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)
cfModel <- cforest(y ~ ., data=simulated)
cfImp <- varimp(cfModel)
cfImp[order(-cfImp)]

##            V4            V2    duplicate1            V1            V5 
##  7.6223892727  6.0579730772  5.0941897280  4.6171158805  1.7161194047 
##            V7            V9            V3            V6           V10 
##  0.0465374951  0.0046062409  0.0003116115 -0.0289427183 -0.0310326410 
##            V8 
## -0.0380965511

cond_cfImp <- varimp(cfModel, conditional=T)
cond_cfImp[order(-cond_cfImp)]

##           V4           V2   duplicate1           V1           V5 
##  6.175211816  4.806626676  2.039771619  1.894092913  1.028543061 
##           V6           V3           V9           V7          V10 
##  0.021733840  0.020998486  0.003766844 -0.006026025 -0.015318278 
##           V8 
## -0.022310557

Importances differ in couple of ways, for top 5 features, the rank order was the same but the conditional values were lower than traditional, the most drastic reduction was on the highly correlated variables. For the reamining 5, the order was changed completely but the magnitude was not.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

#install.packages("gbm")
library(gbm)
gbmModel <- gbm(y ~ ., data=simulated, distribution='gaussian')
summary(gbmModel)

##                   var    rel.inf
## V4                 V4 27.8345107
## V2                 V2 23.2038908
## V1                 V1 17.6456066
## duplicate1 duplicate1 12.3248580
## V5                 V5 10.5288759
## V3                 V3  7.9846250
## V7                 V7  0.2611591
## V6                 V6  0.2164740
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

library(Cubist)
cubistModel <- cubist(x=simulated[,-(ncol(simulated)-1)], y=simulated$y, committees=100)
varImp(cubistModel)

##            Overall
## V3            43.5
## V1            52.5
## V2            59.5
## duplicate1    27.5
## V4            46.0
## V8             4.0
## V5            27.0
## V6            10.0
## V10            1.0
## V7             0.0
## V9             0.0

For Boosted Trees model, the pattern for the top 5 variables was very similar to the Conditional Random Forest. The relative importance between the correlated variables favors V1 over the new one, this is because in boosting trees, one tree improves over the previous which helps reducing collinearity effects.

In Cubist model, the pattern changed completely for the top 5 variables, V3 was ranked top vs V4 in all the previous models. Also, the relative importance between the correlated variables considerable favors V1 even above other variables like V2

Exercise 8.2

Use a simulation to show tree bias with different granularities

The bias-variance tradeoff does depend on the depth of the tree. Decision tree is sensitive to where it splits and how it splits. Therefore, even small changes in input variable values might result in very different tree structure. Tree models suffer from selection bias where predictors having higher number of distinct values are favored over more granular predictors

#Predictors
set.seed(200)
V1 <- sample(0:10 / 10, 200, replace = T)
V2 <- sample(0:100 / 100, 200, replace = T)
V3 <- sample(0:1000 / 1000, 200, replace = T)
V4 <- sample(0:10000 / 10000, 200, replace = T)

# Target variable
y <- V1 + V4

simdata <- data.frame(y, V1,V2,V3,V4)
str(simdata)

## 'data.frame':    200 obs. of  5 variables:
##  $ y : num  0.682 1.269 0.938 0.767 1.503 ...
##  $ V1: num  0.5 0.6 0.6 0.7 0.7 0.9 0.7 0.1 0.5 0.2 ...
##  $ V2: num  0.65 0.44 0.59 0.22 0.82 0.39 0.11 0.85 0.25 0.43 ...
##  $ V3: num  0.851 0.673 0.41 0.033 0.717 0.646 0.768 0.153 0.285 0.781 ...
##  $ V4: num  0.1816 0.6693 0.3381 0.0669 0.8033 ...

# Modeling (Single Tree)

library(rpart)

rpartModel <- rpart(y ~ ., data=simdata)
varImp(rpartModel)

##      Overall
## V1 4.2432046
## V2 0.6640285
## V3 0.5161309
## V4 4.3187553

As expected, V1 and V4 are the variables the tree gives the highest importance (they were used to generate the target). Confirming tree selection bias, V4 showed highest, as it has the most distinct values (compared to V1). Another thing to highlight is that the tree still used the other 2 variables that apparently should not provide any useful information to the tree.

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Gradient boosting involves creating and adding trees to the model sequentially.New trees are created to correct the residual errors in the predictions from the existing sequence of trees. The effect is that the model can quickly fit, then overfit the training dataset.

Learning rate is a technique to slow down the learning in the gradient boosting model by applying a weighting factor for the corrections by new trees when added to the model. Smaller learning rates generally require more trees to be added to the model. The more trees in the model, the more likely to have more predictors used across trees which will increase their overall importance. The right-hand plot learning rate of 0.9 has the opposite effect which focuses on the most relevant predictors within fewer number of trees.

Bagging fraction is the fraction of the training set observations randomly selected to propose the next tree in the expansion. Every tree may potentially be built using very different dataset. Large bagging fractions, help build subsequent trees that leverage very similar datasets (splitting similarly). This means that larger bagging fraction increases the dependent/correlated structure in the boosting trees. The right-hand plot bagging fraction focuses on the most relevant predictors as well.

Which model do you think would be more predictive of other samples?

Learning rate and bagging fraction are important parameters to control the overfitting of a boosting model. Smaller rate and fraction leads to better generalization and better bias-variance trade-off which in turn is supposed to provide better test/validation model performance. Left-hand model (0.1 rate and fraction) will be more predictive

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Interaction depth controls the maximum nodes per tree - number of splits it has to perform on a tree. As each split increases the total number of nodes by 3 and number of terminal nodes by 2, the total number of nodes in the tree will be 3∗N+1 and the number of terminal nodes 2∗N+. Higher interaction depth spreads out the importance more, since each tree now can grow deeper, and has more chance for other features to be involved in the splitting process

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

(A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch.)

Data loading, exploration, splitting & pre-processing/imputation

#install.packages("AppliedPredictiveModeling")
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

library(caret)

chmp <- ChemicalManufacturingProcess
predictors <- chmp[,-1]

# Correlations - Find most correlated predictors

corr <- cor(predictors, use='complete.obs')
topcorr <- findCorrelation(corr) 

# Zero to Near-zero variance predictors check

nzv <- nearZeroVar(predictors)

# Final predictors to be considered for modeling (minus top correlated and near-zero variance)

predictors <- predictors[,-c(nzv, topcorr)]
yield <- as.data.frame(chmp[,1])
colnames(yield) <- c("yield")

# Splitting Train and Test datasets

library(caret)

set.seed(500)
train <- createDataPartition(yield$yield, p = 0.7, list = FALSE)

trainPredictors <- predictors[train,]
trainYield <- yield[train,]

testPredictors <- predictors[-train,]
testYield <- yield[-train,]

# Pre-processing

transtrain <- preProcess(trainPredictors, method=c("BoxCox","center","scale", "knnImpute"))
transtest <- preProcess(testPredictors, method=c("BoxCox","center","scale", "knnImpute"))
transTrainPredictors <- predict(transtrain,trainPredictors)
transTestPredictors <- predict(transtest,testPredictors)

Modeling & Validation

Tree-based Models (Single-Tree, Bagged, Random Forest, Boosted)

# Single-Tree

rpartModel <- train(x = transTrainPredictors,
                    y = trainYield,
                    method = "rpart",
                    tuneLength = 10,
                    control = rpart.control(maxdepth=3))
rpartModel

## CART 
## 
## 124 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE     
##   0.00000000  1.582253  0.3333315  1.242090
##   0.01546150  1.582442  0.3327918  1.242983
##   0.01637835  1.581506  0.3336337  1.242777
##   0.01790651  1.582902  0.3330980  1.244640
##   0.03399488  1.586807  0.3248441  1.255272
##   0.03837772  1.589153  0.3203555  1.261351
##   0.04825734  1.578079  0.3207319  1.251510
##   0.07824823  1.580359  0.3093754  1.263354
##   0.10791698  1.601078  0.2793171  1.280952
##   0.37106585  1.635747  0.2714240  1.344364
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.04825734.

# Bagged

bagModel <- train(x = transTrainPredictors,
                    y = trainYield,
                    method = "treebag")
bagModel

## Bagged CART 
## 
## 124 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1.284615  0.5033861  1.020088

# Random Forest

rfModel <- train(x = transTrainPredictors,
                    y = trainYield,
                    method = "cforest",
                    tuneLength = 10)
#modelLookup("cforest")
rfModel

## Conditional Inference Random Forest 
## 
## 124 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    1.525430  0.4373419  1.237979
##    7    1.367476  0.4854817  1.097711
##   12    1.341693  0.4906098  1.070490
##   17    1.333407  0.4909909  1.059084
##   22    1.333748  0.4863332  1.058292
##   27    1.333708  0.4836690  1.057417
##   32    1.336717  0.4789318  1.057632
##   37    1.339898  0.4767046  1.060028
##   42    1.347497  0.4696276  1.063856
##   47    1.358785  0.4614083  1.072335
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 17.

# Boosted

boostgrid <- expand.grid(n.trees=c(50, 100, 150, 200), 
                    interaction.depth=c(1, 5, 10, 15), 
                    shrinkage=c(0.01, 0.1, 0.3, 0.5), 
                    n.minobsinnode=c(5, 10, 15, 20))
gbmModel <- train(x = transTrainPredictors,
                  y = trainYield, 
                  method = 'gbm', 
                  tuneGrid = boostgrid, 
                  verbose = FALSE)

plot(gbmModel)

gbmModel$bestTune

##     n.trees interaction.depth shrinkage n.minobsinnode
## 100     200                10       0.1              5

gbmModel$finalModel

## A gradient boosted model with gaussian loss function.
## 200 iterations were performed.
## There were 47 predictors of which 47 had non-zero influence.

Which tree-based regression model gives the optimal resampling and test set performance?

Resampling Performance

resampl <- resamples(list(SingleTree=rpartModel, Bagging=bagModel, RandomForest=rfModel, Boosting=gbmModel))
summary(resampl)

## 
## Call:
## summary.resamples(object = resampl)
## 
## Models: SingleTree, Bagging, RandomForest, Boosting 
## Number of resamples: 25 
## 
## MAE 
##                   Min.   1st Qu.    Median      Mean  3rd Qu.     Max.
## SingleTree   0.9529453 1.1478482 1.2601471 1.2515102 1.359692 1.494096
## Bagging      0.7237368 0.9506854 1.0157216 1.0200884 1.104488 1.345302
## RandomForest 0.8370000 0.9666872 1.0268801 1.0590838 1.150407 1.329035
## Boosting     0.7704826 0.8605392 0.9155187 0.9640081 1.072086 1.214469
##              NA's
## SingleTree      0
## Bagging         0
## RandomForest    0
## Boosting        0
## 
## RMSE 
##                   Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## SingleTree   1.2601942 1.450081 1.579719 1.578079 1.716539 1.925961    0
## Bagging      0.8714201 1.209174 1.308471 1.284615 1.423152 1.613547    0
## RandomForest 1.0362395 1.239562 1.295276 1.333407 1.385481 1.672938    0
## Boosting     1.0453886 1.104140 1.223885 1.248895 1.386579 1.696649    0
## 
## Rsquared 
##                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## SingleTree   0.09777539 0.2689006 0.3618896 0.3207319 0.4010188 0.5338808
## Bagging      0.31333004 0.4286663 0.5182934 0.5033861 0.5679770 0.6611859
## RandomForest 0.31770421 0.4195910 0.4916894 0.4909909 0.5520600 0.7201505
## Boosting     0.26777362 0.4882695 0.5552660 0.5524136 0.6183363 0.7478917
##              NA's
## SingleTree      0
## Bagging         0
## RandomForest    0
## Boosting        0

Boosting model appears to be the most optimal amongst all the tree algorithms across all the performance metrics (MAE, RMSE & Rsquared). Closely followed by Bagging

Test set Performance

rpartPred <- predict(rpartModel, newdata = transTestPredictors)
rpartperf <- postResample(pred = rpartPred, obs = testYield)
bagPred <- predict(bagModel, newdata = transTestPredictors)
bagperf <- postResample(pred = bagPred, obs = testYield)
rfPred <- predict(rfModel, newdata = transTestPredictors)
rfperf <- postResample(pred = rfPred, obs = testYield)
gbmPred <- predict(gbmModel, newdata = transTestPredictors)
gbmperf <- postResample(pred = gbmPred, obs = testYield)

library(knitr)
kable(data.frame("SingleTree"=rpartperf, "Bagging"=bagperf, "RandomForest"=rfperf, "Boosting"=gbmperf))

	SingleTree	Bagging	RandomForest	Boosting
RMSE	1.5565189	1.336844	1.3732942	1.1489486
Rsquared	0.3846161	0.541170	0.5857616	0.6588613
MAE	1.1476771	1.026317	1.0486082	0.9182695

For the Test set, the Boosting model confirms to be the most optimal amongst all the algorithms across all the performance metrics (RMSE, Rsquared & MAE).

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

gbmImp <- varImp(gbmModel)
gbmImp

## gbm variable importance
## 
##   only 20 most important variables shown (out of 47)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17  22.987
## BiologicalMaterial06    18.743
## ManufacturingProcess06  17.645
## ManufacturingProcess13  12.974
## ManufacturingProcess11  11.390
## ManufacturingProcess30  10.446
## BiologicalMaterial01     8.057
## ManufacturingProcess09   7.777
## ManufacturingProcess25   7.196
## BiologicalMaterial08     7.188
## ManufacturingProcess22   7.010
## BiologicalMaterial05     6.820
## ManufacturingProcess01   6.638
## BiologicalMaterial11     6.634
## BiologicalMaterial09     6.571
## ManufacturingProcess21   6.186
## ManufacturingProcess39   5.759
## ManufacturingProcess28   5.330
## ManufacturingProcess24   4.621

plot(gbmImp, top=10)

# Linear Model (PLS) - For Reference

ctrl <- trainControl(method = "boot", number = 25)
pls_tune <- train(x = transTrainPredictors, y = trainYield,
                 method = "pls",
                 tuneLength = 15,
                 trControl = ctrl)
plsImp <- varImp(pls_tune, scale = FALSE)

# Nonlinear Model (SVM) - For Reference

svmModel <- train(x = transTrainPredictors,
                  y = trainYield,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
svmImp <- varImp(svmModel)

plot(svmImp, top=10)

plot(plsImp, top=10)

The most important predictors for the tree-based model (Boosting) are primarily ManufacturingProcess predictors (8 out of 10). Interestingly, one of the 2 BiologicalMaterial predictors is the third most relevant for the model. Regarding the linear model (PLS), same proportion of ManufacturingProcess predictors in the top 10 (7 out of 10), but in this model, is more clearly defined the predominant influence over the BiologicalMaterial ones. Compared to the linear (PLS) and Nonlinear (SVM), Boosting trees agrees almost in all the top 5 predictors as well as order of importance, they both consider BiologicalMaterial06 as a relevant variable. In contrast, Linear PLS model prioritizes ManufacturingProcess variables completely (top 6)

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpartFM <- rpartModel$finalModel
plot(rpartFM)
text(rpartFM, use.n=TRUE, all=TRUE, cex=.8)

trainData <- data.frame(transTrainPredictors, trainYield)
less <- subset(trainData, ManufacturingProcess32 < 0.02642 & ManufacturingProcess06 < 0.248)
more <- subset(trainData, ManufacturingProcess32 >= 0.02642 & ManufacturingProcess06 > 0.248)

mean(less$trainYield)

## [1] 39.04113

mean(more$trainYield)

## [1] 42.44591

The plot shows that after the first split using ManufacturingProcess32, the next two splits that lead to leaf nodes consider BiologicalMaterial06 on one path and ManufacturingProcess06 on the other one as the most important predictors. Maximum yield comes from the full ManufacturingProcess-based branch and is calculated as the mean of the values falling on each leaf node according to the path split values.

HHernandez_DATA624_HW09

humbertohp

April 20, 2020

Applied Predictive Modeling (Kuhn & Johnson)