Data 624 HW 9

library(knitr)
library(rmdformats)

## Global options
options(max.print="85")
opts_chunk$set(cache=TRUE,
               prompt=FALSE,
               tidy=TRUE,
               comment=NA,
               message=FALSE,
               warning=FALSE)
opts_knit$set(width=85)

library(mlbench)

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

library(party)

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(gbm)

## Loaded gbm 2.1.8

library(Cubist)
library(rpart)
library(AppliedPredictiveModeling)

library(vip)

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

library(rpart.plot)

8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 - V10)?

rfImp1

        Overall
V1   8.83890885
V2   6.49023056
V3   0.67583163
V4   7.58822553
V5   2.27426009
V6   0.17436781
V7   0.15136583
V8  -0.03078937
V9  -0.02989832
V10 -0.08529218

Ans:

No. It did not. V6 - V10 had some very small importance, as compared to V1 - V5.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate1, simulated$V1)

[1] 0.9396216

Fit another random forest model to these data. Did the importance score for V1 change?

model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

               Overall
V1          6.29780744
V2          6.08038134
V3          0.58410718
V4          6.93924427
V5          2.03104094
V6          0.07947642
V7         -0.02566414
V8         -0.11007435
V9         -0.08839463
V10        -0.00715093
duplicate1  3.56411581

Ans: Yes, it did. The importance for V1 has been reduced. This is due to the fact that the trees in the random forest is equally likely to pick V1 for a split to picking duplicate1.

What happens when you add another predictor that is also highly correlated with V1?

Ans: The total of splits that are originally belong to V1 is going be shared by its highly correlated predictor duplicate1.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

cforestModel <- cforest(y ~ ., data = simulated)

# Conditional variable importance
cfImp3 <- varimp(cforestModel, conditional = T)
# Un-conditional variable importance
cfImp4 <- varimp(cforestModel, conditional = F)
# par(mfrow=c(1,2)) setup and push viewport
vp0 <- viewport(x = 0.15, y = 0, just = c("left", "bottom"), width = 0.85, height = 1)
pushViewport(vp0)

## add barplot
par(new = TRUE, fig = gridBase::gridFIG())
barplot(sort(cfImp3), horiz = TRUE, main = "Conditional", col = rainbow(3), las = 1)

barplot(sort(cfImp4), horiz = TRUE, main = "Un-Conditional", col = rainbow(5), las = 1)

Ans: Yes, the conditional argument, if set to True, is going to V1 and duplicate1 as part of one entity. So their collective importance would add up to the importance of V1 in traditional random forest model. On the other hand, if you the conditional argument is set to False, V1 is taken into account totally independent of duplicate1, which is not entirely correct in a sense.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Generalized Boosted Regression Modeling (GBM)

gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")

## start new graphics frame
plot.new()  # this line is important to prevent print an unneccessary white screen

## setup and push viewport
vp0 <- viewport(x = 0.15, y = 0, just = c("left", "bottom"), width = 0.85, height = 1)
pushViewport(vp0)
## add barplot print(grid.rect()) # to check the current location of the newly
## created viewport
par(new = TRUE, fig = gridBase::gridFIG())
summary(gbmModel, las = 1)

                  var    rel.inf
V4                 V4 30.5911663
V2                 V2 23.1301116
V1                 V1 17.5917429
V5                 V5 11.8717883
duplicate1 duplicate1  8.4784093
V3                 V3  7.8259357
V6                 V6  0.3845881
V7                 V7  0.1262577
V8                 V8  0.0000000
V9                 V9  0.0000000
V10               V10  0.0000000

The same pattern occurs. The uninformative predictors V6 ~ V10 are still ranked at the bottom. V4 is still the top ranked predictor, followed by V1. The GBM model seems to indicate that V1 is much more important than duplicated1. This is because the trees from boosting are dependent on each other and will have correlated structures.

Cubist

# committees is set to 100
cubistModel <- cubist(x = simulated[, -(ncol(simulated) - 1)], y = simulated$y, committees = 100)
cfImp.cubist <- varImp(cubistModel)
cfImp.cubist <- cfImp.cubist %>% arrange(desc(Overall))

## start new graphics frame
plot.new()

## setup and push viewport
vp0 <- viewport(x = 0.15, y = 0, just = c("left", "bottom"), width = 0.85, height = 1)
pushViewport(vp0)
## add barplot
par(new = TRUE, fig = gridBase::gridFIG())
barplot(t(cfImp.cubist), horiz = TRUE, main = "Conditional", col = rainbow(3), las = 1)

Cubist model treats v1 the most important. The combined importance of V1 and duplicate1 is probably ranked the highest. These results are not the same as boosted trees (gbm) and any other prior random forest models where v4 is the most important variable. V6 ~ V10 are ranked the lowest in the importance.

8.2

The tree models will inherently suffer selection bias that predictors with higher number of distinct values are favored over the ones with lesser number of distinct.

I generated random samples for 3 variables. The only parameter that I wanted to mention for the sample function that I used is I pick 501 to be the size that I choose from. The target variable y is created by adding x1 and x3, plus a random error term. Note that x2 is not used to generate the target.

Random Forest (cforest)

set.seed(243)
x1 <- sample(0:998/999, 501, replace = T)
x2 <- sample(0:98/999, 501, replace = T)
x3 <- sample(0:8/999, 501, replace = T)


y <- x1 + x3 + rnorm(200)

df <- data.frame(x1, x2, x3, y)
random.forest <- cforest(y ~ ., data = df)

vi <- varimp(random.forest, conditional = FALSE)
# summary(vi)
barplot(sort(vi), horiz = TRUE, main = "Un-Conditional", col = rainbow(4))

Apparently, x1 has more distinct values than x3 by design. From the importance bar plot above, one can see there is a selection bias in the tree model. Meaning that it favors predictors with more distinct values.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Ans:

Model on the right has a higher bagging fraction rate at 0.9. What this means is there are more trees that saw the same fraction of data which resulted in the reduction of diversity of variable selection. The one of the left has a very granular bagging fraction in 0.1. It will result in the opposite extreme of what you see from the model on the right.

(b) Which model do you think would be more predictive of other samples?

Ans: Model with lower bagging fraction rate should be more predictive on other samples. The right hand model can be fitted very well to the training data but can’t be generalized on other samples.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Ans: As we increase interaction depth, trees are allowed to grow deeper. This will result in more predictors to be considered for tree splitting choices. In turn, this will spread the slope of predictor importance instead of the scenario where there is a handful few of predictors with importance.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Ans: Used the bagImpute method to impute missing values in the dataset.

Data Splitting is also performed. 80% training, 20% test set.

data(ChemicalManufacturingProcess)

# Impute missing values using `bagImpulte`
(cmpImpute <- preProcess(ChemicalManufacturingProcess[, -c(1)], method = c("bagImpute")))

Created from 152 samples and 57 variables

Pre-processing:
  - bagged tree imputation (57)
  - ignored (0)

cmp <- predict(cmpImpute, ChemicalManufacturingProcess[, -c(1)])

# Train/test plitting data, 20% testing
set.seed(89)
train_set <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.8, list = FALSE)
X.train <- cmp[train_set, ]
y.train <- ChemicalManufacturingProcess$Yield[train_set]
X.test <- cmp[-train_set, ]
y.test <- ChemicalManufacturingProcess$Yield[-train_set]
# for part c
df.train <- ChemicalManufacturingProcess[train_set, ]

Let’s train a couple tree-based models. We used bootstrapped resampling with 25 repetitions to determine the final model. No need for centering and scaling for tree-based models.

single tree (CART)

complexity paramter

set.seed(243)
CART <- train(x = X.train, y = y.train, method = "rpart", tuneLength = 10, control = rpart.control(maxdepth = 2))
CART

CART 

144 samples
 57 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
Resampling results across tuning parameters:

  cp           RMSE      Rsquared   MAE     
  0.009864825  1.488400  0.4053676  1.170599
  0.013622514  1.488400  0.4053676  1.170599
  0.017200384  1.488400  0.4053676  1.170599
  0.024694612  1.488400  0.4053676  1.170599
  0.029984706  1.488400  0.4053676  1.170599
  0.032577643  1.488400  0.4053676  1.170599
  0.072428900  1.533057  0.3770369  1.205899
  0.086778383  1.554057  0.3607271  1.229296
  0.100979247  1.562672  0.3538561  1.238299
  0.398898479  1.682577  0.3170981  1.329336

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.03257764.

random forest

number of features to use in each tree:

set.seed(243)
rand_Forest_Model <- train(x = X.train, y = y.train, method = "rf", tuneLength = 10)
rand_Forest_Model

Random Forest 

144 samples
 57 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE      
   2    1.293691  0.5855670  1.0163883
   8    1.216723  0.6055263  0.9415555
  14    1.204834  0.6053742  0.9268926
  20    1.205850  0.6009107  0.9268253
  26    1.211720  0.5940648  0.9304042
  32    1.217266  0.5879310  0.9320650
  38    1.222354  0.5836349  0.9348501
  44    1.232185  0.5768113  0.9423205
  50    1.236609  0.5741164  0.9437762
  57    1.245725  0.5675743  0.9523814

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 14.

gradient boosting

number of trees, interaction depth, shrinkage rate, and minimum terminal node size

set.seed(243)
grid <- expand.grid(n.trees = c(50, 100, 150, 200), interaction.depth = c(1, 5, 10, 
    15), shrinkage = c(0.01, 0.1, 0.5), n.minobsinnode = c(5, 10, 15))
gbm_Model <- train(x = X.train, y = y.train, method = "gbm", tuneGrid = grid, verbose = FALSE  # turn off the status of training process
)

plot(gbm_Model)

gbm_Model$bestTune

   n.trees interaction.depth shrinkage n.minobsinnode
75     150                10       0.1              5

gbm_Model$finalModel

A gradient boosted model with gaussian loss function.
150 iterations were performed.
There were 57 predictors of which 54 had non-zero influence.

cubist

number of committees and neighbors

set.seed(243)
cubist_Model <- train(x = X.train, y = y.train, method = "cubist")
cubist_Model

Cubist 

144 samples
 57 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
Resampling results across tuning parameters:

  committees  neighbors  RMSE      Rsquared   MAE      
   1          0          1.726414  0.3849193  1.2324170
   1          5          1.698354  0.4080571  1.1965189
   1          9          1.711157  0.3995843  1.2128639
  10          0          1.265148  0.5583630  0.9688390
  10          5          1.238131  0.5754601  0.9387618
  10          9          1.249577  0.5682015  0.9534921
  20          0          1.213816  0.5898924  0.9315042
  20          5          1.186822  0.6072804  0.9019907
  20          9          1.199040  0.5996487  0.9162306

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were committees = 20 and neighbors = 5.

resamp_res <- resamples(list(SingleTree = CART, RandomForest = rand_Forest_Model, 
    GradientBoosting = gbm_Model, Cubist = cubist_Model))

summary(resamp_res)


Call:
summary.resamples(object = resamp_res)

Models: SingleTree, RandomForest, GradientBoosting, Cubist 
Number of resamples: 25 

MAE 
                      Min.   1st Qu.    Median      Mean   3rd Qu.     Max.
SingleTree       0.9413484 1.0974450 1.1822222 1.1705995 1.2359558 1.362558
RandomForest     0.7803383 0.8573775 0.8921101 0.9268926 0.9855622 1.126236
GradientBoosting 0.6856090 0.8167832 0.8937282 0.8995661 0.9761865 1.109677
Cubist           0.7135246 0.8327646 0.9022183 0.9019907 0.9878750 1.048360
                 NA's
SingleTree          0
RandomForest        0
GradientBoosting    0
Cubist              0

RMSE 
                      Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
SingleTree       1.1421007 1.369535 1.522238 1.488400 1.596361 1.684370    0
RandomForest     0.9482557 1.096040 1.156528 1.204834 1.308966 1.441086    0
GradientBoosting 0.8889881 1.020132 1.167307 1.172891 1.275066 1.429713    0
Cubist           0.9261991 1.056356 1.211379 1.186822 1.308549 1.442084    0

Rsquared 
                      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
SingleTree       0.2121150 0.3417448 0.4200032 0.4053676 0.4755246 0.5617055
RandomForest     0.4131459 0.5397032 0.6096320 0.6053742 0.6782266 0.7861727
GradientBoosting 0.3779211 0.5516008 0.6210054 0.6095630 0.6909546 0.7516825
Cubist           0.4041096 0.5561325 0.6075966 0.6072804 0.6611404 0.7538830
                 NA's
SingleTree          0
RandomForest        0
GradientBoosting    0
Cubist              0

Based on the mean of RMSE, Gradient Boosting model wins out with the lowest value.

Here I create a function to plug in to get the predict performance on the test set.

testPerf <- function(models, testData, testTarget) {
    method <- c()
    res <- data.frame()
    for (model in models) {
        method <- c(method, model$method)
        pred <- predict(model, newdata = testData)
        res <- rbind(res, t(postResample(pred = pred, obs = testTarget)))
    }
    row.names(res) <- method
    return(res)
}

models <- list(CART, rand_Forest_Model, gbm_Model, cubist_Model)

performance <- testPerf(models, X.test, y.test)
performance

           RMSE  Rsquared       MAE
rpart  1.645732 0.2847704 1.3135636
rf     1.292767 0.4986540 0.9693125
gbm    1.172100 0.5790094 0.8730559
cubist 1.086184 0.6557878 0.8697350

Cubist attains the lowest RMSE among all 4 models. Since the size of the data in the test set is relatively smaller than that of the training set, performance shown above might not be a close approximation of the true model performance on other samples. If there is a choice, I’d do an ensemble between a Generalized Boosted Regression Model (gbm) and a Cubist Model.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Ans: For the purpose of following the flow of these questions, I picked GBM as the optimal tree-based regression model.

Predictor Importance

List a

varImp(gbm_Model)

gbm variable importance

  only 20 most important variables shown (out of 57)

                       Overall
ManufacturingProcess32 100.000
ManufacturingProcess13  58.587
BiologicalMaterial06    39.052
BiologicalMaterial11    21.841
ManufacturingProcess09  16.633
ManufacturingProcess06  15.530
ManufacturingProcess31  14.666
BiologicalMaterial12    14.167
ManufacturingProcess17  13.624
BiologicalMaterial03    11.916
BiologicalMaterial09    10.207
ManufacturingProcess23   8.788
ManufacturingProcess19   8.104
ManufacturingProcess39   8.101
BiologicalMaterial05     7.622
ManufacturingProcess01   6.737
ManufacturingProcess36   6.648
ManufacturingProcess04   6.418
ManufacturingProcess24   6.404
ManufacturingProcess15   6.393

# use vip::vip to plot variable importance
vip(gbm_Model, color = "red", fill = "purple")

Out of top 20 important variables, 14 are ManufacturingProcess predictors and 6 are BiologicalMaterial predictors. The ManufacturingProcess predictors dominated the list.

The top 10 predictors from the optimal linear and non-linear models are:

List b

At first glance, the top 10 most important variables in list a, not in list b, are:

Variable	Importance Score
BiologicalMaterial11	21.84
ManufacturingProcess31	14.67
BiologicalMaterial12	14.17
BiologicalMaterial03	11.92

At first glance, the top 10 most important variables in list b, not in list a, are:

Variable	Importance Score
ManufacturingProcess36	76.32
BiologicalMaterial02	68.17
ManufacturingProcess33	61.27
BiologicalMaterial08	61.12

All in all, the most important predictor is ManufacturingProcess32 in the optimal tree-based regression model.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

multi.class.model = rpart(Yield ~ ., data = df.train)
rpart.plot(multi.class.model)

Ans:

comparing means of Yield:

trainData <- data.frame(X.train, y.train)
less_160 <- subset(trainData, ManufacturingProcess32 < 160)
mt_160 <- subset(trainData, ManufacturingProcess32 >= 160)
# comparing means of Yield
mean(less_160$y.train)

[1] 39.19964

mean(mt_160$y.train)

[1] 41.57017

Apparently, you can tell the split at 160 did help get a better Yield average. So that first split at 160 was an additional knowledge in learning about biological or ManufacturingProcess predictor relationship with Yield.