DATA 624 Homework 9

library(tidyverse)

## -- Attaching packages -------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.3.5     v purrr   0.3.2
## v tibble  3.1.1     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'tibble' was built under R version 3.6.3

## Warning: package 'tidyr' was built under R version 3.6.3

## Warning: package 'dplyr' was built under R version 3.6.3

## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:purrr':
## 
##     cross

## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(earth)

## Warning: package 'earth' was built under R version 3.6.3

## Loading required package: Formula

## Warning: package 'Formula' was built under R version 3.6.3

## Loading required package: plotmo

## Loading required package: plotrix

## Loading required package: TeachingDemos

## Warning: package 'TeachingDemos' was built under R version 3.6.3

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

Exercise 8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)

## Warning: package 'mlbench' was built under R version 3.6.3

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(caret)

## Warning: package 'caret' was built under R version 3.6.3

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

model1 <- randomForest(y ~ ., data = simulated,
                      importance = TRUE,
                      ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

rfImp1

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

According to the variable importance table above, the random forest model does not seem to have significantly used the uninformative predictors (V6 – V10) as they are found at the bottom of the table.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated,
                      importance = TRUE,
                      ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

After adding a predictor that is highly correlated with V1, the score for V1 on the variable importance table decreased from 8.73 to 5.40. While the highly correlated predictor with V1 has a score of 4.37. It seems as though adding a predictor that is highly correlated with another takes some of the importance score away from the original one and adds it to the one being added.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)

## Warning: package 'party' was built under R version 3.6.3

## Loading required package: grid

## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 3.6.3

## Loading required package: modeltools

## Warning: package 'modeltools' was built under R version 3.6.3

## Loading required package: stats4

## 
## Attaching package: 'modeltools'

## The following object is masked from 'package:kernlab':
## 
##     prior

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 3.6.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.6.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## 
## Attaching package: 'strucchange'

## The following object is masked from 'package:stringr':
## 
##     boundary

set.seed(200)
simulated2 <- subset(simulated, select =c( -duplicate1))

model_cf <- cforest(y ~ ., data = simulated2)

cfImp1 <- varimp(model_cf)

cfImp1

##           V1           V2           V3           V4           V5           V6 
##  8.889688266  6.475678898  0.048199904  8.374714465  1.911482592 -0.026382693 
##           V7           V8           V9          V10 
##  0.003470791 -0.020945307 -0.040833731 -0.017820271

rfImp1

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

The importances from the conditional random forest model seem to be very similar to those of the traditional random forest model but all the predictors seem to now have gained more importance with the exception of V3, V5 and V6, which actually lost importance.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(gbm)

## Warning: package 'gbm' was built under R version 3.6.3

## Loaded gbm 2.1.8

#boosted tree model
gbmModel <- gbm(y ~ ., data = simulated2, distribution = "gaussian")
summary(gbmModel)

##     var    rel.inf
## V4   V4 29.6882617
## V1   V1 27.1380761
## V2   V2 23.7362357
## V5   V5 10.9276021
## V3   V3  7.6497297
## V9   V9  0.5830244
## V6   V6  0.2770703
## V7   V7  0.0000000
## V8   V8  0.0000000
## V10 V10  0.0000000

In the boosted tree model, we continue to see that it does not seem to have significantly used the uninformative predictors (V6 – V10). However, the most important predictor in this model is V4 and not V1.

library(Cubist)

## Warning: package 'Cubist' was built under R version 3.6.3

#Cubist model
simulated3 <- subset(simulated2, select = c(-y))
cubistMod <- cubist(simulated3, simulated2$y)
summary(cubistMod)

## 
## Call:
## cubist.default(x = simulated3, y = simulated2$y)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Sat May 07 13:30:44 2022
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 200 cases (11 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.944664]
## 
##  outcome = 0.183529 + 8.9 V4 + 7.9 V1 + 7.1 V2 + 5.3 V5
## 
## 
## Evaluation on training data (200 cases):
## 
##     Average  |error|           2.224012
##     Relative |error|               0.55
##     Correlation coefficient        0.84
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##           100%    V1
##           100%    V2
##           100%    V4
##           100%    V5
## 
## 
## Time: 0.0 secs

Additionally, the Cubist model only uses variables V1, V2, V4 and V5. It seems to have left out variable V3, and it does not use the uninformative predictors (V6 – V10).

Exercise 8.2

Use a simulation to show tree bias with different granularities.

Data granularity refers to how specific a data field is. The more specific, the higher its granularity and the less variable it will be. Tree models tend to favor those predictors that have higher number of distinct values (less granular), thus it is said they are biased against granular predictors.

Below, we will simulate 4 variables with each containing 200 observations, but different granularities (distinct values). We will then fit a tree model to see which are picked as the most important predictors.

set.seed(11)
x1 <- sample(0:10000 / 10000, 200, replace = T)
x2 <- sample(0:1000 / 1000, 200, replace = T)
x3 <- sample(0:100 / 100, 200, replace = T)
x4 <- sample(0:10 / 10, 200, replace = T)

y <- x1 + x2 + x3 + x4 + rnorm(200) 

simulation <- data.frame(x1, x2, x3, x4, y)
str(simulation)

## 'data.frame':    200 obs. of  5 variables:
##  $ x1: num  0.1785 0.0033 0.0695 0.092 0.4239 ...
##  $ x2: num  0.615 0.814 0.19 0.108 0.946 0.623 0.44 0.34 0.482 0.962 ...
##  $ x3: num  0.86 0.24 0.05 0.03 0.01 0.15 0.96 0.23 0.59 0.7 ...
##  $ x4: num  0.2 0.9 0.9 0.6 0.5 0.9 0.8 0.4 0.7 0 ...
##  $ y : num  2.115 1.904 0.937 0.507 2.414 ...

library(rpart)
set.seed(9)
treeModel <- rpart(y ~ ., data = simulation)
varImp(treeModel)

##     Overall
## x1 1.778877
## x2 1.001022
## x3 1.013595
## x4 0.448117

As we are able to see above, the tree model considers X1, the predictor with least granularity, the most important. In contrast, X4 is the least important predictor in the model, which has the greatest granularity.

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Figure 9.32: Left: ACF for a white noise series of 36 numbers. Middle: ACF for a white noise series of 360 numbers. Right: ACF for a white noise series of 1,000 numbers.

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

This is caused by the higher learning rate applied to the model on the right. The learning rate is a tuning parameter used in the model. A higher learning rate means that larger fraction of each tree’s predictions are added to the final prediction. As a consequence more of the same predictors will be selected among the trees. Some suggest small values for the learning parameter work best, but this requires additional computing time.

(b) Which model do you think would be more predictive of other samples?

The two models use extreme parameter values as an example to understand the magnitud of their effect and as the book suggests we should obtain the optimal values of these parameters through the tuning process. However, smaller learning rate and bagging fraction lead to a more generalized ability to predict observations, so I would guess that the model on the left would be more predictive of other samples. Additionally, some performance metrics may assit in picking the best model keeping in mind that over-fitting might be an issue.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Interaction depth can also be refered to as the number of splits performed in a tree or nodes. Increasing the nodes may give more predictors a chance to be involved in the splitting process, spreading out importance. Thus, increasing the depth would reduce the slope of the importance plot.

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 3.6.3

data("ChemicalManufacturingProcess")

preProcValues <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))
data_imp <- predict(preProcValues, ChemicalManufacturingProcess)

set.seed(12)
index <- createDataPartition(data_imp$Yield, p=0.8, list=FALSE) 
Train <- data_imp[index, ]
Test <- data_imp[-index, ]

#Pre-processing
trans_train <- preProcess(Train, method = c("center", "scale"))
trans_test <- preProcess(Test, method = c("center", "scale"))

Train_prep <- predict(trans_train, Train)
Test_prep <- predict(trans_test, Test)

Single Tree:

set.seed(123)
treeModel <- train(x = Train[, 2:58],
                  y = Train$Yield,
                  method = "rpart",
                  preProc = c("center", "scale"),
                  tuneLength = 10)

treePred <- predict(treeModel, newdata = Test[, 2:58])

postResample(pred = treePred, obs = Test$Yield)

##      RMSE  Rsquared       MAE 
## 0.9756851 0.2469015 0.7659706

treeImp <- varImp(treeModel)

treeImp

## rpart variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess13  100.00
## ManufacturingProcess31   85.17
## ManufacturingProcess32   74.80
## ManufacturingProcess36   53.78
## ManufacturingProcess09   53.23
## ManufacturingProcess06   49.26
## BiologicalMaterial03     46.83
## ManufacturingProcess17   42.69
## ManufacturingProcess01    0.00
## BiologicalMaterial04      0.00
## ManufacturingProcess16    0.00
## ManufacturingProcess15    0.00
## ManufacturingProcess43    0.00
## BiologicalMaterial02      0.00
## BiologicalMaterial12      0.00
## ManufacturingProcess02    0.00
## ManufacturingProcess39    0.00
## ManufacturingProcess24    0.00
## ManufacturingProcess42    0.00
## BiologicalMaterial06      0.00

Random Forest:

set.seed(345)
rfModel <- randomForest(Yield ~ ., data = Train_prep,
                      importance = TRUE,
                      ntree = 1000)

rfPred <- predict(rfModel, newdata = Test_prep[, 2:58])

postResample(pred = rfPred, obs = Test_prep$Yield)

##      RMSE  Rsquared       MAE 
## 0.6027058 0.6800785 0.4658529

rfMImp <- varImp(rfModel)

rfMImp

##                            Overall
## BiologicalMaterial01    5.45312964
## BiologicalMaterial02    7.90529873
## BiologicalMaterial03   13.73440330
## BiologicalMaterial04    8.50520239
## BiologicalMaterial05    9.67634898
## BiologicalMaterial06    8.87028490
## BiologicalMaterial07    0.51323977
## BiologicalMaterial08    4.80475335
## BiologicalMaterial09    6.81725627
## BiologicalMaterial10    2.66440345
## BiologicalMaterial11    9.22768218
## BiologicalMaterial12   11.77884422
## ManufacturingProcess01  6.55044074
## ManufacturingProcess02  4.83859878
## ManufacturingProcess03  1.63772068
## ManufacturingProcess04  1.95093975
## ManufacturingProcess05  0.49073933
## ManufacturingProcess06  6.93700972
## ManufacturingProcess07  0.01544675
## ManufacturingProcess08 -1.34843693
## ManufacturingProcess09  9.26324344
## ManufacturingProcess10  3.40370366
## ManufacturingProcess11  6.30702220
## ManufacturingProcess12  3.97295663
## ManufacturingProcess13 15.54940776
## ManufacturingProcess14  2.09904271
## ManufacturingProcess15  2.45800156
## ManufacturingProcess16  3.59038615
## ManufacturingProcess17 10.42339930
## ManufacturingProcess18  2.87779610
## ManufacturingProcess19  1.39081984
## ManufacturingProcess20  5.74963376
## ManufacturingProcess21  2.04333259
## ManufacturingProcess22  2.32264141
## ManufacturingProcess23 -1.25256497
## ManufacturingProcess24  2.71337188
## ManufacturingProcess25  2.20938726
## ManufacturingProcess26  2.19637233
## ManufacturingProcess27  7.07962729
## ManufacturingProcess28  9.81911260
## ManufacturingProcess29  5.46241082
## ManufacturingProcess30  4.53956144
## ManufacturingProcess31 11.86919981
## ManufacturingProcess32 29.10906122
## ManufacturingProcess33  5.08889548
## ManufacturingProcess34  1.69705751
## ManufacturingProcess35 -0.96068338
## ManufacturingProcess36  9.87648958
## ManufacturingProcess37  0.62012626
## ManufacturingProcess38 -0.21664079
## ManufacturingProcess39  6.65174970
## ManufacturingProcess40 -0.91449131
## ManufacturingProcess41 -0.50459676
## ManufacturingProcess42  2.39530430
## ManufacturingProcess43  4.80070781
## ManufacturingProcess44  2.87179555
## ManufacturingProcess45 -0.61684090

Boosted Tree:

set.seed(456)
boostModel <- gbm(Yield ~ ., data = Train_prep, distribution = "gaussian")

boostPred <- predict(boostModel, newdata = Test_prep[, 2:58])

## Using 100 trees...

postResample(pred = boostPred, obs = Test_prep$Yield)

##      RMSE  Rsquared       MAE 
## 0.6222545 0.6100120 0.4813279

summary(boostModel)

##                                           var    rel.inf
## ManufacturingProcess32 ManufacturingProcess32 31.0685224
## ManufacturingProcess13 ManufacturingProcess13 10.4004456
## BiologicalMaterial03     BiologicalMaterial03  8.8496083
## ManufacturingProcess31 ManufacturingProcess31  7.1013209
## ManufacturingProcess17 ManufacturingProcess17  4.3045431
## ManufacturingProcess06 ManufacturingProcess06  4.0803427
## BiologicalMaterial05     BiologicalMaterial05  3.6772228
## ManufacturingProcess42 ManufacturingProcess42  3.2920887
## ManufacturingProcess09 ManufacturingProcess09  2.6798116
## ManufacturingProcess27 ManufacturingProcess27  2.0404499
## BiologicalMaterial12     BiologicalMaterial12  1.6530930
## ManufacturingProcess37 ManufacturingProcess37  1.6171915
## ManufacturingProcess14 ManufacturingProcess14  1.4792738
## BiologicalMaterial10     BiologicalMaterial10  1.2819583
## BiologicalMaterial06     BiologicalMaterial06  1.2268468
## BiologicalMaterial01     BiologicalMaterial01  0.9842856
## ManufacturingProcess05 ManufacturingProcess05  0.9206469
## ManufacturingProcess04 ManufacturingProcess04  0.9114328
## ManufacturingProcess15 ManufacturingProcess15  0.8787844
## BiologicalMaterial02     BiologicalMaterial02  0.8656439
## ManufacturingProcess21 ManufacturingProcess21  0.8538521
## ManufacturingProcess23 ManufacturingProcess23  0.8280938
## ManufacturingProcess45 ManufacturingProcess45  0.8136093
## ManufacturingProcess39 ManufacturingProcess39  0.8041026
## ManufacturingProcess18 ManufacturingProcess18  0.7000319
## ManufacturingProcess30 ManufacturingProcess30  0.6377560
## ManufacturingProcess29 ManufacturingProcess29  0.6037254
## ManufacturingProcess44 ManufacturingProcess44  0.6006315
## BiologicalMaterial11     BiologicalMaterial11  0.5894105
## ManufacturingProcess12 ManufacturingProcess12  0.5834799
## ManufacturingProcess11 ManufacturingProcess11  0.4960041
## ManufacturingProcess28 ManufacturingProcess28  0.4926695
## ManufacturingProcess20 ManufacturingProcess20  0.4387319
## ManufacturingProcess22 ManufacturingProcess22  0.4363509
## ManufacturingProcess02 ManufacturingProcess02  0.4295104
## ManufacturingProcess07 ManufacturingProcess07  0.3991063
## ManufacturingProcess35 ManufacturingProcess35  0.3736141
## ManufacturingProcess10 ManufacturingProcess10  0.3344932
## ManufacturingProcess40 ManufacturingProcess40  0.2713137
## BiologicalMaterial04     BiologicalMaterial04  0.0000000
## BiologicalMaterial07     BiologicalMaterial07  0.0000000
## BiologicalMaterial08     BiologicalMaterial08  0.0000000
## BiologicalMaterial09     BiologicalMaterial09  0.0000000
## ManufacturingProcess01 ManufacturingProcess01  0.0000000
## ManufacturingProcess03 ManufacturingProcess03  0.0000000
## ManufacturingProcess08 ManufacturingProcess08  0.0000000
## ManufacturingProcess16 ManufacturingProcess16  0.0000000
## ManufacturingProcess19 ManufacturingProcess19  0.0000000
## ManufacturingProcess24 ManufacturingProcess24  0.0000000
## ManufacturingProcess25 ManufacturingProcess25  0.0000000
## ManufacturingProcess26 ManufacturingProcess26  0.0000000
## ManufacturingProcess33 ManufacturingProcess33  0.0000000
## ManufacturingProcess34 ManufacturingProcess34  0.0000000
## ManufacturingProcess36 ManufacturingProcess36  0.0000000
## ManufacturingProcess38 ManufacturingProcess38  0.0000000
## ManufacturingProcess41 ManufacturingProcess41  0.0000000
## ManufacturingProcess43 ManufacturingProcess43  0.0000000

Cubist:

set.seed(567)
cubistModel <- cubist(Train[, 2:58], Train$Yield)

cubistPred <- predict(cubistModel, newdata = Test[, 2:58])

postResample(pred = cubistPred, obs = Test$Yield)

##      RMSE  Rsquared       MAE 
## 0.6833094 0.6403582 0.5343385

summary(cubistModel)

## 
## Call:
## cubist.default(x = Train[, 2:58], y = Train$Yield)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Sat May 07 13:30:48 2022
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 144 cases (58 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [144 cases, mean 0.0017274, range -2.19787 to 3.339426, est err 0.5158566]
## 
##  outcome = 0.0070381 + 0.592 ManufacturingProcess32
##            + 0.45 ManufacturingProcess09
## 
## 
## Evaluation on training data (144 cases):
## 
##     Average  |error|          0.3294340
##     Relative |error|               0.42
##     Correlation coefficient        0.85
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##           100%    ManufacturingProcess09
##           100%    ManufacturingProcess32
## 
## 
## Time: 0.0 secs

(a) Which tree-based regression model gives the optimal resampling and test set performance?

According to the performance metrics above, the model that gives the optimal resampling and test set performance is the random forest with RMSE of roughly 0.60 and R-squared of about 0.68. In contrat, the model with the lowest performance is the single tree model.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

The predictor at the top of the list in the random forest model is “BiologicalMaterial01”. Since there are different trees, I am unsure how to read this list as I see other predictors with higher scores down the list. There are about 57 predictors on the list and since there are more process variables in the data they obviously dominate the list. However, the single highest importance score on the list is “ManufacturingProcess32”, which agrees with the importance list from the linear and nonlinear models in the previous exercises.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart.plot)
rpart.plot(treeModel$finalModel)

ManufacturingProcess32 seems to be at the top of all models. ManufacturingProcess09 can also be found on the top 10 of other models. The analysis has given us a feel that certain manufacturing variables are good predictors for yield as opposed to the biological variables.

DATA 624 Homework 9

Mario Pena

May 1, 2022

Exercise 8.1

Exercise 8.2

Exercise 8.3

Exercise 8.7