Exercise 8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a)

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

rfImp1
##         Overall
## V1   8.83890885
## V2   6.49023056
## V3   0.67583163
## V4   7.58822553
## V5   2.27426009
## V6   0.17436781
## V7   0.15136583
## V8  -0.03078937
## V9  -0.02989832
## V10 -0.08529218

No V6-V10 predictors do not have very high importance, V1-V5 have much higher importance.

(b)

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##                Overall
## V1          6.29780744
## V2          6.08038134
## V3          0.58410718
## V4          6.93924427
## V5          2.03104094
## V6          0.07947642
## V7         -0.02566414
## V8         -0.11007435
## V9         -0.08839463
## V10        -0.00715093
## duplicate1  3.56411581

The importance score of V1 has decreased when we introduced highly correlated variable “duplicate1”.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9312569
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
##                 Overall
## V1          5.656397024
## V2          6.957366954
## V3          0.539700105
## V4          7.280227792
## V5          2.094226861
## V6          0.141163232
## V7          0.092792498
## V8         -0.096325566
## V9         -0.007463533
## V10         0.016839393
## duplicate1  2.566313355
## duplicate2  2.654958084

The importance score of V1 has decreased even more when we introduced another highly correlated variable “duplicate2”.

(c)

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

simulated <- subset(simulated, select=-c(duplicate1,duplicate2))
model4 <- cforest(y ~., data = simulated)
rfImp4 <- varimp(model4, conditional = TRUE)
rfImp4
##          V1          V2          V3          V4          V5          V6 
##  6.13813531  5.38693691 -0.01588944  6.37698904  1.39631552 -0.23028740 
##          V7          V8          V9         V10 
## -0.15979153 -0.43336111 -0.20556298 -0.32988314

Yes, the performance is similar to the traditional random forest trees. V6-V10 predictors do not have very high importance, V1, V2, V4, and V5 have much higher importance.

(d)

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted trees
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmModel)

##     var    rel.inf
## V4   V4 29.1936158
## V1   V1 28.0197047
## V2   V2 23.1225041
## V5   V5 10.7920685
## V3   V3  7.9672867
## V6   V6  0.7108707
## V7   V7  0.1939495
## V8   V8  0.0000000
## V9   V9  0.0000000
## V10 V10  0.0000000

Yes, the performance is similar to the traditional random forest trees and confitional inference trees. V6-V10 predictors do not have very high importance, V1 - V5 have much higher importance.

Cubist
simulated2 <- subset(simulated, select=-c(y))
CubistMod <- cubist(simulated2, simulated$y)
summary(CubistMod)
## 
## Call:
## cubist.default(x = simulated2, y = simulated$y)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Sat Nov 21 17:42:51 2020
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 200 cases (11 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.944664]
## 
##  outcome = 0.183529 + 8.9 V4 + 7.9 V1 + 7.1 V2 + 5.3 V5
## 
## 
## Evaluation on training data (200 cases):
## 
##     Average  |error|           2.224012
##     Relative |error|               0.55
##     Correlation coefficient        0.84
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##           100%    V1
##           100%    V2
##           100%    V4
##           100%    V5
## 
## 
## Time: 0.0 secs

Yes, the performance is similar to the traditional random forest trees, confitional inference trees, and boosted trees. V6-V10 predictors do not have very high importance, V1, V2, V4, and V5 are highest importance and are used in Cubist Model.

Exercise 8.2.

Use a simulation to show tree bias with different granularities.

As out textbook indicates: “Basic regression trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors.”

I will simulate a simple dataset with 300 rows and 3 columns, there will be two predictors X1 and X3. X3 will include a random sample of values between 1 and 10, X1 will be genrated by adding variance to the same numbers. The sum of these variables plus some variance will be used to generate Y.

set.seed(200)
X1 <- sample(1:10, 300, replace=TRUE) + rnorm(300, mean=0, sd=1)
X3 <- sample(1:10, 300, replace=TRUE)
Y <- X1+X3 + rnorm(300) * .1

simulated2 <- as.data.frame(cbind(X1, X3, Y))

rpartTree <- rpart(Y ~ ., data = simulated2)
varImp(rpartTree)
##     Overall
## X1 3.389582
## X3 5.311540

Variable X3 is showing a significantly higher Varianble Importance due to the fact that it has lower variance even though both variables are directly correlated to Y.

Exercise 8.3.

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a)

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

According to out textbook “the trees from boosting are dependent on each other and hence will have correlated structures as the method follows by the gradient. Therefore many of the same predictors will be selected across the trees, increasing their contribution to the importance metric.” Since the idea behind boosting involves combining and boosting predictors, bagging fraction parameter (randomly selected fraction of the training data) produces a model with a higher number of important parameters. Small values of the learning rate work best and require high computation time and memory resources - but as a result the model is less greedy and is able to catch a higher quanity of importrant parameters.

(b)

Which model do you think would be more predictive of other samples?

The model on the left will be more predictive of other samples, since it is less greedy, so likely to select a better model and capture and recognize more of the important parameters.

(c)

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

I am assuming that increasing the interaction depth will give more parameters a chance to demonstrate theit importance, since each tree will have higher depth. Therefore, increasing the depth will reduce the slope of the importance plot. Many of the same predictors will be selected across the trees, increasing their contribution to the importance metric.

Exercise 8.7.

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

The first step is to read the data into a dataframe and using knn to impute the data.

data(ChemicalManufacturingProcess)
df1<-ChemicalManufacturingProcess
dim(df1)
## [1] 176  58
#head(df1)

df1pp <- preProcess(df1[,2:ncol(df1)], method=c('knnImpute'))
df1 <- cbind(df1$Yield,predict(df1pp, df1[,2:ncol(df1)]))
colnames(df1)[1] <- "Yield"

The second step is splitting the data into test and train datasets - I will use 80%/20% split.

set.seed(333)
trainingRows <- createDataPartition(df1$Yield, p = 0.8, list = FALSE)

df1_train <- df1[trainingRows, ]
df1_test <- df1[-trainingRows, ]

Single Tree

set.seed(333)
rpartTune <- train(x = df1_train[,2:ncol(df1_train)],
                   y = df1_train$Yield,
                   method = "rpart2",
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))

Evaluating the model’s performance:

rpartPred <- predict(rpartTune, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = rpartPred, obs = df1_test$Yield)
##      RMSE  Rsquared       MAE 
## 1.1952529 0.6376276 0.9207985

Bagged Trees

set.seed(333)
baggedTree <- ipredbagg(df1_train$Yield, df1_train[,2:ncol(df1_train)])

baggedTreePred <- predict(baggedTree, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = baggedTreePred, obs = df1_test$Yield)
##      RMSE  Rsquared       MAE 
## 0.9762990 0.8048250 0.7108411

Random Forest

set.seed(333)
rfModel <- randomForest(df1_train[,2:ncol(df1_train)], df1_train$Yield,
                        importance = TRUE,
                        ntrees = 1000)

rfModelPred <- predict(rfModel, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = rfModelPred, obs = df1_test$Yield)
##      RMSE  Rsquared       MAE 
## 0.9668035 0.8240946 0.6793100

Boosted Trees

set.seed(333)
gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2), 
                       .n.trees = seq(100, 1000, by = 50),
                       .shrinkage = c(0.01, 0.1),
                       .n.minobsinnode = c(5, 10))

gbmTune <- suppressWarnings(train(df1_train[,2:ncol(df1_train)], df1_train$Yield,
                method = "gbm",
                tuneGrid = gbmGrid,
                verbose = FALSE))

gbmTunePred <- predict(gbmTune, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = gbmTunePred, obs = df1_test$Yield)
##      RMSE  Rsquared       MAE 
## 0.9089947 0.8251410 0.6251523

Cubist

cubistMod <- cubist(df1_train[,2:ncol(df1_train)], df1_train$Yield)

cubistModPred <- predict(cubistMod, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = cubistModPred, obs = df1_test$Yield)
##      RMSE  Rsquared       MAE 
## 1.1659034 0.6501737 0.8872566

Let’s comparing the models and determine which has the best performance:

z<- rbind(
  postResample(pred = rpartPred, obs = df1_test$Yield),
  postResample(pred = baggedTreePred, obs = df1_test$Yield),
  postResample(pred = rfModelPred, obs = df1_test$Yield),
  postResample(pred = gbmTunePred, obs = df1_test$Yield),
  postResample(pred = cubistModPred, obs = df1_test$Yield)
)

data.frame(z,row.names = c('single tree', 'bagged tree', 'random forest', 'boosted tree', 'cubist'))
##                    RMSE  Rsquared       MAE
## single tree   1.1952529 0.6376276 0.9207985
## bagged tree   0.9762990 0.8048250 0.7108411
## random forest 0.9668035 0.8240946 0.6793100
## boosted tree  0.9089947 0.8251410 0.6251523
## cubist        1.1659034 0.6501737 0.8872566

Boosted Tree model has the best result with an R-Squared of 0.694 - that model took a very long time to tune and was very computationally intense - so it is possible that Cubist model would more useable even though R Squared is lower (0.662).

(b)

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(gbmTune)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## BiologicalMaterial12    29.749
## ManufacturingProcess13  22.124
## ManufacturingProcess09  21.421
## ManufacturingProcess06  20.227
## BiologicalMaterial03    16.322
## ManufacturingProcess17  16.252
## BiologicalMaterial09    11.750
## BiologicalMaterial05     9.811
## ManufacturingProcess01   9.287
## ManufacturingProcess31   8.783
## BiologicalMaterial11     7.704
## ManufacturingProcess24   7.207
## BiologicalMaterial06     7.181
## ManufacturingProcess30   6.812
## ManufacturingProcess37   6.783
## ManufacturingProcess43   6.237
## BiologicalMaterial08     6.132
## ManufacturingProcess05   6.040
## ManufacturingProcess04   5.939

The list is dominated by Manufacturing Process variables which matches our linear and non-linear model findings.

MARS Coefficients

set.seed(200)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
marsTune <- train(df1_train[,2:ncol(df1_train)],
                  y = df1_train$Yield, 
                  method = "earth",
                  preProc=c("center", "scale"),
                  tuneGrid= marsGrid,
                  trControl = trainControl(method = "cv"))
marsTune
## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 129, 130, 129, 132, 129, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.469694  0.4099538  1.1524962
##   1        3      1.270394  0.5461842  1.0135638
##   1        4      1.237360  0.5645598  0.9940006
##   1        5      1.243428  0.5465747  0.9978504
##   1        6      1.243240  0.5625398  0.9675818
##   1        7      1.281018  0.5448316  1.0106055
##   1        8      1.261312  0.5602795  1.0181434
##   1        9      1.267466  0.5564907  1.0329659
##   1       10      1.302172  0.5403139  1.0488237
##   1       11      1.296341  0.5445436  1.0586088
##   1       12      1.273445  0.5541629  1.0393424
##   1       13      1.252598  0.5698303  1.0246146
##   1       14      1.229202  0.5807273  1.0091250
##   1       15      1.232702  0.5796333  1.0146955
##   1       16      1.231305  0.5806299  1.0147462
##   1       17      1.235551  0.5768138  1.0097048
##   1       18      1.235551  0.5768138  1.0097048
##   1       19      1.235551  0.5768138  1.0097048
##   1       20      1.235551  0.5768138  1.0097048
##   2        2      1.469694  0.4099538  1.1524962
##   2        3      1.281032  0.5539317  1.0244959
##   2        4      1.245999  0.5686795  1.0027749
##   2        5      1.239588  0.5755069  1.0083431
##   2        6      1.277475  0.5596173  1.0404891
##   2        7      1.290410  0.5601103  1.0537775
##   2        8      1.284861  0.5524504  1.0509238
##   2        9      1.278701  0.5694702  1.0456271
##   2       10      1.294947  0.5617347  1.0661414
##   2       11      1.251266  0.5946481  1.0199754
##   2       12      1.296675  0.5954595  1.0328179
##   2       13      1.313246  0.5910848  1.0133991
##   2       14      1.310340  0.5943687  1.0090514
##   2       15      1.311538  0.5921334  1.0090578
##   2       16      1.320320  0.5905494  1.0068345
##   2       17      1.336474  0.5847754  1.0142035
##   2       18      1.326958  0.5945070  1.0169338
##   2       19      1.314935  0.5768725  1.0182396
##   2       20      1.354841  0.5597531  1.0389166
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 1.

Linear Model Coefficients

df1_enet <-  train(df1_train[,2:ncol(df1_train)], df1_train$Yield,
                 method='enet',
                 metric='RMSE',
                 trControl = trainControl(method = "repeatedcv"), 
                 preProcess=c('center','scale'))

enetCoef<- predict(df1_enet$finalModel, newx = as.matrix(df1_test[,2:ncol(df1_test)]),
                      s = .1, mode = "fraction",
                      type = "coefficients")
head(sort(enetCoef$coefficients),3)
## ManufacturingProcess13 ManufacturingProcess37 ManufacturingProcess17 
##            -0.25153249            -0.11488070            -0.06413573
tail(sort(enetCoef$coefficients),7)
##   BiologicalMaterial03 ManufacturingProcess44 ManufacturingProcess06 
##             0.07826640             0.09057314             0.10343127 
## ManufacturingProcess15 ManufacturingProcess34 ManufacturingProcess09 
##             0.15819222             0.17985524             0.51029411 
## ManufacturingProcess32 
##             0.72011230

Process 32, 9, 34, 15, 6, 44, 3(biol), 17, 37, and 13 were the stongest predictors in my ENET linear model. 6 out of those 10 are present in Boosted Tree model’s most important variables.

Process 32, 9, 13, 33, 28, 15, 39, 4, 3(biol), 23 were the stongest predictors in my MARS non-linear model. 4 out of those 10 are present in Boosted Tree model’s most important variables.

(c)

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpartTree2 <- as.party(rpartTune$finalModel)
plot(rpartTree2, gp=gpar(fontsize = 8))

This view confirms that for a single tree, Manufacturing Processes are of higher importance that biological predictors. Very few branchess are relying on Biological materials and the initial split is based on Manufacturing process 32.

Sources: https://stackoverflow.com/questions/13751962/how-to-plot-a-large-ctree-to-avoid-overlapping-nodes