Exercises 8.1,8.2,8.3 & 8.7 from the K&J book.

library(knitr)
library(ggplot2)
library(tidyr)
library(AppliedPredictiveModeling)
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
data(solubility)

8.1 Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a) Fit a random forest model to all of the predictors, then estimate the variable importance scores: Did the random forest model significantly use the uninformative predictors \((V6 – V10)\)?

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(caret)
## Loading required package: lattice
model1 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
kable(rfImp1)
Overall
V1 8.8389088
V2 6.4902306
V3 0.6758316
V4 7.5882255
V5 2.2742601
V6 0.1743678
V7 0.1513658
V8 -0.0307894
V9 -0.0298983
V10 -0.0852922

No, the random forest model does not use \(V6-V10\) to any significant degree as we can clearly see from the table above. The average score for \(V1-V5\) = 5.1734913 whereas the average score for \(V6-V10\) is 0.0359508.

b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
kable(rfImp2)
Overall
V1 6.2978074
V2 6.0803813
V3 0.5841072
V4 6.9392443
V5 2.0310409
V6 0.0794764
V7 -0.0256641
V8 -0.1100744
V9 -0.0883946
V10 -0.0071509
duplicate1 3.5641158

The weight of \(V1\) is now split between \(V1\) and" \(duplicate1\)

c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'party'
## The following objects are masked from 'package:partykit':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
cforestModel <- cforest(y ~ ., data=simulated)
vi <- cbind(rfImp2,
            varimp(cforestModel) %>% sort(decreasing = T),
            varimp(cforestModel, conditional=T) %>% sort(decreasing = T))
colnames(vi) <- c("Traditional","Unconditional","Conditional")
vi <- data.frame(vi)
kable(vi)
Traditional Unconditional Conditional
V1 6.2978074 7.4857968 6.1094658
V2 6.0803813 6.5716327 4.8073914
V3 0.5841072 6.1101315 3.0775321
V4 6.9392443 2.7223809 1.2533015
V5 2.0310409 1.8891176 0.9104678
V6 0.0794764 0.0104503 0.0127822
V7 -0.0256641 0.0089761 0.0093217
V8 -0.1100744 0.0070814 0.0092025
V9 -0.0883946 0.0053563 0.0036281
V10 -0.0071509 -0.0067516 0.0029042
duplicate1 3.5641158 -0.0360227 0.0024102
vi$vars <- row.names(vi)
vi.tidy <- gather(vi,"condition","value",-vars)
ggplot(vi.tidy,aes(x=vars,y=value, fill = condition,color=condition))+
  geom_bar(stat="identity",position="dodge")+
  ggtitle("Variable Importance by Model")

Whereas the traditional model (green) shows a relatively high degree of importance for the the duplicate variable, the cforest model does not. Otherwise, variable importance is somewhat similar with the only other glaring difference being the the traditional model underweights V3 with respect to the other 2 models. All seem to equally ignore \(V6-V10\)

d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(Cubist)
library(gbm)
## Loaded gbm 2.1.8
model.cubist <- cubist(x=simulated[,-(ncol(simulated)-1)], 
                      y=simulated$y,committees=10)
model.gbm <- gbm(y ~ ., data=simulated,n.trees=50, distribution='gaussian')
gbm.summary <- summary(model.gbm)

varImp(model.cubist)
##            Overall
## V1            70.5
## V3            41.5
## V2            54.5
## V4            50.0
## V5            47.0
## V6            12.0
## duplicate1     1.0
## V7             0.0
## V8             0.0
## V9             0.0
## V10            0.0
vi <- cbind(vi,varImp(model.cubist),gbm.summary[2])
vi <- data.frame(vi)
vi <- subset(vi, select = -c(vars))
colnames(vi) <- c("Traditional","Unconditional","Conditional","Cubist","GBM")
kable(vi)
Traditional Unconditional Conditional Cubist GBM
V1 6.2978074 7.4857968 6.1094658 70.5 30.124489
V2 6.0803813 6.5716327 4.8073914 41.5 25.324228
V3 0.5841072 6.1101315 3.0775321 54.5 24.282176
V4 6.9392443 2.7223809 1.2533015 50.0 9.355944
V5 2.0310409 1.8891176 0.9104678 47.0 6.466597
V6 0.0794764 0.0104503 0.0127822 12.0 4.446567
V7 -0.0256641 0.0089761 0.0093217 1.0 0.000000
V8 -0.1100744 0.0070814 0.0092025 0.0 0.000000
V9 -0.0883946 0.0053563 0.0036281 0.0 0.000000
V10 -0.0071509 -0.0067516 0.0029042 0.0 0.000000
duplicate1 3.5641158 -0.0360227 0.0024102 0.0 0.000000
vi$vars <- row.names(vi)
vi.tidy <- gather(vi,"condition","value",-vars)
ggplot(vi.tidy,aes(x=vars,y=value, fill = condition,color=condition))+
  geom_bar(stat="identity",position="dodge")+
  ggtitle("Variable Importance by Model")

As previously, the traditional model is the only one that assigns any weight to the duplicate variable. The GBM model appears to assign weights in relatively similar places as the models previously tested whereas the Cubist model assigns some importances to \(V6\) and \(V7\). I did some messing around with it and re-created this result several times - I am wondering whether this is a model feature, or operator error.

8.2 Use a simulation to show tree bias with different granularities.

set.seed(19)
a <- sample.int(5 , 500, replace = TRUE)/5
b <- sample.int(10, 500, replace = TRUE)/10
c <- sample.int(25, 500, replace = TRUE)/25
d <- sample.int(50, 500, replace = TRUE)/50
e <- sample.int(100, 500, replace = TRUE)/100
target <- a + e * rnorm(500)
df <- data.frame(a, b, c, d, e, target)
str(df)
## 'data.frame':    500 obs. of  6 variables:
##  $ a     : num  1 0.4 0.6 1 0.4 0.2 0.6 1 0.2 1 ...
##  $ b     : num  1 0.6 0.2 1 0.7 0.5 0.6 0.9 0.3 1 ...
##  $ c     : num  0.52 0.08 0.84 0.44 0.56 0.32 0.04 0.12 0.88 0.44 ...
##  $ d     : num  0.04 0.6 0.16 0.48 0.7 0.14 0.46 0.18 0.12 0.92 ...
##  $ e     : num  0.81 0.02 0.1 0.41 0.64 0.23 0.32 0.84 0.99 0.78 ...
##  $ target: num  0.64 0.424 0.678 0.278 -0.444 ...
library(rpart)
library(rpart.plot)
model.rpart <- rpart(target ~ ., data=df)
kable(varImp(model.rpart))
Overall
a 0.1941175
b 0.2511647
c 0.1834566
d 0.0636462
e 0.2548975
rpart.plot(model.rpart)

In the above, first we generate a set of 5 random variables with differning granularity. We create a target variable as a linear combination of the smallest and largest granularity, multiplied by a noise factor.

We then train the model and can clearly see in the varImp table that variable "E" is considerably more important than variable a. The r-part plot is a little bit more cryptic and I think that perhapse, in hindsight, I should have considered using fewer variables.

8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient.

Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Fig 8.24.

Fig 8.24.

a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

In the context of bagging fraction and learning rate:

The learning rate governs the faction of the current prediction value added to the previous prediction and a value of <0.01 is suggested by the text. The bagging fraction determines the proportion of the training data seen by the model - the book suggests 50%.

  • The model on the right has a higher learning rate and this is closer to optimal. There are more variables with lower weight because this model effectively "knows" more about the data. It will, however, have taken significantly longer to generate such a model.
  • The model on the right was also trained with a higher bagging fraction (more data) which should further reduce the importance of marginal variables *The model on the left is less optimized, and on a smaller portion of the data. It does not have enough information to reduce variable importance in the same way as the model on the right.

b) Which model do you think would be more predictive of other samples?

We would expect the performance of the model on the right to be better on analagous data sets due to it's level of tuning. However, It will likely also be more sensitive differences in the outsample data.

Which model I would chose would depend on the application (consequences of being wrong etc...). Where accuracy is not overly important, but robustness matters a lot, the model on the left is preferred, otherwise, the better tuned model on the right.

c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

g1 <- expand.grid(n.trees=100, interaction.depth=1, shrinkage=0.1, n.minobsinnode=10)
g2 <- expand.grid(n.trees=100, interaction.depth=10, shrinkage=0.1,n.minobsinnode=10)
model.gbm1 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = g1, verbose = FALSE)
 
model.gbm2 <- train(x = solTrainXtrans, y = solTrainY, method = 'gbm', tuneGrid = g2, verbose = FALSE)
var.imp <- cbind(varImp(model.gbm1)[[1]],varImp(model.gbm2)[[1]])
colnames(var.imp) <- c("Depth1", "Depth10")
kable(var.imp[order(-var.imp$Depth1),][1:25,])
Depth1 Depth10
NumCarbon 100.0000000 100.0000000
MolWeight 71.2326619 75.7839450
SurfaceArea2 40.6536164 22.3079238
NumAromaticBonds 36.5934625 16.3954527
SurfaceArea1 22.4156010 42.0861640
NumChlorine 22.0324377 9.3635377
HydrophilicFactor 12.0430919 21.1485802
NumNonHAtoms 8.9283369 12.6230651
FP172 8.2081308 2.5659970
NumHalogen 4.5214749 10.2322460
NumRotBonds 4.1502299 3.5024223
FP059 3.8123259 2.0345170
FP116 2.4919572 1.8422265
FP112 2.1801740 1.7124038
FP135 1.8358101 1.1442461
NumOxygen 1.6214354 2.9817220
NumMultBonds 1.5572537 3.9728826
FP075 1.4608977 3.0037295
FP206 1.3469506 0.8144075
FP142 1.0667059 1.8600409
FP125 0.8456011 0.2060303
FP072 0.8013671 2.0490774
FP161 0.7224100 0.6564066
NumHydrogen 0.6908099 5.4346728
FP043 0.6587296 0.4635920

As we can see above, increasing the interaction depth flattens the slope and gives weights to far more variables, further out.

8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

First we'll load and split the data by recycling code from HW7 & 8.

set.seed(19)
data(ChemicalManufacturingProcess)
chem <- ChemicalManufacturingProcess
#impute using knn
chem.imp <- preProcess(chem[,2:ncol(chem)], method=c('knnImpute'))
chem <- cbind(chem$Yield,predict(chem.imp, chem[,2:ncol(chem)]))
colnames(chem)[1] <- "Yield"
#split 70/30
n <-  floor(0.70 * nrow(chem))
idx <- sample(seq_len(nrow(chem)), size = n)
train <- chem[idx, ]
test <- chem[-idx, ]

Next we train, analyze and compare several models:

# train all the models
grid.rf <- expand.grid(mtry=seq(5,40,by=5))
model.rf <- train(Yield ~.,
                  data = train, 
                  method = "rf",
                  tuneGrid = grid.rf,
                  metric = "Rsquared",
                  importance = TRUE,
                  trControl = trainControl(method = "cv", number = 10))
grid.crf <- expand.grid(mtry=seq(5,50,by=5))
model.crf <- train(Yield ~.,
                  data = train, 
                  method = "cforest",
                  tuneGrid = grid.crf,
                  metric = "Rsquared",
                  trControl = trainControl(method = "oob"))
grid.cube <- expand.grid(committees = c(1,5,10,15,20,25), 
                         neighbors = c(0,1,3,4,5))
model.cube <- train(Yield ~.,
                   data = train, 
                   method = "cubist", 
                   metric = "Rsquared",
                   tuneGrid = grid.cube, 
                   trControl = trainControl(method = "cv", number = 10))
grid.rpart <- expand.grid(maxdepth= seq(1,10,by=1))
model.rpart <- train(Yield ~.,
                     data = train,
                     method = "rpart2",
                     metric = "Rsquared", 
                     tuneGrid = grid.rpart,
                     trControl = trainControl(method = "cv", number = 10))
grid.gbm <- expand.grid(n.trees=c(50, 100, 150, 200), 
                    interaction.depth=c(1, 5, 10, 15), 
                    shrinkage=c(0.01, 0.1, 0.5), 
                    n.minobsinnode=c(5, 10, 15))
model.gbm <- train(Yield ~.,
                  data = train, 
                  method = 'gbm', 
                  tuneGrid = grid.gbm , 
                  verbose = FALSE)
pred.rf <- predict(model.rf,  newdata = test[,-1])
pred.crf <- predict(model.crf,  newdata = test[,-1])
pred.cube <-  predict(model.cube,  newdata = test[,-1])
pred.rpart <-  predict(model.rpart,  newdata = test[,-1])
pred.gbm <-  predict(model.gbm,  newdata = test[,-1])
train.results <- data.frame(rbind(getTrainPerf(model.rf),
                                  getTrainPerf(model.crf),
                                  getTrainPerf(model.cube),
                                  getTrainPerf(model.rpart),
                                  getTrainPerf(model.gbm)))
row.names(train.results) <- c("RandomForest", "cForest","Cubeist","Rpart","GBM")
train.results 
##              TrainRMSE TrainRsquared  TrainMAE  method
## RandomForest 1.2016913     0.6305468 0.9673461      rf
## cForest      1.2582897     0.5203157 0.9938630 cforest
## Cubeist      0.9603947     0.7445603 0.7767718  cubist
## Rpart        1.3649699     0.4511684 1.1045945  rpart2
## GBM          1.2596027     0.5135164 0.9994591     gbm
test.results <- data.frame(rbind(postResample(pred = pred.rf, obs = test$Yield),
                        postResample(pred = pred.crf, obs = test$Yield),
                        postResample(pred = pred.cube, obs = test$Yield),
                        postResample(pred = pred.rpart, obs = test$Yield),
                        postResample(pred = pred.gbm, obs = test$Yield)))
row.names(test.results) <- c("RandomForest", "cForest","Cubeist","Rpart","GBM")
test.results
##                  RMSE  Rsquared       MAE
## RandomForest 1.300199 0.6932216 0.9957643
## cForest      1.289938 0.6890906 0.9992050
## Cubeist      1.014226 0.7519017 0.7666334
## Rpart        1.571047 0.4582121 1.2264348
## GBM          1.169969 0.6935498 0.8961994

a) Which tree-based regression model gives the optimal resampling and test set performance?

From the above tables we can see that the best model in both training and testing was the Cubist model.

b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

plot(varImp(model.cube), 
     top=10, 
     scales = list(y = list(cex = 0.8)),
     main="Variable Importance for Cubist")

plot(varImp(model.rf), 
     top=10, scales = list(y = list(cex = 0.8)),
     main="Variable Importance for RandomForest")

plot(varImp(model.crf), 
     top=10, scales = list(y = list(cex = 0.8)),
     main="Variable Importance for cForest")

plot(varImp(model.rpart), 
     top=10, scales = list(y = list(cex = 0.8)),
     main="Variable Importance for Rpart")

plot(varImp(model.gbm),
     top=10, scales = list(y = list(cex = 0.8)),
     main="Variable Importance for GBM")

var.imp <- data.frame(cbind(varImp(model.cube)[[1]],
                            varImp(model.rf)[[1]],
                            varImp(model.crf)[[1]],
                            varImp(model.rpart)[[1]],
                            varImp(model.gbm)[[1]]))
colnames(var.imp) <- c("Cubist","RandomForest", "cForest","Rpart","GBM")
kable(var.imp[1:10,])
Cubist RandomForest cForest Rpart GBM
BiologicalMaterial06 85.9375 45.97124 4.0486620 63.08239 4.863126
ManufacturingProcess13 100.0000 63.30621 15.2548348 64.28698 6.351365
ManufacturingProcess28 67.1875 72.06596 21.1195833 67.64603 19.563297
BiologicalMaterial12 48.4375 60.63919 3.5450018 55.51352 3.567137
BiologicalMaterial09 25.0000 52.80275 1.5252778 100.00000 21.163190
ManufacturingProcess39 32.8125 74.58936 17.9040659 0.00000 16.727681
ManufacturingProcess17 34.3750 29.60265 0.6560436 0.00000 0.000000
ManufacturingProcess01 21.8750 59.85886 8.9780520 0.00000 3.149193
ManufacturingProcess31 34.3750 54.20562 1.5336713 0.00000 18.388157
ManufacturingProcess32 82.8125 43.09557 0.5293143 0.00000 12.732971

From the above, we can see that the most important variable in the best model is mfp32 with the top 10 being dominated by manufacturing, in general. While all other models (except Rpart) show mfp32 as the top variable in terms of importance, there seems to be less of a bias towards manufacturing process variables in all the other models. Just by eye, the other models appear to be about 50/50, Mfg/Bio whereas the Cubist model is 80% Mfg.

c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

plot(as.party(model.rpart$finalModel),gp=gpar(fontsize=11))

I could not find any way to plot the optimal model (cubist), however, given that they are all reasonably similar, I chose to plot rpart using "party", which is relatively simple.

It appears as though the top of the tree (i.e. the big decisions) are governed by manufacturing - we see this in variable importance. It's not incredibly clear, but based on the terminal distributions it appeas as though bio processes may be associated with lower yield outcomes, in general.