Assignment: Exercises 8.1, 8.2, 8.3, and 8.7 from the KJ textbook

library(caTools)
library(DMwR)
## Loading required package: lattice
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##         Overall
## V1   8.83890885
## V2   6.49023056
## V3   0.67583163
## V4   7.58822553
## V5   2.27426009
## V6   0.17436781
## V7   0.15136583
## V8  -0.03078937
## V9  -0.02989832
## V10 -0.08529218

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The model does use predictors V6-V10, but they are not as important in the model as the other predictors. V1 is the most important.

  1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216

Fit another random forest model to these data. Did the importance score for V1 change?

When we add one predictor that is highly correlated with V1, the importance score for V1 decreases. V4 is now the most important predictor.

model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##                Overall
## V1          6.29780744
## V2          6.08038134
## V3          0.58410718
## V4          6.93924427
## V5          2.03104094
## V6          0.07947642
## V7         -0.02566414
## V8         -0.11007435
## V9         -0.08839463
## V10        -0.00715093
## duplicate1  3.56411581

What happens when you add another predictor that is also highly correlated with V1?

When we add yet another predictor that is also highly correlated with V1, the importance of V1 in our model decreases once more.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .2
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
##                Overall
## V1          6.19983255
## V2          6.48305559
## V3          0.42757857
## V4          7.51171970
## V5          2.27902033
## V6          0.12585434
## V7          0.05716457
## V8         -0.08621813
## V9         -0.05339077
## V10        -0.01113780
## duplicate1  3.24628944
## duplicate2  1.09060225
  1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

The importances using conditional inference trees seem to show the same pattern in the predictors that are chosen by the model. V6-V10 are still less importance, but we do see that V3 has become less important relative to the results of the traditional random forest model.

library(party)
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
cmodel1 <- cforest(y ~ ., data = simulated)
varimp(cmodel1, conditional = TRUE)
##           V1           V2           V3           V4           V5           V6 
##  2.487590376  4.580552877  0.027953446  5.615551510  1.057098660  0.015134623 
##           V7           V8           V9          V10   duplicate1   duplicate2 
##  0.005205066  0.005728994 -0.008507938 -0.010683929  0.690821538  0.067750307
  1. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Using an XGBoost tree model, V4 is the most important predictor, followed by V1 and V2. Predictors V6-V10 are less important again.

xgbmodel1 <- train(y ~ ., data = simulated, method = "xgbTree")
rfImpx1 <- varImp(xgbmodel1)
rfImpx1
## xgbTree variable importance
## 
##             Overall
## V4         100.0000
## V2          76.3687
## V1          60.5621
## V5          32.5734
## V3          32.3700
## duplicate1  23.3339
## V6           2.2923
## duplicate2   1.1489
## V8           1.0969
## V7           0.3611
## V10          0.1584
## V9           0.0000

8.2. Use a simulation to show tree bias with different granularities.

Tree selection bias occurs when predictors with a higher number of distinct values are favored over more granular predictors. In our simulation, we create our response variable, y, such that it depends on variable a, so we assume that variable a would be the most important in our tree model. However, it turns out that variable b, which y does not depend on and has more distinct values, ends up being more important in our model than variable a.

library(rpart)

set.seed(200)
a <- rep(1:2, each=100)
y <- a + rnorm(200, mean=0, sd=6)
set.seed(400)
b <- rnorm(200, mean=0, sd=3)

dat <- data.frame(y, a, b)
biasMod <- rpart(y ~ ., data = dat)

simImp <- varImp(biasMod)
simImp
##     Overall
## a 0.1102470
## b 0.8259556

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

  1. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

As we increase the bagging fraction, our model uses more data and as we simultaneouslyincrease the learning rate, our model becomes more greedy and identifies fewer predictors. Both of these combined leads the model to concentrate its importance on fewer predictors, thus the difference that we see between the left and the right.

  1. Which model do you think would be more predictive of other samples?

The model on the left would be more predictive of other samples because models perform worser off if the parameters are too high.

  1. How would increasing interaction depth affect the slope of predictor importance for either model in Fig.8.24?

Increasing interaction depth would spread importance over each individual predictor so that the most important predictors are less important and the least important predictors become more important.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

First, we load our data, perform imputations and split it into a training and test data set.

#load data
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

#imputation
cmp <- knnImputation(ChemicalManufacturingProcess[, !names(ChemicalManufacturingProcess) %in% "yield"])

#data splitting
set.seed(101) 
sample = sample.split(cmp$BiologicalMaterial01, SplitRatio = .8)
cmp_train = subset(cmp, sample == TRUE)
cmp_test  = subset(cmp, sample == FALSE)

cmp_train_X = subset(cmp_train, select = -Yield)
cmp_train_y = cmp_train[,'Yield']

cmp_test_X = subset(cmp_test, select = -Yield)
cmp_test_y = cmp_test[,'Yield']
  1. Which tree-based regression model gives the optimal resampling and test set performance?

Basic Regression Tree:

cmpst <- train(x = cmp_train_X, y = cmp_train_y, method = "rpart")
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
cmpst
## CART 
## 
## 140 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 140, 140, 140, 140, 140, 140, ... 
## Resampling results across tuning parameters:
## 
##   cp         RMSE      Rsquared   MAE     
##   0.0999723  1.539236  0.3454827  1.218798
##   0.1033160  1.536178  0.3455412  1.213621
##   0.3706544  1.704810  0.2650904  1.358981
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.103316.
cmpstPred <- predict(cmpst, newdata = cmp_test_X)
postResample(pred = cmpstPred, obs = cmp_test_y)
##      RMSE  Rsquared       MAE 
## 1.1900480 0.5084895 0.9149989

Random Forest:

cmprf <- train(x = cmp_train_X, y = cmp_train_y, method = "rf")
cmprf
## Random Forest 
## 
## 140 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 140, 140, 140, 140, 140, 140, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.337582  0.5824334  1.0839623
##   29    1.219222  0.6116325  0.9655119
##   57    1.228168  0.5954942  0.9714635
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 29.
cmprfPred <- predict(cmprf, newdata = cmp_test_X)
postResample(pred = cmprfPred, obs = cmp_test_y)
##      RMSE  Rsquared       MAE 
## 1.1761812 0.5228770 0.8348245

XGBoost:

cmpxgb <- train(x = cmp_train_X, y = cmp_train_y, method = "xgbTree")
cmpxgb
## eXtreme Gradient Boosting 
## 
## 140 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 140, 140, 140, 140, 140, 140, ... 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  RMSE      Rsquared 
##   0.3  1          0.6               0.50        50      1.281073  0.5335262
##   0.3  1          0.6               0.50       100      1.284182  0.5396167
##   0.3  1          0.6               0.50       150      1.279859  0.5465025
##   0.3  1          0.6               0.75        50      1.237670  0.5649550
##   0.3  1          0.6               0.75       100      1.219161  0.5823074
##   0.3  1          0.6               0.75       150      1.208760  0.5904512
##   0.3  1          0.6               1.00        50      1.232206  0.5626837
##   0.3  1          0.6               1.00       100      1.224734  0.5713825
##   0.3  1          0.6               1.00       150      1.224059  0.5745115
##   0.3  1          0.8               0.50        50      1.278181  0.5409179
##   0.3  1          0.8               0.50       100      1.272586  0.5516215
##   0.3  1          0.8               0.50       150      1.261833  0.5614731
##   0.3  1          0.8               0.75        50      1.248205  0.5572935
##   0.3  1          0.8               0.75       100      1.222497  0.5781756
##   0.3  1          0.8               0.75       150      1.220984  0.5812936
##   0.3  1          0.8               1.00        50      1.243247  0.5584073
##   0.3  1          0.8               1.00       100      1.227920  0.5719053
##   0.3  1          0.8               1.00       150      1.228331  0.5743774
##   0.3  2          0.6               0.50        50      1.263841  0.5533002
##   0.3  2          0.6               0.50       100      1.255004  0.5606604
##   0.3  2          0.6               0.50       150      1.254375  0.5617151
##   0.3  2          0.6               0.75        50      1.214977  0.5819239
##   0.3  2          0.6               0.75       100      1.202509  0.5917397
##   0.3  2          0.6               0.75       150      1.200370  0.5933968
##   0.3  2          0.6               1.00        50      1.214779  0.5802159
##   0.3  2          0.6               1.00       100      1.205293  0.5881199
##   0.3  2          0.6               1.00       150      1.204536  0.5890601
##   0.3  2          0.8               0.50        50      1.295514  0.5329512
##   0.3  2          0.8               0.50       100      1.278378  0.5459998
##   0.3  2          0.8               0.50       150      1.275684  0.5480316
##   0.3  2          0.8               0.75        50      1.232420  0.5678913
##   0.3  2          0.8               0.75       100      1.222114  0.5768790
##   0.3  2          0.8               0.75       150      1.220297  0.5784880
##   0.3  2          0.8               1.00        50      1.218493  0.5761531
##   0.3  2          0.8               1.00       100      1.207810  0.5842626
##   0.3  2          0.8               1.00       150      1.205934  0.5858674
##   0.3  3          0.6               0.50        50      1.266340  0.5498553
##   0.3  3          0.6               0.50       100      1.263218  0.5522246
##   0.3  3          0.6               0.50       150      1.263000  0.5523979
##   0.3  3          0.6               0.75        50      1.232254  0.5661501
##   0.3  3          0.6               0.75       100      1.230093  0.5682305
##   0.3  3          0.6               0.75       150      1.229907  0.5683666
##   0.3  3          0.6               1.00        50      1.202574  0.5838424
##   0.3  3          0.6               1.00       100      1.200089  0.5855398
##   0.3  3          0.6               1.00       150      1.199934  0.5856375
##   0.3  3          0.8               0.50        50      1.293319  0.5334740
##   0.3  3          0.8               0.50       100      1.286658  0.5386660
##   0.3  3          0.8               0.50       150      1.286170  0.5390214
##   0.3  3          0.8               0.75        50      1.218838  0.5790970
##   0.3  3          0.8               0.75       100      1.214359  0.5824039
##   0.3  3          0.8               0.75       150      1.214310  0.5824651
##   0.3  3          0.8               1.00        50      1.238530  0.5667261
##   0.3  3          0.8               1.00       100      1.237704  0.5675399
##   0.3  3          0.8               1.00       150      1.237648  0.5675898
##   0.4  1          0.6               0.50        50      1.365700  0.4932415
##   0.4  1          0.6               0.50       100      1.351888  0.5074751
##   0.4  1          0.6               0.50       150      1.345937  0.5141629
##   0.4  1          0.6               0.75        50      1.273148  0.5412808
##   0.4  1          0.6               0.75       100      1.258650  0.5572984
##   0.4  1          0.6               0.75       150      1.256923  0.5598387
##   0.4  1          0.6               1.00        50      1.258376  0.5504259
##   0.4  1          0.6               1.00       100      1.248129  0.5620757
##   0.4  1          0.6               1.00       150      1.244783  0.5673097
##   0.4  1          0.8               0.50        50      1.346017  0.5086315
##   0.4  1          0.8               0.50       100      1.337818  0.5219350
##   0.4  1          0.8               0.50       150      1.341290  0.5242514
##   0.4  1          0.8               0.75        50      1.289376  0.5356260
##   0.4  1          0.8               0.75       100      1.278958  0.5494314
##   0.4  1          0.8               0.75       150      1.271441  0.5572848
##   0.4  1          0.8               1.00        50      1.242731  0.5608155
##   0.4  1          0.8               1.00       100      1.241910  0.5670103
##   0.4  1          0.8               1.00       150      1.243206  0.5695683
##   0.4  2          0.6               0.50        50      1.320338  0.5216594
##   0.4  2          0.6               0.50       100      1.315813  0.5254276
##   0.4  2          0.6               0.50       150      1.315218  0.5259712
##   0.4  2          0.6               0.75        50      1.289582  0.5429952
##   0.4  2          0.6               0.75       100      1.287257  0.5470073
##   0.4  2          0.6               0.75       150      1.287047  0.5475724
##   0.4  2          0.6               1.00        50      1.233368  0.5715938
##   0.4  2          0.6               1.00       100      1.227122  0.5766626
##   0.4  2          0.6               1.00       150      1.225574  0.5776521
##   0.4  2          0.8               0.50        50      1.316470  0.5256837
##   0.4  2          0.8               0.50       100      1.309982  0.5304919
##   0.4  2          0.8               0.50       150      1.309166  0.5311503
##   0.4  2          0.8               0.75        50      1.254402  0.5649274
##   0.4  2          0.8               0.75       100      1.244670  0.5714362
##   0.4  2          0.8               0.75       150      1.243009  0.5726216
##   0.4  2          0.8               1.00        50      1.254566  0.5621069
##   0.4  2          0.8               1.00       100      1.251743  0.5656487
##   0.4  2          0.8               1.00       150      1.251135  0.5661447
##   0.4  3          0.6               0.50        50      1.330358  0.5118063
##   0.4  3          0.6               0.50       100      1.327620  0.5134157
##   0.4  3          0.6               0.50       150      1.327319  0.5136295
##   0.4  3          0.6               0.75        50      1.297246  0.5378903
##   0.4  3          0.6               0.75       100      1.295638  0.5390569
##   0.4  3          0.6               0.75       150      1.295633  0.5390587
##   0.4  3          0.6               1.00        50      1.282014  0.5381810
##   0.4  3          0.6               1.00       100      1.281122  0.5389368
##   0.4  3          0.6               1.00       150      1.281114  0.5389417
##   0.4  3          0.8               0.50        50      1.303177  0.5338694
##   0.4  3          0.8               0.50       100      1.300635  0.5354077
##   0.4  3          0.8               0.50       150      1.300469  0.5354850
##   0.4  3          0.8               0.75        50      1.263667  0.5582641
##   0.4  3          0.8               0.75       100      1.262950  0.5589727
##   0.4  3          0.8               0.75       150      1.262950  0.5589797
##   0.4  3          0.8               1.00        50      1.228379  0.5705836
##   0.4  3          0.8               1.00       100      1.227866  0.5709596
##   0.4  3          0.8               1.00       150      1.227847  0.5709701
##   MAE      
##   0.9968671
##   1.0047690
##   0.9992347
##   0.9736137
##   0.9560512
##   0.9476665
##   0.9687589
##   0.9595069
##   0.9588250
##   1.0025958
##   0.9884565
##   0.9806110
##   0.9721616
##   0.9524760
##   0.9526839
##   0.9824306
##   0.9613286
##   0.9611580
##   0.9896431
##   0.9783558
##   0.9780415
##   0.9478509
##   0.9375196
##   0.9357242
##   0.9462404
##   0.9393341
##   0.9391572
##   1.0006740
##   0.9879809
##   0.9842625
##   0.9456932
##   0.9326957
##   0.9320749
##   0.9447736
##   0.9340364
##   0.9324208
##   0.9956355
##   0.9936119
##   0.9933951
##   0.9605906
##   0.9578711
##   0.9576292
##   0.9348046
##   0.9323701
##   0.9322187
##   1.0012237
##   0.9943317
##   0.9942661
##   0.9500726
##   0.9470249
##   0.9468911
##   0.9664168
##   0.9649426
##   0.9648693
##   1.0734340
##   1.0566926
##   1.0503838
##   1.0101949
##   0.9930406
##   0.9907811
##   0.9894004
##   0.9756055
##   0.9718447
##   1.0560798
##   1.0465640
##   1.0484434
##   1.0152241
##   1.0003764
##   0.9938049
##   0.9788495
##   0.9689052
##   0.9664942
##   1.0358751
##   1.0303242
##   1.0291131
##   0.9958917
##   0.9936324
##   0.9935035
##   0.9597746
##   0.9550805
##   0.9538157
##   1.0333404
##   1.0278694
##   1.0277482
##   0.9673061
##   0.9585069
##   0.9573300
##   0.9693101
##   0.9666240
##   0.9661635
##   1.0423192
##   1.0393788
##   1.0392871
##   1.0046807
##   1.0035617
##   1.0035599
##   0.9999953
##   0.9995995
##   0.9996010
##   1.0416721
##   1.0389396
##   1.0387409
##   0.9645330
##   0.9634168
##   0.9634135
##   0.9491362
##   0.9486436
##   0.9486314
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 150, max_depth = 3, eta
##  = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
##  = 1.
cmpxgbPred <- predict(cmpxgb, newdata = cmp_test_X)
postResample(pred = cmpxgbPred, obs = cmp_test_y)
##      RMSE  Rsquared       MAE 
## 1.2570559 0.4949626 0.8991779

We considered Basic Regression, Random Forest and XGBoost tree models, and Random Forest performed slightly better with an R^2 of 0.53.

  1. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

The optimal linear model (elastic net) had 3 biological predictors and they were not as important as the manufacturing processes. The optimal nonlinear model (SVM) incorporated more predictors and had one additional biological predictor in its list of top predictors. Most of the top predictors in our Random Forest model are manufacturing processes, but interestingly enough, one biological predictor (BiologicalMaterial12) ranks third highest in this model, and this differentiates it from the other models where most predictors were manufacturing processes and those also ranked highest.

varImp(cmprf)
## rf variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess13  39.807
## BiologicalMaterial12    34.167
## ManufacturingProcess17  25.162
## ManufacturingProcess31  19.495
## BiologicalMaterial03    16.679
## ManufacturingProcess09  16.276
## BiologicalMaterial06    12.736
## ManufacturingProcess06  11.094
## ManufacturingProcess11  10.634
## BiologicalMaterial11    10.103
## ManufacturingProcess36   8.545
## BiologicalMaterial05     7.346
## ManufacturingProcess24   7.084
## BiologicalMaterial02     7.045
## BiologicalMaterial04     5.823
## ManufacturingProcess18   5.607
## ManufacturingProcess28   5.193
## BiologicalMaterial08     4.852
## ManufacturingProcess30   4.735
  1. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

It does not make sense to plot a single tree for a Random Forest or XGBoost model, but we will plot the basic regression tree that underperformed against these models. There are only two splits here and both deal with manufacturing processes, so this follows the pattern that biological material predictors are not as important.

library(partykit)
## Loading required package: libcoin
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp
plot(as.party(cmpst$finalModel))