Assignment: Exercises 8.1, 8.2, 8.3, and 8.7 from the KJ textbook
library(caTools)
library(DMwR)
## Loading required package: lattice
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
8.1. Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.83890885
## V2 6.49023056
## V3 0.67583163
## V4 7.58822553
## V5 2.27426009
## V6 0.17436781
## V7 0.15136583
## V8 -0.03078937
## V9 -0.02989832
## V10 -0.08529218
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
The model does use predictors V6-V10, but they are not as important in the model as the other predictors. V1 is the most important.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216
Fit another random forest model to these data. Did the importance score for V1 change?
When we add one predictor that is highly correlated with V1, the importance score for V1 decreases. V4 is now the most important predictor.
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 6.29780744
## V2 6.08038134
## V3 0.58410718
## V4 6.93924427
## V5 2.03104094
## V6 0.07947642
## V7 -0.02566414
## V8 -0.11007435
## V9 -0.08839463
## V10 -0.00715093
## duplicate1 3.56411581
What happens when you add another predictor that is also highly correlated with V1?
When we add yet another predictor that is also highly correlated with V1, the importance of V1 in our model decreases once more.
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .2
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
## Overall
## V1 6.19983255
## V2 6.48305559
## V3 0.42757857
## V4 7.51171970
## V5 2.27902033
## V6 0.12585434
## V7 0.05716457
## V8 -0.08621813
## V9 -0.05339077
## V10 -0.01113780
## duplicate1 3.24628944
## duplicate2 1.09060225
The importances using conditional inference trees seem to show the same pattern in the predictors that are chosen by the model. V6-V10 are still less importance, but we do see that V3 has become less important relative to the results of the traditional random forest model.
library(party)
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
cmodel1 <- cforest(y ~ ., data = simulated)
varimp(cmodel1, conditional = TRUE)
## V1 V2 V3 V4 V5 V6
## 2.487590376 4.580552877 0.027953446 5.615551510 1.057098660 0.015134623
## V7 V8 V9 V10 duplicate1 duplicate2
## 0.005205066 0.005728994 -0.008507938 -0.010683929 0.690821538 0.067750307
Using an XGBoost tree model, V4 is the most important predictor, followed by V1 and V2. Predictors V6-V10 are less important again.
xgbmodel1 <- train(y ~ ., data = simulated, method = "xgbTree")
rfImpx1 <- varImp(xgbmodel1)
rfImpx1
## xgbTree variable importance
##
## Overall
## V4 100.0000
## V2 76.3687
## V1 60.5621
## V5 32.5734
## V3 32.3700
## duplicate1 23.3339
## V6 2.2923
## duplicate2 1.1489
## V8 1.0969
## V7 0.3611
## V10 0.1584
## V9 0.0000
8.2. Use a simulation to show tree bias with different granularities.
Tree selection bias occurs when predictors with a higher number of distinct values are favored over more granular predictors. In our simulation, we create our response variable, y, such that it depends on variable a, so we assume that variable a would be the most important in our tree model. However, it turns out that variable b, which y does not depend on and has more distinct values, ends up being more important in our model than variable a.
library(rpart)
set.seed(200)
a <- rep(1:2, each=100)
y <- a + rnorm(200, mean=0, sd=6)
set.seed(400)
b <- rnorm(200, mean=0, sd=3)
dat <- data.frame(y, a, b)
biasMod <- rpart(y ~ ., data = dat)
simImp <- varImp(biasMod)
simImp
## Overall
## a 0.1102470
## b 0.8259556
8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
As we increase the bagging fraction, our model uses more data and as we simultaneouslyincrease the learning rate, our model becomes more greedy and identifies fewer predictors. Both of these combined leads the model to concentrate its importance on fewer predictors, thus the difference that we see between the left and the right.
The model on the left would be more predictive of other samples because models perform worser off if the parameters are too high.
Increasing interaction depth would spread importance over each individual predictor so that the most important predictors are less important and the least important predictors become more important.
8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
First, we load our data, perform imputations and split it into a training and test data set.
#load data
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
#imputation
cmp <- knnImputation(ChemicalManufacturingProcess[, !names(ChemicalManufacturingProcess) %in% "yield"])
#data splitting
set.seed(101)
sample = sample.split(cmp$BiologicalMaterial01, SplitRatio = .8)
cmp_train = subset(cmp, sample == TRUE)
cmp_test = subset(cmp, sample == FALSE)
cmp_train_X = subset(cmp_train, select = -Yield)
cmp_train_y = cmp_train[,'Yield']
cmp_test_X = subset(cmp_test, select = -Yield)
cmp_test_y = cmp_test[,'Yield']
Basic Regression Tree:
cmpst <- train(x = cmp_train_X, y = cmp_train_y, method = "rpart")
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
cmpst
## CART
##
## 140 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 140, 140, 140, 140, 140, 140, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.0999723 1.539236 0.3454827 1.218798
## 0.1033160 1.536178 0.3455412 1.213621
## 0.3706544 1.704810 0.2650904 1.358981
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.103316.
cmpstPred <- predict(cmpst, newdata = cmp_test_X)
postResample(pred = cmpstPred, obs = cmp_test_y)
## RMSE Rsquared MAE
## 1.1900480 0.5084895 0.9149989
Random Forest:
cmprf <- train(x = cmp_train_X, y = cmp_train_y, method = "rf")
cmprf
## Random Forest
##
## 140 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 140, 140, 140, 140, 140, 140, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.337582 0.5824334 1.0839623
## 29 1.219222 0.6116325 0.9655119
## 57 1.228168 0.5954942 0.9714635
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 29.
cmprfPred <- predict(cmprf, newdata = cmp_test_X)
postResample(pred = cmprfPred, obs = cmp_test_y)
## RMSE Rsquared MAE
## 1.1761812 0.5228770 0.8348245
XGBoost:
cmpxgb <- train(x = cmp_train_X, y = cmp_train_y, method = "xgbTree")
cmpxgb
## eXtreme Gradient Boosting
##
## 140 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 140, 140, 140, 140, 140, 140, ...
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds RMSE Rsquared
## 0.3 1 0.6 0.50 50 1.281073 0.5335262
## 0.3 1 0.6 0.50 100 1.284182 0.5396167
## 0.3 1 0.6 0.50 150 1.279859 0.5465025
## 0.3 1 0.6 0.75 50 1.237670 0.5649550
## 0.3 1 0.6 0.75 100 1.219161 0.5823074
## 0.3 1 0.6 0.75 150 1.208760 0.5904512
## 0.3 1 0.6 1.00 50 1.232206 0.5626837
## 0.3 1 0.6 1.00 100 1.224734 0.5713825
## 0.3 1 0.6 1.00 150 1.224059 0.5745115
## 0.3 1 0.8 0.50 50 1.278181 0.5409179
## 0.3 1 0.8 0.50 100 1.272586 0.5516215
## 0.3 1 0.8 0.50 150 1.261833 0.5614731
## 0.3 1 0.8 0.75 50 1.248205 0.5572935
## 0.3 1 0.8 0.75 100 1.222497 0.5781756
## 0.3 1 0.8 0.75 150 1.220984 0.5812936
## 0.3 1 0.8 1.00 50 1.243247 0.5584073
## 0.3 1 0.8 1.00 100 1.227920 0.5719053
## 0.3 1 0.8 1.00 150 1.228331 0.5743774
## 0.3 2 0.6 0.50 50 1.263841 0.5533002
## 0.3 2 0.6 0.50 100 1.255004 0.5606604
## 0.3 2 0.6 0.50 150 1.254375 0.5617151
## 0.3 2 0.6 0.75 50 1.214977 0.5819239
## 0.3 2 0.6 0.75 100 1.202509 0.5917397
## 0.3 2 0.6 0.75 150 1.200370 0.5933968
## 0.3 2 0.6 1.00 50 1.214779 0.5802159
## 0.3 2 0.6 1.00 100 1.205293 0.5881199
## 0.3 2 0.6 1.00 150 1.204536 0.5890601
## 0.3 2 0.8 0.50 50 1.295514 0.5329512
## 0.3 2 0.8 0.50 100 1.278378 0.5459998
## 0.3 2 0.8 0.50 150 1.275684 0.5480316
## 0.3 2 0.8 0.75 50 1.232420 0.5678913
## 0.3 2 0.8 0.75 100 1.222114 0.5768790
## 0.3 2 0.8 0.75 150 1.220297 0.5784880
## 0.3 2 0.8 1.00 50 1.218493 0.5761531
## 0.3 2 0.8 1.00 100 1.207810 0.5842626
## 0.3 2 0.8 1.00 150 1.205934 0.5858674
## 0.3 3 0.6 0.50 50 1.266340 0.5498553
## 0.3 3 0.6 0.50 100 1.263218 0.5522246
## 0.3 3 0.6 0.50 150 1.263000 0.5523979
## 0.3 3 0.6 0.75 50 1.232254 0.5661501
## 0.3 3 0.6 0.75 100 1.230093 0.5682305
## 0.3 3 0.6 0.75 150 1.229907 0.5683666
## 0.3 3 0.6 1.00 50 1.202574 0.5838424
## 0.3 3 0.6 1.00 100 1.200089 0.5855398
## 0.3 3 0.6 1.00 150 1.199934 0.5856375
## 0.3 3 0.8 0.50 50 1.293319 0.5334740
## 0.3 3 0.8 0.50 100 1.286658 0.5386660
## 0.3 3 0.8 0.50 150 1.286170 0.5390214
## 0.3 3 0.8 0.75 50 1.218838 0.5790970
## 0.3 3 0.8 0.75 100 1.214359 0.5824039
## 0.3 3 0.8 0.75 150 1.214310 0.5824651
## 0.3 3 0.8 1.00 50 1.238530 0.5667261
## 0.3 3 0.8 1.00 100 1.237704 0.5675399
## 0.3 3 0.8 1.00 150 1.237648 0.5675898
## 0.4 1 0.6 0.50 50 1.365700 0.4932415
## 0.4 1 0.6 0.50 100 1.351888 0.5074751
## 0.4 1 0.6 0.50 150 1.345937 0.5141629
## 0.4 1 0.6 0.75 50 1.273148 0.5412808
## 0.4 1 0.6 0.75 100 1.258650 0.5572984
## 0.4 1 0.6 0.75 150 1.256923 0.5598387
## 0.4 1 0.6 1.00 50 1.258376 0.5504259
## 0.4 1 0.6 1.00 100 1.248129 0.5620757
## 0.4 1 0.6 1.00 150 1.244783 0.5673097
## 0.4 1 0.8 0.50 50 1.346017 0.5086315
## 0.4 1 0.8 0.50 100 1.337818 0.5219350
## 0.4 1 0.8 0.50 150 1.341290 0.5242514
## 0.4 1 0.8 0.75 50 1.289376 0.5356260
## 0.4 1 0.8 0.75 100 1.278958 0.5494314
## 0.4 1 0.8 0.75 150 1.271441 0.5572848
## 0.4 1 0.8 1.00 50 1.242731 0.5608155
## 0.4 1 0.8 1.00 100 1.241910 0.5670103
## 0.4 1 0.8 1.00 150 1.243206 0.5695683
## 0.4 2 0.6 0.50 50 1.320338 0.5216594
## 0.4 2 0.6 0.50 100 1.315813 0.5254276
## 0.4 2 0.6 0.50 150 1.315218 0.5259712
## 0.4 2 0.6 0.75 50 1.289582 0.5429952
## 0.4 2 0.6 0.75 100 1.287257 0.5470073
## 0.4 2 0.6 0.75 150 1.287047 0.5475724
## 0.4 2 0.6 1.00 50 1.233368 0.5715938
## 0.4 2 0.6 1.00 100 1.227122 0.5766626
## 0.4 2 0.6 1.00 150 1.225574 0.5776521
## 0.4 2 0.8 0.50 50 1.316470 0.5256837
## 0.4 2 0.8 0.50 100 1.309982 0.5304919
## 0.4 2 0.8 0.50 150 1.309166 0.5311503
## 0.4 2 0.8 0.75 50 1.254402 0.5649274
## 0.4 2 0.8 0.75 100 1.244670 0.5714362
## 0.4 2 0.8 0.75 150 1.243009 0.5726216
## 0.4 2 0.8 1.00 50 1.254566 0.5621069
## 0.4 2 0.8 1.00 100 1.251743 0.5656487
## 0.4 2 0.8 1.00 150 1.251135 0.5661447
## 0.4 3 0.6 0.50 50 1.330358 0.5118063
## 0.4 3 0.6 0.50 100 1.327620 0.5134157
## 0.4 3 0.6 0.50 150 1.327319 0.5136295
## 0.4 3 0.6 0.75 50 1.297246 0.5378903
## 0.4 3 0.6 0.75 100 1.295638 0.5390569
## 0.4 3 0.6 0.75 150 1.295633 0.5390587
## 0.4 3 0.6 1.00 50 1.282014 0.5381810
## 0.4 3 0.6 1.00 100 1.281122 0.5389368
## 0.4 3 0.6 1.00 150 1.281114 0.5389417
## 0.4 3 0.8 0.50 50 1.303177 0.5338694
## 0.4 3 0.8 0.50 100 1.300635 0.5354077
## 0.4 3 0.8 0.50 150 1.300469 0.5354850
## 0.4 3 0.8 0.75 50 1.263667 0.5582641
## 0.4 3 0.8 0.75 100 1.262950 0.5589727
## 0.4 3 0.8 0.75 150 1.262950 0.5589797
## 0.4 3 0.8 1.00 50 1.228379 0.5705836
## 0.4 3 0.8 1.00 100 1.227866 0.5709596
## 0.4 3 0.8 1.00 150 1.227847 0.5709701
## MAE
## 0.9968671
## 1.0047690
## 0.9992347
## 0.9736137
## 0.9560512
## 0.9476665
## 0.9687589
## 0.9595069
## 0.9588250
## 1.0025958
## 0.9884565
## 0.9806110
## 0.9721616
## 0.9524760
## 0.9526839
## 0.9824306
## 0.9613286
## 0.9611580
## 0.9896431
## 0.9783558
## 0.9780415
## 0.9478509
## 0.9375196
## 0.9357242
## 0.9462404
## 0.9393341
## 0.9391572
## 1.0006740
## 0.9879809
## 0.9842625
## 0.9456932
## 0.9326957
## 0.9320749
## 0.9447736
## 0.9340364
## 0.9324208
## 0.9956355
## 0.9936119
## 0.9933951
## 0.9605906
## 0.9578711
## 0.9576292
## 0.9348046
## 0.9323701
## 0.9322187
## 1.0012237
## 0.9943317
## 0.9942661
## 0.9500726
## 0.9470249
## 0.9468911
## 0.9664168
## 0.9649426
## 0.9648693
## 1.0734340
## 1.0566926
## 1.0503838
## 1.0101949
## 0.9930406
## 0.9907811
## 0.9894004
## 0.9756055
## 0.9718447
## 1.0560798
## 1.0465640
## 1.0484434
## 1.0152241
## 1.0003764
## 0.9938049
## 0.9788495
## 0.9689052
## 0.9664942
## 1.0358751
## 1.0303242
## 1.0291131
## 0.9958917
## 0.9936324
## 0.9935035
## 0.9597746
## 0.9550805
## 0.9538157
## 1.0333404
## 1.0278694
## 1.0277482
## 0.9673061
## 0.9585069
## 0.9573300
## 0.9693101
## 0.9666240
## 0.9661635
## 1.0423192
## 1.0393788
## 1.0392871
## 1.0046807
## 1.0035617
## 1.0035599
## 0.9999953
## 0.9995995
## 0.9996010
## 1.0416721
## 1.0389396
## 1.0387409
## 0.9645330
## 0.9634168
## 0.9634135
## 0.9491362
## 0.9486436
## 0.9486314
##
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
## parameter 'min_child_weight' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 150, max_depth = 3, eta
## = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
## = 1.
cmpxgbPred <- predict(cmpxgb, newdata = cmp_test_X)
postResample(pred = cmpxgbPred, obs = cmp_test_y)
## RMSE Rsquared MAE
## 1.2570559 0.4949626 0.8991779
We considered Basic Regression, Random Forest and XGBoost tree models, and Random Forest performed slightly better with an R^2 of 0.53.
The optimal linear model (elastic net) had 3 biological predictors and they were not as important as the manufacturing processes. The optimal nonlinear model (SVM) incorporated more predictors and had one additional biological predictor in its list of top predictors. Most of the top predictors in our Random Forest model are manufacturing processes, but interestingly enough, one biological predictor (BiologicalMaterial12) ranks third highest in this model, and this differentiates it from the other models where most predictors were manufacturing processes and those also ranked highest.
varImp(cmprf)
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess13 39.807
## BiologicalMaterial12 34.167
## ManufacturingProcess17 25.162
## ManufacturingProcess31 19.495
## BiologicalMaterial03 16.679
## ManufacturingProcess09 16.276
## BiologicalMaterial06 12.736
## ManufacturingProcess06 11.094
## ManufacturingProcess11 10.634
## BiologicalMaterial11 10.103
## ManufacturingProcess36 8.545
## BiologicalMaterial05 7.346
## ManufacturingProcess24 7.084
## BiologicalMaterial02 7.045
## BiologicalMaterial04 5.823
## ManufacturingProcess18 5.607
## ManufacturingProcess28 5.193
## BiologicalMaterial08 4.852
## ManufacturingProcess30 4.735
It does not make sense to plot a single tree for a Random Forest or XGBoost model, but we will plot the basic regression tree that underperformed against these models. There are only two splits here and both deal with manufacturing processes, so this follows the pattern that biological material predictors are not as important.
library(partykit)
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
plot(as.party(cmpst$finalModel))