Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
library(AppliedPredictiveModeling)
Recreate the simulated data from Exercise 7.2:
library(mlbench) set.seed(200) simulated <- mlbench.friedman1(200, sd = 1) simulated <- cbind(simulated\(x, simulated\)y) simulated <- as.data.frame(simulated) colnames(simulated)[ncol(simulated)] <- “y”
library(mlbench)
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest) library(caret) model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000) rfImp1 <- varImp(model1, scale = FALSE)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1 |>
sort_by(rfImp1$Overall)
## Overall
## V8 -0.11724317
## V9 -0.10344797
## V7 0.02927888
## V10 0.04312556
## V6 0.12395003
## V3 0.72305459
## V5 2.13575650
## V2 6.27437240
## V4 7.50258584
## V1 8.62743275
plot(rfImp1)
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
No, the random forest model did not. V6 – V10 have the lowest importance scores.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated\(duplicate1 <- simulated\)V1 + rnorm(200) * .1 cor(simulated\(duplicate1, simulated\)V1)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9485201
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
set.seed(96)
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2 |>
sort_by(rfImp2$Overall)
## Overall
## V8 -0.12333601
## V9 -0.05916537
## V7 -0.01873194
## V10 0.01692588
## V6 0.23016625
## V3 0.61572773
## V5 2.11481994
## duplicate1 3.33526493
## V1 6.18726213
## V2 6.49136015
## V4 7.06359432
plot(rfImp2)
Yes adding a predictor did change the importance score of V1 - the score is now 6.774034589 as opposed to 8.62743275. It is also now lower in importance than V2.
If we add another highly correlated predictor with V1:
set.seed(97)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9372469
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3 |>
sort_by(rfImp3$Overall)
## Overall
## V8 -0.05666557
## V10 0.02602534
## V7 0.03781915
## V9 0.06286232
## V6 0.14070443
## V3 0.50344026
## duplicate1 1.91648994
## V5 2.22788259
## duplicate2 2.92265084
## V1 5.74694217
## V2 6.87763546
## V4 7.37160256
plot(rfImp3)
Adding another highly correlated predictor did change the importance score of V1 - the score is even lower at 5.19579731 as opposed to 8.50076064. It is also now lower in importance than V2 and V4.
Note: since we added on variables to the data frame, the numbers may appear differently in rpubs (when the code is run again).
Use the cforest
function in the party package to fit a
random forest model using conditional inference trees. The party package
function varimp
can calculate predictor importance. The
conditional
argument of that function toggles between the
traditional importance measure and the modified version described in
Strobl et al. (2007). Do these importances show the same pattern as the
traditional random forest model?
I decided to create new simulated data so the “duplicate” new variables are removed:
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
set.seed(301)
# Recreate the data
simulated2 <- mlbench.friedman1(200, sd = 1)
simulated2 <- cbind(simulated2$x, simulated2$y)
simulated2 <- as.data.frame(simulated2)
colnames(simulated2)[ncol(simulated2)] <- "y"
x <- as.matrix(simulated2[, -which(names(simulated2) == "y")])
### Tune the conditional inference forests
model4 <- cforest(y ~ ., data = simulated2)
model4
##
## Random Forest using Conditional Inference Trees
##
## Number of trees: 500
##
## Response: y
## Inputs: V1, V2, V3, V4, V5, V6, V7, V8, V9, V10
## Number of observations: 200
rfImp4 <- varImp(model4)
rfImp4 |>
sort_by(rfImp4$Overall)
## Overall
## V9 -0.054383229
## V7 -0.032265145
## V8 -0.005465832
## V10 0.017008383
## V6 0.032425746
## V3 0.183586689
## V5 1.012205621
## V1 4.926082392
## V2 6.814068106
## V4 14.219625841
plot(rfImp4)
The same predictors are the most important but the order is slightly different. V6 - V10 (the unimportant predictors) still have the least importance in the model.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Boosted:
# Boosted
set.seed(100)
indx <- createFolds(simulated2$y, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(100)
gbmTune <- train(x = x, y = simulated2$y,
method = "gbm",
tuneGrid = gbmGrid,
trControl = ctrl,
verbose = FALSE)
gbmTune
## Stochastic Gradient Boosting
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.01 1 100 4.058318 0.5924175 3.310401
## 0.01 1 150 3.770109 0.6489209 3.081699
## 0.01 1 200 3.530493 0.6794660 2.875616
## 0.01 1 250 3.333894 0.7031430 2.719584
## 0.01 1 300 3.164019 0.7223481 2.585931
## 0.01 1 350 3.018967 0.7380079 2.458820
## 0.01 1 400 2.896962 0.7517797 2.361692
## 0.01 1 450 2.794121 0.7633783 2.275418
## 0.01 1 500 2.698804 0.7742894 2.193324
## 0.01 1 550 2.614342 0.7842219 2.117630
## 0.01 1 600 2.537068 0.7932969 2.055217
## 0.01 1 650 2.465999 0.8010597 1.997521
## 0.01 1 700 2.405243 0.8080453 1.951048
## 0.01 1 750 2.351374 0.8134648 1.907801
## 0.01 1 800 2.300874 0.8184132 1.867947
## 0.01 1 850 2.256131 0.8239474 1.832917
## 0.01 1 900 2.216133 0.8279178 1.798277
## 0.01 1 950 2.179846 0.8308984 1.769397
## 0.01 1 1000 2.146775 0.8335465 1.745185
## 0.01 3 100 3.506148 0.7129205 2.828383
## 0.01 3 150 3.126204 0.7404025 2.517388
## 0.01 3 200 2.859800 0.7631779 2.295266
## 0.01 3 250 2.651148 0.7842937 2.122757
## 0.01 3 300 2.481946 0.8026236 1.980995
## 0.01 3 350 2.344854 0.8159933 1.865088
## 0.01 3 400 2.237835 0.8262581 1.774919
## 0.01 3 450 2.158864 0.8335579 1.713537
## 0.01 3 500 2.091503 0.8394954 1.662698
## 0.01 3 550 2.036915 0.8449022 1.623011
## 0.01 3 600 1.999018 0.8483345 1.595316
## 0.01 3 650 1.965885 0.8510384 1.572702
## 0.01 3 700 1.937539 0.8536563 1.547325
## 0.01 3 750 1.915125 0.8556375 1.528509
## 0.01 3 800 1.897566 0.8571803 1.512269
## 0.01 3 850 1.881740 0.8586823 1.496784
## 0.01 3 900 1.869342 0.8602120 1.485079
## 0.01 3 950 1.860725 0.8613150 1.475414
## 0.01 3 1000 1.849139 0.8625925 1.464275
## 0.01 5 100 3.334653 0.7287282 2.683727
## 0.01 5 150 2.953547 0.7545739 2.371161
## 0.01 5 200 2.690003 0.7776444 2.143782
## 0.01 5 250 2.490851 0.7969697 1.966661
## 0.01 5 300 2.340446 0.8115355 1.847823
## 0.01 5 350 2.226874 0.8228967 1.766393
## 0.01 5 400 2.142324 0.8306664 1.700991
## 0.01 5 450 2.077926 0.8365060 1.648810
## 0.01 5 500 2.036130 0.8399066 1.612299
## 0.01 5 550 1.998080 0.8437432 1.583024
## 0.01 5 600 1.969221 0.8467325 1.556868
## 0.01 5 650 1.947325 0.8490996 1.537581
## 0.01 5 700 1.932640 0.8505316 1.521751
## 0.01 5 750 1.917808 0.8522819 1.507764
## 0.01 5 800 1.906614 0.8535401 1.494749
## 0.01 5 850 1.894959 0.8549981 1.483169
## 0.01 5 900 1.892677 0.8553683 1.480204
## 0.01 5 950 1.886431 0.8562238 1.473047
## 0.01 5 1000 1.880637 0.8569135 1.464939
## 0.01 7 100 3.309677 0.7334957 2.646904
## 0.01 7 150 2.917530 0.7633590 2.330623
## 0.01 7 200 2.643672 0.7867698 2.094781
## 0.01 7 250 2.447267 0.8025480 1.920383
## 0.01 7 300 2.301073 0.8169173 1.808000
## 0.01 7 350 2.196660 0.8265607 1.730428
## 0.01 7 400 2.117430 0.8338818 1.668192
## 0.01 7 450 2.064661 0.8384898 1.626317
## 0.01 7 500 2.027496 0.8417270 1.594787
## 0.01 7 550 1.994993 0.8446573 1.567457
## 0.01 7 600 1.970914 0.8469070 1.547581
## 0.01 7 650 1.953145 0.8486043 1.533542
## 0.01 7 700 1.940703 0.8497479 1.522346
## 0.01 7 750 1.930428 0.8510641 1.511353
## 0.01 7 800 1.923945 0.8516247 1.505294
## 0.01 7 850 1.919080 0.8519884 1.499366
## 0.01 7 900 1.914038 0.8526189 1.495252
## 0.01 7 950 1.906083 0.8536098 1.490161
## 0.01 7 1000 1.902139 0.8540804 1.485346
## 0.10 1 100 2.134127 0.8325122 1.727405
## 0.10 1 150 2.007291 0.8423442 1.648037
## 0.10 1 200 1.940182 0.8489443 1.587157
## 0.10 1 250 1.906457 0.8525617 1.547285
## 0.10 1 300 1.916332 0.8501194 1.556286
## 0.10 1 350 1.900535 0.8519577 1.548767
## 0.10 1 400 1.906887 0.8502586 1.560407
## 0.10 1 450 1.918735 0.8479802 1.560396
## 0.10 1 500 1.923622 0.8475754 1.570869
## 0.10 1 550 1.924362 0.8468278 1.565978
## 0.10 1 600 1.924163 0.8473865 1.559245
## 0.10 1 650 1.921287 0.8474421 1.559607
## 0.10 1 700 1.921758 0.8478190 1.561406
## 0.10 1 750 1.928198 0.8461037 1.565294
## 0.10 1 800 1.934524 0.8456940 1.578426
## 0.10 1 850 1.929941 0.8459980 1.580258
## 0.10 1 900 1.927487 0.8468707 1.578319
## 0.10 1 950 1.914378 0.8490282 1.571451
## 0.10 1 1000 1.930940 0.8460777 1.580815
## 0.10 3 100 1.959947 0.8455464 1.532718
## 0.10 3 150 1.915036 0.8509078 1.504204
## 0.10 3 200 1.907346 0.8509073 1.492786
## 0.10 3 250 1.918260 0.8494940 1.500916
## 0.10 3 300 1.926056 0.8478941 1.511760
## 0.10 3 350 1.926462 0.8477982 1.513494
## 0.10 3 400 1.925101 0.8479021 1.513746
## 0.10 3 450 1.936142 0.8465022 1.518243
## 0.10 3 500 1.941133 0.8461583 1.525287
## 0.10 3 550 1.936736 0.8471680 1.521190
## 0.10 3 600 1.939495 0.8467436 1.525985
## 0.10 3 650 1.937868 0.8466814 1.524997
## 0.10 3 700 1.941183 0.8464456 1.526388
## 0.10 3 750 1.941095 0.8464800 1.527839
## 0.10 3 800 1.940002 0.8467537 1.526638
## 0.10 3 850 1.942700 0.8464657 1.528499
## 0.10 3 900 1.942200 0.8465262 1.527372
## 0.10 3 950 1.944581 0.8460697 1.529178
## 0.10 3 1000 1.944101 0.8461030 1.528629
## 0.10 5 100 1.938922 0.8464963 1.543350
## 0.10 5 150 1.897555 0.8506336 1.508761
## 0.10 5 200 1.898881 0.8507215 1.510617
## 0.10 5 250 1.902102 0.8491688 1.511537
## 0.10 5 300 1.908680 0.8482477 1.513541
## 0.10 5 350 1.905923 0.8486424 1.512943
## 0.10 5 400 1.910236 0.8482689 1.513470
## 0.10 5 450 1.917145 0.8470926 1.516738
## 0.10 5 500 1.916255 0.8472018 1.514901
## 0.10 5 550 1.917552 0.8470937 1.514390
## 0.10 5 600 1.917671 0.8471799 1.515314
## 0.10 5 650 1.917617 0.8473505 1.515308
## 0.10 5 700 1.918753 0.8471808 1.515736
## 0.10 5 750 1.918110 0.8472724 1.515217
## 0.10 5 800 1.918515 0.8472319 1.515510
## 0.10 5 850 1.918912 0.8472072 1.515827
## 0.10 5 900 1.918991 0.8471846 1.515720
## 0.10 5 950 1.919123 0.8471737 1.515937
## 0.10 5 1000 1.919023 0.8471928 1.515936
## 0.10 7 100 2.025192 0.8324992 1.587500
## 0.10 7 150 2.013766 0.8320011 1.568014
## 0.10 7 200 2.009108 0.8333405 1.577025
## 0.10 7 250 2.004631 0.8343764 1.574972
## 0.10 7 300 2.007876 0.8345229 1.577478
## 0.10 7 350 2.010494 0.8342342 1.581563
## 0.10 7 400 2.008746 0.8347960 1.578158
## 0.10 7 450 2.011629 0.8346025 1.581348
## 0.10 7 500 2.010311 0.8347182 1.581307
## 0.10 7 550 2.010985 0.8346036 1.581474
## 0.10 7 600 2.010909 0.8347145 1.582414
## 0.10 7 650 2.011737 0.8345464 1.582626
## 0.10 7 700 2.012785 0.8343701 1.583422
## 0.10 7 750 2.013036 0.8343306 1.583656
## 0.10 7 800 2.013004 0.8343629 1.583545
## 0.10 7 850 2.012843 0.8343711 1.583382
## 0.10 7 900 2.012821 0.8343597 1.583525
## 0.10 7 950 2.012731 0.8343750 1.583666
## 0.10 7 1000 2.012658 0.8343713 1.583659
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
## 3, shrinkage = 0.01 and n.minobsinnode = 10.
plot(gbmTune, auto.key = list(columns = 4, lines = TRUE))
gbmImp <- varImp(gbmTune, scale = FALSE)
gbmImp
## gbm variable importance
##
## Overall
## V4 63068.6
## V2 33875.9
## V1 27952.7
## V5 12875.8
## V3 12690.7
## V6 2204.7
## V7 1548.4
## V10 1235.9
## V9 1207.3
## V8 951.2
Boosted produced the same results - the importance ordering is the same. The values for boosted are a lot larger.
Cubist:
# Cubist
set.seed(98)
cbGrid <- expand.grid(committees = c(1:10, 20, 50, 75, 100),
neighbors = c(0, 1, 5, 9))
set.seed(100)
cubistTune <- train(x = x, y = simulated2$y,
"cubist",
tuneGrid = cbGrid,
trControl = ctrl)
cubistTune
## Cubist
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 2.486110 0.7496162 1.980196
## 1 1 3.008986 0.6479950 2.301460
## 1 5 2.347183 0.7695990 1.815079
## 1 9 2.281047 0.7814085 1.772161
## 2 0 2.352679 0.7700249 1.876807
## 2 1 2.763001 0.6833978 2.125779
## 2 5 2.245669 0.7852810 1.751135
## 2 9 2.185548 0.7953453 1.716768
## 3 0 2.325927 0.7739165 1.857013
## 3 1 2.737924 0.6900825 2.073857
## 3 5 2.231359 0.7869371 1.738960
## 3 9 2.173200 0.7967481 1.703917
## 4 0 2.343977 0.7694343 1.868838
## 4 1 2.715398 0.6928613 2.083206
## 4 5 2.236297 0.7860729 1.750201
## 4 9 2.185947 0.7944130 1.717669
## 5 0 2.342287 0.7712170 1.867755
## 5 1 2.725670 0.6924045 2.071722
## 5 5 2.236653 0.7867860 1.748587
## 5 9 2.187133 0.7949951 1.716766
## 6 0 2.336457 0.7702762 1.862026
## 6 1 2.711196 0.6933901 2.078510
## 6 5 2.236092 0.7861548 1.750726
## 6 9 2.184520 0.7945777 1.718319
## 7 0 2.327275 0.7729738 1.851675
## 7 1 2.714289 0.6942385 2.070512
## 7 5 2.225451 0.7883570 1.738434
## 7 9 2.175444 0.7965496 1.707181
## 8 0 2.324069 0.7709461 1.858714
## 8 1 2.727350 0.6916902 2.094393
## 8 5 2.228022 0.7870347 1.751677
## 8 9 2.176313 0.7953469 1.718839
## 9 0 2.321950 0.7731831 1.849893
## 9 1 2.731753 0.6922418 2.089778
## 9 5 2.226359 0.7884181 1.742640
## 9 9 2.174664 0.7966778 1.712592
## 10 0 2.313559 0.7737514 1.846239
## 10 1 2.730948 0.6922633 2.092004
## 10 5 2.223155 0.7879975 1.740175
## 10 9 2.170332 0.7967802 1.708415
## 20 0 2.315033 0.7768608 1.836828
## 20 1 2.756712 0.6898296 2.111097
## 20 5 2.223160 0.7899620 1.726474
## 20 9 2.170726 0.7984768 1.689206
## 50 0 2.275530 0.7851917 1.819671
## 50 1 2.758171 0.6883308 2.111673
## 50 5 2.196736 0.7934268 1.714041
## 50 9 2.129414 0.8041390 1.664779
## 75 0 2.266783 0.7861628 1.800135
## 75 1 2.746444 0.6891327 2.092617
## 75 5 2.189819 0.7936866 1.698632
## 75 9 2.120822 0.8046725 1.649021
## 100 0 2.263316 0.7876348 1.797635
## 100 1 2.743127 0.6900732 2.079693
## 100 5 2.192029 0.7940890 1.689447
## 100 9 2.122296 0.8054282 1.642360
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 75 and neighbors = 9.
plot(cubistTune, auto.key = list(columns = 4, lines = TRUE))
cbImp <- varImp(cubistTune, scale = FALSE)
cbImp
## cubist variable importance
##
## Overall
## V2 57.5
## V4 57.5
## V1 54.0
## V3 30.0
## V5 26.0
## V7 3.0
## V10 3.0
## V9 1.0
## V6 0.5
## V8 0.0
Cubist also produced the same results - the importance ordering is the same. Cubist has smaller values than boosted and no negative values.
Use a simulation to show tree bias with different granularities.
The textbook states, “predictors with a higher number of distinct values are favored over more granular predictors” and “there is a high probability that the noise variables will be chosen to split the top nodes of the tree.” So let’s try adding predictors that are just noise:
set.seed(99)
simulated3 <- simulated
nrows <- nrow(simulated3)
rands <- 1e-3 * runif(nrows)
# Add noise predictors
simulated3$R1 <- simulated3$V1 + rands
simulated3$R2 <- simulated3$V2 + rands
model_gran_1 <- randomForest(y ~ ., data = simulated3, importance = TRUE, ntree = 1000)
rfImp_gran_1 <- varImp(model_gran_1, scale = FALSE)
rfImp_gran_1 |>
sort_by(rfImp_gran_1$Overall)
## Overall
## V8 -0.08590449
## V9 -0.03539313
## V10 -0.02639602
## V7 0.05484661
## V6 0.19828435
## V3 0.35486249
## duplicate1 1.21086204
## V5 1.64869393
## duplicate2 1.87416397
## V1 3.66979186
## R1 4.03126666
## V2 4.06406114
## R2 4.27598991
## V4 6.61618188
plot(rfImp_gran_1)
The model favors the random variables more than the duplicate variables.
Let’s try adding 2 more noise predictors:
set.seed(100)
# Add noise predictors
simulated3$R3 <- simulated3$V3 + rands
simulated3$R4 <- simulated3$V4 + rands
model_gran_2 <- randomForest(y ~ ., data = simulated3, importance = TRUE, ntree = 1000)
rfImp_gran_2 <- varImp(model_gran_2, scale = FALSE)
rfImp_gran_2 |>
sort_by(rfImp_gran_2$Overall)
## Overall
## V8 -0.09651461
## V7 -0.03406034
## V9 -0.02245271
## V6 0.05992412
## V10 0.06913218
## V3 0.37034446
## R3 0.38376969
## duplicate1 1.09500925
## V5 1.65254570
## duplicate2 1.68789327
## V1 3.51416840
## R2 3.75925216
## R1 3.83141753
## V2 4.08727341
## V4 4.12691674
## R4 4.58042947
plot(rfImp_gran_2)
This model also favors the random variables showing that trees suffer from selection bias.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
A low bagging fraction and learning rate are trained on smaller amounts of data, which can result in overfitting and increase bias. Low bagging fraction causes increase in diverse models which is why more predictors are considered important. A higher bagging fraction uses a larger amount of the data, so it can create more focused models, that are not necessarily unique. You will get more similar models using a higher bagging fraction and learning rate, which is why there are fewer important predictors. These models have more data to work with.
Which model do you think would be more predictive of other samples?
Having a lower value of the learning rate is beneficial, even though it has a computational cost, so I would use the left side model and repeat many times.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
I think increasing interaction depth affect would make the variance between the predictors smaller. More predictors would become important, and the value for the top predictor would decrease.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Get the data ready:
data(ChemicalManufacturingProcess)
ChemicalManufacturingProcessCopy <- ChemicalManufacturingProcess
# Imputation - use mean
for(i in 1:ncol(ChemicalManufacturingProcessCopy)) {
ChemicalManufacturingProcessCopy[is.na(ChemicalManufacturingProcessCopy[,i]), i] <- mean(ChemicalManufacturingProcessCopy[,i], na.rm = TRUE)
}
# Split the data
chem_train_set_x_df <- as.data.frame(ChemicalManufacturingProcessCopy[1:110,])
chem_train_set_x <- ChemicalManufacturingProcessCopy[1:110,]
chem_test_set_x <- ChemicalManufacturingProcessCopy[111:nrow(ChemicalManufacturingProcessCopy),]
chem_train_set_y_df <- as.data.frame(ChemicalManufacturingProcess[1:110,])
chem_train_set_y <- ChemicalManufacturingProcess[1:110,]
# choose response variable
y <- chem_train_set_x$Yield
# all the predictors need to be put into a matrix
x <- as.matrix(chem_train_set_x[, -which(names(chem_train_set_x) == "Yield")])
Let’s try Bagged Trees:
# Bagged Trees
set.seed(101)
indx <- createFolds(y, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
treebagTune <- train(x = x, y = y,
method = "treebag",
nbagg = 50,
trControl = ctrl)
treebagTune
## Bagged CART
##
## 110 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 98, 99, 99, 101, 99, 99, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.026206 0.7308844 0.7696015
treebagTune$finalModel
##
## Bagging regression trees with 50 bootstrap replications
chem_test_y <- chem_test_set_x[, which(names(chem_test_set_x) == "Yield")]
chem_test_x <- as.matrix(chem_test_set_x[, -which(names(chem_test_set_x) == "Yield")])
treebagTunePred <- predict(treebagTune, newdata = chem_test_x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = treebagTunePred, obs = chem_test_y)
## RMSE Rsquared MAE
## 1.5280023 0.2198921 1.2271630
For Bagged CART, the RMSE is 1.026206 and Rsquared is 0.7308844. This is pretty good Rsquared, but not amazing.
The test set performance values are RMSE of 1.5280023 and 0.2198921 Rsquared. The resampled test set did not perform well.
Now let’s try Random Forests:
# Random Forests
mtryGrid <- data.frame(mtry = floor(seq(10, ncol(x), length = 10)))
### Tune the model using cross-validation ###
set.seed(100)
rfTune <- train(x = x, y = y,
method = "rf",
tuneGrid = mtryGrid,
ntree = 1000,
importance = TRUE,
trControl = ctrl)
rfTune
## Random Forest
##
## 110 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 98, 99, 99, 101, 99, 99, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 10 0.9256904 0.8256316 0.7206621
## 15 0.9162436 0.8202740 0.7023533
## 20 0.9054228 0.8162479 0.6954435
## 25 0.9150617 0.8077850 0.6937333
## 30 0.9171069 0.8002505 0.6942974
## 36 0.9185005 0.7972996 0.6924881
## 41 0.9271291 0.7868590 0.6950346
## 46 0.9213070 0.7909617 0.6903651
## 51 0.9333129 0.7793265 0.7017964
## 57 0.9373766 0.7745866 0.7064238
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 20.
rfTune$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 1000, mtry = param$mtry, importance = TRUE)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 20
##
## Mean of squared residuals: 0.8859545
## % Var explained: 75.3
plot(rfTune)
rfImp <- varImp(rfTune, scale = FALSE)
rfImp
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 26.090
## ManufacturingProcess13 21.594
## ManufacturingProcess09 14.339
## BiologicalMaterial03 14.166
## ManufacturingProcess17 14.107
## BiologicalMaterial08 12.627
## ManufacturingProcess31 11.300
## BiologicalMaterial02 11.255
## BiologicalMaterial12 10.438
## BiologicalMaterial11 9.910
## BiologicalMaterial06 9.102
## ManufacturingProcess28 9.074
## ManufacturingProcess06 8.122
## BiologicalMaterial01 7.559
## ManufacturingProcess36 7.067
## ManufacturingProcess33 6.616
## ManufacturingProcess01 6.600
## ManufacturingProcess11 6.461
## BiologicalMaterial04 6.130
## ManufacturingProcess24 6.047
rfTunePred <- predict(rfTune, newdata = chem_test_x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = rfTunePred, obs = chem_test_y)
## RMSE Rsquared MAE
## 1.5161315 0.1618175 1.2556979
### Tune the model using the OOB estimates ###
ctrlOOB <- trainControl(method = "oob")
set.seed(100)
rfTuneOOB <- train(x = x, y = y,
method = "rf",
tuneGrid = mtryGrid,
ntree = 1000,
importance = TRUE,
trControl = ctrlOOB)
rfTuneOOB
## Random Forest
##
## 110 samples
## 57 predictor
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared
## 10 0.9494792 0.7487014
## 15 0.9349442 0.7563365
## 20 0.9290242 0.7594125
## 25 0.9378709 0.7548086
## 30 0.9325077 0.7576049
## 36 0.9269330 0.7604944
## 41 0.9375693 0.7549663
## 46 0.9436898 0.7517567
## 51 0.9328969 0.7574025
## 57 0.9470804 0.7499696
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 36.
rfTuneOOB$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 1000, mtry = param$mtry, importance = TRUE)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 36
##
## Mean of squared residuals: 0.8798012
## % Var explained: 75.48
plot(rfTuneOOB)
rfTuneOOBPred <- predict(rfTuneOOB, newdata = chem_test_x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = rfTuneOOBPred, obs = chem_test_y)
## RMSE Rsquared MAE
## 1.5321968 0.1638437 1.2591023
### Tune the conditional inference forests ####
set.seed(100)
condrfTune <- train(x = x, y = y,
method = "cforest",
tuneGrid = mtryGrid,
controls = cforest_unbiased(ntree = 1000),
trControl = ctrl)
condrfTune
## Conditional Inference Random Forest
##
## 110 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 98, 99, 99, 101, 99, 99, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 10 1.168003 0.7133461 0.9102069
## 15 1.149683 0.7017961 0.8961622
## 20 1.141419 0.6922299 0.8858195
## 25 1.126813 0.6923026 0.8690717
## 30 1.128123 0.6869251 0.8664302
## 36 1.121336 0.6850380 0.8565380
## 41 1.117175 0.6848067 0.8525516
## 46 1.117668 0.6823346 0.8503776
## 51 1.122439 0.6758730 0.8520768
## 57 1.123381 0.6746645 0.8511820
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 41.
condrfTune$finalModel
##
## Random Forest using Conditional Inference Trees
##
## Number of trees: 1000
##
## Response: .outcome
## Inputs: BiologicalMaterial01, BiologicalMaterial02, BiologicalMaterial03, BiologicalMaterial04, BiologicalMaterial05, BiologicalMaterial06, BiologicalMaterial07, BiologicalMaterial08, BiologicalMaterial09, BiologicalMaterial10, BiologicalMaterial11, BiologicalMaterial12, ManufacturingProcess01, ManufacturingProcess02, ManufacturingProcess03, ManufacturingProcess04, ManufacturingProcess05, ManufacturingProcess06, ManufacturingProcess07, ManufacturingProcess08, ManufacturingProcess09, ManufacturingProcess10, ManufacturingProcess11, ManufacturingProcess12, ManufacturingProcess13, ManufacturingProcess14, ManufacturingProcess15, ManufacturingProcess16, ManufacturingProcess17, ManufacturingProcess18, ManufacturingProcess19, ManufacturingProcess20, ManufacturingProcess21, ManufacturingProcess22, ManufacturingProcess23, ManufacturingProcess24, ManufacturingProcess25, ManufacturingProcess26, ManufacturingProcess27, ManufacturingProcess28, ManufacturingProcess29, ManufacturingProcess30, ManufacturingProcess31, ManufacturingProcess32, ManufacturingProcess33, ManufacturingProcess34, ManufacturingProcess35, ManufacturingProcess36, ManufacturingProcess37, ManufacturingProcess38, ManufacturingProcess39, ManufacturingProcess40, ManufacturingProcess41, ManufacturingProcess42, ManufacturingProcess43, ManufacturingProcess44, ManufacturingProcess45
## Number of observations: 110
plot(condrfTune)
condrfTunePred <- predict(condrfTune, newdata = chem_test_x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = condrfTunePred, obs = chem_test_y)
## RMSE Rsquared MAE
## 1.555673 0.111441 1.296947
set.seed(100)
condrfTuneOOB <- train(x = x, y = y,
method = "cforest",
tuneGrid = mtryGrid,
controls = cforest_unbiased(ntree = 1000),
trControl = trainControl(method = "oob"))
condrfTuneOOB
## Conditional Inference Random Forest
##
## 110 samples
## 57 predictor
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 10 1.172185 0.6977276 0.8991333
## 15 1.143132 0.6929247 0.8738530
## 20 1.133728 0.6868864 0.8626660
## 25 1.126456 0.6811510 0.8570306
## 30 1.119113 0.6848342 0.8491332
## 36 1.115565 0.6824505 0.8426582
## 41 1.109359 0.6857689 0.8369577
## 46 1.116558 0.6766673 0.8436031
## 51 1.117476 0.6750039 0.8339748
## 57 1.120075 0.6729624 0.8386448
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 41.
plot(condrfTuneOOB)
condrfTuneOOBPred <- predict(condrfTuneOOB, newdata = chem_test_x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = condrfTuneOOBPred, obs = chem_test_y)
## RMSE Rsquared MAE
## 1.5605394 0.1088396 1.3030801
For Random Forest, the RMSE is 0.9054228 and Rsquared is 0.8077850 This is also a pretty good Rsquared - the best one yet.
The test set performance values are RMSE of 1.5161315 and 0.1618175 Rsquared. The resampled test set did not perform well.
For Random Forest using the OOB estimates, the RMSE is 0.9269330 and Rsquared is 0.7604944 This is also a pretty good Rsquared, but not the best one.
The test set performance values are RMSE of 1.5321968 and 0.1638437 Rsquared. The resampled test set did not perform well.
For conditional inference forest, the RMSE is 1.117175 and Rsquared is 0.6848067. This is not a very good Rsquared.
The test set performance values are RMSE of 1.555673 and 0.111441 Rsquared. The resampled test set did not perform well.
For Conditional Inference Random Forest using the OOB estimates, the RMSE is 1.109359 and Rsquared is 0.6857689. This is also not a very good Rsquared.
The test set performance values are RMSE of 1.5605394 and 0.1088396 Rsquared. The resampled test set did not perform well.
Now let’s try Boosting:
### Boosting
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(100)
gbmTune <- train(x = x, y = y,
method = "gbm",
tuneGrid = gbmGrid,
trControl = ctrl,
verbose = FALSE)
gbmTune
## Stochastic Gradient Boosting
##
## 110 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 98, 99, 99, 101, 99, 99, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.01 1 100 1.4483379 0.6538321 1.1440673
## 0.01 1 150 1.3171874 0.6709749 1.0358586
## 0.01 1 200 1.2197950 0.6871786 0.9513279
## 0.01 1 250 1.1543350 0.6967779 0.8891422
## 0.01 1 300 1.1071761 0.7036007 0.8478779
## 0.01 1 350 1.0794501 0.7072811 0.8265490
## 0.01 1 400 1.0569652 0.7127157 0.8135310
## 0.01 1 450 1.0388624 0.7195574 0.8032316
## 0.01 1 500 1.0273715 0.7225962 0.7964284
## 0.01 1 550 1.0164232 0.7256218 0.7904974
## 0.01 1 600 1.0058794 0.7312265 0.7815788
## 0.01 1 650 0.9972844 0.7351424 0.7762622
## 0.01 1 700 0.9891540 0.7388222 0.7684108
## 0.01 1 750 0.9819805 0.7426713 0.7630539
## 0.01 1 800 0.9753691 0.7461526 0.7569755
## 0.01 1 850 0.9688896 0.7505879 0.7513533
## 0.01 1 900 0.9654662 0.7521768 0.7482775
## 0.01 1 950 0.9621877 0.7542609 0.7454933
## 0.01 1 1000 0.9572211 0.7564531 0.7407071
## 0.01 3 100 1.3180069 0.6883527 1.0318341
## 0.01 3 150 1.1834390 0.7102265 0.9078393
## 0.01 3 200 1.1096197 0.7210433 0.8398467
## 0.01 3 250 1.0617475 0.7303726 0.8024690
## 0.01 3 300 1.0232679 0.7426427 0.7726892
## 0.01 3 350 0.9974040 0.7518259 0.7544995
## 0.01 3 400 0.9795083 0.7586085 0.7442238
## 0.01 3 450 0.9664389 0.7645341 0.7346622
## 0.01 3 500 0.9573812 0.7670590 0.7304341
## 0.01 3 550 0.9449853 0.7722196 0.7204149
## 0.01 3 600 0.9365183 0.7751392 0.7143587
## 0.01 3 650 0.9287649 0.7788827 0.7067020
## 0.01 3 700 0.9226300 0.7814043 0.7017734
## 0.01 3 750 0.9170305 0.7832468 0.6991913
## 0.01 3 800 0.9104650 0.7868112 0.6930390
## 0.01 3 850 0.9066445 0.7883897 0.6905453
## 0.01 3 900 0.9016557 0.7904849 0.6859483
## 0.01 3 950 0.8979714 0.7923676 0.6835404
## 0.01 3 1000 0.8962644 0.7932470 0.6819073
## 0.01 5 100 1.3092433 0.6989733 1.0114638
## 0.01 5 150 1.1779116 0.7119220 0.9024741
## 0.01 5 200 1.0947939 0.7286074 0.8330031
## 0.01 5 250 1.0384050 0.7452473 0.7910733
## 0.01 5 300 1.0026726 0.7573724 0.7623054
## 0.01 5 350 0.9769106 0.7639812 0.7429002
## 0.01 5 400 0.9558003 0.7714597 0.7251429
## 0.01 5 450 0.9404104 0.7778427 0.7175250
## 0.01 5 500 0.9296583 0.7814004 0.7100643
## 0.01 5 550 0.9174227 0.7869714 0.6997580
## 0.01 5 600 0.9071742 0.7913948 0.6916150
## 0.01 5 650 0.9010271 0.7932504 0.6870746
## 0.01 5 700 0.8910694 0.7974917 0.6783659
## 0.01 5 750 0.8835269 0.8008954 0.6752252
## 0.01 5 800 0.8787979 0.8026178 0.6715871
## 0.01 5 850 0.8752738 0.8040099 0.6679490
## 0.01 5 900 0.8705713 0.8056741 0.6648670
## 0.01 5 950 0.8686084 0.8061536 0.6647348
## 0.01 5 1000 0.8653855 0.8068112 0.6622400
## 0.01 7 100 1.3059642 0.6987248 1.0142280
## 0.01 7 150 1.1697321 0.7182879 0.8992400
## 0.01 7 200 1.0989604 0.7254342 0.8349470
## 0.01 7 250 1.0520240 0.7347828 0.7963163
## 0.01 7 300 1.0215718 0.7446286 0.7732809
## 0.01 7 350 0.9991260 0.7509128 0.7560098
## 0.01 7 400 0.9842015 0.7556447 0.7451752
## 0.01 7 450 0.9691273 0.7609056 0.7350178
## 0.01 7 500 0.9566394 0.7664677 0.7266722
## 0.01 7 550 0.9417267 0.7729195 0.7174703
## 0.01 7 600 0.9320880 0.7768330 0.7124776
## 0.01 7 650 0.9250302 0.7797262 0.7091338
## 0.01 7 700 0.9150232 0.7838578 0.7040437
## 0.01 7 750 0.9103502 0.7862041 0.7002014
## 0.01 7 800 0.9057571 0.7887962 0.6961219
## 0.01 7 850 0.8984976 0.7921459 0.6900663
## 0.01 7 900 0.8946902 0.7935503 0.6870229
## 0.01 7 950 0.8921999 0.7944583 0.6854384
## 0.01 7 1000 0.8874656 0.7963946 0.6818204
## 0.10 1 100 1.0242459 0.7181196 0.7979675
## 0.10 1 150 1.0063041 0.7287274 0.7795413
## 0.10 1 200 0.9832385 0.7391242 0.7705161
## 0.10 1 250 0.9632324 0.7493378 0.7643176
## 0.10 1 300 0.9541549 0.7505433 0.7478313
## 0.10 1 350 0.9472063 0.7534527 0.7438994
## 0.10 1 400 0.9569712 0.7473762 0.7520343
## 0.10 1 450 0.9522323 0.7499997 0.7512065
## 0.10 1 500 0.9574789 0.7457934 0.7523744
## 0.10 1 550 0.9543862 0.7471706 0.7536521
## 0.10 1 600 0.9524283 0.7473309 0.7480448
## 0.10 1 650 0.9525672 0.7480298 0.7480545
## 0.10 1 700 0.9550885 0.7463772 0.7503086
## 0.10 1 750 0.9534650 0.7472071 0.7503689
## 0.10 1 800 0.9618848 0.7433281 0.7583346
## 0.10 1 850 0.9527264 0.7476301 0.7493534
## 0.10 1 900 0.9534093 0.7474830 0.7490165
## 0.10 1 950 0.9550800 0.7459267 0.7488849
## 0.10 1 1000 0.9549024 0.7461965 0.7482082
## 0.10 3 100 0.8672828 0.7997504 0.6724252
## 0.10 3 150 0.8587408 0.8022805 0.6684485
## 0.10 3 200 0.8491344 0.8051998 0.6655315
## 0.10 3 250 0.8438715 0.8077207 0.6636223
## 0.10 3 300 0.8364416 0.8099339 0.6573446
## 0.10 3 350 0.8361030 0.8095964 0.6570500
## 0.10 3 400 0.8359125 0.8095954 0.6576636
## 0.10 3 450 0.8347453 0.8101226 0.6579651
## 0.10 3 500 0.8336613 0.8104941 0.6579158
## 0.10 3 550 0.8332927 0.8105735 0.6581181
## 0.10 3 600 0.8326773 0.8107482 0.6574003
## 0.10 3 650 0.8326588 0.8107899 0.6574405
## 0.10 3 700 0.8329870 0.8105758 0.6581416
## 0.10 3 750 0.8320001 0.8110068 0.6569590
## 0.10 3 800 0.8321554 0.8108708 0.6571576
## 0.10 3 850 0.8320772 0.8109064 0.6572266
## 0.10 3 900 0.8321044 0.8108823 0.6573080
## 0.10 3 950 0.8320347 0.8109105 0.6573072
## 0.10 3 1000 0.8318342 0.8109653 0.6571798
## 0.10 5 100 0.9593756 0.7549443 0.7419556
## 0.10 5 150 0.9357796 0.7648825 0.7267111
## 0.10 5 200 0.9208813 0.7720514 0.7165350
## 0.10 5 250 0.9083948 0.7773926 0.7018404
## 0.10 5 300 0.9008779 0.7803813 0.6932426
## 0.10 5 350 0.9024242 0.7788952 0.6943978
## 0.10 5 400 0.9017370 0.7788301 0.6936603
## 0.10 5 450 0.8997799 0.7796099 0.6918977
## 0.10 5 500 0.9006895 0.7789687 0.6936504
## 0.10 5 550 0.8985394 0.7799777 0.6912532
## 0.10 5 600 0.8981671 0.7802889 0.6908763
## 0.10 5 650 0.8965494 0.7807599 0.6900992
## 0.10 5 700 0.8956180 0.7811850 0.6889932
## 0.10 5 750 0.8950937 0.7813501 0.6887358
## 0.10 5 800 0.8945658 0.7814615 0.6886371
## 0.10 5 850 0.8940807 0.7816393 0.6880690
## 0.10 5 900 0.8940101 0.7816174 0.6879583
## 0.10 5 950 0.8936770 0.7817423 0.6876149
## 0.10 5 1000 0.8934289 0.7818461 0.6872670
## 0.10 7 100 0.9050152 0.7910557 0.7086485
## 0.10 7 150 0.8759106 0.8036110 0.6886651
## 0.10 7 200 0.8706129 0.8053617 0.6906914
## 0.10 7 250 0.8610847 0.8083093 0.6854024
## 0.10 7 300 0.8635373 0.8071970 0.6890273
## 0.10 7 350 0.8609643 0.8081134 0.6881393
## 0.10 7 400 0.8578950 0.8093314 0.6858182
## 0.10 7 450 0.8562474 0.8099959 0.6850984
## 0.10 7 500 0.8559575 0.8097251 0.6842265
## 0.10 7 550 0.8550067 0.8098136 0.6839183
## 0.10 7 600 0.8541177 0.8100802 0.6827144
## 0.10 7 650 0.8538638 0.8100574 0.6827483
## 0.10 7 700 0.8533798 0.8101361 0.6825868
## 0.10 7 750 0.8533797 0.8101053 0.6828099
## 0.10 7 800 0.8537345 0.8098217 0.6833863
## 0.10 7 850 0.8537310 0.8097881 0.6834586
## 0.10 7 900 0.8540211 0.8095763 0.6836386
## 0.10 7 950 0.8538608 0.8096288 0.6833454
## 0.10 7 1000 0.8539068 0.8096231 0.6834146
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
gbmTune$finalModel
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## There were 57 predictors of which 56 had non-zero influence.
plot(gbmTune, auto.key = list(columns = 4, lines = TRUE))
gbmImp <- varImp(gbmTune, scale = FALSE)
gbmImp
## gbm variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 333.74
## ManufacturingProcess13 170.02
## ManufacturingProcess06 157.71
## ManufacturingProcess09 72.72
## BiologicalMaterial03 55.38
## ManufacturingProcess31 55.23
## BiologicalMaterial11 51.71
## BiologicalMaterial08 49.84
## ManufacturingProcess17 41.96
## BiologicalMaterial12 37.05
## BiologicalMaterial09 26.68
## ManufacturingProcess15 25.61
## ManufacturingProcess05 24.82
## ManufacturingProcess42 24.30
## ManufacturingProcess01 23.88
## ManufacturingProcess04 23.29
## BiologicalMaterial04 19.70
## ManufacturingProcess20 17.57
## BiologicalMaterial02 16.44
## ManufacturingProcess03 15.67
gbmTunePred <- predict(gbmTune, newdata = chem_test_x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = gbmTunePred, obs = chem_test_y)
## RMSE Rsquared MAE
## 1.6278311 0.1257552 1.2691626
For Stochastic Gradient Boosting, the RMSE is 0.8318342 and Rsquared is 0.8109653. This is the best Rsquared yet.
The test set performance values are RMSE of 1.6278311 and 0.1257552 Rsquared. The resampled test set did not perform well.
Now let’s try Cubist:
### Cubist
cbGrid <- expand.grid(committees = c(1:10, 20, 50, 75, 100),
neighbors = c(0, 1, 5, 9))
set.seed(100)
cubistTune <- train(x, y,
"cubist",
tuneGrid = cbGrid,
trControl = ctrl)
cubistTune
## Cubist
##
## 110 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 98, 99, 99, 101, 99, 99, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 0.9760591 0.7239697 0.7866821
## 1 1 0.8989662 0.8078457 0.6304803
## 1 5 0.8694968 0.7881871 0.6746366
## 1 9 0.8939106 0.7704492 0.7073263
## 2 0 0.9683191 0.7615857 0.7374496
## 2 1 0.8941032 0.8067911 0.6502860
## 2 5 0.8985366 0.7990245 0.6453925
## 2 9 0.9305241 0.7821177 0.6816569
## 3 0 0.8565233 0.8029734 0.6620314
## 3 1 0.8120012 0.8328901 0.5808218
## 3 5 0.8032756 0.8260645 0.5904685
## 3 9 0.8170974 0.8219461 0.6041171
## 4 0 0.8801530 0.7952121 0.6852915
## 4 1 0.8106978 0.8429572 0.5693896
## 4 5 0.8170935 0.8300469 0.5854132
## 4 9 0.8478985 0.8177320 0.6272331
## 5 0 0.8708639 0.8035874 0.6652981
## 5 1 0.7599308 0.8523177 0.5419491
## 5 5 0.8105990 0.8310991 0.5925108
## 5 9 0.8378839 0.8203181 0.6255494
## 6 0 0.8650573 0.8128972 0.6686134
## 6 1 0.7616806 0.8594168 0.5399942
## 6 5 0.7958720 0.8420401 0.5805012
## 6 9 0.8277821 0.8299176 0.6207085
## 7 0 0.8397481 0.8202117 0.6522682
## 7 1 0.7444919 0.8622388 0.5309700
## 7 5 0.7860351 0.8409138 0.5795684
## 7 9 0.8094116 0.8320257 0.6107343
## 8 0 0.8258270 0.8269115 0.6414097
## 8 1 0.7387420 0.8668747 0.5230493
## 8 5 0.7706005 0.8509960 0.5684454
## 8 9 0.8012115 0.8396694 0.6104625
## 9 0 0.8173384 0.8305986 0.6334590
## 9 1 0.7293402 0.8703782 0.5231017
## 9 5 0.7579300 0.8538050 0.5606776
## 9 9 0.7842962 0.8438592 0.5977199
## 10 0 0.8316213 0.8236442 0.6441661
## 10 1 0.7505436 0.8632493 0.5354345
## 10 5 0.7733745 0.8477313 0.5705181
## 10 9 0.8001195 0.8374964 0.6065137
## 20 0 0.8767542 0.8107228 0.6900870
## 20 1 0.7828961 0.8485103 0.5767046
## 20 5 0.8263284 0.8304575 0.6070923
## 20 9 0.8488508 0.8218695 0.6434328
## 50 0 0.8427499 0.8116329 0.6720406
## 50 1 0.7576651 0.8517342 0.5593009
## 50 5 0.7869744 0.8364552 0.5990439
## 50 9 0.8118871 0.8258637 0.6364337
## 75 0 0.8306681 0.8136498 0.6645124
## 75 1 0.7529566 0.8543419 0.5579873
## 75 5 0.7759688 0.8398680 0.5978006
## 75 9 0.7985683 0.8300644 0.6297697
## 100 0 0.8225066 0.8131985 0.6561630
## 100 1 0.7473080 0.8562931 0.5534851
## 100 5 0.7629617 0.8420478 0.5897026
## 100 9 0.7846826 0.8325806 0.6212690
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 9 and neighbors = 1.
cubistTune$finalModel
##
## Call:
## cubist.default(x = x, y = y, committees = param$committees)
##
## Number of samples: 110
## Number of predictors: 57
##
## Number of committees: 9
## Number of rules per committee: 2, 2, 2, 1, 3, 3, 4, 1, 4
plot(cubistTune, auto.key = list(columns = 4, lines = TRUE))
cbImp <- varImp(cubistTune, scale = FALSE)
cbImp
## cubist variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 41.0
## ManufacturingProcess17 38.5
## ManufacturingProcess13 36.5
## ManufacturingProcess09 31.5
## BiologicalMaterial05 20.5
## BiologicalMaterial03 19.5
## BiologicalMaterial08 15.5
## ManufacturingProcess11 13.0
## ManufacturingProcess29 13.0
## ManufacturingProcess24 12.0
## BiologicalMaterial06 12.0
## BiologicalMaterial10 11.0
## ManufacturingProcess22 10.0
## ManufacturingProcess04 10.0
## ManufacturingProcess27 10.0
## BiologicalMaterial09 9.5
## ManufacturingProcess14 8.5
## BiologicalMaterial02 8.5
## BiologicalMaterial12 8.0
## BiologicalMaterial01 8.0
plot(cbImp)
gbmTunePred <- predict(gbmTune, newdata = chem_test_x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = gbmTunePred, obs = chem_test_y)
## RMSE Rsquared MAE
## 1.6278311 0.1257552 1.2691626
For Cubist, the RMSE is 0.7293402 and Rsquared is 0.8703782. This is the best Rsquared yet and is a pretty good variance accounted for.
The test set performance values are RMSE of 1.6278311 and 0.1257552 Rsquared. The resampled test set did not perform well.
Which tree-based regression model gives the optimal resampling and test set performance?
The Cubist performed the best. Like many of the other models, it did not have a great test set performance, but it has the highest Rsquared value.
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
From above, the top 10 important predictors are:
ManufacturingProcess32 41.0
ManufacturingProcess17 38.5
ManufacturingProcess13 36.5
ManufacturingProcess09 31.5
BiologicalMaterial05 20.5
BiologicalMaterial03 19.5
BiologicalMaterial08 15.5
ManufacturingProcess11 13.0
ManufacturingProcess29 13.0
ManufacturingProcess24 12.0
The process variables dominate the list with 7 out of the top 10 being a process variable. The top predictors for MARS are ManufacturingProcess32, ManufacturingProcess13, ManufacturingProcess06, and ManufacturingProcess16. The only chosen predictors are process variables. This is very similar to the linear model since it has process and biological predictors. The first 4 predictors are the same, but the linear model has more biological processes.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
I tried plotting how the textbook plots, but all of these returned with an error for me:
# library(partykit)
#
# cubistTree <- as.party(cubistTune$finalModel)
# plot(cubistTree)
# plot(cubistTune$finalModel)
Here are scatter plots showing the relationship between some of the top predictors and the yield:
ggplot(data=ChemicalManufacturingProcessCopy, aes(x=ManufacturingProcess32, y=Yield)) + geom_point()
ggplot(data=ChemicalManufacturingProcessCopy, aes(x=ManufacturingProcess17, y=Yield)) + geom_point()
ggplot(data=ChemicalManufacturingProcessCopy, aes(x=ManufacturingProcess13, y=Yield)) + geom_point()
ggplot(data=ChemicalManufacturingProcessCopy, aes(x=ManufacturingProcess09, y=Yield)) + geom_point()
ManufacturingProcess32 approximate range is from 150-170. The variance seems to stay pretty consistent. Overall, as ManufacturingProcess32 increases, Yield increases as well. It seems maybe around 165 ManufacturingProcess32, the Yield starts to even out or decrease.
ManufacturingProcess17 approximate range is from 32.5 - a little above 35 with a few outliers around 40. The variance seems to stay pretty consistent. Overall, as ManufacturingProcess17 increases, Yield decreases.
ManufacturingProcess13 approximate range is from 32-36 with a few outliers around 38. The variance seems to stay pretty consistent. Overall, as ManufacturingProcess13 increases, Yield decreases.
ManufacturingProcess09 approximate range is from 42.5-48 with a few outliers around 40. The variance seems to stay pretty consistent. Overall, as ManufacturingProcess09 increases, Yield increases
Having all this information informs us what values we should have for certain predictors to produce a good yield.