HW 9

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

library(AppliedPredictiveModeling)

8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench) set.seed(200) simulated <- mlbench.friedman1(200, sd = 1) simulated <- cbind(simulated\(x, simulated\)y) simulated <- as.data.frame(simulated) colnames(simulated)[ncol(simulated)] <- “y”

library(mlbench)
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
8.1a

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest) library(caret) model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000) rfImp1 <- varImp(model1, scale = FALSE)

library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1 |>
  sort_by(rfImp1$Overall)
##         Overall
## V8  -0.11724317
## V9  -0.10344797
## V7   0.02927888
## V10  0.04312556
## V6   0.12395003
## V3   0.72305459
## V5   2.13575650
## V2   6.27437240
## V4   7.50258584
## V1   8.62743275
plot(rfImp1)

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No, the random forest model did not. V6 – V10 have the lowest importance scores.

8.1b

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated\(duplicate1 <- simulated\)V1 + rnorm(200) * .1 cor(simulated\(duplicate1, simulated\)V1)

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9485201

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

set.seed(96)
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2 |>
  sort_by(rfImp2$Overall)
##                Overall
## V8         -0.12333601
## V9         -0.05916537
## V7         -0.01873194
## V10         0.01692588
## V6          0.23016625
## V3          0.61572773
## V5          2.11481994
## duplicate1  3.33526493
## V1          6.18726213
## V2          6.49136015
## V4          7.06359432
plot(rfImp2)

Yes adding a predictor did change the importance score of V1 - the score is now 6.774034589 as opposed to 8.62743275. It is also now lower in importance than V2.

If we add another highly correlated predictor with V1:

set.seed(97)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9372469
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3 |>
  sort_by(rfImp3$Overall)
##                Overall
## V8         -0.05666557
## V10         0.02602534
## V7          0.03781915
## V9          0.06286232
## V6          0.14070443
## V3          0.50344026
## duplicate1  1.91648994
## V5          2.22788259
## duplicate2  2.92265084
## V1          5.74694217
## V2          6.87763546
## V4          7.37160256
plot(rfImp3)

Adding another highly correlated predictor did change the importance score of V1 - the score is even lower at 5.19579731 as opposed to 8.50076064. It is also now lower in importance than V2 and V4.

Note: since we added on variables to the data frame, the numbers may appear differently in rpubs (when the code is run again).

8.1c

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

I decided to create new simulated data so the “duplicate” new variables are removed:

library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
set.seed(301)

# Recreate the data
simulated2 <- mlbench.friedman1(200, sd = 1)
simulated2 <- cbind(simulated2$x, simulated2$y)
simulated2 <- as.data.frame(simulated2)
colnames(simulated2)[ncol(simulated2)] <- "y"
x <- as.matrix(simulated2[, -which(names(simulated2) == "y")])

### Tune the conditional inference forests
model4 <- cforest(y ~ ., data = simulated2)
model4
## 
##   Random Forest using Conditional Inference Trees
## 
## Number of trees:  500 
## 
## Response:  y 
## Inputs:  V1, V2, V3, V4, V5, V6, V7, V8, V9, V10 
## Number of observations:  200
rfImp4 <- varImp(model4)
rfImp4 |>
  sort_by(rfImp4$Overall)
##          Overall
## V9  -0.054383229
## V7  -0.032265145
## V8  -0.005465832
## V10  0.017008383
## V6   0.032425746
## V3   0.183586689
## V5   1.012205621
## V1   4.926082392
## V2   6.814068106
## V4  14.219625841
plot(rfImp4)

The same predictors are the most important but the order is slightly different. V6 - V10 (the unimportant predictors) still have the least importance in the model.

8.1d

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted:

# Boosted
set.seed(100)
indx <- createFolds(simulated2$y, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)
gbmTune <- train(x = x, y = simulated2$y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 trControl = ctrl,
                 verbose = FALSE)
gbmTune
## Stochastic Gradient Boosting 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   0.01       1                   100     4.058318  0.5924175  3.310401
##   0.01       1                   150     3.770109  0.6489209  3.081699
##   0.01       1                   200     3.530493  0.6794660  2.875616
##   0.01       1                   250     3.333894  0.7031430  2.719584
##   0.01       1                   300     3.164019  0.7223481  2.585931
##   0.01       1                   350     3.018967  0.7380079  2.458820
##   0.01       1                   400     2.896962  0.7517797  2.361692
##   0.01       1                   450     2.794121  0.7633783  2.275418
##   0.01       1                   500     2.698804  0.7742894  2.193324
##   0.01       1                   550     2.614342  0.7842219  2.117630
##   0.01       1                   600     2.537068  0.7932969  2.055217
##   0.01       1                   650     2.465999  0.8010597  1.997521
##   0.01       1                   700     2.405243  0.8080453  1.951048
##   0.01       1                   750     2.351374  0.8134648  1.907801
##   0.01       1                   800     2.300874  0.8184132  1.867947
##   0.01       1                   850     2.256131  0.8239474  1.832917
##   0.01       1                   900     2.216133  0.8279178  1.798277
##   0.01       1                   950     2.179846  0.8308984  1.769397
##   0.01       1                  1000     2.146775  0.8335465  1.745185
##   0.01       3                   100     3.506148  0.7129205  2.828383
##   0.01       3                   150     3.126204  0.7404025  2.517388
##   0.01       3                   200     2.859800  0.7631779  2.295266
##   0.01       3                   250     2.651148  0.7842937  2.122757
##   0.01       3                   300     2.481946  0.8026236  1.980995
##   0.01       3                   350     2.344854  0.8159933  1.865088
##   0.01       3                   400     2.237835  0.8262581  1.774919
##   0.01       3                   450     2.158864  0.8335579  1.713537
##   0.01       3                   500     2.091503  0.8394954  1.662698
##   0.01       3                   550     2.036915  0.8449022  1.623011
##   0.01       3                   600     1.999018  0.8483345  1.595316
##   0.01       3                   650     1.965885  0.8510384  1.572702
##   0.01       3                   700     1.937539  0.8536563  1.547325
##   0.01       3                   750     1.915125  0.8556375  1.528509
##   0.01       3                   800     1.897566  0.8571803  1.512269
##   0.01       3                   850     1.881740  0.8586823  1.496784
##   0.01       3                   900     1.869342  0.8602120  1.485079
##   0.01       3                   950     1.860725  0.8613150  1.475414
##   0.01       3                  1000     1.849139  0.8625925  1.464275
##   0.01       5                   100     3.334653  0.7287282  2.683727
##   0.01       5                   150     2.953547  0.7545739  2.371161
##   0.01       5                   200     2.690003  0.7776444  2.143782
##   0.01       5                   250     2.490851  0.7969697  1.966661
##   0.01       5                   300     2.340446  0.8115355  1.847823
##   0.01       5                   350     2.226874  0.8228967  1.766393
##   0.01       5                   400     2.142324  0.8306664  1.700991
##   0.01       5                   450     2.077926  0.8365060  1.648810
##   0.01       5                   500     2.036130  0.8399066  1.612299
##   0.01       5                   550     1.998080  0.8437432  1.583024
##   0.01       5                   600     1.969221  0.8467325  1.556868
##   0.01       5                   650     1.947325  0.8490996  1.537581
##   0.01       5                   700     1.932640  0.8505316  1.521751
##   0.01       5                   750     1.917808  0.8522819  1.507764
##   0.01       5                   800     1.906614  0.8535401  1.494749
##   0.01       5                   850     1.894959  0.8549981  1.483169
##   0.01       5                   900     1.892677  0.8553683  1.480204
##   0.01       5                   950     1.886431  0.8562238  1.473047
##   0.01       5                  1000     1.880637  0.8569135  1.464939
##   0.01       7                   100     3.309677  0.7334957  2.646904
##   0.01       7                   150     2.917530  0.7633590  2.330623
##   0.01       7                   200     2.643672  0.7867698  2.094781
##   0.01       7                   250     2.447267  0.8025480  1.920383
##   0.01       7                   300     2.301073  0.8169173  1.808000
##   0.01       7                   350     2.196660  0.8265607  1.730428
##   0.01       7                   400     2.117430  0.8338818  1.668192
##   0.01       7                   450     2.064661  0.8384898  1.626317
##   0.01       7                   500     2.027496  0.8417270  1.594787
##   0.01       7                   550     1.994993  0.8446573  1.567457
##   0.01       7                   600     1.970914  0.8469070  1.547581
##   0.01       7                   650     1.953145  0.8486043  1.533542
##   0.01       7                   700     1.940703  0.8497479  1.522346
##   0.01       7                   750     1.930428  0.8510641  1.511353
##   0.01       7                   800     1.923945  0.8516247  1.505294
##   0.01       7                   850     1.919080  0.8519884  1.499366
##   0.01       7                   900     1.914038  0.8526189  1.495252
##   0.01       7                   950     1.906083  0.8536098  1.490161
##   0.01       7                  1000     1.902139  0.8540804  1.485346
##   0.10       1                   100     2.134127  0.8325122  1.727405
##   0.10       1                   150     2.007291  0.8423442  1.648037
##   0.10       1                   200     1.940182  0.8489443  1.587157
##   0.10       1                   250     1.906457  0.8525617  1.547285
##   0.10       1                   300     1.916332  0.8501194  1.556286
##   0.10       1                   350     1.900535  0.8519577  1.548767
##   0.10       1                   400     1.906887  0.8502586  1.560407
##   0.10       1                   450     1.918735  0.8479802  1.560396
##   0.10       1                   500     1.923622  0.8475754  1.570869
##   0.10       1                   550     1.924362  0.8468278  1.565978
##   0.10       1                   600     1.924163  0.8473865  1.559245
##   0.10       1                   650     1.921287  0.8474421  1.559607
##   0.10       1                   700     1.921758  0.8478190  1.561406
##   0.10       1                   750     1.928198  0.8461037  1.565294
##   0.10       1                   800     1.934524  0.8456940  1.578426
##   0.10       1                   850     1.929941  0.8459980  1.580258
##   0.10       1                   900     1.927487  0.8468707  1.578319
##   0.10       1                   950     1.914378  0.8490282  1.571451
##   0.10       1                  1000     1.930940  0.8460777  1.580815
##   0.10       3                   100     1.959947  0.8455464  1.532718
##   0.10       3                   150     1.915036  0.8509078  1.504204
##   0.10       3                   200     1.907346  0.8509073  1.492786
##   0.10       3                   250     1.918260  0.8494940  1.500916
##   0.10       3                   300     1.926056  0.8478941  1.511760
##   0.10       3                   350     1.926462  0.8477982  1.513494
##   0.10       3                   400     1.925101  0.8479021  1.513746
##   0.10       3                   450     1.936142  0.8465022  1.518243
##   0.10       3                   500     1.941133  0.8461583  1.525287
##   0.10       3                   550     1.936736  0.8471680  1.521190
##   0.10       3                   600     1.939495  0.8467436  1.525985
##   0.10       3                   650     1.937868  0.8466814  1.524997
##   0.10       3                   700     1.941183  0.8464456  1.526388
##   0.10       3                   750     1.941095  0.8464800  1.527839
##   0.10       3                   800     1.940002  0.8467537  1.526638
##   0.10       3                   850     1.942700  0.8464657  1.528499
##   0.10       3                   900     1.942200  0.8465262  1.527372
##   0.10       3                   950     1.944581  0.8460697  1.529178
##   0.10       3                  1000     1.944101  0.8461030  1.528629
##   0.10       5                   100     1.938922  0.8464963  1.543350
##   0.10       5                   150     1.897555  0.8506336  1.508761
##   0.10       5                   200     1.898881  0.8507215  1.510617
##   0.10       5                   250     1.902102  0.8491688  1.511537
##   0.10       5                   300     1.908680  0.8482477  1.513541
##   0.10       5                   350     1.905923  0.8486424  1.512943
##   0.10       5                   400     1.910236  0.8482689  1.513470
##   0.10       5                   450     1.917145  0.8470926  1.516738
##   0.10       5                   500     1.916255  0.8472018  1.514901
##   0.10       5                   550     1.917552  0.8470937  1.514390
##   0.10       5                   600     1.917671  0.8471799  1.515314
##   0.10       5                   650     1.917617  0.8473505  1.515308
##   0.10       5                   700     1.918753  0.8471808  1.515736
##   0.10       5                   750     1.918110  0.8472724  1.515217
##   0.10       5                   800     1.918515  0.8472319  1.515510
##   0.10       5                   850     1.918912  0.8472072  1.515827
##   0.10       5                   900     1.918991  0.8471846  1.515720
##   0.10       5                   950     1.919123  0.8471737  1.515937
##   0.10       5                  1000     1.919023  0.8471928  1.515936
##   0.10       7                   100     2.025192  0.8324992  1.587500
##   0.10       7                   150     2.013766  0.8320011  1.568014
##   0.10       7                   200     2.009108  0.8333405  1.577025
##   0.10       7                   250     2.004631  0.8343764  1.574972
##   0.10       7                   300     2.007876  0.8345229  1.577478
##   0.10       7                   350     2.010494  0.8342342  1.581563
##   0.10       7                   400     2.008746  0.8347960  1.578158
##   0.10       7                   450     2.011629  0.8346025  1.581348
##   0.10       7                   500     2.010311  0.8347182  1.581307
##   0.10       7                   550     2.010985  0.8346036  1.581474
##   0.10       7                   600     2.010909  0.8347145  1.582414
##   0.10       7                   650     2.011737  0.8345464  1.582626
##   0.10       7                   700     2.012785  0.8343701  1.583422
##   0.10       7                   750     2.013036  0.8343306  1.583656
##   0.10       7                   800     2.013004  0.8343629  1.583545
##   0.10       7                   850     2.012843  0.8343711  1.583382
##   0.10       7                   900     2.012821  0.8343597  1.583525
##   0.10       7                   950     2.012731  0.8343750  1.583666
##   0.10       7                  1000     2.012658  0.8343713  1.583659
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
##  3, shrinkage = 0.01 and n.minobsinnode = 10.
plot(gbmTune, auto.key = list(columns = 4, lines = TRUE))

gbmImp <- varImp(gbmTune, scale = FALSE)
gbmImp
## gbm variable importance
## 
##     Overall
## V4  63068.6
## V2  33875.9
## V1  27952.7
## V5  12875.8
## V3  12690.7
## V6   2204.7
## V7   1548.4
## V10  1235.9
## V9   1207.3
## V8    951.2

Boosted produced the same results - the importance ordering is the same. The values for boosted are a lot larger.

Cubist:

# Cubist
set.seed(98)
cbGrid <- expand.grid(committees = c(1:10, 20, 50, 75, 100), 
                      neighbors = c(0, 1, 5, 9))

set.seed(100)
cubistTune <- train(x = x, y = simulated2$y,
                    "cubist",
                    tuneGrid = cbGrid,
                    trControl = ctrl)
cubistTune
## Cubist 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE     
##     1         0          2.486110  0.7496162  1.980196
##     1         1          3.008986  0.6479950  2.301460
##     1         5          2.347183  0.7695990  1.815079
##     1         9          2.281047  0.7814085  1.772161
##     2         0          2.352679  0.7700249  1.876807
##     2         1          2.763001  0.6833978  2.125779
##     2         5          2.245669  0.7852810  1.751135
##     2         9          2.185548  0.7953453  1.716768
##     3         0          2.325927  0.7739165  1.857013
##     3         1          2.737924  0.6900825  2.073857
##     3         5          2.231359  0.7869371  1.738960
##     3         9          2.173200  0.7967481  1.703917
##     4         0          2.343977  0.7694343  1.868838
##     4         1          2.715398  0.6928613  2.083206
##     4         5          2.236297  0.7860729  1.750201
##     4         9          2.185947  0.7944130  1.717669
##     5         0          2.342287  0.7712170  1.867755
##     5         1          2.725670  0.6924045  2.071722
##     5         5          2.236653  0.7867860  1.748587
##     5         9          2.187133  0.7949951  1.716766
##     6         0          2.336457  0.7702762  1.862026
##     6         1          2.711196  0.6933901  2.078510
##     6         5          2.236092  0.7861548  1.750726
##     6         9          2.184520  0.7945777  1.718319
##     7         0          2.327275  0.7729738  1.851675
##     7         1          2.714289  0.6942385  2.070512
##     7         5          2.225451  0.7883570  1.738434
##     7         9          2.175444  0.7965496  1.707181
##     8         0          2.324069  0.7709461  1.858714
##     8         1          2.727350  0.6916902  2.094393
##     8         5          2.228022  0.7870347  1.751677
##     8         9          2.176313  0.7953469  1.718839
##     9         0          2.321950  0.7731831  1.849893
##     9         1          2.731753  0.6922418  2.089778
##     9         5          2.226359  0.7884181  1.742640
##     9         9          2.174664  0.7966778  1.712592
##    10         0          2.313559  0.7737514  1.846239
##    10         1          2.730948  0.6922633  2.092004
##    10         5          2.223155  0.7879975  1.740175
##    10         9          2.170332  0.7967802  1.708415
##    20         0          2.315033  0.7768608  1.836828
##    20         1          2.756712  0.6898296  2.111097
##    20         5          2.223160  0.7899620  1.726474
##    20         9          2.170726  0.7984768  1.689206
##    50         0          2.275530  0.7851917  1.819671
##    50         1          2.758171  0.6883308  2.111673
##    50         5          2.196736  0.7934268  1.714041
##    50         9          2.129414  0.8041390  1.664779
##    75         0          2.266783  0.7861628  1.800135
##    75         1          2.746444  0.6891327  2.092617
##    75         5          2.189819  0.7936866  1.698632
##    75         9          2.120822  0.8046725  1.649021
##   100         0          2.263316  0.7876348  1.797635
##   100         1          2.743127  0.6900732  2.079693
##   100         5          2.192029  0.7940890  1.689447
##   100         9          2.122296  0.8054282  1.642360
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 75 and neighbors = 9.
plot(cubistTune, auto.key = list(columns = 4, lines = TRUE))

cbImp <- varImp(cubistTune, scale = FALSE)
cbImp
## cubist variable importance
## 
##     Overall
## V2     57.5
## V4     57.5
## V1     54.0
## V3     30.0
## V5     26.0
## V7      3.0
## V10     3.0
## V9      1.0
## V6      0.5
## V8      0.0

Cubist also produced the same results - the importance ordering is the same. Cubist has smaller values than boosted and no negative values.

8.2

Use a simulation to show tree bias with different granularities.

The textbook states, “predictors with a higher number of distinct values are favored over more granular predictors” and “there is a high probability that the noise variables will be chosen to split the top nodes of the tree.” So let’s try adding predictors that are just noise:

set.seed(99)
simulated3 <- simulated
nrows <- nrow(simulated3)
rands <- 1e-3 * runif(nrows)

# Add noise predictors
simulated3$R1 <- simulated3$V1 + rands
simulated3$R2 <- simulated3$V2 + rands

model_gran_1 <- randomForest(y ~ ., data = simulated3, importance = TRUE, ntree = 1000)
rfImp_gran_1 <- varImp(model_gran_1, scale = FALSE)
rfImp_gran_1 |>
  sort_by(rfImp_gran_1$Overall)
##                Overall
## V8         -0.08590449
## V9         -0.03539313
## V10        -0.02639602
## V7          0.05484661
## V6          0.19828435
## V3          0.35486249
## duplicate1  1.21086204
## V5          1.64869393
## duplicate2  1.87416397
## V1          3.66979186
## R1          4.03126666
## V2          4.06406114
## R2          4.27598991
## V4          6.61618188
plot(rfImp_gran_1)

The model favors the random variables more than the duplicate variables.

Let’s try adding 2 more noise predictors:

set.seed(100)
# Add noise predictors
simulated3$R3 <- simulated3$V3 + rands
simulated3$R4 <- simulated3$V4 + rands

model_gran_2 <- randomForest(y ~ ., data = simulated3, importance = TRUE, ntree = 1000)
rfImp_gran_2 <- varImp(model_gran_2, scale = FALSE)
rfImp_gran_2 |>
  sort_by(rfImp_gran_2$Overall)
##                Overall
## V8         -0.09651461
## V7         -0.03406034
## V9         -0.02245271
## V6          0.05992412
## V10         0.06913218
## V3          0.37034446
## R3          0.38376969
## duplicate1  1.09500925
## V5          1.65254570
## duplicate2  1.68789327
## V1          3.51416840
## R2          3.75925216
## R1          3.83141753
## V2          4.08727341
## V4          4.12691674
## R4          4.58042947
plot(rfImp_gran_2)

This model also favors the random variables showing that trees suffer from selection bias.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

8.3a

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

A low bagging fraction and learning rate are trained on smaller amounts of data, which can result in overfitting and increase bias. Low bagging fraction causes increase in diverse models which is why more predictors are considered important. A higher bagging fraction uses a larger amount of the data, so it can create more focused models, that are not necessarily unique. You will get more similar models using a higher bagging fraction and learning rate, which is why there are fewer important predictors. These models have more data to work with.

8.3b

Which model do you think would be more predictive of other samples?

Having a lower value of the learning rate is beneficial, even though it has a computational cost, so I would use the left side model and repeat many times.

8.3c

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

I think increasing interaction depth affect would make the variance between the predictors smaller. More predictors would become important, and the value for the top predictor would decrease.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Get the data ready:

data(ChemicalManufacturingProcess)
ChemicalManufacturingProcessCopy <- ChemicalManufacturingProcess

# Imputation - use mean
for(i in 1:ncol(ChemicalManufacturingProcessCopy)) {
  ChemicalManufacturingProcessCopy[is.na(ChemicalManufacturingProcessCopy[,i]), i] <- mean(ChemicalManufacturingProcessCopy[,i], na.rm = TRUE)
}

# Split the data
chem_train_set_x_df <- as.data.frame(ChemicalManufacturingProcessCopy[1:110,])
chem_train_set_x <- ChemicalManufacturingProcessCopy[1:110,]

chem_test_set_x <- ChemicalManufacturingProcessCopy[111:nrow(ChemicalManufacturingProcessCopy),]

chem_train_set_y_df <- as.data.frame(ChemicalManufacturingProcess[1:110,])
chem_train_set_y <- ChemicalManufacturingProcess[1:110,]

# choose response variable
y <- chem_train_set_x$Yield
# all the predictors need to be put into a matrix
x <- as.matrix(chem_train_set_x[, -which(names(chem_train_set_x) == "Yield")])

Let’s try Bagged Trees:

# Bagged Trees
set.seed(101)
indx <- createFolds(y, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)

treebagTune <- train(x = x, y = y,
                     method = "treebag",
                     nbagg = 50,
                     trControl = ctrl)

treebagTune
## Bagged CART 
## 
## 110 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 98, 99, 99, 101, 99, 99, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   1.026206  0.7308844  0.7696015
treebagTune$finalModel
## 
## Bagging regression trees with 50 bootstrap replications
chem_test_y <- chem_test_set_x[, which(names(chem_test_set_x) == "Yield")]
chem_test_x <- as.matrix(chem_test_set_x[, -which(names(chem_test_set_x) == "Yield")])
treebagTunePred <- predict(treebagTune, newdata = chem_test_x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = treebagTunePred, obs = chem_test_y)
##      RMSE  Rsquared       MAE 
## 1.5280023 0.2198921 1.2271630

For Bagged CART, the RMSE is 1.026206 and Rsquared is 0.7308844. This is pretty good Rsquared, but not amazing.

The test set performance values are RMSE of 1.5280023 and 0.2198921 Rsquared. The resampled test set did not perform well.

Now let’s try Random Forests:

# Random Forests
mtryGrid <- data.frame(mtry = floor(seq(10, ncol(x), length = 10)))


### Tune the model using cross-validation ###
set.seed(100)
rfTune <- train(x = x, y = y,
                method = "rf",
                tuneGrid = mtryGrid,
                ntree = 1000,
                importance = TRUE,
                trControl = ctrl)
rfTune
## Random Forest 
## 
## 110 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 98, 99, 99, 101, 99, 99, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##   10    0.9256904  0.8256316  0.7206621
##   15    0.9162436  0.8202740  0.7023533
##   20    0.9054228  0.8162479  0.6954435
##   25    0.9150617  0.8077850  0.6937333
##   30    0.9171069  0.8002505  0.6942974
##   36    0.9185005  0.7972996  0.6924881
##   41    0.9271291  0.7868590  0.6950346
##   46    0.9213070  0.7909617  0.6903651
##   51    0.9333129  0.7793265  0.7017964
##   57    0.9373766  0.7745866  0.7064238
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 20.
rfTune$finalModel
## 
## Call:
##  randomForest(x = x, y = y, ntree = 1000, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 20
## 
##           Mean of squared residuals: 0.8859545
##                     % Var explained: 75.3
plot(rfTune)

rfImp <- varImp(rfTune, scale = FALSE)
rfImp
## rf variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  26.090
## ManufacturingProcess13  21.594
## ManufacturingProcess09  14.339
## BiologicalMaterial03    14.166
## ManufacturingProcess17  14.107
## BiologicalMaterial08    12.627
## ManufacturingProcess31  11.300
## BiologicalMaterial02    11.255
## BiologicalMaterial12    10.438
## BiologicalMaterial11     9.910
## BiologicalMaterial06     9.102
## ManufacturingProcess28   9.074
## ManufacturingProcess06   8.122
## BiologicalMaterial01     7.559
## ManufacturingProcess36   7.067
## ManufacturingProcess33   6.616
## ManufacturingProcess01   6.600
## ManufacturingProcess11   6.461
## BiologicalMaterial04     6.130
## ManufacturingProcess24   6.047
rfTunePred <- predict(rfTune, newdata = chem_test_x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = rfTunePred, obs = chem_test_y)
##      RMSE  Rsquared       MAE 
## 1.5161315 0.1618175 1.2556979
### Tune the model using the OOB estimates ###
ctrlOOB <- trainControl(method = "oob")
set.seed(100)
rfTuneOOB <- train(x = x, y = y,
                   method = "rf",
                   tuneGrid = mtryGrid,
                   ntree = 1000,
                   importance = TRUE,
                   trControl = ctrlOOB)
rfTuneOOB
## Random Forest 
## 
## 110 samples
##  57 predictor
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared 
##   10    0.9494792  0.7487014
##   15    0.9349442  0.7563365
##   20    0.9290242  0.7594125
##   25    0.9378709  0.7548086
##   30    0.9325077  0.7576049
##   36    0.9269330  0.7604944
##   41    0.9375693  0.7549663
##   46    0.9436898  0.7517567
##   51    0.9328969  0.7574025
##   57    0.9470804  0.7499696
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 36.
rfTuneOOB$finalModel
## 
## Call:
##  randomForest(x = x, y = y, ntree = 1000, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 36
## 
##           Mean of squared residuals: 0.8798012
##                     % Var explained: 75.48
plot(rfTuneOOB)

rfTuneOOBPred <- predict(rfTuneOOB, newdata = chem_test_x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = rfTuneOOBPred, obs = chem_test_y)
##      RMSE  Rsquared       MAE 
## 1.5321968 0.1638437 1.2591023
### Tune the conditional inference forests ####
set.seed(100)
condrfTune <- train(x = x, y = y,
                    method = "cforest",
                    tuneGrid = mtryGrid,
                    controls = cforest_unbiased(ntree = 1000),
                    trControl = ctrl)
condrfTune
## Conditional Inference Random Forest 
## 
## 110 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 98, 99, 99, 101, 99, 99, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##   10    1.168003  0.7133461  0.9102069
##   15    1.149683  0.7017961  0.8961622
##   20    1.141419  0.6922299  0.8858195
##   25    1.126813  0.6923026  0.8690717
##   30    1.128123  0.6869251  0.8664302
##   36    1.121336  0.6850380  0.8565380
##   41    1.117175  0.6848067  0.8525516
##   46    1.117668  0.6823346  0.8503776
##   51    1.122439  0.6758730  0.8520768
##   57    1.123381  0.6746645  0.8511820
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 41.
condrfTune$finalModel
## 
##   Random Forest using Conditional Inference Trees
## 
## Number of trees:  1000 
## 
## Response:  .outcome 
## Inputs:  BiologicalMaterial01, BiologicalMaterial02, BiologicalMaterial03, BiologicalMaterial04, BiologicalMaterial05, BiologicalMaterial06, BiologicalMaterial07, BiologicalMaterial08, BiologicalMaterial09, BiologicalMaterial10, BiologicalMaterial11, BiologicalMaterial12, ManufacturingProcess01, ManufacturingProcess02, ManufacturingProcess03, ManufacturingProcess04, ManufacturingProcess05, ManufacturingProcess06, ManufacturingProcess07, ManufacturingProcess08, ManufacturingProcess09, ManufacturingProcess10, ManufacturingProcess11, ManufacturingProcess12, ManufacturingProcess13, ManufacturingProcess14, ManufacturingProcess15, ManufacturingProcess16, ManufacturingProcess17, ManufacturingProcess18, ManufacturingProcess19, ManufacturingProcess20, ManufacturingProcess21, ManufacturingProcess22, ManufacturingProcess23, ManufacturingProcess24, ManufacturingProcess25, ManufacturingProcess26, ManufacturingProcess27, ManufacturingProcess28, ManufacturingProcess29, ManufacturingProcess30, ManufacturingProcess31, ManufacturingProcess32, ManufacturingProcess33, ManufacturingProcess34, ManufacturingProcess35, ManufacturingProcess36, ManufacturingProcess37, ManufacturingProcess38, ManufacturingProcess39, ManufacturingProcess40, ManufacturingProcess41, ManufacturingProcess42, ManufacturingProcess43, ManufacturingProcess44, ManufacturingProcess45 
## Number of observations:  110
plot(condrfTune)

condrfTunePred <- predict(condrfTune, newdata = chem_test_x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = condrfTunePred, obs = chem_test_y)
##     RMSE Rsquared      MAE 
## 1.555673 0.111441 1.296947
set.seed(100)
condrfTuneOOB <- train(x = x, y = y,
                       method = "cforest",
                       tuneGrid = mtryGrid,
                       controls = cforest_unbiased(ntree = 1000),
                       trControl = trainControl(method = "oob"))
condrfTuneOOB
## Conditional Inference Random Forest 
## 
## 110 samples
##  57 predictor
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##   10    1.172185  0.6977276  0.8991333
##   15    1.143132  0.6929247  0.8738530
##   20    1.133728  0.6868864  0.8626660
##   25    1.126456  0.6811510  0.8570306
##   30    1.119113  0.6848342  0.8491332
##   36    1.115565  0.6824505  0.8426582
##   41    1.109359  0.6857689  0.8369577
##   46    1.116558  0.6766673  0.8436031
##   51    1.117476  0.6750039  0.8339748
##   57    1.120075  0.6729624  0.8386448
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 41.
plot(condrfTuneOOB)

condrfTuneOOBPred <- predict(condrfTuneOOB, newdata = chem_test_x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = condrfTuneOOBPred, obs = chem_test_y)
##      RMSE  Rsquared       MAE 
## 1.5605394 0.1088396 1.3030801

For Random Forest, the RMSE is 0.9054228 and Rsquared is 0.8077850 This is also a pretty good Rsquared - the best one yet.

The test set performance values are RMSE of 1.5161315 and 0.1618175 Rsquared. The resampled test set did not perform well.

For Random Forest using the OOB estimates, the RMSE is 0.9269330 and Rsquared is 0.7604944 This is also a pretty good Rsquared, but not the best one.

The test set performance values are RMSE of 1.5321968 and 0.1638437 Rsquared. The resampled test set did not perform well.

For conditional inference forest, the RMSE is 1.117175 and Rsquared is 0.6848067. This is not a very good Rsquared.

The test set performance values are RMSE of 1.555673 and 0.111441 Rsquared. The resampled test set did not perform well.

For Conditional Inference Random Forest using the OOB estimates, the RMSE is 1.109359 and Rsquared is 0.6857689. This is also not a very good Rsquared.

The test set performance values are RMSE of 1.5605394 and 0.1088396 Rsquared. The resampled test set did not perform well.

Now let’s try Boosting:

### Boosting

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)
gbmTune <- train(x = x, y = y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 trControl = ctrl,
                 verbose = FALSE)
gbmTune
## Stochastic Gradient Boosting 
## 
## 110 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 98, 99, 99, 101, 99, 99, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE       Rsquared   MAE      
##   0.01       1                   100     1.4483379  0.6538321  1.1440673
##   0.01       1                   150     1.3171874  0.6709749  1.0358586
##   0.01       1                   200     1.2197950  0.6871786  0.9513279
##   0.01       1                   250     1.1543350  0.6967779  0.8891422
##   0.01       1                   300     1.1071761  0.7036007  0.8478779
##   0.01       1                   350     1.0794501  0.7072811  0.8265490
##   0.01       1                   400     1.0569652  0.7127157  0.8135310
##   0.01       1                   450     1.0388624  0.7195574  0.8032316
##   0.01       1                   500     1.0273715  0.7225962  0.7964284
##   0.01       1                   550     1.0164232  0.7256218  0.7904974
##   0.01       1                   600     1.0058794  0.7312265  0.7815788
##   0.01       1                   650     0.9972844  0.7351424  0.7762622
##   0.01       1                   700     0.9891540  0.7388222  0.7684108
##   0.01       1                   750     0.9819805  0.7426713  0.7630539
##   0.01       1                   800     0.9753691  0.7461526  0.7569755
##   0.01       1                   850     0.9688896  0.7505879  0.7513533
##   0.01       1                   900     0.9654662  0.7521768  0.7482775
##   0.01       1                   950     0.9621877  0.7542609  0.7454933
##   0.01       1                  1000     0.9572211  0.7564531  0.7407071
##   0.01       3                   100     1.3180069  0.6883527  1.0318341
##   0.01       3                   150     1.1834390  0.7102265  0.9078393
##   0.01       3                   200     1.1096197  0.7210433  0.8398467
##   0.01       3                   250     1.0617475  0.7303726  0.8024690
##   0.01       3                   300     1.0232679  0.7426427  0.7726892
##   0.01       3                   350     0.9974040  0.7518259  0.7544995
##   0.01       3                   400     0.9795083  0.7586085  0.7442238
##   0.01       3                   450     0.9664389  0.7645341  0.7346622
##   0.01       3                   500     0.9573812  0.7670590  0.7304341
##   0.01       3                   550     0.9449853  0.7722196  0.7204149
##   0.01       3                   600     0.9365183  0.7751392  0.7143587
##   0.01       3                   650     0.9287649  0.7788827  0.7067020
##   0.01       3                   700     0.9226300  0.7814043  0.7017734
##   0.01       3                   750     0.9170305  0.7832468  0.6991913
##   0.01       3                   800     0.9104650  0.7868112  0.6930390
##   0.01       3                   850     0.9066445  0.7883897  0.6905453
##   0.01       3                   900     0.9016557  0.7904849  0.6859483
##   0.01       3                   950     0.8979714  0.7923676  0.6835404
##   0.01       3                  1000     0.8962644  0.7932470  0.6819073
##   0.01       5                   100     1.3092433  0.6989733  1.0114638
##   0.01       5                   150     1.1779116  0.7119220  0.9024741
##   0.01       5                   200     1.0947939  0.7286074  0.8330031
##   0.01       5                   250     1.0384050  0.7452473  0.7910733
##   0.01       5                   300     1.0026726  0.7573724  0.7623054
##   0.01       5                   350     0.9769106  0.7639812  0.7429002
##   0.01       5                   400     0.9558003  0.7714597  0.7251429
##   0.01       5                   450     0.9404104  0.7778427  0.7175250
##   0.01       5                   500     0.9296583  0.7814004  0.7100643
##   0.01       5                   550     0.9174227  0.7869714  0.6997580
##   0.01       5                   600     0.9071742  0.7913948  0.6916150
##   0.01       5                   650     0.9010271  0.7932504  0.6870746
##   0.01       5                   700     0.8910694  0.7974917  0.6783659
##   0.01       5                   750     0.8835269  0.8008954  0.6752252
##   0.01       5                   800     0.8787979  0.8026178  0.6715871
##   0.01       5                   850     0.8752738  0.8040099  0.6679490
##   0.01       5                   900     0.8705713  0.8056741  0.6648670
##   0.01       5                   950     0.8686084  0.8061536  0.6647348
##   0.01       5                  1000     0.8653855  0.8068112  0.6622400
##   0.01       7                   100     1.3059642  0.6987248  1.0142280
##   0.01       7                   150     1.1697321  0.7182879  0.8992400
##   0.01       7                   200     1.0989604  0.7254342  0.8349470
##   0.01       7                   250     1.0520240  0.7347828  0.7963163
##   0.01       7                   300     1.0215718  0.7446286  0.7732809
##   0.01       7                   350     0.9991260  0.7509128  0.7560098
##   0.01       7                   400     0.9842015  0.7556447  0.7451752
##   0.01       7                   450     0.9691273  0.7609056  0.7350178
##   0.01       7                   500     0.9566394  0.7664677  0.7266722
##   0.01       7                   550     0.9417267  0.7729195  0.7174703
##   0.01       7                   600     0.9320880  0.7768330  0.7124776
##   0.01       7                   650     0.9250302  0.7797262  0.7091338
##   0.01       7                   700     0.9150232  0.7838578  0.7040437
##   0.01       7                   750     0.9103502  0.7862041  0.7002014
##   0.01       7                   800     0.9057571  0.7887962  0.6961219
##   0.01       7                   850     0.8984976  0.7921459  0.6900663
##   0.01       7                   900     0.8946902  0.7935503  0.6870229
##   0.01       7                   950     0.8921999  0.7944583  0.6854384
##   0.01       7                  1000     0.8874656  0.7963946  0.6818204
##   0.10       1                   100     1.0242459  0.7181196  0.7979675
##   0.10       1                   150     1.0063041  0.7287274  0.7795413
##   0.10       1                   200     0.9832385  0.7391242  0.7705161
##   0.10       1                   250     0.9632324  0.7493378  0.7643176
##   0.10       1                   300     0.9541549  0.7505433  0.7478313
##   0.10       1                   350     0.9472063  0.7534527  0.7438994
##   0.10       1                   400     0.9569712  0.7473762  0.7520343
##   0.10       1                   450     0.9522323  0.7499997  0.7512065
##   0.10       1                   500     0.9574789  0.7457934  0.7523744
##   0.10       1                   550     0.9543862  0.7471706  0.7536521
##   0.10       1                   600     0.9524283  0.7473309  0.7480448
##   0.10       1                   650     0.9525672  0.7480298  0.7480545
##   0.10       1                   700     0.9550885  0.7463772  0.7503086
##   0.10       1                   750     0.9534650  0.7472071  0.7503689
##   0.10       1                   800     0.9618848  0.7433281  0.7583346
##   0.10       1                   850     0.9527264  0.7476301  0.7493534
##   0.10       1                   900     0.9534093  0.7474830  0.7490165
##   0.10       1                   950     0.9550800  0.7459267  0.7488849
##   0.10       1                  1000     0.9549024  0.7461965  0.7482082
##   0.10       3                   100     0.8672828  0.7997504  0.6724252
##   0.10       3                   150     0.8587408  0.8022805  0.6684485
##   0.10       3                   200     0.8491344  0.8051998  0.6655315
##   0.10       3                   250     0.8438715  0.8077207  0.6636223
##   0.10       3                   300     0.8364416  0.8099339  0.6573446
##   0.10       3                   350     0.8361030  0.8095964  0.6570500
##   0.10       3                   400     0.8359125  0.8095954  0.6576636
##   0.10       3                   450     0.8347453  0.8101226  0.6579651
##   0.10       3                   500     0.8336613  0.8104941  0.6579158
##   0.10       3                   550     0.8332927  0.8105735  0.6581181
##   0.10       3                   600     0.8326773  0.8107482  0.6574003
##   0.10       3                   650     0.8326588  0.8107899  0.6574405
##   0.10       3                   700     0.8329870  0.8105758  0.6581416
##   0.10       3                   750     0.8320001  0.8110068  0.6569590
##   0.10       3                   800     0.8321554  0.8108708  0.6571576
##   0.10       3                   850     0.8320772  0.8109064  0.6572266
##   0.10       3                   900     0.8321044  0.8108823  0.6573080
##   0.10       3                   950     0.8320347  0.8109105  0.6573072
##   0.10       3                  1000     0.8318342  0.8109653  0.6571798
##   0.10       5                   100     0.9593756  0.7549443  0.7419556
##   0.10       5                   150     0.9357796  0.7648825  0.7267111
##   0.10       5                   200     0.9208813  0.7720514  0.7165350
##   0.10       5                   250     0.9083948  0.7773926  0.7018404
##   0.10       5                   300     0.9008779  0.7803813  0.6932426
##   0.10       5                   350     0.9024242  0.7788952  0.6943978
##   0.10       5                   400     0.9017370  0.7788301  0.6936603
##   0.10       5                   450     0.8997799  0.7796099  0.6918977
##   0.10       5                   500     0.9006895  0.7789687  0.6936504
##   0.10       5                   550     0.8985394  0.7799777  0.6912532
##   0.10       5                   600     0.8981671  0.7802889  0.6908763
##   0.10       5                   650     0.8965494  0.7807599  0.6900992
##   0.10       5                   700     0.8956180  0.7811850  0.6889932
##   0.10       5                   750     0.8950937  0.7813501  0.6887358
##   0.10       5                   800     0.8945658  0.7814615  0.6886371
##   0.10       5                   850     0.8940807  0.7816393  0.6880690
##   0.10       5                   900     0.8940101  0.7816174  0.6879583
##   0.10       5                   950     0.8936770  0.7817423  0.6876149
##   0.10       5                  1000     0.8934289  0.7818461  0.6872670
##   0.10       7                   100     0.9050152  0.7910557  0.7086485
##   0.10       7                   150     0.8759106  0.8036110  0.6886651
##   0.10       7                   200     0.8706129  0.8053617  0.6906914
##   0.10       7                   250     0.8610847  0.8083093  0.6854024
##   0.10       7                   300     0.8635373  0.8071970  0.6890273
##   0.10       7                   350     0.8609643  0.8081134  0.6881393
##   0.10       7                   400     0.8578950  0.8093314  0.6858182
##   0.10       7                   450     0.8562474  0.8099959  0.6850984
##   0.10       7                   500     0.8559575  0.8097251  0.6842265
##   0.10       7                   550     0.8550067  0.8098136  0.6839183
##   0.10       7                   600     0.8541177  0.8100802  0.6827144
##   0.10       7                   650     0.8538638  0.8100574  0.6827483
##   0.10       7                   700     0.8533798  0.8101361  0.6825868
##   0.10       7                   750     0.8533797  0.8101053  0.6828099
##   0.10       7                   800     0.8537345  0.8098217  0.6833863
##   0.10       7                   850     0.8537310  0.8097881  0.6834586
##   0.10       7                   900     0.8540211  0.8095763  0.6836386
##   0.10       7                   950     0.8538608  0.8096288  0.6833454
##   0.10       7                  1000     0.8539068  0.8096231  0.6834146
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.
gbmTune$finalModel
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## There were 57 predictors of which 56 had non-zero influence.
plot(gbmTune, auto.key = list(columns = 4, lines = TRUE))

gbmImp <- varImp(gbmTune, scale = FALSE)
gbmImp
## gbm variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  333.74
## ManufacturingProcess13  170.02
## ManufacturingProcess06  157.71
## ManufacturingProcess09   72.72
## BiologicalMaterial03     55.38
## ManufacturingProcess31   55.23
## BiologicalMaterial11     51.71
## BiologicalMaterial08     49.84
## ManufacturingProcess17   41.96
## BiologicalMaterial12     37.05
## BiologicalMaterial09     26.68
## ManufacturingProcess15   25.61
## ManufacturingProcess05   24.82
## ManufacturingProcess42   24.30
## ManufacturingProcess01   23.88
## ManufacturingProcess04   23.29
## BiologicalMaterial04     19.70
## ManufacturingProcess20   17.57
## BiologicalMaterial02     16.44
## ManufacturingProcess03   15.67
gbmTunePred <- predict(gbmTune, newdata = chem_test_x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = gbmTunePred, obs = chem_test_y)
##      RMSE  Rsquared       MAE 
## 1.6278311 0.1257552 1.2691626

For Stochastic Gradient Boosting, the RMSE is 0.8318342 and Rsquared is 0.8109653. This is the best Rsquared yet.

The test set performance values are RMSE of 1.6278311 and 0.1257552 Rsquared. The resampled test set did not perform well.

Now let’s try Cubist:

### Cubist

cbGrid <- expand.grid(committees = c(1:10, 20, 50, 75, 100), 
                      neighbors = c(0, 1, 5, 9))

set.seed(100)
cubistTune <- train(x, y,
                    "cubist",
                    tuneGrid = cbGrid,
                    trControl = ctrl)
cubistTune
## Cubist 
## 
## 110 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 98, 99, 99, 101, 99, 99, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE      
##     1         0          0.9760591  0.7239697  0.7866821
##     1         1          0.8989662  0.8078457  0.6304803
##     1         5          0.8694968  0.7881871  0.6746366
##     1         9          0.8939106  0.7704492  0.7073263
##     2         0          0.9683191  0.7615857  0.7374496
##     2         1          0.8941032  0.8067911  0.6502860
##     2         5          0.8985366  0.7990245  0.6453925
##     2         9          0.9305241  0.7821177  0.6816569
##     3         0          0.8565233  0.8029734  0.6620314
##     3         1          0.8120012  0.8328901  0.5808218
##     3         5          0.8032756  0.8260645  0.5904685
##     3         9          0.8170974  0.8219461  0.6041171
##     4         0          0.8801530  0.7952121  0.6852915
##     4         1          0.8106978  0.8429572  0.5693896
##     4         5          0.8170935  0.8300469  0.5854132
##     4         9          0.8478985  0.8177320  0.6272331
##     5         0          0.8708639  0.8035874  0.6652981
##     5         1          0.7599308  0.8523177  0.5419491
##     5         5          0.8105990  0.8310991  0.5925108
##     5         9          0.8378839  0.8203181  0.6255494
##     6         0          0.8650573  0.8128972  0.6686134
##     6         1          0.7616806  0.8594168  0.5399942
##     6         5          0.7958720  0.8420401  0.5805012
##     6         9          0.8277821  0.8299176  0.6207085
##     7         0          0.8397481  0.8202117  0.6522682
##     7         1          0.7444919  0.8622388  0.5309700
##     7         5          0.7860351  0.8409138  0.5795684
##     7         9          0.8094116  0.8320257  0.6107343
##     8         0          0.8258270  0.8269115  0.6414097
##     8         1          0.7387420  0.8668747  0.5230493
##     8         5          0.7706005  0.8509960  0.5684454
##     8         9          0.8012115  0.8396694  0.6104625
##     9         0          0.8173384  0.8305986  0.6334590
##     9         1          0.7293402  0.8703782  0.5231017
##     9         5          0.7579300  0.8538050  0.5606776
##     9         9          0.7842962  0.8438592  0.5977199
##    10         0          0.8316213  0.8236442  0.6441661
##    10         1          0.7505436  0.8632493  0.5354345
##    10         5          0.7733745  0.8477313  0.5705181
##    10         9          0.8001195  0.8374964  0.6065137
##    20         0          0.8767542  0.8107228  0.6900870
##    20         1          0.7828961  0.8485103  0.5767046
##    20         5          0.8263284  0.8304575  0.6070923
##    20         9          0.8488508  0.8218695  0.6434328
##    50         0          0.8427499  0.8116329  0.6720406
##    50         1          0.7576651  0.8517342  0.5593009
##    50         5          0.7869744  0.8364552  0.5990439
##    50         9          0.8118871  0.8258637  0.6364337
##    75         0          0.8306681  0.8136498  0.6645124
##    75         1          0.7529566  0.8543419  0.5579873
##    75         5          0.7759688  0.8398680  0.5978006
##    75         9          0.7985683  0.8300644  0.6297697
##   100         0          0.8225066  0.8131985  0.6561630
##   100         1          0.7473080  0.8562931  0.5534851
##   100         5          0.7629617  0.8420478  0.5897026
##   100         9          0.7846826  0.8325806  0.6212690
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 9 and neighbors = 1.
cubistTune$finalModel
## 
## Call:
## cubist.default(x = x, y = y, committees = param$committees)
## 
## Number of samples: 110 
## Number of predictors: 57 
## 
## Number of committees: 9 
## Number of rules per committee: 2, 2, 2, 1, 3, 3, 4, 1, 4
plot(cubistTune, auto.key = list(columns = 4, lines = TRUE))

cbImp <- varImp(cubistTune, scale = FALSE)
cbImp
## cubist variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32    41.0
## ManufacturingProcess17    38.5
## ManufacturingProcess13    36.5
## ManufacturingProcess09    31.5
## BiologicalMaterial05      20.5
## BiologicalMaterial03      19.5
## BiologicalMaterial08      15.5
## ManufacturingProcess11    13.0
## ManufacturingProcess29    13.0
## ManufacturingProcess24    12.0
## BiologicalMaterial06      12.0
## BiologicalMaterial10      11.0
## ManufacturingProcess22    10.0
## ManufacturingProcess04    10.0
## ManufacturingProcess27    10.0
## BiologicalMaterial09       9.5
## ManufacturingProcess14     8.5
## BiologicalMaterial02       8.5
## BiologicalMaterial12       8.0
## BiologicalMaterial01       8.0
plot(cbImp)

gbmTunePred <- predict(gbmTune, newdata = chem_test_x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = gbmTunePred, obs = chem_test_y)
##      RMSE  Rsquared       MAE 
## 1.6278311 0.1257552 1.2691626

For Cubist, the RMSE is 0.7293402 and Rsquared is 0.8703782. This is the best Rsquared yet and is a pretty good variance accounted for.

The test set performance values are RMSE of 1.6278311 and 0.1257552 Rsquared. The resampled test set did not perform well.

8.7a

Which tree-based regression model gives the optimal resampling and test set performance?

The Cubist performed the best. Like many of the other models, it did not have a great test set performance, but it has the highest Rsquared value.

8.7b

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

From above, the top 10 important predictors are:

ManufacturingProcess32 41.0
ManufacturingProcess17 38.5
ManufacturingProcess13 36.5
ManufacturingProcess09 31.5
BiologicalMaterial05 20.5
BiologicalMaterial03 19.5
BiologicalMaterial08 15.5
ManufacturingProcess11 13.0
ManufacturingProcess29 13.0
ManufacturingProcess24 12.0

The process variables dominate the list with 7 out of the top 10 being a process variable. The top predictors for MARS are ManufacturingProcess32, ManufacturingProcess13, ManufacturingProcess06, and ManufacturingProcess16. The only chosen predictors are process variables. This is very similar to the linear model since it has process and biological predictors. The first 4 predictors are the same, but the linear model has more biological processes.

8.7c

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

I tried plotting how the textbook plots, but all of these returned with an error for me:

# library(partykit)
# 
# cubistTree <- as.party(cubistTune$finalModel)
# plot(cubistTree)

# plot(cubistTune$finalModel)

Here are scatter plots showing the relationship between some of the top predictors and the yield:

ggplot(data=ChemicalManufacturingProcessCopy, aes(x=ManufacturingProcess32, y=Yield)) + geom_point()

ggplot(data=ChemicalManufacturingProcessCopy, aes(x=ManufacturingProcess17, y=Yield)) + geom_point()

ggplot(data=ChemicalManufacturingProcessCopy, aes(x=ManufacturingProcess13, y=Yield)) + geom_point()

ggplot(data=ChemicalManufacturingProcessCopy, aes(x=ManufacturingProcess09, y=Yield)) + geom_point()

ManufacturingProcess32 approximate range is from 150-170. The variance seems to stay pretty consistent. Overall, as ManufacturingProcess32 increases, Yield increases as well. It seems maybe around 165 ManufacturingProcess32, the Yield starts to even out or decrease.

ManufacturingProcess17 approximate range is from 32.5 - a little above 35 with a few outliers around 40. The variance seems to stay pretty consistent. Overall, as ManufacturingProcess17 increases, Yield decreases.

ManufacturingProcess13 approximate range is from 32-36 with a few outliers around 38. The variance seems to stay pretty consistent. Overall, as ManufacturingProcess13 increases, Yield decreases.

ManufacturingProcess09 approximate range is from 42.5-48 with a few outliers around 40. The variance seems to stay pretty consistent. Overall, as ManufacturingProcess09 increases, Yield increases

Having all this information informs us what values we should have for certain predictors to produce a good yield.