Chapter 8, Kuhn and Johnson, Applied Predictive Modeling

Exercise 8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

8.1.a

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The random forest model did not significantly use the uninformative predictors (V6 – V10) as seen below in the Overall column.

library(randomForest)
library(caret)
set.seed(101)
model1 <- randomForest(y ~ ., data = simulated,
                      importance = TRUE,
                      ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
varImp(model1)

##        Overall
## V1  58.0853885
## V2  47.2605089
## V3   9.0331887
## V4  52.1388218
## V5  23.0427026
## V6   3.6874906
## V7   0.4329092
## V8  -1.0585958
## V9  -0.6603554
## V10 -1.5066634

8.1.b

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated2 <- simulated
simulated2$duplicate1 <- simulated2$V1 + rnorm(200) * .1
cor(simulated2$duplicate1, simulated2$V1)

## [1] 0.9353036

Fit another random forest model to these data. Did the importance score for V1 change?

Yes, the score for V1 was reduced from 58.0853885 to 29.9870844.

set.seed(102)
model2 <- randomForest(y ~ ., data = simulated2,
                      importance = TRUE,
                      ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
varImp(model2)

##               Overall
## V1         29.9870844
## V2         47.8046607
## V3         11.1389833
## V4         52.3435183
## V5         20.9576967
## V6          2.6738456
## V7          0.4829991
## V8         -4.3763485
## V9         -2.3718826
## V10         0.1649363
## duplicate1 29.3057231

What happens when you add another predictor that is also highly correlated with V1?

simulated3 <- simulated2
simulated3$duplicate2 <- simulated3$V1 + rnorm(200) * .1
cor(simulated3$duplicate2, simulated3$V1)

## [1] 0.9373468

The value of V1 diminishes even further to 26.32386207 when a second highly correlated predictor variable is added.

set.seed(103)
model3 <- randomForest(y ~ ., data = simulated3,
                      importance = TRUE,
                      ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
varImp(model3)

##                Overall
## V1         26.32386207
## V2         50.21462836
## V3          8.80903311
## V4         53.56697296
## V5         24.88382742
## V6          3.98633526
## V7          0.07252339
## V8         -0.57023026
## V9          0.29724693
## V10        -0.07867575
## duplicate1 24.78568683
## duplicate2 16.36582229

8.1.c

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

There are considerable differences between the importances where the conditional computation is set to TRUE (i.e. conditional computation of the importance) and when it is set to FALSE (i.e. unconditional computation of the importance).

set.seed(104)
party_model <- cforest(y ~ ., data = simulated,
                      control = cforest_control(ntree = 50)
                      )

varimp(party_model, conditional = TRUE)

##          V1          V2          V3          V4          V5          V6 
## 2.500366690 4.147284845 0.117796542 5.562326154 0.584068656 0.000427757 
##          V7          V8          V9         V10 
## 0.059419902 0.010785555 0.018520247 0.006618613

varimp(party_model, conditional = FALSE)

##          V1          V2          V3          V4          V5          V6 
##  9.62224831  7.58082173  0.13620059  9.32598070  2.46159687 -0.02286604 
##          V7          V8          V9         V10 
##  0.12888545 -0.05064172  0.17145243 -0.04055390

8.1.d

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Creating a boosted tree, or Gradient Boosting Machine (GBM).

gbmGrid <- expand.grid(interaction.depth = seq(1,7, by = 2),
                       n.trees = seq(100, 500, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 3)

set.seed(105)

gbmTune <- train(y ~ ., data = simulated,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)
gbmTune

## Stochastic Gradient Boosting 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   0.01       1                  100      3.995805  0.5776166  3.276980
##   0.01       1                  150      3.695844  0.6309434  3.036789
##   0.01       1                  200      3.455314  0.6644869  2.839095
##   0.01       1                  250      3.255560  0.6899540  2.675536
##   0.01       1                  300      3.089233  0.7132649  2.538255
##   0.01       1                  350      2.944228  0.7328316  2.417203
##   0.01       1                  400      2.818036  0.7494924  2.308735
##   0.01       1                  450      2.710302  0.7630493  2.215265
##   0.01       1                  500      2.615433  0.7757461  2.132033
##   0.01       3                  100      3.458882  0.6929036  2.837407
##   0.01       3                  150      3.067551  0.7374512  2.516892
##   0.01       3                  200      2.786976  0.7677814  2.286402
##   0.01       3                  250      2.573320  0.7900604  2.106486
##   0.01       3                  300      2.405242  0.8088867  1.962927
##   0.01       3                  350      2.277189  0.8225163  1.855329
##   0.01       3                  400      2.175040  0.8327997  1.769704
##   0.01       3                  450      2.098931  0.8402706  1.705148
##   0.01       3                  500      2.038406  0.8460102  1.652771
##   0.01       5                  100      3.265013  0.7259168  2.671374
##   0.01       5                  150      2.869117  0.7624994  2.349975
##   0.01       5                  200      2.597553  0.7877082  2.127605
##   0.01       5                  250      2.407750  0.8057135  1.965158
##   0.01       5                  300      2.267417  0.8196619  1.843744
##   0.01       5                  350      2.162169  0.8304044  1.753772
##   0.01       5                  400      2.085212  0.8381233  1.687814
##   0.01       5                  450      2.029971  0.8435372  1.639708
##   0.01       5                  500      1.987071  0.8475863  1.603178
##   0.01       7                  100      3.192407  0.7389184  2.614071
##   0.01       7                  150      2.797255  0.7702049  2.294571
##   0.01       7                  200      2.535636  0.7919994  2.076689
##   0.01       7                  250      2.355060  0.8095292  1.920288
##   0.01       7                  300      2.229086  0.8214984  1.809667
##   0.01       7                  350      2.140799  0.8298060  1.733536
##   0.01       7                  400      2.076318  0.8362659  1.677716
##   0.01       7                  450      2.030884  0.8407237  1.638867
##   0.01       7                  500      1.996613  0.8442391  1.608230
##   0.10       1                  100      2.101089  0.8300680  1.678809
##   0.10       1                  150      1.926538  0.8476812  1.533507
##   0.10       1                  200      1.883671  0.8511487  1.498819
##   0.10       1                  250      1.862376  0.8530836  1.477936
##   0.10       1                  300      1.861325  0.8526384  1.477800
##   0.10       1                  350      1.860576  0.8524635  1.476150
##   0.10       1                  400      1.865615  0.8514076  1.482673
##   0.10       1                  450      1.870605  0.8506402  1.484873
##   0.10       1                  500      1.872349  0.8501779  1.487219
##   0.10       3                  100      1.901682  0.8502609  1.528354
##   0.10       3                  150      1.876583  0.8527489  1.507168
##   0.10       3                  200      1.865754  0.8539075  1.495764
##   0.10       3                  250      1.866369  0.8535496  1.495927
##   0.10       3                  300      1.865559  0.8534300  1.494770
##   0.10       3                  350      1.865188  0.8533620  1.494106
##   0.10       3                  400      1.866862  0.8530133  1.495071
##   0.10       3                  450      1.868375  0.8527078  1.496683
##   0.10       3                  500      1.868415  0.8526416  1.496915
##   0.10       5                  100      1.926286  0.8452157  1.553391
##   0.10       5                  150      1.911456  0.8465165  1.542253
##   0.10       5                  200      1.909069  0.8464344  1.540653
##   0.10       5                  250      1.907747  0.8465461  1.539543
##   0.10       5                  300      1.907215  0.8465451  1.538830
##   0.10       5                  350      1.907541  0.8464275  1.538986
##   0.10       5                  400      1.907407  0.8464421  1.538735
##   0.10       5                  450      1.907413  0.8464240  1.538811
##   0.10       5                  500      1.907375  0.8464273  1.538839
##   0.10       7                  100      1.971915  0.8375622  1.579696
##   0.10       7                  150      1.962138  0.8386602  1.574179
##   0.10       7                  200      1.959228  0.8389521  1.572481
##   0.10       7                  250      1.958169  0.8390927  1.571734
##   0.10       7                  300      1.957822  0.8391242  1.571549
##   0.10       7                  350      1.957700  0.8391249  1.571607
##   0.10       7                  400      1.957695  0.8391237  1.571621
##   0.10       7                  450      1.957682  0.8391245  1.571635
##   0.10       7                  500      1.957647  0.8391276  1.571623
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 3
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 350, interaction.depth =
##  1, shrinkage = 0.1 and n.minobsinnode = 3.

The best model shows n.trees = 350, interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 3.

gbmTune$bestTune

##    n.trees interaction.depth shrinkage n.minobsinnode
## 42     350                 1       0.1              3

The variable importance in the final model for GBM shows a similar pattern to that in the random forest model, with the first five predictors showing the most importance and predictors V6 - V10 showing comparatively much smaller importance.

gbmImp <- varImp(gbmTune$finalModel, scale = FALSE)
gbmImp

##       Overall
## V1  4454.0170
## V2  4062.7178
## V3  1658.4146
## V4  4801.0918
## V5  1833.8697
## V6   343.9860
## V7   236.3253
## V8   103.4074
## V9   104.0551
## V10  122.7416

Trying a Cubist model.

CubistGrid <- expand.grid(committees = seq(1, 10, by = 1),
                       neighbors = seq(1, 9, by=1)
                       )

set.seed(106)

CubistTune <- train(y ~ ., data = simulated,
                 method = "cubist",
                 trControl = trainControl(method = "cv", n = 10),
                 tuneGrid = CubistGrid,
                 verbose = FALSE)
CubistTune

## Cubist 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE     
##    1          1          2.711925  0.7336853  2.207277
##    1          2          2.326001  0.7897266  1.872233
##    1          3          2.228632  0.8071637  1.811793
##    1          4          2.231424  0.8085515  1.811411
##    1          5          2.248003  0.8073025  1.836636
##    1          6          2.223493  0.8134148  1.820904
##    1          7          2.229699  0.8130288  1.825180
##    1          8          2.215426  0.8163242  1.814618
##    1          9          2.221566  0.8158777  1.812802
##    2          1          2.340070  0.7964211  1.885731
##    2          2          2.175029  0.8148730  1.683995
##    2          3          2.129579  0.8191719  1.654108
##    2          4          2.116291  0.8212608  1.652221
##    2          5          2.118139  0.8216806  1.672199
##    2          6          2.087094  0.8278590  1.641830
##    2          7          2.077668  0.8299909  1.633237
##    2          8          2.066039  0.8327102  1.621080
##    2          9          2.070462  0.8333704  1.623758
##    3          1          2.294832  0.8011217  1.883218
##    3          2          2.098351  0.8250880  1.673916
##    3          3          2.045167  0.8315533  1.620922
##    3          4          2.034948  0.8340246  1.617090
##    3          5          2.043261  0.8335112  1.632887
##    3          6          2.013996  0.8395177  1.606264
##    3          7          2.013288  0.8401834  1.605875
##    3          8          1.999939  0.8431075  1.597307
##    3          9          2.001361  0.8440620  1.597319
##    4          1          2.258175  0.8074714  1.822501
##    4          2          2.105873  0.8241487  1.634083
##    4          3          2.061229  0.8283244  1.608464
##    4          4          2.052213  0.8302035  1.605971
##    4          5          2.056035  0.8300225  1.626713
##    4          6          2.019646  0.8364804  1.594174
##    4          7          2.008063  0.8383537  1.587219
##    4          8          1.995898  0.8410126  1.573795
##    4          9          2.000587  0.8414608  1.574768
##    5          1          2.210676  0.8128564  1.812824
##    5          2          2.045633  0.8320052  1.624804
##    5          3          2.001494  0.8369599  1.589742
##    5          4          1.993808  0.8390432  1.587721
##    5          5          1.999740  0.8386806  1.604159
##    5          6          1.967075  0.8448665  1.573362
##    5          7          1.962752  0.8456880  1.565428
##    5          8          1.950617  0.8482419  1.555006
##    5          9          1.954331  0.8487846  1.555836
##    6          1          2.198884  0.8157285  1.793028
##    6          2          2.069039  0.8285744  1.621563
##    6          3          2.028433  0.8327044  1.603352
##    6          4          2.018928  0.8349532  1.597658
##    6          5          2.022005  0.8350214  1.616614
##    6          6          1.984937  0.8417400  1.583331
##    6          7          1.975503  0.8433117  1.572371
##    6          8          1.964102  0.8458160  1.557703
##    6          9          1.970063  0.8460593  1.558274
##    7          1          2.191346  0.8171894  1.798636
##    7          2          2.042870  0.8334824  1.625910
##    7          3          2.000818  0.8379953  1.598749
##    7          4          1.994664  0.8401334  1.591285
##    7          5          2.000308  0.8398234  1.606783
##    7          6          1.966283  0.8463090  1.571500
##    7          7          1.961806  0.8472306  1.559848
##    7          8          1.949725  0.8498320  1.546689
##    7          9          1.954954  0.8501811  1.545975
##    8          1          2.153382  0.8231194  1.766590
##    8          2          2.027899  0.8363316  1.612666
##    8          3          1.997273  0.8390926  1.590581
##    8          4          1.993695  0.8408109  1.583381
##    8          5          1.994718  0.8412745  1.597227
##    8          6          1.957605  0.8481896  1.561220
##    8          7          1.950050  0.8495089  1.544936
##    8          8          1.939104  0.8518786  1.529702
##    8          9          1.945596  0.8520786  1.530434
##    9          1          2.167254  0.8211178  1.775970
##    9          2          2.018143  0.8374644  1.606543
##    9          3          1.981078  0.8411872  1.586101
##    9          4          1.977655  0.8430467  1.581607
##    9          5          1.979508  0.8433312  1.595858
##    9          6          1.941694  0.8503175  1.557291
##    9          7          1.935124  0.8513861  1.539788
##    9          8          1.923756  0.8538002  1.523752
##    9          9          1.929766  0.8540458  1.523193
##   10          1          2.162744  0.8212122  1.778427
##   10          2          2.034639  0.8342419  1.615806
##   10          3          2.002803  0.8375004  1.596337
##   10          4          1.999543  0.8392458  1.585589
##   10          5          1.999465  0.8397810  1.598078
##   10          6          1.961211  0.8468361  1.563843
##   10          7          1.953291  0.8481327  1.549044
##   10          8          1.942347  0.8505878  1.535451
##   10          9          1.948663  0.8508138  1.536372
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 9 and neighbors = 8.

The best model shows committees = 9 and neighbors = 8.

CubistTune$bestTune

##    committees neighbors
## 80          9         8

The variable importance in the final Cubist model again shows a similar pattern to that in the random forest model and the GBM model, with the first five predictors showing the most importance and predictors V6 - V10 showing comparatively much smaller or no importance.

CubistImp <- varImp(CubistTune$finalModel, scale = FALSE)
CubistImp

##     Overall
## V1     68.0
## V3     37.0
## V2     54.0
## V4     50.0
## V5     48.5
## V6     13.0
## V7      0.0
## V8      0.0
## V9      0.0
## V10     0.0

Exercise 8.2

Use a simulation to show tree bias with different granularities.

A dataset is created with variable \(x1\) consisting of 200 observations two distinct values and variable \(x2\) consisting of 200 granular random numbers to predict outcome \(y\) which is highly correlated to variable \(x1\). So even though the text indicates that selection bias favors predictors with more distinct values, the variable \(x2\) with random numbers added to the \(x1\) variable were favored roughly 7:1 over variable \(x1\) as a predictor.

set.seed(107)
 
vars <- tibble()

for (i in 1:200){

  # generate a dataset for the simulation
  sim <- tibble(x1 = rep(c(1,2), times = 100))
  sim$x2 <- rnorm(200, mean = 5, sd = 2)
  sim$y <- sim$x1 + rnorm(200, mean = 0, sd = 1)
  
  sim.mod <- rpart(y ~ ., data = sim)
  
  top10 <- varImp(sim.mod) %>%
    slice_max(order_by = Overall) %>%
    row.names()
  
  vars <- rbind(vars, top10)
}

table(vars)

## vars
##  x1  x2 
##  24 176

Looking at variable importance, it is confirmed that the x2 variable has been granted more importance in the simulation, which is interesting.

varImp(sim.mod)

##      Overall
## x1 0.2379800
## x2 0.6724913

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

8.3.a

Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors?

The bagging fraction represents the fraction of the training data used, and the learning rate represents the fraction of the current predicted value that is added to the immediately previously predicted value. The model on the right is setting both parameters to 0.9, which is more likely to overfit the model and cause only the first few predictors to be highlighted as not only the most important but perhaps even the only important predicted values. In contrast, the model on the left is setting both parameters to 0.1, which is much less likely to overfit and also more likely to improve the predictive accuracy. According to the text, “Friedman suggests using a bagging fraction of around 0.5.” (p. 206)

8.3.b

Which model do you think would be more predictive of other samples?

Given the answer to 8.3.a above, I think that the model on the left would be more predictive of other samples.

8.3.c

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing interaction depth (to around 0.5 according to Friedman) could help to optimize the slope of predictor importance by achieving a better balanced middle ground.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

First acquire the data.

data("ChemicalManufacturingProcess")
chem_mfg <- ChemicalManufacturingProcess

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter.

Transform, impute missing values, center and scale the data, then check for low frequency values and remove them. Only one variable is removed

set.seed(404)

preProc <- preProcess(chem_mfg, method = c("BoxCox","knnImpute","center","scale"))

chem_pred <- predict(preProc,chem_mfg)

# identify predictors with low frequencies
lowFreqPredictors <- nearZeroVar(chem_pred)

# remove the above set from the data and store the result in a dataframe
chem_pred <- chem_pred[,-lowFreqPredictors]

trainingRows <- sample(c(rep(0, 0.75 * nrow(chem_pred)), 
                  rep(1, 0.25 * nrow(chem_pred))))

chem_train <- chem_pred[trainingRows == 0,]
chem_test <- chem_pred[trainingRows == 1,]

With the data split we can train each of the model types from the chapter: single tree, random forest, gradient boosting and Cubist. Let’s train all 4 types in that order.

The single tree (CART) model tunes over a complexity parameter, and the final model selected complexity parameter 0.08378936, which also has the lowest RMSE and highest Rsquared values.

trainX <- chem_train %>% select(-Yield) 
trainY <- chem_train$Yield

testX <- chem_test %>% select(-Yield)
testY <- chem_test$Yield

set.seed(405)

rpartMod <- train(x = trainX,
                  y = trainY,
                  method = "rpart",
                  tuneLength = 10,
                  control = rpart.control(maxdepth = 10L)
                  )
rpartMod

## CART 
## 
## 132 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE       Rsquared   MAE      
##   0.01354873  0.8049641  0.3665461  0.6404770
##   0.01765407  0.8003728  0.3676270  0.6377386
##   0.01804999  0.7996917  0.3689222  0.6376828
##   0.01862487  0.7985018  0.3702319  0.6349426
##   0.03301136  0.7958219  0.3658563  0.6323289
##   0.04135500  0.7849224  0.3694403  0.6277843
##   0.05329571  0.7762629  0.3743849  0.6210082
##   0.07829230  0.7766703  0.3670912  0.6215433
##   0.08378936  0.7700079  0.3734328  0.6154686
##   0.41761517  0.8477627  0.3291581  0.7054969
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.08378936.

plot(rpartMod)

The Random Forest model tunes over the number of features to use in each tree that is evaluated, and it selected the model with the lowest RMSE, which was mtry = 26 and RMSE = 0.6282372.

set.seed(406)

RF_Mod <- train(x = trainX,
                y = trainY,
                method = "rf",
                tuneLength = 10L)

RF_Mod

## Random Forest 
## 
## 132 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    2    0.6748048  0.5495652  0.5557324
##    8    0.6417031  0.5698311  0.5269982
##   14    0.6327102  0.5763653  0.5178999
##   20    0.6290145  0.5773244  0.5134953
##   26    0.6282372  0.5754574  0.5107357
##   32    0.6286607  0.5720628  0.5116401
##   38    0.6312262  0.5663146  0.5118749
##   44    0.6325253  0.5643301  0.5112899
##   50    0.6371669  0.5573378  0.5134995
##   56    0.6394104  0.5537738  0.5142292
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 26.

plot(RF_Mod)

The gradient boosting model tunes over the number of trees, interaction depth, shrinkage rate and minimum observations in the nodes, and it selected n.trees = 500, interaction.depth = 11, shrinkage = 0.01 and n.minobsinnode = 5, where RMSE = 0.6192271.

set.seed(407)

gbmGrid2 <- expand.grid(interaction.depth = seq(1,15, by = 5),
                       n.trees = seq(100, 500, by = 100),
                       shrinkage = c(0.01, 0.1, 0.5),
                       n.minobsinnode = c(5, 10, 15)
                       )

gbmMod <- train(x = trainX,
                y = trainY,
                method = "gbm",
                tuneGrid = gbmGrid2,
                verbose = FALSE)
gbmMod

## Stochastic Gradient Boosting 
## 
## 132 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  n.trees  RMSE       Rsquared 
##   0.01        1                  5              100      0.7370571  0.4651511
##   0.01        1                  5              200      0.6817666  0.4855145
##   0.01        1                  5              300      0.6620301  0.4972607
##   0.01        1                  5              400      0.6521094  0.5056299
##   0.01        1                  5              500      0.6460158  0.5129683
##   0.01        1                 10              100      0.7373940  0.4631862
##   0.01        1                 10              200      0.6802126  0.4877830
##   0.01        1                 10              300      0.6587881  0.5016686
##   0.01        1                 10              400      0.6492666  0.5095714
##   0.01        1                 10              500      0.6444236  0.5148808
##   0.01        1                 15              100      0.7342956  0.4677778
##   0.01        1                 15              200      0.6792548  0.4901899
##   0.01        1                 15              300      0.6582292  0.5023378
##   0.01        1                 15              400      0.6494848  0.5096260
##   0.01        1                 15              500      0.6449915  0.5141947
##   0.01        6                  5              100      0.6882403  0.5114744
##   0.01        6                  5              200      0.6423044  0.5257229
##   0.01        6                  5              300      0.6297544  0.5360787
##   0.01        6                  5              400      0.6255573  0.5416497
##   0.01        6                  5              500      0.6232940  0.5451644
##   0.01        6                 10              100      0.6935400  0.5069938
##   0.01        6                 10              200      0.6440425  0.5258504
##   0.01        6                 10              300      0.6314331  0.5341283
##   0.01        6                 10              400      0.6274953  0.5383298
##   0.01        6                 10              500      0.6239103  0.5436938
##   0.01        6                 15              100      0.7062681  0.4939890
##   0.01        6                 15              200      0.6547579  0.5125137
##   0.01        6                 15              300      0.6401304  0.5222972
##   0.01        6                 15              400      0.6346034  0.5287070
##   0.01        6                 15              500      0.6324812  0.5321224
##   0.01       11                  5              100      0.6818832  0.5258197
##   0.01       11                  5              200      0.6350153  0.5385125
##   0.01       11                  5              300      0.6236665  0.5461688
##   0.01       11                  5              400      0.6211743  0.5488601
##   0.01       11                  5              500      0.6192271  0.5513971
##   0.01       11                 10              100      0.6929863  0.5092920
##   0.01       11                 10              200      0.6440467  0.5254312
##   0.01       11                 10              300      0.6314916  0.5343700
##   0.01       11                 10              400      0.6275585  0.5381355
##   0.01       11                 10              500      0.6243187  0.5431088
##   0.01       11                 15              100      0.7015178  0.5056586
##   0.01       11                 15              200      0.6516962  0.5194749
##   0.01       11                 15              300      0.6377504  0.5270697
##   0.01       11                 15              400      0.6331312  0.5319376
##   0.01       11                 15              500      0.6307983  0.5352547
##   0.10        1                  5              100      0.6535684  0.5066982
##   0.10        1                  5              200      0.6542160  0.5111406
##   0.10        1                  5              300      0.6534049  0.5146111
##   0.10        1                  5              400      0.6534964  0.5161386
##   0.10        1                  5              500      0.6526969  0.5186241
##   0.10        1                 10              100      0.6495670  0.5097828
##   0.10        1                 10              200      0.6510404  0.5126691
##   0.10        1                 10              300      0.6535248  0.5122275
##   0.10        1                 10              400      0.6535588  0.5145005
##   0.10        1                 10              500      0.6537423  0.5152774
##   0.10        1                 15              100      0.6479456  0.5134762
##   0.10        1                 15              200      0.6543280  0.5107582
##   0.10        1                 15              300      0.6604964  0.5060703
##   0.10        1                 15              400      0.6618215  0.5068907
##   0.10        1                 15              500      0.6643671  0.5059078
##   0.10        6                  5              100      0.6317202  0.5365384
##   0.10        6                  5              200      0.6293862  0.5403359
##   0.10        6                  5              300      0.6291360  0.5409466
##   0.10        6                  5              400      0.6289738  0.5412793
##   0.10        6                  5              500      0.6289158  0.5413583
##   0.10        6                 10              100      0.6262132  0.5417056
##   0.10        6                 10              200      0.6265911  0.5423066
##   0.10        6                 10              300      0.6263218  0.5433232
##   0.10        6                 10              400      0.6261116  0.5440541
##   0.10        6                 10              500      0.6260856  0.5442395
##   0.10        6                 15              100      0.6446185  0.5220179
##   0.10        6                 15              200      0.6394123  0.5315022
##   0.10        6                 15              300      0.6405357  0.5308435
##   0.10        6                 15              400      0.6399070  0.5322242
##   0.10        6                 15              500      0.6402852  0.5318261
##   0.10       11                  5              100      0.6327798  0.5367099
##   0.10       11                  5              200      0.6325548  0.5375226
##   0.10       11                  5              300      0.6332232  0.5366667
##   0.10       11                  5              400      0.6334804  0.5363242
##   0.10       11                  5              500      0.6336118  0.5361509
##   0.10       11                 10              100      0.6376831  0.5283100
##   0.10       11                 10              200      0.6360064  0.5313663
##   0.10       11                 10              300      0.6353634  0.5324191
##   0.10       11                 10              400      0.6354255  0.5324948
##   0.10       11                 10              500      0.6352444  0.5328152
##   0.10       11                 15              100      0.6393094  0.5245047
##   0.10       11                 15              200      0.6372500  0.5308381
##   0.10       11                 15              300      0.6367424  0.5330525
##   0.10       11                 15              400      0.6377779  0.5320433
##   0.10       11                 15              500      0.6382583  0.5318266
##   0.50        1                  5              100      0.7356817  0.4343120
##   0.50        1                  5              200      0.7393572  0.4366845
##   0.50        1                  5              300      0.7399861  0.4375522
##   0.50        1                  5              400      0.7399606  0.4380613
##   0.50        1                  5              500      0.7402697  0.4378759
##   0.50        1                 10              100      0.7157237  0.4578811
##   0.50        1                 10              200      0.7124490  0.4675811
##   0.50        1                 10              300      0.7114750  0.4700329
##   0.50        1                 10              400      0.7112105  0.4717345
##   0.50        1                 10              500      0.7117262  0.4713326
##   0.50        1                 15              100      0.7372281  0.4286385
##   0.50        1                 15              200      0.7401058  0.4307126
##   0.50        1                 15              300      0.7407246  0.4319868
##   0.50        1                 15              400      0.7417462  0.4316124
##   0.50        1                 15              500      0.7424766  0.4316444
##   0.50        6                  5              100      0.7531406  0.4176720
##   0.50        6                  5              200      0.7531123  0.4179072
##   0.50        6                  5              300      0.7531181  0.4179096
##   0.50        6                  5              400      0.7531183  0.4179093
##   0.50        6                  5              500      0.7531183  0.4179093
##   0.50        6                 10              100      0.7235516  0.4395822
##   0.50        6                 10              200      0.7239805  0.4391569
##   0.50        6                 10              300      0.7240493  0.4390820
##   0.50        6                 10              400      0.7240439  0.4390852
##   0.50        6                 10              500      0.7240436  0.4390841
##   0.50        6                 15              100      0.7350882  0.4360436
##   0.50        6                 15              200      0.7345366  0.4372262
##   0.50        6                 15              300      0.7351651  0.4366227
##   0.50        6                 15              400      0.7352929  0.4365388
##   0.50        6                 15              500      0.7353305  0.4365297
##   0.50       11                  5              100      0.7699983  0.3963567
##   0.50       11                  5              200      0.7700935  0.3962006
##   0.50       11                  5              300      0.7700950  0.3962001
##   0.50       11                  5              400      0.7700951  0.3962000
##   0.50       11                  5              500      0.7700951  0.3962000
##   0.50       11                 10              100      0.7325500  0.4386459
##   0.50       11                 10              200      0.7326953  0.4389810
##   0.50       11                 10              300      0.7327284  0.4389848
##   0.50       11                 10              400      0.7327735  0.4389360
##   0.50       11                 10              500      0.7327752  0.4389344
##   0.50       11                 15              100      0.7399964  0.4175177
##   0.50       11                 15              200      0.7400771  0.4187559
##   0.50       11                 15              300      0.7400960  0.4185902
##   0.50       11                 15              400      0.7400815  0.4186186
##   0.50       11                 15              500      0.7400508  0.4186448
##   MAE      
##   0.6109017
##   0.5594687
##   0.5371287
##   0.5230742
##   0.5140743
##   0.6105525
##   0.5579403
##   0.5341115
##   0.5210249
##   0.5138463
##   0.6084873
##   0.5568716
##   0.5335857
##   0.5212273
##   0.5138350
##   0.5677251
##   0.5191257
##   0.5007855
##   0.4935333
##   0.4887714
##   0.5740408
##   0.5228503
##   0.5053318
##   0.4977029
##   0.4923044
##   0.5842379
##   0.5331885
##   0.5141605
##   0.5063208
##   0.5016688
##   0.5612477
##   0.5116663
##   0.4947859
##   0.4885967
##   0.4848843
##   0.5714143
##   0.5210350
##   0.5026809
##   0.4951250
##   0.4904996
##   0.5807685
##   0.5303081
##   0.5119817
##   0.5041687
##   0.4995841
##   0.5102499
##   0.5064829
##   0.5037199
##   0.5044915
##   0.5037609
##   0.5092468
##   0.5087937
##   0.5120220
##   0.5121893
##   0.5113029
##   0.5091321
##   0.5118625
##   0.5167432
##   0.5182556
##   0.5205944
##   0.4933413
##   0.4899733
##   0.4898202
##   0.4896770
##   0.4896421
##   0.4890498
##   0.4889993
##   0.4885858
##   0.4884049
##   0.4883658
##   0.5056983
##   0.5028355
##   0.5035118
##   0.5034374
##   0.5038183
##   0.4945771
##   0.4926192
##   0.4927379
##   0.4928335
##   0.4928172
##   0.5025158
##   0.5001694
##   0.4995555
##   0.4993328
##   0.4990487
##   0.5054999
##   0.5036062
##   0.5043610
##   0.5051281
##   0.5057808
##   0.5764945
##   0.5797314
##   0.5806527
##   0.5807530
##   0.5811245
##   0.5630980
##   0.5597806
##   0.5593762
##   0.5594897
##   0.5599122
##   0.5845247
##   0.5892297
##   0.5910257
##   0.5923801
##   0.5926532
##   0.5938101
##   0.5938105
##   0.5938214
##   0.5938218
##   0.5938218
##   0.5658401
##   0.5663251
##   0.5664126
##   0.5664145
##   0.5664165
##   0.5827455
##   0.5820158
##   0.5826048
##   0.5827317
##   0.5827866
##   0.6007775
##   0.6008768
##   0.6008796
##   0.6008796
##   0.6008796
##   0.5750730
##   0.5752676
##   0.5752905
##   0.5753407
##   0.5753450
##   0.5783508
##   0.5778540
##   0.5780692
##   0.5780739
##   0.5780611
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 500, interaction.depth =
##  11, shrinkage = 0.01 and n.minobsinnode = 5.

plot(gbmMod)

The Cubist model tunes over the number of committees and neighbors used in building the model, and it selected committees = 20 and neighbors = 5 where RMSE = 0.6241594.

set.seed(408)

CubistMod <- train(x = trainX,
                   y = trainY,
                   method = "cubist")
  
CubistMod

## Cubist 
## 
## 132 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE      
##    1          0          0.9097795  0.3261565  0.6657959
##    1          5          0.8948935  0.3483801  0.6471097
##    1          9          0.8976531  0.3423711  0.6527329
##   10          0          0.6655149  0.5073520  0.5157291
##   10          5          0.6499137  0.5295174  0.5012022
##   10          9          0.6569764  0.5200360  0.5081046
##   20          0          0.6405293  0.5311136  0.4992244
##   20          5          0.6241594  0.5542839  0.4824151
##   20          9          0.6312454  0.5448120  0.4892632
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.

plot(CubistMod)

8.7.a

Which tree-based regression model gives the optimal resampling and test set performance?

When trained each model stated “RMSE was used to select the optimal model using the smallest value.” Therefore, we can select the model with the overall smallest RMSE, which appears to be Gradient Boosting in the 1st, Mean, 3rd and Max columns. Cubist has a lower Median value, but Gradient Boosting shows the smallest overall RMSE values consistently.

optimalTree <- resamples(list(SingleTree=rpartMod, RandomForest=RF_Mod, GradientBoosting=gbmMod, Cubist=CubistMod))

summary(optimalTree)

## 
## Call:
## summary.resamples(object = optimalTree)
## 
## Models: SingleTree, RandomForest, GradientBoosting, Cubist 
## Number of resamples: 25 
## 
## MAE 
##                       Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## SingleTree       0.4859186 0.5820133 0.6129884 0.6154686 0.6472591 0.7220190
## RandomForest     0.3836628 0.4853523 0.5150236 0.5107357 0.5341481 0.6699730
## GradientBoosting 0.3879447 0.4669835 0.4908541 0.4848843 0.5038450 0.5531224
## Cubist           0.3821551 0.4545234 0.4754450 0.4824151 0.5021344 0.6463159
##                  NA's
## SingleTree          0
## RandomForest        0
## GradientBoosting    0
## Cubist              0
## 
## RMSE 
##                       Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## SingleTree       0.5966413 0.7065075 0.7787488 0.7700079 0.8300308 0.9086841
## RandomForest     0.4737784 0.5972830 0.6433030 0.6282372 0.6616887 0.8285726
## GradientBoosting 0.5206890 0.5821177 0.6264100 0.6192271 0.6487761 0.7411194
## Cubist           0.4782438 0.5861085 0.6186330 0.6241594 0.6504664 0.8070523
##                  NA's
## SingleTree          0
## RandomForest        0
## GradientBoosting    0
## Cubist              0
## 
## Rsquared 
##                       Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## SingleTree       0.1486274 0.3183827 0.3638089 0.3734328 0.4262367 0.5547497
## RandomForest     0.4068542 0.5251647 0.5672361 0.5754574 0.6270524 0.7879016
## GradientBoosting 0.4241430 0.4964715 0.5480393 0.5513971 0.6045590 0.6651260
## Cubist           0.2944683 0.4987805 0.5567139 0.5542839 0.6205018 0.7291482
##                  NA's
## SingleTree          0
## RandomForest        0
## GradientBoosting    0
## Cubist              0

The test sets do not return RMSE, but we have Rsquared to evaluate, and Cubist shows the highest Rsquared value. However, the model was trained with 132 observations and tested with only 44, so there may not have been sufficient representation in the test sets to trust the results for selecting the best model.

Therefore, I will stick with Gradient Boosting as being the best model of the four tree-based models.

rbind(
  "SingleTree" = postResample(pred = predict(rpartMod), obs = testY),
  "RandomForest" = postResample(pred = predict(RF_Mod), obs = testY),
  "GradientBoosting" = postResample(pred = predict(gbmMod), obs = testY),
  "Cubist" = postResample(pred = predict(CubistMod), obs = testY)
)

##                  RMSE   Rsquared MAE
## SingleTree         NA 0.02188217  NA
## RandomForest       NA 0.02357147  NA
## GradientBoosting   NA 0.02578373  NA
## Cubist             NA 0.03007913  NA

Let’s look one more time at the results of the trained GBM model.

plot(gbmMod)

8.7.b

Which predictors are most important in the optimal tree-based regression model? - Do either the biological or process variables dominate the list? - How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

The top 10 predictors from the Gradient Boosting model are sorted in descending Overall importance in the list below. Of the top 10 predictors eight are Manufacturing Process predictors and only two are Biological Material predictors, so Manufacturing processes clearly dominate the list.

varImp(gbmMod)

## gbm variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32 100.000
## BiologicalMaterial12    16.838
## ManufacturingProcess06  12.351
## ManufacturingProcess17  11.963
## ManufacturingProcess13  11.455
## ManufacturingProcess09  10.906
## ManufacturingProcess31   9.533
## BiologicalMaterial03     7.867
## ManufacturingProcess11   6.178
## ManufacturingProcess21   5.848
## ManufacturingProcess27   5.478
## ManufacturingProcess43   5.462
## ManufacturingProcess20   5.159
## ManufacturingProcess04   5.050
## BiologicalMaterial02     4.884
## BiologicalMaterial11     4.772
## ManufacturingProcess05   4.553
## ManufacturingProcess39   4.502
## ManufacturingProcess24   4.354
## BiologicalMaterial09     4.216

The top 10 predictors from the optimal linear model (Partial Least Squares) were:

ManufacturingProcess32  100.00000           
ManufacturingProcess13  84.69518            
ManufacturingProcess17  84.27932            
ManufacturingProcess36  83.77365            
ManufacturingProcess09  79.59390            
BiologicalMaterial02      56.37639          
ManufacturingProcess06  54.10142            
ManufacturingProcess12  53.66660            
BiologicalMaterial06      53.17872          
ManufacturingProcess11  52.92723

and from the optimal non-linear model (Neural Network) they were:

ManufacturingProcess23  100.00000           
ManufacturingProcess32  96.42698            
ManufacturingProcess34  92.81134            
ManufacturingProcess33  84.75033            
BiologicalMaterial09      82.03159          
ManufacturingProcess01  80.48583            
ManufacturingProcess03  80.03863            
BiologicalMaterial11      78.72629          
ManufacturingProcess28  76.02541            
ManufacturingProcess45  73.55811

ManufacturingProcess32 was the top linear and tree-based predictor, and it was second most important in the non-linear model. Overall there were more similarities in variable importance between the linear and tree-based models with predictors 32, 06, 17, 13, 09 and 11 in the top 10 predictors in both models, but importance appears to drop off very quickly in the tree-based model while it declines less rapidly in the linear model, albeit faster than in the non-linear model.

8.7.c

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

The optimal single tree with the distribution of yield appears to be very eye opening. While ManufacturingProcess32 is at the top accounting for 100% of the yield, BiologicalMaterial12 and ManufacturingProcess06 split that into 51% and 49% respectively, with the BiologiclMaterial having slightly more impact to final yield than the ManufacturingProcess at that level of the split. Then on the BiologicalMaterial side of the split only three ManufacturingProcesses (29, 17 and 27) have impact on final yield, even though #29 is not listed in the top 20 most important predictors. Interestingly, once ManufacturingProcess17 is evaluated, BiologicalProcess12 appears again as a split point at a higher threshold. While on the ManufacturingProcess06 side of the first split there are no BiologicalProcess predictors, only ManufacturingProcesses 17, 23 and 02. It is also interesting that ManufacturingProcess17 appears in the next level on both sides of the original split, but at different thresholds.

In terms of relationsihp with Yield at the lowest level, BiologicalMaterial 12 accounts for 18% of the final yield, while ManufacturingProcesses (27, 39, 02, 23 and 17) account for, respectively, 22%, 6%, 18%, 11% and 25% of final yield. This means that ManufacturingProcess17 accounts for the greatest percentage of yield at 25%, followed by ManufacturingProcess27 at 22% and third place is a tie between BiologicalMaterial12 and ManufacturingProcess 02 each at 18%. This does not align with the numbers in the varImp() function that was used to rank and extract the importance of variables from the tree-based model.

rpartTree <- rpart(Yield ~ ., method = "anova", data = chem_train)

rpart.plot(rpartTree)

Data 624 Predictive Analytics

Douglas Barley

5/1/2022