Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
Did the random forest model significantly use the uninformative
predictors (V6 – V10)?
The random forest model did not significantly use the uninformative
predictors (V6 – V10) as seen below in the Overall
column.
library(randomForest)
library(caret)
set.seed(101)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
varImp(model1)
## Overall
## V1 58.0853885
## V2 47.2605089
## V3 9.0331887
## V4 52.1388218
## V5 23.0427026
## V6 3.6874906
## V7 0.4329092
## V8 -1.0585958
## V9 -0.6603554
## V10 -1.5066634
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated2 <- simulated
simulated2$duplicate1 <- simulated2$V1 + rnorm(200) * .1
cor(simulated2$duplicate1, simulated2$V1)
## [1] 0.9353036
Fit another random forest model to these data. Did the importance
score for V1 change?
Yes, the score for V1 was reduced from 58.0853885 to
29.9870844.
set.seed(102)
model2 <- randomForest(y ~ ., data = simulated2,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
varImp(model2)
## Overall
## V1 29.9870844
## V2 47.8046607
## V3 11.1389833
## V4 52.3435183
## V5 20.9576967
## V6 2.6738456
## V7 0.4829991
## V8 -4.3763485
## V9 -2.3718826
## V10 0.1649363
## duplicate1 29.3057231
What happens when you add another predictor that is also highly
correlated with V1?
simulated3 <- simulated2
simulated3$duplicate2 <- simulated3$V1 + rnorm(200) * .1
cor(simulated3$duplicate2, simulated3$V1)
## [1] 0.9373468
The value of V1 diminishes even further to 26.32386207
when a second highly correlated predictor variable is added.
set.seed(103)
model3 <- randomForest(y ~ ., data = simulated3,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
varImp(model3)
## Overall
## V1 26.32386207
## V2 50.21462836
## V3 8.80903311
## V4 53.56697296
## V5 24.88382742
## V6 3.98633526
## V7 0.07252339
## V8 -0.57023026
## V9 0.29724693
## V10 -0.07867575
## duplicate1 24.78568683
## duplicate2 16.36582229
Use the cforest function in the party
package to fit a random forest model using conditional inference trees.
The party package function varimp can
calculate predictor importance. The conditional argument of
that function toggles between the traditional importance measure and the
modified version described in Strobl et al. (2007). Do these importances
show the same pattern as the traditional random forest model?
There are considerable differences between the importances where the
conditional computation is set to TRUE (i.e. conditional
computation of the importance) and when it is set to FALSE
(i.e. unconditional computation of the importance).
set.seed(104)
party_model <- cforest(y ~ ., data = simulated,
control = cforest_control(ntree = 50)
)
varimp(party_model, conditional = TRUE)
## V1 V2 V3 V4 V5 V6
## 2.500366690 4.147284845 0.117796542 5.562326154 0.584068656 0.000427757
## V7 V8 V9 V10
## 0.059419902 0.010785555 0.018520247 0.006618613
varimp(party_model, conditional = FALSE)
## V1 V2 V3 V4 V5 V6
## 9.62224831 7.58082173 0.13620059 9.32598070 2.46159687 -0.02286604
## V7 V8 V9 V10
## 0.12888545 -0.05064172 0.17145243 -0.04055390
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Creating a boosted tree, or Gradient Boosting Machine (GBM).
gbmGrid <- expand.grid(interaction.depth = seq(1,7, by = 2),
n.trees = seq(100, 500, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 3)
set.seed(105)
gbmTune <- train(y ~ ., data = simulated,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
gbmTune
## Stochastic Gradient Boosting
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.01 1 100 3.995805 0.5776166 3.276980
## 0.01 1 150 3.695844 0.6309434 3.036789
## 0.01 1 200 3.455314 0.6644869 2.839095
## 0.01 1 250 3.255560 0.6899540 2.675536
## 0.01 1 300 3.089233 0.7132649 2.538255
## 0.01 1 350 2.944228 0.7328316 2.417203
## 0.01 1 400 2.818036 0.7494924 2.308735
## 0.01 1 450 2.710302 0.7630493 2.215265
## 0.01 1 500 2.615433 0.7757461 2.132033
## 0.01 3 100 3.458882 0.6929036 2.837407
## 0.01 3 150 3.067551 0.7374512 2.516892
## 0.01 3 200 2.786976 0.7677814 2.286402
## 0.01 3 250 2.573320 0.7900604 2.106486
## 0.01 3 300 2.405242 0.8088867 1.962927
## 0.01 3 350 2.277189 0.8225163 1.855329
## 0.01 3 400 2.175040 0.8327997 1.769704
## 0.01 3 450 2.098931 0.8402706 1.705148
## 0.01 3 500 2.038406 0.8460102 1.652771
## 0.01 5 100 3.265013 0.7259168 2.671374
## 0.01 5 150 2.869117 0.7624994 2.349975
## 0.01 5 200 2.597553 0.7877082 2.127605
## 0.01 5 250 2.407750 0.8057135 1.965158
## 0.01 5 300 2.267417 0.8196619 1.843744
## 0.01 5 350 2.162169 0.8304044 1.753772
## 0.01 5 400 2.085212 0.8381233 1.687814
## 0.01 5 450 2.029971 0.8435372 1.639708
## 0.01 5 500 1.987071 0.8475863 1.603178
## 0.01 7 100 3.192407 0.7389184 2.614071
## 0.01 7 150 2.797255 0.7702049 2.294571
## 0.01 7 200 2.535636 0.7919994 2.076689
## 0.01 7 250 2.355060 0.8095292 1.920288
## 0.01 7 300 2.229086 0.8214984 1.809667
## 0.01 7 350 2.140799 0.8298060 1.733536
## 0.01 7 400 2.076318 0.8362659 1.677716
## 0.01 7 450 2.030884 0.8407237 1.638867
## 0.01 7 500 1.996613 0.8442391 1.608230
## 0.10 1 100 2.101089 0.8300680 1.678809
## 0.10 1 150 1.926538 0.8476812 1.533507
## 0.10 1 200 1.883671 0.8511487 1.498819
## 0.10 1 250 1.862376 0.8530836 1.477936
## 0.10 1 300 1.861325 0.8526384 1.477800
## 0.10 1 350 1.860576 0.8524635 1.476150
## 0.10 1 400 1.865615 0.8514076 1.482673
## 0.10 1 450 1.870605 0.8506402 1.484873
## 0.10 1 500 1.872349 0.8501779 1.487219
## 0.10 3 100 1.901682 0.8502609 1.528354
## 0.10 3 150 1.876583 0.8527489 1.507168
## 0.10 3 200 1.865754 0.8539075 1.495764
## 0.10 3 250 1.866369 0.8535496 1.495927
## 0.10 3 300 1.865559 0.8534300 1.494770
## 0.10 3 350 1.865188 0.8533620 1.494106
## 0.10 3 400 1.866862 0.8530133 1.495071
## 0.10 3 450 1.868375 0.8527078 1.496683
## 0.10 3 500 1.868415 0.8526416 1.496915
## 0.10 5 100 1.926286 0.8452157 1.553391
## 0.10 5 150 1.911456 0.8465165 1.542253
## 0.10 5 200 1.909069 0.8464344 1.540653
## 0.10 5 250 1.907747 0.8465461 1.539543
## 0.10 5 300 1.907215 0.8465451 1.538830
## 0.10 5 350 1.907541 0.8464275 1.538986
## 0.10 5 400 1.907407 0.8464421 1.538735
## 0.10 5 450 1.907413 0.8464240 1.538811
## 0.10 5 500 1.907375 0.8464273 1.538839
## 0.10 7 100 1.971915 0.8375622 1.579696
## 0.10 7 150 1.962138 0.8386602 1.574179
## 0.10 7 200 1.959228 0.8389521 1.572481
## 0.10 7 250 1.958169 0.8390927 1.571734
## 0.10 7 300 1.957822 0.8391242 1.571549
## 0.10 7 350 1.957700 0.8391249 1.571607
## 0.10 7 400 1.957695 0.8391237 1.571621
## 0.10 7 450 1.957682 0.8391245 1.571635
## 0.10 7 500 1.957647 0.8391276 1.571623
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 3
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 350, interaction.depth =
## 1, shrinkage = 0.1 and n.minobsinnode = 3.
The best model shows n.trees = 350, interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 3.
gbmTune$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 42 350 1 0.1 3
The variable importance in the final model for GBM shows a similar
pattern to that in the random forest model, with the first five
predictors showing the most importance and predictors
V6 - V10 showing comparatively much smaller importance.
gbmImp <- varImp(gbmTune$finalModel, scale = FALSE)
gbmImp
## Overall
## V1 4454.0170
## V2 4062.7178
## V3 1658.4146
## V4 4801.0918
## V5 1833.8697
## V6 343.9860
## V7 236.3253
## V8 103.4074
## V9 104.0551
## V10 122.7416
Trying a Cubist model.
CubistGrid <- expand.grid(committees = seq(1, 10, by = 1),
neighbors = seq(1, 9, by=1)
)
set.seed(106)
CubistTune <- train(y ~ ., data = simulated,
method = "cubist",
trControl = trainControl(method = "cv", n = 10),
tuneGrid = CubistGrid,
verbose = FALSE)
CubistTune
## Cubist
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 1 2.711925 0.7336853 2.207277
## 1 2 2.326001 0.7897266 1.872233
## 1 3 2.228632 0.8071637 1.811793
## 1 4 2.231424 0.8085515 1.811411
## 1 5 2.248003 0.8073025 1.836636
## 1 6 2.223493 0.8134148 1.820904
## 1 7 2.229699 0.8130288 1.825180
## 1 8 2.215426 0.8163242 1.814618
## 1 9 2.221566 0.8158777 1.812802
## 2 1 2.340070 0.7964211 1.885731
## 2 2 2.175029 0.8148730 1.683995
## 2 3 2.129579 0.8191719 1.654108
## 2 4 2.116291 0.8212608 1.652221
## 2 5 2.118139 0.8216806 1.672199
## 2 6 2.087094 0.8278590 1.641830
## 2 7 2.077668 0.8299909 1.633237
## 2 8 2.066039 0.8327102 1.621080
## 2 9 2.070462 0.8333704 1.623758
## 3 1 2.294832 0.8011217 1.883218
## 3 2 2.098351 0.8250880 1.673916
## 3 3 2.045167 0.8315533 1.620922
## 3 4 2.034948 0.8340246 1.617090
## 3 5 2.043261 0.8335112 1.632887
## 3 6 2.013996 0.8395177 1.606264
## 3 7 2.013288 0.8401834 1.605875
## 3 8 1.999939 0.8431075 1.597307
## 3 9 2.001361 0.8440620 1.597319
## 4 1 2.258175 0.8074714 1.822501
## 4 2 2.105873 0.8241487 1.634083
## 4 3 2.061229 0.8283244 1.608464
## 4 4 2.052213 0.8302035 1.605971
## 4 5 2.056035 0.8300225 1.626713
## 4 6 2.019646 0.8364804 1.594174
## 4 7 2.008063 0.8383537 1.587219
## 4 8 1.995898 0.8410126 1.573795
## 4 9 2.000587 0.8414608 1.574768
## 5 1 2.210676 0.8128564 1.812824
## 5 2 2.045633 0.8320052 1.624804
## 5 3 2.001494 0.8369599 1.589742
## 5 4 1.993808 0.8390432 1.587721
## 5 5 1.999740 0.8386806 1.604159
## 5 6 1.967075 0.8448665 1.573362
## 5 7 1.962752 0.8456880 1.565428
## 5 8 1.950617 0.8482419 1.555006
## 5 9 1.954331 0.8487846 1.555836
## 6 1 2.198884 0.8157285 1.793028
## 6 2 2.069039 0.8285744 1.621563
## 6 3 2.028433 0.8327044 1.603352
## 6 4 2.018928 0.8349532 1.597658
## 6 5 2.022005 0.8350214 1.616614
## 6 6 1.984937 0.8417400 1.583331
## 6 7 1.975503 0.8433117 1.572371
## 6 8 1.964102 0.8458160 1.557703
## 6 9 1.970063 0.8460593 1.558274
## 7 1 2.191346 0.8171894 1.798636
## 7 2 2.042870 0.8334824 1.625910
## 7 3 2.000818 0.8379953 1.598749
## 7 4 1.994664 0.8401334 1.591285
## 7 5 2.000308 0.8398234 1.606783
## 7 6 1.966283 0.8463090 1.571500
## 7 7 1.961806 0.8472306 1.559848
## 7 8 1.949725 0.8498320 1.546689
## 7 9 1.954954 0.8501811 1.545975
## 8 1 2.153382 0.8231194 1.766590
## 8 2 2.027899 0.8363316 1.612666
## 8 3 1.997273 0.8390926 1.590581
## 8 4 1.993695 0.8408109 1.583381
## 8 5 1.994718 0.8412745 1.597227
## 8 6 1.957605 0.8481896 1.561220
## 8 7 1.950050 0.8495089 1.544936
## 8 8 1.939104 0.8518786 1.529702
## 8 9 1.945596 0.8520786 1.530434
## 9 1 2.167254 0.8211178 1.775970
## 9 2 2.018143 0.8374644 1.606543
## 9 3 1.981078 0.8411872 1.586101
## 9 4 1.977655 0.8430467 1.581607
## 9 5 1.979508 0.8433312 1.595858
## 9 6 1.941694 0.8503175 1.557291
## 9 7 1.935124 0.8513861 1.539788
## 9 8 1.923756 0.8538002 1.523752
## 9 9 1.929766 0.8540458 1.523193
## 10 1 2.162744 0.8212122 1.778427
## 10 2 2.034639 0.8342419 1.615806
## 10 3 2.002803 0.8375004 1.596337
## 10 4 1.999543 0.8392458 1.585589
## 10 5 1.999465 0.8397810 1.598078
## 10 6 1.961211 0.8468361 1.563843
## 10 7 1.953291 0.8481327 1.549044
## 10 8 1.942347 0.8505878 1.535451
## 10 9 1.948663 0.8508138 1.536372
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 9 and neighbors = 8.
The best model shows committees = 9 and neighbors = 8.
CubistTune$bestTune
## committees neighbors
## 80 9 8
The variable importance in the final Cubist model again shows a
similar pattern to that in the random forest model and the GBM model,
with the first five predictors showing the most importance and
predictors V6 - V10 showing comparatively much smaller or
no importance.
CubistImp <- varImp(CubistTune$finalModel, scale = FALSE)
CubistImp
## Overall
## V1 68.0
## V3 37.0
## V2 54.0
## V4 50.0
## V5 48.5
## V6 13.0
## V7 0.0
## V8 0.0
## V9 0.0
## V10 0.0
Use a simulation to show tree bias with different granularities.
A dataset is created with variable \(x1\) consisting of 200 observations two distinct values and variable \(x2\) consisting of 200 granular random numbers to predict outcome \(y\) which is highly correlated to variable \(x1\). So even though the text indicates that selection bias favors predictors with more distinct values, the variable \(x2\) with random numbers added to the \(x1\) variable were favored roughly 7:1 over variable \(x1\) as a predictor.
set.seed(107)
vars <- tibble()
for (i in 1:200){
# generate a dataset for the simulation
sim <- tibble(x1 = rep(c(1,2), times = 100))
sim$x2 <- rnorm(200, mean = 5, sd = 2)
sim$y <- sim$x1 + rnorm(200, mean = 0, sd = 1)
sim.mod <- rpart(y ~ ., data = sim)
top10 <- varImp(sim.mod) %>%
slice_max(order_by = Overall) %>%
row.names()
vars <- rbind(vars, top10)
}
table(vars)
## vars
## x1 x2
## 24 176
Looking at variable importance, it is confirmed that the
x2 variable has been granted more importance in the
simulation, which is interesting.
varImp(sim.mod)
## Overall
## x1 0.2379800
## x2 0.6724913
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors?
The bagging fraction represents the fraction of the training data used, and the learning rate represents the fraction of the current predicted value that is added to the immediately previously predicted value. The model on the right is setting both parameters to 0.9, which is more likely to overfit the model and cause only the first few predictors to be highlighted as not only the most important but perhaps even the only important predicted values. In contrast, the model on the left is setting both parameters to 0.1, which is much less likely to overfit and also more likely to improve the predictive accuracy. According to the text, “Friedman suggests using a bagging fraction of around 0.5.” (p. 206)
Which model do you think would be more predictive of other samples?
Given the answer to 8.3.a above, I think that the model on the left would be more predictive of other samples.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Increasing interaction depth (to around 0.5 according to Friedman) could help to optimize the slope of predictor importance by achieving a better balanced middle ground.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
First acquire the data.
data("ChemicalManufacturingProcess")
chem_mfg <- ChemicalManufacturingProcess
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter.
Transform, impute missing values, center and scale the data, then check for low frequency values and remove them. Only one variable is removed
set.seed(404)
preProc <- preProcess(chem_mfg, method = c("BoxCox","knnImpute","center","scale"))
chem_pred <- predict(preProc,chem_mfg)
# identify predictors with low frequencies
lowFreqPredictors <- nearZeroVar(chem_pred)
# remove the above set from the data and store the result in a dataframe
chem_pred <- chem_pred[,-lowFreqPredictors]
trainingRows <- sample(c(rep(0, 0.75 * nrow(chem_pred)),
rep(1, 0.25 * nrow(chem_pred))))
chem_train <- chem_pred[trainingRows == 0,]
chem_test <- chem_pred[trainingRows == 1,]
With the data split we can train each of the model types from the chapter: single tree, random forest, gradient boosting and Cubist. Let’s train all 4 types in that order.
The single tree (CART) model tunes over a complexity parameter, and the final model selected complexity parameter 0.08378936, which also has the lowest RMSE and highest Rsquared values.
trainX <- chem_train %>% select(-Yield)
trainY <- chem_train$Yield
testX <- chem_test %>% select(-Yield)
testY <- chem_test$Yield
set.seed(405)
rpartMod <- train(x = trainX,
y = trainY,
method = "rpart",
tuneLength = 10,
control = rpart.control(maxdepth = 10L)
)
rpartMod
## CART
##
## 132 samples
## 56 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01354873 0.8049641 0.3665461 0.6404770
## 0.01765407 0.8003728 0.3676270 0.6377386
## 0.01804999 0.7996917 0.3689222 0.6376828
## 0.01862487 0.7985018 0.3702319 0.6349426
## 0.03301136 0.7958219 0.3658563 0.6323289
## 0.04135500 0.7849224 0.3694403 0.6277843
## 0.05329571 0.7762629 0.3743849 0.6210082
## 0.07829230 0.7766703 0.3670912 0.6215433
## 0.08378936 0.7700079 0.3734328 0.6154686
## 0.41761517 0.8477627 0.3291581 0.7054969
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.08378936.
plot(rpartMod)
The Random Forest model tunes over the number of features to use in each tree that is evaluated, and it selected the model with the lowest RMSE, which was mtry = 26 and RMSE = 0.6282372.
set.seed(406)
RF_Mod <- train(x = trainX,
y = trainY,
method = "rf",
tuneLength = 10L)
RF_Mod
## Random Forest
##
## 132 samples
## 56 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.6748048 0.5495652 0.5557324
## 8 0.6417031 0.5698311 0.5269982
## 14 0.6327102 0.5763653 0.5178999
## 20 0.6290145 0.5773244 0.5134953
## 26 0.6282372 0.5754574 0.5107357
## 32 0.6286607 0.5720628 0.5116401
## 38 0.6312262 0.5663146 0.5118749
## 44 0.6325253 0.5643301 0.5112899
## 50 0.6371669 0.5573378 0.5134995
## 56 0.6394104 0.5537738 0.5142292
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 26.
plot(RF_Mod)
The gradient boosting model tunes over the number of trees, interaction depth, shrinkage rate and minimum observations in the nodes, and it selected n.trees = 500, interaction.depth = 11, shrinkage = 0.01 and n.minobsinnode = 5, where RMSE = 0.6192271.
set.seed(407)
gbmGrid2 <- expand.grid(interaction.depth = seq(1,15, by = 5),
n.trees = seq(100, 500, by = 100),
shrinkage = c(0.01, 0.1, 0.5),
n.minobsinnode = c(5, 10, 15)
)
gbmMod <- train(x = trainX,
y = trainY,
method = "gbm",
tuneGrid = gbmGrid2,
verbose = FALSE)
gbmMod
## Stochastic Gradient Boosting
##
## 132 samples
## 56 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared
## 0.01 1 5 100 0.7370571 0.4651511
## 0.01 1 5 200 0.6817666 0.4855145
## 0.01 1 5 300 0.6620301 0.4972607
## 0.01 1 5 400 0.6521094 0.5056299
## 0.01 1 5 500 0.6460158 0.5129683
## 0.01 1 10 100 0.7373940 0.4631862
## 0.01 1 10 200 0.6802126 0.4877830
## 0.01 1 10 300 0.6587881 0.5016686
## 0.01 1 10 400 0.6492666 0.5095714
## 0.01 1 10 500 0.6444236 0.5148808
## 0.01 1 15 100 0.7342956 0.4677778
## 0.01 1 15 200 0.6792548 0.4901899
## 0.01 1 15 300 0.6582292 0.5023378
## 0.01 1 15 400 0.6494848 0.5096260
## 0.01 1 15 500 0.6449915 0.5141947
## 0.01 6 5 100 0.6882403 0.5114744
## 0.01 6 5 200 0.6423044 0.5257229
## 0.01 6 5 300 0.6297544 0.5360787
## 0.01 6 5 400 0.6255573 0.5416497
## 0.01 6 5 500 0.6232940 0.5451644
## 0.01 6 10 100 0.6935400 0.5069938
## 0.01 6 10 200 0.6440425 0.5258504
## 0.01 6 10 300 0.6314331 0.5341283
## 0.01 6 10 400 0.6274953 0.5383298
## 0.01 6 10 500 0.6239103 0.5436938
## 0.01 6 15 100 0.7062681 0.4939890
## 0.01 6 15 200 0.6547579 0.5125137
## 0.01 6 15 300 0.6401304 0.5222972
## 0.01 6 15 400 0.6346034 0.5287070
## 0.01 6 15 500 0.6324812 0.5321224
## 0.01 11 5 100 0.6818832 0.5258197
## 0.01 11 5 200 0.6350153 0.5385125
## 0.01 11 5 300 0.6236665 0.5461688
## 0.01 11 5 400 0.6211743 0.5488601
## 0.01 11 5 500 0.6192271 0.5513971
## 0.01 11 10 100 0.6929863 0.5092920
## 0.01 11 10 200 0.6440467 0.5254312
## 0.01 11 10 300 0.6314916 0.5343700
## 0.01 11 10 400 0.6275585 0.5381355
## 0.01 11 10 500 0.6243187 0.5431088
## 0.01 11 15 100 0.7015178 0.5056586
## 0.01 11 15 200 0.6516962 0.5194749
## 0.01 11 15 300 0.6377504 0.5270697
## 0.01 11 15 400 0.6331312 0.5319376
## 0.01 11 15 500 0.6307983 0.5352547
## 0.10 1 5 100 0.6535684 0.5066982
## 0.10 1 5 200 0.6542160 0.5111406
## 0.10 1 5 300 0.6534049 0.5146111
## 0.10 1 5 400 0.6534964 0.5161386
## 0.10 1 5 500 0.6526969 0.5186241
## 0.10 1 10 100 0.6495670 0.5097828
## 0.10 1 10 200 0.6510404 0.5126691
## 0.10 1 10 300 0.6535248 0.5122275
## 0.10 1 10 400 0.6535588 0.5145005
## 0.10 1 10 500 0.6537423 0.5152774
## 0.10 1 15 100 0.6479456 0.5134762
## 0.10 1 15 200 0.6543280 0.5107582
## 0.10 1 15 300 0.6604964 0.5060703
## 0.10 1 15 400 0.6618215 0.5068907
## 0.10 1 15 500 0.6643671 0.5059078
## 0.10 6 5 100 0.6317202 0.5365384
## 0.10 6 5 200 0.6293862 0.5403359
## 0.10 6 5 300 0.6291360 0.5409466
## 0.10 6 5 400 0.6289738 0.5412793
## 0.10 6 5 500 0.6289158 0.5413583
## 0.10 6 10 100 0.6262132 0.5417056
## 0.10 6 10 200 0.6265911 0.5423066
## 0.10 6 10 300 0.6263218 0.5433232
## 0.10 6 10 400 0.6261116 0.5440541
## 0.10 6 10 500 0.6260856 0.5442395
## 0.10 6 15 100 0.6446185 0.5220179
## 0.10 6 15 200 0.6394123 0.5315022
## 0.10 6 15 300 0.6405357 0.5308435
## 0.10 6 15 400 0.6399070 0.5322242
## 0.10 6 15 500 0.6402852 0.5318261
## 0.10 11 5 100 0.6327798 0.5367099
## 0.10 11 5 200 0.6325548 0.5375226
## 0.10 11 5 300 0.6332232 0.5366667
## 0.10 11 5 400 0.6334804 0.5363242
## 0.10 11 5 500 0.6336118 0.5361509
## 0.10 11 10 100 0.6376831 0.5283100
## 0.10 11 10 200 0.6360064 0.5313663
## 0.10 11 10 300 0.6353634 0.5324191
## 0.10 11 10 400 0.6354255 0.5324948
## 0.10 11 10 500 0.6352444 0.5328152
## 0.10 11 15 100 0.6393094 0.5245047
## 0.10 11 15 200 0.6372500 0.5308381
## 0.10 11 15 300 0.6367424 0.5330525
## 0.10 11 15 400 0.6377779 0.5320433
## 0.10 11 15 500 0.6382583 0.5318266
## 0.50 1 5 100 0.7356817 0.4343120
## 0.50 1 5 200 0.7393572 0.4366845
## 0.50 1 5 300 0.7399861 0.4375522
## 0.50 1 5 400 0.7399606 0.4380613
## 0.50 1 5 500 0.7402697 0.4378759
## 0.50 1 10 100 0.7157237 0.4578811
## 0.50 1 10 200 0.7124490 0.4675811
## 0.50 1 10 300 0.7114750 0.4700329
## 0.50 1 10 400 0.7112105 0.4717345
## 0.50 1 10 500 0.7117262 0.4713326
## 0.50 1 15 100 0.7372281 0.4286385
## 0.50 1 15 200 0.7401058 0.4307126
## 0.50 1 15 300 0.7407246 0.4319868
## 0.50 1 15 400 0.7417462 0.4316124
## 0.50 1 15 500 0.7424766 0.4316444
## 0.50 6 5 100 0.7531406 0.4176720
## 0.50 6 5 200 0.7531123 0.4179072
## 0.50 6 5 300 0.7531181 0.4179096
## 0.50 6 5 400 0.7531183 0.4179093
## 0.50 6 5 500 0.7531183 0.4179093
## 0.50 6 10 100 0.7235516 0.4395822
## 0.50 6 10 200 0.7239805 0.4391569
## 0.50 6 10 300 0.7240493 0.4390820
## 0.50 6 10 400 0.7240439 0.4390852
## 0.50 6 10 500 0.7240436 0.4390841
## 0.50 6 15 100 0.7350882 0.4360436
## 0.50 6 15 200 0.7345366 0.4372262
## 0.50 6 15 300 0.7351651 0.4366227
## 0.50 6 15 400 0.7352929 0.4365388
## 0.50 6 15 500 0.7353305 0.4365297
## 0.50 11 5 100 0.7699983 0.3963567
## 0.50 11 5 200 0.7700935 0.3962006
## 0.50 11 5 300 0.7700950 0.3962001
## 0.50 11 5 400 0.7700951 0.3962000
## 0.50 11 5 500 0.7700951 0.3962000
## 0.50 11 10 100 0.7325500 0.4386459
## 0.50 11 10 200 0.7326953 0.4389810
## 0.50 11 10 300 0.7327284 0.4389848
## 0.50 11 10 400 0.7327735 0.4389360
## 0.50 11 10 500 0.7327752 0.4389344
## 0.50 11 15 100 0.7399964 0.4175177
## 0.50 11 15 200 0.7400771 0.4187559
## 0.50 11 15 300 0.7400960 0.4185902
## 0.50 11 15 400 0.7400815 0.4186186
## 0.50 11 15 500 0.7400508 0.4186448
## MAE
## 0.6109017
## 0.5594687
## 0.5371287
## 0.5230742
## 0.5140743
## 0.6105525
## 0.5579403
## 0.5341115
## 0.5210249
## 0.5138463
## 0.6084873
## 0.5568716
## 0.5335857
## 0.5212273
## 0.5138350
## 0.5677251
## 0.5191257
## 0.5007855
## 0.4935333
## 0.4887714
## 0.5740408
## 0.5228503
## 0.5053318
## 0.4977029
## 0.4923044
## 0.5842379
## 0.5331885
## 0.5141605
## 0.5063208
## 0.5016688
## 0.5612477
## 0.5116663
## 0.4947859
## 0.4885967
## 0.4848843
## 0.5714143
## 0.5210350
## 0.5026809
## 0.4951250
## 0.4904996
## 0.5807685
## 0.5303081
## 0.5119817
## 0.5041687
## 0.4995841
## 0.5102499
## 0.5064829
## 0.5037199
## 0.5044915
## 0.5037609
## 0.5092468
## 0.5087937
## 0.5120220
## 0.5121893
## 0.5113029
## 0.5091321
## 0.5118625
## 0.5167432
## 0.5182556
## 0.5205944
## 0.4933413
## 0.4899733
## 0.4898202
## 0.4896770
## 0.4896421
## 0.4890498
## 0.4889993
## 0.4885858
## 0.4884049
## 0.4883658
## 0.5056983
## 0.5028355
## 0.5035118
## 0.5034374
## 0.5038183
## 0.4945771
## 0.4926192
## 0.4927379
## 0.4928335
## 0.4928172
## 0.5025158
## 0.5001694
## 0.4995555
## 0.4993328
## 0.4990487
## 0.5054999
## 0.5036062
## 0.5043610
## 0.5051281
## 0.5057808
## 0.5764945
## 0.5797314
## 0.5806527
## 0.5807530
## 0.5811245
## 0.5630980
## 0.5597806
## 0.5593762
## 0.5594897
## 0.5599122
## 0.5845247
## 0.5892297
## 0.5910257
## 0.5923801
## 0.5926532
## 0.5938101
## 0.5938105
## 0.5938214
## 0.5938218
## 0.5938218
## 0.5658401
## 0.5663251
## 0.5664126
## 0.5664145
## 0.5664165
## 0.5827455
## 0.5820158
## 0.5826048
## 0.5827317
## 0.5827866
## 0.6007775
## 0.6008768
## 0.6008796
## 0.6008796
## 0.6008796
## 0.5750730
## 0.5752676
## 0.5752905
## 0.5753407
## 0.5753450
## 0.5783508
## 0.5778540
## 0.5780692
## 0.5780739
## 0.5780611
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 500, interaction.depth =
## 11, shrinkage = 0.01 and n.minobsinnode = 5.
plot(gbmMod)
The Cubist model tunes over the number of committees and neighbors used in building the model, and it selected committees = 20 and neighbors = 5 where RMSE = 0.6241594.
set.seed(408)
CubistMod <- train(x = trainX,
y = trainY,
method = "cubist")
CubistMod
## Cubist
##
## 132 samples
## 56 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 0.9097795 0.3261565 0.6657959
## 1 5 0.8948935 0.3483801 0.6471097
## 1 9 0.8976531 0.3423711 0.6527329
## 10 0 0.6655149 0.5073520 0.5157291
## 10 5 0.6499137 0.5295174 0.5012022
## 10 9 0.6569764 0.5200360 0.5081046
## 20 0 0.6405293 0.5311136 0.4992244
## 20 5 0.6241594 0.5542839 0.4824151
## 20 9 0.6312454 0.5448120 0.4892632
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
plot(CubistMod)
Which tree-based regression model gives the optimal resampling and test set performance?
When trained each model stated “RMSE was used to select the optimal model using the smallest value.” Therefore, we can select the model with the overall smallest RMSE, which appears to be Gradient Boosting in the 1st, Mean, 3rd and Max columns. Cubist has a lower Median value, but Gradient Boosting shows the smallest overall RMSE values consistently.
optimalTree <- resamples(list(SingleTree=rpartMod, RandomForest=RF_Mod, GradientBoosting=gbmMod, Cubist=CubistMod))
summary(optimalTree)
##
## Call:
## summary.resamples(object = optimalTree)
##
## Models: SingleTree, RandomForest, GradientBoosting, Cubist
## Number of resamples: 25
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## SingleTree 0.4859186 0.5820133 0.6129884 0.6154686 0.6472591 0.7220190
## RandomForest 0.3836628 0.4853523 0.5150236 0.5107357 0.5341481 0.6699730
## GradientBoosting 0.3879447 0.4669835 0.4908541 0.4848843 0.5038450 0.5531224
## Cubist 0.3821551 0.4545234 0.4754450 0.4824151 0.5021344 0.6463159
## NA's
## SingleTree 0
## RandomForest 0
## GradientBoosting 0
## Cubist 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## SingleTree 0.5966413 0.7065075 0.7787488 0.7700079 0.8300308 0.9086841
## RandomForest 0.4737784 0.5972830 0.6433030 0.6282372 0.6616887 0.8285726
## GradientBoosting 0.5206890 0.5821177 0.6264100 0.6192271 0.6487761 0.7411194
## Cubist 0.4782438 0.5861085 0.6186330 0.6241594 0.6504664 0.8070523
## NA's
## SingleTree 0
## RandomForest 0
## GradientBoosting 0
## Cubist 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## SingleTree 0.1486274 0.3183827 0.3638089 0.3734328 0.4262367 0.5547497
## RandomForest 0.4068542 0.5251647 0.5672361 0.5754574 0.6270524 0.7879016
## GradientBoosting 0.4241430 0.4964715 0.5480393 0.5513971 0.6045590 0.6651260
## Cubist 0.2944683 0.4987805 0.5567139 0.5542839 0.6205018 0.7291482
## NA's
## SingleTree 0
## RandomForest 0
## GradientBoosting 0
## Cubist 0
The test sets do not return RMSE, but we have Rsquared to evaluate, and Cubist shows the highest Rsquared value. However, the model was trained with 132 observations and tested with only 44, so there may not have been sufficient representation in the test sets to trust the results for selecting the best model.
Therefore, I will stick with Gradient Boosting as being the best model of the four tree-based models.
rbind(
"SingleTree" = postResample(pred = predict(rpartMod), obs = testY),
"RandomForest" = postResample(pred = predict(RF_Mod), obs = testY),
"GradientBoosting" = postResample(pred = predict(gbmMod), obs = testY),
"Cubist" = postResample(pred = predict(CubistMod), obs = testY)
)
## RMSE Rsquared MAE
## SingleTree NA 0.02188217 NA
## RandomForest NA 0.02357147 NA
## GradientBoosting NA 0.02578373 NA
## Cubist NA 0.03007913 NA
Let’s look one more time at the results of the trained GBM model.
plot(gbmMod)
Which predictors are most important in the optimal tree-based regression model? - Do either the biological or process variables dominate the list? - How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
The top 10 predictors from the Gradient Boosting model are sorted in descending Overall importance in the list below. Of the top 10 predictors eight are Manufacturing Process predictors and only two are Biological Material predictors, so Manufacturing processes clearly dominate the list.
varImp(gbmMod)
## gbm variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.000
## BiologicalMaterial12 16.838
## ManufacturingProcess06 12.351
## ManufacturingProcess17 11.963
## ManufacturingProcess13 11.455
## ManufacturingProcess09 10.906
## ManufacturingProcess31 9.533
## BiologicalMaterial03 7.867
## ManufacturingProcess11 6.178
## ManufacturingProcess21 5.848
## ManufacturingProcess27 5.478
## ManufacturingProcess43 5.462
## ManufacturingProcess20 5.159
## ManufacturingProcess04 5.050
## BiologicalMaterial02 4.884
## BiologicalMaterial11 4.772
## ManufacturingProcess05 4.553
## ManufacturingProcess39 4.502
## ManufacturingProcess24 4.354
## BiologicalMaterial09 4.216
The top 10 predictors from the optimal linear model (Partial Least Squares) were:
ManufacturingProcess32 100.00000
ManufacturingProcess13 84.69518
ManufacturingProcess17 84.27932
ManufacturingProcess36 83.77365
ManufacturingProcess09 79.59390
BiologicalMaterial02 56.37639
ManufacturingProcess06 54.10142
ManufacturingProcess12 53.66660
BiologicalMaterial06 53.17872
ManufacturingProcess11 52.92723
and from the optimal non-linear model (Neural Network) they were:
ManufacturingProcess23 100.00000
ManufacturingProcess32 96.42698
ManufacturingProcess34 92.81134
ManufacturingProcess33 84.75033
BiologicalMaterial09 82.03159
ManufacturingProcess01 80.48583
ManufacturingProcess03 80.03863
BiologicalMaterial11 78.72629
ManufacturingProcess28 76.02541
ManufacturingProcess45 73.55811
ManufacturingProcess32 was the top linear and tree-based predictor, and it was second most important in the non-linear model. Overall there were more similarities in variable importance between the linear and tree-based models with predictors 32, 06, 17, 13, 09 and 11 in the top 10 predictors in both models, but importance appears to drop off very quickly in the tree-based model while it declines less rapidly in the linear model, albeit faster than in the non-linear model.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
The optimal single tree with the distribution of yield appears to be very eye opening. While ManufacturingProcess32 is at the top accounting for 100% of the yield, BiologicalMaterial12 and ManufacturingProcess06 split that into 51% and 49% respectively, with the BiologiclMaterial having slightly more impact to final yield than the ManufacturingProcess at that level of the split. Then on the BiologicalMaterial side of the split only three ManufacturingProcesses (29, 17 and 27) have impact on final yield, even though #29 is not listed in the top 20 most important predictors. Interestingly, once ManufacturingProcess17 is evaluated, BiologicalProcess12 appears again as a split point at a higher threshold. While on the ManufacturingProcess06 side of the first split there are no BiologicalProcess predictors, only ManufacturingProcesses 17, 23 and 02. It is also interesting that ManufacturingProcess17 appears in the next level on both sides of the original split, but at different thresholds.
In terms of relationsihp with Yield at the lowest level,
BiologicalMaterial 12 accounts for 18% of the final yield, while
ManufacturingProcesses (27, 39, 02, 23 and 17) account for,
respectively, 22%, 6%, 18%, 11% and 25% of final yield. This means that
ManufacturingProcess17 accounts for the greatest percentage of yield at
25%, followed by ManufacturingProcess27 at 22% and third place is a tie
between BiologicalMaterial12 and ManufacturingProcess 02 each at 18%.
This does not align with the numbers in the varImp()
function that was used to rank and extract the importance of variables
from the tree-based model.
rpartTree <- rpart(Yield ~ ., method = "anova", data = chem_train)
rpart.plot(rpartTree)