library(tidyverse)
library(mlbench)
library(randomForest)
library(caret)
library(party)
library(gbm)
library(Cubist)
library(rpart)
library(AppliedPredictiveModeling)
library(RWeka)
Recreate the simulated data from Exercise 7.2
set.seed(624)
simulated <- mlbench.friedman1(200, sd = 1) #200 observations
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
set.seed(624)
model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
varImp(model1, scale = FALSE)
## Overall
## V1 7.06138525
## V2 4.76962217
## V3 1.01126851
## V4 9.88171245
## V5 2.05197889
## V6 0.09245359
## V7 -0.05564489
## V8 -0.07717705
## V9 0.03891580
## V10 -0.06427946
No, the uninformative predictors (V6-V10) are not important to this model.
set.seed(624)
model4 <- cforest(y ~ ., data = simulated)
varimp(model4, conditional = TRUE)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 duplicate1 duplicate2
## 4.811084012 2.263260969 0.117301895 4.235852150 1.307494550 0.019542410 -0.001326794 -0.026687589 0.024102142 0.025047531 1.008603284 0.066807840
The use of conditional inference trees results in a model that places less importance on each of the informative predictors but still places little to no importance on the uninformative ones. V1 remains the most important, but with a lower score of approximately 4.81. The importance scores of variables V2 and v4 [and its first duplicate] each show roughly two point decreases from their counterpart scores from the prior model.
Boosted Trees
set.seed(624)
simulated <- simulated %>% select(-c(duplicate1, duplicate2))
boosted_model1 <- gbm(y ~ .,
data = simulated,
distribution = "gaussian")
summary.gbm(boosted_model1)
## var rel.inf
## V4 V4 36.466310
## V1 V1 25.198851
## V2 V2 17.098097
## V5 V5 12.021480
## V3 V3 9.215263
## V6 V6 0.000000
## V7 V7 0.000000
## V8 V8 0.000000
## V9 V9 0.000000
## V10 V10 0.000000
A boosted trees model focuses on the informative predictors V4, V1, V2, V5, and V3, in descending order of relative importance.
set.seed(624)
simulated$duplicate1 <- simulated$V4 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V4)
## [1] 0.9439399
boosted_model2 <- gbm(y ~ .,
data = simulated,
distribution = "gaussian")
summary.gbm(boosted_model2)
## var rel.inf
## V4 V4 34.9746629
## V1 V1 25.4830902
## V2 V2 16.6363163
## V5 V5 11.8536331
## V3 V3 8.5933492
## duplicate1 duplicate1 1.9040810
## V10 V10 0.2088589
## V9 V9 0.1948116
## V8 V8 0.1511967
## V6 V6 0.0000000
## V7 V7 0.0000000
Adding a predictor (duplicate1) highly correlated with V4 results in no change in order, though the duplicate is more important than the set of uninformative predictors. As with the random forest model, the informative predictors each decrease in importance with the addition of a duplicate.
Cubist
set.seed(624)
simulated <- simulated %>% select(-c(duplicate1))
cubist_model1 <- cubist(simulated[1:10], simulated$y)
summary(cubist_model1)
##
## Call:
## cubist.default(x = simulated[1:10], y = simulated$y)
##
##
## Cubist [Release 2.07 GPL Edition] Wed Nov 18 18:41:29 2020
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 200 cases (11 attributes) from undefined.data
##
## Model:
##
## Rule 1: [53 cases, mean 10.425206, range 1.93532 to 17.54472, est err 1.407606]
##
## if
## V1 <= 0.2136574
## V4 <= 0.934455
## then
## outcome = 2.358711 + 13.5 V1 + 9.7 V4 + 5 V5 - 2 V3 + 1.1 V2
##
## Rule 2: [55 cases, mean 12.290461, range 4.176228 to 22.66149, est err 1.248912]
##
## if
## V1 > 0.2136574
## V2 <= 0.4577003
## V3 > 0.07063606
## V4 <= 0.934455
## then
## outcome = -1.676048 + 18.6 V2 + 9 V4 + 6.5 V1 + 4 V5 - 0.7 V3
##
## Rule 3: [20 cases, mean 14.479267, range 10.15909 to 21.36569, est err 1.369244]
##
## if
## V1 > 0.2136574
## V1 <= 0.5269587
## V2 > 0.4577003
## V4 <= 0.934455
## V5 <= 0.5558813
## then
## outcome = 10.405348 + 8.7 V4 - 2.8 V9 - 1.9 V7 + 1.7 V10 + 1.4 V5
## + 1.1 V2 + 0.6 V1 - 0.3 V3
##
## Rule 4: [20 cases, mean 16.767288, range 10.33381 to 23.71413, est err 1.739599]
##
## if
## V1 > 0.5269587
## V2 > 0.4577003
## V4 <= 0.934455
## V5 <= 0.5558813
## then
## outcome = 31.8521 - 12.9 V1 - 12.4 V2 + 8.7 V4 + 0.5 V5
##
## Rule 5: [10 cases, mean 18.189348, range 10.8877 to 22.12831, est err 0.689396]
##
## if
## V1 > 0.2136574
## V2 <= 0.4577003
## V3 <= 0.07063606
## then
## outcome = 2.862449 + 16.9 V2 + 7.9 V4 + 6.6 V1 + 3.5 V5 - 2.8 V3
##
## Rule 6: [32 cases, mean 18.451132, range 10.53267 to 23.19134, est err 1.462592]
##
## if
## V1 > 0.2136574
## V2 > 0.4577003
## V4 <= 0.934455
## V5 > 0.5558813
## then
## outcome = 10.412436 + 7.9 V4 + 2.6 V2 + 2.4 V5 + 1.6 V1 - 0.8 V3
##
## Rule 7: [10 cases, mean 19.617697, range 15.73031 to 24.72045, est err 0.778461]
##
## if
## V4 > 0.934455
## then
## outcome = 125.80106 - 110.9 V4 + 2.9 V2 + 0.2 V5 + 0.2 V1
##
##
## Evaluation on training data (200 cases):
##
## Average |error| 1.355614
## Relative |error| 0.34
## Correlation coefficient 0.94
##
##
## Attribute usage:
## Conds Model
##
## 95% 100% V1
## 95% 100% V4
## 68% 100% V2
## 36% 100% V5
## 32% 85% V3
## 10% V7
## 10% V9
## 10% V10
##
##
## Time: 0.0 secs
A cubist model focuses on the informative predictors V1, V4, V2, V5, and V3, in descending order of relative importance. Here, V1 and V4 switch places from the boosted trees models.
set.seed(624)
simulated$duplicate1 <- simulated$V4 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V4)
## [1] 0.9439399
cubist_model2 <- cubist(simulated[-11], simulated$y)
summary(cubist_model2)
##
## Call:
## cubist.default(x = simulated[-11], y = simulated$y)
##
##
## Cubist [Release 2.07 GPL Edition] Wed Nov 18 18:41:29 2020
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 200 cases (12 attributes) from undefined.data
##
## Model:
##
## Rule 1: [53 cases, mean 10.425206, range 1.93532 to 17.54472, est err 1.407606]
##
## if
## V1 <= 0.2136574
## V4 <= 0.934455
## then
## outcome = 2.358711 + 13.5 V1 + 9.7 V4 + 5 V5 - 2 V3 + 1.1 V2
##
## Rule 2: [65 cases, mean 13.197982, range 4.176228 to 22.66149, est err 1.376401]
##
## if
## V1 > 0.2136574
## V2 <= 0.4577003
## V4 <= 0.934455
## then
## outcome = -1.180513 + 19.2 V2 + 9 V4 + 7.5 V1 + 4 V5 - 3.1 V3
##
## Rule 3: [72 cases, mean 16.880102, range 10.15909 to 23.71413, est err 1.842202]
##
## if
## V1 > 0.2136574
## V2 > 0.4577003
## V4 <= 0.934455
## then
## outcome = 7.40266 + 8.8 V4 + 5.5 V5 + 2.6 V2 + 1.8 V1 - 0.8 V3
##
## Rule 4: [10 cases, mean 19.617697, range 15.73031 to 24.72045, est err 0.778461]
##
## if
## V4 > 0.934455
## then
## outcome = 125.80106 - 110.9 V4 + 2.9 V2 + 0.2 V5 + 0.2 V1
##
##
## Evaluation on training data (200 cases):
##
## Average |error| 1.678127
## Relative |error| 0.42
## Correlation coefficient 0.90
##
##
## Attribute usage:
## Conds Model
##
## 100% 100% V4
## 95% 100% V1
## 68% 100% V2
## 100% V5
## 95% V3
##
##
## Time: 0.0 secs
Adding a predictor (duplicate1) highly correlated with V4 results in a new model that uses solely the informative predictors, and neither the duplicate nor any of the uninformative predictors. More specifically, the new model places greatest importance on V4, V1, and V2.
Use a simulation to show tree bias with different granularities.
set.seed(624)
x1 <- seq(0.01, 1, 0.01)
x2 <- rep(1:2, each = 50)
x3 <- sample(c(seq(1, 50, 1), rep(1:2, each = 25)))
y <- x2 + rnorm(100)
varImp(rpart(y ~ x1 + x2 + x3))
## Overall
## x1 0.5268857
## x2 0.2619461
## x3 0.3989453
The simulation includes three predictors--x1, with 100 unique values; x2, with 2 unique values; and x3, with 52 unique values--and a response y created from x2 values plus random noise. Using CART to model y on the three predictors results in a model that places greatest importance on x1, which has the most unique values, and the least importance on x2, which informed y but also is the most granular. This tendency towards using less granular predictors for splitting suggests selection bias.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9.
The model on the right uses extremely high values for both the bagging fraction--the sampled proportion of training data used to inform the current iteration--and the learning rate--the fraction of the current predicted value to be added to the previous iteration's predicted value. These high values mean that the right-hand model places greater importance on past stages, using much of the same data (bagging) and correlating closely in structure and prediction with prior ones (learning). It also means that the model is greedy, or selects the optimal weak learner in each stage, which is related to learning. Thus, the predictors showing greater importance previously are likely to be the focus--and increasingly so--subsequently.
By comparison, the left-hand model uses a much smaller bagging fraction and [more optimal] learning rate. A lower proportion of the training data is used at each stage, and predictions are less dependent upon previous predictions. As a result, the model places greater importance on more diverse, more random data across stages.
I expect the model with bagging and learning parameters of 0.1, respectively, would be more predictive of other samples given it will have been trained on more diverse data and thus should be more generalizable to new data. The text suggests that lower values for the learning parameter perform better, though the low bagging parameter seems sub-optimal and could result in an overly simplistic model. On the other hand, the right-hand model is greedy and thus could be prone to over-fitting. Regardless, I would choose based upon validation of both models.
Increasing the interaction depth would result in a flatter, still decreasing slope of predictor importance. Increased depth results in additional splits and thus additional interactions between split predictors, which in turn increases opportunities for new predictors to take on greater importance in the model. Allocating importance across a larger set of predictors lessens the reliance on a smaller set of particularly important predictors.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models.
The code and descriptions below are adapted directly from the Week 10 Homework assignment.
data("ChemicalManufacturingProcess")
imputed <- impute::impute.knn(as.matrix(ChemicalManufacturingProcess), rng.seed = 624)
cmp <- as.data.frame(imputed$data)
sum(is.na(cmp))
## [1] 0
The data set contains the 57 predictors (12 describing input biological material and 45 describing the process predictors) for the 176 manufacturing runs. Response yield contains the percent yield for each run.
Data are assumed to missing at least at random for the purposes of this exercise, and KNN imputation is used to estimate the 106 missing values in the set. The imputation process uses 10 neighbors.
set.seed(624)
index <- createDataPartition(cmp$Yield, p = .80, list = FALSE)
cmp_train <- cmp[index,] # 144 observations
cmp_test <- cmp[-index,] # 32 observations
An 80/20 split is used to create a training set of 144 runs and a test set of 32 runs.
ex <- nearZeroVar(cmp_train[-1], saveMetrics = TRUE)
ex %>% arrange(-freqRatio, percentUnique, -nzv) %>% head()
## freqRatio percentUnique zeroVar nzv
## BiologicalMaterial07 71.000000 1.388889 FALSE TRUE
## ManufacturingProcess41 6.500000 2.777778 FALSE FALSE
## ManufacturingProcess28 5.400000 14.583333 FALSE FALSE
## ManufacturingProcess12 4.760000 1.388889 FALSE FALSE
## ManufacturingProcess34 4.636364 6.250000 FALSE FALSE
## ManufacturingProcess40 4.333333 1.388889 FALSE FALSE
sum(ex$nzv)
## [1] 1
A check for near-zero variance predictors returns just one: BiologicalMaterial07, with a frequency ratio of approximately 71.
corr <- cor((cmp_train %>% select(-c("Yield","BiologicalMaterial07"))), method = "spearman")
corrplot::corrplot(corr)
hicorr <- findCorrelation(corr)
set.seed(624)
cmp_train_slim <- cmp_train %>% select(-c(Yield, BiologicalMaterial07)) %>% select(-all_of(hicorr))
cmp_train_transform <- cmp_train_slim %>% preProcess(method = c("BoxCox", "center", "scale")) %>% predict(cmp_train_slim) %>% cbind(cmp_train$Yield) %>% rename(Yield = "cmp_train$Yield")
cmp_test_slim <- cmp_test %>% select(-c(Yield, BiologicalMaterial07)) %>% select(-all_of(hicorr))
cmp_test_transform <- cmp_test_slim %>% preProcess(method = c("BoxCox", "center", "scale")) %>% predict(cmp_test_slim) %>% cbind(cmp_test$Yield) %>% rename(Yield = "cmp_test$Yield")
In pre-processing, predictors with near-zero variance or high correlations (using Spearman's \(\rho\)) are removed, and remaining predictors undergo a Box-Cox transformation as well as centering and scaling. The resulting training and test sets feature 51 predictors.
Single Tree
set.seed(624)
grid <- expand.grid(maxdepth = seq(1, 10, by = 1))
(single_cmp <- caret::train(Yield ~ .,
data = cmp_train_transform,
method = "rpart2",
tuneGrid = grid,
trControl = trainControl(method = "repeatedcv", repeats = 5)))
## CART
##
## 144 samples
## 51 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 130, 130, 130, 130, 129, 130, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 1.461753 0.3740462 1.167401
## 2 1.448095 0.3822608 1.161703
## 3 1.496788 0.3572653 1.198263
## 4 1.517058 0.3647183 1.202607
## 5 1.556054 0.3461312 1.221953
## 6 1.545414 0.3534624 1.202002
## 7 1.531284 0.3669317 1.174808
## 8 1.528409 0.3701504 1.164245
## 9 1.535316 0.3677410 1.169591
## 10 1.531337 0.3706694 1.161769
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 2.
singlepred_cmp <- predict(single_cmp, newdata = cmp_test_transform)
postResample(pred = singlepred_cmp, obs = cmp_test_transform$Yield)
## RMSE Rsquared MAE
## 1.4593369 0.5400498 1.1604447
Minimizing RMSE (at approximately (~1.45) and using repeated 10-fold cross-validation, the maximum depth is 2. Evaluating on the test set returns an RMSE of approximately 1.46.
Random Forest
set.seed(624)
(rf_cmp <- caret::train(Yield ~ .,
data = cmp_train_transform,
method = "rf",
trControl = trainControl(method = "repeatedcv", repeats = 5),
importance = TRUE))
## Random Forest
##
## 144 samples
## 51 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 130, 130, 130, 130, 129, 130, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.220578 0.6430496 0.9980242
## 26 1.138212 0.6445025 0.8997997
## 51 1.157702 0.6243623 0.9124376
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 26.
rfpred_cmp <- predict(rf_cmp, newdata = cmp_test_transform)
postResample(pred = rfpred_cmp, obs = cmp_test_transform$Yield)
## RMSE Rsquared MAE
## 1.2659820 0.7610683 0.9470121
Minimizing RMSE (~1.14) and using repeated 10-fold cross-validation, the number of predictors to be randomly sampled at each split is 26. Evaluating on the test set returns an RMSE of approximately 1.27.
Boosted Trees
set.seed(624)
boosted_grid <- expand.grid(.interaction.depth = seq(1, 7, by =2),
.n.trees = seq(100, 1000, by = 50),
.shrinkage = c(0.01, 0.1),
.n.minobsinnode = c(5, 10))
(boosted_cmp <- caret::train(Yield ~ .,
data = cmp_train_transform,
method = "gbm",
tuneGrid = boosted_grid,
trControl = trainControl(method = "repeatedcv", repeats = 5),
verbose = FALSE,
distribution = "gaussian"))
## Stochastic Gradient Boosting
##
## 144 samples
## 51 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 130, 130, 130, 130, 129, 130, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared MAE
## 0.01 1 5 100 1.427721 0.5372102 1.1649823
## 0.01 1 5 150 1.346093 0.5562413 1.0949981
## 0.01 1 5 200 1.291922 0.5689057 1.0438736
## 0.01 1 5 250 1.260184 0.5751826 1.0100025
## 0.01 1 5 300 1.239968 0.5795328 0.9869272
## 0.01 1 5 350 1.227185 0.5827024 0.9731314
## 0.01 1 5 400 1.216606 0.5868720 0.9617909
## 0.01 1 5 450 1.207433 0.5908080 0.9528824
## 0.01 1 5 500 1.199722 0.5940348 0.9461525
## 0.01 1 5 550 1.194491 0.5971129 0.9409722
## 0.01 1 5 600 1.189609 0.5994562 0.9353368
## 0.01 1 5 650 1.184690 0.6022769 0.9305125
## 0.01 1 5 700 1.180421 0.6042636 0.9259541
## 0.01 1 5 750 1.177870 0.6054348 0.9230573
## 0.01 1 5 800 1.174100 0.6070573 0.9190059
## 0.01 1 5 850 1.170615 0.6092221 0.9152021
## 0.01 1 5 900 1.167679 0.6110072 0.9124152
## 0.01 1 5 950 1.165259 0.6124905 0.9093425
## 0.01 1 5 1000 1.162631 0.6137581 0.9066604
## 0.01 1 10 100 1.430718 0.5351788 1.1697965
## 0.01 1 10 150 1.347425 0.5543554 1.0981603
## 0.01 1 10 200 1.293514 0.5672715 1.0469999
## 0.01 1 10 250 1.261516 0.5734375 1.0133415
## 0.01 1 10 300 1.242343 0.5776798 0.9898924
## 0.01 1 10 350 1.226934 0.5823603 0.9701843
## 0.01 1 10 400 1.218723 0.5846213 0.9588219
## 0.01 1 10 450 1.211388 0.5882841 0.9498190
## 0.01 1 10 500 1.206770 0.5904363 0.9443450
## 0.01 1 10 550 1.201310 0.5935291 0.9396121
## 0.01 1 10 600 1.197199 0.5950782 0.9357632
## 0.01 1 10 650 1.193007 0.5969303 0.9317260
## 0.01 1 10 700 1.189875 0.5983347 0.9290368
## 0.01 1 10 750 1.186269 0.6004149 0.9252790
## 0.01 1 10 800 1.183274 0.6020390 0.9214548
## 0.01 1 10 850 1.180326 0.6032591 0.9185015
## 0.01 1 10 900 1.178213 0.6048093 0.9157945
## 0.01 1 10 950 1.177492 0.6048333 0.9147729
## 0.01 1 10 1000 1.173987 0.6067682 0.9120401
## 0.01 3 5 100 1.341189 0.5781599 1.0951408
## 0.01 3 5 150 1.261639 0.5922409 1.0229614
## 0.01 3 5 200 1.215986 0.6054455 0.9766663
## 0.01 3 5 250 1.188927 0.6143525 0.9492845
## 0.01 3 5 300 1.169674 0.6228305 0.9308577
## 0.01 3 5 350 1.156187 0.6287682 0.9170908
## 0.01 3 5 400 1.146540 0.6328776 0.9070298
## 0.01 3 5 450 1.137410 0.6366442 0.8984058
## 0.01 3 5 500 1.130171 0.6395350 0.8906868
## 0.01 3 5 550 1.123249 0.6432943 0.8828832
## 0.01 3 5 600 1.118381 0.6458489 0.8771387
## 0.01 3 5 650 1.112654 0.6489951 0.8704841
## 0.01 3 5 700 1.108581 0.6513915 0.8658082
## 0.01 3 5 750 1.104946 0.6537154 0.8618190
## 0.01 3 5 800 1.101683 0.6556584 0.8585060
## 0.01 3 5 850 1.097812 0.6575985 0.8546428
## 0.01 3 5 900 1.095353 0.6588568 0.8517996
## 0.01 3 5 950 1.092516 0.6604139 0.8495586
## 0.01 3 5 1000 1.090565 0.6614816 0.8477378
## 0.01 3 10 100 1.344530 0.5709046 1.0954226
## 0.01 3 10 150 1.269268 0.5832928 1.0247813
## 0.01 3 10 200 1.227559 0.5925271 0.9817101
## 0.01 3 10 250 1.204203 0.5992447 0.9563339
## 0.01 3 10 300 1.187879 0.6054613 0.9393864
## 0.01 3 10 350 1.178072 0.6088769 0.9282799
## 0.01 3 10 400 1.168352 0.6138010 0.9181443
## 0.01 3 10 450 1.160359 0.6178563 0.9099285
## 0.01 3 10 500 1.153195 0.6218698 0.9027709
## 0.01 3 10 550 1.147271 0.6246241 0.8966618
## 0.01 3 10 600 1.143126 0.6267419 0.8932371
## 0.01 3 10 650 1.139550 0.6289213 0.8901652
## 0.01 3 10 700 1.135354 0.6312293 0.8863081
## 0.01 3 10 750 1.131851 0.6335354 0.8830729
## 0.01 3 10 800 1.129948 0.6349748 0.8813000
## 0.01 3 10 850 1.127933 0.6361663 0.8796885
## 0.01 3 10 900 1.126984 0.6368407 0.8789730
## 0.01 3 10 950 1.124901 0.6383726 0.8781243
## 0.01 3 10 1000 1.122757 0.6397135 0.8758066
## 0.01 5 5 100 1.320854 0.5893272 1.0787699
## 0.01 5 5 150 1.236864 0.6096302 1.0022836
## 0.01 5 5 200 1.190980 0.6213442 0.9586855
## 0.01 5 5 250 1.162328 0.6320830 0.9313011
## 0.01 5 5 300 1.141233 0.6410624 0.9113742
## 0.01 5 5 350 1.128032 0.6464326 0.8970834
## 0.01 5 5 400 1.117900 0.6507642 0.8866143
## 0.01 5 5 450 1.109422 0.6551295 0.8769114
## 0.01 5 5 500 1.100296 0.6604405 0.8671636
## 0.01 5 5 550 1.094523 0.6631980 0.8598450
## 0.01 5 5 600 1.089303 0.6655582 0.8541454
## 0.01 5 5 650 1.085266 0.6677532 0.8491781
## 0.01 5 5 700 1.081199 0.6700302 0.8444533
## 0.01 5 5 750 1.077784 0.6721817 0.8407880
## 0.01 5 5 800 1.074649 0.6737578 0.8373817
## 0.01 5 5 850 1.071930 0.6754401 0.8353929
## 0.01 5 5 900 1.069330 0.6768532 0.8337019
## 0.01 5 5 950 1.067298 0.6780592 0.8320116
## 0.01 5 5 1000 1.064959 0.6794012 0.8302676
## 0.01 5 10 100 1.334725 0.5774794 1.0872102
## 0.01 5 10 150 1.256511 0.5912033 1.0161543
## 0.01 5 10 200 1.212597 0.6031082 0.9704546
## 0.01 5 10 250 1.187118 0.6108277 0.9415820
## 0.01 5 10 300 1.168823 0.6189689 0.9227376
## 0.01 5 10 350 1.158372 0.6235842 0.9114831
## 0.01 5 10 400 1.149189 0.6279380 0.9016161
## 0.01 5 10 450 1.141560 0.6316506 0.8928099
## 0.01 5 10 500 1.135468 0.6346432 0.8863226
## 0.01 5 10 550 1.130429 0.6371292 0.8816861
## 0.01 5 10 600 1.127230 0.6388135 0.8785189
## 0.01 5 10 650 1.123189 0.6409727 0.8757199
## 0.01 5 10 700 1.118759 0.6436225 0.8718750
## 0.01 5 10 750 1.116078 0.6451100 0.8693085
## 0.01 5 10 800 1.114160 0.6461159 0.8669856
## 0.01 5 10 850 1.112224 0.6473042 0.8653907
## 0.01 5 10 900 1.110956 0.6482396 0.8648234
## 0.01 5 10 950 1.109271 0.6494480 0.8640239
## 0.01 5 10 1000 1.107787 0.6502579 0.8632946
## 0.01 7 5 100 1.307003 0.6055447 1.0678095
## 0.01 7 5 150 1.222831 0.6211639 0.9897895
## 0.01 7 5 200 1.175944 0.6333512 0.9413445
## 0.01 7 5 250 1.145687 0.6439610 0.9114432
## 0.01 7 5 300 1.126863 0.6503178 0.8920416
## 0.01 7 5 350 1.112537 0.6571885 0.8777051
## 0.01 7 5 400 1.101134 0.6625862 0.8665131
## 0.01 7 5 450 1.091912 0.6667622 0.8566489
## 0.01 7 5 500 1.084818 0.6702307 0.8491598
## 0.01 7 5 550 1.079717 0.6726313 0.8430410
## 0.01 7 5 600 1.075421 0.6746464 0.8384735
## 0.01 7 5 650 1.071004 0.6770645 0.8334904
## 0.01 7 5 700 1.067025 0.6793239 0.8296821
## 0.01 7 5 750 1.063902 0.6810934 0.8267113
## 0.01 7 5 800 1.061204 0.6824111 0.8241104
## 0.01 7 5 850 1.058741 0.6837976 0.8219455
## 0.01 7 5 900 1.056899 0.6848372 0.8203424
## 0.01 7 5 950 1.055008 0.6859554 0.8189743
## 0.01 7 5 1000 1.052993 0.6870680 0.8179537
## 0.01 7 10 100 1.331869 0.5804672 1.0863914
## 0.01 7 10 150 1.256325 0.5917392 1.0150732
## 0.01 7 10 200 1.213167 0.6009102 0.9714404
## 0.01 7 10 250 1.186659 0.6103508 0.9439939
## 0.01 7 10 300 1.170533 0.6164179 0.9269136
## 0.01 7 10 350 1.157858 0.6225617 0.9137971
## 0.01 7 10 400 1.147356 0.6275769 0.9031168
## 0.01 7 10 450 1.141269 0.6298271 0.8969305
## 0.01 7 10 500 1.134919 0.6332701 0.8905073
## 0.01 7 10 550 1.129641 0.6360193 0.8856572
## 0.01 7 10 600 1.124837 0.6389644 0.8803235
## 0.01 7 10 650 1.121352 0.6409815 0.8776442
## 0.01 7 10 700 1.118717 0.6425308 0.8748260
## 0.01 7 10 750 1.116387 0.6441264 0.8723908
## 0.01 7 10 800 1.114156 0.6455379 0.8700583
## 0.01 7 10 850 1.112080 0.6468245 0.8680341
## 0.01 7 10 900 1.109234 0.6488245 0.8658905
## 0.01 7 10 950 1.107222 0.6499643 0.8654522
## 0.01 7 10 1000 1.105364 0.6511462 0.8645189
## 0.10 1 5 100 1.189021 0.5956498 0.9344818
## 0.10 1 5 150 1.178125 0.6032055 0.9186806
## 0.10 1 5 200 1.173761 0.6043892 0.9100564
## 0.10 1 5 250 1.169437 0.6089336 0.9037489
## 0.10 1 5 300 1.174292 0.6068745 0.9059815
## 0.10 1 5 350 1.176658 0.6056341 0.9093323
## 0.10 1 5 400 1.170998 0.6094985 0.9017819
## 0.10 1 5 450 1.166908 0.6132904 0.8993088
## 0.10 1 5 500 1.164777 0.6147483 0.8984574
## 0.10 1 5 550 1.164095 0.6155460 0.8997818
## 0.10 1 5 600 1.162590 0.6169697 0.8989849
## 0.10 1 5 650 1.162716 0.6169379 0.8974344
## 0.10 1 5 700 1.162256 0.6184752 0.8974466
## 0.10 1 5 750 1.159087 0.6208312 0.8953293
## 0.10 1 5 800 1.160477 0.6202221 0.8963830
## 0.10 1 5 850 1.157762 0.6220146 0.8938432
## 0.10 1 5 900 1.156933 0.6230451 0.8929510
## 0.10 1 5 950 1.156874 0.6233522 0.8939382
## 0.10 1 5 1000 1.155715 0.6244395 0.8937127
## 0.10 1 10 100 1.196562 0.5890876 0.9321203
## 0.10 1 10 150 1.191884 0.5907804 0.9264165
## 0.10 1 10 200 1.187808 0.5915673 0.9210896
## 0.10 1 10 250 1.185724 0.5933218 0.9210656
## 0.10 1 10 300 1.187728 0.5940101 0.9196915
## 0.10 1 10 350 1.187453 0.5924768 0.9190276
## 0.10 1 10 400 1.185182 0.5972276 0.9203548
## 0.10 1 10 450 1.186503 0.5970572 0.9203066
## 0.10 1 10 500 1.182477 0.6000346 0.9172044
## 0.10 1 10 550 1.182385 0.6019572 0.9210270
## 0.10 1 10 600 1.181680 0.6034325 0.9206430
## 0.10 1 10 650 1.181371 0.6035280 0.9201149
## 0.10 1 10 700 1.183981 0.6023284 0.9227204
## 0.10 1 10 750 1.183313 0.6036948 0.9232864
## 0.10 1 10 800 1.186230 0.6034987 0.9266954
## 0.10 1 10 850 1.183598 0.6052658 0.9249682
## 0.10 1 10 900 1.182754 0.6062730 0.9252733
## 0.10 1 10 950 1.183869 0.6056868 0.9256990
## 0.10 1 10 1000 1.186154 0.6051804 0.9280313
## 0.10 3 5 100 1.126673 0.6309413 0.8778412
## 0.10 3 5 150 1.118608 0.6344207 0.8711840
## 0.10 3 5 200 1.111321 0.6396734 0.8679516
## 0.10 3 5 250 1.106877 0.6418787 0.8657773
## 0.10 3 5 300 1.103539 0.6440274 0.8630819
## 0.10 3 5 350 1.100967 0.6457262 0.8618612
## 0.10 3 5 400 1.100733 0.6459224 0.8624044
## 0.10 3 5 450 1.099570 0.6467314 0.8616397
## 0.10 3 5 500 1.098866 0.6472124 0.8611487
## 0.10 3 5 550 1.098451 0.6474745 0.8610017
## 0.10 3 5 600 1.098140 0.6476500 0.8608511
## 0.10 3 5 650 1.098080 0.6476705 0.8609299
## 0.10 3 5 700 1.098131 0.6476232 0.8609636
## 0.10 3 5 750 1.098025 0.6476836 0.8609171
## 0.10 3 5 800 1.097970 0.6477143 0.8609243
## 0.10 3 5 850 1.097941 0.6477330 0.8609185
## 0.10 3 5 900 1.097907 0.6477625 0.8609117
## 0.10 3 5 950 1.097911 0.6477644 0.8609318
## 0.10 3 5 1000 1.097905 0.6477734 0.8609294
## 0.10 3 10 100 1.142661 0.6272673 0.8970376
## 0.10 3 10 150 1.135790 0.6315620 0.8974571
## 0.10 3 10 200 1.130363 0.6364418 0.8931886
## 0.10 3 10 250 1.123547 0.6414206 0.8886820
## 0.10 3 10 300 1.118739 0.6440784 0.8866396
## 0.10 3 10 350 1.114966 0.6470856 0.8838787
## 0.10 3 10 400 1.113234 0.6482029 0.8835835
## 0.10 3 10 450 1.111906 0.6494541 0.8832368
## 0.10 3 10 500 1.110307 0.6504913 0.8821800
## 0.10 3 10 550 1.109801 0.6508576 0.8820233
## 0.10 3 10 600 1.109214 0.6512106 0.8817648
## 0.10 3 10 650 1.108676 0.6514447 0.8814342
## 0.10 3 10 700 1.108292 0.6517145 0.8812605
## 0.10 3 10 750 1.107870 0.6519886 0.8810405
## 0.10 3 10 800 1.107656 0.6521021 0.8809519
## 0.10 3 10 850 1.107521 0.6522160 0.8808688
## 0.10 3 10 900 1.107300 0.6523223 0.8807278
## 0.10 3 10 950 1.107137 0.6524157 0.8806744
## 0.10 3 10 1000 1.107003 0.6525077 0.8806093
## 0.10 5 5 100 1.110228 0.6464728 0.8706931
## 0.10 5 5 150 1.094429 0.6559326 0.8591453
## 0.10 5 5 200 1.089129 0.6594170 0.8563424
## 0.10 5 5 250 1.084335 0.6616264 0.8532406
## 0.10 5 5 300 1.081299 0.6631954 0.8512959
## 0.10 5 5 350 1.080151 0.6637837 0.8504221
## 0.10 5 5 400 1.079125 0.6643194 0.8498183
## 0.10 5 5 450 1.078372 0.6646770 0.8493239
## 0.10 5 5 500 1.077883 0.6649037 0.8490162
## 0.10 5 5 550 1.077596 0.6650377 0.8488405
## 0.10 5 5 600 1.077334 0.6651680 0.8486450
## 0.10 5 5 650 1.077171 0.6652489 0.8485479
## 0.10 5 5 700 1.077088 0.6652842 0.8485129
## 0.10 5 5 750 1.077005 0.6653221 0.8484477
## 0.10 5 5 800 1.076952 0.6653440 0.8484174
## 0.10 5 5 850 1.076912 0.6653634 0.8483940
## 0.10 5 5 900 1.076897 0.6653722 0.8483815
## 0.10 5 5 950 1.076872 0.6653834 0.8483668
## 0.10 5 5 1000 1.076853 0.6653936 0.8483529
## 0.10 5 10 100 1.143387 0.6244895 0.9082911
## 0.10 5 10 150 1.130882 0.6342497 0.9002799
## 0.10 5 10 200 1.128319 0.6363941 0.8966440
## 0.10 5 10 250 1.123485 0.6396760 0.8951774
## 0.10 5 10 300 1.119522 0.6424423 0.8935532
## 0.10 5 10 350 1.116497 0.6443116 0.8910050
## 0.10 5 10 400 1.115291 0.6450172 0.8905278
## 0.10 5 10 450 1.113397 0.6461334 0.8890954
## 0.10 5 10 500 1.112395 0.6467030 0.8886366
## 0.10 5 10 550 1.111445 0.6472506 0.8880727
## 0.10 5 10 600 1.110833 0.6474857 0.8876282
## 0.10 5 10 650 1.110356 0.6477130 0.8871843
## 0.10 5 10 700 1.110071 0.6478199 0.8871712
## 0.10 5 10 750 1.109561 0.6481382 0.8868893
## 0.10 5 10 800 1.109290 0.6482998 0.8866922
## 0.10 5 10 850 1.109141 0.6483626 0.8866334
## 0.10 5 10 900 1.108886 0.6485098 0.8864779
## 0.10 5 10 950 1.108744 0.6485841 0.8863830
## 0.10 5 10 1000 1.108594 0.6486610 0.8862972
## 0.10 7 5 100 1.089405 0.6620128 0.8535273
## 0.10 7 5 150 1.078684 0.6694235 0.8446272
## 0.10 7 5 200 1.072416 0.6725484 0.8392623
## 0.10 7 5 250 1.069538 0.6738973 0.8367786
## 0.10 7 5 300 1.067820 0.6748603 0.8357188
## 0.10 7 5 350 1.066777 0.6754413 0.8350811
## 0.10 7 5 400 1.065936 0.6758591 0.8344585
## 0.10 7 5 450 1.065447 0.6760288 0.8341605
## 0.10 7 5 500 1.065171 0.6761686 0.8339107
## 0.10 7 5 550 1.064816 0.6763610 0.8336768
## 0.10 7 5 600 1.064618 0.6764600 0.8335541
## 0.10 7 5 650 1.064479 0.6765316 0.8334642
## 0.10 7 5 700 1.064373 0.6765950 0.8333745
## 0.10 7 5 750 1.064329 0.6766160 0.8333385
## 0.10 7 5 800 1.064270 0.6766512 0.8332976
## 0.10 7 5 850 1.064235 0.6766676 0.8332654
## 0.10 7 5 900 1.064217 0.6766774 0.8332537
## 0.10 7 5 950 1.064197 0.6766883 0.8332455
## 0.10 7 5 1000 1.064194 0.6766887 0.8332439
## 0.10 7 10 100 1.161177 0.6089217 0.9167722
## 0.10 7 10 150 1.148935 0.6188672 0.9096452
## 0.10 7 10 200 1.144118 0.6235342 0.9071570
## 0.10 7 10 250 1.139228 0.6266509 0.9042905
## 0.10 7 10 300 1.137585 0.6277145 0.9029185
## 0.10 7 10 350 1.135365 0.6289793 0.9018301
## 0.10 7 10 400 1.133647 0.6300888 0.9009775
## 0.10 7 10 450 1.131108 0.6314978 0.8997108
## 0.10 7 10 500 1.130178 0.6322563 0.8991446
## 0.10 7 10 550 1.129738 0.6324890 0.8990676
## 0.10 7 10 600 1.129200 0.6327620 0.8988156
## 0.10 7 10 650 1.128651 0.6330775 0.8985912
## 0.10 7 10 700 1.128076 0.6333568 0.8982969
## 0.10 7 10 750 1.127922 0.6333845 0.8981867
## 0.10 7 10 800 1.127631 0.6335750 0.8980244
## 0.10 7 10 850 1.127460 0.6336513 0.8979988
## 0.10 7 10 900 1.127215 0.6337674 0.8978673
## 0.10 7 10 950 1.127072 0.6338214 0.8977727
## 0.10 7 10 1000 1.127024 0.6338374 0.8977656
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth = 7, shrinkage = 0.01 and n.minobsinnode = 5.
boostedpred_cmp <- predict(boosted_cmp, newdata = cmp_test_transform)
postResample(pred = boostedpred_cmp, obs = cmp_test_transform$Yield)
## RMSE Rsquared MAE
## 1.1057507 0.7661067 0.8521719
Minimizing RMSE (~1.05) and using repeated 10-fold cross-validation, the model uses 1000 trees, a depth of 7 trees, a shrinkage value of 0.01, and a minimum of 5 observations per terminal node. Evaluating on the test set returns an RMSE of approximately 1.11.
Cubist
set.seed(624)
(cubist_cmp <- caret::train(Yield ~ .,
data = cmp_train_transform,
method = "cubist",
trControl = trainControl(method = "repeatedcv", repeats = 5)))
## Cubist
##
## 144 samples
## 51 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 130, 130, 130, 130, 129, 130, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 1.2718257 0.5613352 1.0059226
## 1 5 1.0997059 0.6641557 0.8461829
## 1 9 1.1979266 0.6046555 0.9266776
## 10 0 1.1581101 0.6015906 0.9204779
## 10 5 1.0131135 0.6930819 0.7922289
## 10 9 1.1057110 0.6384356 0.8669104
## 20 0 1.1399322 0.6118317 0.9079318
## 20 5 0.9893243 0.7036421 0.7778457
## 20 9 1.0853852 0.6482333 0.8578460
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
cubistpred_cmp <- predict(cubist_cmp, newdata = cmp_test_transform)
postResample(pred = cubistpred_cmp, obs = cmp_test_transform$Yield)
## RMSE Rsquared MAE
## 0.9300742 0.8138656 0.6736065
Minimizing RMSE (~0.99) and using repeated 10-fold cross-validation, the model uses 20 committees and 5 neighbors. Evaluating on the test set returns an RMSE of approximately 0.93, which is oddly lower than its training set counterpart. After experimenting with other random seeds, this result could be attributable to randomness, so the initial seed (624) is used.
Of the four models, the Cubist model performs the best on the test set. Its test set RMSE is approximately 0.93.
varImp(cubist_cmp)$importance %>%
arrange(-Overall) %>%
rownames_to_column("predictor") %>%
top_n(10) %>%
ggplot(aes(x = reorder(predictor, Overall), y = Overall)) +
geom_col() +
ggtitle("Top ten predictors of product yield, by importance in Cubist model") +
xlab(NULL) +
ylab("Importance") +
coord_flip()
set.seed(624)
svm_cmp <- caret::train(Yield ~ .,
data = cmp_train_transform,
method = "svmRadial",
tuneLength = 14,
trControl = trainControl(method = "repeatedcv", repeats = 5))
varImp(svm_cmp)$importance %>%
arrange(-Overall) %>%
rownames_to_column("predictor") %>%
top_n(10) %>%
ggplot(aes(x = reorder(predictor, Overall), y = Overall)) +
geom_col() +
ggtitle("Top ten predictors of product yield, by importance in SVM model") +
xlab(NULL) +
ylab("Importance") +
coord_flip()
set.seed(624)
cmp_pls <- caret::train(Yield ~ .,
data = cmp_train_transform,
method = "pls",
tuneLength = 20,
trControl = trainControl(method = "repeatedcv", repeats = 5)
)
varImp(cmp_pls)$importance %>%
arrange(-Overall) %>%
rownames_to_column("predictor") %>%
top_n(10) %>%
ggplot(aes(x = reorder(predictor, Overall), y = Overall)) +
geom_col() +
ggtitle("Top ten predictors of product yield, by importance in PLS model") +
xlab(NULL) +
ylab("Importance") +
coord_flip()
ManufacturingProcess32 and ManufacturingProcess09 are by far the most important predictors in the Cubist model. The process predictors still dominate, though less than they do in the linear partial least squares (PLS) model and more than they do in the non-linear support vector machine (SVM) model. Relative to the other optimal models, the Cubist model places importance on different process predictors--ManufacturingProcess33, ManufacturingProcess28, and ManufacturingProcess25--that are all also relatively less important to the model. The biological predictors are generally the same across models.
single_cmp$finalModel
## n= 144
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 144 453.46990 40.15799
## 2) ManufacturingProcess32< 0.1888699 83 152.30240 39.23084
## 4) BiologicalMaterial12< -0.5895659 31 36.09462 38.32839 *
## 5) BiologicalMaterial12>=-0.5895659 52 75.90933 39.76885 *
## 3) ManufacturingProcess32>=0.1888699 61 132.74350 41.41951
## 6) BiologicalMaterial03< 1.089406 48 92.19183 41.08688 *
## 7) BiologicalMaterial03>=1.089406 13 15.63103 42.64769 *
plot(single_cmp$finalModel)
text(single_cmp$finalModel)
Plotting offers a more visual and intuitive depiction of the tree. Ultimately, the process predictor is feeding the biological predictors towards predicted yields. In this tree, the split of the process predictor ManufacturingProcess32 is approximately 0.19. On either side of the split, down the tree, biological predictors control the terminal nodes. The predictor below the process split, BiologicalMaterial12, itself splits at approximately -0.59 towards approximately 38.33 below and approximately 39.77 above. The predictor above the process split, BiologicalMaterial03, itself splits at approximately 1.09 towards approximately 41.09 below and approximately 42.65 above. Those terminal node values represent the predicted or mean yields for each partitioned group of observations.
Kuhn, M. and Johnson, K. (2013). Applied predictive modeling. doi 10.1007/978-1-4614-6849-3