library(tidyverse)
library(caret)
library(AppliedPredictiveModeling)
data(permeability)
fingerprints <- as.data.frame(fingerprints)
The function uses the standard rules of thumb for removing variables. We have no reason to adjust at this point.
zero_vars <- nearZeroVar(fingerprints)
length(zero_vars)
## [1] 719
fingerprints_var <- fingerprints[, -zero_vars]
Over two-thirds of the variables in the dataset are removed.
We split into test-train splits, leaving 20% for the holdout set. We’re stuck with a low number of observations so putting more in the training set, which will be used for repeated cross validation, is our priority.
In practice, I would probably not use a holdout set with this number of observations, but that is what the book asked for.
set.seed(7)
samples <- createDataPartition(permeability, p = .8, list = F)
fp_train <- fingerprints_var[samples, ]
fp_test <- fingerprints_var[-samples, ]
perm_train <- permeability[samples, ]
perm_test<- permeability[-samples, ]
PLS should overcome collinearity, but we will try giving it some help by removing highly correlated variables. The tradeoff here is that we shrink our feature space
too_high <-findCorrelation(cor(fp_train), .9)
fp_train_cor <- fp_train[, -too_high]
fp_test_cor <- fp_test[, -too_high]
We train a PLS model and seek the optimal number of components.
pls_test <- train(x = fp_train,
y = perm_train,
method = 'pls',
tuneGrid = data.frame(ncomp = seq(1, 80, 1)),
trControl = trainControl(method = 'repeatedcv', number = 10, repeats = 5)
)
rslts <- pls_test$results
ggplot(rslts) + geom_line(aes(x = ncomp, y = RMSE))
library(kableExtra)
head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
| ncomp | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|
| 1 | 13.12196 | 0.3153016 | 10.074064 | 2.635170 | 0.2117838 | 1.923137 |
| 2 | 11.47877 | 0.4827818 | 8.182957 | 2.893810 | 0.2190565 | 1.982311 |
| 3 | 11.20152 | 0.5168873 | 8.280295 | 2.539717 | 0.1871408 | 1.662693 |
| 4 | 11.36578 | 0.5089077 | 8.653261 | 2.521500 | 0.1779809 | 1.773574 |
| 5 | 11.27327 | 0.5245455 | 8.455785 | 2.681686 | 0.1879479 | 1.868318 |
| 6 | 11.24334 | 0.5277537 | 8.098340 | 2.571194 | 0.1844890 | 1.727003 |
| 7 | 11.42653 | 0.5187461 | 8.281599 | 2.340755 | 0.1788926 | 1.642486 |
| 8 | 11.67791 | 0.4981780 | 8.738919 | 2.123011 | 0.1689353 | 1.585042 |
| 9 | 11.98842 | 0.4794563 | 8.851377 | 2.229950 | 0.1783094 | 1.639184 |
| 10 | 12.16037 | 0.4749514 | 8.936781 | 2.414712 | 0.1820555 | 1.664207 |
The best R squared occurs at 3 Components, .51, thought R squared and RMSE are pretty similar between 2 and 9 components.
We’ll try a model with the highly correlated predictors removed just for due diligence.
pls_test <- train(x = fp_train_cor,
y = perm_train,
method = 'pls',
tuneGrid = data.frame(ncomp = seq(1, 80, 1)),
trControl = trainControl(method = 'repeatedcv', number = 10, repeats = 5)
)
rslts <- pls_test$results
ggplot(rslts) + geom_line(aes(x = ncomp, y = RMSE))
library(kableExtra)
head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
| ncomp | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|
| 1 | 11.93364 | 0.4301069 | 8.822951 | 3.049226 | 0.2280876 | 2.202543 |
| 2 | 11.19312 | 0.4921207 | 8.264361 | 2.639869 | 0.2166940 | 1.740735 |
| 3 | 11.19351 | 0.4960587 | 8.419027 | 2.069418 | 0.1936079 | 1.493946 |
| 4 | 11.03624 | 0.5152311 | 8.241503 | 2.138118 | 0.1926534 | 1.610490 |
| 5 | 10.82605 | 0.5350954 | 8.137202 | 2.247647 | 0.2051616 | 1.598093 |
| 6 | 10.92083 | 0.5373558 | 8.130074 | 2.248815 | 0.2003508 | 1.706117 |
| 7 | 11.33000 | 0.5189290 | 8.300017 | 2.403725 | 0.2075739 | 1.789440 |
| 8 | 11.66446 | 0.4948652 | 8.662352 | 2.416649 | 0.2121677 | 1.913779 |
| 9 | 11.78705 | 0.4912127 | 8.840528 | 2.507220 | 0.2110571 | 2.046892 |
| 10 | 12.01933 | 0.4815073 | 8.912092 | 2.743909 | 0.2172953 | 2.233449 |
Since performance is about the same and it culls our dataset, we’ll use this data for the PLS/PCA model.
We choose the number of components based on RMSE.
library(pls)
mod1 <- plsr(perm_train ~ ., data = fp_train_cor, ncomp = 5)
preds <- predict(mod1, fp_test_cor)
df <- data.frame(obs = perm_test, pred = preds[,1,5])
defaultSummary(df)
## RMSE Rsquared MAE
## 12.2281388 0.4419219 9.0983921
The model performs somewhat worse on the test set, although given that its only about 30 observations, the error bars around RMSE and R squared are likely significant.
ENET
We continue using the culled dataset. LASSO regression has the ability to completely eliminate variables, but starting with 130, instead of over 300, will make its job a little easier, without losing much information.
fp_tm <- as.matrix(fp_train_cor)
fp_tt <- as.matrix(fp_test_cor)
grid <- expand.grid(.lambda = c(0, .01, .1), .fraction = seq(.1,.9,.1))
enet_test <- train(x = fp_tm,
y = perm_train,
method = 'enet',
trControl = trainControl(method = 'repeatedcv', number = 10, repeats = 5),
tuneGrid = grid
)
rslts <- enet_test$results %>%
arrange(RMSE)
head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
| lambda | fraction | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|---|
| 0.10 | 0.4 | 10.97565 | 0.5454836 | 8.103199 | 2.083525 | 0.1710363 | 1.700495 |
| 0.10 | 0.5 | 11.03603 | 0.5467752 | 8.133713 | 2.169483 | 0.1732243 | 1.751047 |
| 0.01 | 0.2 | 11.11475 | 0.5280797 | 8.195236 | 2.101289 | 0.1773798 | 1.722511 |
| 0.10 | 0.3 | 11.15948 | 0.5292068 | 8.198842 | 2.030621 | 0.1700960 | 1.639116 |
| 0.10 | 0.6 | 11.22027 | 0.5397668 | 8.266250 | 2.233676 | 0.1744998 | 1.741469 |
| 0.01 | 0.3 | 11.31193 | 0.5213694 | 8.330346 | 2.341844 | 0.1812038 | 1.850853 |
| 0.10 | 0.7 | 11.35702 | 0.5338386 | 8.381972 | 2.282186 | 0.1747442 | 1.733163 |
| 0.01 | 0.1 | 11.36865 | 0.5132990 | 8.161978 | 2.061692 | 0.1770246 | 1.636241 |
| 0.10 | 0.2 | 11.39648 | 0.5082092 | 8.224155 | 2.064574 | 0.1762908 | 1.605596 |
| 0.10 | 0.8 | 11.44244 | 0.5311521 | 8.455546 | 2.335632 | 0.1755044 | 1.722711 |
The best RMSEs nearly all occur with a lambda of .1, indicating a mixture of LASSO and ridge. The best fraction (the ratio of the sum of the coefficients in the enet model relative to the full OLS model) is between .1 and .7, depending on lambda. This seems to indicate there are a large number of low, but not zero, information variables present.
library(elasticnet)
fp_tt <- as.matrix(fp_test_cor)
mod2 <- enet(x = fp_tm, y = perm_train, lambda = .1)
preds <- predict(mod2, s= .4, newx = fp_tt, mode = 'fraction', type = 'fit')
df1 <- data.frame(obs = perm_test, pred = preds$fit)
defaultSummary(df1)
## RMSE Rsquared MAE
## 11.6333462 0.4916524 8.2352733
RMSE is inline, although slightly higher, than the training set.
We’ll try a PCA regression, which should be a slightly worse version of the PLS regression.
PCR
pca_fit <- preProcess(fp_train, method = 'pca')
pca_train <- predict(pca_fit, fp_train)
lm_test <- train(x = fp_train,
y = perm_train,
method = 'pcr',
tuneGrid = data.frame(ncomp = seq(1, 80, 1)),
trControl = trainControl(method = 'repeatedcv', number = 10, repeats = 5),
)
rslts <- lm_test$results
head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
| ncomp | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|
| 1 | 14.75050 | 0.1121710 | 11.481926 | 2.703301 | 0.1068101 | 2.003189 |
| 2 | 14.78931 | 0.1050584 | 11.652487 | 2.652901 | 0.0963392 | 1.962427 |
| 3 | 13.38908 | 0.3066499 | 9.971547 | 2.693651 | 0.2273224 | 2.076134 |
| 4 | 11.66162 | 0.4642896 | 8.443879 | 2.588202 | 0.2401876 | 1.744123 |
| 5 | 11.66126 | 0.4627821 | 8.356554 | 2.609220 | 0.2392260 | 1.763022 |
| 6 | 11.71055 | 0.4558462 | 8.419351 | 2.524749 | 0.2409457 | 1.807729 |
| 7 | 11.74415 | 0.4499319 | 8.517642 | 2.510217 | 0.2370372 | 1.855262 |
| 8 | 11.80431 | 0.4448628 | 8.571406 | 2.485347 | 0.2343003 | 1.880409 |
| 9 | 11.88208 | 0.4413982 | 8.615894 | 2.531507 | 0.2357723 | 1.901882 |
| 10 | 11.79592 | 0.4511278 | 8.611095 | 2.623503 | 0.2399310 | 2.013587 |
With this simpler model, we are still able to get within about 5% of the RMSE of PLS and ENET.
Gradient Boosting
We’ll try a more modern model, Gradient Boosting, using the XGBoost package. Since this data is all categorical, there should be interactions, which would have to be manually uncovered and plugged into our linear models. GB will find these interactions automatically.
grid<- expand.grid(eta = c(.005, .01, 0.025, 0.05), nrounds = c(200,500), max_depth = c(1,2,3), gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1)
gb_test <- train(x = fp_tm,
y = perm_train,
method = 'xgbTree',
tuneGrid = grid,
trControl = trainControl(method = 'cv', number = 10),
verbose = F
)
rslts <- gb_test$results %>%
arrange(RMSE)
head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
| eta | max_depth | gamma | colsample_bytree | min_child_weight | subsample | nrounds | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.050 | 2 | 0 | 1 | 1 | 1 | 500 | 10.39426 | 0.5777854 | 7.308519 | 2.801858 | 0.2340349 | 2.040100 |
| 0.010 | 2 | 0 | 1 | 1 | 1 | 500 | 10.46240 | 0.5585730 | 7.265872 | 2.639403 | 0.2371244 | 1.852221 |
| 0.025 | 2 | 0 | 1 | 1 | 1 | 200 | 10.47386 | 0.5580647 | 7.284401 | 2.622707 | 0.2364705 | 1.829595 |
| 0.025 | 2 | 0 | 1 | 1 | 1 | 500 | 10.56748 | 0.5574036 | 7.356155 | 2.707002 | 0.2305747 | 1.945410 |
| 0.050 | 2 | 0 | 1 | 1 | 1 | 200 | 10.58906 | 0.5530681 | 7.295085 | 2.623186 | 0.2285004 | 1.869272 |
| 0.005 | 2 | 0 | 1 | 1 | 1 | 500 | 10.76921 | 0.5402104 | 7.306012 | 2.330568 | 0.2481567 | 1.592276 |
| 0.010 | 3 | 0 | 1 | 1 | 1 | 500 | 10.98923 | 0.5233653 | 7.388219 | 2.884528 | 0.2610558 | 2.021502 |
| 0.050 | 1 | 0 | 1 | 1 | 1 | 500 | 10.99310 | 0.5511968 | 7.970474 | 1.757212 | 0.1687718 | 1.520256 |
| 0.010 | 2 | 0 | 1 | 1 | 1 | 200 | 11.00459 | 0.5333358 | 7.344646 | 2.160730 | 0.2510869 | 1.486226 |
| 0.005 | 3 | 0 | 1 | 1 | 1 | 500 | 11.01229 | 0.5164521 | 7.397330 | 2.759216 | 0.2711515 | 1.768182 |
We get about a 5% improvement in RMSE VS the linear models.
Out of these models, I would choose ENET because it preserves our original variables, making presentation of the model easier. ENET slightly outperformed the PLS model, but even if it didn’t predict quite as well, it would remain my choice.
We’ll examine residuals VS predicted and the most important variables.
preds <- predict(mod2, s= .4, newx = fp_tm, mode = 'fraction', type = 'fit')
train_df <- data.frame(obs = perm_train, pred = preds$fit)
ggplot(train_df) + geom_point(aes(x = pred, y = obs)) + geom_smooth(aes(x = pred, y = obs))
imp <- varImp(enet_test)
plot(imp, top = 20)
Recommendation
I would recommend the ENET model for prescreening compounds to be given the permeability test. This cut costs, while not introducing false positives.
data(ChemicalManufacturingProcess)
First, we remove any “near zero variance” variables, because in small datasets, this can occurr in continuous predictors.
zero_vars <- nearZeroVar(ChemicalManufacturingProcess)
length(zero_vars)
## [1] 1
ChemicalManufacturingProcess_new <- ChemicalManufacturingProcess[, -zero_vars]
We then the MICE package to impute missing variables using Random Forest regression. We choose RF because it doesn’t require much tuning and it can handle data with more variables than observations.
library(naniar)
library(mice)
gg_miss_upset(ChemicalManufacturingProcess_new, nsets = 5, nintersects = 5)
gg_miss_case(ChemicalManufacturingProcess_new)
library(mice)
pred <- dplyr::select(ChemicalManufacturingProcess_new, -ManufacturingProcess36) %>%
mice(data = , m = 3, method = 'rf', maxit = 2)
imputed <-complete(pred)
pred <- mice(data = imputed, m = 3, method = 'rf', maxit = 2)
##
## iter imp variable
## 1 1
## 1 2
## 1 3
## 2 1
## 2 2
## 2 3
chem_imputed <- complete(pred)
We start by getting a sense of the correlation to see how much of a problem multicollinearity is. This will advise whether we use elasticnet or PLS.
library(corrplot)
cor_mat <- cor(chem_imputed) %>%
corrplot(order = 'hclust',tl.pos='n')
It appears to be a moderate problem so we start with PLS.
We’ll also look for skewed variables to see if a BoxCox will be necessary.
library(psych)
ins_desc <- chem_imputed %>%
keep(is.numeric) %>%
describe() %>%
dplyr::select(-c(vars,n,range, mad, trimmed)) %>%
arrange(desc(abs(skew))) %>%
head(10)
rns <- rownames(ins_desc)
round_2 <- function(col) round(col,2)
ins_desc2 <- ins_desc %>%
mutate_all(round_2)
rownames(ins_desc2) <- rns
ins_desc2 %>%
kable %>%
kable_styling
| mean | sd | median | min | max | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|
| 6016.36 | 458.21 | 6047.00 | 0 | 6161.0 | -12.86 | 165.88 | 34.54 |
| 4828.28 | 368.12 | 4854.00 | 0 | 4990.0 | -12.82 | 165.15 | 27.75 |
| 4809.68 | 367.48 | 4835.00 | 0 | 4971.0 | -12.74 | 163.74 | 27.70 |
| 4562.88 | 348.95 | 4586.00 | 0 | 4710.0 | -12.70 | 163.12 | 26.30 |
| 4556.46 | 349.01 | 4582.00 | 0 | 4759.0 | -12.64 | 162.07 | 26.31 |
| 4565.80 | 351.70 | 4588.00 | 0 | 4852.0 | -12.42 | 158.40 | 26.51 |
| 70.20 | 5.48 | 70.80 | 0 | 72.5 | -12.00 | 150.39 | 0.41 |
| 20.02 | 1.64 | 19.95 | 0 | 22.0 | -10.20 | 122.44 | 0.12 |
| 0.91 | 0.87 | 0.80 | 0 | 11.0 | 9.05 | 101.03 | 0.07 |
| 11.21 | 1.94 | 11.60 | 0 | 12.1 | -5.45 | 28.53 | 0.15 |
With 10 variables showing absolute skew of at least 5, it appears a BoxCox transformation is advisable.
yield <- chem_imputed$Yield
all_x <- select(chem_imputed, - Yield)
samples <- createDataPartition(yield, p = .8, list = F)
cm_train <- all_x[samples, ]
cm_test <- all_x[-samples, ]
yield_train <- yield[samples]
yield_test<- yield[-samples]
PLS
bc_trans <- preProcess(cm_train, method = 'BoxCox')
cm_train_bc <- predict(bc_trans, cm_train)
pls_test <- train(x = cm_train_bc,
y = yield_train,
method = 'pls',
tuneGrid = data.frame(ncomp = seq(1, 12, 1)),
trControl = trainControl(method = 'repeatedcv', number = 10, repeats = 5),
preProcess = c('center', 'scale')
)
rslts <- pls_test$results
head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
| ncomp | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|
| 1 | 1.479064 | 0.4581292 | 1.163280 | 0.4062373 | 0.1938449 | 0.2566277 |
| 2 | 1.911005 | 0.4586269 | 1.231900 | 1.3224469 | 0.2699577 | 0.4995333 |
| 3 | 1.591002 | 0.5186443 | 1.137000 | 0.9940874 | 0.2159254 | 0.3566610 |
| 4 | 1.959623 | 0.5244660 | 1.225299 | 2.0875400 | 0.2185787 | 0.6260072 |
| 5 | 2.281869 | 0.5120782 | 1.314499 | 2.8707004 | 0.2322308 | 0.8474590 |
| 6 | 2.475436 | 0.5008570 | 1.379368 | 3.2514094 | 0.2363079 | 0.9367181 |
| 7 | 2.870926 | 0.4976666 | 1.489454 | 4.3670259 | 0.2393440 | 1.2370356 |
| 8 | 3.200531 | 0.4854751 | 1.599387 | 5.1668915 | 0.2431496 | 1.4526766 |
| 9 | 3.555623 | 0.4848262 | 1.704430 | 6.1139923 | 0.2416043 | 1.6943596 |
| 10 | 3.788721 | 0.4821295 | 1.766936 | 6.6883097 | 0.2396317 | 1.8436993 |
ggplot(rslts) + geom_point(aes(x = ncomp, y = RMSE))
ElasticNet
cm_train_m <- as.matrix(cm_train_bc)
grid <- expand.grid(.lambda = c(0,.01, .1), .fraction = seq(.1,.9,.1))
enet_test <- train(x = cm_train_m,
y = yield_train,
method = 'enet',
trControl = trainControl(method = 'repeatedcv', number = 10, repeats = 5),
tuneGrid = grid,
preProcess = c('center', 'scale')
)
rslts <- enet_test$results %>%
arrange(RMSE) %>%
head(10)
head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
| lambda | fraction | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|---|
| 0.10 | 0.2 | 1.245793 | 0.6103088 | 1.0307728 | 0.2213538 | 0.1535328 | 0.1849439 |
| 0.00 | 0.1 | 1.246954 | 0.5983720 | 0.9697345 | 0.3709042 | 0.1854045 | 0.2044358 |
| 0.10 | 0.3 | 1.254930 | 0.5851604 | 0.9907774 | 0.3126607 | 0.1830733 | 0.1989275 |
| 0.01 | 0.2 | 1.279426 | 0.5885692 | 0.9875599 | 0.4166481 | 0.1894413 | 0.2111994 |
| 0.01 | 0.1 | 1.282634 | 0.6053662 | 1.0584336 | 0.2258827 | 0.1502298 | 0.1848459 |
| 0.01 | 0.3 | 1.325078 | 0.5936338 | 0.9909098 | 0.5913607 | 0.1976089 | 0.2508474 |
| 0.10 | 0.4 | 1.329781 | 0.5871102 | 0.9920491 | 0.6230484 | 0.1996715 | 0.2541581 |
| 0.00 | 0.2 | 1.409977 | 0.5813539 | 1.0316194 | 0.7260290 | 0.2029704 | 0.2893751 |
| 0.01 | 0.4 | 1.427944 | 0.5955169 | 1.0201733 | 0.9905075 | 0.2047506 | 0.3793446 |
| 0.10 | 0.5 | 1.464479 | 0.5841312 | 1.0278469 | 1.0353664 | 0.2147077 | 0.3513381 |
Elasticnet appears to perform about 15% better than the PLS regression. It is also strange that the PLS regression only uses 1 component, though the ENET fraction is also low at .1. We’ll continue with the ENET model.
preproc <- preProcess(cm_train, method = c('BoxCox', 'center', 'scale'))
cm_train_proc <- predict(preproc, cm_train) %>% as.matrix()
cm_test_proc <- predict(preproc, cm_test) %>% as.matrix()
enet_model_cm <- enet(x = cm_train_proc , y = yield_train, lambda = 0)
predicted <-predict(enet_model_cm, s= .1, newx = cm_train_proc, mode = 'fraction', type = 'fit') %>% as.data.frame()
preds_df <- data.frame(predicted = predicted$fit, actual = yield_train)
ggplot(preds_df) + geom_point(aes(x = predicted, y = actual)) + geom_smooth(aes(x = predicted, y = actual))
Our fit is fairly linear. If a considerable amount of nonlinearity remaining after the BoxCox transformation, we would be somewhat hard-pressed to find a solution through transformations of the predictors with so many variables to worry about. The classic measures to look for nonlinear relations, like marginal model plots, are less effective. We would have two main options, one traditional and one that I’ve had some success with in smaller models.
square_it <- function(x) x**2
squared <- mutate_all(cm_train, square_it)
all_train <- cbind(cm_train_bc, squared) %>% as.matrix()
The above method for handling non-linearity is pretty suboptimal. This is where nonlinear methods have their advantages.
grid<- expand.grid(eta = c(.01, 0.025, 0.1), nrounds = c(200, 500), max_depth = c(1,2,3), gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1)
gb_test <- train(x = cm_train_m,
y = yield_train,
method = 'xgbTree',
tuneGrid = grid,
trControl = trainControl(method = 'cv', number = 10),
verbose = F
)
rslts <- gb_test$results %>%
arrange(RMSE)
head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
| eta | max_depth | gamma | colsample_bytree | min_child_weight | subsample | nrounds | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.025 | 3 | 0 | 1 | 1 | 1 | 500 | 1.029006 | 0.7030420 | 0.7869483 | 0.1539676 | 0.1313352 | 0.1051852 |
| 0.100 | 3 | 0 | 1 | 1 | 1 | 200 | 1.035521 | 0.6931972 | 0.7895210 | 0.1891608 | 0.1530216 | 0.1568017 |
| 0.100 | 3 | 0 | 1 | 1 | 1 | 500 | 1.037953 | 0.6914289 | 0.7890969 | 0.1876309 | 0.1518748 | 0.1574077 |
| 0.025 | 2 | 0 | 1 | 1 | 1 | 500 | 1.053282 | 0.6871506 | 0.8054012 | 0.1629783 | 0.1504761 | 0.1346482 |
| 0.100 | 2 | 0 | 1 | 1 | 1 | 200 | 1.055557 | 0.6803353 | 0.8061325 | 0.1737833 | 0.1440501 | 0.1474035 |
| 0.100 | 2 | 0 | 1 | 1 | 1 | 500 | 1.061814 | 0.6769401 | 0.8048745 | 0.1841962 | 0.1403903 | 0.1509016 |
| 0.025 | 3 | 0 | 1 | 1 | 1 | 200 | 1.102290 | 0.6929702 | 0.8370630 | 0.1201180 | 0.1241807 | 0.0776431 |
| 0.010 | 3 | 0 | 1 | 1 | 1 | 500 | 1.106824 | 0.6928414 | 0.8409648 | 0.1141102 | 0.1201056 | 0.0823961 |
| 0.100 | 1 | 0 | 1 | 1 | 1 | 200 | 1.126361 | 0.6429998 | 0.8733945 | 0.1743432 | 0.1631040 | 0.1311516 |
| 0.025 | 1 | 0 | 1 | 1 | 1 | 500 | 1.136768 | 0.6362508 | 0.8708829 | 0.1589034 | 0.1533987 | 0.1054712 |
The GB model improves over elasticnet by over 20%!
enet_model_cm <- enet(x = cm_train_proc , y = yield_train, lambda = .1) #predict had problems with infite values in predictions with the original model so we adjusted lambda.
test_preds <-predict(enet_model_cm, s= .2, newx = cm_test_proc, mode = 'fraction', type = 'fit') %>% as.data.frame()
df1 <- data.frame(obs = yield_test, pred = test_preds$fit)
defaultSummary(df1)
## RMSE Rsquared MAE
## 1.2675071 0.5206818 1.0310448
The predictions are quite similar to the stats on the training set. We expect A small amount of degrading, and, in this case, it is under 10% by RMSE
ggplot(df1) + geom_point(aes(x = pred, y = obs)) + geom_smooth(aes(x = pred, y = obs))
This nonlinearity is somewhat alarming, but with only 32 values, we shouldn’t put too much stock in it. I would probably choose a different model in practice though.
imp <- varImp(enet_test)
plot(imp, top = 55)
imp <- varImp(enet_test)
plot(imp, top = 20)
Observations
We will single out our top variables and find their relationship with the target. This is generally one of the first phases of modeling, but with so many variables, we left it until the list could be narrowed down.
importants <- imp$importance %>%
rownames_to_column() %>%
arrange(desc(Overall)) %>%
head(12)
only_importants <- chem_imputed[,importants$rowname]
only_importants$yield <- yield
library(gridExtra)
plot_it <- function(col){
ggplot(only_importants) + geom_point(aes_string(x = col, y = yield), color = 'black') +
geom_smooth(aes_string(x = col, y = yield))
}
plots <- lapply(colnames(only_importants)[1:length(only_importants) - 1], plot_it)
grid.arrange(grobs = plots, ncol = 4, nrow = 3)
We add a smooth to each plot to aggregate the relationship
Observations
Modeling should be an iterative process, and at this point, we would use this information to improve out model building, noting the nonlinearity in the most important variables.
Recommendation
The manufacturing process elements can be optimized using this model if used carefully. The relationship plots shown are helpful for background information, but do not account for the presence of other variables. To aid in determining which levers to turn in the manufacturing process, the user should go with added variable plots instead. The final decision should be made by plugging new predictors into the model, rather than by examining plots.