library(gridExtra)
library(ggplot2)
library(cowplot)
options(scipen=10000)
library(mlbench)
library(tidyverse)
library(corrplot)
library(AppliedPredictiveModeling)
library(caret)
library(DataExplorer)
library(kableExtra)
library(mice)
library(pls)
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
(a) Start R and use these commands to load the data:
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
nearZero <- nearZeroVar(fingerprints)
fingerprintsC <- fingerprints[,-nearZero]
Number of variables left:
ncol(fingerprintsC)
## [1] 388
(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2 ?
set.seed(2425) # For reproducibility
trainIndex <- sample(1:nrow(fingerprintsC), size = 0.8 * nrow(fingerprintsC))
fingerprintsMerged <- cbind(fingerprintsC, permeability) |> as_data_frame()
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
## tibble, or `as.data.frame()` to convert to a data frame.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#
fingerprintsTrain <- fingerprintsMerged[trainIndex,]
fingerprintsTest <- fingerprintsMerged[-trainIndex,]
# str(fingerprintsTrain)
fingerprintsTrainNoPr <- fingerprintsTrain |> select(-permeability)
fingerCorr <- cor(fingerprintsTrainNoPr)
# findCorrelation(fingerCorr,0.75)
set.seed(2425) # For reproducibility
trControl <- trainControl(method = "cv",number = 10)
plsModel <- train(fingerprintsTrainNoPr,fingerprintsTrain$permeability ,method = "pls", trControl = trControl, preProcess = c("center","scale"), tuneLength = 10)
plsModel
## Partial Least Squares
##
## 132 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 118, 118, 118, 119, 120, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 11.93575 0.3717458 9.255064
## 2 10.47966 0.5543316 7.489034
## 3 10.27631 0.5781111 7.668499
## 4 10.53127 0.5186387 7.818789
## 5 10.77483 0.4969523 7.691391
## 6 11.10669 0.4814662 8.064078
## 7 11.22050 0.4755242 8.150355
## 8 11.40249 0.4715207 8.341430
## 9 11.51949 0.4559837 8.413719
## 10 11.62321 0.4554849 8.342049
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
plot(plsModel)
The optimal number of latent variables is 3 and the R2 stimate is 0.5781111, which is they highest among the number of components tried.
(d) Predict the response for the test set. What is the test set estimate of R2 ?
# plsModel <- pls::plsr(permeability ~.,data=fingerprintsTrain)
modelPred <- predict(plsModel,newdata = fingerprintsTest)
postResample(pred = modelPred, obs = fingerprintsTest$permeability)
## RMSE Rsquared MAE
## 14.9302821 0.2709311 11.5245778
The test set stimated R2 is 0.2709311.
(e) Try building other models discussed in this chapter. Do any have better predictive performance?
Ridge Regression, parameter tuned: lambda (from 0 to 1 by 0.1)
set.seed(2425) # For reproducibility
ridgeFit <- train(
x=fingerprintsTrainNoPr,
y=fingerprintsTrain$permeability,
method='ridge',
metric='Rsquared',
tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
ridgeFit
## Ridge Regression
##
## 132 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 118, 118, 118, 119, 120, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 802367780.01868 0.09887969 214441826.400764
## 0.1 12.20228 0.43842501 8.505063
## 0.2 11.94297 0.47885042 8.461151
## 0.3 12.05871 0.49833946 8.619401
## 0.4 12.32543 0.51121382 8.895922
## 0.5 12.70289 0.52049743 9.236827
## 0.6 13.15710 0.52781833 9.640106
## 0.7 13.66675 0.53397421 10.108291
## 0.8 14.23771 0.53898576 10.631019
## 0.9 14.84155 0.54340174 11.168300
## 1.0 15.47789 0.54721903 11.711016
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.
Lasso Regression, parameter tuned: fraction (from 0 to 0.5 by 0.05)
set.seed(2425) # For reproducibility
lassoFit <- train(
x=fingerprintsTrainNoPr,
y=fingerprintsTrain$permeability,
method='lasso',
metric='Rsquared',
tuneGrid=data.frame(.fraction = seq(0.05, 0.5, by=0.05)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
lassoFit
## The lasso
##
## 132 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 118, 118, 118, 119, 120, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.05 40041544 0.1738815 10701561
## 0.10 80163975 0.1486392 21424732
## 0.15 120286404 0.1486392 32147902
## 0.20 160408834 0.1486392 42871073
## 0.25 200531264 0.1486392 53594243
## 0.30 240653694 0.1486392 64317414
## 0.35 280776124 0.1572672 75040584
## 0.40 320898555 0.1338068 85763755
## 0.45 361020989 0.1129875 96486927
## 0.50 401143424 0.1042411 107210099
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.
Elastic net, parameters tuned: fraction and lambda (2-D grid with each D from 0 to 1 by 0.1)
set.seed(2425) # For reproducibility
enetFit <- train(
x=fingerprintsTrainNoPr,
y=fingerprintsTrain$permeability,
method='enet',
metric='Rsquared',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1),
.lambda = seq(0, 1, by=0.1)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
enetFit
## Elasticnet
##
## 132 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 118, 118, 118, 119, 120, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.0 0.0 15.260780 NaN 12.122307
## 0.0 0.1 80163974.510689 0.14863919 21424731.912977
## 0.0 0.2 160408834.393874 0.14863919 42871072.822539
## 0.0 0.3 240653694.277060 0.14863919 64317413.732102
## 0.0 0.4 320898555.359768 0.13380678 85763755.272878
## 0.0 0.5 401143424.142742 0.10424107 107210099.435151
## 0.0 0.6 481388295.101290 0.09868425 128656444.939307
## 0.0 0.7 561633166.764311 0.09818927 150102790.679683
## 0.0 0.8 641878037.909563 0.09839624 171549135.978401
## 0.0 0.9 722122909.115455 0.09866210 192995481.277120
## 0.0 1.0 802367780.018678 0.09887969 214441826.400764
## 0.1 0.0 14.622760 NaN 11.559912
## 0.1 0.1 9.991123 0.63324472 7.189238
## 0.1 0.2 10.240873 0.56809715 7.248915
## 0.1 0.3 10.586953 0.53481599 7.557206
## 0.1 0.4 10.838509 0.51881105 7.763374
## 0.1 0.5 11.058863 0.50424115 7.885955
## 0.1 0.6 11.265256 0.49078354 8.012734
## 0.1 0.7 11.504053 0.47651658 8.141108
## 0.1 0.8 11.778516 0.46146610 8.289844
## 0.1 0.9 12.019729 0.44831371 8.425988
## 0.1 1.0 12.202285 0.43842501 8.505063
## 0.2 0.0 14.622760 NaN 11.559912
## 0.2 0.1 10.198479 0.61203674 7.356762
## 0.2 0.2 10.070647 0.60636155 6.998864
## 0.2 0.3 10.515833 0.55988190 7.418731
## 0.2 0.4 10.756930 0.54437979 7.698070
## 0.2 0.5 10.984986 0.53200746 7.903420
## 0.2 0.6 11.185500 0.52025561 8.023412
## 0.2 0.7 11.399790 0.50863017 8.132020
## 0.2 0.8 11.612777 0.49645256 8.253235
## 0.2 0.9 11.801939 0.48573133 8.372382
## 0.2 1.0 11.942973 0.47885042 8.461151
## 0.3 0.0 14.622760 NaN 11.559912
## 0.3 0.1 10.266780 0.60650807 7.442365
## 0.3 0.2 10.016172 0.62652825 6.812753
## 0.3 0.3 10.523341 0.57870120 7.306924
## 0.3 0.4 10.850755 0.55849905 7.710320
## 0.3 0.5 11.097343 0.54797229 7.970797
## 0.3 0.6 11.321812 0.53562933 8.146556
## 0.3 0.7 11.550657 0.52419889 8.298958
## 0.3 0.8 11.749800 0.51413129 8.429755
## 0.3 0.9 11.925599 0.50490435 8.540425
## 0.3 1.0 12.058710 0.49833946 8.619401
## 0.4 0.0 14.622760 NaN 11.559912
## 0.4 0.1 10.302765 0.60627823 7.480312
## 0.4 0.2 10.050769 0.63429937 6.699186
## 0.4 0.3 10.645382 0.59216014 7.314368
## 0.4 0.4 11.033873 0.56950921 7.782967
## 0.4 0.5 11.300823 0.55860473 8.072870
## 0.4 0.6 11.564261 0.54655473 8.305049
## 0.4 0.7 11.806510 0.53495112 8.501374
## 0.4 0.8 12.015222 0.52579416 8.673454
## 0.4 0.9 12.182226 0.51775128 8.801622
## 0.4 1.0 12.325429 0.51121382 8.895922
## 0.5 0.0 14.622760 NaN 11.559912
## 0.5 0.1 10.348790 0.60490572 7.509109
## 0.5 0.2 10.133114 0.63761629 6.620225
## 0.5 0.3 10.778051 0.60456240 7.343278
## 0.5 0.4 11.273302 0.57867838 7.876785
## 0.5 0.5 11.595033 0.56650571 8.246186
## 0.5 0.6 11.884916 0.55477573 8.549504
## 0.5 0.7 12.156357 0.54322314 8.785597
## 0.5 0.8 12.377407 0.53419776 8.973409
## 0.5 0.9 12.548489 0.52678779 9.123832
## 0.5 1.0 12.702892 0.52049743 9.236827
## 0.6 0.0 14.622760 NaN 11.559912
## 0.6 0.1 10.383723 0.60273088 7.517420
## 0.6 0.2 10.242833 0.63848585 6.589084
## 0.6 0.3 10.960323 0.61225872 7.427103
## 0.6 0.4 11.563837 0.58565197 8.089024
## 0.6 0.5 11.959282 0.57157519 8.532573
## 0.6 0.6 12.298621 0.55995393 8.877815
## 0.6 0.7 12.572811 0.54997502 9.133085
## 0.6 0.8 12.806454 0.54125399 9.345945
## 0.6 0.9 12.989779 0.53394625 9.511248
## 0.6 1.0 13.157102 0.52781833 9.640106
## 0.7 0.0 14.622760 NaN 11.559912
## 0.7 0.1 10.417098 0.60087368 7.515903
## 0.7 0.2 10.383610 0.63694858 6.559416
## 0.7 0.3 11.170144 0.61992591 7.563298
## 0.7 0.4 11.901411 0.59151317 8.381715
## 0.7 0.5 12.373171 0.57591359 8.902985
## 0.7 0.6 12.737873 0.56500173 9.283757
## 0.7 0.7 13.040165 0.55557176 9.562922
## 0.7 0.8 13.283924 0.54735982 9.780256
## 0.7 0.9 13.485332 0.54003335 9.956210
## 0.7 1.0 13.666753 0.53397421 10.108291
## 0.8 0.0 14.622760 NaN 11.559912
## 0.8 0.1 10.433672 0.59905991 7.493092
## 0.8 0.2 10.534394 0.63585535 6.531095
## 0.8 0.3 11.421544 0.62447326 7.780023
## 0.8 0.4 12.274933 0.59636197 8.731893
## 0.8 0.5 12.812030 0.58032449 9.319917
## 0.8 0.6 13.222684 0.56931871 9.733258
## 0.8 0.7 13.548507 0.56039507 10.013897
## 0.8 0.8 13.815509 0.55242248 10.251490
## 0.8 0.9 14.038219 0.54509352 10.452206
## 0.8 1.0 14.237709 0.53898576 10.631019
## 0.9 0.0 14.622760 NaN 11.559912
## 0.9 0.1 10.461544 0.59764715 7.483664
## 0.9 0.2 10.727567 0.63327253 6.523331
## 0.9 0.3 11.705305 0.62728783 8.021158
## 0.9 0.4 12.684318 0.60044864 9.104595
## 0.9 0.5 13.285359 0.58421467 9.735267
## 0.9 0.6 13.740778 0.57301853 10.187436
## 0.9 0.7 14.090141 0.56456704 10.481302
## 0.9 0.8 14.382995 0.55671392 10.733571
## 0.9 0.9 14.619484 0.54963698 10.961512
## 0.9 1.0 14.841545 0.54340174 11.168300
## 1.0 0.0 14.622760 NaN 11.559912
## 1.0 0.1 10.486694 0.59646586 7.469183
## 1.0 0.2 10.953379 0.62956321 6.545608
## 1.0 0.3 12.019663 0.62888116 8.267122
## 1.0 0.4 13.114747 0.60392930 9.479820
## 1.0 0.5 13.792191 0.58785747 10.159859
## 1.0 0.6 14.289489 0.57625919 10.650704
## 1.0 0.7 14.667637 0.56789661 10.964553
## 1.0 0.8 14.987442 0.56021606 11.247447
## 1.0 0.9 15.232649 0.55359752 11.486376
## 1.0 1.0 15.477888 0.54721903 11.711016
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.2 and lambda = 0.6.
resamp <- resamples(list(PLS=plsModel, Ridge=ridgeFit, Lasso=lassoFit, enet=enetFit))
resamp
##
## Call:
## resamples.default(x = list(PLS = plsModel, Ridge = ridgeFit, Lasso =
## lassoFit, enet = enetFit))
##
## Models: PLS, Ridge, Lasso, enet
## Number of resamples: 10
## Performance metrics: MAE, RMSE, Rsquared
## Time estimates for: everything, final model fit
summary(resamp)
##
## Call:
## summary.resamples(object = resamp)
##
## Models: PLS, Ridge, Lasso, enet
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu.
## PLS 4.419257 5.534504 7.818659 7.668499 9.553810
## Ridge 7.004522 9.234092 11.648974 11.711016 14.915608
## Lasso 11.202097 11.574198 11.946299 10701561.102331 16052336.052448
## enet 2.656191 4.651612 6.985758 6.589084 8.330346
## Max. NA's
## PLS 10.98701 0
## Ridge 16.13178 0
## Lasso 32104660.15860 7
## enet 10.45220 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu.
## PLS 5.208639 7.147750 9.673697 10.27631 12.86269
## Ridge 9.009285 11.310016 15.037590 15.47789 18.80961
## Lasso 16.055141 16.274349 16.493558 40041543.73132 60062307.56941
## enet 3.815536 7.532664 11.020318 10.24283 12.97799
## Max. NA's
## PLS 16.24362 0
## Ridge 23.22932 0
## Lasso 120124598.64527 7
## enet 16.44571 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.13651651 0.3544622 0.6815658 0.5781111 0.7840040 0.8851359 0
## Ridge 0.17950431 0.3143576 0.6232016 0.5472190 0.7157092 0.8876797 0
## Lasso 0.02852087 0.1278148 0.2271087 0.1738815 0.2465618 0.2660149 7
## enet 0.19313929 0.3871844 0.7883984 0.6384858 0.8268568 0.9357691 0
ridgeFitPred <- predict(ridgeFit,newdata = fingerprintsTest)
lassoFitPred <- predict(lassoFit,newdata = fingerprintsTest)
enetFitPred <- predict(enetFit,newdata = fingerprintsTest)
Ridge Results
ridgeRes <- postResample(pred = ridgeFitPred, obs = fingerprintsTest$permeability)
ridgeRes
## RMSE Rsquared MAE
## 19.1487094 0.3122149 14.8732077
Lasso Results
lassoRes <- postResample(pred = lassoFitPred, obs = fingerprintsTest$permeability)
lassoRes
## RMSE Rsquared MAE
## 1095.97427689 0.03469573 507.51889060
Elastic Net Results
enetRes <- postResample(pred = enetFitPred, obs = fingerprintsTest$permeability)
enetRes
## RMSE Rsquared MAE
## 16.7444104 0.2166588 11.3697266
When evaluating both training and test data metrics, it’s essential to identify a model that balances predictive accuracy and generalizability. Elastic Net (enet) showed strong performance on the training data with the lowest mean MAE (6.59), a competitive mean RMSE (10.24), and the highest mean R-squared (0.64), indicating an excellent fit on the training set. However, its test data performance reveals some weaknesses. While it maintains a decent test RMSE (16.74) and a competitive MAE (11.37), its test R-squared drops to 0.22, the lowest among the models, suggesting potential overfitting. This drop implies that enet may capture the training data well but struggles to generalize to new, unseen data.
Partial Least Squares (PLS) provides a better balance between training and test performance. On the training data, it showed a mean MAE of 7.67, a mean RMSE of 10.28, and a mean R-squared of 0.58, indicating reliable predictive power. More importantly, PLS performs best on the test data with the lowest RMSE (14.93) and a competitive MAE (11.52). Although its test R-squared (0.27) is not the highest, it still represents a reasonable level of variance explanation and suggests that PLS generalizes better than enet. This makes PLS a strong candidate for reliable performance across different data sets.
Ridge regression, while showing moderate performance in training with higher mean MAE (11.71) and RMSE (15.48), surprises with the highest test R-squared (0.31). This indicates that Ridge explains more variance on the test data than the other models. However, its higher test RMSE (19.15) and MAE (14.87) suggest that its predictions are less accurate compared to PLS and enet, indicating that while it captures variance well, it sacrifices predictive accuracy.
Lasso is consistently the worst performer across both training and test data, with extremely high errors and a very low R-squared, indicating instability and poor predictive ability. It is not a viable option due to these significant shortcomings.
In conclusion, PLS stands out as the best model when considering both training and test data. It maintains a strong balance with the lowest test RMSE, competitive MAE, and reasonable R-squared, indicating that it generalizes effectively while maintaining good predictive accuracy. Elastic Net can be a secondary choice due to its competitive test errors, but its low test R-squared signals potential overfitting. Ridge, despite its high test R-squared, is less preferred due to higher errors, making PLS the optimal model for robust and balanced performance.
(f) Would you recommend any of your models to replace the permeability laboratory experiment?
I would not recommend replacing the permeability laboratory experiment entirely with any of the models due to limitations in accuracy and reliability. While models like Partial Least Squares (PLS) and Elastic Net (enet) demonstrated solid performance and could serve as valuable tools for augmenting lab work by providing preliminary predictions, they fall short of the high level of precision needed to fully replace experimental methods. Data-driven models rely heavily on the quality and representativeness of the training data and may not account for the full complexity and variability of real-world conditions. Without rigorous validation on independent datasets, there is a risk that these models may not consistently meet the standards required for reliable permeability assessments. Therefore, while these models could complement and optimize laboratory work by screening and prioritizing samples, they should not be used as replacements for the proven accuracy of laboratory experiments.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
(a) Start R and use these commands to load the data:
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
data("ChemicalManufacturingProcess")
(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.
Identify variables in the ChemicalManufacturingProcess dataset that have near-zero variance. These variables are characterized by very little variation across observations and can contribute little to predictive models, potentially adding noise or redundancy.
cherNearZero <- nearZeroVar(ChemicalManufacturingProcess)
chermical <- ChemicalManufacturingProcess[,-cherNearZero]
Impute values using The mice() function from the mice package, which is used to perform multiple imputations on the chermical dataset to fill in missing values. The method used is predictive mean matching (method = ‘pmm’), which ensures that imputed values are plausible and within the range of observed data.
set.seed(2425) # For reproducibility
chermical_imp <- mice(chermical, m = 5, method = 'pmm', maxit = 5, seed = 123, printFlag = FALSE)
# chermical_imp <- mice(ChemicalManufacturingProcess, m = 5, method = 'pmm', maxit = 5, seed = 123, printFlag = FALSE)
chermical_comp <- complete(chermical_imp,1)
(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Creating Training and Test Data
cherTrainIndex <- sample(1:nrow(chermical_comp), size = 0.8 * nrow(chermical_comp))
chermicalTrain <- chermical_comp[cherTrainIndex,]
chermicalTest <- chermical_comp[-cherTrainIndex,]
Training an Elastic Net regression model on the chermicalTrain dataset to predict Yield. A grid search (tuneGrid) explores combinations of .fraction (mixing parameter) and .lambda (regularization strength) from 0 to 1 in 0.1 increments.
set.seed(2425) # For reproducibility
enetFit2 <- train(
x=chermicalTrain,
y=chermicalTrain$Yield,
method='enet',
metric='Rsquared',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1),
.lambda = seq(0, 1, by=0.1)),
trControl=trainControl(method='cv'),
# preProcess=c('center','scale')
# preProcess=c('center','scale')
)
enetFit2
## Elasticnet
##
## 140 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 125, 127, 125, 127, 126, 126, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.0 0.0 1.785732296510209 NaN 1.489257066900703
## 0.0 0.1 1.434847103043961 1.0000000 1.196849759796647
## 0.0 0.2 1.275419646602540 1.0000000 1.063866452692593
## 0.0 0.3 1.115992190161119 1.0000000 0.930883145588539
## 0.0 0.4 0.956564733719698 1.0000000 0.797899838484485
## 0.0 0.5 0.797137277278277 1.0000000 0.664916531380431
## 0.0 0.6 0.637709820836857 1.0000000 0.531933224276378
## 0.0 0.7 0.478282364395436 1.0000000 0.398949917172325
## 0.0 0.8 0.318854907954016 1.0000000 0.265966610068271
## 0.0 0.9 0.159427451512596 1.0000000 0.132983302964218
## 0.0 1.0 0.000000001221557 1.0000000 0.000000000375774
## 0.1 0.0 1.785732296510209 NaN 1.489257066900703
## 0.1 0.1 1.447824832343437 1.0000000 1.207546180644274
## 0.1 0.2 1.109917368176664 1.0000000 0.925835294387845
## 0.1 0.3 0.772009904009893 1.0000000 0.644124408131415
## 0.1 0.4 0.434102439843120 1.0000000 0.362413521874986
## 0.1 0.5 0.154259036745992 0.9988143 0.125908333572697
## 0.1 0.6 0.179187099610413 0.9908308 0.143286280349247
## 0.1 0.7 0.231297930560163 0.9846266 0.184287547313417
## 0.1 0.8 0.314242147526389 0.9711641 0.231315890855729
## 0.1 0.9 0.424290659216675 0.9452044 0.276429006480236
## 0.1 1.0 0.515588982646662 0.9198164 0.309671227098255
## 0.2 0.0 1.785732296510209 NaN 1.489257066900703
## 0.2 0.1 1.397163060717235 1.0000000 1.165282894586124
## 0.2 0.2 1.008593824924262 1.0000000 0.841308722271545
## 0.2 0.3 0.620024589131288 1.0000000 0.517334549956966
## 0.2 0.4 0.282324619804769 0.9981682 0.230649091204894
## 0.2 0.5 0.223629717441934 0.9863330 0.176153490255319
## 0.2 0.6 0.322998368405664 0.9717213 0.253875720531182
## 0.2 0.7 0.378710284827910 0.9614894 0.296160611386199
## 0.2 0.8 0.471670023451252 0.9417344 0.349735284322026
## 0.2 0.9 0.587008383421088 0.9123469 0.399035921758594
## 0.2 1.0 0.698633103038860 0.8819868 0.441577685831880
## 0.3 0.0 1.785732296510209 NaN 1.489257066900703
## 0.3 0.1 1.361149849710224 1.0000000 1.135265321367309
## 0.3 0.2 0.936567402910239 1.0000000 0.781273575833915
## 0.3 0.3 0.520182945621010 0.9998984 0.432841363889942
## 0.3 0.4 0.270331569367831 0.9914229 0.220504628598872
## 0.3 0.5 0.324271640129705 0.9705069 0.256122103555501
## 0.3 0.6 0.427340940372378 0.9543289 0.332358973426881
## 0.3 0.7 0.484526179689154 0.9440740 0.377042568356188
## 0.3 0.8 0.575984423522267 0.9237542 0.434344973545167
## 0.3 0.9 0.713851577894005 0.8885063 0.492878983091895
## 0.3 1.0 0.820317441851651 0.8622926 0.536441823848129
## 0.4 0.0 1.785732296510209 NaN 1.489257066900703
## 0.4 0.1 1.332190998722839 1.0000000 1.111139588004123
## 0.4 0.2 0.878649700935469 1.0000000 0.733022109107543
## 0.4 0.3 0.466508306678030 0.9975145 0.383691332944335
## 0.4 0.4 0.290688344586185 0.9828824 0.231566016057706
## 0.4 0.5 0.400916890518639 0.9572896 0.315708257634870
## 0.4 0.6 0.513305995934263 0.9398840 0.399766731080206
## 0.4 0.7 0.576971161863790 0.9295168 0.450868984394657
## 0.4 0.8 0.681554824590548 0.9059296 0.516052096958842
## 0.4 0.9 0.822728116433971 0.8715705 0.579486812748633
## 0.4 1.0 0.924287278048619 0.8492866 0.622839625249375
## 0.5 0.0 1.785732296510209 NaN 1.489257066900703
## 0.5 0.1 1.307295916235547 1.0000000 1.090398794100918
## 0.5 0.2 0.828859535960884 1.0000000 0.691540521301132
## 0.5 0.3 0.437487100147201 0.9932304 0.354738272543427
## 0.5 0.4 0.320461617298899 0.9739978 0.249984433619203
## 0.5 0.5 0.466763788351665 0.9456905 0.366311663318744
## 0.5 0.6 0.589867113627702 0.9275577 0.463400113754935
## 0.5 0.7 0.666434708615125 0.9154333 0.525148855884162
## 0.5 0.8 0.783182665152827 0.8897631 0.591425060986056
## 0.5 0.9 0.921145199580711 0.8589521 0.658059284185828
## 0.5 1.0 1.024004691811390 0.8387311 0.703912611939428
## 0.6 0.0 1.785732296510209 NaN 1.489257066900703
## 0.6 0.1 1.285068814331364 1.0000000 1.071879095677130
## 0.6 0.2 0.784632958700645 0.9999999 0.654674622138400
## 0.6 0.3 0.422572487209625 0.9890410 0.343072563388642
## 0.6 0.4 0.352822870501357 0.9655535 0.272215242103743
## 0.6 0.5 0.526077211240813 0.9353977 0.414033145475914
## 0.6 0.6 0.660243079321853 0.9169021 0.523089765335876
## 0.6 0.7 0.754469471467317 0.9018727 0.599244843184092
## 0.6 0.8 0.885523015866271 0.8740853 0.669538730438320
## 0.6 0.9 1.019906541083494 0.8469550 0.732460043336070
## 0.6 1.0 1.124888589701265 0.8286762 0.781574995727976
## 0.7 0.0 1.785732296510209 NaN 1.489257066900703
## 0.7 0.1 1.264587813549223 1.0000000 1.054810510525454
## 0.7 0.2 0.752153059386862 0.9998357 0.626409330302607
## 0.7 0.3 0.415882578490783 0.9848519 0.338196548179069
## 0.7 0.4 0.386503705908352 0.9575260 0.296907785067535
## 0.7 0.5 0.581268663216794 0.9261360 0.460010128102817
## 0.7 0.6 0.727196936757065 0.9074271 0.580542454374291
## 0.7 0.7 0.842657294035419 0.8885020 0.671511097430215
## 0.7 0.8 0.988449981312779 0.8588146 0.750133723943875
## 0.7 0.9 1.121968958099376 0.8345992 0.812212790811561
## 0.7 1.0 1.228512543903121 0.8183702 0.858126104529331
## 0.8 0.0 1.785732296510209 NaN 1.489257066900703
## 0.8 0.1 1.245437377268442 1.0000000 1.038850024585222
## 0.8 0.2 0.726568833481090 0.9988945 0.603137472527159
## 0.8 0.3 0.412874078878135 0.9802189 0.336394213144416
## 0.8 0.4 0.420516546450795 0.9499402 0.325023307083475
## 0.8 0.5 0.633650308584103 0.9176319 0.504185001917922
## 0.8 0.6 0.791762651504268 0.8987763 0.635320241858850
## 0.8 0.7 0.921355897532670 0.8777488 0.736794620503935
## 0.8 0.8 1.090085035874687 0.8443305 0.828412512667163
## 0.8 0.9 1.224443872642379 0.8223830 0.893286574323227
## 0.8 1.0 1.334166652904727 0.8077551 0.941787513356787
## 0.9 0.0 1.785732296510209 NaN 1.489257066900703
## 0.9 0.1 1.227425003454661 1.0000000 1.023838592652305
## 0.9 0.2 0.702882415474459 0.9972472 0.581892728420243
## 0.9 0.3 0.411371226022985 0.9755281 0.334246194708562
## 0.9 0.4 0.454082569547592 0.9428660 0.352736973938333
## 0.9 0.5 0.683146786422000 0.9099057 0.545698063824204
## 0.9 0.6 0.853799524220365 0.8908578 0.687471307155524
## 0.9 0.7 0.996554487181447 0.8681131 0.799081951036427
## 0.9 0.8 1.189150196088896 0.8310386 0.904666273874281
## 0.9 0.9 1.325023031376493 0.8108461 0.971115013351703
## 0.9 1.0 1.440425659865060 0.7970872 1.023250088543972
## 1.0 0.0 1.785732296510209 NaN 1.489257066900703
## 1.0 0.1 1.210289350758559 1.0000000 1.009563363604153
## 1.0 0.2 0.680633224212471 0.9952134 0.561386888244268
## 1.0 0.3 0.411609201381028 0.9707856 0.332525386644738
## 1.0 0.4 0.487563237782785 0.9361413 0.379937853755743
## 1.0 0.5 0.730474343324566 0.9028558 0.585302034727792
## 1.0 0.6 0.913478830557609 0.8837013 0.737527305884699
## 1.0 0.7 1.069335556432365 0.8594749 0.859241735565904
## 1.0 0.8 1.282378448905773 0.8194544 0.977475632840046
## 1.0 0.9 1.425225328986141 0.7997574 1.048299162833104
## 1.0 1.0 1.546098715823611 0.7866793 1.102573979073727
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.1 and lambda = 1.
plot(enetFit2)
The optimal performance metrics are fraction = 0.1 and lambda = 1.
The model metrics are as follows: RMSE:
0.515588982646662
Rsquared: 0.9198164
MAE: 0.309671227098255
(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
enet2Pred <- predict(enetFit2,newdata = chermicalTest)
postResample(pred = enet2Pred, obs = chermicalTest$Yield)
## RMSE Rsquared MAE
## 1.359568 1.000000 1.078993
The test performance metrics indicate that the model achieved an impressive result, with an R-squared of 1.000, suggesting it perfectly explained the variance in the test data. This is an improvement over the training R-squared of 0.9198, which, while already high, was not perfect. The RMSE and MAE for the test set were 1.3596 and 1.0790, respectively, showing an expected increase in error when compared to the training RMSE of 0.5156 and MAE of 0.3097. This difference reflects the typical scenario where models perform better on training data due to being directly fit to it.
The comparison between training and test metrics shows that the model maintained strong predictive performance on unseen data, with only a modest increase in error. The higher RMSE and MAE on the test set are within a reasonable range, demonstrating that the model generalizes well beyond its training set. The perfect R-squared on the test set is notable, but it could suggest an unusually favorable test set or potential overfitting that should be carefully reviewed.
(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
The Overall score in the varImp() output from the caret package represents the relative importance of each variable in the model. This score quantifies how much a variable contributes to the model’s predictive performance.
predictorImportance <- varImp(enetFit2)$importance
predictorImportance$Name <- rownames(predictorImportance)
predictorImportance <- predictorImportance[order(-predictorImportance$Overall),]
predictorImportance <- predictorImportance[2:nrow(predictorImportance),]
rownames(predictorImportance) <- NULL
predictorImportance |> head(20) |> kable() |> kable_styling() |> kable_classic(full_width = F)
| Overall | Name |
|---|---|
| 42.08617 | ManufacturingProcess32 |
| 37.87971 | BiologicalMaterial06 |
| 30.91456 | BiologicalMaterial03 |
| 30.15037 | ManufacturingProcess13 |
| 28.96515 | ManufacturingProcess36 |
| 28.49918 | BiologicalMaterial02 |
| 26.31269 | ManufacturingProcess31 |
| 25.75545 | ManufacturingProcess17 |
| 24.45221 | ManufacturingProcess09 |
| 24.13562 | BiologicalMaterial12 |
| 23.57180 | BiologicalMaterial04 |
| 22.55198 | ManufacturingProcess33 |
| 20.19579 | ManufacturingProcess06 |
| 19.89309 | BiologicalMaterial01 |
| 18.25433 | ManufacturingProcess29 |
| 17.76085 | BiologicalMaterial11 |
| 15.34550 | BiologicalMaterial08 |
| 13.28445 | ManufacturingProcess27 |
| 12.68791 | ManufacturingProcess11 |
| 11.32475 | ManufacturingProcess30 |
Score of contribution from each Variable Class:
bio <- predictorImportance |> filter(grepl("Biological",Name)) |> summarise(Total_Score=sum(Overall))
manu <- predictorImportance |> filter(grepl("Manufacturing",Name)) |> summarise(Total_Score=sum(Overall))
bio$VariableType <- "Biological"
manu$VariableType <- "Manufacturing"
bind_rows(
bio,
manu
) |> kable() |> kable_styling() |> kable_classic(full_width = F)
| Total_Score | VariableType |
|---|---|
| 223.3147 | Biological |
| 410.0964 | Manufacturing |
These aggregated scores provide a clear understanding of the relative contribution of different types of variables to the model’s overall performance. Higher scores indicate that a specific type of variable is more influential in predicting the target outcome. In this case, “Manufacturing” variables appear to contribute more substantially than “Biological” ones, highlighting their greater predictive importance.
(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
Exploring the relationships between key predictors and the response variable is essential for understanding how to improve yield in future manufacturing processes.
top20Predictors <- predictorImportance[1:10,]$Name
top20Data <- chermical_comp |> select(all_of(top20Predictors))
correlations <- cor(top20Data)
corrplot::corrplot(correlations, order = "hclust")
BiologicalMaterial06 and BiologicalMaterial02, for example, show a strong positive correlation (0.954), indicating that these materials may have complementary effects on the yield. Additionally, BiologicalMaterial03 is highly correlated with BiologicalMaterial06 (0.872), suggesting that their combined influence could be leveraged to enhance output. These insights imply that fine-tuning the levels or conditions related to these biological inputs could be a strategic approach to improving yield.
ManufacturingProcess32 shows moderate positive correlations with both BiologicalMaterial06 (0.600) and BiologicalMaterial02 (0.629), pointing to a potential combined effect that could be optimized for better yield results. This relationship suggests that ManufacturingProcess32 may interact effectively with these biological materials and should be a focus for further refinement. Conversely, ManufacturingProcess36 has a strong negative correlation with ManufacturingProcess32 (-0.774) and moderate negative correlations with BiologicalMaterial06 (-0.510) and BiologicalMaterial03 (-0.464). These negative relationships indicate that ManufacturingProcess36 might counteract the positive impacts of other processes, and changes to it should be managed carefully to avoid decreasing yield.
Notably, ManufacturingProcess13 displays a significant negative correlation with ManufacturingProcess09 (-0.791), suggesting potential interference between these processes. This could imply that balancing or redesigning these processes to minimize adverse interactions could enhance overall efficiency and yield. On the other hand, processes like ManufacturingProcess31 and ManufacturingProcess17 show weaker correlations with the main variables, indicating a smaller direct influence on yield but suggesting possible secondary effects when combined with more impactful variables (e.g., a positive correlation between ManufacturingProcess17 and ManufacturingProcess13 at 0.782).
Prioritizing the optimization of BiologicalMaterial06, BiologicalMaterial02, and ManufacturingProcess32 could significantly impact the yield positively. At the same time, managing or revising ManufacturingProcess36 and mitigating interactions like those between ManufacturingProcess13 and ManufacturingProcess09 could prevent negative effects. The relationships identified here can guide process engineers to conduct targeted experiments and adjustments, leading to more consistent and higher-yielding manufacturing outcomes.