Imports

library(gridExtra)
library(ggplot2)
library(cowplot)
options(scipen=10000)
library(mlbench)
library(tidyverse)
library(corrplot)
library(AppliedPredictiveModeling)
library(caret)
library(DataExplorer)
library(kableExtra)
library(mice)
library(pls)

6.2 Permeability

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

nearZero <-  nearZeroVar(fingerprints)
fingerprintsC <- fingerprints[,-nearZero]

Number of variables left:

ncol(fingerprintsC)
## [1] 388

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2 ?

set.seed(2425)  # For reproducibility
trainIndex <- sample(1:nrow(fingerprintsC), size = 0.8 * nrow(fingerprintsC))
fingerprintsMerged <- cbind(fingerprintsC, permeability) |> as_data_frame()
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
##   tibble, or `as.data.frame()` to convert to a data frame.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# 
fingerprintsTrain <- fingerprintsMerged[trainIndex,]
fingerprintsTest <- fingerprintsMerged[-trainIndex,]
# str(fingerprintsTrain)
fingerprintsTrainNoPr <- fingerprintsTrain |> select(-permeability)

fingerCorr <- cor(fingerprintsTrainNoPr)

# findCorrelation(fingerCorr,0.75)
set.seed(2425)  # For reproducibility

trControl <- trainControl(method = "cv",number = 10)
plsModel <- train(fingerprintsTrainNoPr,fingerprintsTrain$permeability ,method = "pls", trControl = trControl, preProcess = c("center","scale"), tuneLength = 10)
plsModel
## Partial Least Squares 
## 
## 132 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 118, 118, 118, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     11.93575  0.3717458  9.255064
##    2     10.47966  0.5543316  7.489034
##    3     10.27631  0.5781111  7.668499
##    4     10.53127  0.5186387  7.818789
##    5     10.77483  0.4969523  7.691391
##    6     11.10669  0.4814662  8.064078
##    7     11.22050  0.4755242  8.150355
##    8     11.40249  0.4715207  8.341430
##    9     11.51949  0.4559837  8.413719
##   10     11.62321  0.4554849  8.342049
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
plot(plsModel)

The optimal number of latent variables is 3 and the R2 stimate is 0.5781111, which is they highest among the number of components tried.

(d) Predict the response for the test set. What is the test set estimate of R2 ?

# plsModel <- pls::plsr(permeability ~.,data=fingerprintsTrain)

modelPred <- predict(plsModel,newdata = fingerprintsTest)
postResample(pred = modelPred, obs = fingerprintsTest$permeability)
##       RMSE   Rsquared        MAE 
## 14.9302821  0.2709311 11.5245778

The test set stimated R2 is 0.2709311.

(e) Try building other models discussed in this chapter. Do any have better predictive performance?

Ridge Regression, parameter tuned: lambda (from 0 to 1 by 0.1)

set.seed(2425)  # For reproducibility

ridgeFit <- train(
                  x=fingerprintsTrainNoPr,
                  y=fingerprintsTrain$permeability,
                  method='ridge',
                  metric='Rsquared',
                  tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )
ridgeFit
## Ridge Regression 
## 
## 132 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 118, 118, 118, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE             Rsquared    MAE             
##   0.0     802367780.01868  0.09887969  214441826.400764
##   0.1            12.20228  0.43842501          8.505063
##   0.2            11.94297  0.47885042          8.461151
##   0.3            12.05871  0.49833946          8.619401
##   0.4            12.32543  0.51121382          8.895922
##   0.5            12.70289  0.52049743          9.236827
##   0.6            13.15710  0.52781833          9.640106
##   0.7            13.66675  0.53397421         10.108291
##   0.8            14.23771  0.53898576         10.631019
##   0.9            14.84155  0.54340174         11.168300
##   1.0            15.47789  0.54721903         11.711016
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.

Lasso Regression, parameter tuned: fraction (from 0 to 0.5 by 0.05)

set.seed(2425)  # For reproducibility

lassoFit <- train(
                  x=fingerprintsTrainNoPr,
                  y=fingerprintsTrain$permeability,
                  method='lasso',
                  metric='Rsquared',
                  tuneGrid=data.frame(.fraction = seq(0.05, 0.5, by=0.05)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )
lassoFit
## The lasso 
## 
## 132 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 118, 118, 118, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE       Rsquared   MAE      
##   0.05       40041544  0.1738815   10701561
##   0.10       80163975  0.1486392   21424732
##   0.15      120286404  0.1486392   32147902
##   0.20      160408834  0.1486392   42871073
##   0.25      200531264  0.1486392   53594243
##   0.30      240653694  0.1486392   64317414
##   0.35      280776124  0.1572672   75040584
##   0.40      320898555  0.1338068   85763755
##   0.45      361020989  0.1129875   96486927
##   0.50      401143424  0.1042411  107210099
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.

Elastic net, parameters tuned: fraction and lambda (2-D grid with each D from 0 to 1 by 0.1)

set.seed(2425)  # For reproducibility

enetFit <- train(
                  x=fingerprintsTrainNoPr,
                  y=fingerprintsTrain$permeability,
                 method='enet',
                 metric='Rsquared',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                      .lambda = seq(0, 1, by=0.1)),
                 trControl=trainControl(method='cv'),
                 preProcess=c('center','scale')
                  )
enetFit
## Elasticnet 
## 
## 132 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 118, 118, 118, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE              Rsquared    MAE             
##   0.0     0.0              15.260780         NaN         12.122307
##   0.0     0.1        80163974.510689  0.14863919   21424731.912977
##   0.0     0.2       160408834.393874  0.14863919   42871072.822539
##   0.0     0.3       240653694.277060  0.14863919   64317413.732102
##   0.0     0.4       320898555.359768  0.13380678   85763755.272878
##   0.0     0.5       401143424.142742  0.10424107  107210099.435151
##   0.0     0.6       481388295.101290  0.09868425  128656444.939307
##   0.0     0.7       561633166.764311  0.09818927  150102790.679683
##   0.0     0.8       641878037.909563  0.09839624  171549135.978401
##   0.0     0.9       722122909.115455  0.09866210  192995481.277120
##   0.0     1.0       802367780.018678  0.09887969  214441826.400764
##   0.1     0.0              14.622760         NaN         11.559912
##   0.1     0.1               9.991123  0.63324472          7.189238
##   0.1     0.2              10.240873  0.56809715          7.248915
##   0.1     0.3              10.586953  0.53481599          7.557206
##   0.1     0.4              10.838509  0.51881105          7.763374
##   0.1     0.5              11.058863  0.50424115          7.885955
##   0.1     0.6              11.265256  0.49078354          8.012734
##   0.1     0.7              11.504053  0.47651658          8.141108
##   0.1     0.8              11.778516  0.46146610          8.289844
##   0.1     0.9              12.019729  0.44831371          8.425988
##   0.1     1.0              12.202285  0.43842501          8.505063
##   0.2     0.0              14.622760         NaN         11.559912
##   0.2     0.1              10.198479  0.61203674          7.356762
##   0.2     0.2              10.070647  0.60636155          6.998864
##   0.2     0.3              10.515833  0.55988190          7.418731
##   0.2     0.4              10.756930  0.54437979          7.698070
##   0.2     0.5              10.984986  0.53200746          7.903420
##   0.2     0.6              11.185500  0.52025561          8.023412
##   0.2     0.7              11.399790  0.50863017          8.132020
##   0.2     0.8              11.612777  0.49645256          8.253235
##   0.2     0.9              11.801939  0.48573133          8.372382
##   0.2     1.0              11.942973  0.47885042          8.461151
##   0.3     0.0              14.622760         NaN         11.559912
##   0.3     0.1              10.266780  0.60650807          7.442365
##   0.3     0.2              10.016172  0.62652825          6.812753
##   0.3     0.3              10.523341  0.57870120          7.306924
##   0.3     0.4              10.850755  0.55849905          7.710320
##   0.3     0.5              11.097343  0.54797229          7.970797
##   0.3     0.6              11.321812  0.53562933          8.146556
##   0.3     0.7              11.550657  0.52419889          8.298958
##   0.3     0.8              11.749800  0.51413129          8.429755
##   0.3     0.9              11.925599  0.50490435          8.540425
##   0.3     1.0              12.058710  0.49833946          8.619401
##   0.4     0.0              14.622760         NaN         11.559912
##   0.4     0.1              10.302765  0.60627823          7.480312
##   0.4     0.2              10.050769  0.63429937          6.699186
##   0.4     0.3              10.645382  0.59216014          7.314368
##   0.4     0.4              11.033873  0.56950921          7.782967
##   0.4     0.5              11.300823  0.55860473          8.072870
##   0.4     0.6              11.564261  0.54655473          8.305049
##   0.4     0.7              11.806510  0.53495112          8.501374
##   0.4     0.8              12.015222  0.52579416          8.673454
##   0.4     0.9              12.182226  0.51775128          8.801622
##   0.4     1.0              12.325429  0.51121382          8.895922
##   0.5     0.0              14.622760         NaN         11.559912
##   0.5     0.1              10.348790  0.60490572          7.509109
##   0.5     0.2              10.133114  0.63761629          6.620225
##   0.5     0.3              10.778051  0.60456240          7.343278
##   0.5     0.4              11.273302  0.57867838          7.876785
##   0.5     0.5              11.595033  0.56650571          8.246186
##   0.5     0.6              11.884916  0.55477573          8.549504
##   0.5     0.7              12.156357  0.54322314          8.785597
##   0.5     0.8              12.377407  0.53419776          8.973409
##   0.5     0.9              12.548489  0.52678779          9.123832
##   0.5     1.0              12.702892  0.52049743          9.236827
##   0.6     0.0              14.622760         NaN         11.559912
##   0.6     0.1              10.383723  0.60273088          7.517420
##   0.6     0.2              10.242833  0.63848585          6.589084
##   0.6     0.3              10.960323  0.61225872          7.427103
##   0.6     0.4              11.563837  0.58565197          8.089024
##   0.6     0.5              11.959282  0.57157519          8.532573
##   0.6     0.6              12.298621  0.55995393          8.877815
##   0.6     0.7              12.572811  0.54997502          9.133085
##   0.6     0.8              12.806454  0.54125399          9.345945
##   0.6     0.9              12.989779  0.53394625          9.511248
##   0.6     1.0              13.157102  0.52781833          9.640106
##   0.7     0.0              14.622760         NaN         11.559912
##   0.7     0.1              10.417098  0.60087368          7.515903
##   0.7     0.2              10.383610  0.63694858          6.559416
##   0.7     0.3              11.170144  0.61992591          7.563298
##   0.7     0.4              11.901411  0.59151317          8.381715
##   0.7     0.5              12.373171  0.57591359          8.902985
##   0.7     0.6              12.737873  0.56500173          9.283757
##   0.7     0.7              13.040165  0.55557176          9.562922
##   0.7     0.8              13.283924  0.54735982          9.780256
##   0.7     0.9              13.485332  0.54003335          9.956210
##   0.7     1.0              13.666753  0.53397421         10.108291
##   0.8     0.0              14.622760         NaN         11.559912
##   0.8     0.1              10.433672  0.59905991          7.493092
##   0.8     0.2              10.534394  0.63585535          6.531095
##   0.8     0.3              11.421544  0.62447326          7.780023
##   0.8     0.4              12.274933  0.59636197          8.731893
##   0.8     0.5              12.812030  0.58032449          9.319917
##   0.8     0.6              13.222684  0.56931871          9.733258
##   0.8     0.7              13.548507  0.56039507         10.013897
##   0.8     0.8              13.815509  0.55242248         10.251490
##   0.8     0.9              14.038219  0.54509352         10.452206
##   0.8     1.0              14.237709  0.53898576         10.631019
##   0.9     0.0              14.622760         NaN         11.559912
##   0.9     0.1              10.461544  0.59764715          7.483664
##   0.9     0.2              10.727567  0.63327253          6.523331
##   0.9     0.3              11.705305  0.62728783          8.021158
##   0.9     0.4              12.684318  0.60044864          9.104595
##   0.9     0.5              13.285359  0.58421467          9.735267
##   0.9     0.6              13.740778  0.57301853         10.187436
##   0.9     0.7              14.090141  0.56456704         10.481302
##   0.9     0.8              14.382995  0.55671392         10.733571
##   0.9     0.9              14.619484  0.54963698         10.961512
##   0.9     1.0              14.841545  0.54340174         11.168300
##   1.0     0.0              14.622760         NaN         11.559912
##   1.0     0.1              10.486694  0.59646586          7.469183
##   1.0     0.2              10.953379  0.62956321          6.545608
##   1.0     0.3              12.019663  0.62888116          8.267122
##   1.0     0.4              13.114747  0.60392930          9.479820
##   1.0     0.5              13.792191  0.58785747         10.159859
##   1.0     0.6              14.289489  0.57625919         10.650704
##   1.0     0.7              14.667637  0.56789661         10.964553
##   1.0     0.8              14.987442  0.56021606         11.247447
##   1.0     0.9              15.232649  0.55359752         11.486376
##   1.0     1.0              15.477888  0.54721903         11.711016
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.2 and lambda = 0.6.
resamp <- resamples(list(PLS=plsModel, Ridge=ridgeFit, Lasso=lassoFit, enet=enetFit))
resamp
## 
## Call:
## resamples.default(x = list(PLS = plsModel, Ridge = ridgeFit, Lasso =
##  lassoFit, enet = enetFit))
## 
## Models: PLS, Ridge, Lasso, enet 
## Number of resamples: 10 
## Performance metrics: MAE, RMSE, Rsquared 
## Time estimates for: everything, final model fit
summary(resamp)
## 
## Call:
## summary.resamples(object = resamp)
## 
## Models: PLS, Ridge, Lasso, enet 
## Number of resamples: 10 
## 
## MAE 
##            Min.   1st Qu.    Median            Mean         3rd Qu.
## PLS    4.419257  5.534504  7.818659        7.668499        9.553810
## Ridge  7.004522  9.234092 11.648974       11.711016       14.915608
## Lasso 11.202097 11.574198 11.946299 10701561.102331 16052336.052448
## enet   2.656191  4.651612  6.985758        6.589084        8.330346
##                 Max. NA's
## PLS         10.98701    0
## Ridge       16.13178    0
## Lasso 32104660.15860    7
## enet        10.45220    0
## 
## RMSE 
##            Min.   1st Qu.    Median           Mean        3rd Qu.
## PLS    5.208639  7.147750  9.673697       10.27631       12.86269
## Ridge  9.009285 11.310016 15.037590       15.47789       18.80961
## Lasso 16.055141 16.274349 16.493558 40041543.73132 60062307.56941
## enet   3.815536  7.532664 11.020318       10.24283       12.97799
##                  Max. NA's
## PLS          16.24362    0
## Ridge        23.22932    0
## Lasso 120124598.64527    7
## enet         16.44571    0
## 
## Rsquared 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS   0.13651651 0.3544622 0.6815658 0.5781111 0.7840040 0.8851359    0
## Ridge 0.17950431 0.3143576 0.6232016 0.5472190 0.7157092 0.8876797    0
## Lasso 0.02852087 0.1278148 0.2271087 0.1738815 0.2465618 0.2660149    7
## enet  0.19313929 0.3871844 0.7883984 0.6384858 0.8268568 0.9357691    0
ridgeFitPred <- predict(ridgeFit,newdata = fingerprintsTest)
lassoFitPred <- predict(lassoFit,newdata = fingerprintsTest)
enetFitPred <- predict(enetFit,newdata = fingerprintsTest)

Ridge Results

ridgeRes <- postResample(pred = ridgeFitPred, obs = fingerprintsTest$permeability) 
ridgeRes
##       RMSE   Rsquared        MAE 
## 19.1487094  0.3122149 14.8732077

Lasso Results

lassoRes <- postResample(pred = lassoFitPred, obs = fingerprintsTest$permeability) 
lassoRes
##          RMSE      Rsquared           MAE 
## 1095.97427689    0.03469573  507.51889060

Elastic Net Results

enetRes <- postResample(pred = enetFitPred, obs = fingerprintsTest$permeability) 
enetRes
##       RMSE   Rsquared        MAE 
## 16.7444104  0.2166588 11.3697266

When evaluating both training and test data metrics, it’s essential to identify a model that balances predictive accuracy and generalizability. Elastic Net (enet) showed strong performance on the training data with the lowest mean MAE (6.59), a competitive mean RMSE (10.24), and the highest mean R-squared (0.64), indicating an excellent fit on the training set. However, its test data performance reveals some weaknesses. While it maintains a decent test RMSE (16.74) and a competitive MAE (11.37), its test R-squared drops to 0.22, the lowest among the models, suggesting potential overfitting. This drop implies that enet may capture the training data well but struggles to generalize to new, unseen data.

Partial Least Squares (PLS) provides a better balance between training and test performance. On the training data, it showed a mean MAE of 7.67, a mean RMSE of 10.28, and a mean R-squared of 0.58, indicating reliable predictive power. More importantly, PLS performs best on the test data with the lowest RMSE (14.93) and a competitive MAE (11.52). Although its test R-squared (0.27) is not the highest, it still represents a reasonable level of variance explanation and suggests that PLS generalizes better than enet. This makes PLS a strong candidate for reliable performance across different data sets.

Ridge regression, while showing moderate performance in training with higher mean MAE (11.71) and RMSE (15.48), surprises with the highest test R-squared (0.31). This indicates that Ridge explains more variance on the test data than the other models. However, its higher test RMSE (19.15) and MAE (14.87) suggest that its predictions are less accurate compared to PLS and enet, indicating that while it captures variance well, it sacrifices predictive accuracy.

Lasso is consistently the worst performer across both training and test data, with extremely high errors and a very low R-squared, indicating instability and poor predictive ability. It is not a viable option due to these significant shortcomings.

In conclusion, PLS stands out as the best model when considering both training and test data. It maintains a strong balance with the lowest test RMSE, competitive MAE, and reasonable R-squared, indicating that it generalizes effectively while maintaining good predictive accuracy. Elastic Net can be a secondary choice due to its competitive test errors, but its low test R-squared signals potential overfitting. Ridge, despite its high test R-squared, is less preferred due to higher errors, making PLS the optimal model for robust and balanced performance.

(f) Would you recommend any of your models to replace the permeability laboratory experiment?

I would not recommend replacing the permeability laboratory experiment entirely with any of the models due to limitations in accuracy and reliability. While models like Partial Least Squares (PLS) and Elastic Net (enet) demonstrated solid performance and could serve as valuable tools for augmenting lab work by providing preliminary predictions, they fall short of the high level of precision needed to fully replace experimental methods. Data-driven models rely heavily on the quality and representativeness of the training data and may not account for the full complexity and variability of real-world conditions. Without rigorous validation on independent datasets, there is a risk that these models may not consistently meet the standards required for reliable permeability assessments. Therefore, while these models could complement and optimize laboratory work by screening and prioritizing samples, they should not be used as replacements for the proven accuracy of laboratory experiments.

6.3 Chemical Manufacturing Process

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

data("ChemicalManufacturingProcess")

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.

Identify variables in the ChemicalManufacturingProcess dataset that have near-zero variance. These variables are characterized by very little variation across observations and can contribute little to predictive models, potentially adding noise or redundancy.

cherNearZero <-  nearZeroVar(ChemicalManufacturingProcess)

chermical <- ChemicalManufacturingProcess[,-cherNearZero]

Impute values using The mice() function from the mice package, which is used to perform multiple imputations on the chermical dataset to fill in missing values. The method used is predictive mean matching (method = ‘pmm’), which ensures that imputed values are plausible and within the range of observed data.

set.seed(2425)  # For reproducibility


chermical_imp <- mice(chermical, m = 5, method = 'pmm', maxit = 5, seed = 123, printFlag = FALSE)
# chermical_imp <- mice(ChemicalManufacturingProcess, m = 5, method = 'pmm', maxit = 5, seed = 123, printFlag = FALSE)

chermical_comp <-  complete(chermical_imp,1)

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Creating Training and Test Data

cherTrainIndex <- sample(1:nrow(chermical_comp), size = 0.8 * nrow(chermical_comp))

chermicalTrain <- chermical_comp[cherTrainIndex,] 
chermicalTest <- chermical_comp[-cherTrainIndex,]

Training an Elastic Net regression model on the chermicalTrain dataset to predict Yield. A grid search (tuneGrid) explores combinations of .fraction (mixing parameter) and .lambda (regularization strength) from 0 to 1 in 0.1 increments.

set.seed(2425)  # For reproducibility

enetFit2 <- train(
                  x=chermicalTrain,
                  y=chermicalTrain$Yield,
                 method='enet',
                 metric='Rsquared',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                      .lambda = seq(0, 1, by=0.1)),
                 trControl=trainControl(method='cv'),
                 # preProcess=c('center','scale')
                  # preProcess=c('center','scale')

                  )
enetFit2
## Elasticnet 
## 
## 140 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 125, 127, 125, 127, 126, 126, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE               Rsquared   MAE              
##   0.0     0.0       1.785732296510209        NaN  1.489257066900703
##   0.0     0.1       1.434847103043961  1.0000000  1.196849759796647
##   0.0     0.2       1.275419646602540  1.0000000  1.063866452692593
##   0.0     0.3       1.115992190161119  1.0000000  0.930883145588539
##   0.0     0.4       0.956564733719698  1.0000000  0.797899838484485
##   0.0     0.5       0.797137277278277  1.0000000  0.664916531380431
##   0.0     0.6       0.637709820836857  1.0000000  0.531933224276378
##   0.0     0.7       0.478282364395436  1.0000000  0.398949917172325
##   0.0     0.8       0.318854907954016  1.0000000  0.265966610068271
##   0.0     0.9       0.159427451512596  1.0000000  0.132983302964218
##   0.0     1.0       0.000000001221557  1.0000000  0.000000000375774
##   0.1     0.0       1.785732296510209        NaN  1.489257066900703
##   0.1     0.1       1.447824832343437  1.0000000  1.207546180644274
##   0.1     0.2       1.109917368176664  1.0000000  0.925835294387845
##   0.1     0.3       0.772009904009893  1.0000000  0.644124408131415
##   0.1     0.4       0.434102439843120  1.0000000  0.362413521874986
##   0.1     0.5       0.154259036745992  0.9988143  0.125908333572697
##   0.1     0.6       0.179187099610413  0.9908308  0.143286280349247
##   0.1     0.7       0.231297930560163  0.9846266  0.184287547313417
##   0.1     0.8       0.314242147526389  0.9711641  0.231315890855729
##   0.1     0.9       0.424290659216675  0.9452044  0.276429006480236
##   0.1     1.0       0.515588982646662  0.9198164  0.309671227098255
##   0.2     0.0       1.785732296510209        NaN  1.489257066900703
##   0.2     0.1       1.397163060717235  1.0000000  1.165282894586124
##   0.2     0.2       1.008593824924262  1.0000000  0.841308722271545
##   0.2     0.3       0.620024589131288  1.0000000  0.517334549956966
##   0.2     0.4       0.282324619804769  0.9981682  0.230649091204894
##   0.2     0.5       0.223629717441934  0.9863330  0.176153490255319
##   0.2     0.6       0.322998368405664  0.9717213  0.253875720531182
##   0.2     0.7       0.378710284827910  0.9614894  0.296160611386199
##   0.2     0.8       0.471670023451252  0.9417344  0.349735284322026
##   0.2     0.9       0.587008383421088  0.9123469  0.399035921758594
##   0.2     1.0       0.698633103038860  0.8819868  0.441577685831880
##   0.3     0.0       1.785732296510209        NaN  1.489257066900703
##   0.3     0.1       1.361149849710224  1.0000000  1.135265321367309
##   0.3     0.2       0.936567402910239  1.0000000  0.781273575833915
##   0.3     0.3       0.520182945621010  0.9998984  0.432841363889942
##   0.3     0.4       0.270331569367831  0.9914229  0.220504628598872
##   0.3     0.5       0.324271640129705  0.9705069  0.256122103555501
##   0.3     0.6       0.427340940372378  0.9543289  0.332358973426881
##   0.3     0.7       0.484526179689154  0.9440740  0.377042568356188
##   0.3     0.8       0.575984423522267  0.9237542  0.434344973545167
##   0.3     0.9       0.713851577894005  0.8885063  0.492878983091895
##   0.3     1.0       0.820317441851651  0.8622926  0.536441823848129
##   0.4     0.0       1.785732296510209        NaN  1.489257066900703
##   0.4     0.1       1.332190998722839  1.0000000  1.111139588004123
##   0.4     0.2       0.878649700935469  1.0000000  0.733022109107543
##   0.4     0.3       0.466508306678030  0.9975145  0.383691332944335
##   0.4     0.4       0.290688344586185  0.9828824  0.231566016057706
##   0.4     0.5       0.400916890518639  0.9572896  0.315708257634870
##   0.4     0.6       0.513305995934263  0.9398840  0.399766731080206
##   0.4     0.7       0.576971161863790  0.9295168  0.450868984394657
##   0.4     0.8       0.681554824590548  0.9059296  0.516052096958842
##   0.4     0.9       0.822728116433971  0.8715705  0.579486812748633
##   0.4     1.0       0.924287278048619  0.8492866  0.622839625249375
##   0.5     0.0       1.785732296510209        NaN  1.489257066900703
##   0.5     0.1       1.307295916235547  1.0000000  1.090398794100918
##   0.5     0.2       0.828859535960884  1.0000000  0.691540521301132
##   0.5     0.3       0.437487100147201  0.9932304  0.354738272543427
##   0.5     0.4       0.320461617298899  0.9739978  0.249984433619203
##   0.5     0.5       0.466763788351665  0.9456905  0.366311663318744
##   0.5     0.6       0.589867113627702  0.9275577  0.463400113754935
##   0.5     0.7       0.666434708615125  0.9154333  0.525148855884162
##   0.5     0.8       0.783182665152827  0.8897631  0.591425060986056
##   0.5     0.9       0.921145199580711  0.8589521  0.658059284185828
##   0.5     1.0       1.024004691811390  0.8387311  0.703912611939428
##   0.6     0.0       1.785732296510209        NaN  1.489257066900703
##   0.6     0.1       1.285068814331364  1.0000000  1.071879095677130
##   0.6     0.2       0.784632958700645  0.9999999  0.654674622138400
##   0.6     0.3       0.422572487209625  0.9890410  0.343072563388642
##   0.6     0.4       0.352822870501357  0.9655535  0.272215242103743
##   0.6     0.5       0.526077211240813  0.9353977  0.414033145475914
##   0.6     0.6       0.660243079321853  0.9169021  0.523089765335876
##   0.6     0.7       0.754469471467317  0.9018727  0.599244843184092
##   0.6     0.8       0.885523015866271  0.8740853  0.669538730438320
##   0.6     0.9       1.019906541083494  0.8469550  0.732460043336070
##   0.6     1.0       1.124888589701265  0.8286762  0.781574995727976
##   0.7     0.0       1.785732296510209        NaN  1.489257066900703
##   0.7     0.1       1.264587813549223  1.0000000  1.054810510525454
##   0.7     0.2       0.752153059386862  0.9998357  0.626409330302607
##   0.7     0.3       0.415882578490783  0.9848519  0.338196548179069
##   0.7     0.4       0.386503705908352  0.9575260  0.296907785067535
##   0.7     0.5       0.581268663216794  0.9261360  0.460010128102817
##   0.7     0.6       0.727196936757065  0.9074271  0.580542454374291
##   0.7     0.7       0.842657294035419  0.8885020  0.671511097430215
##   0.7     0.8       0.988449981312779  0.8588146  0.750133723943875
##   0.7     0.9       1.121968958099376  0.8345992  0.812212790811561
##   0.7     1.0       1.228512543903121  0.8183702  0.858126104529331
##   0.8     0.0       1.785732296510209        NaN  1.489257066900703
##   0.8     0.1       1.245437377268442  1.0000000  1.038850024585222
##   0.8     0.2       0.726568833481090  0.9988945  0.603137472527159
##   0.8     0.3       0.412874078878135  0.9802189  0.336394213144416
##   0.8     0.4       0.420516546450795  0.9499402  0.325023307083475
##   0.8     0.5       0.633650308584103  0.9176319  0.504185001917922
##   0.8     0.6       0.791762651504268  0.8987763  0.635320241858850
##   0.8     0.7       0.921355897532670  0.8777488  0.736794620503935
##   0.8     0.8       1.090085035874687  0.8443305  0.828412512667163
##   0.8     0.9       1.224443872642379  0.8223830  0.893286574323227
##   0.8     1.0       1.334166652904727  0.8077551  0.941787513356787
##   0.9     0.0       1.785732296510209        NaN  1.489257066900703
##   0.9     0.1       1.227425003454661  1.0000000  1.023838592652305
##   0.9     0.2       0.702882415474459  0.9972472  0.581892728420243
##   0.9     0.3       0.411371226022985  0.9755281  0.334246194708562
##   0.9     0.4       0.454082569547592  0.9428660  0.352736973938333
##   0.9     0.5       0.683146786422000  0.9099057  0.545698063824204
##   0.9     0.6       0.853799524220365  0.8908578  0.687471307155524
##   0.9     0.7       0.996554487181447  0.8681131  0.799081951036427
##   0.9     0.8       1.189150196088896  0.8310386  0.904666273874281
##   0.9     0.9       1.325023031376493  0.8108461  0.971115013351703
##   0.9     1.0       1.440425659865060  0.7970872  1.023250088543972
##   1.0     0.0       1.785732296510209        NaN  1.489257066900703
##   1.0     0.1       1.210289350758559  1.0000000  1.009563363604153
##   1.0     0.2       0.680633224212471  0.9952134  0.561386888244268
##   1.0     0.3       0.411609201381028  0.9707856  0.332525386644738
##   1.0     0.4       0.487563237782785  0.9361413  0.379937853755743
##   1.0     0.5       0.730474343324566  0.9028558  0.585302034727792
##   1.0     0.6       0.913478830557609  0.8837013  0.737527305884699
##   1.0     0.7       1.069335556432365  0.8594749  0.859241735565904
##   1.0     0.8       1.282378448905773  0.8194544  0.977475632840046
##   1.0     0.9       1.425225328986141  0.7997574  1.048299162833104
##   1.0     1.0       1.546098715823611  0.7866793  1.102573979073727
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.1 and lambda = 1.
plot(enetFit2)

The optimal performance metrics are fraction = 0.1 and lambda = 1. The model metrics are as follows: RMSE: 0.515588982646662
Rsquared: 0.9198164 MAE: 0.309671227098255

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

enet2Pred <- predict(enetFit2,newdata = chermicalTest)
postResample(pred = enet2Pred, obs = chermicalTest$Yield)
##     RMSE Rsquared      MAE 
## 1.359568 1.000000 1.078993

The test performance metrics indicate that the model achieved an impressive result, with an R-squared of 1.000, suggesting it perfectly explained the variance in the test data. This is an improvement over the training R-squared of 0.9198, which, while already high, was not perfect. The RMSE and MAE for the test set were 1.3596 and 1.0790, respectively, showing an expected increase in error when compared to the training RMSE of 0.5156 and MAE of 0.3097. This difference reflects the typical scenario where models perform better on training data due to being directly fit to it.

The comparison between training and test metrics shows that the model maintained strong predictive performance on unseen data, with only a modest increase in error. The higher RMSE and MAE on the test set are within a reasonable range, demonstrating that the model generalizes well beyond its training set. The perfect R-squared on the test set is notable, but it could suggest an unusually favorable test set or potential overfitting that should be carefully reviewed.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

The Overall score in the varImp() output from the caret package represents the relative importance of each variable in the model. This score quantifies how much a variable contributes to the model’s predictive performance.

predictorImportance <- varImp(enetFit2)$importance
predictorImportance$Name <- rownames(predictorImportance)

predictorImportance <- predictorImportance[order(-predictorImportance$Overall),]
predictorImportance <- predictorImportance[2:nrow(predictorImportance),]
rownames(predictorImportance) <- NULL

predictorImportance |> head(20) |> kable() |> kable_styling() |>  kable_classic(full_width = F)
Overall Name
42.08617 ManufacturingProcess32
37.87971 BiologicalMaterial06
30.91456 BiologicalMaterial03
30.15037 ManufacturingProcess13
28.96515 ManufacturingProcess36
28.49918 BiologicalMaterial02
26.31269 ManufacturingProcess31
25.75545 ManufacturingProcess17
24.45221 ManufacturingProcess09
24.13562 BiologicalMaterial12
23.57180 BiologicalMaterial04
22.55198 ManufacturingProcess33
20.19579 ManufacturingProcess06
19.89309 BiologicalMaterial01
18.25433 ManufacturingProcess29
17.76085 BiologicalMaterial11
15.34550 BiologicalMaterial08
13.28445 ManufacturingProcess27
12.68791 ManufacturingProcess11
11.32475 ManufacturingProcess30

Score of contribution from each Variable Class:

bio <- predictorImportance |> filter(grepl("Biological",Name)) |> summarise(Total_Score=sum(Overall))
manu <- predictorImportance |> filter(grepl("Manufacturing",Name)) |> summarise(Total_Score=sum(Overall))
bio$VariableType <- "Biological"
manu$VariableType <- "Manufacturing"

bind_rows(
  
  bio,
  manu
) |> kable() |> kable_styling() |>  kable_classic(full_width = F)
Total_Score VariableType
223.3147 Biological
410.0964 Manufacturing

These aggregated scores provide a clear understanding of the relative contribution of different types of variables to the model’s overall performance. Higher scores indicate that a specific type of variable is more influential in predicting the target outcome. In this case, “Manufacturing” variables appear to contribute more substantially than “Biological” ones, highlighting their greater predictive importance.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Exploring the relationships between key predictors and the response variable is essential for understanding how to improve yield in future manufacturing processes.

top20Predictors <- predictorImportance[1:10,]$Name

top20Data <- chermical_comp |> select(all_of(top20Predictors))
correlations <- cor(top20Data)
corrplot::corrplot(correlations, order = "hclust")

BiologicalMaterial06 and BiologicalMaterial02, for example, show a strong positive correlation (0.954), indicating that these materials may have complementary effects on the yield. Additionally, BiologicalMaterial03 is highly correlated with BiologicalMaterial06 (0.872), suggesting that their combined influence could be leveraged to enhance output. These insights imply that fine-tuning the levels or conditions related to these biological inputs could be a strategic approach to improving yield.

ManufacturingProcess32 shows moderate positive correlations with both BiologicalMaterial06 (0.600) and BiologicalMaterial02 (0.629), pointing to a potential combined effect that could be optimized for better yield results. This relationship suggests that ManufacturingProcess32 may interact effectively with these biological materials and should be a focus for further refinement. Conversely, ManufacturingProcess36 has a strong negative correlation with ManufacturingProcess32 (-0.774) and moderate negative correlations with BiologicalMaterial06 (-0.510) and BiologicalMaterial03 (-0.464). These negative relationships indicate that ManufacturingProcess36 might counteract the positive impacts of other processes, and changes to it should be managed carefully to avoid decreasing yield.

Notably, ManufacturingProcess13 displays a significant negative correlation with ManufacturingProcess09 (-0.791), suggesting potential interference between these processes. This could imply that balancing or redesigning these processes to minimize adverse interactions could enhance overall efficiency and yield. On the other hand, processes like ManufacturingProcess31 and ManufacturingProcess17 show weaker correlations with the main variables, indicating a smaller direct influence on yield but suggesting possible secondary effects when combined with more impactful variables (e.g., a positive correlation between ManufacturingProcess17 and ManufacturingProcess13 at 0.782).

Prioritizing the optimization of BiologicalMaterial06, BiologicalMaterial02, and ManufacturingProcess32 could significantly impact the yield positively. At the same time, managing or revising ManufacturingProcess36 and mitigating interactions like those between ManufacturingProcess13 and ManufacturingProcess09 could prevent negative effects. The relationships identified here can guide process engineers to conduct targeted experiments and adjustments, leading to more consistent and higher-yielding manufacturing outcomes.