#Chapter 6
Developing a model to predict permeability (see Sect.1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.5.3
data(permeability)
The matrix fingerprints contain the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
library(AppliedPredictiveModeling)
library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Loading required package: lattice
data(permeability)
# Filter near-zero variance predictors
nzv_cols <- nearZeroVar(fingerprints)
filtered_fingerprints <- fingerprints[, -nzv_cols]
# Check remaining predictors
ncol(filtered_fingerprints)
## [1] 388
set.seed(123)
# Split Data (80/20 split)
training_rows <- createDataPartition(permeability, p = 0.8, list = FALSE)
train_x <- filtered_fingerprints[training_rows, ]
train_y <- permeability[training_rows]
test_x <- filtered_fingerprints[-training_rows, ]
test_y <- permeability[-training_rows]
# Tune PLS using 10-fold Cross-Validation
ctrl <- trainControl(method = "cv", number = 10)
pls_fit <- train(train_x, train_y,
method = "pls",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale"))
print(pls_fit)
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 121, 118, 119, 119, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.31894 0.3442124 10.254018
## 2 11.78898 0.4830504 8.534741
## 3 11.98818 0.4792649 9.219285
## 4 12.04349 0.4923322 9.448926
## 5 11.79823 0.5193195 9.049121
## 6 11.53275 0.5335956 8.658301
## 7 11.64053 0.5229621 8.878265
## 8 11.86459 0.5144801 9.265252
## 9 11.98385 0.5188205 9.218594
## 10 12.55634 0.4808614 9.610747
## 11 12.69674 0.4758068 9.702325
## 12 13.01534 0.4538906 9.956623
## 13 13.12637 0.4367362 9.878017
## 14 13.44865 0.4140715 10.065088
## 15 13.60135 0.4034269 10.188150
## 16 13.79361 0.3943904 10.247160
## 17 14.00756 0.3845119 10.412776
## 18 14.18113 0.3711378 10.587027
## 19 14.25674 0.3703610 10.575726
## 20 14.33121 0.3723176 10.679764
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.
pls_pred <- predict(pls_fit, test_x)
test_results <- postResample(pred = pls_pred, obs = test_y)
print(test_results)
## RMSE Rsquared MAE
## 12.3486900 0.3244542 8.2881075
enet_grid <- expand.grid(lambda = seq(0, 0.1, length = 10),
fraction = seq(0.1, 1, length = 10))
enet_fit <- train(train_x, train_y,
method = "enet",
tuneGrid = enet_grid,
trControl = ctrl,
preProc = c("center", "scale"))
# Compare PLS vs Enet
results <- resamples(list(PLS = pls_fit, ENet = enet_fit))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: PLS, ENet
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 5.940748 7.855195 8.101540 8.658301 8.587837 12.53980 0
## ENet 5.504259 7.317846 8.244954 8.067092 8.951861 11.20909 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 7.479049 10.662269 11.18215 11.53275 11.66271 16.49145 0
## ENet 7.188166 9.663742 11.86149 11.35483 12.47250 15.18330 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.2072535 0.3875424 0.5626977 0.5335956 0.6291758 0.8326590 0
## ENet 0.1557624 0.5202124 0.5401826 0.5579759 0.7192849 0.8020521 0
I would not recommend this model to replace laboratory experiments. The test set \(R^2\) of 0.32 is too low for critical drug safety decisions.
However, the Elastic Net model is a great tool for initial screening. It can quickly rank thousands of molecules and identify the most promising ones. This allows the company to focus expensive lab resources only on the best candidates, saving significant time and money.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
data(chemicalManufacturing)
## Warning in data(chemicalManufacturing): data set 'chemicalManufacturing' not
## found
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
library(AppliedPredictiveModeling)
library(caret)
data(ChemicalManufacturingProcess)
yield <- ChemicalManufacturingProcess[, 1]
predictors <- ChemicalManufacturingProcess[, -1]
preProcValues <- preProcess(predictors, method = c("knnImpute", "center", "scale"))
predictors_imputed <- predict(preProcValues, predictors)
nzv <- nearZeroVar(predictors_imputed)
predictors_final <- predictors_imputed[, -nzv]
set.seed(100)
trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
trainX <- predictors_final[trainIndex, ]
trainY <- yield[trainIndex]
testX <- predictors_final[-trainIndex, ]
testY <- yield[-trainIndex]
# 10-fold Cross-Validation
ctrl <- trainControl(method = "cv", number = 10)
# Tuning Elastic Net
enet_model <- train(trainX, trainY,
method = "enet",
tuneLength = 10,
trControl = ctrl)
print(enet_model)
## Elasticnet
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 130, 130, 130, 130, 129, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.0000000000 0.0500000 1.244172 0.6147662 1.0190661
## 0.0000000000 0.1555556 1.378006 0.5908371 1.0251543
## 0.0000000000 0.2611111 1.629785 0.5211871 1.1565962
## 0.0000000000 0.3666667 2.063473 0.4774618 1.3244840
## 0.0000000000 0.4722222 2.160662 0.4750796 1.3898004
## 0.0000000000 0.5777778 2.404458 0.4168164 1.5011728
## 0.0000000000 0.6833333 2.475839 0.3843046 1.5546255
## 0.0000000000 0.7888889 2.500736 0.3658578 1.5821895
## 0.0000000000 0.8944444 2.314532 0.3673867 1.5412759
## 0.0000000000 1.0000000 2.453173 0.3537257 1.5867729
## 0.0001000000 0.0500000 1.282831 0.6216407 1.0540552
## 0.0001000000 0.1555556 1.180879 0.6478936 0.9452422
## 0.0001000000 0.2611111 1.752144 0.5111475 1.1700379
## 0.0001000000 0.3666667 2.015971 0.4860865 1.2905975
## 0.0001000000 0.4722222 2.100301 0.4714937 1.3496796
## 0.0001000000 0.5777778 2.296038 0.4484741 1.4368979
## 0.0001000000 0.6833333 2.282253 0.4537758 1.4590883
## 0.0001000000 0.7888889 2.318584 0.4038397 1.5061445
## 0.0001000000 0.8944444 2.154628 0.3873881 1.4808128
## 0.0001000000 1.0000000 2.217656 0.3707322 1.5068437
## 0.0002371374 0.0500000 1.307395 0.6232738 1.0696962
## 0.0002371374 0.1555556 1.179558 0.6467529 0.9446982
## 0.0002371374 0.2611111 1.719137 0.5156104 1.1572297
## 0.0002371374 0.3666667 2.053713 0.4765594 1.2929202
## 0.0002371374 0.4722222 2.043527 0.4857594 1.3204481
## 0.0002371374 0.5777778 2.250374 0.4602902 1.4066560
## 0.0002371374 0.6833333 2.244924 0.4490510 1.4372834
## 0.0002371374 0.7888889 2.205340 0.4473841 1.4506640
## 0.0002371374 0.8944444 2.071967 0.4099308 1.4433598
## 0.0002371374 1.0000000 2.052717 0.3914470 1.4503995
## 0.0005623413 0.0500000 1.337836 0.6220587 1.0898499
## 0.0005623413 0.1555556 1.178271 0.6454398 0.9428255
## 0.0005623413 0.2611111 1.555914 0.5315350 1.1068567
## 0.0005623413 0.3666667 2.092342 0.4828709 1.2913178
## 0.0005623413 0.4722222 2.028369 0.4720576 1.3074858
## 0.0005623413 0.5777778 2.192522 0.4744410 1.3766541
## 0.0005623413 0.6833333 2.199284 0.4539179 1.4056591
## 0.0005623413 0.7888889 2.138985 0.4533779 1.4149333
## 0.0005623413 0.8944444 1.985111 0.4512411 1.3949056
## 0.0005623413 1.0000000 1.831980 0.4445396 1.3697527
## 0.0013335214 0.0500000 1.373894 0.6183196 1.1167896
## 0.0013335214 0.1555556 1.179770 0.6428216 0.9446678
## 0.0013335214 0.2611111 1.391956 0.5619567 1.0499720
## 0.0013335214 0.3666667 2.096368 0.5007512 1.2766981
## 0.0013335214 0.4722222 2.052396 0.4703423 1.3012134
## 0.0013335214 0.5777778 2.149779 0.4656834 1.3528830
## 0.0013335214 0.6833333 2.111065 0.4706828 1.3608942
## 0.0013335214 0.7888889 2.045666 0.4554102 1.3690642
## 0.0013335214 0.8944444 1.927122 0.4635826 1.3553869
## 0.0013335214 1.0000000 1.656891 0.5268702 1.2835751
## 0.0031622777 0.0500000 1.422443 0.6108503 1.1532527
## 0.0031622777 0.1555556 1.186479 0.6360492 0.9535622
## 0.0031622777 0.2611111 1.236631 0.6215180 0.9842247
## 0.0031622777 0.3666667 1.968068 0.5155178 1.2285808
## 0.0031622777 0.4722222 2.095974 0.4829087 1.2934736
## 0.0031622777 0.5777778 2.111817 0.4662882 1.3247299
## 0.0031622777 0.6833333 2.134380 0.4601318 1.3487188
## 0.0031622777 0.7888889 1.956551 0.4583141 1.3187903
## 0.0031622777 0.8944444 1.854844 0.4587990 1.3093527
## 0.0031622777 1.0000000 1.713267 0.4753231 1.2846475
## 0.0074989421 0.0500000 1.486661 0.5965208 1.2019673
## 0.0074989421 0.1555556 1.192979 0.6320828 0.9635728
## 0.0074989421 0.2611111 1.188122 0.6423965 0.9546659
## 0.0074989421 0.3666667 1.613137 0.5387629 1.1172995
## 0.0074989421 0.4722222 2.074451 0.5078931 1.2664775
## 0.0074989421 0.5777778 2.082881 0.4824161 1.2936573
## 0.0074989421 0.6833333 2.165292 0.4680606 1.3331812
## 0.0074989421 0.7888889 2.019454 0.4612526 1.3058598
## 0.0074989421 0.8944444 1.917719 0.4550955 1.2924390
## 0.0074989421 1.0000000 1.857229 0.4515885 1.2900188
## 0.0177827941 0.0500000 1.560915 0.5726787 1.2553997
## 0.0177827941 0.1555556 1.196480 0.6346837 0.9777529
## 0.0177827941 0.2611111 1.186003 0.6372318 0.9505554
## 0.0177827941 0.3666667 1.293343 0.5963794 1.0067460
## 0.0177827941 0.4722222 1.774615 0.5249655 1.1678427
## 0.0177827941 0.5777778 2.047195 0.5075470 1.2608838
## 0.0177827941 0.6833333 2.107800 0.4914367 1.2955035
## 0.0177827941 0.7888889 2.146565 0.4801898 1.3156829
## 0.0177827941 0.8944444 2.043031 0.4746369 1.2953451
## 0.0177827941 1.0000000 2.002625 0.4689325 1.2930889
## 0.0421696503 0.0500000 1.626717 0.5441282 1.3055139
## 0.0421696503 0.1555556 1.251640 0.6248573 1.0325093
## 0.0421696503 0.2611111 1.191162 0.6322101 0.9617146
## 0.0421696503 0.3666667 1.202450 0.6284709 0.9668562
## 0.0421696503 0.4722222 1.499547 0.5528325 1.0845055
## 0.0421696503 0.5777778 1.843969 0.5156925 1.1957482
## 0.0421696503 0.6833333 1.987115 0.5088552 1.2452922
## 0.0421696503 0.7888889 2.044293 0.5009766 1.2723793
## 0.0421696503 0.8944444 2.106824 0.4928918 1.2981538
## 0.0421696503 1.0000000 2.098588 0.4884785 1.3005768
## 0.1000000000 0.0500000 1.674633 0.5173042 1.3423952
## 0.1000000000 0.1555556 1.331865 0.6184366 1.0863401
## 0.1000000000 0.2611111 1.195738 0.6292752 0.9752124
## 0.1000000000 0.3666667 1.200488 0.6270371 0.9656296
## 0.1000000000 0.4722222 1.343687 0.5756409 1.0237337
## 0.1000000000 0.5777778 1.633287 0.5344339 1.1358659
## 0.1000000000 0.6833333 1.829592 0.5146197 1.2044586
## 0.1000000000 0.7888889 1.919991 0.5090075 1.2367938
## 0.1000000000 0.8944444 1.999040 0.5036788 1.2637416
## 0.1000000000 1.0000000 2.097514 0.4982134 1.2948349
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.1555556 and lambda
## = 0.0005623413.
enet_pred <- predict(enet_model, testX)
postResample(pred = enet_pred, obs = testY)
## RMSE Rsquared MAE
## 0.9969012 0.6350374 0.7999766
# Calculate and plot importance
importance <- varImp(enet_model)
plot(importance, top = 20)
The importance plot shows that Manufacturing Processes 32 and 13 are the most critical factors for success. Since these are controllable variables, the company can directly increase revenue by optimizing them.
-Process Optimization: The company should adjust Process 32 to its optimal level. If the relationship is positive, increasing this setting will directly boost yield and profits ($100k per 1%).
-Early Warning: For fixed inputs like Biological Material 06, the model acts as an early warning system. If raw materials are low-quality, engineers can proactively adjust the manufacturing settings to compensate and prevent a costly low-yield batch.