6.2. Developing a model to predict permeability (see Sect. 1.4) could save sig- nificant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a su!cient permeability to become a drug:
library(AppliedPredictiveModeling)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
data(permeability)
x <- fingerprints
y <- permeability
nzv <- nearZeroVar(x)
x_filtered <- x[, -nzv]
ncol(x_filtered)
## [1] 388
388 predictors remain
set.seed(123)
trainIndex <- createDataPartition(y, p = .8, list = FALSE)
x_train <- x_filtered[trainIndex, ]
x_test <- x_filtered[-trainIndex, ]
y_train <- y[trainIndex]
y_test <- y[-trainIndex]
ctrl <- trainControl(method = "cv", number = 10)
pls_model <- train(
x_train, y_train,
method = "pls",
preProcess = c("center","scale"),
tuneLength = 20,
trControl = ctrl
)
pls_model
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 121, 118, 119, 119, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.31894 0.3442124 10.254018
## 2 11.78898 0.4830504 8.534741
## 3 11.98818 0.4792649 9.219285
## 4 12.04349 0.4923322 9.448926
## 5 11.79823 0.5193195 9.049121
## 6 11.53275 0.5335956 8.658301
## 7 11.64053 0.5229621 8.878265
## 8 11.86459 0.5144801 9.265252
## 9 11.98385 0.5188205 9.218594
## 10 12.55634 0.4808614 9.610747
## 11 12.69674 0.4758068 9.702325
## 12 13.01534 0.4538906 9.956623
## 13 13.12637 0.4367362 9.878017
## 14 13.44865 0.4140715 10.065088
## 15 13.60135 0.4034269 10.188150
## 16 13.79361 0.3943904 10.247160
## 17 14.00756 0.3845119 10.412776
## 18 14.18113 0.3711378 10.587027
## 19 14.25674 0.3703610 10.575726
## 20 14.33121 0.3723176 10.679764
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.
The optimal number of latent variables is 6, selected based on the lowest RMSE.
The corresponding resampled estimate of R² = 0.5336 (approximately 0.534).
pred <- predict(pls_model, x_test)
postResample(pred, y_test)
## RMSE Rsquared MAE
## 12.3486900 0.3244542 8.2881075
The response was predicted for the test set using the tuned PLS model.
The test set estimate of R² = 0.324.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-10
ridge_model <- train(
x_train, y_train,
method = "glmnet",
preProcess = c("center", "scale"),
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 1, length = 20)),
trControl = ctrl
)
lasso_model <- train(
x_train, y_train,
method = "glmnet",
preProcess = c("center", "scale"),
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 20)),
trControl = ctrl
)
ridge_model
## glmnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 118, 121, 120, 120, 120, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00100000 11.33669 0.5731857 8.54889
## 0.05357895 11.33669 0.5731857 8.54889
## 0.10615789 11.33669 0.5731857 8.54889
## 0.15873684 11.33669 0.5731857 8.54889
## 0.21131579 11.33669 0.5731857 8.54889
## 0.26389474 11.33669 0.5731857 8.54889
## 0.31647368 11.33669 0.5731857 8.54889
## 0.36905263 11.33669 0.5731857 8.54889
## 0.42163158 11.33669 0.5731857 8.54889
## 0.47421053 11.33669 0.5731857 8.54889
## 0.52678947 11.33669 0.5731857 8.54889
## 0.57936842 11.33669 0.5731857 8.54889
## 0.63194737 11.33669 0.5731857 8.54889
## 0.68452632 11.33669 0.5731857 8.54889
## 0.73710526 11.33669 0.5731857 8.54889
## 0.78968421 11.33669 0.5731857 8.54889
## 0.84226316 11.33669 0.5731857 8.54889
## 0.89484211 11.33669 0.5731857 8.54889
## 0.94742105 11.33669 0.5731857 8.54889
## 1.00000000 11.33669 0.5731857 8.54889
##
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 1.
lasso_model
## glmnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 120, 120, 119, 118, 120, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00100000 12.74117 0.4723781 9.176338
## 0.05357895 12.74117 0.4723781 9.176338
## 0.10615789 12.69336 0.4741610 9.150030
## 0.15873684 12.09979 0.5050067 8.816790
## 0.21131579 11.76073 0.5248985 8.680782
## 0.26389474 11.59842 0.5329640 8.614591
## 0.31647368 11.52846 0.5358785 8.577002
## 0.36905263 11.48823 0.5381836 8.545427
## 0.42163158 11.45519 0.5399699 8.508475
## 0.47421053 11.41375 0.5428282 8.466997
## 0.52678947 11.36201 0.5462513 8.427929
## 0.57936842 11.36583 0.5459328 8.442941
## 0.63194737 11.39464 0.5439284 8.474881
## 0.68452632 11.43031 0.5417073 8.498274
## 0.73710526 11.47169 0.5388543 8.509875
## 0.78968421 11.50303 0.5370308 8.503900
## 0.84226316 11.52698 0.5356889 8.486556
## 0.89484211 11.53477 0.5347036 8.459214
## 0.94742105 11.52859 0.5347867 8.427965
## 1.00000000 11.52381 0.5349053 8.402687
##
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.5267895.
Additional models were built using Ridge and Lasso regression.
Results:
Answer:
Yes. Ridge regression performed better than the PLS
model, with the highest
resampled R² = 0.573. Lasso also performed slightly
better than PLS but worse than Ridge.
postResample(predict(ridge_model, x_test), y_test)
## RMSE Rsquared MAE
## 11.0203929 0.3266984 7.6511575
postResample(predict(lasso_model, x_test), y_test)
## RMSE Rsquared MAE
## 10.9899991 0.3902119 7.3778673
Test set performance:
Although the Lasso model performed best, the test set R² values are
relatively low,
indicating limited predictive accuracy.
No. The predictive performance is not strong enough to replace the
permeability
laboratory experiment. The models may be useful for screening compounds,
but not
as a full replacement for laboratory measurements.
6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the re- lationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing pro- cess. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch: (a) Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(chemicalManufacturing) The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
library(AppliedPredictiveModeling)
library(caret)
data(ChemicalManufacturingProcess)
x <- ChemicalManufacturingProcess[, -1]
y <- ChemicalManufacturingProcess$Yield
preProc <- preProcess(x, method = "medianImpute")
x_imputed <- predict(preProc, x)
set.seed(123)
trainIndex <- createDataPartition(y, p = .8, list = FALSE)
x_train <- x_imputed[trainIndex, ]
x_test <- x_imputed[-trainIndex, ]
y_train <- y[trainIndex]
y_test <- y[-trainIndex]
ctrl <- trainControl(method = "cv", number = 10)
pls_model <- train(
x_train, y_train,
method = "pls",
preProcess = c("center","scale"),
tuneLength = 20,
trControl = ctrl
)
pls_model
## Partial Least Squares
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 131, 130, 130, 129, 131, 129, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.469135 0.4413084 1.1554181
## 2 2.041088 0.5084869 1.2301350
## 3 1.267233 0.5914708 0.9759364
## 4 1.307502 0.6017702 1.0043410
## 5 1.751768 0.5622009 1.1346052
## 6 1.783366 0.5694847 1.1446377
## 7 1.906613 0.5578632 1.1838996
## 8 1.966814 0.5397329 1.2163865
## 9 2.161597 0.5137488 1.2932272
## 10 2.497343 0.4954145 1.3993903
## 11 2.831450 0.4626421 1.5298526
## 12 3.363236 0.4204766 1.6920564
## 13 3.767530 0.3965149 1.8190061
## 14 3.934273 0.3878164 1.8747674
## 15 4.039276 0.3867160 1.8974451
## 16 4.106236 0.3838261 1.9225634
## 17 4.104890 0.3833394 1.9193643
## 18 4.118901 0.3826933 1.9284710
## 19 4.124366 0.3823171 1.9367371
## 20 4.235183 0.3831376 1.9693438
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
A Partial Least Squares (PLS) model was trained using 10-fold cross-validation.
The optimal number of components was 3, selected based on the lowest RMSE.
The corresponding performance metrics were:
The optimal value of the performance metric occurs at 3 latent variables, with a cross-validated R² = 0.591.
pred <- predict(pls_model, x_test)
postResample(pred, y_test)
## RMSE Rsquared MAE
## 1.3836482 0.4658466 1.1737971
The test set performance metrics are:
The test set performance is worse than the resampled training performance:
The test set R² is 0.466, which is lower than the resampled training R² of 0.591, indicating a decrease in predictive performance on new data.
print("The most important predictors are:")
## [1] "The most important predictors are:"
varImp(pls_model)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
## pls variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess17 88.56
## ManufacturingProcess13 88.43
## ManufacturingProcess09 88.21
## ManufacturingProcess36 87.39
## ManufacturingProcess06 70.42
## BiologicalMaterial06 62.98
## ManufacturingProcess33 62.79
## BiologicalMaterial03 61.77
## BiologicalMaterial02 61.50
## BiologicalMaterial08 60.52
## ManufacturingProcess11 57.13
## BiologicalMaterial11 55.73
## BiologicalMaterial12 55.64
## BiologicalMaterial01 53.52
## BiologicalMaterial04 51.96
## ManufacturingProcess28 49.54
## ManufacturingProcess12 46.99
## ManufacturingProcess37 45.75
## BiologicalMaterial10 42.94
Most of the top predictors are manufacturing process variables, with only a few biological material variables appearing in the top list.
The manufacturing process predictors dominate the list of important variables, suggesting that adjusting process conditions may have the greatest impact on improving yield.
top_vars <- varImp(pls_model)$importance
top_vars$Predictor <- rownames(top_vars)
top_vars <- top_vars[order(-top_vars$Overall), ]
# top 10 names
top_names <- top_vars$Predictor[1:10]
top_vars[1:10, ]
## Overall Predictor
## ManufacturingProcess32 100.00000 ManufacturingProcess32
## ManufacturingProcess17 88.55696 ManufacturingProcess17
## ManufacturingProcess13 88.42528 ManufacturingProcess13
## ManufacturingProcess09 88.21006 ManufacturingProcess09
## ManufacturingProcess36 87.38600 ManufacturingProcess36
## ManufacturingProcess06 70.42444 ManufacturingProcess06
## BiologicalMaterial06 62.97962 BiologicalMaterial06
## ManufacturingProcess33 62.78626 ManufacturingProcess33
## BiologicalMaterial03 61.76596 BiologicalMaterial03
## BiologicalMaterial02 61.50390 BiologicalMaterial02
# scatter plots
par(mfrow = c(2, 5))
for (v in top_names) {
plot(x_train[, v], y_train, xlab = v, ylab = "Yield", main = v)
abline(lm(y_train ~ x_train[, v]), col = "red")
}
The plots show that several manufacturing process variables have clear linear relationships with yield. Some predictors such as ManufacturingProcess32, ManufacturingProcess09, and ManufacturingProcess06 show positive relationships, indicating that increasing these values tends to increase yield. Others such as ManufacturingProcess17 and ManufacturingProcess13 show negative relationships, suggesting that higher values may decrease yield.
Biological material variables such as BiologicalMaterial06, BiologicalMaterial03, and BiologicalMaterial02 also show moderate positive relationships with yield, indicating that raw material quality affects final output.
This information can help improve yield by: