(see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have sufficient permeability to become a drug.
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.5.3
data(permeability)
The matrix fingerprints contains the 1,107 binary
molecular predictors for the 165 compounds, while
permeability contains the response.
The fingerprint predictors indicate the presence or absence of
substructures of a molecule and are often sparse. Filter out predictors
that have low frequencies using the nearZeroVar function
from the caret package.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
# Identify near-zero variance predictors
nzv <- nearZeroVar(fingerprints)
# Remove them
filtered <- fingerprints[, -nzv]
# Number of predictors left
ncol(filtered)
## [1] 388
Question: How many predictors are left for modeling?
388 predictors were left after modelling.
Split the data into training and test sets
Pre-process the data
Tune a Partial Least Squares (PLS) model
# 1. Remove near-zero variance predictors
nzv <- nearZeroVar(fingerprints)
x <- fingerprints[, -nzv]
y <- permeability
# 2. Train/Test Split (80/20)
set.seed(123)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
x_train <- x[trainIndex, ]
x_test <- x[-trainIndex, ]
y_train <- y[trainIndex]
y_test <- y[-trainIndex]
# 3. Train Control (Cross-validation)
ctrl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 3
)
# 4. Train PLS Model
set.seed(123)
pls_model <- train(
x = x_train,
y = y_train,
method = "pls",
preProcess = c("center", "scale"),
tuneLength = 20,
trControl = ctrl
)
# 5. View Results
pls_model
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 120, 119, 118, 120, 121, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.41180 0.3378787 10.274379
## 2 11.62287 0.5290809 8.372257
## 3 11.66671 0.5359195 8.953309
## 4 11.93005 0.5130401 9.148765
## 5 11.84554 0.5216169 8.929995
## 6 11.76362 0.5223236 8.709542
## 7 11.72222 0.5237452 9.033260
## 8 11.78689 0.5267913 9.116499
## 9 11.97219 0.5264258 9.202549
## 10 12.33420 0.5049110 9.380367
## 11 12.46814 0.4910466 9.524606
## 12 12.66410 0.4792936 9.705490
## 13 12.68945 0.4736903 9.653963
## 14 12.89769 0.4613160 9.796857
## 15 13.05553 0.4563152 9.991041
## 16 13.16973 0.4484492 10.035515
## 17 13.39867 0.4385930 10.155347
## 18 13.58537 0.4324732 10.335175
## 19 13.60721 0.4309564 10.336521
## 20 13.69992 0.4292043 10.425633
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
plot(pls_model)
Question:
How many latent variables are optimal?
2 variables are optimal
What is the corresponding resampled estimate of R2R^2R2?
Corresponding resampled estimate is 0.529.
Using more than 2 components worsens RMSE. The shows moderate predictive strength of around 53%, which suggests that it is useful for the first few latent variables are captured properly but anymore will add noise of overfitting.
Predict the response for the test set.
# Predict on test set
pred <- predict(pls_model, newdata = x_test)
# Compute R^2
R2(pred, y_test)
## [1] 0.2960035
Question: What is the test set estimate of R2R^2R2?
0.296
This is much weaker than the previous predictor with only around a 29.6% of variability in permeability is explained on new data.
Try building other models discussed in this chapter.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.5.3
## Loading required package: Matrix
## Loaded glmnet 4.1-10
set.seed(123)
# Train control
ctrl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 3
)
# Grid for alpha (0 = Ridge, 1 = Lasso)
grid <- expand.grid(
alpha = seq(0, 1, length = 5),
lambda = seq(0.0001, 1, length = 20)
)
# Train Elastic Net (includes Ridge & Lasso)
enet_model <- train(
x = x_train,
y = y_train,
method = "glmnet",
preProcess = c("center", "scale"),
tuneGrid = grid,
trControl = ctrl
)
enet_model
## glmnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 120, 119, 118, 120, 121, 119, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.00 0.00010000 11.50094 0.5550307 8.611059
## 0.00 0.05272632 11.50094 0.5550307 8.611059
## 0.00 0.10535263 11.50094 0.5550307 8.611059
## 0.00 0.15797895 11.50094 0.5550307 8.611059
## 0.00 0.21060526 11.50094 0.5550307 8.611059
## 0.00 0.26323158 11.50094 0.5550307 8.611059
## 0.00 0.31585789 11.50094 0.5550307 8.611059
## 0.00 0.36848421 11.50094 0.5550307 8.611059
## 0.00 0.42111053 11.50094 0.5550307 8.611059
## 0.00 0.47373684 11.50094 0.5550307 8.611059
## 0.00 0.52636316 11.50094 0.5550307 8.611059
## 0.00 0.57898947 11.50094 0.5550307 8.611059
## 0.00 0.63161579 11.50094 0.5550307 8.611059
## 0.00 0.68424211 11.50094 0.5550307 8.611059
## 0.00 0.73686842 11.50094 0.5550307 8.611059
## 0.00 0.78949474 11.50094 0.5550307 8.611059
## 0.00 0.84212105 11.50094 0.5550307 8.611059
## 0.00 0.89474737 11.50094 0.5550307 8.611059
## 0.00 0.94737368 11.50094 0.5550307 8.611059
## 0.00 1.00000000 11.50094 0.5550307 8.611059
## 0.25 0.00010000 12.44093 0.4697850 9.335061
## 0.25 0.05272632 12.44093 0.4697850 9.335061
## 0.25 0.10535263 12.44093 0.4697850 9.335061
## 0.25 0.15797895 12.44093 0.4697850 9.335061
## 0.25 0.21060526 12.44093 0.4697850 9.335061
## 0.25 0.26323158 12.44093 0.4697850 9.335061
## 0.25 0.31585789 12.44093 0.4697850 9.335061
## 0.25 0.36848421 12.44093 0.4697850 9.335061
## 0.25 0.42111053 12.43650 0.4702610 9.339461
## 0.25 0.47373684 12.32850 0.4760365 9.274947
## 0.25 0.52636316 12.21407 0.4826385 9.194501
## 0.25 0.57898947 12.11411 0.4886311 9.120716
## 0.25 0.63161579 12.02817 0.4940929 9.058284
## 0.25 0.68424211 11.94918 0.4994033 9.003817
## 0.25 0.73686842 11.89003 0.5035086 8.956694
## 0.25 0.78949474 11.83954 0.5068923 8.916022
## 0.25 0.84212105 11.79356 0.5100062 8.876727
## 0.25 0.89474737 11.75597 0.5126325 8.847553
## 0.25 0.94737368 11.72721 0.5147858 8.827178
## 0.25 1.00000000 11.70638 0.5164935 8.812772
## 0.50 0.00010000 12.54719 0.4669426 9.336994
## 0.50 0.05272632 12.54719 0.4669426 9.336994
## 0.50 0.10535263 12.54719 0.4669426 9.336994
## 0.50 0.15797895 12.54719 0.4669426 9.336994
## 0.50 0.21060526 12.54617 0.4670586 9.344031
## 0.50 0.26323158 12.36518 0.4763093 9.267754
## 0.50 0.31585789 12.19017 0.4860831 9.145771
## 0.50 0.36848421 12.07411 0.4934721 9.057372
## 0.50 0.42111053 11.99143 0.4982830 9.005146
## 0.50 0.47373684 11.91146 0.5036371 8.967438
## 0.50 0.52636316 11.85673 0.5073443 8.946073
## 0.50 0.57898947 11.82243 0.5098871 8.929842
## 0.50 0.63161579 11.78759 0.5126383 8.903944
## 0.50 0.68424211 11.75266 0.5153872 8.874579
## 0.50 0.73686842 11.72536 0.5176981 8.838931
## 0.50 0.78949474 11.69937 0.5197057 8.806773
## 0.50 0.84212105 11.68309 0.5208465 8.775873
## 0.50 0.89474737 11.66503 0.5223515 8.740908
## 0.50 0.94737368 11.65091 0.5235038 8.706791
## 0.50 1.00000000 11.64168 0.5242318 8.672372
## 0.75 0.00010000 12.57926 0.4665146 9.313249
## 0.75 0.05272632 12.57926 0.4665146 9.313249
## 0.75 0.10535263 12.57926 0.4665146 9.313249
## 0.75 0.15797895 12.50040 0.4697638 9.296534
## 0.75 0.21060526 12.24796 0.4834699 9.159914
## 0.75 0.26323158 12.11191 0.4912944 9.081064
## 0.75 0.31585789 11.99829 0.4986558 9.030641
## 0.75 0.36848421 11.91253 0.5039134 8.999413
## 0.75 0.42111053 11.85163 0.5084025 8.968858
## 0.75 0.47373684 11.79874 0.5125141 8.921969
## 0.75 0.52636316 11.74972 0.5165935 8.869784
## 0.75 0.57898947 11.72802 0.5183373 8.823424
## 0.75 0.63161579 11.70867 0.5201094 8.769415
## 0.75 0.68424211 11.70053 0.5206494 8.723391
## 0.75 0.73686842 11.69927 0.5207255 8.680964
## 0.75 0.78949474 11.71297 0.5193522 8.651382
## 0.75 0.84212105 11.72929 0.5177082 8.628225
## 0.75 0.89474737 11.75141 0.5155972 8.615455
## 0.75 0.94737368 11.77555 0.5135301 8.607816
## 0.75 1.00000000 11.80346 0.5115089 8.603532
## 1.00 0.00010000 12.70805 0.4628577 9.379954
## 1.00 0.05272632 12.70805 0.4628577 9.379954
## 1.00 0.10535263 12.69960 0.4632510 9.382373
## 1.00 0.15797895 12.34936 0.4814712 9.231227
## 1.00 0.21060526 12.15841 0.4921499 9.159221
## 1.00 0.26323158 11.99932 0.5017959 9.097477
## 1.00 0.31585789 11.88860 0.5084398 9.049415
## 1.00 0.36848421 11.82242 0.5128029 8.975038
## 1.00 0.42111053 11.77697 0.5161822 8.901115
## 1.00 0.47373684 11.76328 0.5173399 8.838166
## 1.00 0.52636316 11.76176 0.5169860 8.781735
## 1.00 0.57898947 11.77816 0.5147976 8.741698
## 1.00 0.63161579 11.81643 0.5108173 8.722938
## 1.00 0.68424211 11.85678 0.5068551 8.715327
## 1.00 0.73686842 11.90351 0.5027353 8.715672
## 1.00 0.78949474 11.95475 0.4987754 8.718805
## 1.00 0.84212105 11.98542 0.4965526 8.720737
## 1.00 0.89474737 11.99268 0.4960834 8.716083
## 1.00 0.94737368 11.99313 0.4962729 8.708970
## 1.00 1.00000000 11.98820 0.4971179 8.697711
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 1.
plot(enet_model)
#Extract Best Base Model
enet_model$bestTune
## alpha lambda
## 20 0 1
#Test Set Performance
# Predictions
pred_enet <- predict(enet_model, newdata = x_test)
# Performance
postResample(pred_enet, y_test)
## RMSE Rsquared MAE
## 11.0203929 0.3266984 7.6511575
Question: Do any models have better predictive performance?
Penalized regression models were applied to improve upon the PLS results. Cross-validation identified Ridge regression (alpha = 0) as the best-performing model, suggesting that the response is influenced by many predictors with small, distributed effects rather than a few dominant ones. The model achieved a resampled R2R^2R2 of about 0.56 and a test set R2R^2R2 of roughly 0.33, representing a modest improvement over the PLS model (~0.30). While regularization helped reduce overfitting and improved generalization, the overall predictive performance remains limited.
Would you recommend any of your models to replace the permeability laboratory experiment?
I would not recommend replacing the permeability laboratory experiment with the models developed in this analysis. Although Ridge regression demonstrated a modest improvement over the PLS model, the test set R2R^2R2 of approximately 0.33 indicates limited predictive accuracy and insufficient explanatory power. Such performance is inadequate for high-stakes decision-making in pharmaceutical development, where precise and reliable measurements are essential. Nevertheless, these models may still serve a valuable role as preliminary screening tools to prioritize candidate compounds, thereby improving efficiency while maintaining the necessity of experimental validation.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. The objective is to understand the relationship between:
Biological measurements of raw materials (predictors)
Manufacturing process measurements (predictors)
Product yield (response)
Biological predictors cannot be changed, but process predictors can. Improving product yield by 1% increases revenue by approximately $100,000 per batch.
library(AppliedPredictiveModeling)
library(caret)
data(ChemicalManufacturingProcess)
The matrix processPredictors contains 57 predictors (12
biological + 45 process variables) for 176 runs. The variable
yield contains the percent yield.
A small percentage of predictor values are missing.
# Extract predictors
processPredictors <- ChemicalManufacturingProcess[, -1] # remove yield column
yield <- ChemicalManufacturingProcess$Yield
# Check missing values
sum(is.na(processPredictors))
## [1] 106
# Apply kNN imputation
preProcValues <- preProcess(processPredictors, method = "knnImpute")
# Transform the data
processPredictors_imputed <- predict(preProcValues, processPredictors)
# Verify no missing values remain
sum(is.na(processPredictors_imputed))
## [1] 0
Task: Use an imputation method to fill in missing values (see Sect. 3.8).
The predictor dataset initially contained 106 missing values. These were addressed using k-nearest neighbors (kNN) imputation, which replaces missing entries based on similar observations in the dataset. After preprocessing, no missing values remained. This approach preserves all observations while maintaining the underlying data structure, making the dataset suitable for subsequent predictive modeling.
Split the data into training and test sets
Pre-process the data
Train and tune a model of your choice
library(caret)
library(glmnet)
set.seed(123)
# 1. Train/Test Split
trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
x_train <- processPredictors_imputed[trainIndex, ]
x_test <- processPredictors_imputed[-trainIndex, ]
y_train <- yield[trainIndex]
y_test <- yield[-trainIndex]
# 2. Pre-process (center & scale)
preProc <- preProcess(x_train, method = c("center", "scale"))
x_train <- predict(preProc, x_train)
x_test <- predict(preProc, x_test)
# 3. Train & Tune Model (Elastic Net)
ctrl <- trainControl(method = "cv", number = 10)
model <- train(
x = x_train,
y = y_train,
method = "glmnet",
trControl = ctrl,
tuneLength = 10
)
# 4. Extract results
best <- model$results[
model$results$alpha == model$bestTune$alpha &
model$results$lambda == model$bestTune$lambda, ]
# 5. Print
print(model$bestTune)
## alpha lambda
## 58 0.6 0.1821943
print(best[, c("RMSE", "Rsquared")])
## RMSE Rsquared
## 58 1.13129 0.6244953Question: What is the optimal value of the performance metric?
The optimal model produced a cross-validated RMSE of 1.131 and an R² of 0.624, with tuning parameters α = 0.6 and λ = 0.182, indicating an elastic net specification. These results suggest the model achieves a favorable trade-off between bias and variance and accounts for a meaningful proportion of the variability in yield.
Predict the response for the test set.
Questions:
What is the test set performance?
How does it compare to the resampled training performance?
The model produced a test set RMSE of 1.202 and an R² of 0.577. These results are slightly inferior to the cross-validated training performance (RMSE = 1.131, R² = 0.624), indicating minor overfitting. Nevertheless, the relatively small decline in performance suggests that the model maintains good generalization to unseen data.
# 1. Predict on test set
test_pred <- predict(model, newdata = x_test)
# 2. Compute performance metrics
test_RMSE <- RMSE(test_pred, y_test)
test_R2 <- R2(test_pred, y_test)
# 3. Show results
test_RMSE
## [1] 1.201836
test_R2
## [1] 0.5771784
# 4. Compare with training (CV) performance
train_best <- model$results[
model$results$alpha == model$bestTune$alpha &
model$results$lambda == model$bestTune$lambda, ]
train_best[, c("RMSE", "Rsquared")]
## RMSE Rsquared
## 58 1.13129 0.6244953
Questions:
varImp(model)
## glmnet variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.0000
## ManufacturingProcess09 61.1354
## ManufacturingProcess17 47.7024
## BiologicalMaterial06 27.1593
## ManufacturingProcess06 26.5485
## ManufacturingProcess37 22.0026
## ManufacturingProcess45 18.3213
## BiologicalMaterial05 18.1548
## ManufacturingProcess13 14.2742
## ManufacturingProcess36 14.0917
## ManufacturingProcess15 9.1167
## ManufacturingProcess44 8.7566
## ManufacturingProcess34 7.9396
## ManufacturingProcess39 3.3326
## ManufacturingProcess43 3.3286
## BiologicalMaterial07 1.5281
## ManufacturingProcess04 0.1659
## ManufacturingProcess40 0.0000
## ManufacturingProcess41 0.0000
## ManufacturingProcess38 0.0000
importance <- varImp(model, scale = TRUE)
plot(importance, top = 20)
Which predictors are most important?
The most important predictors are ManufacturingProcess32, ManufacturingProcess09, and ManufacturingProcess17, followed by BiologicalMaterial06, ManufacturingProcess06, and ManufacturingProcess37. Among these, ManufacturingProcess32 stands out with a substantially higher importance score than all other variables, indicating that it has the strongest influence on predicting yield.
Do biological or process predictors dominate?
Process predictors clearly dominate the model. The majority of the top-ranked variables are manufacturing process variables, with only a few biological predictors (such as BiologicalMaterial06 and BiologicalMaterial05) appearing among the most important. This suggests that yield is driven more by manufacturing conditions than by biological material characteristics.
Explore the relationships between top predictors and yield.
Question: How can this information help improve yield in future manufacturing runs?
The most important predictors show that certain manufacturing steps, especially variables like ManufacturingProcess32, ManufacturingProcess09, and ManufacturingProcess17, have a strong effect on yield. By looking at how changes in these variables affect the output, we can see which conditions lead to better results.
This information can help improve yield by focusing on controlling and adjusting these key process variables. Keeping them at levels that are associated with higher yield can make production more consistent and efficient, leading to better overall performance.