6.2 Permeability Prediction Developing a model to predict permeability

(see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have sufficient permeability to become a drug.

(a) Load the Data Start R and use these commands to load the data

library(AppliedPredictiveModeling) 
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.5.3
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains the response.

(b) Feature Filtering

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse. Filter out predictors that have low frequencies using the nearZeroVar function from the caret package.

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
# Identify near-zero variance predictors
nzv <- nearZeroVar(fingerprints)

# Remove them
filtered <- fingerprints[, -nzv]

# Number of predictors left
ncol(filtered)
## [1] 388

Question: How many predictors are left for modeling?

388 predictors were left after modelling.

(c) Model Training (PLS)

  • Split the data into training and test sets

  • Pre-process the data

  • Tune a Partial Least Squares (PLS) model

    # 1. Remove near-zero variance predictors
    nzv <- nearZeroVar(fingerprints)
    x <- fingerprints[, -nzv]
    y <- permeability
    
    # 2. Train/Test Split (80/20)
    set.seed(123)
    trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
    
    x_train <- x[trainIndex, ]
    x_test  <- x[-trainIndex, ]
    y_train <- y[trainIndex]
    y_test  <- y[-trainIndex]
    
    # 3. Train Control (Cross-validation)
    ctrl <- trainControl(
      method = "repeatedcv",
      number = 10,
      repeats = 3
    )
    
    # 4. Train PLS Model
    set.seed(123)
    pls_model <- train(
      x = x_train,
      y = y_train,
      method = "pls",
      preProcess = c("center", "scale"),
      tuneLength = 20,
      trControl = ctrl
    )
    
    # 5. View Results
    pls_model
    ## Partial Least Squares 
    ## 
    ## 133 samples
    ## 388 predictors
    ## 
    ## Pre-processing: centered (388), scaled (388) 
    ## Resampling: Cross-Validated (10 fold, repeated 3 times) 
    ## Summary of sample sizes: 120, 119, 118, 120, 121, 119, ... 
    ## Resampling results across tuning parameters:
    ## 
    ##   ncomp  RMSE      Rsquared   MAE      
    ##    1     13.41180  0.3378787  10.274379
    ##    2     11.62287  0.5290809   8.372257
    ##    3     11.66671  0.5359195   8.953309
    ##    4     11.93005  0.5130401   9.148765
    ##    5     11.84554  0.5216169   8.929995
    ##    6     11.76362  0.5223236   8.709542
    ##    7     11.72222  0.5237452   9.033260
    ##    8     11.78689  0.5267913   9.116499
    ##    9     11.97219  0.5264258   9.202549
    ##   10     12.33420  0.5049110   9.380367
    ##   11     12.46814  0.4910466   9.524606
    ##   12     12.66410  0.4792936   9.705490
    ##   13     12.68945  0.4736903   9.653963
    ##   14     12.89769  0.4613160   9.796857
    ##   15     13.05553  0.4563152   9.991041
    ##   16     13.16973  0.4484492  10.035515
    ##   17     13.39867  0.4385930  10.155347
    ##   18     13.58537  0.4324732  10.335175
    ##   19     13.60721  0.4309564  10.336521
    ##   20     13.69992  0.4292043  10.425633
    ## 
    ## RMSE was used to select the optimal model using the smallest value.
    ## The final value used for the model was ncomp = 2.
    plot(pls_model)

Question:

  • How many latent variables are optimal?

    2 variables are optimal

  • What is the corresponding resampled estimate of R2R^2R2?

    Corresponding resampled estimate is 0.529.

Using more than 2 components worsens RMSE. The shows moderate predictive strength of around 53%, which suggests that it is useful for the first few latent variables are captured properly but anymore will add noise of overfitting.

(d) Test Set Evaluation

Predict the response for the test set.

# Predict on test set
pred <- predict(pls_model, newdata = x_test)

# Compute R^2
R2(pred, y_test)
## [1] 0.2960035

Question: What is the test set estimate of R2R^2R2?

0.296

This is much weaker than the previous predictor with only around a 29.6% of variability in permeability is explained on new data.

(e) Try Other Models

Try building other models discussed in this chapter.

library(glmnet)
## Warning: package 'glmnet' was built under R version 4.5.3
## Loading required package: Matrix
## Loaded glmnet 4.1-10
set.seed(123)

# Train control
ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 3
)

# Grid for alpha (0 = Ridge, 1 = Lasso)
grid <- expand.grid(
  alpha = seq(0, 1, length = 5),
  lambda = seq(0.0001, 1, length = 20)
)

# Train Elastic Net (includes Ridge & Lasso)
enet_model <- train(
  x = x_train,
  y = y_train,
  method = "glmnet",
  preProcess = c("center", "scale"),
  tuneGrid = grid,
  trControl = ctrl
)
 
enet_model
## glmnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 120, 119, 118, 120, 121, 119, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      RMSE      Rsquared   MAE     
##   0.00   0.00010000  11.50094  0.5550307  8.611059
##   0.00   0.05272632  11.50094  0.5550307  8.611059
##   0.00   0.10535263  11.50094  0.5550307  8.611059
##   0.00   0.15797895  11.50094  0.5550307  8.611059
##   0.00   0.21060526  11.50094  0.5550307  8.611059
##   0.00   0.26323158  11.50094  0.5550307  8.611059
##   0.00   0.31585789  11.50094  0.5550307  8.611059
##   0.00   0.36848421  11.50094  0.5550307  8.611059
##   0.00   0.42111053  11.50094  0.5550307  8.611059
##   0.00   0.47373684  11.50094  0.5550307  8.611059
##   0.00   0.52636316  11.50094  0.5550307  8.611059
##   0.00   0.57898947  11.50094  0.5550307  8.611059
##   0.00   0.63161579  11.50094  0.5550307  8.611059
##   0.00   0.68424211  11.50094  0.5550307  8.611059
##   0.00   0.73686842  11.50094  0.5550307  8.611059
##   0.00   0.78949474  11.50094  0.5550307  8.611059
##   0.00   0.84212105  11.50094  0.5550307  8.611059
##   0.00   0.89474737  11.50094  0.5550307  8.611059
##   0.00   0.94737368  11.50094  0.5550307  8.611059
##   0.00   1.00000000  11.50094  0.5550307  8.611059
##   0.25   0.00010000  12.44093  0.4697850  9.335061
##   0.25   0.05272632  12.44093  0.4697850  9.335061
##   0.25   0.10535263  12.44093  0.4697850  9.335061
##   0.25   0.15797895  12.44093  0.4697850  9.335061
##   0.25   0.21060526  12.44093  0.4697850  9.335061
##   0.25   0.26323158  12.44093  0.4697850  9.335061
##   0.25   0.31585789  12.44093  0.4697850  9.335061
##   0.25   0.36848421  12.44093  0.4697850  9.335061
##   0.25   0.42111053  12.43650  0.4702610  9.339461
##   0.25   0.47373684  12.32850  0.4760365  9.274947
##   0.25   0.52636316  12.21407  0.4826385  9.194501
##   0.25   0.57898947  12.11411  0.4886311  9.120716
##   0.25   0.63161579  12.02817  0.4940929  9.058284
##   0.25   0.68424211  11.94918  0.4994033  9.003817
##   0.25   0.73686842  11.89003  0.5035086  8.956694
##   0.25   0.78949474  11.83954  0.5068923  8.916022
##   0.25   0.84212105  11.79356  0.5100062  8.876727
##   0.25   0.89474737  11.75597  0.5126325  8.847553
##   0.25   0.94737368  11.72721  0.5147858  8.827178
##   0.25   1.00000000  11.70638  0.5164935  8.812772
##   0.50   0.00010000  12.54719  0.4669426  9.336994
##   0.50   0.05272632  12.54719  0.4669426  9.336994
##   0.50   0.10535263  12.54719  0.4669426  9.336994
##   0.50   0.15797895  12.54719  0.4669426  9.336994
##   0.50   0.21060526  12.54617  0.4670586  9.344031
##   0.50   0.26323158  12.36518  0.4763093  9.267754
##   0.50   0.31585789  12.19017  0.4860831  9.145771
##   0.50   0.36848421  12.07411  0.4934721  9.057372
##   0.50   0.42111053  11.99143  0.4982830  9.005146
##   0.50   0.47373684  11.91146  0.5036371  8.967438
##   0.50   0.52636316  11.85673  0.5073443  8.946073
##   0.50   0.57898947  11.82243  0.5098871  8.929842
##   0.50   0.63161579  11.78759  0.5126383  8.903944
##   0.50   0.68424211  11.75266  0.5153872  8.874579
##   0.50   0.73686842  11.72536  0.5176981  8.838931
##   0.50   0.78949474  11.69937  0.5197057  8.806773
##   0.50   0.84212105  11.68309  0.5208465  8.775873
##   0.50   0.89474737  11.66503  0.5223515  8.740908
##   0.50   0.94737368  11.65091  0.5235038  8.706791
##   0.50   1.00000000  11.64168  0.5242318  8.672372
##   0.75   0.00010000  12.57926  0.4665146  9.313249
##   0.75   0.05272632  12.57926  0.4665146  9.313249
##   0.75   0.10535263  12.57926  0.4665146  9.313249
##   0.75   0.15797895  12.50040  0.4697638  9.296534
##   0.75   0.21060526  12.24796  0.4834699  9.159914
##   0.75   0.26323158  12.11191  0.4912944  9.081064
##   0.75   0.31585789  11.99829  0.4986558  9.030641
##   0.75   0.36848421  11.91253  0.5039134  8.999413
##   0.75   0.42111053  11.85163  0.5084025  8.968858
##   0.75   0.47373684  11.79874  0.5125141  8.921969
##   0.75   0.52636316  11.74972  0.5165935  8.869784
##   0.75   0.57898947  11.72802  0.5183373  8.823424
##   0.75   0.63161579  11.70867  0.5201094  8.769415
##   0.75   0.68424211  11.70053  0.5206494  8.723391
##   0.75   0.73686842  11.69927  0.5207255  8.680964
##   0.75   0.78949474  11.71297  0.5193522  8.651382
##   0.75   0.84212105  11.72929  0.5177082  8.628225
##   0.75   0.89474737  11.75141  0.5155972  8.615455
##   0.75   0.94737368  11.77555  0.5135301  8.607816
##   0.75   1.00000000  11.80346  0.5115089  8.603532
##   1.00   0.00010000  12.70805  0.4628577  9.379954
##   1.00   0.05272632  12.70805  0.4628577  9.379954
##   1.00   0.10535263  12.69960  0.4632510  9.382373
##   1.00   0.15797895  12.34936  0.4814712  9.231227
##   1.00   0.21060526  12.15841  0.4921499  9.159221
##   1.00   0.26323158  11.99932  0.5017959  9.097477
##   1.00   0.31585789  11.88860  0.5084398  9.049415
##   1.00   0.36848421  11.82242  0.5128029  8.975038
##   1.00   0.42111053  11.77697  0.5161822  8.901115
##   1.00   0.47373684  11.76328  0.5173399  8.838166
##   1.00   0.52636316  11.76176  0.5169860  8.781735
##   1.00   0.57898947  11.77816  0.5147976  8.741698
##   1.00   0.63161579  11.81643  0.5108173  8.722938
##   1.00   0.68424211  11.85678  0.5068551  8.715327
##   1.00   0.73686842  11.90351  0.5027353  8.715672
##   1.00   0.78949474  11.95475  0.4987754  8.718805
##   1.00   0.84212105  11.98542  0.4965526  8.720737
##   1.00   0.89474737  11.99268  0.4960834  8.716083
##   1.00   0.94737368  11.99313  0.4962729  8.708970
##   1.00   1.00000000  11.98820  0.4971179  8.697711
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 1.
plot(enet_model)

#Extract Best Base Model
enet_model$bestTune
##    alpha lambda
## 20     0      1
#Test Set Performance 
# Predictions
pred_enet <- predict(enet_model, newdata = x_test)

# Performance
postResample(pred_enet, y_test)
##       RMSE   Rsquared        MAE 
## 11.0203929  0.3266984  7.6511575

Question: Do any models have better predictive performance?

Penalized regression models were applied to improve upon the PLS results. Cross-validation identified Ridge regression (alpha = 0) as the best-performing model, suggesting that the response is influenced by many predictors with small, distributed effects rather than a few dominant ones. The model achieved a resampled R2R^2R2 of about 0.56 and a test set R2R^2R2 of roughly 0.33, representing a modest improvement over the PLS model (~0.30). While regularization helped reduce overfitting and improved generalization, the overall predictive performance remains limited.

(f) Recommendation

Would you recommend any of your models to replace the permeability laboratory experiment?

I would not recommend replacing the permeability laboratory experiment with the models developed in this analysis. Although Ridge regression demonstrated a modest improvement over the PLS model, the test set R2R^2R2 of approximately 0.33 indicates limited predictive accuracy and insufficient explanatory power. Such performance is inadequate for high-stakes decision-making in pharmaceutical development, where precise and reliable measurements are essential. Nevertheless, these models may still serve a valuable role as preliminary screening tools to prioritize candidate compounds, thereby improving efficiency while maintaining the necessity of experimental validation.

6.3 Chemical Manufacturing Process

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. The objective is to understand the relationship between:

(a) Load the Data

library(AppliedPredictiveModeling) 
library(caret)
data(ChemicalManufacturingProcess)

The matrix processPredictors contains 57 predictors (12 biological + 45 process variables) for 176 runs. The variable yield contains the percent yield.

(b) Handle Missing Values

A small percentage of predictor values are missing.

# Extract predictors
processPredictors <- ChemicalManufacturingProcess[, -1]  # remove yield column
yield <- ChemicalManufacturingProcess$Yield

# Check missing values
sum(is.na(processPredictors))
## [1] 106
# Apply kNN imputation
preProcValues <- preProcess(processPredictors, method = "knnImpute")

# Transform the data
processPredictors_imputed <- predict(preProcValues, processPredictors)

# Verify no missing values remain
sum(is.na(processPredictors_imputed))
## [1] 0

Task: Use an imputation method to fill in missing values (see Sect. 3.8).

The predictor dataset initially contained 106 missing values. These were addressed using k-nearest neighbors (kNN) imputation, which replaces missing entries based on similar observations in the dataset. After preprocessing, no missing values remained. This approach preserves all observations while maintaining the underlying data structure, making the dataset suitable for subsequent predictive modeling.

(c) Model Training

  • Split the data into training and test sets

  • Pre-process the data

  • Train and tune a model of your choice

    library(caret)
    library(glmnet)
    
    set.seed(123)
    
    # 1. Train/Test Split
    trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
    
    x_train <- processPredictors_imputed[trainIndex, ]
    x_test  <- processPredictors_imputed[-trainIndex, ]
    
    y_train <- yield[trainIndex]
    y_test  <- yield[-trainIndex]
    
    # 2. Pre-process (center & scale)
    preProc <- preProcess(x_train, method = c("center", "scale"))
    
    x_train <- predict(preProc, x_train)
    x_test  <- predict(preProc, x_test)
    
    # 3. Train & Tune Model (Elastic Net)
    ctrl <- trainControl(method = "cv", number = 10)
    
    model <- train(
      x = x_train,
      y = y_train,
      method = "glmnet",
      trControl = ctrl,
      tuneLength = 10
    )
    
    # 4. Extract results
    best <- model$results[
      model$results$alpha == model$bestTune$alpha &
      model$results$lambda == model$bestTune$lambda, ]
    
    # 5. Print 
    print(model$bestTune)
    ##    alpha    lambda
    ## 58   0.6 0.1821943
    print(best[, c("RMSE", "Rsquared")])
    ##       RMSE  Rsquared
    ## 58 1.13129 0.6244953

Question: What is the optimal value of the performance metric?

The optimal model produced a cross-validated RMSE of 1.131 and an R² of 0.624, with tuning parameters α = 0.6 and λ = 0.182, indicating an elastic net specification. These results suggest the model achieves a favorable trade-off between bias and variance and accounts for a meaningful proportion of the variability in yield.

(d) Test Set Evaluation

Predict the response for the test set.

Questions:

  • What is the test set performance?

  • How does it compare to the resampled training performance?

    The model produced a test set RMSE of 1.202 and an R² of 0.577. These results are slightly inferior to the cross-validated training performance (RMSE = 1.131, R² = 0.624), indicating minor overfitting. Nevertheless, the relatively small decline in performance suggests that the model maintains good generalization to unseen data.

# 1. Predict on test set
test_pred <- predict(model, newdata = x_test)

# 2. Compute performance metrics
test_RMSE <- RMSE(test_pred, y_test)
test_R2   <- R2(test_pred, y_test)

# 3. Show results
test_RMSE
## [1] 1.201836
test_R2
## [1] 0.5771784
# 4. Compare with training (CV) performance
train_best <- model$results[
  model$results$alpha == model$bestTune$alpha &
  model$results$lambda == model$bestTune$lambda, ]

train_best[, c("RMSE", "Rsquared")]
##       RMSE  Rsquared
## 58 1.13129 0.6244953

(e) Variable Importance

Questions:

varImp(model)
## glmnet variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                         Overall
## ManufacturingProcess32 100.0000
## ManufacturingProcess09  61.1354
## ManufacturingProcess17  47.7024
## BiologicalMaterial06    27.1593
## ManufacturingProcess06  26.5485
## ManufacturingProcess37  22.0026
## ManufacturingProcess45  18.3213
## BiologicalMaterial05    18.1548
## ManufacturingProcess13  14.2742
## ManufacturingProcess36  14.0917
## ManufacturingProcess15   9.1167
## ManufacturingProcess44   8.7566
## ManufacturingProcess34   7.9396
## ManufacturingProcess39   3.3326
## ManufacturingProcess43   3.3286
## BiologicalMaterial07     1.5281
## ManufacturingProcess04   0.1659
## ManufacturingProcess40   0.0000
## ManufacturingProcess41   0.0000
## ManufacturingProcess38   0.0000
importance <- varImp(model, scale = TRUE)
plot(importance, top = 20)

  • Which predictors are most important?

    The most important predictors are ManufacturingProcess32, ManufacturingProcess09, and ManufacturingProcess17, followed by BiologicalMaterial06, ManufacturingProcess06, and ManufacturingProcess37. Among these, ManufacturingProcess32 stands out with a substantially higher importance score than all other variables, indicating that it has the strongest influence on predicting yield.

  • Do biological or process predictors dominate?

    Process predictors clearly dominate the model. The majority of the top-ranked variables are manufacturing process variables, with only a few biological predictors (such as BiologicalMaterial06 and BiologicalMaterial05) appearing among the most important. This suggests that yield is driven more by manufacturing conditions than by biological material characteristics.

(f) Interpretation

Explore the relationships between top predictors and yield.

Question: How can this information help improve yield in future manufacturing runs?

The most important predictors show that certain manufacturing steps, especially variables like ManufacturingProcess32, ManufacturingProcess09, and ManufacturingProcess17, have a strong effect on yield. By looking at how changes in these variables affect the output, we can see which conditions lead to better results.

This information can help improve yield by focusing on controlling and adjusting these key process variables. Keeping them at levels that are associated with higher yield can make production more consistent and efficient, leading to better overall performance.