library(AppliedPredictiveModeling)
library(caret)
library(pls)
library(tidyverse)

Question 6.2 — Permeability Modeling

This analysis explores molecular permeability data using predictive modeling techniques. The goal is to determine whether machine learning models can accurately predict permeability and potentially reduce the need for expensive laboratory testing.

data(permeability)

dim(fingerprints)
## [1]  165 1107
length(permeability)
## [1] 165

The dataset contains molecular fingerprint predictors and a permeability response variable. The fingerprint variables are binary indicators representing the presence or absence of molecular substructures.

nzv <- nearZeroVar(fingerprints)

filtered_fingerprints <- fingerprints[, -nzv]

dim(filtered_fingerprints)
## [1] 165 388

The nearZeroVar function was used to remove sparse predictors with little variation across observations. Removing these predictors helps reduce noise and improves model performance.

set.seed(123)

trainIndex <- createDataPartition(permeability, p = 0.8, list = FALSE)

trainX <- filtered_fingerprints[trainIndex, ]
testX <- filtered_fingerprints[-trainIndex, ]

trainY <- permeability[trainIndex]
testY <- permeability[-trainIndex]
ctrl <- trainControl(method = "cv", number = 10)

pls_model <- train(
  trainX,
  trainY,
  method = "pls",
  preProcess = c("center", "scale"),
  tuneLength = 20,
  trControl = ctrl
)

pls_model
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 121, 118, 119, 119, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.31894  0.3442124  10.254018
##    2     11.78898  0.4830504   8.534741
##    3     11.98818  0.4792649   9.219285
##    4     12.04349  0.4923322   9.448926
##    5     11.79823  0.5193195   9.049121
##    6     11.53275  0.5335956   8.658301
##    7     11.64053  0.5229621   8.878265
##    8     11.86459  0.5144801   9.265252
##    9     11.98385  0.5188205   9.218594
##   10     12.55634  0.4808614   9.610747
##   11     12.69674  0.4758068   9.702325
##   12     13.01534  0.4538906   9.956623
##   13     13.12637  0.4367362   9.878017
##   14     13.44865  0.4140715  10.065088
##   15     13.60135  0.4034269  10.188150
##   16     13.79361  0.3943904  10.247160
##   17     14.00756  0.3845119  10.412776
##   18     14.18113  0.3711378  10.587027
##   19     14.25674  0.3703610  10.575726
##   20     14.33121  0.3723176  10.679764
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.

A Partial Least Squares (PLS) model was trained using 10-fold cross-validation. Predictor variables were centered and scaled before modeling. Cross-validation was used to identify the optimal number of latent variables.

plot(pls_model)

pls_predictions <- predict(pls_model, testX)

postResample(pls_predictions, testY)
##       RMSE   Rsquared        MAE 
## 12.3486900  0.3244542  8.2881075
rf_model <- train(
  trainX,
  trainY,
  method = "rf",
  trControl = ctrl,
  importance = TRUE
)

rf_model
## Random Forest 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 118, 121, 120, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##     2   11.93122  0.5589109  9.202839
##   195   11.07286  0.5842520  7.755212
##   388   11.05443  0.5851016  7.620368
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 388.
rf_predictions <- predict(rf_model, testX)

postResample(rf_predictions, testY)
##     RMSE Rsquared      MAE 
## 9.915651 0.523969 5.940383

Conclusion

The predictive models demonstrated the ability to estimate molecular permeability with moderate accuracy. The Random Forest model produced stronger predictive performance compared to the PLS model, suggesting that nonlinear relationships may exist within the molecular fingerprint data. While the models are useful for screening compounds, additional validation would likely be required before replacing laboratory permeability experiments entirely.

Question 6.3 — Chemical Manufacturing Process

This analysis investigates how biological and manufacturing process variables influence pharmaceutical product yield. Predictive modeling is used to identify important predictors and evaluate whether manufacturing outcomes can be improved.

data(ChemicalManufacturingProcess)

str(ChemicalManufacturingProcess)
## 'data.frame':    176 obs. of  58 variables:
##  $ Yield                 : num  38 42.4 42 41.4 42.5 ...
##  $ BiologicalMaterial01  : num  6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
##  $ BiologicalMaterial02  : num  49.6 61 61 61 63.3 ...
##  $ BiologicalMaterial03  : num  57 67.5 67.5 67.5 72.2 ...
##  $ BiologicalMaterial04  : num  12.7 14.7 14.7 14.7 14 ...
##  $ BiologicalMaterial05  : num  19.5 19.4 19.4 19.4 17.9 ...
##  $ BiologicalMaterial06  : num  43.7 53.1 53.1 53.1 54.7 ...
##  $ BiologicalMaterial07  : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ BiologicalMaterial08  : num  16.7 19 19 19 18.2 ...
##  $ BiologicalMaterial09  : num  11.4 12.6 12.6 12.6 12.8 ...
##  $ BiologicalMaterial10  : num  3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
##  $ BiologicalMaterial11  : num  138 154 154 154 148 ...
##  $ BiologicalMaterial12  : num  18.8 21.1 21.1 21.1 21.1 ...
##  $ ManufacturingProcess01: num  NA 0 0 0 10.7 12 11.5 12 12 12 ...
##  $ ManufacturingProcess02: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess03: num  NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
##  $ ManufacturingProcess04: num  NA 917 912 911 918 924 933 929 928 938 ...
##  $ ManufacturingProcess05: num  NA 1032 1004 1015 1028 ...
##  $ ManufacturingProcess06: num  NA 210 207 213 206 ...
##  $ ManufacturingProcess07: num  NA 177 178 177 178 178 177 178 177 177 ...
##  $ ManufacturingProcess08: num  NA 178 178 177 178 178 178 178 177 177 ...
##  $ ManufacturingProcess09: num  43 46.6 45.1 44.9 45 ...
##  $ ManufacturingProcess10: num  NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
##  $ ManufacturingProcess11: num  NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
##  $ ManufacturingProcess12: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess13: num  35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
##  $ ManufacturingProcess14: num  4898 4869 4878 4897 4992 ...
##  $ ManufacturingProcess15: num  6108 6095 6087 6102 6233 ...
##  $ ManufacturingProcess16: num  4682 4617 4617 4635 4733 ...
##  $ ManufacturingProcess17: num  35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
##  $ ManufacturingProcess18: num  4865 4867 4877 4872 4886 ...
##  $ ManufacturingProcess19: num  6049 6097 6078 6073 6102 ...
##  $ ManufacturingProcess20: num  4665 4621 4621 4611 4659 ...
##  $ ManufacturingProcess21: num  0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
##  $ ManufacturingProcess22: num  NA 3 4 5 8 9 1 2 3 4 ...
##  $ ManufacturingProcess23: num  NA 0 1 2 4 1 1 2 3 1 ...
##  $ ManufacturingProcess24: num  NA 3 4 5 18 1 1 2 3 4 ...
##  $ ManufacturingProcess25: num  4873 4869 4897 4892 4930 ...
##  $ ManufacturingProcess26: num  6074 6107 6116 6111 6151 ...
##  $ ManufacturingProcess27: num  4685 4630 4637 4630 4684 ...
##  $ ManufacturingProcess28: num  10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
##  $ ManufacturingProcess29: num  21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
##  $ ManufacturingProcess30: num  9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
##  $ ManufacturingProcess31: num  69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
##  $ ManufacturingProcess32: num  156 169 173 171 171 173 159 161 160 164 ...
##  $ ManufacturingProcess33: num  66 66 66 68 70 70 65 65 65 66 ...
##  $ ManufacturingProcess34: num  2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ ManufacturingProcess35: num  486 508 509 496 468 490 475 478 491 488 ...
##  $ ManufacturingProcess36: num  0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
##  $ ManufacturingProcess37: num  0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
##  $ ManufacturingProcess38: num  3 2 2 2 2 2 2 2 3 3 ...
##  $ ManufacturingProcess39: num  7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
##  $ ManufacturingProcess40: num  NA 0.1 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess41: num  NA 0.15 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess42: num  11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
##  $ ManufacturingProcess43: num  3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
##  $ ManufacturingProcess44: num  1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
##  $ ManufacturingProcess45: num  2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
preProc <- preProcess(
  ChemicalManufacturingProcess[, -1],
  method = c("medianImpute")
)

predictors_imputed <- predict(
  preProc,
  ChemicalManufacturingProcess[, -1]
)

yield <- ChemicalManufacturingProcess$Yield
set.seed(123)

trainIndex2 <- createDataPartition(yield, p = 0.8, list = FALSE)

trainX2 <- predictors_imputed[trainIndex2, ]
testX2 <- predictors_imputed[-trainIndex2, ]

trainY2 <- yield[trainIndex2]
testY2 <- yield[-trainIndex2]
rf_model2 <- train(
  trainX2,
  trainY2,
  method = "rf",
  trControl = ctrl,
  importance = TRUE
)

rf_model2
## Random Forest 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 130, 130, 129, 131, 129, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.230320  0.6567950  0.9798169
##   29    1.149462  0.6445841  0.9010466
##   57    1.149977  0.6325385  0.8794988
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 29.
rf_predictions2 <- predict(rf_model2, testX2)

postResample(rf_predictions2, testY2)
##      RMSE  Rsquared       MAE 
## 1.2682253 0.5399667 0.9642597
importance <- varImp(rf_model2)

importance
## rf variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess17   49.00
## BiologicalMaterial06     40.46
## ManufacturingProcess31   39.02
## BiologicalMaterial03     38.61
## BiologicalMaterial12     33.06
## ManufacturingProcess09   30.00
## BiologicalMaterial04     29.59
## BiologicalMaterial11     28.76
## ManufacturingProcess36   28.10
## ManufacturingProcess39   27.20
## BiologicalMaterial05     23.96
## ManufacturingProcess01   23.65
## ManufacturingProcess30   23.44
## ManufacturingProcess27   23.38
## ManufacturingProcess20   23.16
## BiologicalMaterial02     22.72
## ManufacturingProcess11   22.54
## ManufacturingProcess06   19.98
## ManufacturingProcess28   19.88
plot(importance)

The variable importance analysis identified several manufacturing process variables as highly influential in predicting product yield. Process-related predictors appeared more dominant than biological predictors, suggesting that adjustments during manufacturing may have a strong impact on improving production efficiency and consistency.

Final Conclusion

Predictive modeling techniques successfully identified important relationships within both pharmaceutical datasets. The models demonstrated that machine learning can assist in predicting permeability and manufacturing yield, potentially reducing costs and improving operational efficiency. Random Forest models generally produced stronger predictive performance, highlighting the usefulness of nonlinear modeling approaches in pharmaceutical analytics.