library(AppliedPredictiveModeling)
library(caret)
library(pls)
library(tidyverse)
This analysis explores molecular permeability data using predictive modeling techniques. The goal is to determine whether machine learning models can accurately predict permeability and potentially reduce the need for expensive laboratory testing.
data(permeability)
dim(fingerprints)
## [1] 165 1107
length(permeability)
## [1] 165
The dataset contains molecular fingerprint predictors and a permeability response variable. The fingerprint variables are binary indicators representing the presence or absence of molecular substructures.
nzv <- nearZeroVar(fingerprints)
filtered_fingerprints <- fingerprints[, -nzv]
dim(filtered_fingerprints)
## [1] 165 388
The nearZeroVar function was used to remove sparse predictors with little variation across observations. Removing these predictors helps reduce noise and improves model performance.
set.seed(123)
trainIndex <- createDataPartition(permeability, p = 0.8, list = FALSE)
trainX <- filtered_fingerprints[trainIndex, ]
testX <- filtered_fingerprints[-trainIndex, ]
trainY <- permeability[trainIndex]
testY <- permeability[-trainIndex]
ctrl <- trainControl(method = "cv", number = 10)
pls_model <- train(
trainX,
trainY,
method = "pls",
preProcess = c("center", "scale"),
tuneLength = 20,
trControl = ctrl
)
pls_model
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 121, 118, 119, 119, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.31894 0.3442124 10.254018
## 2 11.78898 0.4830504 8.534741
## 3 11.98818 0.4792649 9.219285
## 4 12.04349 0.4923322 9.448926
## 5 11.79823 0.5193195 9.049121
## 6 11.53275 0.5335956 8.658301
## 7 11.64053 0.5229621 8.878265
## 8 11.86459 0.5144801 9.265252
## 9 11.98385 0.5188205 9.218594
## 10 12.55634 0.4808614 9.610747
## 11 12.69674 0.4758068 9.702325
## 12 13.01534 0.4538906 9.956623
## 13 13.12637 0.4367362 9.878017
## 14 13.44865 0.4140715 10.065088
## 15 13.60135 0.4034269 10.188150
## 16 13.79361 0.3943904 10.247160
## 17 14.00756 0.3845119 10.412776
## 18 14.18113 0.3711378 10.587027
## 19 14.25674 0.3703610 10.575726
## 20 14.33121 0.3723176 10.679764
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.
A Partial Least Squares (PLS) model was trained using 10-fold cross-validation. Predictor variables were centered and scaled before modeling. Cross-validation was used to identify the optimal number of latent variables.
plot(pls_model)
pls_predictions <- predict(pls_model, testX)
postResample(pls_predictions, testY)
## RMSE Rsquared MAE
## 12.3486900 0.3244542 8.2881075
rf_model <- train(
trainX,
trainY,
method = "rf",
trControl = ctrl,
importance = TRUE
)
rf_model
## Random Forest
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 118, 121, 120, 120, 120, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 11.93122 0.5589109 9.202839
## 195 11.07286 0.5842520 7.755212
## 388 11.05443 0.5851016 7.620368
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 388.
rf_predictions <- predict(rf_model, testX)
postResample(rf_predictions, testY)
## RMSE Rsquared MAE
## 9.915651 0.523969 5.940383
The predictive models demonstrated the ability to estimate molecular permeability with moderate accuracy. The Random Forest model produced stronger predictive performance compared to the PLS model, suggesting that nonlinear relationships may exist within the molecular fingerprint data. While the models are useful for screening compounds, additional validation would likely be required before replacing laboratory permeability experiments entirely.
This analysis investigates how biological and manufacturing process variables influence pharmaceutical product yield. Predictive modeling is used to identify important predictors and evaluate whether manufacturing outcomes can be improved.
data(ChemicalManufacturingProcess)
str(ChemicalManufacturingProcess)
## 'data.frame': 176 obs. of 58 variables:
## $ Yield : num 38 42.4 42 41.4 42.5 ...
## $ BiologicalMaterial01 : num 6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
## $ BiologicalMaterial02 : num 49.6 61 61 61 63.3 ...
## $ BiologicalMaterial03 : num 57 67.5 67.5 67.5 72.2 ...
## $ BiologicalMaterial04 : num 12.7 14.7 14.7 14.7 14 ...
## $ BiologicalMaterial05 : num 19.5 19.4 19.4 19.4 17.9 ...
## $ BiologicalMaterial06 : num 43.7 53.1 53.1 53.1 54.7 ...
## $ BiologicalMaterial07 : num 100 100 100 100 100 100 100 100 100 100 ...
## $ BiologicalMaterial08 : num 16.7 19 19 19 18.2 ...
## $ BiologicalMaterial09 : num 11.4 12.6 12.6 12.6 12.8 ...
## $ BiologicalMaterial10 : num 3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
## $ BiologicalMaterial11 : num 138 154 154 154 148 ...
## $ BiologicalMaterial12 : num 18.8 21.1 21.1 21.1 21.1 ...
## $ ManufacturingProcess01: num NA 0 0 0 10.7 12 11.5 12 12 12 ...
## $ ManufacturingProcess02: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess03: num NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
## $ ManufacturingProcess04: num NA 917 912 911 918 924 933 929 928 938 ...
## $ ManufacturingProcess05: num NA 1032 1004 1015 1028 ...
## $ ManufacturingProcess06: num NA 210 207 213 206 ...
## $ ManufacturingProcess07: num NA 177 178 177 178 178 177 178 177 177 ...
## $ ManufacturingProcess08: num NA 178 178 177 178 178 178 178 177 177 ...
## $ ManufacturingProcess09: num 43 46.6 45.1 44.9 45 ...
## $ ManufacturingProcess10: num NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
## $ ManufacturingProcess11: num NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
## $ ManufacturingProcess12: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess13: num 35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
## $ ManufacturingProcess14: num 4898 4869 4878 4897 4992 ...
## $ ManufacturingProcess15: num 6108 6095 6087 6102 6233 ...
## $ ManufacturingProcess16: num 4682 4617 4617 4635 4733 ...
## $ ManufacturingProcess17: num 35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
## $ ManufacturingProcess18: num 4865 4867 4877 4872 4886 ...
## $ ManufacturingProcess19: num 6049 6097 6078 6073 6102 ...
## $ ManufacturingProcess20: num 4665 4621 4621 4611 4659 ...
## $ ManufacturingProcess21: num 0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
## $ ManufacturingProcess22: num NA 3 4 5 8 9 1 2 3 4 ...
## $ ManufacturingProcess23: num NA 0 1 2 4 1 1 2 3 1 ...
## $ ManufacturingProcess24: num NA 3 4 5 18 1 1 2 3 4 ...
## $ ManufacturingProcess25: num 4873 4869 4897 4892 4930 ...
## $ ManufacturingProcess26: num 6074 6107 6116 6111 6151 ...
## $ ManufacturingProcess27: num 4685 4630 4637 4630 4684 ...
## $ ManufacturingProcess28: num 10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
## $ ManufacturingProcess29: num 21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
## $ ManufacturingProcess30: num 9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
## $ ManufacturingProcess31: num 69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
## $ ManufacturingProcess32: num 156 169 173 171 171 173 159 161 160 164 ...
## $ ManufacturingProcess33: num 66 66 66 68 70 70 65 65 65 66 ...
## $ ManufacturingProcess34: num 2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
## $ ManufacturingProcess35: num 486 508 509 496 468 490 475 478 491 488 ...
## $ ManufacturingProcess36: num 0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
## $ ManufacturingProcess37: num 0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
## $ ManufacturingProcess38: num 3 2 2 2 2 2 2 2 3 3 ...
## $ ManufacturingProcess39: num 7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
## $ ManufacturingProcess40: num NA 0.1 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess41: num NA 0.15 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess42: num 11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
## $ ManufacturingProcess43: num 3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
## $ ManufacturingProcess44: num 1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
## $ ManufacturingProcess45: num 2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
preProc <- preProcess(
ChemicalManufacturingProcess[, -1],
method = c("medianImpute")
)
predictors_imputed <- predict(
preProc,
ChemicalManufacturingProcess[, -1]
)
yield <- ChemicalManufacturingProcess$Yield
set.seed(123)
trainIndex2 <- createDataPartition(yield, p = 0.8, list = FALSE)
trainX2 <- predictors_imputed[trainIndex2, ]
testX2 <- predictors_imputed[-trainIndex2, ]
trainY2 <- yield[trainIndex2]
testY2 <- yield[-trainIndex2]
rf_model2 <- train(
trainX2,
trainY2,
method = "rf",
trControl = ctrl,
importance = TRUE
)
rf_model2
## Random Forest
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 131, 130, 130, 129, 131, 129, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.230320 0.6567950 0.9798169
## 29 1.149462 0.6445841 0.9010466
## 57 1.149977 0.6325385 0.8794988
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 29.
rf_predictions2 <- predict(rf_model2, testX2)
postResample(rf_predictions2, testY2)
## RMSE Rsquared MAE
## 1.2682253 0.5399667 0.9642597
importance <- varImp(rf_model2)
importance
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess17 49.00
## BiologicalMaterial06 40.46
## ManufacturingProcess31 39.02
## BiologicalMaterial03 38.61
## BiologicalMaterial12 33.06
## ManufacturingProcess09 30.00
## BiologicalMaterial04 29.59
## BiologicalMaterial11 28.76
## ManufacturingProcess36 28.10
## ManufacturingProcess39 27.20
## BiologicalMaterial05 23.96
## ManufacturingProcess01 23.65
## ManufacturingProcess30 23.44
## ManufacturingProcess27 23.38
## ManufacturingProcess20 23.16
## BiologicalMaterial02 22.72
## ManufacturingProcess11 22.54
## ManufacturingProcess06 19.98
## ManufacturingProcess28 19.88
plot(importance)
The variable importance analysis identified several manufacturing
process variables as highly influential in predicting product yield.
Process-related predictors appeared more dominant than biological
predictors, suggesting that adjustments during manufacturing may have a
strong impact on improving production efficiency and consistency.
Predictive modeling techniques successfully identified important relationships within both pharmaceutical datasets. The models demonstrated that machine learning can assist in predicting permeability and manufacturing yield, potentially reducing costs and improving operational efficiency. Random Forest models generally produced stronger predictive performance, highlighting the usefulness of nonlinear modeling approaches in pharmaceutical analytics.