Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
dim(fingerprints)
## [1] 165 1107
fingerprints <- fingerprints[, -nearZeroVar(fingerprints)]
dim(fingerprints)
## [1] 165 388
There are 388 predictors left for modeling.
set.seed(123)
# index for training
index <- createDataPartition(permeability, p = .8, list = FALSE)
# train
train_perm <- permeability[index, ]
train_fp <- fingerprints[index, ]
# test
test_perm <- permeability[-index, ]
test_fp <- fingerprints [-index, ]
# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)
plsTune <- train(train_fp, train_perm, method = "pls", metric = "Rsquared",
tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))
# Plotting plsTune
plot(plsTune)
plsTune
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 121, 118, 119, 119, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.31894 0.3442124 10.254018
## 2 11.78898 0.4830504 8.534741
## 3 11.98818 0.4792649 9.219285
## 4 12.04349 0.4923322 9.448926
## 5 11.79823 0.5193195 9.049121
## 6 11.53275 0.5335956 8.658301
## 7 11.64053 0.5229621 8.878265
## 8 11.86459 0.5144801 9.265252
## 9 11.98385 0.5188205 9.218594
## 10 12.55634 0.4808614 9.610747
## 11 12.69674 0.4758068 9.702325
## 12 13.01534 0.4538906 9.956623
## 13 13.12637 0.4367362 9.878017
## 14 13.44865 0.4140715 10.065088
## 15 13.60135 0.4034269 10.188150
## 16 13.79361 0.3943904 10.247160
## 17 14.00756 0.3845119 10.412776
## 18 14.18113 0.3711378 10.587027
## 19 14.25674 0.3703610 10.575726
## 20 14.33121 0.3723176 10.679764
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 6.
The latent variables are 6 and the corresponding resampled estimate of \(R^{2}\) is 0.5335956.
fp_predict <- predict(plsTune, test_fp)
postResample(fp_predict, test_perm)
## RMSE Rsquared MAE
## 12.3486900 0.3244542 8.2881075
The test set estimate of \(R^{2}\) is 0.3244542.
set.seed(123)
# grid of penalties
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
# tuning penalized regression model
enetTune <- train(train_fp, train_perm, method = "enet",
tuneGrid = enetGrid, trControl = ctrl, preProc = c("center", "scale"))
plot(enetTune)
enetTune
## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 118, 120, 121, 119, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 12.97787 0.4301959 9.877733
## 0.00 0.10 11.80785 0.4931852 9.107986
## 0.00 0.15 11.66393 0.4930871 8.897928
## 0.00 0.20 11.86820 0.4793179 9.030145
## 0.00 0.25 12.10947 0.4728912 9.253825
## 0.00 0.30 12.34485 0.4658690 9.393513
## 0.00 0.35 12.61952 0.4590835 9.612879
## 0.00 0.40 12.75663 0.4601371 9.686790
## 0.00 0.45 12.96171 0.4539654 9.774599
## 0.00 0.50 13.22572 0.4468568 9.900789
## 0.00 0.55 13.44341 0.4427435 9.994948
## 0.00 0.60 13.62322 0.4378770 10.071188
## 0.00 0.65 13.86382 0.4299241 10.153834
## 0.00 0.70 14.18080 0.4200728 10.374171
## 0.00 0.75 14.48298 0.4084205 10.465235
## 0.00 0.80 14.79949 0.3962019 10.559980
## 0.00 0.85 15.11740 0.3847540 10.634954
## 0.00 0.90 15.47863 0.3758087 10.864504
## 0.00 0.95 15.82572 0.3675399 11.039900
## 0.00 1.00 16.20635 0.3601753 11.220501
## 0.01 0.05 19.70369 0.5419408 13.658257
## 0.01 0.10 29.20414 0.5517070 18.690554
## 0.01 0.15 39.34872 0.5464403 24.503744
## 0.01 0.20 49.58867 0.5300114 30.253422
## 0.01 0.25 59.75378 0.5135458 35.825185
## 0.01 0.30 69.86658 0.5025866 41.376372
## 0.01 0.35 80.04013 0.4907761 46.959417
## 0.01 0.40 90.16964 0.4820232 52.431432
## 0.01 0.45 100.22907 0.4769870 57.896985
## 0.01 0.50 110.20374 0.4733163 63.339107
## 0.01 0.55 120.15283 0.4689652 68.748714
## 0.01 0.60 130.10820 0.4634918 74.138916
## 0.01 0.65 140.04834 0.4574078 79.532014
## 0.01 0.70 149.98967 0.4508075 84.943512
## 0.01 0.75 159.92961 0.4435334 90.354687
## 0.01 0.80 169.84077 0.4381988 95.756154
## 0.01 0.85 179.74379 0.4319106 101.148939
## 0.01 0.90 189.66445 0.4256649 106.576223
## 0.01 0.95 199.59443 0.4201282 112.076637
## 0.01 1.00 209.53137 0.4142206 117.588754
## 0.10 0.05 12.48366 0.5107258 9.539035
## 0.10 0.10 11.53534 0.5261893 8.482600
## 0.10 0.15 11.27266 0.5429020 8.204349
## 0.10 0.20 11.27554 0.5488762 8.346509
## 0.10 0.25 11.30648 0.5527622 8.491996
## 0.10 0.30 11.39070 0.5510568 8.624229
## 0.10 0.35 11.39403 0.5533536 8.686883
## 0.10 0.40 11.39420 0.5565505 8.705934
## 0.10 0.45 11.50017 0.5532039 8.805005
## 0.10 0.50 11.62477 0.5502964 8.891474
## 0.10 0.55 11.75005 0.5467109 8.971633
## 0.10 0.60 11.85638 0.5433706 9.039529
## 0.10 0.65 11.92754 0.5414808 9.064147
## 0.10 0.70 11.97002 0.5408816 9.061763
## 0.10 0.75 12.00539 0.5408888 9.064088
## 0.10 0.80 12.02698 0.5416069 9.059826
## 0.10 0.85 12.03704 0.5428213 9.068134
## 0.10 0.90 12.04706 0.5438642 9.086377
## 0.10 0.95 12.05637 0.5446054 9.098177
## 0.10 1.00 12.06264 0.5453869 9.100052
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.15 and lambda = 0.1.
enet_predict <- predict(enetTune, test_fp)
postResample(enet_predict, test_perm)
## RMSE Rsquared MAE
## 11.5871592 0.3519218 7.2484765
The test set estimate of \(R^{2}\) is 0.3519218, which is higher than PLS model. The Elastic Net model has a lower RMSE than the PLS model, indicating that, on average, its predictions are closer to the observed values. The Elastic Net model has better predictive performance than the Partial Least Squares model across all three evaluated metrics.
I would recommend the Elastic Net regression model as it produced better statistics. It had a higher R2 and lower RMSE and MAE.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors
(12 describing the input biological material and 45 describing the
process predictors) for the 176 manufacturing runs. yield contains the
percent yield for each run.
sum(is.na(ChemicalManufacturingProcess))
## [1] 106
missings <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(missings, ChemicalManufacturingProcess)
sum(is.na(Chemical))
## [1] 0
There were 106 missing values in ChemicalManufacturingProcess. Bagged trees were used to impute the data. Bagged trees are made using all the other variables.
# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]
set.seed(123)
# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)
# train
train_chem <- Chemical[index, ]
# test
test_chem <- Chemical[-index, ]
Tuning a model of my choice. Here I choose PLS model to process the data.
set.seed(123)
plsTune <- train(Yield ~ ., Chemical , method = "pls",
tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))
plot(plsTune)
plsTune
## Partial Least Squares
##
## 176 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 158, 158, 159, 160, 160, 157, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.488542 0.4318901 1.179867
## 2 1.907582 0.4397944 1.203172
## 3 1.383700 0.5302843 1.055955
## 4 1.575912 0.5121870 1.119931
## 5 1.803918 0.5225095 1.176135
## 6 1.911910 0.5100661 1.195775
## 7 2.009377 0.5065711 1.224336
## 8 2.020122 0.5086911 1.234196
## 9 2.142552 0.4894267 1.281860
## 10 2.214721 0.4826316 1.301670
## 11 2.255229 0.4735616 1.329375
## 12 2.194734 0.4779426 1.311638
## 13 2.105780 0.4798980 1.292012
## 14 2.089038 0.4773860 1.288809
## 15 2.108072 0.4687723 1.301333
## 16 2.127078 0.4641491 1.303687
## 17 2.132216 0.4579375 1.306510
## 18 2.175989 0.4530179 1.311188
## 19 2.258117 0.4456591 1.332741
## 20 2.290990 0.4381405 1.344932
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
The optimal PLS model uses 3 components (ncomp = 3), as it results in the smallest RMSE value among all the tested models. The \(R^2\) is 0.5302843.
set.seed(123)
larsTune <- train(Yield ~ ., Chemical , method = "lars", metric = "Rsquared",
tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))
lars_predict <- predict(larsTune, test_chem[ ,-1])
postResample(lars_predict, test_chem[ ,1])
## RMSE Rsquared MAE
## 1.0867580 0.6596484 0.8315831
Ordinary linear model had the highest \(R^2\), but it comes with consequences. Therefore, the lars method was chosen as it had the highest \(R^2\) The \(R^2\) is 0.6596484, which is higher than the training set.
varImp(larsTune)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 90.02
## BiologicalMaterial06 84.56
## ManufacturingProcess36 75.52
## ManufacturingProcess17 74.88
## BiologicalMaterial03 73.53
## ManufacturingProcess09 70.37
## BiologicalMaterial12 67.97
## BiologicalMaterial02 65.32
## ManufacturingProcess31 60.45
## ManufacturingProcess06 57.36
## ManufacturingProcess33 49.53
## BiologicalMaterial11 48.11
## BiologicalMaterial04 47.12
## BiologicalMaterial08 41.87
## ManufacturingProcess11 41.75
## BiologicalMaterial01 39.13
## ManufacturingProcess12 32.97
## ManufacturingProcess30 32.87
## BiologicalMaterial09 32.41
The 5 most important variables used in the modeling are
ManufacturingProcess32,
ManufacturingProcess13, BiologicalMaterial06,
ManufacturingProcess36, and
ManufacturingProcess17. The process predictors dominate the
list. The ratio of process to biological predictors is 11:9.
top10 <- varImp(larsTune)$importance %>%
arrange(-Overall) %>%
head(10)
Chemical %>%
select(c("Yield", row.names(top10))) %>%
cor() %>%
corrplot()
ManufacturingProcess32 has a significant positive
relationship with Yield, as demonstrated by the correlation
plot. Yield and three of the top ten variables have a
negative correlation. Given that these were the variables influencing
the Yield, this information can be helpful in subsequent
manufacturing process runs. They should enhance their biological
measures of the raw materials and production process measurements if
they hope to increase or optimize their Yield.