6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
zeroVarFingerprints<-fingerprints %>% nearZeroVar( names = T)
ourFingerprints<-fingerprints[,-nearZeroVar(fingerprints)]
ncol(ourFingerprints)
## [1] 388
set.seed(11032021)
trainIndex<-caret::createDataPartition(permeability, list=F)
preProc<-preProcess(ourFingerprints)
trainFingerprints<-predict(preProc, ourFingerprints[trainIndex,])
testFingerprints<-predict(preProc, ourFingerprints[-trainIndex,])
trainPermeability<-permeability[trainIndex]
testPermeability<-permeability[-trainIndex,]
cvControl <- trainControl("method"="cv", number=5)
plsFingerprints <- train(trainFingerprints, trainPermeability,method="pls",
trControl=cvControl)
plsFingerprints
## Partial Least Squares
##
## 84 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 66, 68, 67, 68, 67
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 11.67629 0.4237458 9.032390
## 2 10.77439 0.5084283 7.540053
## 3 10.59683 0.5290366 7.572073
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
plot(plsFingerprints)
Train data Rsquared was .529
fingerprintPrediction <- predict(plsFingerprints,newdata =testFingerprints)
postResample(fingerprintPrediction,testPermeability)
## RMSE Rsquared MAE
## 13.5500390 0.3167951 9.2881724
Actual observeed Rsquared was ~ .317 which isn’t bad
cvControl <- trainControl("method"="cv", number=5)
enetFingerprints <- train(trainFingerprints, trainPermeability,method="enet",
trControl=cvControl)
enetFingerprints
## Elasticnet
##
## 84 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 66, 68, 67, 68, 67
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0e+00 0.050 23.42620 0.3732354 15.800541
## 0e+00 0.525 23.29393 0.3692813 15.215349
## 0e+00 1.000 27.75722 0.2777446 19.909774
## 1e-04 0.050 23.93159 0.5959581 16.543719
## 1e-04 0.525 213.97001 0.2036016 128.354774
## 1e-04 1.000 382.21248 0.1156696 217.640312
## 1e-01 0.050 11.39131 0.5886936 8.942639
## 1e-01 0.525 13.09809 0.4198533 9.559618
## 1e-01 1.000 15.41443 0.3108042 11.490501
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.05 and lambda = 0.1.
plot(enetFingerprints)
cvControl <- trainControl("method"="cv", number=5)
lassoFingerprints <- train(trainFingerprints, trainPermeability,method="lasso",
trControl=cvControl)
lassoFingerprints
## The lasso
##
## 84 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 68, 68, 68, 66, 66
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.1 10.92118 0.6297616 8.696591
## 0.5 11.99590 0.4932956 8.295701
## 0.9 14.16194 0.4211613 10.129707
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.1.
plot(lassoFingerprints)
fingerprintPrediction <- predict(enetFingerprints,newdata =testFingerprints)
postResample(fingerprintPrediction,testPermeability)
## RMSE Rsquared MAE
## 13.5981176 0.2569714 10.3351569
fingerprintPrediction <- predict(lassoFingerprints,newdata =testFingerprints)
postResample(fingerprintPrediction,testPermeability)
## RMSE Rsquared MAE
## 12.779143 0.347982 8.950053
Lasso and enet flip their train performances with lasso doing better.
Not at this present time.
permeability%>% as.data.frame()%>% ggplot()+geom_histogram(aes(permeability))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Looking at our response variable, one suspects that given our mean absolute error we just don’t have a good enough model. Obviously this is case dependent.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
imp<-mice(ChemicalManufacturingProcess)
imputedProcess<-mice::complete(imp)
We impute with the MICE package.
set.seed(11032021)
preProc<-preProcess(ourFingerprints, method="YeoJohnson")
trainIndex<-caret::createDataPartition(imputedProcess$Yield, list=F)
trainProcess<-predict(preProc,imputedProcess[trainIndex,])
testProcess<-predict(preProc,imputedProcess[-trainIndex,])
processModel <- train(Yield ~., data= trainProcess,method="ridge",
trControl=cvControl)
processModel
## Ridge Regression
##
## 88 samples
## 57 predictors
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 71, 70, 71, 71, 69
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0e+00 2.335818 0.3192472 1.832441
## 1e-04 2.210594 0.3404945 1.743630
## 1e-01 1.333795 0.6275322 1.081136
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
plot(processModel)
processPrediction <- predict(processModel,newdata =subset(testProcess,select=-c(Yield)))
postResample(processPrediction,testProcess$Yield)
## RMSE Rsquared MAE
## 2.6219141 0.2195833 1.2977042
Rsquared is .218 which is not great
varImp((processModel))
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess17 100.00
## ManufacturingProcess13 94.59
## ManufacturingProcess32 89.46
## BiologicalMaterial06 69.99
## ManufacturingProcess09 64.47
## BiologicalMaterial03 62.84
## BiologicalMaterial12 54.32
## ManufacturingProcess06 49.48
## ManufacturingProcess20 47.97
## ManufacturingProcess27 47.89
## BiologicalMaterial02 47.29
## ManufacturingProcess30 42.25
## BiologicalMaterial11 42.08
## ManufacturingProcess31 41.94
## ManufacturingProcess11 40.13
## ManufacturingProcess12 39.26
## ManufacturingProcess36 39.03
## ManufacturingProcess21 33.90
## ManufacturingProcess18 32.92
## BiologicalMaterial01 31.87
Manufacturing Process generally seems more important
plot(ManufacturingProcess17~Yield, imputedProcess)
plot(ManufacturingProcess13~Yield, imputedProcess)
plot(ManufacturingProcess32~Yield, imputedProcess)
plot(BiologicalMaterial06~Yield, imputedProcess)
Some like ManufacturingProces32 should fairly obviously be maximised. Others like ManufacturingProces17 are negatively associated. Obviously we can look at taking those processes further in the direction that might improve the results.