Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug: (a) Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(permeability) The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version
## 3.4.4
library(knitr)
data(permeability)
dim(permeability)
## [1] 165 1
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
388 Predictors rare left
library(caret)
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2020d.
## 1.0/zoneinfo/America/New_York'
#str(fingerprints)
low_freq = nearZeroVar(fingerprints)
head(low_freq,20)
## [1] 7 8 9 10 13 14 17 18 19 22 23 24 30 31 32 33 34 45 77 81
dim(fingerprints)
## [1] 165 1107
fingerprints_mod <- fingerprints[, -low_freq]
dim(fingerprints_mod)
## [1] 165 388
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
Optimal latent variable = 6 R2 = 0.4134156
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
split <- createDataPartition(permeability, p = .8, list = FALSE)
x_train <- fingerprints_mod[split, ]
x_test <- fingerprints_mod[-split, ]
y_train <- permeability[split, ]
y_test <- permeability[-split, ]
plsModel <- train(x = x_train, y = y_train,
method = "pls",
metric='Rsquared',
#tuneGrid = expand.grid(ncomp = 1:20),
tuneLength = 25,
preProcess=c('center', 'scale'),
trControl = trainControl(method = "cv")
)
plsModel$results
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 13.31831 0.3510139 10.056137 3.048949 0.3080094 2.340990
## 2 2 12.11637 0.4681143 8.683817 3.865193 0.3427076 2.733703
## 3 3 12.27746 0.4287951 9.286901 3.381632 0.3290341 2.399297
## 4 4 12.29808 0.4056076 9.572933 3.373612 0.3158981 2.105239
## 5 5 12.45136 0.4101553 9.560444 3.518779 0.3107306 2.166172
## 6 6 12.28314 0.4176184 9.335474 3.345252 0.2868268 2.274462
## 7 7 12.29332 0.4207393 9.396583 3.363152 0.2871234 2.552239
## 8 8 12.48948 0.4278259 9.634706 3.364080 0.2616385 2.575149
## 9 9 12.43078 0.4382165 9.641899 3.647312 0.2802843 2.815822
## 10 10 12.71323 0.4283256 9.873832 3.609001 0.2715464 2.727247
## 11 11 12.94289 0.4203995 9.772981 3.615176 0.2632383 2.623715
## 12 12 13.02096 0.4153005 9.887467 3.576483 0.2647847 2.739921
## 13 13 13.13881 0.4160452 10.007913 3.911930 0.2796092 2.826050
## 14 14 13.12442 0.4173972 10.028739 3.911703 0.2704674 2.750586
## 15 15 13.18062 0.4136883 10.060093 4.040629 0.2699734 2.971478
## 16 16 13.34378 0.4112354 10.226867 4.089526 0.2651834 2.932879
## 17 17 13.43831 0.4130573 10.254919 4.312595 0.2659651 3.029609
## 18 18 13.37039 0.4156534 10.166224 3.967276 0.2531463 2.790825
## 19 19 13.47148 0.4156267 10.272169 3.948411 0.2529685 2.764084
## 20 20 13.72332 0.3979978 10.334129 3.886329 0.2536122 2.792335
## 21 21 13.97621 0.3827927 10.454817 3.821791 0.2480859 2.709723
## 22 22 14.19810 0.3647156 10.609059 3.463789 0.2346457 2.551320
## 23 23 14.39272 0.3555223 10.654922 3.335432 0.2264955 2.477234
## 24 24 14.49462 0.3536899 10.788655 3.170498 0.2177835 2.286006
## 25 25 14.55363 0.3561616 10.890812 3.256004 0.2160580 2.311935
plot(plsModel)
Predict the response for the test set. What is the test set estimate of R2?
# Predict using pls
pls_Predict <- predict(plsModel, newdata=x_test)
# Finding R squared for prediction
postResample(pred=pls_Predict, obs=y_test)
## RMSE Rsquared MAE
## 9.600764 0.566260 7.060612
Try building other models discussed in this chapter. Do any have better predictive performance?
ridgeModel <- train(x=x_train,
y=y_train,
method='ridge',
metric='Rsquared',
tuneGrid=data.frame(.lambda = seq(0, 1, by=0.1)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
## Warning: model fit failed for Fold06: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
ridgeModel
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 121, 118, 119, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 13.69443 0.4174436 10.223431
## 0.1 12.74328 0.4451309 9.698201
## 0.2 12.76773 0.4673060 9.764792
## 0.3 13.02276 0.4770641 10.002382
## 0.4 13.38561 0.4834519 10.336600
## 0.5 13.81463 0.4872222 10.687869
## 0.6 14.30204 0.4898365 11.079783
## 0.7 14.84287 0.4918793 11.528626
## 0.8 15.41394 0.4929513 11.994964
## 0.9 16.02262 0.4937894 12.504361
## 1.0 16.66100 0.4943765 13.054912
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.
plot(ridgeModel)
EnetModel <- train(x=x_train,
y=y_train,
method='enet',
metric='Rsquared',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.2),
.lambda = seq(0, 1, by=0.2)),
trControl=trainControl(method='cv',number = 10),
preProcess=c('center','scale')
)
## Warning: model fit failed for Fold01: lambda=0.0, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## Warning in train.default(x = x_train, y = y_train, method = "enet", metric
## = "Rsquared", : missing values found in aggregated results
EnetModel
## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 120, 120, 119, 118, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.0 0.0 15.33200 NaN 12.264531
## 0.0 0.2 181.61093 0.5006206 79.232433
## 0.0 0.4 377.34007 0.4562549 157.821727
## 0.0 0.6 570.25913 0.4281670 233.569898
## 0.0 0.8 765.24697 0.3914445 308.991946
## 0.0 1.0 953.08771 0.3486669 385.634422
## 0.2 0.0 15.61342 NaN 12.519571
## 0.2 0.2 11.48961 0.5133987 8.392898
## 0.2 0.4 11.70541 0.5210781 8.912043
## 0.2 0.6 12.03273 0.5165124 9.195574
## 0.2 0.8 12.31190 0.5083282 9.536378
## 0.2 1.0 12.68389 0.4932131 9.842629
## 0.4 0.0 15.61342 NaN 12.519571
## 0.4 0.2 11.65580 0.5066712 8.246826
## 0.4 0.4 12.08351 0.5240873 9.061063
## 0.4 0.6 12.44097 0.5237892 9.483765
## 0.4 0.8 12.88938 0.5162779 10.008867
## 0.4 1.0 13.30904 0.5046888 10.413628
## 0.6 0.0 15.61342 NaN 12.519571
## 0.6 0.2 11.91737 0.4970200 8.250184
## 0.6 0.4 12.66908 0.5219582 9.420102
## 0.6 0.6 13.23198 0.5204417 10.044822
## 0.6 0.8 13.75836 0.5176946 10.614714
## 0.6 1.0 14.24591 0.5086338 11.120193
## 0.8 0.0 15.61342 NaN 12.519571
## 0.8 0.2 12.21003 0.4908036 8.267310
## 0.8 0.4 13.38426 0.5192736 9.900780
## 0.8 0.6 14.22172 0.5165188 10.775442
## 0.8 0.8 14.85172 0.5169858 11.434564
## 0.8 1.0 15.40918 0.5097863 12.050619
## 1.0 0.0 15.61342 NaN 12.519571
## 1.0 0.2 12.54519 0.4862380 8.338754
## 1.0 0.4 14.23547 0.5157809 10.555123
## 1.0 0.6 15.33993 0.5132102 11.698039
## 1.0 0.8 16.09025 0.5150767 12.482319
## 1.0 1.0 16.71524 0.5098920 13.148758
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.4.
plot(EnetModel)
lassoModel <- train(x=x_train,
y=y_train,
method='lasso',
metric='Rsquared',
tuneGrid=data.frame(.fraction = seq(.01, .5, .02)),
trControl=trainControl(method='cv'),
preProcess=c('center','scale')
)
## Warning: model fit failed for Fold02: fraction=0.49 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold03: fraction=0.49 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold04: fraction=0.49 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
lassoModel
## The lasso
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 121, 121, 120, 119, 120, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.01 14.45928 0.3924549 11.506603
## 0.03 13.16883 0.4321263 10.358607
## 0.05 12.53464 0.4566478 9.676124
## 0.07 12.33564 0.4484233 9.233382
## 0.09 12.24431 0.4361742 8.943501
## 0.11 12.23273 0.4229047 8.780282
## 0.13 12.28223 0.4090724 8.758722
## 0.15 12.39036 0.3980402 8.790822
## 0.17 12.47679 0.3913770 8.885504
## 0.19 12.56915 0.3868593 9.000616
## 0.21 12.59689 0.3877780 9.057625
## 0.23 12.63076 0.3900039 9.083509
## 0.25 12.67865 0.3922171 9.117568
## 0.27 12.73338 0.3940052 9.164527
## 0.29 12.82377 0.3938140 9.254206
## 0.31 12.92441 0.3934767 9.348935
## 0.33 13.02625 0.3934781 9.443282
## 0.35 13.13278 0.3932617 9.556290
## 0.37 13.23222 0.3932985 9.665364
## 0.39 13.31589 0.3929378 9.728507
## 0.41 13.40832 0.3922120 9.785678
## 0.43 13.49990 0.3909769 9.851165
## 0.45 13.59580 0.3895129 9.914293
## 0.47 13.68749 0.3874632 9.967952
## 0.49 13.77272 0.3857790 10.007907
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.05.
plot(lassoModel)
Would you recommend any of your models to replace the permeability laboratory experiment?
The PLS model seems to perform the best with R2 0f 0.62. If I had to recommend a model it would be PLS, however an R2 of .62 is not very encouraging and would continue to look for additional models that would produce an r2 of at least .70
# The PLS model seems to perform the best with R2 0f 0.62. If I had to recommend a model it would be PLS, however an R2 of .62 is not very encouring and would continue to look for additional models that would produce an r2 of at least .70
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch: (a) Start R and use these commands to load the data: > library(AppliedPredictiveModeling) > data(chemicalManufacturing) The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)
## [1] 176 58
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
# remove near zero variables
chem_low_freq <- nearZeroVar(ChemicalManufacturingProcess)
chemManProcess_mod <- ChemicalManufacturingProcess[, -chem_low_freq]
dim(chemManProcess_mod)
## [1] 176 57
# Checking for missing values and impute.
#impute using knn
table(is.na(chemManProcess_mod))
##
## FALSE TRUE
## 9926 106
chemManProcess_mod_Impute <- preProcess(chemManProcess_mod[,-c(1)], method=c('knnImpute'))
ChemManProcess_Final <- predict(chemManProcess_mod_Impute, ChemicalManufacturingProcess[,-c(1)])
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
cmp_split <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
x_train <- ChemManProcess_Final[cmp_split, ]
y_train <- ChemicalManufacturingProcess$Yield[cmp_split]
x_test <- ChemManProcess_Final[-cmp_split, ]
y_test <- ChemicalManufacturingProcess$Yield[-cmp_split]
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
EnetModel <- train(x=x_train,
y=y_train,
method='enet',
metric='Rsquared',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.2),
.lambda = seq(0, 1, by=0.2)),
trControl=trainControl(method='cv',number = 10),
preProcess=c('center','scale')
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## Warning in train.default(x = x_train, y = y_train, method = "enet", metric
## = "Rsquared", : missing values found in aggregated results
EnetModel
## Elasticnet
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 131, 130, 130, 129, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.0 0.0 1.807180 NaN 1.4855211
## 0.0 0.2 1.963422 0.4877692 1.1891600
## 0.0 0.4 3.491484 0.4408354 1.6554474
## 0.0 0.6 3.462514 0.4253402 1.6987124
## 0.0 0.8 3.039518 0.4125657 1.6204022
## 0.0 1.0 2.462089 0.4047948 1.4957632
## 0.2 0.0 1.807180 NaN 1.4855211
## 0.2 0.2 1.277240 0.5836335 1.0590494
## 0.2 0.4 1.157380 0.5926030 0.9463185
## 0.2 0.6 1.493290 0.5701047 1.0482312
## 0.2 0.8 1.762857 0.5330229 1.1453829
## 0.2 1.0 1.942509 0.4966184 1.2132904
## 0.4 0.0 1.807180 NaN 1.4855211
## 0.4 0.2 1.293645 0.5811742 1.0738433
## 0.4 0.4 1.172937 0.5857248 0.9485363
## 0.4 0.6 1.458645 0.5648819 1.0539227
## 0.4 0.8 1.729016 0.5383721 1.1513518
## 0.4 1.0 1.849290 0.5009173 1.2135353
## 0.6 0.0 1.807180 NaN 1.4855211
## 0.6 0.2 1.289571 0.5801150 1.0697302
## 0.6 0.4 1.186378 0.5832192 0.9538751
## 0.6 0.6 1.435633 0.5636273 1.0673757
## 0.6 0.8 1.684293 0.5400377 1.1698086
## 0.6 1.0 1.861634 0.5052551 1.2605026
## 0.8 0.0 1.807180 NaN 1.4855211
## 0.8 0.2 1.282603 0.5782920 1.0632733
## 0.8 0.4 1.198524 0.5825707 0.9604851
## 0.8 0.6 1.434967 0.5632098 1.0911774
## 0.8 0.8 1.692416 0.5420618 1.2197898
## 0.8 1.0 1.917090 0.5121017 1.3260410
## 1.0 0.0 1.807180 NaN 1.4855211
## 1.0 0.2 1.274588 0.5765192 1.0552914
## 1.0 0.4 1.213783 0.5810262 0.9682351
## 1.0 0.6 1.460624 0.5630218 1.1305573
## 1.0 0.8 1.728334 0.5483696 1.2795496
## 1.0 1.0 2.002102 0.5219743 1.4040937
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.4 and lambda = 0.2.
plot(EnetModel)
# Predict using pls
Enet_Predict <- predict(EnetModel, newdata=x_test)
# Finding R squared for prediction
postResample(pred=Enet_Predict, obs=y_test)
## RMSE Rsquared MAE
## 1.1208794 0.6463099 0.9020061
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
The process predictors ManfaturingProcess32, BiologicalMaterial06,ManfaturingProcess13 are on top of this list.
EnetImp <- varImp(EnetModel, scale = FALSE)
plot(EnetImp, top=20, scales = list(y = list(cex = 0.8)))
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
These top 3 predictors are showing very good relationship to the response. If you know your top predictors, you can continue to develop the process and upgrade the material to produce better yields.
viporder <- order(abs(EnetImp$importance),decreasing=TRUE)
topPred = rownames(EnetImp$importance)[viporder[c(1:3)]]
featurePlot(x_train[, topPred],
y_train,
plot = "scatter",
between = list(x = 1, y = 1),
type = c("g", "p", "smooth"),
layout = c(3,1),
labels = rep("", 2))