Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
library(caret)
x <- nearZeroVar(data.frame(fingerprints))
fp <- fingerprints[,-x]
paste("Dimensions of Fingerprints:")
## [1] "Dimensions of Fingerprints:"
dim(fingerprints)
## [1] 165 1107
paste("Dimensions of Fingerprints without low frequencies:")
## [1] "Dimensions of Fingerprints without low frequencies:"
dim(fp)
## [1] 165 388
As you can see from the dimensions, the fingerprint dataset with low frequency predictors filtered has 388 predictors remaining.
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
#load PLS package
library(pls,warn.conflicts = FALSE)
#set data as a dataframe
fp <- as.data.frame(fp)
#split the data
set.seed(12)
trainingRows <- createDataPartition(permeability, p=.80, list = FALSE)
fp.train <- fp[trainingRows,]
fp.test <- fp[-trainingRows,]
p.train <- permeability[trainingRows,]
p.test <- permeability[-trainingRows,]
Prior to performing PLS, the predictors should be centered and scaled, especially if the predictors are on scales of differing magnitude.
#PLS Model
set.seed(13)
p.pls <- train(fp.train,
p.train,
method = 'pls',
metric = 'Rsquared',
tuneLength = 10,
preProcess = c('center','scale'),
trControl = trainControl(method = 'repeatedcv',repeats = 5))
plot(p.pls)
p.pls$results
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 12.72462 0.3651445 10.047501 2.631606 0.2234657 1.940926
## 2 2 11.26119 0.5079742 8.002020 3.187312 0.2568022 1.992533
## 3 3 11.29178 0.5127242 8.389938 2.848592 0.2341558 1.780225
## 4 4 11.75818 0.4879665 8.843052 2.689239 0.2169731 1.917375
## 5 5 11.68253 0.4996245 8.519151 2.929627 0.2188679 1.947097
## 6 6 11.64158 0.4994828 8.358427 2.899145 0.2209665 1.927925
## 7 7 11.52283 0.5071795 8.362761 2.781527 0.2079151 1.978294
## 8 8 11.65642 0.4988509 8.612291 2.619304 0.2015002 1.907011
## 9 9 11.89904 0.4866130 8.909316 2.784417 0.2140803 1.861240
## 10 10 12.16455 0.4752184 9.094156 2.775450 0.2099491 1.888054
Based on our results, ncomp3 has the lowest RMSE(11.49) and the highest rsquared value (0.502) indicating that it is the strongest model.
Predict the response for the test set. What is the test set estimate of R2?
p.pred <- predict(p.pls, newdata=fp.test)
postResample(pred = p.pred, obs = p.test)
## RMSE Rsquared MAE
## 13.8155969 0.3141046 9.8310535
Try building other models discussed in this chapter. Do any have better predictive performance? ####Ridget
ctrl <- trainControl(method = "cv", number = 10)
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(100)
ridgeRegFit <- train(fp.train,p.train,
method = "ridge",
## Fir the model over many penalty values
tuneGrid = ridgeGrid,
trControl = ctrl,
## put the predictors on the same scale
preProc = c("center", "scale"))
ridgeRegFit
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 19.06358 0.3197872 12.238583
## 0.007142857 15.72473 0.3306688 11.653505
## 0.014285714 14.75215 0.3600286 11.071735
## 0.021428571 152.32779 0.3869760 87.526855
## 0.028571429 13.65908 0.4008254 10.368303
## 0.035714286 13.36385 0.4139591 10.161625
## 0.042857143 13.13638 0.4250001 9.995798
## 0.050000000 12.96552 0.4339068 9.865352
## 0.057142857 12.81701 0.4422135 9.750288
## 0.064285714 12.80165 0.4434259 9.724615
## 0.071428571 12.60919 0.4550532 9.573428
## 0.078571429 12.52866 0.4607153 9.503284
## 0.085714286 12.47022 0.4652653 9.454446
## 0.092857143 12.39962 0.4702314 9.398336
## 0.100000000 12.35983 0.4738991 9.363776
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
.fraction = seq(.05, 1, length = 15))
set.seed(100)
enetTune <- train(fp.train,p.train,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c("center", "scale"))
enetTune
## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 120, 119, 121, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.0500000 10.97653 0.5477488 8.029136
## 0.00 0.1178571 10.76896 0.5495103 7.434702
## 0.00 0.1857143 10.97972 0.5386672 7.708130
## 0.00 0.2535714 11.22744 0.5168881 8.102942
## 0.00 0.3214286 11.53964 0.4929626 8.414226
## 0.00 0.3892857 12.14795 0.4532626 8.952964
## 0.00 0.4571429 12.93934 0.4142961 9.452206
## 0.00 0.5250000 13.39868 0.3947928 9.819215
## 0.00 0.5928571 14.09072 0.3737067 10.167608
## 0.00 0.6607143 14.66681 0.3611112 10.422076
## 0.00 0.7285714 15.32796 0.3529545 10.703084
## 0.00 0.7964286 16.27133 0.3467691 11.068167
## 0.00 0.8642857 17.19271 0.3373228 11.456761
## 0.00 0.9321429 18.12846 0.3275404 11.848944
## 0.00 1.0000000 19.06358 0.3197872 12.238583
## 0.01 0.0500000 10.55477 0.5830915 7.469326
## 0.01 0.1178571 10.82296 0.5546742 7.672218
## 0.01 0.1857143 11.21445 0.5119579 8.390975
## 0.01 0.2535714 11.80771 0.4682625 8.961662
## 0.01 0.3214286 12.30230 0.4393598 9.388717
## 0.01 0.3892857 12.58326 0.4292055 9.634180
## 0.01 0.4571429 12.85809 0.4188562 9.864308
## 0.01 0.5250000 13.06542 0.4109832 10.009855
## 0.01 0.5928571 13.23831 0.4071533 10.134343
## 0.01 0.6607143 13.54514 0.3981232 10.321681
## 0.01 0.7285714 13.92872 0.3855687 10.566047
## 0.01 0.7964286 14.34889 0.3738676 10.833367
## 0.01 0.8642857 14.74801 0.3627486 11.073660
## 0.01 0.9321429 15.05710 0.3536017 11.255867
## 0.01 1.0000000 15.36628 0.3435692 11.425724
## 0.10 0.0500000 11.85352 0.5366824 8.994292
## 0.10 0.1178571 10.51394 0.5814344 7.315997
## 0.10 0.1857143 10.55873 0.5825807 7.279600
## 0.10 0.2535714 10.62470 0.5770193 7.477477
## 0.10 0.3214286 10.75754 0.5637897 7.758130
## 0.10 0.3892857 10.93363 0.5503113 7.981392
## 0.10 0.4571429 11.12765 0.5393982 8.247257
## 0.10 0.5250000 11.34912 0.5248881 8.516625
## 0.10 0.5928571 11.53072 0.5137987 8.701154
## 0.10 0.6607143 11.65658 0.5068645 8.805328
## 0.10 0.7285714 11.80452 0.4987927 8.926092
## 0.10 0.7964286 11.95940 0.4908311 9.052030
## 0.10 0.8642857 12.11341 0.4834499 9.173058
## 0.10 0.9321429 12.24253 0.4782647 9.273371
## 0.10 1.0000000 12.35983 0.4738991 9.363776
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.1178571 and lambda = 0.1.
Interesting to find that the lambda of 1 has the strongest Rsquared. According to the booklambda of 0, fits the lasso model. Since the lambda of 0.1 has the best fit, the elastnic net has the best prediction.
6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch: ###A
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
library(DataExplorer)
plot_missing(ChemicalManufacturingProcess)
It looks like 3 manufacturing process has over 3% of the values missing.
library("caret")
library("rnn")
PP <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))
cmp <- predict(PP, ChemicalManufacturingProcess)
library(DataExplorer)
plot_missing(cmp)
###C Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
y <- ChemicalManufacturingProcess$Yield
cmp <- cmp %>% select(-Yield)
trainingrows <- createDataPartition(y, p=.80, list = FALSE)
#Traing and test rows for complete cases
x.train <- cmp[trainingrows,]
x.test <- cmp[-trainingrows,]
y.train <- y[trainingrows]
y.test <- y[-trainingrows]
#PLS Model
set.seed(13)
p.pls <- train(x.train,
y.train,
method = 'pls',
metric = 'Rsquared',
tuneLength = 10,
preProcess = c('center','scale'),
trControl = trainControl(method = 'repeatedcv',repeats = 5))
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
plot(p.pls)
p.pls$results
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 1.423043 0.4394708 1.134340 0.2969004 0.1632513 0.2061634
## 2 2 1.801038 0.4841826 1.207533 1.3081317 0.2324248 0.4915311
## 3 3 1.334308 0.5648355 1.048120 0.4926768 0.1709773 0.2637704
## 4 4 1.504981 0.5416215 1.109215 0.8501903 0.2022379 0.3579844
## 5 5 1.849288 0.5100506 1.226166 1.4966719 0.2321061 0.5405090
## 6 6 1.906421 0.5147433 1.229845 1.7317057 0.2433972 0.6177219
## 7 7 2.083584 0.5062454 1.284964 2.0213285 0.2475283 0.7295149
## 8 8 2.252770 0.4982705 1.350616 2.4463451 0.2498341 0.8837983
## 9 9 2.593900 0.4831696 1.455706 3.0048306 0.2569507 1.0569759
## 10 10 2.839170 0.4699322 1.525045 3.3996646 0.2598900 1.1768309
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
.fraction = seq(.05, 1, length = 15))
ctrl <- trainControl(method = "cv", number = 10)
set.seed(100)
enetTune <- train(x.train,y.train,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c("center", "scale"))
enetTune
## Elasticnet
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.0500000 1.194023 0.6012936 0.9635360
## 0.00 0.1178571 1.122666 0.6287469 0.8992816
## 0.00 0.1857143 1.424786 0.5362659 1.0374346
## 0.00 0.2535714 2.253200 0.5023912 1.2875214
## 0.00 0.3214286 2.982601 0.4613981 1.5048995
## 0.00 0.3892857 3.296479 0.4547221 1.6110154
## 0.00 0.4571429 3.293536 0.4583287 1.6272604
## 0.00 0.5250000 3.147376 0.5070916 1.5923946
## 0.00 0.5928571 3.306887 0.4688437 1.6594489
## 0.00 0.6607143 3.631948 0.4195279 1.7760754
## 0.00 0.7285714 3.766038 0.4085364 1.8319052
## 0.00 0.7964286 3.710676 0.4060130 1.8314146
## 0.00 0.8642857 3.728410 0.4055051 1.8494402
## 0.00 0.9321429 3.969112 0.3880442 1.9302811
## 0.00 1.0000000 4.311444 0.3761358 2.0335878
## 0.01 0.0500000 1.460847 0.5842655 1.1983025
## 0.01 0.1178571 1.205917 0.6007919 0.9736601
## 0.01 0.1857143 1.160208 0.6150736 0.9447116
## 0.01 0.2535714 1.174283 0.6084475 0.9520994
## 0.01 0.3214286 1.134544 0.6176079 0.9187663
## 0.01 0.3892857 1.363851 0.5497650 1.0047982
## 0.01 0.4571429 1.717500 0.5392663 1.1057608
## 0.01 0.5250000 2.039045 0.5212217 1.2103792
## 0.01 0.5928571 2.336098 0.5209717 1.3031113
## 0.01 0.6607143 2.682101 0.4870149 1.4203374
## 0.01 0.7285714 2.973478 0.4743905 1.5128186
## 0.01 0.7964286 3.159862 0.4688631 1.5733159
## 0.01 0.8642857 3.277601 0.4640925 1.6143278
## 0.01 0.9321429 3.374909 0.4600189 1.6483681
## 0.01 1.0000000 3.439166 0.4564505 1.6736043
## 0.10 0.0500000 1.599563 0.4888145 1.3127413
## 0.10 0.1178571 1.394000 0.5919333 1.1437660
## 0.10 0.1857143 1.251720 0.5991533 1.0183158
## 0.10 0.2535714 1.181089 0.6027306 0.9547308
## 0.10 0.3214286 1.177239 0.6034753 0.9505441
## 0.10 0.3892857 1.171457 0.6103310 0.9519020
## 0.10 0.4571429 1.163312 0.6113846 0.9492304
## 0.10 0.5250000 1.190275 0.5907541 0.9638741
## 0.10 0.5928571 1.296361 0.5561948 1.0035676
## 0.10 0.6607143 1.421959 0.5451032 1.0487662
## 0.10 0.7285714 1.550349 0.5347742 1.0999093
## 0.10 0.7964286 1.721472 0.5251099 1.1567160
## 0.10 0.8642857 1.854713 0.5186700 1.2001285
## 0.10 0.9321429 1.945102 0.5159651 1.2303014
## 0.10 1.0000000 2.006981 0.5141554 1.2535822
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.1178571 and lambda = 0.
The Lasso model has the highest R squared value and the loest RMSE.
Predict the response for the test set. What is the value of the performancemetric and how does this compare with the resampled performance metric on the training set?
cmp_pred <- predict(enetTune,x.test)
cmp.eval <- data.frame(obs = y.test, pred = cmp_pred)
defaultSummary(cmp.eval)
## RMSE Rsquared MAE
## 1.1451611 0.6837186 0.8799826
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
important <- varImp(enetTune, scale = FALSE)
plot(important,top=10, scales = list(y = list(cex = 0.8)))
Manufacturing Process32 has the strongest impact on the model, while Manufacturing Process 13 is a close second. The top 3 most important predictores are process predictors.
The top 3 most important predictors are examined in the relationship with the response. In the manufacturing process, if we can control these predictors, they can make a bigger impact to yield than the rest of variables. Only concern here is that ManufacturingProcess43 has an outlier, which causes the strong relationship with the response. This outlier should be investigated, and it might affect the relationship with yield.
top <- order(abs(important$importance),decreasing=TRUE)
top3 = rownames(important$importance)[top[c(1:3)]]
featurePlot(x.train[, top3],
y.train,
plot = "scatter",
type = c("g", "p", "smooth"))