Exercises 6.2 & 6.3 from the K&J book.
#clear the workspace
rm(list = ls())
#load req's packages
library(AppliedPredictiveModeling)
library(caret)
library(elasticnet)
library(knitr)
library(ggplot2)
library(kableExtra)
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
#drop everything with near zero variability
f <- fingerprints[, -nearZeroVar(fingerprints)]
Filtering using nearZero resulted in the number of variables going from 1107 to 388, a decrease of 719.
First we'll split the data
set.seed(11)
#add perm data to f before we split
f <- cbind(data.frame(permeability),f)
#split 75/25
n <- floor(0.75 * nrow(f))
idx <- sample(seq_len(nrow(f)), size = n)
train <- f[idx, ]
test <- f[-idx, ]
Next we run the PLS
#train the pls model
pls.model <- train(train[,-1],
train$permeability,
method = "pls",
tuneLength = 10,
trControl = trainControl(method = "cv"))
#output results
kable(pls.model$results)
| ncomp | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|
| 1 | 13.89503 | 0.3472600 | 10.827980 | 2.441099 | 0.2056161 | 2.068817 |
| 2 | 11.35498 | 0.5233864 | 8.301820 | 3.385000 | 0.3173944 | 1.997800 |
| 3 | 11.05793 | 0.5524960 | 8.455204 | 2.857272 | 0.2561434 | 2.338631 |
| 4 | 11.43852 | 0.5210680 | 8.948124 | 2.296126 | 0.1918366 | 1.876007 |
| 5 | 11.00379 | 0.5577330 | 8.426772 | 2.488035 | 0.1970978 | 1.704400 |
| 6 | 11.30268 | 0.5293902 | 8.458922 | 2.362089 | 0.2049334 | 1.380478 |
| 7 | 11.29453 | 0.5262470 | 8.716203 | 2.629013 | 0.2213593 | 1.810636 |
| 8 | 11.17101 | 0.5357901 | 8.479021 | 2.884467 | 0.2320330 | 2.033405 |
| 9 | 11.36656 | 0.5238562 | 8.540063 | 3.344882 | 0.2381457 | 2.414318 |
| 10 | 11.72035 | 0.5138794 | 8.789012 | 3.410256 | 0.2328869 | 2.724826 |
plot(pls.model$results$Rsquared,
xlab = "ncomp",
ylab = "Rsquared"
)
The optimal is 5 and the with the corresponding estimate of \(R^2\) at 0.557733
Next we run the model:
output <- predict(pls.model, test, ncomp = 10)
postResample(pred = output, obs = test$permeability)
## RMSE Rsquared MAE
## 13.3229538 0.1817181 9.6233479
And we can see that the \(R^2\) is reduced by about half.
We'll use an elastic net as a secondary choice here:
#train the model
enet.model <- train(x=train[,-1],
y=train$permeability,
method='enet',
metric='RMSE',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.2),
.lambda = seq(0, 1, by=0.2)),
trControl=trainControl(method='cv',number=10),
preProcess=c('center','scale'))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
plot(enet.model)
#best params
enet.model$bestTune
## fraction lambda
## 9 0.4 0.2
#perf of best params
getTrainPerf(enet.model)
## TrainRMSE TrainRsquared TrainMAE method
## 1 10.80887 0.5809875 7.902054 enet
The optimal parameters were fraction = 0.4 and lambda = 0.2 with an RMSE of 10.808873 and an R-squared of 0.5809875
#apply the trained model
enet.pred <- predict(enet.model, newdata=test[,-1])
output <- postResample(pred=enet.pred, obs=test$permeability)
kable(output)
| x | |
|---|---|
| RMSE | 13.1317500 |
| Rsquared | 0.2721501 |
| MAE | 9.3181206 |
In the above case we can see that the degredation is much less significant AND the out-sample performance for the elastic net is higher than that of the PLS
The elastic net model looks promising at this stage given it's apparent ability to maintain a high \(R^2\) out-of-sample. I'm sure much further improvements could also be made with this method.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
data(ChemicalManufacturingProcess)
chem <- ChemicalManufacturingProcess
head(chem)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00 6.25 49.58 56.97
## 2 42.44 8.01 60.97 67.48
## 3 42.03 8.01 60.97 67.48
## 4 41.42 8.01 60.97 67.48
## 5 42.49 7.47 63.33 72.25
## 6 43.57 6.12 58.36 65.31
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 12.74 19.51 43.73
## 2 14.65 19.36 53.14
## 3 14.65 19.36 53.14
## 4 14.65 19.36 53.14
## 5 14.02 17.91 54.66
## 6 15.17 21.79 51.23
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1 100 16.66 11.44
## 2 100 19.04 12.55
## 3 100 19.04 12.55
## 4 100 19.04 12.55
## 5 100 18.22 12.80
## 6 100 18.30 12.13
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1 3.46 138.09 18.83
## 2 3.46 153.67 21.05
## 3 3.46 153.67 21.05
## 4 3.46 153.67 21.05
## 5 3.05 147.61 21.05
## 6 3.78 151.88 20.76
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1 NA NA NA
## 2 0.0 0 NA
## 3 0.0 0 NA
## 4 0.0 0 NA
## 5 10.7 0 NA
## 6 12.0 0 NA
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1 NA NA NA
## 2 917 1032.2 210.0
## 3 912 1003.6 207.1
## 4 911 1014.6 213.3
## 5 918 1027.5 205.7
## 6 924 1016.8 208.9
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1 NA NA 43.00
## 2 177 178 46.57
## 3 178 178 45.07
## 4 177 177 44.92
## 5 178 178 44.96
## 6 178 178 45.32
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1 NA NA NA
## 2 NA NA 0
## 3 NA NA 0
## 4 NA NA 0
## 5 NA NA 0
## 6 NA NA 0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1 35.5 4898 6108
## 2 34.0 4869 6095
## 3 34.8 4878 6087
## 4 34.8 4897 6102
## 5 34.6 4992 6233
## 6 34.0 4985 6222
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1 4682 35.5 4865
## 2 4617 34.0 4867
## 3 4617 34.8 4877
## 4 4635 34.8 4872
## 5 4733 33.9 4886
## 6 4786 33.4 4862
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1 6049 4665 0.0
## 2 6097 4621 0.0
## 3 6078 4621 0.0
## 4 6073 4611 0.0
## 5 6102 4659 -0.7
## 6 6115 4696 -0.6
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1 NA NA NA
## 2 3 0 3
## 3 4 1 4
## 4 5 2 5
## 5 8 4 18
## 6 9 1 1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1 4873 6074 4685
## 2 4869 6107 4630
## 3 4897 6116 4637
## 4 4892 6111 4630
## 5 4930 6151 4684
## 6 4871 6128 4687
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1 10.7 21.0 9.9
## 2 11.2 21.4 9.9
## 3 11.1 21.3 9.4
## 4 11.1 21.3 9.4
## 5 11.3 21.6 9.0
## 6 11.4 21.7 10.1
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1 69.1 156 66
## 2 68.7 169 66
## 3 69.3 173 66
## 4 69.3 171 68
## 5 69.4 171 70
## 6 68.2 173 70
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1 2.4 486 0.019
## 2 2.6 508 0.019
## 3 2.6 509 0.018
## 4 2.5 496 0.018
## 5 2.5 468 0.017
## 6 2.5 490 0.018
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 0.5 3 7.2
## 2 2.0 2 7.2
## 3 0.7 2 7.2
## 4 1.2 2 7.2
## 5 0.2 2 7.3
## 6 0.4 2 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1 NA NA 11.6
## 2 0.1 0.15 11.1
## 3 0.0 0.00 12.0
## 4 0.0 0.00 10.6
## 5 0.0 0.00 11.0
## 6 0.0 0.00 11.5
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1 3.0 1.8 2.4
## 2 0.9 1.9 2.2
## 3 1.0 1.8 2.3
## 4 1.1 1.8 2.1
## 5 1.1 1.7 2.1
## 6 2.2 1.8 2.0
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
For this I used caret::preprocess as was mentioned in the book. The data are relatively obscure and leave little room for intuition as to the "best" method for imputation so I went with KNN.
#impute using knn
chem.imp <- preProcess(chem[,2:ncol(chem)], method=c('knnImpute'))
chem <- cbind(chem$Yield,predict(chem.imp, chem[,2:ncol(chem)]))
colnames(chem)[1] <- "Yield"
First we split the data into test and training sets. I've gone with 70/30, which is somewhat arbitrary.
set.seed(11)
#split 70/30
n <- floor(0.70 * nrow(chem))
idx <- sample(seq_len(nrow(chem)), size = n)
train <- chem[idx, ]
test <- chem[-idx, ]
I used caret to run an elastic-net model with a 20-step tuning range from 0-1 by 0.05 for both "lambda" and "fraction". 5-fold cross validation was also used as part of the training process.
#train the model
enet.model <- train(x=train[,-1],
y=train$Yield,
method='enet',
metric='RMSE',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.05),
.lambda = seq(0, 1, by=0.05)),
trControl=trainControl(method='cv',number=5),
preProcess=c('center','scale'))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
plot(enet.model)
#best params
enet.model$bestTune
## fraction lambda
## 27 0.25 0.05
#perf of best params
getTrainPerf(enet.model)
## TrainRMSE TrainRsquared TrainMAE method
## 1 1.200123 0.6250487 0.966174 enet
The optimal parameters were fraction = 0.25 and lambda = 0.05 with an RMSE of 1.2001234 and an R-squared of 0.6250487
#apply the trained model
enet.pred <- predict(enet.model, newdata=test[,-1])
output <- postResample(pred=enet.pred, obs=test$Yield)
kable(output)
| x | |
|---|---|
| RMSE | 1.1815463 |
| Rsquared | 0.4847856 |
| MAE | 0.9376099 |
The RMSE is higher than the training set and the R2 is lower. This is expected as out-sample model performance is almost never as good as in-sample.
#get all the coefficients
enet.model.coeff <- predict.enet(enet.model$finalModel, s=enet.model$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients
#drop the zeros and format
enet.model.coeff <- data.frame(sort(enet.model.coeff[enet.model.coeff != 0 ]))
colnames(enet.model.coeff) <- c("Coefficient")
#output
kable(enet.model.coeff)
| Coefficient | |
|---|---|
| ManufacturingProcess17 | -0.2378672 |
| ManufacturingProcess36 | -0.1314797 |
| ManufacturingProcess13 | -0.0490453 |
| ManufacturingProcess37 | -0.0414170 |
| ManufacturingProcess24 | -0.0029082 |
| ManufacturingProcess42 | 0.0104839 |
| ManufacturingProcess06 | 0.0215609 |
| BiologicalMaterial03 | 0.0245692 |
| ManufacturingProcess30 | 0.0254836 |
| ManufacturingProcess44 | 0.0637420 |
| ManufacturingProcess45 | 0.0727243 |
| ManufacturingProcess29 | 0.0872005 |
| ManufacturingProcess09 | 0.5552636 |
| ManufacturingProcess32 | 0.8976209 |
It appears as though Manufacturing Processes dominate the list with process 32 & 13 being the strongest - one with a positive and one with a negative relationship. Of the 14 non-zero coefficients, we see that 10 are manufacturing related and 4 are biological. We also see that the top-5 coefficients are manufacturing as opposed to biological.
We'll focus on the top 5 predictors for this part.
#extract the top 5
exp <- chem[,c("Yield",
"ManufacturingProcess32",
"ManufacturingProcess13",
"ManufacturingProcess36",
"ManufacturingProcess17",
"ManufacturingProcess09")]
kable(head(exp,5))
| Yield | ManufacturingProcess32 | ManufacturingProcess13 | ManufacturingProcess36 | ManufacturingProcess17 | ManufacturingProcess09 |
|---|---|---|---|---|---|
| 38.00 | -0.4568829 | 0.9771151 | -0.6557774 | 0.9263296 | -1.7201524 |
| 42.44 | 1.9517531 | -0.5003098 | -0.6557774 | -0.2753953 | 0.5883746 |
| 42.03 | 2.6928719 | 0.2876502 | -1.8000420 | 0.3655246 | -0.3815947 |
| 41.42 | 2.3223125 | 0.2876502 | -1.8000420 | 0.3655246 | -0.4785917 |
| 42.49 | 2.3223125 | 0.0906602 | -2.9443066 | -0.3555103 | -0.4527258 |
pairs(exp)
If taken the top 5 predictors and plotted them as a "pairs" plot in order to examine the relationships. A few things: