Exercises 6.2 & 6.3 from the K&J book. The rpubs version of this work can be found here, and source/data can be found on github here.
#clear the workspace
rm(list = ls())
#load req's packages
library(AppliedPredictiveModeling)
library(caret)
library(elasticnet)
library(knitr)
library(ggplot2)
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
#drop everything with near zero variability
f <- fingerprints[, -nearZeroVar(fingerprints)]
Filtering using nearZero resulted in the number of variables going from 1107 to 388, a decrease of 719.
First we’ll split the data
set.seed(11)
#add perm data to f before we split
f <- cbind(data.frame(permeability),f)
#split 75/25
n <- floor(0.75 * nrow(f))
idx <- sample(seq_len(nrow(f)), size = n)
train <- f[idx, ]
test <- f[-idx, ]
Next we run the PLS
#train the pls model
pls.model <- train(train[,-1],
train$permeability,
method = "pls",
tuneLength = 10,
trControl = trainControl(method = "cv"))
#output results
kable(pls.model$results)
| ncomp | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|
| 1 | 13.89503 | 0.3472600 | 10.827980 | 2.441099 | 0.2056161 | 2.068817 |
| 2 | 11.35498 | 0.5233864 | 8.301820 | 3.385000 | 0.3173944 | 1.997800 |
| 3 | 11.05793 | 0.5524960 | 8.455204 | 2.857272 | 0.2561434 | 2.338631 |
| 4 | 11.43852 | 0.5210680 | 8.948124 | 2.296126 | 0.1918366 | 1.876007 |
| 5 | 11.00379 | 0.5577330 | 8.426772 | 2.488035 | 0.1970978 | 1.704400 |
| 6 | 11.30268 | 0.5293902 | 8.458922 | 2.362089 | 0.2049334 | 1.380478 |
| 7 | 11.29453 | 0.5262470 | 8.716203 | 2.629013 | 0.2213593 | 1.810636 |
| 8 | 11.17101 | 0.5357901 | 8.479021 | 2.884467 | 0.2320330 | 2.033405 |
| 9 | 11.36656 | 0.5238562 | 8.540063 | 3.344882 | 0.2381457 | 2.414318 |
| 10 | 11.72035 | 0.5138794 | 8.789012 | 3.410256 | 0.2328869 | 2.724826 |
plot(pls.model$results$Rsquared,
xlab = "ncomp",
ylab = "Rsquared"
)
The optimal is 5 and the with the corresponding estimate of \(R^2\) at 0.557733
Next we run the model:
output <- predict(pls.model, test, ncomp = 10)
postResample(pred = output, obs = test$permeability)
## RMSE Rsquared MAE
## 13.3229538 0.1817181 9.6233479
And we can see that the \(R^2\) is reduced by about half.
We’ll use an elastic net as a secondary choice here:
#train the model
enet.model <- train(x=train[,-1],
y=train$permeability,
method='enet',
metric='RMSE',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.2),
.lambda = seq(0, 1, by=0.2)),
trControl=trainControl(method='cv',number=10),
preProcess=c('center','scale'))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
plot(enet.model)
#best params
enet.model$bestTune
#perf of best params
getTrainPerf(enet.model)
The optimal parameters were fraction = 0.4 and lambda = 0.2 with an RMSE of 10.808873 and an R-squared of 0.5809875
#apply the trained model
enet.pred <- predict(enet.model, newdata=test[,-1])
output <- postResample(pred=enet.pred, obs=test$permeability)
kable(output)
| x | |
|---|---|
| RMSE | 13.1317500 |
| Rsquared | 0.2721501 |
| MAE | 9.3181206 |
In the above case we can see that the degredation is much less significant AND the out-sample performance for the elastic net is higher than that of the PLS
The elastic net model looks promising at this stage given it’s apparent ability to maintain a high \(R^2\) out-of-sample. I’m sure much further improvements could also be made with this method.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
data(ChemicalManufacturingProcess)
chem <- ChemicalManufacturingProcess
head(chem)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
For this I used caret::preprocess as was mentioned in the book. The data are relatively obscure and leave little room for intuition as to the “best” method for imputation so I went with KNN.
#impute using knn
chem.imp <- preProcess(chem[,2:ncol(chem)], method=c('knnImpute'))
chem <- cbind(chem$Yield,predict(chem.imp, chem[,2:ncol(chem)]))
colnames(chem)[1] <- "Yield"
First we split the data into test and training sets. I’ve gone with 70/30, which is somewhat arbitrary.
set.seed(11)
#split 70/30
n <- floor(0.70 * nrow(chem))
idx <- sample(seq_len(nrow(chem)), size = n)
train <- chem[idx, ]
test <- chem[-idx, ]
I used caret to run an elastic-net model with a 20-step tuning range from 0-1 by 0.05 for both “lambda” and “fraction”. 5-fold cross validation was also used as part of the training process.
#train the model
enet.model <- train(x=train[,-1],
y=train$Yield,
method='enet',
metric='RMSE',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.05),
.lambda = seq(0, 1, by=0.05)),
trControl=trainControl(method='cv',number=5),
preProcess=c('center','scale'))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
plot(enet.model)
#best params
enet.model$bestTune
#perf of best params
getTrainPerf(enet.model)
The optimal parameters were fraction = 0.25 and lambda = 0.05 with an RMSE of 1.2001234 and an R-squared of 0.6250487
#apply the trained model
enet.pred <- predict(enet.model, newdata=test[,-1])
output <- postResample(pred=enet.pred, obs=test$Yield)
kable(output)
| x | |
|---|---|
| RMSE | 1.1815463 |
| Rsquared | 0.4847856 |
| MAE | 0.9376099 |
The RMSE is higher than the training set and the R2 is lower. This is expected as out-sample model performance is almost never as good as in-sample.
#get all the coefficients
enet.model.coeff <- predict.enet(enet.model$finalModel, s=enet.model$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients
#drop the zeros and format
enet.model.coeff <- data.frame(sort(enet.model.coeff[enet.model.coeff != 0 ]))
colnames(enet.model.coeff) <- c("Coefficient")
#output
kable(enet.model.coeff)
| Coefficient | |
|---|---|
| ManufacturingProcess17 | -0.2378672 |
| ManufacturingProcess36 | -0.1314797 |
| ManufacturingProcess13 | -0.0490453 |
| ManufacturingProcess37 | -0.0414170 |
| ManufacturingProcess24 | -0.0029082 |
| ManufacturingProcess42 | 0.0104839 |
| ManufacturingProcess06 | 0.0215609 |
| BiologicalMaterial03 | 0.0245692 |
| ManufacturingProcess30 | 0.0254836 |
| ManufacturingProcess44 | 0.0637420 |
| ManufacturingProcess45 | 0.0727243 |
| ManufacturingProcess29 | 0.0872005 |
| ManufacturingProcess09 | 0.5552636 |
| ManufacturingProcess32 | 0.8976209 |
It appears as though Manufacturing Processes dominate the list with process 32 & 13 being the strongest - one with a positive and one with a negative relationship. Of the 14 non-zero coefficients, we see that 10 are manufacturing related and 4 are biological. We also see that the top-5 coefficients are manufacturing as opposed to biological.
We’ll focus on the top 5 predictors for this part.
#extract the top 5
exp <- chem[,c("Yield",
"ManufacturingProcess32",
"ManufacturingProcess13",
"ManufacturingProcess36",
"ManufacturingProcess17",
"ManufacturingProcess09")]
kable(head(exp,5))
| Yield | ManufacturingProcess32 | ManufacturingProcess13 | ManufacturingProcess36 | ManufacturingProcess17 | ManufacturingProcess09 |
|---|---|---|---|---|---|
| 38.00 | -0.4568829 | 0.9771151 | -0.6557774 | 0.9263296 | -1.7201524 |
| 42.44 | 1.9517531 | -0.5003098 | -0.6557774 | -0.2753953 | 0.5883746 |
| 42.03 | 2.6928719 | 0.2876502 | -1.8000420 | 0.3655246 | -0.3815947 |
| 41.42 | 2.3223125 | 0.2876502 | -1.8000420 | 0.3655246 | -0.4785917 |
| 42.49 | 2.3223125 | 0.0906602 | -2.9443066 | -0.3555103 | -0.4527258 |
pairs(exp)
If taken the top 5 predictors and plotted them as a “pairs” plot in order to examine the relationships. A few things: