Data 624 - Homework #7

Exercises 6.2 & 6.3 from the K&J book. The rpubs version of this work can be found here, and source/data can be found on github here.

#clear the workspace
rm(list = ls())

#load req's packages
library(AppliedPredictiveModeling)
library(caret)
library(elasticnet)
library(knitr)
library(ggplot2)

Question 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

a) Start R and use these commands to load the data:

data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

#drop everything with near zero variability 
f  <- fingerprints[, -nearZeroVar(fingerprints)]

Filtering using nearZero resulted in the number of variables going from 1107 to 388, a decrease of 719.

c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

First we’ll split the data

set.seed(11)

#add perm data to f before we split
f <- cbind(data.frame(permeability),f)

#split 75/25
n <-  floor(0.75 * nrow(f))
idx <- sample(seq_len(nrow(f)), size = n)
train <- f[idx, ]
test <- f[-idx, ]

Next we run the PLS

#train the pls model
pls.model <- train(train[,-1],
                   train$permeability,
                   method = "pls", 
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))

#output results
kable(pls.model$results)

ncomp	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
1	13.89503	0.3472600	10.827980	2.441099	0.2056161	2.068817
2	11.35498	0.5233864	8.301820	3.385000	0.3173944	1.997800
3	11.05793	0.5524960	8.455204	2.857272	0.2561434	2.338631
4	11.43852	0.5210680	8.948124	2.296126	0.1918366	1.876007
5	11.00379	0.5577330	8.426772	2.488035	0.1970978	1.704400
6	11.30268	0.5293902	8.458922	2.362089	0.2049334	1.380478
7	11.29453	0.5262470	8.716203	2.629013	0.2213593	1.810636
8	11.17101	0.5357901	8.479021	2.884467	0.2320330	2.033405
9	11.36656	0.5238562	8.540063	3.344882	0.2381457	2.414318
10	11.72035	0.5138794	8.789012	3.410256	0.2328869	2.724826

plot(pls.model$results$Rsquared,
     xlab = "ncomp",
     ylab = "Rsquared"
     )

The optimal is 5 and the with the corresponding estimate of \(R^2\) at 0.557733

d) Predict the response for the test set. What is the test set estimate of R2?

Next we run the model:

output <- predict(pls.model, test, ncomp = 10)
postResample(pred = output, obs = test$permeability)

##       RMSE   Rsquared        MAE 
## 13.3229538  0.1817181  9.6233479

And we can see that the \(R^2\) is reduced by about half.

e) Try building other models discussed in this chapter. Do any have better predictive performance?

We’ll use an elastic net as a secondary choice here:

#train the model
enet.model <-  train(x=train[,-1],
                 y=train$permeability,
                 method='enet',
                 metric='RMSE',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.2), 
                                      .lambda = seq(0, 1, by=0.2)),
                 trControl=trainControl(method='cv',number=10),
                 preProcess=c('center','scale'))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

plot(enet.model)

#best params
enet.model$bestTune

#perf of best params
getTrainPerf(enet.model)

The optimal parameters were fraction = 0.4 and lambda = 0.2 with an RMSE of 10.808873 and an R-squared of 0.5809875

#apply the trained model
enet.pred <- predict(enet.model, newdata=test[,-1])
output <- postResample(pred=enet.pred, obs=test$permeability)
kable(output)

	x
RMSE	13.1317500
Rsquared	0.2721501
MAE	9.3181206

In the above case we can see that the degredation is much less significant AND the out-sample performance for the elastic net is higher than that of the PLS

Question 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

a) Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)

chem <- ChemicalManufacturingProcess

head(chem)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

For this I used caret::preprocess as was mentioned in the book. The data are relatively obscure and leave little room for intuition as to the “best” method for imputation so I went with KNN.

#impute using knn
chem.imp <- preProcess(chem[,2:ncol(chem)], method=c('knnImpute'))
chem <- cbind(chem$Yield,predict(chem.imp, chem[,2:ncol(chem)]))
colnames(chem)[1] <- "Yield"

c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

First we split the data into test and training sets. I’ve gone with 70/30, which is somewhat arbitrary.

set.seed(11)

#split 70/30
n <-  floor(0.70 * nrow(chem))
idx <- sample(seq_len(nrow(chem)), size = n)
train <- chem[idx, ]
test <- chem[-idx, ]

I used caret to run an elastic-net model with a 20-step tuning range from 0-1 by 0.05 for both “lambda” and “fraction”. 5-fold cross validation was also used as part of the training process.

#train the model
enet.model <-  train(x=train[,-1],
                 y=train$Yield,
                 method='enet',
                 metric='RMSE',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.05), 
                                      .lambda = seq(0, 1, by=0.05)),
                 trControl=trainControl(method='cv',number=5),
                 preProcess=c('center','scale'))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

plot(enet.model)

#best params
enet.model$bestTune

#perf of best params
getTrainPerf(enet.model)

The optimal parameters were fraction = 0.25 and lambda = 0.05 with an RMSE of 1.2001234 and an R-squared of 0.6250487

d) Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

#apply the trained model
enet.pred <- predict(enet.model, newdata=test[,-1])
output <- postResample(pred=enet.pred, obs=test$Yield)
kable(output)

	x
RMSE	1.1815463
Rsquared	0.4847856
MAE	0.9376099

The RMSE is higher than the training set and the R2 is lower. This is expected as out-sample model performance is almost never as good as in-sample.

e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

#get all the coefficients
enet.model.coeff <- predict.enet(enet.model$finalModel, s=enet.model$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients

#drop the zeros and format
enet.model.coeff <- data.frame(sort(enet.model.coeff[enet.model.coeff != 0 ]))
colnames(enet.model.coeff) <- c("Coefficient")

#output
kable(enet.model.coeff)

	Coefficient
ManufacturingProcess17	-0.2378672
ManufacturingProcess36	-0.1314797
ManufacturingProcess13	-0.0490453
ManufacturingProcess37	-0.0414170
ManufacturingProcess24	-0.0029082
ManufacturingProcess42	0.0104839
ManufacturingProcess06	0.0215609
BiologicalMaterial03	0.0245692
ManufacturingProcess30	0.0254836
ManufacturingProcess44	0.0637420
ManufacturingProcess45	0.0727243
ManufacturingProcess29	0.0872005
ManufacturingProcess09	0.5552636
ManufacturingProcess32	0.8976209

It appears as though Manufacturing Processes dominate the list with process 32 & 13 being the strongest - one with a positive and one with a negative relationship. Of the 14 non-zero coefficients, we see that 10 are manufacturing related and 4 are biological. We also see that the top-5 coefficients are manufacturing as opposed to biological.

f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

We’ll focus on the top 5 predictors for this part.

#extract the top 5
exp <- chem[,c("Yield",
               "ManufacturingProcess32",
               "ManufacturingProcess13",
               "ManufacturingProcess36",
               "ManufacturingProcess17",
               "ManufacturingProcess09")]


kable(head(exp,5))

Yield	ManufacturingProcess32	ManufacturingProcess13	ManufacturingProcess36	ManufacturingProcess17	ManufacturingProcess09
38.00	-0.4568829	0.9771151	-0.6557774	0.9263296	-1.7201524
42.44	1.9517531	-0.5003098	-0.6557774	-0.2753953	0.5883746
42.03	2.6928719	0.2876502	-1.8000420	0.3655246	-0.3815947
41.42	2.3223125	0.2876502	-1.8000420	0.3655246	-0.4785917
42.49	2.3223125	0.0906602	-2.9443066	-0.3555103	-0.4527258

pairs(exp)

If taken the top 5 predictors and plotted them as a “pairs” plot in order to examine the relationships. A few things:

Ultimately, our “real world” goal here is to enhance yield as opposed to just creating a good model to predict yield. In that regard, we may want to reduce exposure to, or elimiate processes with a negative correlation to yield (process 13 and 17… likely 36 also )
At a minimum, 13 and 17 themselves appear to be correlated to one another. Perhapse we could drop one of them entirely in the manufacturing process.
Process 32 and to a lesser extent, process 9, appear to have a positive correlation with Yield. I’d want amplify my exposure to those processes to the extent possible
Generally, I’d like to try and figure out WHY the above relationships exist and try to get an intuitive understanding of how to make improvements.

Data 624 - Homework #7

Paul Britton

2020-04-05

Question 6.2

a) Start R and use these commands to load the data:

c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

d) Predict the response for the test set. What is the test set estimate of R2?

e) Try building other models discussed in this chapter. Do any have better predictive performance?

Question 6.3

a) Start R and use these commands to load the data:

b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

d) Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Data 624 - Homework #7

Paul Britton

2020-04-05

Question 6.2

a) Start R and use these commands to load the data:

c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

d) Predict the response for the test set. What is the test set estimate of R2?

e) Try building other models discussed in this chapter. Do any have better predictive performance?

f) Would you recommend any of your models to replace the permeability laboratory experiment?

Question 6.3

a) Start R and use these commands to load the data:

b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

d) Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?