Data 624 - Homework #7

Exercises 6.2 & 6.3 from the K&J book.

#clear the workspace
rm(list = ls())
#load req's packages
library(AppliedPredictiveModeling)
library(caret)
library(elasticnet)
library(knitr)
library(ggplot2)
library(kableExtra)

Question 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

a) Start R and use these commands to load the data:

data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

#drop everything with near zero variability 
f  <- fingerprints[, -nearZeroVar(fingerprints)]

Filtering using nearZero resulted in the number of variables going from 1107 to 388, a decrease of 719.

c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

First we'll split the data

set.seed(11)
#add perm data to f before we split
f <- cbind(data.frame(permeability),f)
#split 75/25
n <-  floor(0.75 * nrow(f))
idx <- sample(seq_len(nrow(f)), size = n)
train <- f[idx, ]
test <- f[-idx, ]

Next we run the PLS

#train the pls model
pls.model <- train(train[,-1],
                   train$permeability,
                   method = "pls", 
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))
#output results
kable(pls.model$results)

ncomp	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
1	13.89503	0.3472600	10.827980	2.441099	0.2056161	2.068817
2	11.35498	0.5233864	8.301820	3.385000	0.3173944	1.997800
3	11.05793	0.5524960	8.455204	2.857272	0.2561434	2.338631
4	11.43852	0.5210680	8.948124	2.296126	0.1918366	1.876007
5	11.00379	0.5577330	8.426772	2.488035	0.1970978	1.704400
6	11.30268	0.5293902	8.458922	2.362089	0.2049334	1.380478
7	11.29453	0.5262470	8.716203	2.629013	0.2213593	1.810636
8	11.17101	0.5357901	8.479021	2.884467	0.2320330	2.033405
9	11.36656	0.5238562	8.540063	3.344882	0.2381457	2.414318
10	11.72035	0.5138794	8.789012	3.410256	0.2328869	2.724826

plot(pls.model$results$Rsquared,
     xlab = "ncomp",
     ylab = "Rsquared"
     )

The optimal is 5 and the with the corresponding estimate of \(R^2\) at 0.557733

d) Predict the response for the test set. What is the test set estimate of R2?

Next we run the model:

output <- predict(pls.model, test, ncomp = 10)
postResample(pred = output, obs = test$permeability)

##       RMSE   Rsquared        MAE 
## 13.3229538  0.1817181  9.6233479

And we can see that the \(R^2\) is reduced by about half.

e) Try building other models discussed in this chapter. Do any have better predictive performance?

We'll use an elastic net as a secondary choice here:

#train the model
enet.model <-  train(x=train[,-1],
                 y=train$permeability,
                 method='enet',
                 metric='RMSE',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.2), 
                                      .lambda = seq(0, 1, by=0.2)),
                 trControl=trainControl(method='cv',number=10),
                 preProcess=c('center','scale'))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

plot(enet.model)

#best params
enet.model$bestTune

##   fraction lambda
## 9      0.4    0.2

#perf of best params
getTrainPerf(enet.model)

##   TrainRMSE TrainRsquared TrainMAE method
## 1  10.80887     0.5809875 7.902054   enet

The optimal parameters were fraction = 0.4 and lambda = 0.2 with an RMSE of 10.808873 and an R-squared of 0.5809875

#apply the trained model
enet.pred <- predict(enet.model, newdata=test[,-1])
output <- postResample(pred=enet.pred, obs=test$permeability)
kable(output)

	x
RMSE	13.1317500
Rsquared	0.2721501
MAE	9.3181206

In the above case we can see that the degredation is much less significant AND the out-sample performance for the elastic net is higher than that of the PLS

Question 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

a) Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)
chem <- ChemicalManufacturingProcess
head(chem)

##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00                 6.25                49.58                56.97
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
## 5 42.49                 7.47                63.33                72.25
## 6 43.57                 6.12                58.36                65.31
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                12.74                19.51                43.73
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
## 5                14.02                17.91                54.66
## 6                15.17                21.79                51.23
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1                  100                16.66                11.44
## 2                  100                19.04                12.55
## 3                  100                19.04                12.55
## 4                  100                19.04                12.55
## 5                  100                18.22                12.80
## 6                  100                18.30                12.13
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1                 3.46               138.09                18.83
## 2                 3.46               153.67                21.05
## 3                 3.46               153.67                21.05
## 4                 3.46               153.67                21.05
## 5                 3.05               147.61                21.05
## 6                 3.78               151.88                20.76
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1                     NA                     NA                     NA
## 2                    0.0                      0                     NA
## 3                    0.0                      0                     NA
## 4                    0.0                      0                     NA
## 5                   10.7                      0                     NA
## 6                   12.0                      0                     NA
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1                     NA                     NA                     NA
## 2                    917                 1032.2                  210.0
## 3                    912                 1003.6                  207.1
## 4                    911                 1014.6                  213.3
## 5                    918                 1027.5                  205.7
## 6                    924                 1016.8                  208.9
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1                     NA                     NA                  43.00
## 2                    177                    178                  46.57
## 3                    178                    178                  45.07
## 4                    177                    177                  44.92
## 5                    178                    178                  44.96
## 6                    178                    178                  45.32
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1                     NA                     NA                     NA
## 2                     NA                     NA                      0
## 3                     NA                     NA                      0
## 4                     NA                     NA                      0
## 5                     NA                     NA                      0
## 6                     NA                     NA                      0
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1                   35.5                   4898                   6108
## 2                   34.0                   4869                   6095
## 3                   34.8                   4878                   6087
## 4                   34.8                   4897                   6102
## 5                   34.6                   4992                   6233
## 6                   34.0                   4985                   6222
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1                   4682                   35.5                   4865
## 2                   4617                   34.0                   4867
## 3                   4617                   34.8                   4877
## 4                   4635                   34.8                   4872
## 5                   4733                   33.9                   4886
## 6                   4786                   33.4                   4862
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1                   6049                   4665                    0.0
## 2                   6097                   4621                    0.0
## 3                   6078                   4621                    0.0
## 4                   6073                   4611                    0.0
## 5                   6102                   4659                   -0.7
## 6                   6115                   4696                   -0.6
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1                     NA                     NA                     NA
## 2                      3                      0                      3
## 3                      4                      1                      4
## 4                      5                      2                      5
## 5                      8                      4                     18
## 6                      9                      1                      1
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1                   4873                   6074                   4685
## 2                   4869                   6107                   4630
## 3                   4897                   6116                   4637
## 4                   4892                   6111                   4630
## 5                   4930                   6151                   4684
## 6                   4871                   6128                   4687
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1                   10.7                   21.0                    9.9
## 2                   11.2                   21.4                    9.9
## 3                   11.1                   21.3                    9.4
## 4                   11.1                   21.3                    9.4
## 5                   11.3                   21.6                    9.0
## 6                   11.4                   21.7                   10.1
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1                   69.1                    156                     66
## 2                   68.7                    169                     66
## 3                   69.3                    173                     66
## 4                   69.3                    171                     68
## 5                   69.4                    171                     70
## 6                   68.2                    173                     70
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1                    2.4                    486                  0.019
## 2                    2.6                    508                  0.019
## 3                    2.6                    509                  0.018
## 4                    2.5                    496                  0.018
## 5                    2.5                    468                  0.017
## 6                    2.5                    490                  0.018
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1                    0.5                      3                    7.2
## 2                    2.0                      2                    7.2
## 3                    0.7                      2                    7.2
## 4                    1.2                      2                    7.2
## 5                    0.2                      2                    7.3
## 6                    0.4                      2                    7.2
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1                     NA                     NA                   11.6
## 2                    0.1                   0.15                   11.1
## 3                    0.0                   0.00                   12.0
## 4                    0.0                   0.00                   10.6
## 5                    0.0                   0.00                   11.0
## 6                    0.0                   0.00                   11.5
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1                    3.0                    1.8                    2.4
## 2                    0.9                    1.9                    2.2
## 3                    1.0                    1.8                    2.3
## 4                    1.1                    1.8                    2.1
## 5                    1.1                    1.7                    2.1
## 6                    2.2                    1.8                    2.0

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

For this I used caret::preprocess as was mentioned in the book. The data are relatively obscure and leave little room for intuition as to the "best" method for imputation so I went with KNN.

#impute using knn
chem.imp <- preProcess(chem[,2:ncol(chem)], method=c('knnImpute'))
chem <- cbind(chem$Yield,predict(chem.imp, chem[,2:ncol(chem)]))
colnames(chem)[1] <- "Yield"

c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

First we split the data into test and training sets. I've gone with 70/30, which is somewhat arbitrary.

set.seed(11)
#split 70/30
n <-  floor(0.70 * nrow(chem))
idx <- sample(seq_len(nrow(chem)), size = n)
train <- chem[idx, ]
test <- chem[-idx, ]

I used caret to run an elastic-net model with a 20-step tuning range from 0-1 by 0.05 for both "lambda" and "fraction". 5-fold cross validation was also used as part of the training process.

#train the model
enet.model <-  train(x=train[,-1],
                 y=train$Yield,
                 method='enet',
                 metric='RMSE',
                 tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.05), 
                                      .lambda = seq(0, 1, by=0.05)),
                 trControl=trainControl(method='cv',number=5),
                 preProcess=c('center','scale'))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

plot(enet.model)

#best params
enet.model$bestTune

##    fraction lambda
## 27     0.25   0.05

#perf of best params
getTrainPerf(enet.model)

##   TrainRMSE TrainRsquared TrainMAE method
## 1  1.200123     0.6250487 0.966174   enet

The optimal parameters were fraction = 0.25 and lambda = 0.05 with an RMSE of 1.2001234 and an R-squared of 0.6250487

d) Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

#apply the trained model
enet.pred <- predict(enet.model, newdata=test[,-1])
output <- postResample(pred=enet.pred, obs=test$Yield)
kable(output)

	x
RMSE	1.1815463
Rsquared	0.4847856
MAE	0.9376099

The RMSE is higher than the training set and the R2 is lower. This is expected as out-sample model performance is almost never as good as in-sample.

e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

#get all the coefficients
enet.model.coeff <- predict.enet(enet.model$finalModel, s=enet.model$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients
#drop the zeros and format
enet.model.coeff <- data.frame(sort(enet.model.coeff[enet.model.coeff != 0 ]))
colnames(enet.model.coeff) <- c("Coefficient")
#output
kable(enet.model.coeff)

	Coefficient
ManufacturingProcess17	-0.2378672
ManufacturingProcess36	-0.1314797
ManufacturingProcess13	-0.0490453
ManufacturingProcess37	-0.0414170
ManufacturingProcess24	-0.0029082
ManufacturingProcess42	0.0104839
ManufacturingProcess06	0.0215609
BiologicalMaterial03	0.0245692
ManufacturingProcess30	0.0254836
ManufacturingProcess44	0.0637420
ManufacturingProcess45	0.0727243
ManufacturingProcess29	0.0872005
ManufacturingProcess09	0.5552636
ManufacturingProcess32	0.8976209

It appears as though Manufacturing Processes dominate the list with process 32 & 13 being the strongest - one with a positive and one with a negative relationship. Of the 14 non-zero coefficients, we see that 10 are manufacturing related and 4 are biological. We also see that the top-5 coefficients are manufacturing as opposed to biological.

f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

We'll focus on the top 5 predictors for this part.

#extract the top 5
exp <- chem[,c("Yield",
               "ManufacturingProcess32",
               "ManufacturingProcess13",
               "ManufacturingProcess36",
               "ManufacturingProcess17",
               "ManufacturingProcess09")]
kable(head(exp,5))

Yield	ManufacturingProcess32	ManufacturingProcess13	ManufacturingProcess36	ManufacturingProcess17	ManufacturingProcess09
38.00	-0.4568829	0.9771151	-0.6557774	0.9263296	-1.7201524
42.44	1.9517531	-0.5003098	-0.6557774	-0.2753953	0.5883746
42.03	2.6928719	0.2876502	-1.8000420	0.3655246	-0.3815947
41.42	2.3223125	0.2876502	-1.8000420	0.3655246	-0.4785917
42.49	2.3223125	0.0906602	-2.9443066	-0.3555103	-0.4527258

pairs(exp)

If taken the top 5 predictors and plotted them as a "pairs" plot in order to examine the relationships. A few things:

Ultimately, our "real world" goal here is to enhance yield as opposed to just creating a good model to predict yield. In that regard, we may want to reduce exposure to, or elimiate processes with a negative correlation to yield (process 13 and 17... likely 36 also )
At a minimum, 13 and 17 themselves appear to be correlated to one another. Perhapse we could drop one of them entirely in the manufacturing process.
Process 32 and to a lesser extent, process 9, appear to have a positive correlation with Yield. I'd want amplify my exposure to those processes to the extent possible
Generally, I'd like to try and figure out WHY the above relationships exist and try to get an intuitive understanding of how to make improvements.

Data 624 - Homework #7

Wei Zhou

2020-10-30

Question 6.2

a) Start R and use these commands to load the data:

c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

d) Predict the response for the test set. What is the test set estimate of R2?

e) Try building other models discussed in this chapter. Do any have better predictive performance?

Question 6.3

a) Start R and use these commands to load the data:

b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

d) Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Data 624 - Homework #7

Wei Zhou

2020-10-30

Question 6.2

a) Start R and use these commands to load the data:

c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

d) Predict the response for the test set. What is the test set estimate of R2?

e) Try building other models discussed in this chapter. Do any have better predictive performance?

f) Would you recommend any of your models to replace the permeability laboratory experiment?

Question 6.3

a) Start R and use these commands to load the data:

b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

d) Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?