Assignment: Exercises 6.2 and 6.3 from the KJ textbook

library(caTools)
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'caret'
## The following object is masked from 'package:pls':
## 
##     R2
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 3.0-2
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

  1. Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

fingerprints originally has 1,107 predictors, and applying nearZeroVar leaves us with 388 predictors. We also convert the fingerprints matrix into a data frame and append the permeability response variable to it.

ncol(fingerprints)
## [1] 1107
f2 <- as.data.frame(fingerprints[, -nearZeroVar(fingerprints)])
ncol(f2)
## [1] 388
f2$permeability <- permeability
  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

We do an 80/20 split on our dataset for training and testing of our model. The resampled estimate of R^2 for the optimal model is 0.385.

set.seed(101) 
sample2 = sample.split(f2$X1, SplitRatio = .8)
train2 = subset(f2, sample2 == TRUE)
test2  = subset(f2, sample2 == FALSE)

train2.X = subset(train2, select = -permeability)
train2.y = as.double(train2[,389])
set.seed(42)
cv_5 = trainControl(method = "cv", number = 5)

pls = train(train2.X, train2.y,
  method = "pls",
  trControl = cv_5,
  tuneLength = 20,
  preProc = c("center", "scale")
)

pls
## Partial Least Squares 
## 
## 132 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 106, 106, 105, 106, 105 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.11285  0.3023623  10.202646
##    2     12.34325  0.3857531   8.782778
##    3     12.87472  0.3257454   9.762851
##    4     13.46234  0.2861761  10.196898
##    5     13.15464  0.3269827   9.863930
##    6     12.61933  0.3674958   9.483667
##    7     12.38542  0.3880415   9.574450
##    8     12.38560  0.3918038   9.452560
##    9     13.26281  0.3313283  10.222422
##   10     13.80736  0.2930231  10.623743
##   11     13.89955  0.2912486  10.633719
##   12     14.14672  0.2775490  10.710961
##   13     14.78525  0.2409274  11.125034
##   14     14.89994  0.2336614  11.178475
##   15     15.46838  0.1951437  11.555663
##   16     15.89752  0.1847873  11.934445
##   17     16.21158  0.1724270  12.238368
##   18     16.40462  0.1655988  12.336353
##   19     16.54381  0.1525942  12.373425
##   20     16.75260  0.1498705  12.464773
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
  1. Predict the response for the test set. What is the test set estimate of R2?

The test set estimate of R^2 is 0.676 which is more than the 0.38 that we got from training the model.

test2.y <- as.double(test2[,389])
results2 <- predict(pls, test2)
postResample(results2, test2.y)
##       RMSE   Rsquared        MAE 
## 11.6102518  0.6761989  8.6112739
  1. Try building other models discussed in this chapter. Do any have better predictive performance?

We also attempt an elastic net model and tuning it resulted in a model that was about equal parts Ridge and Lasso with an R^2 of 0.436 instead of the 0.38 that we got for the training of the PLS. The R^2 of our test set is slightly lower than what we saw with PLS at 0.611 versus 0.676.

fenet = train(
  permeability~ ., data = train2,
  method = "glmnet",
  trControl = cv_5
)

fenet
## glmnet 
## 
## 132 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 106, 107, 105, 105, 105 
## Resampling results across tuning parameters:
## 
##   alpha  lambda     RMSE      Rsquared   MAE      
##   0.10   0.5675973  12.03826  0.4014247   8.875618
##   0.10   1.7949004  11.58749  0.4232663   8.490708
##   0.10   5.6759734  11.45569  0.4372767   8.023007
##   0.55   0.5675973  12.04758  0.3842069   8.992382
##   0.55   1.7949004  11.42972  0.4356371   7.962093
##   0.55   5.6759734  11.90777  0.4390952   8.632116
##   1.00   0.5675973  11.69898  0.4088869   8.557344
##   1.00   1.7949004  11.43925  0.4393444   7.900447
##   1.00   5.6759734  13.09008  0.4259580  10.027339
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.55 and lambda = 1.7949.
results3 <- predict(fenet, test2)
postResample(results3, test2.y)
##       RMSE   Rsquared        MAE 
## 12.0358969  0.6105601  8.6806142
  1. Would you recommend any of your models to replace the permeability laboratory experiment?

I would recommend either of these models but there would need to be a deeper dive and evaluation to decide which would be a better fit.

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data:
data("ChemicalManufacturingProcess")

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

We use KNN imputation which imputes a missing value with the average weighted value of observations near/similar to it. We perform this imputation for missing values in all variables except for the response variable, yield.

cmp <- knnImputation(ChemicalManufacturingProcess[, !names(ChemicalManufacturingProcess) %in% "yield"])
anyNA(cmp)
## [1] FALSE
  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

First, we split our data, using 80% of it to train the model and holding out the rest for our test set.

set.seed(101) 
sample = sample.split(cmp$BiologicalMaterial01, SplitRatio = .8)
train = subset(cmp, sample == TRUE)
test  = subset(cmp, sample == FALSE)

Next, we use cross validation with 5 folds and tune an Elastic Net model on the training set. Alpha controls the mix between the ridge and lasso penalties and lambda controls the amount of penalization. The optimal value of the performance metric is the minimized RMSE value (1.24) and R^2 of 0.59 when alpha = 1 and lamda = 0.22. Since alpha = 1, this optimal model is actually a lasso regression.

set.seed(42)
cv_5 = trainControl(method = "cv", number = 5)

elnet = train(
  Yield ~ ., data = train,
  method = "glmnet",
  trControl = cv_5
)

elnet
## glmnet 
## 
## 140 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 112, 112, 112, 112, 112 
## Resampling results across tuning parameters:
## 
##   alpha  lambda       RMSE      Rsquared   MAE     
##   0.10   0.002214191  6.090344  0.3463700  2.029516
##   0.10   0.022141914  4.294448  0.4891488  1.571158
##   0.10   0.221419144  2.388345  0.5062283  1.174329
##   0.55   0.002214191  5.253761  0.3646390  1.846019
##   0.55   0.022141914  4.013140  0.4989420  1.483398
##   0.55   0.221419144  1.492739  0.5313145  1.045691
##   1.00   0.002214191  4.829122  0.3904685  1.745341
##   1.00   0.022141914  3.872670  0.5016430  1.446879
##   1.00   0.221419144  1.243313  0.5961836  1.015103
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.2214191.
  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

The RMSE for the test set is 1.319 which is slightly higher, but not by much, when compared to the 1.24 RMSE that we have with the training set. The R^2 is 0.397 which is less than the 0.59 that we got from training the model.

test.y <- test[,1]
results <- predict(elnet, s = 0.22, test)
postResample(results, test.y)
##     RMSE Rsquared      MAE 
## 1.319225 0.397440 1.047190
  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

There are 3 biological and 7 process predictors in our model. The strongest predictors are manufacturing processes, specifically 9, 17 and 32.

coef(elnet$finalModel, elnet$finalModel$lambdaOpt)
## 58 x 1 sparse Matrix of class "dgCMatrix"
##                                    1
## (Intercept)             1.846801e+01
## BiologicalMaterial01    .           
## BiologicalMaterial02    1.389335e-03
## BiologicalMaterial03    .           
## BiologicalMaterial04    .           
## BiologicalMaterial05    2.122102e-03
## BiologicalMaterial06    .           
## BiologicalMaterial07    .           
## BiologicalMaterial08    4.478227e-02
## BiologicalMaterial09    .           
## BiologicalMaterial10    .           
## BiologicalMaterial11    .           
## BiologicalMaterial12    .           
## ManufacturingProcess01  .           
## ManufacturingProcess02  .           
## ManufacturingProcess03  .           
## ManufacturingProcess04  .           
## ManufacturingProcess05  .           
## ManufacturingProcess06  1.960422e-02
## ManufacturingProcess07  .           
## ManufacturingProcess08  .           
## ManufacturingProcess09  3.001602e-01
## ManufacturingProcess10  .           
## ManufacturingProcess11  .           
## ManufacturingProcess12  .           
## ManufacturingProcess13 -9.270081e-02
## ManufacturingProcess14  .           
## ManufacturingProcess15  .           
## ManufacturingProcess16  .           
## ManufacturingProcess17 -2.200624e-01
## ManufacturingProcess18  .           
## ManufacturingProcess19  .           
## ManufacturingProcess20  .           
## ManufacturingProcess21  .           
## ManufacturingProcess22  .           
## ManufacturingProcess23  .           
## ManufacturingProcess24 -3.283492e-03
## ManufacturingProcess25  .           
## ManufacturingProcess26  .           
## ManufacturingProcess27  .           
## ManufacturingProcess28  .           
## ManufacturingProcess29  .           
## ManufacturingProcess30  .           
## ManufacturingProcess31  .           
## ManufacturingProcess32  1.160302e-01
## ManufacturingProcess33  .           
## ManufacturingProcess34  .           
## ManufacturingProcess35  .           
## ManufacturingProcess36 -2.307180e+02
## ManufacturingProcess37  .           
## ManufacturingProcess38  .           
## ManufacturingProcess39  .           
## ManufacturingProcess40  .           
## ManufacturingProcess41  .           
## ManufacturingProcess42  .           
## ManufacturingProcess43  .           
## ManufacturingProcess44  .           
## ManufacturingProcess45  .
  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

Like we previously mentioned, the strongest predictors for yield are manufacturing processes 9, 17 and 32, but process 17 has a negative effect on yield. It would help to implement more processes similar to 9 and 32 in the production of this pharmaceutical product. Some biological materials also had relatively strong effects on the yield, but those predictors cannot be changed.