# Import required R libraries
library(AppliedPredictiveModeling)
library(caret)
library(tidyverse)
library(pls)
library(elasticnet)
library(corrplot)

Exercise 6.2

Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

Section a

Start R and use these commands to load the data:

data(permeability)
data(fingerprints)

fp_data <- as.data.frame(fingerprints)

perm_data <- as.data.frame(permeability)
fp_data <- as.data.frame(fingerprints)

#str(perm_data)
#str(fp_data)

#summary(fp_data)
#head(fp_data)
dim(fp_data)
## [1]  165 1107
#summary(perm_data)
#head(perm_data)
dim(perm_data)
## [1] 165   1

165 observations in fingerprints dataset with 1107 features. 165 observations (predictions) in permeability dataset with 1 feature, the prediction.

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

Section b

The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

nzv_cols <- nearZeroVar(fp_data)
length(nzv_cols)
## [1] 719
# From: https://stackoverflow.com/questions/28043393/nearzerovar-function-in-caret
if(length(nzv_cols) > 0) fp_data <- fp_data[, -nzv_cols]

dim(fp_data)
## [1] 165 388
fp_data <- fp_data %>%
  mutate_if(is.numeric,as.factor)

# Add the outcome variable to the predictors dataset
fp_data$Permeability <- perm_data$permeability

dim(fp_data)
## [1] 165 389
#str(fp_data)

Answer: 719 features with near zero variance, thus 388 features remaining for modeling.

Section c

Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding re-sampled estimate of \(R^2\)?

# From textbook: Prior to performing PLS, the predictors should be centered and scaled,
# PLS has one tuning parameter: the number of components to retain.

# Set the random number seed so we can reproduce the results
set.seed(8675309)

#summary(fp_data)

# Use 80% for training
trainingRows <- createDataPartition(fp_data$Permeability, p = .80, list=FALSE)

# Training set
training_data <- fp_data[trainingRows, ]

# Test set
test_data <- fp_data[-trainingRows, ]

# Initial model following example from book
plsFit <- plsr(Permeability ~ ., data=training_data)
plsFit
## Partial least squares regression , fitted with the kernel algorithm.
## Call:
## plsr(formula = Permeability ~ ., data = training_data)

I’ll admit, this initial model was not helpful. On to the tuning of the model.

# Training pls model based on book example
ctrl <- trainControl(method = "cv", number = 10)

# NOTE: No metric parameter defaults to RMSE
plsTune <- train(Permeability ~ .,
                data = training_data,
                method = "pls",
                metric = "Rsquared",
                tuneLength = 20,
                trControl = ctrl,
                preProc = c("center", "scale")) 

plsTune
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 121, 120, 121, 118, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.06698  0.3799681  10.014244
##    2     11.62076  0.4842259   8.219313
##    3     11.42978  0.4907288   8.779714
##    4     11.33345  0.4947285   8.789783
##    5     11.22369  0.5277212   8.859795
##    6     10.96079  0.5457347   8.515748
##    7     10.91604  0.5481675   8.483617
##    8     10.74673  0.5674378   8.309220
##    9     10.91521  0.5665909   8.379533
##   10     11.04581  0.5637790   8.443369
##   11     11.32753  0.5485861   8.561184
##   12     11.29806  0.5466893   8.594283
##   13     11.63459  0.5235809   8.912699
##   14     11.80198  0.5108357   8.944397
##   15     12.19740  0.4911028   9.196299
##   16     12.32189  0.4840448   9.255872
##   17     12.50701  0.4725653   9.416088
##   18     12.81096  0.4536203   9.577273
##   19     13.04490  0.4465773   9.810423
##   20     13.28740  0.4394109  10.018533
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 8.
ggplot(plsTune) + labs(title="PLS Model Component Tuning")

First, the data was split - 80% for training and 20% for the test set. The data was pre-processed using the preProcess parameter in the train function in order to center and scale the predictors. The PLS model was tuned by way of cross-validation performed 10-fold with a tune length of 20.

Answer: Apparently, the running of the chunks in Rstudio and knitting the .rmd file are producing slightly different results. The knitted results above show an optimal latent variables count of 8 with a corresponding \(R^2\) value of 0.5674378. The best performing RMSE seen above is 10.74673, also for the variable count of 8. When running the chunks in Rstudio, the optimal number of latent variables is 9 based on the corresponding \(R^2\) value of 0.5322824. The best performing RMSE seen above is 10.75441, which resulted from the latent variable count of 8.

Section d

Predict the response for the test set. What is the test set estimate of \(R^2\)?

preds_test <- predict(plsTune, test_data) 
postResample(pred=preds_test, obs=test_data$Permeability)
##      RMSE  Rsquared       MAE 
## 13.088606  0.398143  9.532294

Answer: For the test set predictions, the estimate of \(R^2\) is 0.398143.

Section e

Try building other models discussed in this chapter. Do any have better predictive performance?

I have attempted models of type:

  • Ridge Regression

  • Lasso

  • Elastic Net

Ridge Regression Model

# Ridge regression
# From book: Page 135
# Define the candidate set of values
ridgeGrid <- data.frame(.lambda = seq(0, 1, by = 0.1)) 

ridgeRegFit <- train(Permeability ~ .,
                     data = training_data,
                     method = "ridge",
                     metric = "Rsquared",
                     tuneGrid = ridgeGrid,
                     trControl = ctrl,
                     preProc = c("center", "scale"))

ridgeRegFit
## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 120, 121, 118, 120, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE      
##   0.0     27.25144  0.3394782  18.220341
##   0.1     12.31169  0.5402269   9.474454
##   0.2     12.12525  0.5640031   9.373737
##   0.3     12.31625  0.5704226   9.588307
##   0.4     12.65255  0.5725739   9.954747
##   0.5     13.08302  0.5729303  10.378427
##   0.6     13.58010  0.5725119  10.836059
##   0.7     14.12918  0.5717114  11.315484
##   0.8     14.72067  0.5707226  11.801494
##   0.9     15.34666  0.5696357  12.299337
##   1.0     16.00192  0.5685118  12.823477
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.5.
ggplot(ridgeRegFit) + labs(title="Ridge Regression Model With Tuning")

preds_test_rr <- predict(ridgeRegFit, test_data) 
postResample(pred=preds_test_rr, obs=test_data$Permeability)
##       RMSE   Rsquared        MAE 
## 14.7337233  0.4319239 11.2870756

The best performing ridge-regression model used a lambda value of 0.5, which resulted:

  • \(R^2\): 0.5729303

  • RMSE: 13.08302

The best RMSE is 12.12525 with lambda of 0.2.

On the test set, the \(R^2\) is 0.4319239.

Lasso Model

lassoGrid <- data.frame(.fraction = seq(0, 0.5, by=0.05))

lassoFit <- train(Permeability ~ .,
                  data = training_data,
                  method = 'lasso',
                  metric = 'Rsquared',
                  tuneGrid = lassoGrid,
                  trControl = ctrl,
                  preProcess = c('center','scale'))

lassoFit
## The lasso 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 120, 120, 121, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE      
##   0.00      15.25173        NaN  12.164769
##   0.05      11.73790  0.4615889   8.631120
##   0.10      11.46013  0.4600279   8.227656
##   0.15      11.16597  0.4795671   8.054270
##   0.20      10.91302  0.4994958   7.986236
##   0.25      10.78169  0.5121227   7.945473
##   0.30      10.67609  0.5184317   7.883757
##   0.35      10.61055  0.5244119   7.842006
##   0.40      10.63854  0.5286524   7.837140
##   0.45      10.81814  0.5230483   7.970015
##   0.50      11.03395  0.5135500   8.148108
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.4.
ggplot(lassoFit) + labs(title="Lasso Model With Tuning")

preds_test_ls <- predict(lassoFit, test_data) 
postResample(pred=preds_test_ls, obs=test_data$Permeability)
##       RMSE   Rsquared        MAE 
## 15.3916216  0.2644105 10.8037744

The best performing lasso model used a fraction value of 0.4, which resulted:

  • \(R^2\): 0.5286524

  • RMSE: 10.63854

The best RMSE is 10.61055 with fraction value of 0.35.

On the test set, the \(R^2\) is 0.2644105.

Elastic Net Model

# Elastic Net
# From book: Page 136
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
                        .fraction = seq(.05, 1, length = 20))

enetFit <- train(Permeability ~ .,
                 data = training_data,
                 method = 'enet',
                 metric = 'Rsquared',
                 tuneGrid = enetGrid,
                 trControl = ctrl,
                 preProcess = c('center','scale'))
enetFit
## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 120, 120, 118, 119, 121, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.00    0.05      12.13174  0.4264804   8.890428
##   0.00    0.10      12.04037  0.3964501   8.651442
##   0.00    0.15      11.79216  0.4160285   8.578331
##   0.00    0.20      11.44426  0.4431953   8.271882
##   0.00    0.25      11.27123  0.4662284   8.178241
##   0.00    0.30      11.29218  0.4817690   8.193417
##   0.00    0.35      11.35585  0.4893024   8.184000
##   0.00    0.40      11.36832  0.5011068   8.163387
##   0.00    0.45      11.35545  0.5107693   8.101684
##   0.00    0.50      11.37484  0.5150926   8.064096
##   0.00    0.55      11.43404  0.5145448   8.085164
##   0.00    0.60      11.60826  0.5044263   8.172674
##   0.00    0.65      11.82871  0.4922947   8.312996
##   0.00    0.70      12.04321  0.4826022   8.453167
##   0.00    0.75      12.24838  0.4730391   8.602791
##   0.00    0.80      12.42744  0.4664210   8.762566
##   0.00    0.85      12.60865  0.4600349   8.939787
##   0.00    0.90      12.76428  0.4549951   9.069096
##   0.00    0.95      12.94244  0.4479978   9.200452
##   0.00    1.00      13.23542  0.4368884   9.414313
##   0.01    0.05      12.29772  0.3957658   8.625945
##   0.01    0.10      12.08015  0.3973304   8.630781
##   0.01    0.15      11.79673  0.4237458   8.552979
##   0.01    0.20      11.47915  0.4580001   8.334006
##   0.01    0.25      11.23882  0.4837031   8.143624
##   0.01    0.30      11.16564  0.4931745   8.051794
##   0.01    0.35      11.27716  0.4921913   8.084016
##   0.01    0.40      11.62539  0.4772361   8.321765
##   0.01    0.45      11.91671  0.4676895   8.519016
##   0.01    0.50      12.26456  0.4539864   8.776543
##   0.01    0.55      12.56161  0.4428986   8.958501
##   0.01    0.60      12.87320  0.4311745   9.145669
##   0.01    0.65      13.18229  0.4204418   9.346808
##   0.01    0.70      13.44654  0.4108620   9.515121
##   0.01    0.75      13.69427  0.4015330   9.668477
##   0.01    0.80      13.95244  0.3923449   9.826005
##   0.01    0.85      14.17345  0.3867823   9.969775
##   0.01    0.90      14.36549  0.3836568  10.097894
##   0.01    0.95      14.49952  0.3812745  10.182624
##   0.01    1.00      14.62202  0.3786416  10.256410
##   0.10    0.05      12.56087  0.4038016   9.378175
##   0.10    0.10      12.23035  0.3965524   8.510535
##   0.10    0.15      12.10772  0.3983033   8.463308
##   0.10    0.20      11.92445  0.4109733   8.548923
##   0.10    0.25      11.70282  0.4303017   8.483137
##   0.10    0.30      11.51357  0.4493951   8.408995
##   0.10    0.35      11.35203  0.4645799   8.306644
##   0.10    0.40      11.28311  0.4714924   8.234107
##   0.10    0.45      11.24892  0.4761932   8.166846
##   0.10    0.50      11.23508  0.4801107   8.157513
##   0.10    0.55      11.24363  0.4821687   8.157992
##   0.10    0.60      11.30687  0.4813663   8.232800
##   0.10    0.65      11.39429  0.4802165   8.338524
##   0.10    0.70      11.49710  0.4780400   8.454955
##   0.10    0.75      11.59164  0.4765505   8.545441
##   0.10    0.80      11.68219  0.4751533   8.636094
##   0.10    0.85      11.76112  0.4739984   8.711089
##   0.10    0.90      11.84121  0.4725046   8.780124
##   0.10    0.95      11.93251  0.4700194   8.853243
##   0.10    1.00      12.02882  0.4670917   8.928275
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.5 and lambda = 0.
ggplot(enetFit) + labs(title="Elastic Net Model With Tuning")

preds_test_en <- predict(enetFit, test_data) 
postResample(pred=preds_test_en, obs=test_data$Permeability)
##       RMSE   Rsquared        MAE 
## 15.7516342  0.2410341 10.6574078

The best performing elastic net model used a fraction value of 0.5 and lambda = 0, which resulted:

  • \(R^2\): 0.5150926

  • RMSE: 11.37484

The best RMSE is 11.16564 with fraction value of 0.30 and lambda = 0.01.

On the test set, the \(R^2\) is 0.2410341.

Section f

Would you recommend any of your models to replace the permeability laboratory experiment?

Based solely on the test set estimates of \(R^2\), the ridge regression model performed the best with \(R^2\) 0.4319239, as compared to \(R^2\) 0.398143 from the PLS model. That being said, these results indicate that the models account for less than half of the variation in the dependent variable that is predictable from the predictors. The permeability assay is expensive labor intensive according to the book, but I don’t think any of the above models account for enough variation in the dependent variables to be reliable replacements for the laboratory experiment.


Exercise 6.3

A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

Section a

Start R and use these commands to load the data:

data(ChemicalManufacturingProcess)

#str(ChemicalManufacturingProcess)
#summary(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)
## [1] 176  58
#head(ChemicalManufacturingProcess)
cmp_data <- as.data.frame(ChemicalManufacturingProcess)

#head(cmp_data)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. Yield contains the percent yield for each run.

Not sure what processPredictors actually refers to, but it’s not an object from the AppliedPredictiveModeling library.

Section b

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

# Let's identify the missing values first using vis_miss
library(naniar)
library(RANN)
vis_miss(cmp_data)

# Let's try to impute using preprocess function
# And make sure not to transform the 'Yield' column which is the result
cmp_preprocess_data <- preProcess(cmp_data[, -c(1)], method="knnImpute")

cmp_full_data <- predict(cmp_preprocess_data, cmp_data[, -c(1)])
cmp_full_data$Yield <- cmp_data$Yield
vis_miss(cmp_full_data)

# Note: By using knnImpute, all the data has been centered and scaled

The initial plot from vis_miss indicates missing values.

The second plot above confirms no missing values in the predictor columns after applying the kNN imputation approach based on the preProcess function.

Section c

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

# Set the random number seed so we can reproduce the results
# Jenny, I got your number
set.seed(8675309)

nzv_cols <- nearZeroVar(cmp_full_data)
length(nzv_cols)
## [1] 1
# From: https://stackoverflow.com/questions/28043393/nearzerovar-function-in-caret
if(length(nzv_cols) > 0) cmp_full_data <- cmp_full_data[, -nzv_cols]
dim(cmp_full_data)
## [1] 176  57
trainingRows <- createDataPartition(cmp_full_data$Yield, p = .80, list=FALSE)

# Training set
training_data <- cmp_full_data[trainingRows, ]

# Test set
test_data <- cmp_full_data[-trainingRows, ]

### Start with PLS model as performed in Exercise 6.2

# Training PLS model based on book example
ctrl <- trainControl(method = "cv", number = 10)

# No metric parameter defaults to RMSE
plsTune <- train(Yield ~ .,
                data = training_data,
                method = "pls",
                metric = "Rsquared",
                tuneLength = 20,
                trControl = ctrl,
                preProc = c("center", "scale")) 

plsTune
## Partial Least Squares 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.548929  0.3664572  1.2101762
##    2     2.194603  0.4552950  1.3026148
##    3     1.574481  0.5045231  1.1219921
##    4     1.238694  0.5619734  0.9918974
##    5     1.369146  0.5547723  1.0372528
##    6     1.533527  0.5481599  1.0937005
##    7     1.912386  0.5036053  1.2141069
##    8     2.181657  0.4596084  1.2999653
##    9     2.256578  0.4494974  1.3370226
##   10     2.624882  0.4340313  1.4352155
##   11     2.801974  0.4298023  1.4872009
##   12     2.879798  0.4331459  1.5021256
##   13     3.088889  0.4351606  1.5554874
##   14     3.366239  0.4383872  1.6248632
##   15     3.590371  0.4359457  1.6940757
##   16     3.742172  0.4389701  1.7259043
##   17     3.709196  0.4421182  1.7150210
##   18     3.630277  0.4443288  1.6977077
##   19     3.607280  0.4487621  1.6920686
##   20     3.649520  0.4497827  1.7140831
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 4.
ggplot(plsTune) + labs(title="PLS Model With Component Tuning")

Based on the performance metric of \(R^2\), the latent variable count of 4 resulted in 0.5619734.

Section d

Predict the response for the test set. What is the value of the performance metric and how does this compare with the re-sampled performance metric on the training set?

preds_test_pls <- predict(plsTune, test_data) 
postResample(pred=preds_test_pls, obs=test_data$Yield)
##      RMSE  Rsquared       MAE 
## 1.3131556 0.5463996 1.0473280

For the test set, the \(R^2\) result is 0.5463996, which is quite close to the 0.5619734 value from the training set.

Section e

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

var_imp <- varImp(plsTune)

var_imp
## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   85.19
## ManufacturingProcess36   84.84
## ManufacturingProcess13   79.27
## ManufacturingProcess17   77.75
## ManufacturingProcess06   64.16
## ManufacturingProcess11   60.24
## ManufacturingProcess33   54.36
## BiologicalMaterial02     53.90
## BiologicalMaterial08     53.64
## BiologicalMaterial06     53.62
## BiologicalMaterial03     51.08
## BiologicalMaterial12     49.72
## BiologicalMaterial11     49.71
## ManufacturingProcess34   48.99
## BiologicalMaterial01     47.22
## BiologicalMaterial04     44.56
## ManufacturingProcess12   44.09
## ManufacturingProcess28   41.83
## ManufacturingProcess04   39.77

As the output above indicates, the top 8 results are Manufacturing Process (process predictors) and 12 of the top 20, while the biological predictors are just 8 of the top 20. BiologicalMaterial02, the biological predictor with the highest score, only results in variable importance of 53.90%. The top 5 Manufacturing Process predictors score greater than 75%.

Section f

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

top10_preds <- var_imp$importance %>% as.data.frame() %>% arrange(desc(Overall)) %>% head(10) %>% rownames()

vars <- c(top10_preds, "Yield")

corr <- cmp_full_data %>% 
  select(vars) %>% cor()

corrplot(corr, method="number")

To assess the relationships of the top predictors and the response, I’ve selected the top 10 predictors and generated a correlation plot along with the response variable Yield. The highest correlation with the response variable not surprising is ManufacturingProcess32. Interesting, ManufacturingProcess36, ManufacturingProcess13, and ManufacturingProcess17 have a negative correlation with Yield. If the goal is to improve Yield, then the predictors with positive correlation would be the variables to focus on, such as ManufacturingProcess32, ManufacturingProcess09, BiologicalMaterial02, and ManufacturingProcess33.

As the premise of the question states “manufacturing process predictors can be changed in the manufacturing process,” then the team should focus on improving predictors ManufacturingProcess32, ManufacturingProcess09, and ManufacturingProcess33 and lowering predictors ManufacturingProcess36, ManufacturingProcess13, and ManufacturingProcess17.