# Import required R libraries
library(AppliedPredictiveModeling)
library(caret)
library(tidyverse)
library(pls)
library(elasticnet)
library(corrplot)
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
Start R and use these commands to load the data:
data(permeability)
data(fingerprints)
fp_data <- as.data.frame(fingerprints)
perm_data <- as.data.frame(permeability)
fp_data <- as.data.frame(fingerprints)
#str(perm_data)
#str(fp_data)
#summary(fp_data)
#head(fp_data)
dim(fp_data)
## [1] 165 1107
#summary(perm_data)
#head(perm_data)
dim(perm_data)
## [1] 165 1
165 observations in fingerprints
dataset with 1107 features. 165 observations (predictions) in permeability
dataset with 1 feature, the prediction.
The matrix fingerprints
contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability
response.
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar
function from the caret
package. How many predictors are left for modeling?
nzv_cols <- nearZeroVar(fp_data)
length(nzv_cols)
## [1] 719
# From: https://stackoverflow.com/questions/28043393/nearzerovar-function-in-caret
if(length(nzv_cols) > 0) fp_data <- fp_data[, -nzv_cols]
dim(fp_data)
## [1] 165 388
fp_data <- fp_data %>%
mutate_if(is.numeric,as.factor)
# Add the outcome variable to the predictors dataset
fp_data$Permeability <- perm_data$permeability
dim(fp_data)
## [1] 165 389
#str(fp_data)
Answer: 719 features with near zero variance, thus 388 features remaining for modeling.
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding re-sampled estimate of \(R^2\)?
# From textbook: Prior to performing PLS, the predictors should be centered and scaled,
# PLS has one tuning parameter: the number of components to retain.
# Set the random number seed so we can reproduce the results
set.seed(8675309)
#summary(fp_data)
# Use 80% for training
trainingRows <- createDataPartition(fp_data$Permeability, p = .80, list=FALSE)
# Training set
training_data <- fp_data[trainingRows, ]
# Test set
test_data <- fp_data[-trainingRows, ]
# Initial model following example from book
plsFit <- plsr(Permeability ~ ., data=training_data)
plsFit
## Partial least squares regression , fitted with the kernel algorithm.
## Call:
## plsr(formula = Permeability ~ ., data = training_data)
I’ll admit, this initial model was not helpful. On to the tuning of the model.
# Training pls model based on book example
ctrl <- trainControl(method = "cv", number = 10)
# NOTE: No metric parameter defaults to RMSE
plsTune <- train(Permeability ~ .,
data = training_data,
method = "pls",
metric = "Rsquared",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale"))
plsTune
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 120, 121, 118, 119, 120, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.06698 0.3799681 10.014244
## 2 11.62076 0.4842259 8.219313
## 3 11.42978 0.4907288 8.779714
## 4 11.33345 0.4947285 8.789783
## 5 11.22369 0.5277212 8.859795
## 6 10.96079 0.5457347 8.515748
## 7 10.91604 0.5481675 8.483617
## 8 10.74673 0.5674378 8.309220
## 9 10.91521 0.5665909 8.379533
## 10 11.04581 0.5637790 8.443369
## 11 11.32753 0.5485861 8.561184
## 12 11.29806 0.5466893 8.594283
## 13 11.63459 0.5235809 8.912699
## 14 11.80198 0.5108357 8.944397
## 15 12.19740 0.4911028 9.196299
## 16 12.32189 0.4840448 9.255872
## 17 12.50701 0.4725653 9.416088
## 18 12.81096 0.4536203 9.577273
## 19 13.04490 0.4465773 9.810423
## 20 13.28740 0.4394109 10.018533
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 8.
ggplot(plsTune) + labs(title="PLS Model Component Tuning")
First, the data was split - 80% for training and 20% for the test set. The data was pre-processed using the preProcess
parameter in the train
function in order to center and scale the predictors. The PLS model was tuned by way of cross-validation performed 10-fold with a tune length of 20.
Answer: Apparently, the running of the chunks in Rstudio and knitting the .rmd file are producing slightly different results. The knitted results above show an optimal latent variables count of 8 with a corresponding \(R^2\) value of 0.5674378. The best performing RMSE seen above is 10.74673, also for the variable count of 8. When running the chunks in Rstudio, the optimal number of latent variables is 9 based on the corresponding \(R^2\) value of 0.5322824. The best performing RMSE seen above is 10.75441, which resulted from the latent variable count of 8.
Predict the response for the test set. What is the test set estimate of \(R^2\)?
preds_test <- predict(plsTune, test_data)
postResample(pred=preds_test, obs=test_data$Permeability)
## RMSE Rsquared MAE
## 13.088606 0.398143 9.532294
Answer: For the test set predictions, the estimate of \(R^2\) is 0.398143.
Try building other models discussed in this chapter. Do any have better predictive performance?
I have attempted models of type:
Ridge Regression
Lasso
Elastic Net
# Ridge regression
# From book: Page 135
# Define the candidate set of values
ridgeGrid <- data.frame(.lambda = seq(0, 1, by = 0.1))
ridgeRegFit <- train(Permeability ~ .,
data = training_data,
method = "ridge",
metric = "Rsquared",
tuneGrid = ridgeGrid,
trControl = ctrl,
preProc = c("center", "scale"))
ridgeRegFit
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 120, 121, 118, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 27.25144 0.3394782 18.220341
## 0.1 12.31169 0.5402269 9.474454
## 0.2 12.12525 0.5640031 9.373737
## 0.3 12.31625 0.5704226 9.588307
## 0.4 12.65255 0.5725739 9.954747
## 0.5 13.08302 0.5729303 10.378427
## 0.6 13.58010 0.5725119 10.836059
## 0.7 14.12918 0.5717114 11.315484
## 0.8 14.72067 0.5707226 11.801494
## 0.9 15.34666 0.5696357 12.299337
## 1.0 16.00192 0.5685118 12.823477
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.5.
ggplot(ridgeRegFit) + labs(title="Ridge Regression Model With Tuning")
preds_test_rr <- predict(ridgeRegFit, test_data)
postResample(pred=preds_test_rr, obs=test_data$Permeability)
## RMSE Rsquared MAE
## 14.7337233 0.4319239 11.2870756
The best performing ridge-regression model used a lambda value of 0.5, which resulted:
\(R^2\): 0.5729303
RMSE: 13.08302
The best RMSE is 12.12525 with lambda of 0.2.
On the test set, the \(R^2\) is 0.4319239.
lassoGrid <- data.frame(.fraction = seq(0, 0.5, by=0.05))
lassoFit <- train(Permeability ~ .,
data = training_data,
method = 'lasso',
metric = 'Rsquared',
tuneGrid = lassoGrid,
trControl = ctrl,
preProcess = c('center','scale'))
lassoFit
## The lasso
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 120, 120, 121, 120, 120, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.00 15.25173 NaN 12.164769
## 0.05 11.73790 0.4615889 8.631120
## 0.10 11.46013 0.4600279 8.227656
## 0.15 11.16597 0.4795671 8.054270
## 0.20 10.91302 0.4994958 7.986236
## 0.25 10.78169 0.5121227 7.945473
## 0.30 10.67609 0.5184317 7.883757
## 0.35 10.61055 0.5244119 7.842006
## 0.40 10.63854 0.5286524 7.837140
## 0.45 10.81814 0.5230483 7.970015
## 0.50 11.03395 0.5135500 8.148108
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.4.
ggplot(lassoFit) + labs(title="Lasso Model With Tuning")
preds_test_ls <- predict(lassoFit, test_data)
postResample(pred=preds_test_ls, obs=test_data$Permeability)
## RMSE Rsquared MAE
## 15.3916216 0.2644105 10.8037744
The best performing lasso model used a fraction value of 0.4, which resulted:
\(R^2\): 0.5286524
RMSE: 10.63854
The best RMSE is 10.61055 with fraction value of 0.35.
On the test set, the \(R^2\) is 0.2644105.
# Elastic Net
# From book: Page 136
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
.fraction = seq(.05, 1, length = 20))
enetFit <- train(Permeability ~ .,
data = training_data,
method = 'enet',
metric = 'Rsquared',
tuneGrid = enetGrid,
trControl = ctrl,
preProcess = c('center','scale'))
enetFit
## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 120, 120, 118, 119, 121, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 12.13174 0.4264804 8.890428
## 0.00 0.10 12.04037 0.3964501 8.651442
## 0.00 0.15 11.79216 0.4160285 8.578331
## 0.00 0.20 11.44426 0.4431953 8.271882
## 0.00 0.25 11.27123 0.4662284 8.178241
## 0.00 0.30 11.29218 0.4817690 8.193417
## 0.00 0.35 11.35585 0.4893024 8.184000
## 0.00 0.40 11.36832 0.5011068 8.163387
## 0.00 0.45 11.35545 0.5107693 8.101684
## 0.00 0.50 11.37484 0.5150926 8.064096
## 0.00 0.55 11.43404 0.5145448 8.085164
## 0.00 0.60 11.60826 0.5044263 8.172674
## 0.00 0.65 11.82871 0.4922947 8.312996
## 0.00 0.70 12.04321 0.4826022 8.453167
## 0.00 0.75 12.24838 0.4730391 8.602791
## 0.00 0.80 12.42744 0.4664210 8.762566
## 0.00 0.85 12.60865 0.4600349 8.939787
## 0.00 0.90 12.76428 0.4549951 9.069096
## 0.00 0.95 12.94244 0.4479978 9.200452
## 0.00 1.00 13.23542 0.4368884 9.414313
## 0.01 0.05 12.29772 0.3957658 8.625945
## 0.01 0.10 12.08015 0.3973304 8.630781
## 0.01 0.15 11.79673 0.4237458 8.552979
## 0.01 0.20 11.47915 0.4580001 8.334006
## 0.01 0.25 11.23882 0.4837031 8.143624
## 0.01 0.30 11.16564 0.4931745 8.051794
## 0.01 0.35 11.27716 0.4921913 8.084016
## 0.01 0.40 11.62539 0.4772361 8.321765
## 0.01 0.45 11.91671 0.4676895 8.519016
## 0.01 0.50 12.26456 0.4539864 8.776543
## 0.01 0.55 12.56161 0.4428986 8.958501
## 0.01 0.60 12.87320 0.4311745 9.145669
## 0.01 0.65 13.18229 0.4204418 9.346808
## 0.01 0.70 13.44654 0.4108620 9.515121
## 0.01 0.75 13.69427 0.4015330 9.668477
## 0.01 0.80 13.95244 0.3923449 9.826005
## 0.01 0.85 14.17345 0.3867823 9.969775
## 0.01 0.90 14.36549 0.3836568 10.097894
## 0.01 0.95 14.49952 0.3812745 10.182624
## 0.01 1.00 14.62202 0.3786416 10.256410
## 0.10 0.05 12.56087 0.4038016 9.378175
## 0.10 0.10 12.23035 0.3965524 8.510535
## 0.10 0.15 12.10772 0.3983033 8.463308
## 0.10 0.20 11.92445 0.4109733 8.548923
## 0.10 0.25 11.70282 0.4303017 8.483137
## 0.10 0.30 11.51357 0.4493951 8.408995
## 0.10 0.35 11.35203 0.4645799 8.306644
## 0.10 0.40 11.28311 0.4714924 8.234107
## 0.10 0.45 11.24892 0.4761932 8.166846
## 0.10 0.50 11.23508 0.4801107 8.157513
## 0.10 0.55 11.24363 0.4821687 8.157992
## 0.10 0.60 11.30687 0.4813663 8.232800
## 0.10 0.65 11.39429 0.4802165 8.338524
## 0.10 0.70 11.49710 0.4780400 8.454955
## 0.10 0.75 11.59164 0.4765505 8.545441
## 0.10 0.80 11.68219 0.4751533 8.636094
## 0.10 0.85 11.76112 0.4739984 8.711089
## 0.10 0.90 11.84121 0.4725046 8.780124
## 0.10 0.95 11.93251 0.4700194 8.853243
## 0.10 1.00 12.02882 0.4670917 8.928275
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were fraction = 0.5 and lambda = 0.
ggplot(enetFit) + labs(title="Elastic Net Model With Tuning")
preds_test_en <- predict(enetFit, test_data)
postResample(pred=preds_test_en, obs=test_data$Permeability)
## RMSE Rsquared MAE
## 15.7516342 0.2410341 10.6574078
The best performing elastic net model used a fraction value of 0.5 and lambda = 0, which resulted:
\(R^2\): 0.5150926
RMSE: 11.37484
The best RMSE is 11.16564 with fraction value of 0.30 and lambda = 0.01.
On the test set, the \(R^2\) is 0.2410341.
Would you recommend any of your models to replace the permeability laboratory experiment?
Based solely on the test set estimates of \(R^2\), the ridge regression model performed the best with \(R^2\) 0.4319239, as compared to \(R^2\) 0.398143 from the PLS model. That being said, these results indicate that the models account for less than half of the variation in the dependent variable that is predictable from the predictors. The permeability assay is expensive labor intensive according to the book, but I don’t think any of the above models account for enough variation in the dependent variables to be reliable replacements for the laboratory experiment.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
Start R and use these commands to load the data:
data(ChemicalManufacturingProcess)
#str(ChemicalManufacturingProcess)
#summary(ChemicalManufacturingProcess)
dim(ChemicalManufacturingProcess)
## [1] 176 58
#head(ChemicalManufacturingProcess)
cmp_data <- as.data.frame(ChemicalManufacturingProcess)
#head(cmp_data)
The matrix processPredictors
contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. Yield contains the percent yield for each run.
Not sure what processPredictors
actually refers to, but it’s not an object from the AppliedPredictiveModeling
library.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
# Let's identify the missing values first using vis_miss
library(naniar)
library(RANN)
vis_miss(cmp_data)
# Let's try to impute using preprocess function
# And make sure not to transform the 'Yield' column which is the result
cmp_preprocess_data <- preProcess(cmp_data[, -c(1)], method="knnImpute")
cmp_full_data <- predict(cmp_preprocess_data, cmp_data[, -c(1)])
cmp_full_data$Yield <- cmp_data$Yield
vis_miss(cmp_full_data)
# Note: By using knnImpute, all the data has been centered and scaled
The initial plot from vis_miss
indicates missing values.
The second plot above confirms no missing values in the predictor columns after applying the kNN imputation approach based on the preProcess
function.
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
# Set the random number seed so we can reproduce the results
# Jenny, I got your number
set.seed(8675309)
nzv_cols <- nearZeroVar(cmp_full_data)
length(nzv_cols)
## [1] 1
# From: https://stackoverflow.com/questions/28043393/nearzerovar-function-in-caret
if(length(nzv_cols) > 0) cmp_full_data <- cmp_full_data[, -nzv_cols]
dim(cmp_full_data)
## [1] 176 57
trainingRows <- createDataPartition(cmp_full_data$Yield, p = .80, list=FALSE)
# Training set
training_data <- cmp_full_data[trainingRows, ]
# Test set
test_data <- cmp_full_data[-trainingRows, ]
### Start with PLS model as performed in Exercise 6.2
# Training PLS model based on book example
ctrl <- trainControl(method = "cv", number = 10)
# No metric parameter defaults to RMSE
plsTune <- train(Yield ~ .,
data = training_data,
method = "pls",
metric = "Rsquared",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale"))
plsTune
## Partial Least Squares
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 129, 130, 130, 130, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.548929 0.3664572 1.2101762
## 2 2.194603 0.4552950 1.3026148
## 3 1.574481 0.5045231 1.1219921
## 4 1.238694 0.5619734 0.9918974
## 5 1.369146 0.5547723 1.0372528
## 6 1.533527 0.5481599 1.0937005
## 7 1.912386 0.5036053 1.2141069
## 8 2.181657 0.4596084 1.2999653
## 9 2.256578 0.4494974 1.3370226
## 10 2.624882 0.4340313 1.4352155
## 11 2.801974 0.4298023 1.4872009
## 12 2.879798 0.4331459 1.5021256
## 13 3.088889 0.4351606 1.5554874
## 14 3.366239 0.4383872 1.6248632
## 15 3.590371 0.4359457 1.6940757
## 16 3.742172 0.4389701 1.7259043
## 17 3.709196 0.4421182 1.7150210
## 18 3.630277 0.4443288 1.6977077
## 19 3.607280 0.4487621 1.6920686
## 20 3.649520 0.4497827 1.7140831
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 4.
ggplot(plsTune) + labs(title="PLS Model With Component Tuning")
Based on the performance metric of \(R^2\), the latent variable count of 4 resulted in 0.5619734.
Predict the response for the test set. What is the value of the performance metric and how does this compare with the re-sampled performance metric on the training set?
preds_test_pls <- predict(plsTune, test_data)
postResample(pred=preds_test_pls, obs=test_data$Yield)
## RMSE Rsquared MAE
## 1.3131556 0.5463996 1.0473280
For the test set, the \(R^2\) result is 0.5463996, which is quite close to the 0.5619734 value from the training set.
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
var_imp <- varImp(plsTune)
var_imp
## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 85.19
## ManufacturingProcess36 84.84
## ManufacturingProcess13 79.27
## ManufacturingProcess17 77.75
## ManufacturingProcess06 64.16
## ManufacturingProcess11 60.24
## ManufacturingProcess33 54.36
## BiologicalMaterial02 53.90
## BiologicalMaterial08 53.64
## BiologicalMaterial06 53.62
## BiologicalMaterial03 51.08
## BiologicalMaterial12 49.72
## BiologicalMaterial11 49.71
## ManufacturingProcess34 48.99
## BiologicalMaterial01 47.22
## BiologicalMaterial04 44.56
## ManufacturingProcess12 44.09
## ManufacturingProcess28 41.83
## ManufacturingProcess04 39.77
As the output above indicates, the top 8 results are Manufacturing Process (process predictors) and 12 of the top 20, while the biological predictors are just 8 of the top 20. BiologicalMaterial02, the biological predictor with the highest score, only results in variable importance of 53.90%. The top 5 Manufacturing Process predictors score greater than 75%.
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
top10_preds <- var_imp$importance %>% as.data.frame() %>% arrange(desc(Overall)) %>% head(10) %>% rownames()
vars <- c(top10_preds, "Yield")
corr <- cmp_full_data %>%
select(vars) %>% cor()
corrplot(corr, method="number")
To assess the relationships of the top predictors and the response, I’ve selected the top 10 predictors and generated a correlation plot along with the response variable Yield
. The highest correlation with the response variable not surprising is ManufacturingProcess32. Interesting, ManufacturingProcess36, ManufacturingProcess13, and ManufacturingProcess17 have a negative correlation with Yield
. If the goal is to improve Yield
, then the predictors with positive correlation would be the variables to focus on, such as ManufacturingProcess32, ManufacturingProcess09, BiologicalMaterial02, and ManufacturingProcess33.
As the premise of the question states “manufacturing process predictors can be changed in the manufacturing process,” then the team should focus on improving predictors ManufacturingProcess32, ManufacturingProcess09, and ManufacturingProcess33 and lowering predictors ManufacturingProcess36, ManufacturingProcess13, and ManufacturingProcess17.