Developing a model to predict permeability (See Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
The fingerprints predictors indicate the presense or absense of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many are left for modeling?
## [1] 719
Split the data into training and test sets, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
I will split it using a 80:20 split. I will center the data.
df <- as.data.frame(fingerprints[, nearZeroVar(fingerprints)]) %>%
mutate(y = permeability)
# Make this reproducible
set.seed(42)
ind_train <- createDataPartition(df$y, times = 1, p = 0.8, list = FALSE)
traind_df <- df[ind_train, ]
testd_df <- df[-ind_train, ]
pls_model <- train(
y ~ ., data = traind_df, method = "pls",
center = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
# Plot model RMSE vs different values of components
title <- paste("Training Set RMSE Minimized at",
pls_model$bestTune$ncomp,
"Components")
plot(pls_model, main = title)| ncomp | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|
| 5 | 14.29229 | 0.263486 | 10.77974 | 3.389964 | 0.2388523 | 2.7213 |
The optimal number of components in the PLS model is 5. This captures 26% of the variation in the permeability.
Predict the response of the test set. What is the test set estimate of R2 ?
# Make predictions
pls_predictions <- predict(pls_model, testd_df)
# Model performance metrics
results <- data.frame(Model = "PLS",
RMSE = caret::RMSE(pls_predictions, testd_df$y),
Rsquared = caret::R2(pls_predictions, testd_df$y))
results %>%
kable() %>%
kable_styling()| Model | RMSE | Rsquared | |
|---|---|---|---|
| permeability | PLS | 13.23639 | 0.3016825 |
The R2 is 0.3.
Try building other models discussed in this chapter. Do any have better predictive performance? ### PCR Model
pcr_model <- train(
y ~ ., data = traind_df, method = "pcr",
center = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
title <- paste("Training Set RMSE Minimized at",
pcr_model$bestTune,
"Components")
plot(pcr_model, main = title)# Make predictions
pcr_predictions <- predict(pcr_model, testd_df)
# Model performance metrics
pcr_results <- data.frame(Model = "PCR",
RMSE = caret::RMSE(pcr_predictions, testd_df$y),
Rsquared = caret::R2(pcr_predictions, testd_df$y))
pcr_results %>%
kable() %>%
kable_styling()| Model | RMSE | Rsquared | |
|---|---|---|---|
| permeability | PCR | 15.90717 | 0.0344595 |
x <- model.matrix(y ~ ., data = traind_df)
x_test <- model.matrix(y ~ ., data = testd_df)
rr_cv <- cv.glmnet(x, traind_df$y, alpha = 0)
rr_model <- glmnet(x, traind_df$y, alpha = 0, lambda = rr_cv$lambda.min)
rr_predictions <- as.vector(predict(rr_model, x_test))
rr_results <- data.frame(Model = "Ridge Regression",
RMSE = caret::RMSE(rr_predictions, testd_df$y),
Rsquared = caret::R2(rr_predictions, testd_df$y))
rr_results %>%
kable() %>%
kable_styling()| Model | RMSE | Rsquared | |
|---|---|---|---|
| permeability | Ridge Regression | 14.59042 | 0.1426659 |
lr_cv <- cv.glmnet(x, traind_df$y, alpha = 1)
lr_model <- glmnet(x, traind_df$y, alpha = 1, lambda = lr_cv$lambda.min)
lr_predictions <- as.vector(predict(lr_model, x_test))
lr_results <- data.frame(Model = "Lasso Regression",
RMSE = caret::RMSE(lr_predictions, testd_df$y),
Rsquared = caret::R2(lr_predictions, testd_df$y))
lr_results %>%
kable() %>%
kable_styling()| Model | RMSE | Rsquared | |
|---|---|---|---|
| permeability | Lasso Regression | 13.98552 | 0.2417885 |
en_model <- train(
y ~ ., data = traind_df, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## alpha lambda
## 65 0.7 1.333236
en_predictions <- en_model %>% predict(x_test)
# Model performance metrics
en_results <- data.frame(Model = "Elastic Net Regression",
RMSE = caret::RMSE(en_predictions, testd_df$y),
Rsquared = caret::R2(en_predictions, testd_df$y))
en_results %>%
kable() %>%
kable_styling()| Model | RMSE | Rsquared | |
|---|---|---|---|
| permeability | Elastic Net Regression | 14.08594 | 0.205574 |
pls_model$results %>%
filter(ncomp == pls_model$bestTune$ncomp) %>%
mutate("Model" = "PLS") %>%
as.data.frame() %>%
bind_rows(pcr_results) %>%
bind_rows(rr_results) %>%
bind_rows(lr_results) %>%
bind_rows(en_results) %>%
arrange(desc(Rsquared)) %>%
kable() %>%
kable_styling()| ncomp | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD | Model | |
|---|---|---|---|---|---|---|---|---|
| …1 | 5 | 14.29229 | 0.2634860 | 10.77974 | 3.389964 | 0.2388523 | 2.7213 | PLS |
| permeability…2 | NA | 13.98552 | 0.2417885 | NA | NA | NA | NA | Lasso Regression |
| permeability…3 | NA | 14.08594 | 0.2055740 | NA | NA | NA | NA | Elastic Net Regression |
| permeability…4 | NA | 14.59042 | 0.1426659 | NA | NA | NA | NA | Ridge Regression |
| permeability…5 | NA | 15.90717 | 0.0344595 | NA | NA | NA | NA | PCR |
Elasticnet model has the lowest R2.
A chemical manufacturing process for a pharmaceutical produce was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurement of the raw materials (predictors), measurements of the manufacutring process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw materials before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boot revenue by approximately one hundred thousand dollars per batch:
Start R and use these commands to load the data:
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
use KNN to impute values
# Make this reproducible
(CHEM_knn_impute <- preProcess(ChemicalManufacturingProcess, method=c('knnImpute')))## Created from 152 samples and 58 variables
##
## Pre-processing:
## - centered (58)
## - ignored (0)
## - 5 nearest neighbor imputation (58)
## - scaled (58)
## Yield BiologicalMaterial01 BiologicalMaterial02
## Min. :-2.6692 Min. :-2.5653 Min. :-2.1858
## 1st Qu.:-0.7716 1st Qu.:-0.6078 1st Qu.:-0.7457
## Median :-0.1119 Median :-0.1491 Median :-0.1484
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7035 3rd Qu.: 0.6423 3rd Qu.: 0.7557
## Max. : 3.3394 Max. : 3.3597 Max. : 2.2459
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## Min. :-2.6830 Min. :-1.6731 Min. :-2.90576
## 1st Qu.:-0.6811 1st Qu.:-0.6222 1st Qu.:-0.73944
## Median :-0.1212 Median :-0.1405 Median :-0.05891
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6804 3rd Qu.: 0.4907 3rd Qu.: 0.70568
## Max. : 2.6355 Max. : 6.0523 Max. : 3.38985
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## Min. :-2.2184 Min. :-0.1313 Min. :-2.38535
## 1st Qu.:-0.7622 1st Qu.:-0.1313 1st Qu.:-0.64225
## Median :-0.1202 Median :-0.1313 Median : 0.02249
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6499 3rd Qu.:-0.1313 3rd Qu.: 0.56906
## Max. : 2.7948 Max. : 7.5723 Max. : 2.43034
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## Min. :-3.39629 Min. :-1.7202 Min. :-2.3116
## 1st Qu.:-0.59627 1st Qu.:-0.5685 1st Qu.:-0.6505
## Median :-0.03627 Median :-0.1513 Median :-0.1811
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.67428 3rd Qu.: 0.3161 3rd Qu.: 0.5491
## Max. : 2.96246 Max. : 6.7920 Max. : 2.4431
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## Min. :-2.3914 Min. :-6.149703 Min. :-1.969253
## 1st Qu.:-0.6074 1st Qu.:-0.223563 1st Qu.: 0.308956
## Median :-0.1033 Median : 0.105667 Median : 0.509627
## Mean : 0.0000 Mean : 0.001224 Mean : 0.009518
## 3rd Qu.: 0.7112 3rd Qu.: 0.503487 3rd Qu.: 0.568648
## Max. : 2.5986 Max. : 1.587202 Max. : 0.686690
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## Min. :-3.10582 Min. :-3.323233 Min. :-2.577803
## 1st Qu.:-0.42705 1st Qu.:-0.613828 1st Qu.:-0.487046
## Median : 0.37658 Median : 0.342432 Median :-0.086583
## Mean : 0.04123 Mean : 0.003213 Mean :-0.002534
## 3rd Qu.: 0.46587 3rd Qu.: 0.661186 3rd Qu.: 0.230347
## Max. : 2.69818 Max. : 2.254953 Max. : 5.686954
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## Min. :-1.630631 Min. :-0.9580199 Min. :-1.111973
## 1st Qu.:-0.630408 1st Qu.:-0.9580199 1st Qu.:-1.111973
## Median :-0.222910 Median :-0.9580199 Median : 0.894164
## Mean :-0.006574 Mean :-0.0009072 Mean :-0.001759
## 3rd Qu.: 0.480950 3rd Qu.: 1.0378549 3rd Qu.: 0.894164
## Max. : 7.408415 Max. : 1.0378549 Max. : 0.894164
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## Min. :-4.37787 Min. :-2.18999 Min. :-2.63442
## 1st Qu.:-0.49799 1st Qu.:-0.62482 1st Qu.:-0.53867
## Median : 0.04519 Median :-0.10310 Median : 0.02020
## Mean : 0.00000 Mean : 0.02156 Mean : 0.03163
## 3rd Qu.: 0.55281 3rd Qu.: 0.54906 3rd Qu.: 0.71878
## Max. : 2.39252 Max. : 3.15768 Max. : 2.95425
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## Min. :-0.480694 Min. :-2.37172 Min. :-2.803712
## 1st Qu.:-0.480694 1st Qu.:-0.59881 1st Qu.:-0.488202
## Median :-0.480694 Median : 0.09066 Median : 0.029921
## Mean :-0.002731 Mean : 0.00000 Mean :-0.004071
## 3rd Qu.:-0.480694 3rd Qu.: 0.68163 3rd Qu.: 0.520534
## Max. : 2.068439 Max. : 4.03046 Max. : 3.688885
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## Min. :-2.3137 Min. :-12.98219 Min. :-2.43850
## 1st Qu.:-0.4960 1st Qu.: -0.01436 1st Qu.:-0.67597
## Median :-0.1273 Median : 0.06312 Median : 0.04507
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.3786 3rd Qu.: 0.15126 3rd Qu.: 0.60587
## Max. : 3.3283 Max. : 0.81376 Max. : 4.53150
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## Min. :-13.08836 Min. :-3.0321 Min. :-13.05542
## 1st Qu.: 0.00903 1st Qu.:-0.6022 1st Qu.: -0.01063
## Median : 0.06890 Median :-0.1360 Median : 0.07318
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.14237 3rd Qu.: 0.4838 3rd Qu.: 0.15197
## Max. : 0.43899 Max. : 2.5846 Max. : 0.58033
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## Min. :-2.1018 Min. :-1.6230324 Min. :-1.814768
## 1st Qu.:-0.5599 1st Qu.:-0.7223009 1st Qu.:-0.611797
## Median :-0.1745 Median :-0.1218132 Median :-0.010311
## Mean : 0.0000 Mean : 0.0003314 Mean : 0.004726
## 3rd Qu.: 0.2110 3rd Qu.: 0.7789183 3rd Qu.: 0.591175
## Max. : 4.8365 Max. : 1.9798937 Max. : 1.794146
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## Min. :-1.523304 Min. :-12.927496 Min. :-12.940454
## 1st Qu.:-0.833580 1st Qu.: 0.002208 1st Qu.: 0.008505
## Median :-0.143857 Median : 0.069146 Median : 0.063251
## Mean : 0.005061 Mean : 0.000105 Mean : 0.000404
## 3rd Qu.: 0.890729 3rd Qu.: 0.128720 3rd Qu.: 0.115417
## Max. : 2.442608 Max. : 0.433287 Max. : 0.312785
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## Min. :-12.888994 Min. :-1.25583 Min. :-12.026718
## 1st Qu.: 0.000681 1st Qu.:-1.25583 1st Qu.: -0.186978
## Median : 0.066362 Median : 0.72551 Median : -0.066778
## Mean : 0.001121 Mean :-0.02868 Mean : -0.003263
## 3rd Qu.: 0.131337 3rd Qu.: 0.78266 3rd Qu.: 0.233723
## Max. : 0.416660 Max. : 0.93507 Max. : 1.195326
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## Min. :-9.38589 Min. :-12.632749 Min. :-2.86552
## 1st Qu.:-0.37026 1st Qu.: -0.015263 1st Qu.:-0.64216
## Median : 0.03954 Median : 0.110732 Median :-0.08632
## Mean : 0.01114 Mean : -0.001722 Mean : 0.00000
## 3rd Qu.: 0.55179 3rd Qu.: 0.218728 3rd Qu.: 0.65480
## Max. : 2.08855 Max. : 0.416720 Max. : 2.69287
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## Min. :-3.037737 Min. :-3.558813 Min. :-3.01270
## 1st Qu.:-0.621676 1st Qu.: 0.118269 1st Qu.:-0.51725
## Median : 0.183677 Median : 0.118269 Median :-0.05513
## Mean :-0.005764 Mean :-0.009176 Mean :-0.02362
## 3rd Qu.: 0.586354 3rd Qu.: 0.118269 3rd Qu.: 0.49941
## Max. : 2.599738 Max. : 1.956810 Max. : 2.44032
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## Min. :-2.944307 Min. :-2.27741 Min. :-3.9024
## 1st Qu.:-0.655777 1st Qu.:-0.70467 1st Qu.:-0.8225
## Median :-0.083645 Median :-0.03064 Median : 0.7175
## Mean :-0.008228 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.488487 3rd Qu.: 0.64339 3rd Qu.: 0.7175
## Max. : 2.777017 Max. : 2.89017 Max. : 0.7175
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## Min. :-4.5508 Min. :-0.4626528 Min. :-0.440588
## 1st Qu.: 0.1653 1st Qu.:-0.4626528 1st Qu.:-0.440588
## Median : 0.2317 Median :-0.4626528 Median :-0.440588
## Mean : 0.0000 Mean : 0.0003392 Mean :-0.000392
## 3rd Qu.: 0.2982 3rd Qu.:-0.4626528 3rd Qu.:-0.440588
## Max. : 0.4310 Max. : 2.1490969 Max. : 3.275213
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## Min. :-5.77163 Min. :-1.0506 Min. :-5.60583
## 1st Qu.: 0.09979 1st Qu.:-0.3594 1st Qu.:-0.01588
## Median : 0.20280 Median :-0.1290 Median : 0.29467
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.25430 3rd Qu.: 0.1303 3rd Qu.: 0.29467
## Max. : 0.46031 Max. :11.6224 Max. : 0.91578
## ManufacturingProcess45
## Min. :-5.25447
## 1st Qu.:-0.09356
## Median : 0.15220
## Mean : 0.00000
## 3rd Qu.: 0.39796
## Max. : 1.13523
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric? Pre-process the data removing the zero variance variables prior to splitting the data.
We will split the data into 80% train and 20% test.
## [1] 176 58
Data will be split into training and test set using an 80:20 split.
## [1] 176 57
set.seed(555)
select_train <- createDataPartition(CHEM_dataframe2$Yield, times = 1, p = .80, list = FALSE)
train_x2 <- CHEM_dataframe2[select_train, ][, -c(1)]
test_x2 <- CHEM_dataframe2[-select_train, ][, -c(1)]
train_y2 <- CHEM_dataframe2[select_train, ]$Yield
test_y2 <- CHEM_dataframe2[-select_train, ]$Yield
(P_fit2 <- train(x = train_x2, y = train_y2,
method = "pls",
metric = "Rsquared",
tuneLength = 25,
trControl = trainControl(method = "cv", number=10),
preProcess = c('center', 'scale')
))## Partial Least Squares
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 129, 130, 130, 129, 130, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.7338982 0.4317348 0.6175189
## 2 0.7722558 0.4684310 0.5812528
## 3 0.8134501 0.5194408 0.5823906
## 4 0.9540440 0.5193093 0.6360902
## 5 1.0721588 0.5241915 0.6733250
## 6 1.1436786 0.5185259 0.6912503
## 7 1.2458840 0.5085213 0.7365470
## 8 1.3875640 0.4976078 0.7926451
## 9 1.4661458 0.4849316 0.8268420
## 10 1.5879129 0.4644183 0.8643309
## 11 1.6495018 0.4598224 0.8756271
## 12 1.7627131 0.4554407 0.8976769
## 13 1.7487823 0.4650528 0.8931955
## 14 1.6777520 0.4656269 0.8684202
## 15 1.6377169 0.4705341 0.8508950
## 16 1.6194606 0.4693782 0.8400870
## 17 1.5915410 0.4743053 0.8205206
## 18 1.5932854 0.4752533 0.8122477
## 19 1.6520158 0.4751787 0.8387607
## 20 1.6905995 0.4728315 0.8552220
## 21 1.7800104 0.4698293 0.8932960
## 22 1.8218081 0.4683798 0.9101214
## 23 1.9167962 0.4630918 0.9435827
## 24 1.9756004 0.4611352 0.9652468
## 25 2.0143881 0.4592907 0.9780208
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 5.
pre-process by centering and scaling the predictors.
The optimal value for the performance metric yields a RMSE of 1.07 and
R2 of 0.524.
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
## RMSE Rsquared MAE
## 0.6344353 0.6718573 0.5241329
The predictions on the test set yield a RMSE of 0.634, which is better than the one on the training set and R2 of 0.524, which is slightly lower than the one from the training set. So the model has performed very well on the test set.
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list??
In this model, the most important predictor at the top of the list is “ManufacturingProcess32”. We have both biological and process predictors similarly as important in the list.
## Warning: package 'pls' was built under R version 4.3.2
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:corrplot':
##
## corrplot
## The following object is masked from 'package:stats':
##
## loadings
The RMSE is about 0.62 and a R2 of roughly 0.68 on the test set.
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
corr_vals <- CHEM_dataframe2 %>%
dplyr::select('Yield', 'ManufacturingProcess32','ManufacturingProcess36',
'BiologicalMaterial06','ManufacturingProcess13',
'BiologicalMaterial03','ManufacturingProcess17',
'BiologicalMaterial02','BiologicalMaterial12',
'ManufacturingProcess09','ManufacturingProcess31')
corr_plot_vals <- cor(corr_vals)
corrplot.mixed(corr_plot_vals, tl.col = 'black', tl.pos = 'lt',
upper = "number", lower="circle")