DATA 624 - HOMEWORK 7
library(tidyverse)
library(caret)
library(pls)
library(missForest)
library(elasticnet)
library(corrplot)
Question 6.2
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
(a)
Start R and use these commands to load the data:
The matrix fingerprints
contains the 1,107 binary molecular predictors for the 165 compounds, while permeability
contains permeability response.
(b)
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar
function from the caret package. How many predictors are left for modeling?
Answer: 388 predictors are left for modeling after removing predictors with near zero variance.
col_nearzero <- nearZeroVar(fingerprints)
fp_filtered <- fingerprints[,-col_nearzero]
ncol(fp_filtered)
## [1] 388
(c)
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
Train-test-split
Tune PLS model
Answer:
10 latent variables are optimal;
The corresplonding resampled estimate of R2 is 0.8433209.
set.seed(0)
ctrl <- trainControl(method = "cv", number = 10)
plsTune <- train(data_train_X, data_train_Y,
method = 'pls',
tuneLength = 20,
trControl = ctrl,
preProc = c('center', 'scale'))
resampled_pred <- predict(plsTune, data_train_X)
plsTune$bestTune
## RMSE Rsquared MAE
## 6.1373639 0.8433209 4.3027112
Variance Important Evaluation
(d)
Predict the response for the test set. What is the test set estimate of R2?
The R2 of the test set is 0.3160511.
data_pred <- predict(plsTune, data_test_X)
pls_pred_metrics <- postResample(data_pred, data_test_Y)
pls_pred_metrics
## RMSE Rsquared MAE
## 14.6767717 0.3160511 11.1698324
(e)
Try building other models discussed in this chapter. Do any have better predictive performance?
Answer: Ridge, Lasso and Elastic Net models are built as below. In summary, Elastic Net model has the best predictive performance among all.
Ridge
The optimal lambda is 0.1473684, the R2 of test set is 0.3209984
Train Ridge Model
set.seed(0)
ridgeGrid <- data.frame(.lambda = seq(0, .2, length = 20))
RidgeTune <- train(data_train_X, data_train_Y,
method = 'ridge',
tuneGrid = ridgeGrid,
trControl = ctrl,
preProc = c('center', 'scale'))
## Warning: model fit failed for Fold06: lambda=0.00000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 120, 119, 121, 120, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00000000 60.71006 0.4991202 29.193712
## 0.01052632 12.05791 0.4846220 8.590699
## 0.02105263 11.57500 0.5162559 8.258894
## 0.03157895 11.43398 0.5284007 8.157156
## 0.04210526 11.06693 0.5517906 7.886334
## 0.05263158 10.96832 0.5638508 7.862230
## 0.06315789 10.81088 0.5711652 7.745400
## 0.07368421 10.71966 0.5779169 7.708256
## 0.08421053 10.66994 0.5824879 7.699946
## 0.09473684 10.62469 0.5863821 7.681508
## 0.10526316 10.61000 0.5882193 7.686754
## 0.11578947 10.56765 0.5920883 7.675728
## 0.12631579 10.56695 0.5933558 7.677131
## 0.13684211 10.55660 0.5951257 7.685192
## 0.14736842 10.54805 0.5968473 7.694685
## 0.15789474 10.58958 0.5958360 7.727812
## 0.16842105 10.57866 0.5976003 7.738601
## 0.17894737 10.59367 0.5980664 7.760969
## 0.18947368 10.59047 0.5995876 7.770115
## 0.20000000 10.61157 0.5998085 7.793120
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1473684.
Prediction
ridge_pred <- predict(RidgeTune, data_test_X)
ridge_pred_metrics <- postResample(ridge_pred, data_test_Y)
ridge_pred_metrics
## RMSE Rsquared MAE
## 14.8866173 0.3209984 11.2260409
Lasso
Train Lasso Model
The Optimal fraction is 0.03368421, the R2 of test set is 0.3603223
set.seed(0)
lassoGrid <- data.frame(.fraction = seq(0.01, .1, length = 20))
lassoTune <- train(data_train_X, data_train_Y,
method = 'lasso',
tuneGrid = lassoGrid,
trControl = ctrl,
preProc = c('center', 'scale'))
## Warning: model fit failed for Fold06: fraction=0.1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## The lasso
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 120, 119, 121, 120, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.01000000 15.01230 0.4220646 11.56372
## 0.01473684 14.90148 0.4228490 11.32596
## 0.01947368 14.73416 0.4249298 11.08808
## 0.02421053 14.62962 0.4349871 10.87726
## 0.02894737 14.58540 0.4411191 10.68893
## 0.03368421 14.56377 0.4450647 10.50608
## 0.03842105 14.59098 0.4516374 10.38933
## 0.04315789 14.65946 0.4575359 10.36153
## 0.04789474 14.78465 0.4576551 10.45548
## 0.05263158 14.95305 0.4560809 10.58151
## 0.05736842 15.15743 0.4545849 10.71754
## 0.06210526 15.34749 0.4538233 10.83315
## 0.06684211 15.50782 0.4554049 10.93703
## 0.07157895 15.70231 0.4560171 11.07033
## 0.07631579 15.90921 0.4562893 11.20673
## 0.08105263 16.10125 0.4569274 11.33376
## 0.08578947 16.24817 0.4584387 11.43141
## 0.09052632 16.39067 0.4606762 11.52125
## 0.09526316 16.54697 0.4625776 11.61881
## 0.10000000 16.71106 0.4646594 11.72435
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.03368421.
Prediction
lasso_pred <- predict(lassoTune, data_test_X)
lasso_pred_metrics <- postResample(lasso_pred, data_test_Y)
lasso_pred_metrics
## RMSE Rsquared MAE
## 13.2576810 0.3603223 10.4641081
Elastic Net
Train Elastic Net Model
The optimal fraction = 0.09052632 and lambda = 0.1789474, the R2 of test set is 0.3978472
set.seed(0)
enetGrid <- data.frame(.lambda = seq(0, .2, length = 20),
.fraction = seq(0.01, .1, length = 20))
enetTune <- train(data_train_X, data_train_Y,
method = 'enet',
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c('center', 'scale'))
## Warning: model fit failed for Fold06: lambda=0.00000, fraction=0.01000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 120, 119, 121, 120, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00000000 0.01000000 15.01230 0.4220646 11.563724
## 0.01052632 0.01473684 13.08430 0.4724539 10.329082
## 0.02105263 0.01947368 12.95924 0.4690177 10.184203
## 0.03157895 0.02421053 12.54755 0.4763840 9.819180
## 0.04210526 0.02894737 12.47553 0.4869010 9.747347
## 0.05263158 0.03368421 12.20585 0.4972868 9.473268
## 0.06315789 0.03842105 12.08418 0.5001782 9.345039
## 0.07368421 0.04315789 11.93248 0.5041611 9.175376
## 0.08421053 0.04789474 11.80591 0.5061118 9.024356
## 0.09473684 0.05263158 11.68354 0.5055404 8.858137
## 0.10526316 0.05736842 11.60898 0.5062403 8.731185
## 0.11578947 0.06210526 11.49028 0.4990189 8.600943
## 0.12631579 0.06684211 11.44616 0.5045535 8.476450
## 0.13684211 0.07157895 11.38741 0.5042226 8.372496
## 0.14736842 0.07631579 11.34000 0.5037701 8.282322
## 0.15789474 0.08105263 11.30962 0.5002722 8.224958
## 0.16842105 0.08578947 11.28209 0.5001292 8.154987
## 0.17894737 0.09052632 11.27615 0.4978649 8.109554
## 0.18947368 0.09526316 11.27653 0.4964541 8.063385
## 0.20000000 0.10000000 11.29302 0.4932032 8.036270
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.09052632 and lambda
## = 0.1789474.
Prediction
enet_pred <- predict(enetTune, data_test_X)
enet_pred_metrics <- postResample(enet_pred, data_test_Y)
enet_pred_metrics
## RMSE Rsquared MAE
## 12.2606421 0.3978472 8.8156558
(f)
Would you recommend any of your models to replace the permeability laboratory experiment?
Answer: According to the test set prediction metrics below, the Elastic Net Model has the lowest RMSE and highest R2. I would recommend to replace the original PLS model by Elastic Net model.
rbind(pls_pred_metrics,
ridge_pred_metrics,
lasso_pred_metrics,
enet_pred_metrics) %>%
data.frame() %>%
arrange(desc(Rsquared))
Question 6.3
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), 6.5 Computing 139 measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boostrevenue by approximately one hundred thousand dollars per batch:
(a)
Start R and use these commands to load the data:
data(ChemicalManufacturingProcess)
chem_predictors <- ChemicalManufacturingProcess %>% select(-Yield) %>% as.matrix()
chem_response <- ChemicalManufacturingProcess %>% select(Yield) %>% as.matrix()
The matrix processPredictors
contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield
contains the percent yield for each run.
(b)
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Anwer: Imputd mssing values using missFroest
package.
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
(c)
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
train_test_split
Build Elastic Net Model
Answer: Elastic Net model is selected. The optimal fraction = 0.2928571 and lambda = 0.8571429
set.seed(0)
enetGrid <- data.frame(.lambda = seq(0, 3, length = 50),
.fraction = seq(0.01, 1, length = 50))
enetTune <- train(data_train_X, data_train_Y,
method = 'enet',
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c('center', 'scale'))
enetTune
## Elasticnet
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 118, 118, 119, 118, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00000000 0.01000000 1.636748 0.5573480 1.3397154
## 0.06122449 0.03020408 1.734346 0.5364780 1.4130006
## 0.12244898 0.05040816 1.681241 0.5592217 1.3716636
## 0.18367347 0.07061224 1.629231 0.5743009 1.3312113
## 0.24489796 0.09081633 1.575004 0.5898296 1.2901447
## 0.30612245 0.11102041 1.520786 0.6032852 1.2489968
## 0.36734694 0.13122449 1.469543 0.6111888 1.2105562
## 0.42857143 0.15142857 1.421375 0.6168614 1.1732333
## 0.48979592 0.17163265 1.376023 0.6188092 1.1352478
## 0.55102041 0.19183673 1.335158 0.6175534 1.0984173
## 0.61224490 0.21204082 1.298877 0.6136621 1.0635302
## 0.67346939 0.23224490 1.267583 0.6105534 1.0324245
## 0.73469388 0.25244898 1.244009 0.5993529 1.0143852
## 0.79591837 0.27265306 1.232483 0.5826778 1.0028098
## 0.85714286 0.29285714 1.227267 0.5723389 0.9929597
## 0.91836735 0.31306122 1.230552 0.5636743 0.9856738
## 0.97959184 0.33326531 1.238759 0.5585420 0.9786307
## 1.04081633 0.35346939 1.253779 0.5543045 0.9813659
## 1.10204082 0.37367347 1.276632 0.5505213 0.9917693
## 1.16326531 0.39387755 1.301474 0.5492610 1.0024707
## 1.22448980 0.41408163 1.329210 0.5486168 1.0215773
## 1.28571429 0.43428571 1.378793 0.5429651 1.0510728
## 1.34693878 0.45448980 1.452609 0.5352128 1.0903272
## 1.40816327 0.47469388 1.530811 0.5309380 1.1322185
## 1.46938776 0.49489796 1.612047 0.5283869 1.1774640
## 1.53061224 0.51510204 1.703980 0.5250787 1.2296456
## 1.59183673 0.53530612 1.804569 0.5207456 1.2843487
## 1.65306122 0.55551020 1.931867 0.5152172 1.3463801
## 1.71428571 0.57571429 2.091702 0.5071234 1.4186604
## 1.77551020 0.59591837 2.253616 0.5008438 1.4912837
## 1.83673469 0.61612245 2.412064 0.4963841 1.5679324
## 1.89795918 0.63632653 2.573079 0.4921721 1.6488214
## 1.95918367 0.65653061 2.738963 0.4874789 1.7318588
## 2.02040816 0.67673469 2.909691 0.4825764 1.8187355
## 2.08163265 0.69693878 3.081605 0.4777622 1.9057896
## 2.14285714 0.71714286 3.253474 0.4730963 1.9908441
## 2.20408163 0.73734694 3.427221 0.4684424 2.0752630
## 2.26530612 0.75755102 3.590375 0.4639830 2.1578821
## 2.32653061 0.77775510 3.742582 0.4600084 2.2377147
## 2.38775510 0.79795918 3.892234 0.4563995 2.3157919
## 2.44897959 0.81816327 4.034053 0.4531192 2.3922832
## 2.51020408 0.83836735 4.170594 0.4502717 2.4668428
## 2.57142857 0.85857143 4.305503 0.4475778 2.5404821
## 2.63265306 0.87877551 4.434366 0.4450682 2.6128498
## 2.69387755 0.89897959 4.557350 0.4427987 2.6826484
## 2.75510204 0.91918367 4.677790 0.4406556 2.7508181
## 2.81632653 0.93938776 4.794734 0.4385366 2.8175590
## 2.87755102 0.95959184 4.906069 0.4365229 2.8832045
## 2.93877551 0.97979592 5.013615 0.4345610 2.9471459
## 3.00000000 1.00000000 5.119666 0.4326595 3.0104908
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.2928571 and lambda
## = 0.8571429.
(d)
Predict the response for the test set.What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
Answer: The R2 for the training set and test set are 0.6384729 and 0.4907965 respectively. The model has better performance on the training set.
enet_train_pred <- predict(enetTune, data_train_X)
train_metrics <- postResample(enet_train_pred,data_train_Y)
enet_test_pred <- predict(enetTune, data_test_X)
test_metrics <- postResample(enet_test_pred,data_test_Y)
rbind(train_metrics,test_metrics) %>%
data.frame()
(e)
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Answer:
The top 20 important predictors are as below. The process predictors domiate the list.
enet_vapImp <- varImp(enetTune)
enet_vapImp$importance %>%
arrange(desc(Overall)) %>%
top_n(20, Overall)
(f)
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
Answer: As observed from the correlation plot, all biological material (BM) predictors have postive correlationship with the response variable Yield, while the manufacturing process (MP) predictors are overallly have overall smaller positive correlation with Yield than those of BMs, or have negative correslation with Yield. In future runs of manufacturing process, those individual MP predictors with small absolute value of correlation values can be further analysed and improvement actions can be taken to such MP steps in order to increase the yield so as to boost revenue.
top_pred <- enet_vapImp$importance %>%
arrange(desc(Overall)) %>%
top_n(20, Overall) %>%
rownames_to_column() %>%
spread(key = rowname, value = Overall)
cor_df <- data.frame(imp_chem_predictors) %>%
select(names(top_pred)) %>%
cbind(chem_response)
names(cor_df) <- names(cor_df) %>%
str_replace('BiologicalMaterial', 'BM') %>%
str_replace('ManufacturingProcess', 'MP')
cor_df <- cor(cor_df)
corrplot(cor_df)