Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
#install.packages('AppliedPredictiveModeling')
library(AppliedPredictiveModeling)
data(permeability)
After utilizing the nearZeroVar function, the fingerprint matrix now has 388 from 1107 predictors.
dim(fingerprints)
## [1] 165 1107
near_zero = nearZeroVar(fingerprints)
high_freq = fingerprints[, -near_zero]
dim(high_freq)
## [1] 165 388
RMSE was used to select the optimal model using the smallest value. The final value used for the model was ncomp = 9. Additionally, R square has its largest value at ncomp = 9.
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 118, 119, 121, 120, 120, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.71638 0.3726546 9.803750
## 2 11.43482 0.4812433 8.287554
## 3 11.56869 0.4775294 8.635061
## 4 11.82234 0.4622131 9.000811
## 5 11.96746 0.4722405 9.004607
## 6 11.80342 0.4763650 8.712800
## 7 11.75711 0.4782825 8.794521
## 8 11.57750 0.4943179 8.741823
## 9 11.42685 0.5092455 8.573922
## 10 11.65230 0.5024878 8.601296
## 11 12.19187 0.4755751 8.866223
## 12 12.24185 0.4766699 8.927797
## 13 12.56914 0.4611196 9.122023
## 14 12.87329 0.4497174 9.246870
## 15 13.26387 0.4322739 9.658887
## 16 13.21115 0.4341617 9.699480
## 17 13.45490 0.4270706 9.834210
## 18 13.65659 0.4172002 10.008847
## 19 13.68960 0.4115089 10.210977
## 20 13.50473 0.4179000 10.122616
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 9.
R square is higher in the prediction (0.80) when compared to the training set (0.51).
## RMSE Rsquared MAE
## 6.9841682 0.7979941 5.1395247
Pls provided the best results when comparing RMSE and R square to the other models.
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Linear Regression
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 120, 120, 119, 119, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 31.07298 0.1988006 20.87865
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## Warning: model fit failed for Fold08: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 121, 119, 121, 119, 119, 120, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 14.73314 0.4598381 11.137947
## 0.007142857 14.57374 0.3832517 11.033327
## 0.014285714 13.59025 0.4288226 10.348159
## 0.021428571 13.18649 0.4493503 10.074866
## 0.028571429 12.94008 0.4634247 9.896530
## 0.035714286 12.69788 0.4770145 9.725875
## 0.042857143 12.56784 0.4854845 9.629233
## 0.050000000 12.45986 0.4920169 9.549563
## 0.057142857 12.39794 0.4976161 9.488381
## 0.064285714 12.31467 0.5031968 9.409630
## 0.071428571 12.23896 0.5082542 9.345222
## 0.078571429 12.17124 0.5133017 9.281102
## 0.085714286 12.11938 0.5170257 9.235251
## 0.092857143 12.09821 0.5195938 9.210768
## 0.100000000 12.04437 0.5232864 9.159174
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
## Warning: model fit failed for Fold01: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold02: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold06: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## Elasticnet
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 121, 120, 119, 119, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 12.05500 0.4972413 8.469496
## 0.00 0.10 11.99768 0.4681608 8.400393
## 0.00 0.15 12.21350 0.4524943 8.778776
## 0.00 0.20 12.39718 0.4533848 9.096279
## 0.00 0.25 12.50963 0.4569811 9.241537
## 0.00 0.30 12.59068 0.4630397 9.345331
## 0.00 0.35 12.73778 0.4641009 9.531024
## 0.00 0.40 12.89001 0.4629444 9.720604
## 0.00 0.45 13.00617 0.4627482 9.923570
## 0.00 0.50 13.31246 0.4490701 10.186072
## 0.00 0.55 13.64578 0.4383642 10.483213
## 0.00 0.60 13.95987 0.4326433 10.766972
## 0.00 0.65 14.33329 0.4273412 11.073196
## 0.00 0.70 14.70119 0.4224003 11.342834
## 0.00 0.75 15.02981 0.4181737 11.556056
## 0.00 0.80 15.34786 0.4135673 11.769453
## 0.00 0.85 15.61921 0.4092783 11.982167
## 0.00 0.90 15.98723 0.4046158 12.239297
## 0.00 0.95 16.21398 0.4012868 12.397839
## 0.00 1.00 16.34694 0.3994760 12.490594
## 0.01 0.05 12.00311 0.4432811 8.541659
## 0.01 0.10 11.73877 0.4577646 8.442520
## 0.01 0.15 12.01307 0.4527905 8.702248
## 0.01 0.20 12.20135 0.4478607 8.840800
## 0.01 0.25 12.19961 0.4498527 8.864502
## 0.01 0.30 12.25606 0.4497001 8.928818
## 0.01 0.35 12.29561 0.4519955 8.961331
## 0.01 0.40 12.37765 0.4535000 9.025607
## 0.01 0.45 12.46253 0.4529305 9.094534
## 0.01 0.50 12.52663 0.4538468 9.202575
## 0.01 0.55 12.61490 0.4533859 9.305514
## 0.01 0.60 12.71660 0.4519459 9.409789
## 0.01 0.65 12.79307 0.4498799 9.448750
## 0.01 0.70 12.84677 0.4484607 9.534262
## 0.01 0.75 12.93775 0.4457568 9.619744
## 0.01 0.80 13.08526 0.4408351 9.727770
## 0.01 0.85 13.28313 0.4328238 9.902945
## 0.01 0.90 13.53842 0.4219862 10.122445
## 0.01 0.95 13.80846 0.4096135 10.349692
## 0.01 1.00 14.05687 0.3982597 10.555394
## 0.10 0.05 12.13615 0.4897295 9.166732
## 0.10 0.10 11.87082 0.4503997 8.353546
## 0.10 0.15 11.77572 0.4521499 8.335728
## 0.10 0.20 11.70048 0.4624965 8.332907
## 0.10 0.25 11.71634 0.4668777 8.387695
## 0.10 0.30 11.77893 0.4676413 8.521179
## 0.10 0.35 11.82911 0.4679777 8.609616
## 0.10 0.40 11.84028 0.4696330 8.659048
## 0.10 0.45 11.88772 0.4693870 8.741836
## 0.10 0.50 11.98615 0.4666372 8.855073
## 0.10 0.55 12.06613 0.4646154 8.914644
## 0.10 0.60 12.10204 0.4647323 8.932619
## 0.10 0.65 12.11097 0.4665819 8.930454
## 0.10 0.70 12.13252 0.4675700 8.930057
## 0.10 0.75 12.17634 0.4665709 8.932311
## 0.10 0.80 12.23772 0.4646346 8.987128
## 0.10 0.85 12.31147 0.4625688 9.060522
## 0.10 0.90 12.39566 0.4598667 9.148148
## 0.10 0.95 12.50449 0.4557333 9.256571
## 0.10 1.00 12.61255 0.4518290 9.359723
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.2 and lambda = 0.1.
If I had to recommend a model, I’d recommend the PLS model due to a low RMSE and a R square of 0.8.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the reponse of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
Start R and use these commands to load the data:
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.
Split the data into a training and a stest set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
I will use PLS model since it performed best for the previous exercise. RMSE was used to select the optimal model using the smallest value. The final value used for the model was ncomp = 10. R squared at ncomp=10 is at its highest value (.61).
## Partial Least Squares
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 131, 130, 130, 130, 129, 129, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.8312122 0.4582919 0.6368723
## 2 0.8784130 0.5065290 0.6044327
## 3 0.7449064 0.5860351 0.5387970
## 4 0.7191004 0.5974733 0.5526837
## 5 0.7404487 0.5932325 0.5659655
## 6 0.7452959 0.5893215 0.5681318
## 7 0.7684710 0.5780940 0.5925164
## 8 0.7513039 0.5867866 0.5912885
## 9 0.7132363 0.6101240 0.5655940
## 10 0.7044069 0.6113084 0.5601250
## 11 0.7084133 0.6090025 0.5526973
## 12 0.7206937 0.6059508 0.5488486
## 13 0.7350137 0.6045316 0.5580296
## 14 0.7261172 0.6036587 0.5626825
## 15 0.7348801 0.5972221 0.5747648
## 16 0.7197914 0.6006316 0.5705944
## 17 0.7221599 0.5943178 0.5710099
## 18 0.7256506 0.5867697 0.5691657
## 19 0.7400848 0.5720982 0.5808337
## 20 0.7650570 0.5470784 0.5935269
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 10.
The predictive R square was greater and the RMSE was less than that on the training set.
## RMSE Rsquared MAE
## 0.4702021 0.7831245 0.3666030
ManufacturingProcess32 is the most important predictor in the model. Process predictors seem to have more out of the 20 important variables shown but they seem to have fairly similar amount of importance.
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
## pls variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 88.06
## ManufacturingProcess13 86.60
## ManufacturingProcess17 79.38
## ManufacturingProcess36 77.46
## BiologicalMaterial02 59.84
## ManufacturingProcess06 57.66
## BiologicalMaterial08 57.62
## ManufacturingProcess33 57.34
## BiologicalMaterial06 56.95
## BiologicalMaterial11 55.19
## ManufacturingProcess12 54.80
## BiologicalMaterial12 54.05
## ManufacturingProcess11 53.01
## BiologicalMaterial03 50.57
## ManufacturingProcess30 49.55
## ManufacturingProcess31 47.83
## ManufacturingProcess29 47.60
## ManufacturingProcess28 47.23
## BiologicalMaterial04 46.75
The most important predictors have a strong positive or negative correlation to the Yield response variable. Having information whether variables correlate can provide suggest variable removal to improve the model.