Github repo | portfolio | Blog
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
The matrix fingerprints
contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
#> [1] "719 columns are removed. 388 columns are left for modeling."
Below, the data is splitted into train/test set, using the createDataPartition function.
I used the train function to perform the pre-process and tuning together. The function first preprocess the training set by centering it and scaling it. Then the function uses 10-fold cross validation to try the ncomp parameter (number of components, i.e. latent variables) of the PLS model from 1 to 20.
#> Partial Least Squares
#>
#> 133 samples
#> 388 predictors
#>
#> Pre-processing: centered (388), scaled (388)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 120, 120, 120, 119, 120, 119, ...
#> Resampling results across tuning parameters:
#>
#> ncomp RMSE Rsquared MAE
#> 1 12.66375 0.3461521 9.603300
#> 2 11.66557 0.4884962 8.364131
#> 3 11.86090 0.4462082 9.081715
#> 4 12.15438 0.4254096 9.389547
#> 5 12.00549 0.4492658 9.076973
#> 6 11.70875 0.4618761 8.826006
#> 7 11.42799 0.4775314 8.747634
#> 8 11.25481 0.4925856 8.533191
#> 9 11.03884 0.5066896 8.344461
#> 10 10.88506 0.5187689 8.028416
#> 11 10.80216 0.5271005 8.012505
#> 12 10.81509 0.5247495 8.034195
#> 13 10.74326 0.5245241 8.003891
#> 14 10.70988 0.5238880 7.892343
#> 15 10.87078 0.5198265 8.032440
#> 16 11.25383 0.4999003 8.375209
#> 17 11.52949 0.4886010 8.619614
#> 18 11.46969 0.4907743 8.541789
#> 19 11.46414 0.4906053 8.591222
#> 20 11.38025 0.5003197 8.484932
#>
#> Rsquared was used to select the optimal model using the largest value.
#> The final value used for the model was ncomp = 11.
Using R^2 as the deciding metric, the CV found the optimal ncomp to be 12, with the maximum R^2 being 0.530788.
The postResample function from the caret package can be use to find the R^2 in the test set, using the selected model.
#> RMSE Rsquared MAE
#> 15.4341377 0.2899935 11.6639869
I will utilize the following types:
I ensure that all of the models have the same seed, so their CV sets are identical. This way, I can then use the resamples functions to compare all 4 models at once. The R^2 metrics are used in all cases.
#> Ridge Regression
#>
#> 133 samples
#> 388 predictors
#>
#> Pre-processing: centered (388), scaled (388)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 120, 120, 120, 119, 120, 119, ...
#> Resampling results across tuning parameters:
#>
#> lambda RMSE Rsquared MAE
#> 0.0 12.31893 0.4528045 9.294510
#> 0.1 10.79390 0.5292391 7.935830
#> 0.2 11.15821 0.5289694 8.241192
#> 0.3 11.56911 0.5286210 8.612171
#> 0.4 12.09487 0.5259461 9.039822
#> 0.5 12.67474 0.5230682 9.494363
#> 0.6 13.29786 0.5205254 10.029131
#> 0.7 13.95945 0.5179546 10.622090
#> 0.8 14.64851 0.5156990 11.217376
#> 0.9 15.36147 0.5136360 11.828589
#> 1.0 16.09410 0.5117506 12.434283
#>
#> Rsquared was used to select the optimal model using the largest value.
#> The final value used for the model was lambda = 0.1.
#> The lasso
#>
#> 133 samples
#> 388 predictors
#>
#> Pre-processing: centered (388), scaled (388)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 120, 120, 120, 119, 120, 119, ...
#> Resampling results across tuning parameters:
#>
#> fraction RMSE Rsquared MAE
#> 0.00 14.62719 NaN 11.901845
#> 0.05 13.29247 0.4552939 10.359126
#> 0.10 12.80621 0.4554363 9.552588
#> 0.15 12.90315 0.4235837 9.618665
#> 0.20 12.87604 0.4068843 9.600865
#> 0.25 12.75649 0.4114738 9.556420
#> 0.30 12.66593 0.4169995 9.558130
#> 0.35 12.58089 0.4227670 9.518502
#> 0.40 12.50366 0.4283152 9.458058
#> 0.45 12.44435 0.4366066 9.404528
#> 0.50 12.35249 0.4422982 9.333097
#>
#> Rsquared was used to select the optimal model using the largest value.
#> The final value used for the model was fraction = 0.1.
#> Elasticnet
#>
#> 133 samples
#> 388 predictors
#>
#> Pre-processing: centered (388), scaled (388)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 120, 120, 120, 119, 120, 119, ...
#> Resampling results across tuning parameters:
#>
#> lambda fraction RMSE Rsquared MAE
#> 0.0 0.0 14.62719 NaN 11.901845
#> 0.0 0.1 12.80621 0.4554363 9.552588
#> 0.0 0.2 12.87604 0.4068843 9.600865
#> 0.0 0.3 12.66593 0.4169995 9.558130
#> 0.0 0.4 12.50366 0.4283152 9.458058
#> 0.0 0.5 12.35249 0.4422982 9.333097
#> 0.0 0.6 12.25156 0.4459605 9.303113
#> 0.0 0.7 12.19984 0.4489988 9.311995
#> 0.0 0.8 12.21014 0.4503081 9.280251
#> 0.0 0.9 12.24057 0.4526927 9.271295
#> 0.0 1.0 12.31893 0.4528045 9.294510
#> 0.1 0.0 15.17471 NaN 12.242520
#> 0.1 0.1 11.55434 0.4874903 8.229407
#> 0.1 0.2 11.63397 0.4714981 8.459077
#> 0.1 0.3 11.17731 0.5023828 8.228974
#> 0.1 0.4 10.93531 0.5152918 8.045882
#> 0.1 0.5 10.80908 0.5216687 7.920525
#> 0.1 0.6 10.72893 0.5273809 7.828944
#> 0.1 0.7 10.69309 0.5308887 7.777732
#> 0.1 0.8 10.72480 0.5308325 7.832275
#> 0.1 0.9 10.75986 0.5298811 7.908604
#> 0.1 1.0 10.79390 0.5292391 7.935830
#> 0.2 0.0 15.17471 NaN 12.242520
#> 0.2 0.1 11.48301 0.4945241 8.170117
#> 0.2 0.2 11.84209 0.4696152 8.505668
#> 0.2 0.3 11.43389 0.4958256 8.375297
#> 0.2 0.4 11.19698 0.5114505 8.243848
#> 0.2 0.5 11.09985 0.5182971 8.147914
#> 0.2 0.6 11.00732 0.5263811 8.075977
#> 0.2 0.7 10.99994 0.5305898 8.069433
#> 0.2 0.8 11.05283 0.5299608 8.106566
#> 0.2 0.9 11.10436 0.5299953 8.173625
#> 0.2 1.0 11.15821 0.5289694 8.241192
#> 0.3 0.0 15.17471 NaN 12.242520
#> 0.3 0.1 11.43015 0.4972166 8.101725
#> 0.3 0.2 11.96864 0.4730337 8.473644
#> 0.3 0.3 11.75988 0.4899883 8.585111
#> 0.3 0.4 11.56666 0.5064758 8.533954
#> 0.3 0.5 11.46977 0.5140181 8.420096
#> 0.3 0.6 11.41659 0.5217383 8.390925
#> 0.3 0.7 11.38872 0.5284379 8.402842
#> 0.3 0.8 11.43282 0.5293766 8.443219
#> 0.3 0.9 11.50977 0.5285789 8.537666
#> 0.3 1.0 11.56911 0.5286210 8.612171
#> 0.4 0.0 15.17471 NaN 12.242520
#> 0.4 0.1 11.41234 0.4970136 8.041166
#> 0.4 0.2 12.07118 0.4773956 8.417483
#> 0.4 0.3 12.09220 0.4855603 8.772254
#> 0.4 0.4 11.96135 0.5008874 8.794705
#> 0.4 0.5 11.91102 0.5097316 8.694116
#> 0.4 0.6 11.89320 0.5169703 8.720530
#> 0.4 0.7 11.87520 0.5239754 8.757094
#> 0.4 0.8 11.92508 0.5256967 8.832045
#> 0.4 0.9 12.00938 0.5256961 8.936506
#> 0.4 1.0 12.09487 0.5259461 9.039822
#> 0.5 0.0 15.17471 NaN 12.242520
#> 0.5 0.1 11.40954 0.4963705 7.986894
#> 0.5 0.2 12.20206 0.4801158 8.392077
#> 0.5 0.3 12.46719 0.4806466 8.951887
#> 0.5 0.4 12.40354 0.4958890 9.045601
#> 0.5 0.5 12.39951 0.5050895 8.992832
#> 0.5 0.6 12.42175 0.5118355 9.100795
#> 0.5 0.7 12.42012 0.5187276 9.170141
#> 0.5 0.8 12.47215 0.5220491 9.262259
#> 0.5 0.9 12.57012 0.5224055 9.371348
#> 0.5 1.0 12.67474 0.5230682 9.494363
#> 0.6 0.0 15.17471 NaN 12.242520
#> 0.6 0.1 11.43256 0.4939374 7.946523
#> 0.6 0.2 12.36966 0.4817767 8.395529
#> 0.6 0.3 12.85821 0.4782145 9.124133
#> 0.6 0.4 12.87990 0.4922499 9.308087
#> 0.6 0.5 12.92052 0.5013594 9.353646
#> 0.6 0.6 12.97712 0.5077939 9.498849
#> 0.6 0.7 13.00404 0.5142719 9.613319
#> 0.6 0.8 13.06977 0.5182672 9.741350
#> 0.6 0.9 13.17859 0.5193988 9.891632
#> 0.6 1.0 13.29786 0.5205254 10.029131
#> 0.7 0.0 15.17471 NaN 12.242520
#> 0.7 0.1 11.46619 0.4912343 7.912957
#> 0.7 0.2 12.55303 0.4833110 8.390999
#> 0.7 0.3 13.25892 0.4767863 9.301066
#> 0.7 0.4 13.37747 0.4891867 9.594389
#> 0.7 0.5 13.47327 0.4977598 9.738364
#> 0.7 0.6 13.56088 0.5039024 9.927475
#> 0.7 0.7 13.62075 0.5100817 10.119179
#> 0.7 0.8 13.70839 0.5144473 10.288818
#> 0.7 0.9 13.82941 0.5164615 10.468116
#> 0.7 1.0 13.95945 0.5179546 10.622090
#> 0.8 0.0 15.17471 NaN 12.242520
#> 0.8 0.1 11.50692 0.4886514 7.879098
#> 0.8 0.2 12.76479 0.4839868 8.409995
#> 0.8 0.3 13.66419 0.4762875 9.519340
#> 0.8 0.4 13.89244 0.4866892 9.940965
#> 0.8 0.5 14.04197 0.4946135 10.165559
#> 0.8 0.6 14.16619 0.5006537 10.436159
#> 0.8 0.7 14.25971 0.5068011 10.645189
#> 0.8 0.8 14.36988 0.5114073 10.843979
#> 0.8 0.9 14.50376 0.5140374 11.043532
#> 0.8 1.0 14.64851 0.5156990 11.217376
#> 0.9 0.0 15.17471 NaN 12.242520
#> 0.9 0.1 11.55121 0.4862199 7.848229
#> 0.9 0.2 12.99728 0.4843262 8.472094
#> 0.9 0.3 14.07288 0.4758987 9.760572
#> 0.9 0.4 14.42043 0.4845527 10.309539
#> 0.9 0.5 14.62738 0.4918566 10.622684
#> 0.9 0.6 14.78920 0.4978107 10.969734
#> 0.9 0.7 14.91416 0.5040442 11.192740
#> 0.9 0.8 15.05381 0.5086436 11.412504
#> 0.9 0.9 15.19917 0.5118594 11.638340
#> 0.9 1.0 15.36147 0.5136360 11.828589
#> 1.0 0.0 15.17471 NaN 12.242520
#> 1.0 0.1 11.60369 0.4838849 7.823994
#> 1.0 0.2 13.24717 0.4844231 8.565415
#> 1.0 0.3 14.49421 0.4757489 10.001137
#> 1.0 0.4 14.95867 0.4828266 10.733226
#> 1.0 0.5 15.22359 0.4895516 11.135659
#> 1.0 0.6 15.42999 0.4952250 11.523292
#> 1.0 0.7 15.58606 0.5014637 11.753596
#> 1.0 0.8 15.75800 0.5060329 11.990365
#> 1.0 0.9 15.91616 0.5098193 12.235461
#> 1.0 1.0 16.09410 0.5117506 12.434283
#>
#> Rsquared was used to select the optimal model using the largest value.
#> The final values used for the model were fraction = 0.7 and lambda = 0.1.
#>
#> Call:
#> summary.resamples(object = resamp)
#>
#> Models: PLS, Ridge, Lasso, enet
#> Number of resamples: 10
#>
#> MAE
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> PLS 3.431086 6.279454 8.683473 8.012505 9.946990 10.48475 0
#> Ridge 2.932725 6.407812 8.665880 7.935830 9.924056 10.45745 0
#> Lasso 5.791257 7.693374 7.903535 9.552588 9.632303 21.21770 1
#> enet 2.993593 6.181506 8.157837 7.777732 9.796487 10.79813 0
#>
#> RMSE
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> PLS 4.127860 7.883648 12.72946 10.80216 13.06351 14.98578 0
#> Ridge 3.517458 8.293763 12.22699 10.79390 13.11415 15.70450 0
#> Lasso 7.657579 8.839116 12.01826 12.80621 14.38416 25.37189 1
#> enet 3.616262 8.054933 11.81838 10.69309 13.31569 14.94121 0
#>
#> Rsquared
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> PLS 0.0007971331 0.3795375 0.5019221 0.5271005 0.7601007 0.9084493 0
#> Ridge 0.0047218766 0.3487174 0.5283737 0.5292391 0.7709032 0.9297472 0
#> Lasso 0.0235890619 0.1246479 0.5805556 0.4554363 0.7421118 0.8656899 1
#> enet 0.0042021862 0.3407759 0.5243361 0.5308887 0.7766468 0.9259750 0
The model with the maximum R^2 appears to be the elastic net model, with R^2 = 0.5524289.
Below, I also evaluate the models using the test set:
#> RMSE Rsquared MAE
#> 15.4341377 0.2899935 11.6639869
#> $pls
#> RMSE Rsquared MAE
#> 15.4341377 0.2899935 11.6639869
#>
#> $ridge
#> RMSE Rsquared MAE
#> 15.0822363 0.3047575 11.4460367
#>
#> $lasso
#> RMSE Rsquared MAE
#> 12.3663838 0.3954258 9.1764050
#>
#> $enet
#> RMSE Rsquared MAE
#> 14.1795387 0.3415212 10.7570758
The evaluation on the test sets seems to suggest that the Lasso model is best, with R^2 = 0.4595602. Here we seem to have a dilemma: the 10-fold cross validations suggest that the elastic net model is the best, while the test set evaluation suggest that the Lasso model is the best. Here, I would choose to trust the cross validation result, because the cross validation result is closer approximation to the true distribution than the test set, which is equivalent to just one fold of the whole set.
Nonetheless, the scores for the Ridge, Lasso, and Enet are all higher (better performance) than the PLS.
I would not recommend any of the models to replace the permeability laboratory experiment. The MAE of all of the models are roughly between 8 and 9, meaning that the model predictions are on average +/- 8 to 9 off. Looking at the histogram of the target variable permeability:
We can see that most of permeability are under 10. The model’s accuracy is not good enough to replace lab test.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
The preProcess function can be used to impute the imssing value. I choose to use the ‘bagImpute’ method, which impute the missing values through bagged tree model.
#> Created from 152 samples and 57 variables
#>
#> Pre-processing:
#> - bagged tree imputation (57)
#> - ignored (0)
The elastic net model is tuned using 10-fold cross validation with parameters lambda ranging from 0 to 1, and fraction ranging from 0 to 1. The metric used to decide is the RMSE.
The best parameter combo is fraction = 0.5, lambda = 0.2, with the RMSE = 1.1920333.
#> RMSE Rsquared MAE
#> 1.0611572 0.6019381 0.8536509
The test set RMSE is 1.0292796. This is lower than the resampled performance metric (cross validated RMSE) on the training set. So the test set result appears to be better than the training set result.
The coefficients of the best-tuned elastic net model is below. We can see that the elastic net zero out some of the predictors, due to the lasso penalty.
#> BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
#> 0.00000000 0.00000000 0.00000000
#> BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
#> 0.00000000 0.00000000 0.08872051
#> BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
#> 0.00000000 0.00000000 0.00000000
#> BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
#> 0.00000000 0.00000000 0.00000000
#> ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
#> 0.00000000 0.00000000 0.00000000
#> ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
#> 0.04275827 0.00000000 0.04748137
#> ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
#> -0.04828912 0.00000000 0.45351080
#> ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
#> 0.00000000 0.00000000 0.00000000
#> ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
#> -0.21934545 0.00000000 0.06315242
#> ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
#> 0.00000000 -0.22082836 0.00000000
#> ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess22
#> 0.00000000 0.00000000 0.00000000
#> ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
#> 0.00000000 0.00000000 0.00000000
#> ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
#> 0.00000000 0.00000000 0.00000000
#> ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
#> 0.00000000 0.00000000 0.00000000
#> ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
#> 0.80779340 0.00000000 0.12023091
#> ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
#> 0.00000000 -0.19025738 -0.12077693
#> ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
#> 0.00000000 0.10800365 0.00000000
#> ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
#> 0.00000000 0.02043653 0.00000000
#> ManufacturingProcess44 ManufacturingProcess45
#> 0.00000000 0.05192492
We can compare the non-zero coefficients by taking their absolute value, and then sorting them:
#> ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess17
#> 0.80779340 0.45351080 0.22082836
#> ManufacturingProcess13 ManufacturingProcess36 ManufacturingProcess37
#> 0.21934545 0.19025738 0.12077693
#> ManufacturingProcess34 ManufacturingProcess39 BiologicalMaterial06
#> 0.12023091 0.10800365 0.08872051
#> ManufacturingProcess15 ManufacturingProcess45 ManufacturingProcess07
#> 0.06315242 0.05192492 0.04828912
#> ManufacturingProcess06 ManufacturingProcess04 ManufacturingProcess42
#> 0.04748137 0.04275827 0.02043653
We can conclude the following:
#> loess r-squared variable importance
#>
#> only 20 most important variables shown (out of 57)
#>
#> Overall
#> ManufacturingProcess32 100.00
#> ManufacturingProcess13 82.21
#> ManufacturingProcess36 79.47
#> BiologicalMaterial06 75.61
#> BiologicalMaterial03 71.87
#> ManufacturingProcess17 70.62
#> BiologicalMaterial12 66.86
#> ManufacturingProcess09 62.20
#> ManufacturingProcess06 55.45
#> BiologicalMaterial02 53.61
#> ManufacturingProcess31 46.70
#> ManufacturingProcess33 45.66
#> BiologicalMaterial11 42.39
#> BiologicalMaterial04 39.70
#> ManufacturingProcess29 36.88
#> ManufacturingProcess11 36.39
#> ManufacturingProcess12 35.50
#> BiologicalMaterial08 31.86
#> BiologicalMaterial09 30.98
#> BiologicalMaterial01 29.67
Again, 11 out of the 20 in the list are ManufacturingProcess predictors, which makes it more important than BiologicalMaterial.
Elastic net is a linear regression model. The coefficients directly explain how the predictors affect the target. Positive coefficients improve the yield, while negative coefficients decrease the yield.
For the ManufacturingProcess having the positive coefficients, I would alter the process such that the predictor value increases. Below are the ManufacturingProcess having positive coefficients:
#> ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess34
#> 0.80779340 0.45351080 0.12023091
#> ManufacturingProcess39 ManufacturingProcess15 ManufacturingProcess45
#> 0.10800365 0.06315242 0.05192492
#> ManufacturingProcess06 ManufacturingProcess04 ManufacturingProcess42
#> 0.04748137 0.04275827 0.02043653
For the ManufacturingProcess having the negative coefficients, I would alter the process such that the predictor value decreases. Below are the ManufacturingProcess having negative coefficients:
#> ManufacturingProcess17 ManufacturingProcess13 ManufacturingProcess36
#> -0.22082836 -0.21934545 -0.19025738
#> ManufacturingProcess37 ManufacturingProcess07
#> -0.12077693 -0.04828912