CUNY DATA624 HW7

6.2 Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

a) The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

There are currently 1,107 predictors as mentioned in part A and also below when we use the ncol function.

## [1] 1107

When we use the nearZeroVar function, there are 719 predictors to remove, leaving us with 388 predictors for modeling (below).

## [1] 388

We’ll save the matrix with the relevant predictors for future use.

c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

Since, we’re trying to predict permeability, we use the createDataPartition function to help us create our training and test sets.

Now we tune our PLS model using cross validation and from the code below, we can see the optimal number of latent variables.

## [1] "The optimal number of latent variables = 5"

From the table below, we can see that for ncomp = 5 the corresponding \(R^2 = 0.406\)

## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 121, 120, 120, 118, 119, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.12631  0.3028742  10.029920
##    2     12.11593  0.4426055   8.774476
##    3     11.85556  0.4509721   9.039633
##    4     11.95439  0.4470189   9.005672
##    5     11.32556  0.4905685   8.470917
##    6     11.44232  0.4981471   8.512476
##    7     11.36593  0.5016860   8.579412
##    8     11.43091  0.4951021   8.738870
##    9     11.41819  0.4954867   8.602458
##   10     11.50184  0.4928341   8.669320
##   11     11.65389  0.4880699   8.692119
##   12     11.76185  0.4765640   8.755928
##   13     12.00174  0.4563971   8.992871
##   14     12.41083  0.4338308   9.146298
##   15     12.44980  0.4373552   9.173314
##   16     12.66798  0.4239454   9.278853
##   17     12.91954  0.4038774   9.570894
##   18     13.18323  0.3927971   9.693686
##   19     13.45573  0.3790540   9.875498
##   20     13.52435  0.3803545   9.867799
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.

d) Predict the response for the test set. What is the test set estimate of \(R^2\)?

Below we can see that the \(R^2\) estimate for the test set is 0.447

##       RMSE   Rsquared        MAE 
## 14.3499136  0.2855738 11.1917749

e) Try building other models discussed in this chapter. Do any have better predictive performance?

Below, I tried both ridge and lasso regression and both did not perform as wel as the PLS model

Ridge Regression

## Ridge Regression 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 121, 120, 120, 118, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE       
##   0.000000000   24.66318  0.2757693   18.096185
##   0.007142857   15.52673  0.3245990   11.494276
##   0.014285714  139.18240  0.3409874  108.414446
##   0.021428571   13.61766  0.3868480    9.994901
##   0.028571429   13.19319  0.4045492    9.658507
##   0.035714286   12.96935  0.4155138    9.455701
##   0.042857143   12.71796  0.4268010    9.272863
##   0.050000000   12.51029  0.4380482    9.121286
##   0.057142857   12.37443  0.4450249    9.023325
##   0.064285714   12.27362  0.4514116    8.950980
##   0.071428571   12.23244  0.4550147    8.933299
##   0.078571429   12.12523  0.4613185    8.855556
##   0.085714286   12.08238  0.4642299    8.828290
##   0.092857143   12.09525  0.4654539    8.862810
##   0.100000000   12.02008  0.4703192    8.819515
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
##       RMSE   Rsquared        MAE 
## 14.2806823  0.3177461 10.5800037

Lasso Regression

## Elasticnet 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 121, 120, 120, 118, 119, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.00    0.05      11.43782  0.4992148   8.847282
##   0.00    0.10      11.25501  0.5129622   8.533042
##   0.00    0.15      12.03110  0.4771312   8.774326
##   0.00    0.20      12.76238  0.4533056   9.028350
##   0.00    0.25      13.27042  0.4477735   9.428899
##   0.00    0.30      13.87833  0.4289005   9.958050
##   0.00    0.35      14.62201  0.4133093  10.570820
##   0.00    0.40      15.50348  0.4012624  11.299566
##   0.00    0.45      16.49710  0.3864137  12.093590
##   0.00    0.50      17.64229  0.3704828  12.940777
##   0.00    0.55      18.64877  0.3582644  13.678525
##   0.00    0.60      19.48843  0.3507808  14.256270
##   0.00    0.65      20.14874  0.3405494  14.739689
##   0.00    0.70      20.78344  0.3311630  15.237609
##   0.00    0.75      21.40064  0.3225830  15.686945
##   0.00    0.80      22.05634  0.3133142  16.189308
##   0.00    0.85      22.75321  0.3032592  16.722482
##   0.00    0.90      23.34474  0.2943714  17.164903
##   0.00    0.95      24.00617  0.2838052  17.660912
##   0.00    1.00      24.66318  0.2757693  18.096185
##   0.01    0.05      11.30772  0.4942483   8.093209
##   0.01    0.10      10.49617  0.5453325   7.476541
##   0.01    0.15      10.57148  0.5437003   7.812613
##   0.01    0.20      10.69096  0.5345984   8.045101
##   0.01    0.25      11.15396  0.5016138   8.348004
##   0.01    0.30      11.61816  0.4704004   8.601899
##   0.01    0.35      12.00717  0.4460259   8.858209
##   0.01    0.40      12.31704  0.4303885   9.109076
##   0.01    0.45      12.54078  0.4221279   9.299680
##   0.01    0.50      12.78854  0.4134597   9.483983
##   0.01    0.55      13.06522  0.4040495   9.725846
##   0.01    0.60      13.33346  0.3943784   9.959184
##   0.01    0.65      13.60415  0.3846914  10.150502
##   0.01    0.70      13.86627  0.3759544  10.296316
##   0.01    0.75      14.13677  0.3660339  10.466103
##   0.01    0.80      14.38367  0.3562923  10.651182
##   0.01    0.85      14.60850  0.3482641  10.819588
##   0.01    0.90      14.80564  0.3415264  10.963649
##   0.01    0.95      14.97014  0.3369076  11.076676
##   0.01    1.00      15.13126  0.3326039  11.171433
##   0.10    0.05      12.54680  0.4242606   9.427604
##   0.10    0.10      11.20048  0.4958875   8.037751
##   0.10    0.15      10.63530  0.5319108   7.517095
##   0.10    0.20      10.50390  0.5461102   7.544521
##   0.10    0.25      10.49472  0.5509441   7.702546
##   0.10    0.30      10.41925  0.5587123   7.759176
##   0.10    0.35      10.52712  0.5542362   7.920797
##   0.10    0.40      10.75165  0.5400271   8.099932
##   0.10    0.45      10.94363  0.5263853   8.246305
##   0.10    0.50      11.09861  0.5155610   8.331537
##   0.10    0.55      11.20544  0.5086556   8.375126
##   0.10    0.60      11.29184  0.5037099   8.416906
##   0.10    0.65      11.36553  0.5005304   8.468361
##   0.10    0.70      11.45314  0.4967980   8.534663
##   0.10    0.75      11.54955  0.4926189   8.599881
##   0.10    0.80      11.65627  0.4876138   8.658895
##   0.10    0.85      11.75883  0.4826742   8.707970
##   0.10    0.90      11.85808  0.4779313   8.752449
##   0.10    0.95      11.94750  0.4736723   8.793308
##   0.10    1.00      12.02008  0.4703192   8.819515
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.3 and lambda = 0.1.
##       RMSE   Rsquared        MAE 
## 13.5445765  0.2975322 11.1324739

f) Would you recommend any of your models to replace the permeability laboratory experiment?

The PLS model would be the best choice, however I’m not sure what could be considered an acceptable/good \(R^2\) for this study. An acceptable/good \(R^2\) would depend on how much of the variability just simply cannot be explained. Because my knowledge in biology and pharmaceuticals is severely lacking, I would not know if our \(R^2\) would be sufficient enough to recommend this model.

6.3 A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

a) The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values

As in previous assignments, a kNN imputation method will be used to impute the missing values.

c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

First, we’ll utilize the nearZerovar function used earlier and then do a test/train split on the data. Then after the data has been prepared, I’ll run it through several of the models that was introduced in the chapter. I’ll then provide the relevant optimal value for each model.

Linear Regression

## Linear Regression 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 132, 129, 128, 131, 131, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   7.462568  0.383472  2.948277
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

For linear regression, stated above “Tuning parameter intercept was held constant at a value of TRUE”

PLS

## Partial Least Squares 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 132, 129, 128, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.554941  0.4483190  1.193826
##    2     2.055282  0.4678454  1.301299
##    3     1.596828  0.5187550  1.155859
##    4     1.767286  0.5178720  1.217920
##    5     2.054411  0.5106762  1.309666
##    6     2.182990  0.5052719  1.353758
##    7     2.525829  0.4942351  1.476046
##    8     2.662967  0.4957308  1.518107
##    9     2.919284  0.4774127  1.634768
##   10     3.193686  0.4711553  1.720623
##   11     3.501326  0.4606444  1.809975
##   12     3.647455  0.4501590  1.857916
##   13     3.633360  0.4554329  1.840247
##   14     3.658999  0.4474023  1.851775
##   15     3.620189  0.4470239  1.833028
##   16     3.482063  0.4487003  1.778911
##   17     3.341519  0.4502064  1.734531
##   18     3.244179  0.4552209  1.701815
##   19     3.097408  0.4575057  1.648129
##   20     3.098581  0.4601752  1.636867
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.

For PLS, the optimal number of latent variables is 3.

Ridge Regression

## Ridge Regression 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 132, 129, 128, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   lambda     RMSE      Rsquared   MAE     
##   0.3000000  2.171350  0.5215160  1.369490
##   0.3105263  2.165935  0.5217617  1.369538
##   0.3210526  2.161043  0.5219942  1.369735
##   0.3315789  2.156640  0.5222149  1.370085
##   0.3421053  2.152695  0.5224252  1.370611
##   0.3526316  2.149180  0.5226263  1.371462
##   0.3631579  2.146068  0.5228191  1.372406
##   0.3736842  2.143337  0.5230047  1.373563
##   0.3842105  2.140964  0.5231839  1.374892
##   0.3947368  2.138930  0.5233574  1.376320
##   0.4052632  2.137216  0.5235260  1.377904
##   0.4157895  2.135805  0.5236901  1.379553
##   0.4263158  2.134683  0.5238504  1.381261
##   0.4368421  2.133833  0.5240073  1.383056
##   0.4473684  2.133244  0.5241613  1.385372
##   0.4578947  2.132902  0.5243127  1.387750
##   0.4684211  2.132796  0.5244619  1.390172
##   0.4789474  2.132916  0.5246092  1.392766
##   0.4894737  2.133250  0.5247549  1.395413
##   0.5000000  2.133790  0.5248992  1.398096
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.4684211.

The optimal model had a lambda value of 0.4052632

Lasso

## Elasticnet 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 132, 129, 128, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE      Rsquared   MAE      
##   0.00    0.05      1.215442  0.6120655  1.0061326
##   0.00    0.10      1.278747  0.5842531  1.0038815
##   0.00    0.15      1.531577  0.5613581  1.0492453
##   0.00    0.20      1.585318  0.5570600  1.0665089
##   0.00    0.25      1.454973  0.5546742  1.0548253
##   0.00    0.30      1.517125  0.5331071  1.1233330
##   0.00    0.35      1.750319  0.5037711  1.2209377
##   0.00    0.40      1.939982  0.4869322  1.2956932
##   0.00    0.45      1.917775  0.4804730  1.3056247
##   0.00    0.50      2.374497  0.4615946  1.4353601
##   0.00    0.55      3.348818  0.4446601  1.6992079
##   0.00    0.60      4.308300  0.4322957  1.9514471
##   0.00    0.65      5.182976  0.4243114  2.1817724
##   0.00    0.70      6.022617  0.4163797  2.4142604
##   0.00    0.75      6.597587  0.4094283  2.5891565
##   0.00    0.80      6.770507  0.4030721  2.6653127
##   0.00    0.85      6.939576  0.3970876  2.7390958
##   0.00    0.90      7.057579  0.3918822  2.7987088
##   0.00    0.95      7.241703  0.3873771  2.8706171
##   0.00    1.00      7.462568  0.3834720  2.9482766
##   0.01    0.05      1.502638  0.5751176  1.2346498
##   0.01    0.10      1.270991  0.6026829  1.0587577
##   0.01    0.15      1.265598  0.5737832  1.0180169
##   0.01    0.20      1.364192  0.5645903  1.0235409
##   0.01    0.25      1.600102  0.5454874  1.0869972
##   0.01    0.30      1.775445  0.5366448  1.1460554
##   0.01    0.35      1.784454  0.5334887  1.1543766
##   0.01    0.40      1.755650  0.5366938  1.1507519
##   0.01    0.45      1.723746  0.5422050  1.1442372
##   0.01    0.50      1.708300  0.5476298  1.1461731
##   0.01    0.55      1.799308  0.5396728  1.1909662
##   0.01    0.60      1.924147  0.5285191  1.2403883
##   0.01    0.65      2.021887  0.5192113  1.2828267
##   0.01    0.70      2.252929  0.5090508  1.3569929
##   0.01    0.75      2.467770  0.5013379  1.4252814
##   0.01    0.80      2.605230  0.4945886  1.4677347
##   0.01    0.85      2.739536  0.4882602  1.5121611
##   0.01    0.90      2.988542  0.4812531  1.5878352
##   0.01    0.95      3.235669  0.4753663  1.6611628
##   0.01    1.00      3.470767  0.4701015  1.7296431
##   0.10    0.05      1.621044  0.5365171  1.3299694
##   0.10    0.10      1.445478  0.5885097  1.1920142
##   0.10    0.15      1.311479  0.5981651  1.0910337
##   0.10    0.20      1.234110  0.5940398  1.0341257
##   0.10    0.25      1.235512  0.5844133  0.9986076
##   0.10    0.30      1.365497  0.5690259  1.0147910
##   0.10    0.35      1.500354  0.5622917  1.0485666
##   0.10    0.40      1.604483  0.5552707  1.0796564
##   0.10    0.45      1.682237  0.5514107  1.1108060
##   0.10    0.50      1.702451  0.5503190  1.1289963
##   0.10    0.55      1.743950  0.5500092  1.1499204
##   0.10    0.60      1.789034  0.5478768  1.1730868
##   0.10    0.65      1.834246  0.5446151  1.1955993
##   0.10    0.70      1.886039  0.5380923  1.2259580
##   0.10    0.75      1.950352  0.5312070  1.2591265
##   0.10    0.80      2.053357  0.5249776  1.2993386
##   0.10    0.85      2.179490  0.5205151  1.3439724
##   0.10    0.90      2.293311  0.5165944  1.3830474
##   0.10    0.95      2.387531  0.5128940  1.4157880
##   0.10    1.00      2.471390  0.5093985  1.4458673
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.05 and lambda = 0.

The optimal Lasso model had fraction = 0.3 and lambda = 0.01

d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

We’ll create a function to evaluate how our model performs against the test set so we can reuse later in the markdown.

Below, a comparison of the performance metrics between the test set and the training set showed that the test set for all four models had better metrics.

Linear Regression

Train

##       RMSE Rsquared      MAE
## 1 7.462568 0.383472 2.948277

Test

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
##      RMSE  Rsquared       MAE 
## 1.1785319 0.5659899 0.9414049

PLS

Train

##       RMSE Rsquared      MAE
## 3 1.596828 0.518755 1.155859

Test

##      RMSE  Rsquared       MAE 
## 1.3581984 0.4232806 1.0821023

Ridge

Train

##       RMSE  Rsquared      MAE
## 2 2.165935 0.5217617 1.369538

Test

##      RMSE  Rsquared       MAE 
## 1.1339754 0.6156612 0.9406349

Lasso

Train

##        RMSE  Rsquared     MAE
## 21 1.502638 0.5751176 1.23465

Test

##      RMSE  Rsquared       MAE 
## 1.2836058 0.6360101 1.0379887
  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Of the models above, the best performing model was the lasso model. We’ll evaluate the most important predictors for this model.

For the below, we’ll only select the predictors where the coefficient is not 0

## ManufacturingProcess09 ManufacturingProcess13 ManufacturingProcess32 
##              0.2034032             -0.1959582              0.1132021

From the above, we can see that there are only 5 predictors used and they are all manufacturing predictors.

  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

The assumed goal is that the company/factory wants to increase the yield. Based on the values above, the manufacturing processes that should be enhanced/increased is processes 9 and 32 while simultaneously reducing or eliminating processes 13, 17, and 36

Chester Poon

3/13/2020