6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling)

data(permeability)

## [1]  165 1108

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?

Most of the predictors show a low frequency in that contain mostly zeroes. The following missmap is coded to show the positions of those observations containign zero values. As we can see below, most of the cells and predictors contain zeroes.

This is the patter of zero values in the data set after the columns with near zero variance were removed.

Using the nearZeroVar function we reduced the number of predictor variables from 1107 to 388 or 35.05% of the initial set of predictors.

(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

  • First let's joing the target variable permeability to the set of predictor variables.

  • After removing predictors using the nearZeroVar function we are left with a zeries of binary observations (1 or 0) and with no missing values. Pre-processing would not accomplish much so we continue to the patition of the filtered data set into training and testing sets.

  • Then we can split the joint data (165 total observations) set into a Train and Test set with the 0.8 ratio.

  • Then we will tune the model using a Partial Least Squares (PLS) model:

##   ncomp
## 5     5
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 121, 120, 119, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     12.63198  0.3409357  9.560486
##    2     11.30146  0.4996715  7.946391
##    3     11.01391  0.5329650  8.313197
##    4     10.84232  0.5609288  8.199199
##    5     10.55942  0.5865098  7.908282
##    6     10.62644  0.5660936  8.013765
##    7     10.66660  0.5632970  8.151901
##    8     10.77245  0.5542260  8.200827
##    9     10.69334  0.5731825  7.902640
##   10     10.79171  0.5637022  8.010114
##   11     11.15479  0.5347263  8.243973
##   12     11.27761  0.5238127  8.319206
##   13     11.39654  0.5122514  8.535408
##   14     11.41730  0.5116199  8.510090
##   15     11.56485  0.4986354  8.563046
##   16     11.75286  0.4844587  8.697778
##   17     12.10311  0.4664784  8.914445
##   18     12.22208  0.4538516  9.006182
##   19     12.17944  0.4539222  8.958104
##   20     12.17942  0.4581364  8.952671
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.

We see how the optimum number of latent variables is 5. This optimum number will output an accuracy of calculated RMSE = 10.56 , Rsquared = 0.59 and MAE = 7.91.

## Data:    X dimension: 133 388 
##  Y dimension: 133 1
## Fit method: oscorespls
## Number of components considered: 5
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps
## X           21.76    34.47    40.26    45.62    51.31
## .outcome    35.59    54.38    61.01    66.41    72.38

4 Data Splitting

https://topepo.github.io/caret/data-splitting.html

Principal Component and Partial Least Squares Regression Essentials

http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/152-principal-component-and-partial-least-squares-regression-essentials/

(d) Predict the response for the test set. What is the test set estimate of R2?

Using our PLS model we predict the following using the Test series:

##  [1] 12.2491268  1.4567709  1.3494801 27.6418425 -0.8127830  0.3749145
##  [7]  3.2739626  9.5448963  4.2358243  2.4513993  4.8979749  2.5648366
## [13]  0.7184764  7.4521856 40.8673056  7.9951254  7.5957188  6.7443999
## [19] 21.5073823 -1.2638749  4.2964071 18.4089308 33.0935897 54.0852978
## [25] 39.4376380 37.5416611 29.6477088  1.1711051 16.4236425  5.1636985
## [31] 10.4736203 15.5762160

The prediction accuracy using the PLS model on the Test set is a bit poor as we can see on the plot of Observed Vs. Predicted and basedon the RMSE and Rsquare values found usign the PLS model.

##       Model     RMSE   Rsquare
## 1 PLS Model 12.95938 0.4301335

The calculated Rsquare is not an impressive 0.43.

(e) Try building other models discussed in this chapter. Do any have better

predictive performance?

Simple Linear Regression

## Linear Regression 
## 
## 133 samples
## 388 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 133, 133, 133, 133, 133, 133, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE   
##   38.72553  0.07380942  25.277
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Robust Linear Regression

## Robust Linear Model 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: principal component signal extraction (388), centered
##  (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 118, 118, 120, 121, 119, ... 
## Resampling results across tuning parameters:
## 
##   intercept  psi           RMSE      Rsquared   MAE      
##   FALSE      psi.huber     16.52267  0.5029668  13.082375
##   FALSE      psi.hampel    16.51183  0.4886975  13.203347
##   FALSE      psi.bisquare  16.39115  0.4943165  12.957634
##    TRUE      psi.huber     11.16592  0.5074595   8.175870
##    TRUE      psi.hampel    11.18341  0.5193583   8.269696
##    TRUE      psi.bisquare  12.06326  0.4804153   8.097868
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.huber.

Comparing the performance of other models

##                 Model     RMSE    Rsquare
## 1           PLS Model 12.95938 0.43013350
## 2        Linear Model 38.09400 0.02861343
## 3 Robust Linear Model 15.68134 0.30298189

(f) Would you recommend any of your models to replace the permeability

laboratory experiment?

From all the tested model, the Robust Linear Model performed better than either Partial Least Squares or the Simple Linear Model.

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing pro- cess. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

(a) Start R and use these commands to load the data:

library(AppliedPredictiveModeling) data(chemicalManufacturing)

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

## [1] "How many missing values in the original data set?   106"

The missing values are 1.04% of the total values. The distribution of ghe missing values is shown in the two following maps.

## 
##  Variables sorted by number of missings: 
##  Variable       Count
##      MP03 0.085227273
##      MP11 0.056818182
##      MP10 0.051136364
##      MP25 0.028409091
##      MP26 0.028409091
##      MP27 0.028409091
##      MP28 0.028409091
##      MP29 0.028409091
##      MP30 0.028409091
##      MP31 0.028409091
##      MP33 0.028409091
##      MP34 0.028409091
##      MP35 0.028409091
##      MP36 0.028409091
##      MP02 0.017045455
##      MP06 0.011363636
##      MP01 0.005681818
##      MP04 0.005681818
##      MP05 0.005681818
##      MP07 0.005681818
##      MP08 0.005681818
##      MP12 0.005681818
##      MP14 0.005681818
##      MP22 0.005681818
##      MP23 0.005681818
##      MP24 0.005681818
##      MP40 0.005681818
##      MP41 0.005681818
##     Yield 0.000000000
##      BM01 0.000000000
##      BM02 0.000000000
##      BM03 0.000000000
##      BM04 0.000000000
##      BM05 0.000000000
##      BM06 0.000000000
##      BM07 0.000000000
##      BM08 0.000000000
##      BM09 0.000000000
##      BM10 0.000000000
##      BM11 0.000000000
##      BM12 0.000000000
##      MP09 0.000000000
##      MP13 0.000000000
##      MP15 0.000000000
##      MP16 0.000000000
##      MP17 0.000000000
##      MP18 0.000000000
##      MP19 0.000000000
##      MP20 0.000000000
##      MP21 0.000000000
##      MP32 0.000000000
##      MP37 0.000000000
##      MP38 0.000000000
##      MP39 0.000000000
##      MP42 0.000000000
##      MP43 0.000000000
##      MP44 0.000000000
##      MP45 0.000000000

We will use the mice package, 5 iterations to impute missing values and the predictive mean matching (pmm) algorithm to do the imputation.

## 
##  iter imp variable
##   1   1  MP01  MP02  MP03  MP04  MP05  MP06  MP07  MP08  MP10  MP11  MP12  MP14  MP22  MP23  MP24  MP25  MP26  MP27  MP28  MP29  MP30  MP31  MP33  MP34  MP35  MP36  MP40  MP41
##   2   1  MP01  MP02  MP03  MP04  MP05  MP06  MP07  MP08  MP10  MP11  MP12  MP14  MP22  MP23  MP24  MP25  MP26  MP27  MP28  MP29  MP30  MP31  MP33  MP34  MP35  MP36  MP40  MP41
##   3   1  MP01  MP02  MP03  MP04  MP05  MP06  MP07  MP08  MP10  MP11  MP12  MP14  MP22  MP23  MP24  MP25  MP26  MP27  MP28  MP29  MP30  MP31  MP33  MP34  MP35  MP36  MP40  MP41
##   4   1  MP01  MP02  MP03  MP04  MP05  MP06  MP07  MP08  MP10  MP11  MP12  MP14  MP22  MP23  MP24  MP25  MP26  MP27  MP28  MP29  MP30  MP31  MP33  MP34  MP35  MP36  MP40  MP41
##   5   1  MP01  MP02  MP03  MP04  MP05  MP06  MP07  MP08  MP10  MP11  MP12  MP14  MP22  MP23  MP24  MP25  MP26  MP27  MP28  MP29  MP30  MP31  MP33  MP34  MP35  MP36  MP40  MP41

Dealing with Missing Data using R

https://medium.com/coinmonks/dealing-with-missing-data-using-r-3ae428da2d17

After imputation we have no more missing values as shown below.

## [1] "How many missing values?   0"

(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

Before fitting a model we will do to pre-processing on the data set as some models do not perform accurately when the distribution is not normal, the mean values is not zero and we have a big range of values across the different predictor variables.

First, let's see through histigrans the distribution of the first 9 predictor variables.

Color of histograms from data frame in R

https://stackoverflow.com/questions/27988918/color-of-histograms-from-data-frame-in-r

Using the caret package, original data set will be pre-processed using a BoxCox transformation, Center to a median of zero and Scaling or normalization.

## Created from 176 samples and 58 variables
## 
## Pre-processing:
##   - Box-Cox transformation (29)
##   - centered (58)
##   - ignored (0)
##   - scaled (58)
## 
## Lambda estimates for Box-Cox transformation:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.0000 -1.7000 -0.9000 -0.2345  1.5000  2.0000

  • Next step is to separate the data set into a test and train set:

Now we can train our model using the following models:

Simple Linear Regression

##   intercept
## 1      TRUE
## Linear Regression 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   3.563485  0.167845  1.332132
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Partial Least Squares

##   ncomp
## 5     5
## Partial Least Squares 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 130, 130, 130, 129, 128, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.8190777  0.4086137  0.6526755
##    2     1.1988762  0.4412542  0.7259039
##    3     0.8514252  0.5365336  0.6064737
##    4     0.7530500  0.5469212  0.5752749
##    5     0.7191935  0.5601815  0.5631237
##    6     0.7225099  0.5653804  0.5580460
##    7     0.7580084  0.5593145  0.5656616
##    8     0.9138903  0.5101392  0.6301220
##    9     1.0455884  0.4955110  0.6696908
##   10     1.1785289  0.4739455  0.7140162
##   11     1.2293015  0.4794248  0.7256701
##   12     1.2751032  0.4814158  0.7389385
##   13     1.4163048  0.4748105  0.7892309
##   14     1.6016463  0.4613623  0.8463557
##   15     1.7751551  0.4546681  0.8966684
##   16     1.8873806  0.4453905  0.9339333
##   17     2.0006605  0.4386084  0.9677091
##   18     2.0112539  0.4330453  0.9723878
##   19     2.0053449  0.4255035  0.9726223
##   20     2.0339424  0.4176118  0.9827192
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.

Randmom Forest

##   mtry
## 2   29
## Random Forest 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    2    0.6771556  0.6043597  0.5504551
##   29    0.6315683  0.6023452  0.5011480
##   57    0.6585438  0.5578456  0.5152781
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 29.

Ridge Regression

## Ridge Regression 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 131, 128, 129, 129, 130, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE      
##   0.000000000  4.4140066  0.3975461  1.5491705
##   0.007142857  1.8036585  0.4666767  0.8557589
##   0.014285714  1.4559569  0.4760000  0.7574392
##   0.021428571  1.2864484  0.4839763  0.7086101
##   0.028571429  1.1816507  0.4903150  0.6779173
##   0.035714286  1.1088559  0.4954458  0.6561919
##   0.042857143  1.0546890  0.4997052  0.6399541
##   0.050000000  1.0125098  0.5033285  0.6272535
##   0.057142857  0.9785878  0.5064787  0.6176367
##   0.064285714  0.9506405  0.5092705  0.6097167
##   0.071428571  0.9271807  0.5117853  0.6030313
##   0.078571429  0.9071927  0.5140823  0.5973359
##   0.085714286  0.8899556  0.5162054  0.5925293
##   0.092857143  0.8749427  0.5181880  0.5883222
##   0.100000000  0.8617586  0.5200563  0.5847280
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

##    lambda
## 15    0.1
## Ridge Regression 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 131, 128, 129, 129, 130, ... 
## Resampling results across tuning parameters:
## 
##   lambda       RMSE       Rsquared   MAE      
##   0.000000000  4.4140066  0.3975461  1.5491705
##   0.007142857  1.8036585  0.4666767  0.8557589
##   0.014285714  1.4559569  0.4760000  0.7574392
##   0.021428571  1.2864484  0.4839763  0.7086101
##   0.028571429  1.1816507  0.4903150  0.6779173
##   0.035714286  1.1088559  0.4954458  0.6561919
##   0.042857143  1.0546890  0.4997052  0.6399541
##   0.050000000  1.0125098  0.5033285  0.6272535
##   0.057142857  0.9785878  0.5064787  0.6176367
##   0.064285714  0.9506405  0.5092705  0.6097167
##   0.071428571  0.9271807  0.5117853  0.6030313
##   0.078571429  0.9071927  0.5140823  0.5973359
##   0.085714286  0.8899556  0.5162054  0.5925293
##   0.092857143  0.8749427  0.5181880  0.5883222
##   0.100000000  0.8617586  0.5200563  0.5847280
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.

(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

##               Model      RMSE   Rsquare
## 1 Linear Regression 1.2886330 0.1204939
## 2               PLS 0.7088808 0.5443235
## 3     Random Forest 0.6289398 0.7263291
## 4  Ridge-regression 0.6916511 0.5679324

We see how from all regression models, the Ridge-regression model has the highest Rsquare metric and lowest RMSE value. Surprisinlgy, Random Forest clearly outperforms all the regression models. Below we can see how the Predicted Vs. Observed Yield of the Test set compare for all the fitted models:

A plot of the residuals show all regression models having a random cloud pattern. However, there is some hint of structure on the residuals of the Random Forest model that is also the best performing model.

(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

From the three plots below, we notice how the importance of the predictors depended on the model used. However, the predictor Manufactoring Processing 32 (MP32) was the top predictor from all three models. Also most the models chose Manufacturing Processing variables over Biological Materials as the important predictor variable able to explain the response variable.

(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

We see from the plot below that the Proces predictors have a higher correlation with the response variable Yield over the Material predictors. Given limited time and resources to increase yield, it is reasonable to say that focusing on the Process could more reliable increase yield than focusing on the materials.

How to Create a Correlation Matrix with Too Many Variables in R

https://towardsdatascience.com/how-to-create-a-correlation-matrix-with-too-many-variables-309cc0c0a57