6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
(a) Start R and use these commands to load the data:
library(AppliedPredictiveModeling)
data(permeability)
## [1] 165 1108
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
(b) The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
Most of the predictors show a low frequency in that contain mostly zeroes. The following missmap is coded to show the positions of those observations containign zero values. As we can see below, most of the cells and predictors contain zeroes.
This is the patter of zero values in the data set after the columns with near zero variance were removed.
Using the nearZeroVar function we reduced the number of predictor variables from 1107 to 388 or 35.05% of the initial set of predictors.
(c) Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
First let's joing the target variable permeability to the set of predictor variables.
After removing predictors using the nearZeroVar function we are left with a zeries of binary observations (1 or 0) and with no missing values. Pre-processing would not accomplish much so we continue to the patition of the filtered data set into training and testing sets.
Then we can split the joint data (165 total observations) set into a Train and Test set with the 0.8 ratio.
Then we will tune the model using a Partial Least Squares (PLS) model:
## ncomp
## 5 5
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 121, 120, 119, 120, 120, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 12.63198 0.3409357 9.560486
## 2 11.30146 0.4996715 7.946391
## 3 11.01391 0.5329650 8.313197
## 4 10.84232 0.5609288 8.199199
## 5 10.55942 0.5865098 7.908282
## 6 10.62644 0.5660936 8.013765
## 7 10.66660 0.5632970 8.151901
## 8 10.77245 0.5542260 8.200827
## 9 10.69334 0.5731825 7.902640
## 10 10.79171 0.5637022 8.010114
## 11 11.15479 0.5347263 8.243973
## 12 11.27761 0.5238127 8.319206
## 13 11.39654 0.5122514 8.535408
## 14 11.41730 0.5116199 8.510090
## 15 11.56485 0.4986354 8.563046
## 16 11.75286 0.4844587 8.697778
## 17 12.10311 0.4664784 8.914445
## 18 12.22208 0.4538516 9.006182
## 19 12.17944 0.4539222 8.958104
## 20 12.17942 0.4581364 8.952671
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.
We see how the optimum number of latent variables is 5. This optimum number will output an accuracy of calculated RMSE = 10.56 , Rsquared = 0.59 and MAE = 7.91.
## Data: X dimension: 133 388
## Y dimension: 133 1
## Fit method: oscorespls
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 21.76 34.47 40.26 45.62 51.31
## .outcome 35.59 54.38 61.01 66.41 72.38
4 Data Splitting
https://topepo.github.io/caret/data-splitting.html
Principal Component and Partial Least Squares Regression Essentials
(d) Predict the response for the test set. What is the test set estimate of R2?
Using our PLS model we predict the following using the Test series:
## [1] 12.2491268 1.4567709 1.3494801 27.6418425 -0.8127830 0.3749145
## [7] 3.2739626 9.5448963 4.2358243 2.4513993 4.8979749 2.5648366
## [13] 0.7184764 7.4521856 40.8673056 7.9951254 7.5957188 6.7443999
## [19] 21.5073823 -1.2638749 4.2964071 18.4089308 33.0935897 54.0852978
## [25] 39.4376380 37.5416611 29.6477088 1.1711051 16.4236425 5.1636985
## [31] 10.4736203 15.5762160
The prediction accuracy using the PLS model on the Test set is a bit poor as we can see on the plot of Observed Vs. Predicted and basedon the RMSE and Rsquare values found usign the PLS model.
## Model RMSE Rsquare
## 1 PLS Model 12.95938 0.4301335
The calculated Rsquare is not an impressive 0.43.
(e) Try building other models discussed in this chapter. Do any have better
predictive performance?
Simple Linear Regression
## Linear Regression
##
## 133 samples
## 388 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 133, 133, 133, 133, 133, 133, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 38.72553 0.07380942 25.277
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Robust Linear Regression
## Robust Linear Model
##
## 133 samples
## 388 predictors
##
## Pre-processing: principal component signal extraction (388), centered
## (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 118, 118, 120, 121, 119, ...
## Resampling results across tuning parameters:
##
## intercept psi RMSE Rsquared MAE
## FALSE psi.huber 16.52267 0.5029668 13.082375
## FALSE psi.hampel 16.51183 0.4886975 13.203347
## FALSE psi.bisquare 16.39115 0.4943165 12.957634
## TRUE psi.huber 11.16592 0.5074595 8.175870
## TRUE psi.hampel 11.18341 0.5193583 8.269696
## TRUE psi.bisquare 12.06326 0.4804153 8.097868
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were intercept = TRUE and psi = psi.huber.
Comparing the performance of other models
## Model RMSE Rsquare
## 1 PLS Model 12.95938 0.43013350
## 2 Linear Model 38.09400 0.02861343
## 3 Robust Linear Model 15.68134 0.30298189
(f) Would you recommend any of your models to replace the permeability
laboratory experiment?
From all the tested model, the Robust Linear Model performed better than either Partial Least Squares or the Simple Linear Model.
6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors),measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing pro- cess. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
(a) Start R and use these commands to load the data:
library(AppliedPredictiveModeling) data(chemicalManufacturing)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
(b) A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
## [1] "How many missing values in the original data set? 106"
The missing values are 1.04% of the total values. The distribution of ghe missing values is shown in the two following maps.
##
## Variables sorted by number of missings:
## Variable Count
## MP03 0.085227273
## MP11 0.056818182
## MP10 0.051136364
## MP25 0.028409091
## MP26 0.028409091
## MP27 0.028409091
## MP28 0.028409091
## MP29 0.028409091
## MP30 0.028409091
## MP31 0.028409091
## MP33 0.028409091
## MP34 0.028409091
## MP35 0.028409091
## MP36 0.028409091
## MP02 0.017045455
## MP06 0.011363636
## MP01 0.005681818
## MP04 0.005681818
## MP05 0.005681818
## MP07 0.005681818
## MP08 0.005681818
## MP12 0.005681818
## MP14 0.005681818
## MP22 0.005681818
## MP23 0.005681818
## MP24 0.005681818
## MP40 0.005681818
## MP41 0.005681818
## Yield 0.000000000
## BM01 0.000000000
## BM02 0.000000000
## BM03 0.000000000
## BM04 0.000000000
## BM05 0.000000000
## BM06 0.000000000
## BM07 0.000000000
## BM08 0.000000000
## BM09 0.000000000
## BM10 0.000000000
## BM11 0.000000000
## BM12 0.000000000
## MP09 0.000000000
## MP13 0.000000000
## MP15 0.000000000
## MP16 0.000000000
## MP17 0.000000000
## MP18 0.000000000
## MP19 0.000000000
## MP20 0.000000000
## MP21 0.000000000
## MP32 0.000000000
## MP37 0.000000000
## MP38 0.000000000
## MP39 0.000000000
## MP42 0.000000000
## MP43 0.000000000
## MP44 0.000000000
## MP45 0.000000000
We will use the mice package, 5 iterations to impute missing values and the predictive mean matching (pmm) algorithm to do the imputation.
##
## iter imp variable
## 1 1 MP01 MP02 MP03 MP04 MP05 MP06 MP07 MP08 MP10 MP11 MP12 MP14 MP22 MP23 MP24 MP25 MP26 MP27 MP28 MP29 MP30 MP31 MP33 MP34 MP35 MP36 MP40 MP41
## 2 1 MP01 MP02 MP03 MP04 MP05 MP06 MP07 MP08 MP10 MP11 MP12 MP14 MP22 MP23 MP24 MP25 MP26 MP27 MP28 MP29 MP30 MP31 MP33 MP34 MP35 MP36 MP40 MP41
## 3 1 MP01 MP02 MP03 MP04 MP05 MP06 MP07 MP08 MP10 MP11 MP12 MP14 MP22 MP23 MP24 MP25 MP26 MP27 MP28 MP29 MP30 MP31 MP33 MP34 MP35 MP36 MP40 MP41
## 4 1 MP01 MP02 MP03 MP04 MP05 MP06 MP07 MP08 MP10 MP11 MP12 MP14 MP22 MP23 MP24 MP25 MP26 MP27 MP28 MP29 MP30 MP31 MP33 MP34 MP35 MP36 MP40 MP41
## 5 1 MP01 MP02 MP03 MP04 MP05 MP06 MP07 MP08 MP10 MP11 MP12 MP14 MP22 MP23 MP24 MP25 MP26 MP27 MP28 MP29 MP30 MP31 MP33 MP34 MP35 MP36 MP40 MP41
Dealing with Missing Data using R
https://medium.com/coinmonks/dealing-with-missing-data-using-r-3ae428da2d17
After imputation we have no more missing values as shown below.
## [1] "How many missing values? 0"
(c) Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Before fitting a model we will do to pre-processing on the data set as some models do not perform accurately when the distribution is not normal, the mean values is not zero and we have a big range of values across the different predictor variables.
First, let's see through histigrans the distribution of the first 9 predictor variables.
Color of histograms from data frame in R
https://stackoverflow.com/questions/27988918/color-of-histograms-from-data-frame-in-r
Using the caret package, original data set will be pre-processed using a BoxCox transformation, Center to a median of zero and Scaling or normalization.
## Created from 176 samples and 58 variables
##
## Pre-processing:
## - Box-Cox transformation (29)
## - centered (58)
## - ignored (0)
## - scaled (58)
##
## Lambda estimates for Box-Cox transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.0000 -1.7000 -0.9000 -0.2345 1.5000 2.0000
- Next step is to separate the data set into a test and train set:
Now we can train our model using the following models:
Simple Linear Regression
## intercept
## 1 TRUE
## Linear Regression
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 3.563485 0.167845 1.332132
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Partial Least Squares
## ncomp
## 5 5
## Partial Least Squares
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 130, 130, 130, 129, 128, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.8190777 0.4086137 0.6526755
## 2 1.1988762 0.4412542 0.7259039
## 3 0.8514252 0.5365336 0.6064737
## 4 0.7530500 0.5469212 0.5752749
## 5 0.7191935 0.5601815 0.5631237
## 6 0.7225099 0.5653804 0.5580460
## 7 0.7580084 0.5593145 0.5656616
## 8 0.9138903 0.5101392 0.6301220
## 9 1.0455884 0.4955110 0.6696908
## 10 1.1785289 0.4739455 0.7140162
## 11 1.2293015 0.4794248 0.7256701
## 12 1.2751032 0.4814158 0.7389385
## 13 1.4163048 0.4748105 0.7892309
## 14 1.6016463 0.4613623 0.8463557
## 15 1.7751551 0.4546681 0.8966684
## 16 1.8873806 0.4453905 0.9339333
## 17 2.0006605 0.4386084 0.9677091
## 18 2.0112539 0.4330453 0.9723878
## 19 2.0053449 0.4255035 0.9726223
## 20 2.0339424 0.4176118 0.9827192
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.
Randmom Forest
## mtry
## 2 29
## Random Forest
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.6771556 0.6043597 0.5504551
## 29 0.6315683 0.6023452 0.5011480
## 57 0.6585438 0.5578456 0.5152781
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 29.
Ridge Regression
## Ridge Regression
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 131, 128, 129, 129, 130, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 4.4140066 0.3975461 1.5491705
## 0.007142857 1.8036585 0.4666767 0.8557589
## 0.014285714 1.4559569 0.4760000 0.7574392
## 0.021428571 1.2864484 0.4839763 0.7086101
## 0.028571429 1.1816507 0.4903150 0.6779173
## 0.035714286 1.1088559 0.4954458 0.6561919
## 0.042857143 1.0546890 0.4997052 0.6399541
## 0.050000000 1.0125098 0.5033285 0.6272535
## 0.057142857 0.9785878 0.5064787 0.6176367
## 0.064285714 0.9506405 0.5092705 0.6097167
## 0.071428571 0.9271807 0.5117853 0.6030313
## 0.078571429 0.9071927 0.5140823 0.5973359
## 0.085714286 0.8899556 0.5162054 0.5925293
## 0.092857143 0.8749427 0.5181880 0.5883222
## 0.100000000 0.8617586 0.5200563 0.5847280
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
## lambda
## 15 0.1
## Ridge Regression
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 131, 128, 129, 129, 130, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 4.4140066 0.3975461 1.5491705
## 0.007142857 1.8036585 0.4666767 0.8557589
## 0.014285714 1.4559569 0.4760000 0.7574392
## 0.021428571 1.2864484 0.4839763 0.7086101
## 0.028571429 1.1816507 0.4903150 0.6779173
## 0.035714286 1.1088559 0.4954458 0.6561919
## 0.042857143 1.0546890 0.4997052 0.6399541
## 0.050000000 1.0125098 0.5033285 0.6272535
## 0.057142857 0.9785878 0.5064787 0.6176367
## 0.064285714 0.9506405 0.5092705 0.6097167
## 0.071428571 0.9271807 0.5117853 0.6030313
## 0.078571429 0.9071927 0.5140823 0.5973359
## 0.085714286 0.8899556 0.5162054 0.5925293
## 0.092857143 0.8749427 0.5181880 0.5883222
## 0.100000000 0.8617586 0.5200563 0.5847280
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
(d) Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
## Model RMSE Rsquare
## 1 Linear Regression 1.2886330 0.1204939
## 2 PLS 0.7088808 0.5443235
## 3 Random Forest 0.6289398 0.7263291
## 4 Ridge-regression 0.6916511 0.5679324
We see how from all regression models, the Ridge-regression model has the highest Rsquare metric and lowest RMSE value. Surprisinlgy, Random Forest clearly outperforms all the regression models. Below we can see how the Predicted Vs. Observed Yield of the Test set compare for all the fitted models:
A plot of the residuals show all regression models having a random cloud pattern. However, there is some hint of structure on the residuals of the Random Forest model that is also the best performing model.
(e) Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
From the three plots below, we notice how the importance of the predictors depended on the model used. However, the predictor Manufactoring Processing 32 (MP32) was the top predictor from all three models. Also most the models chose Manufacturing Processing variables over Biological Materials as the important predictor variable able to explain the response variable.
(f) Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
We see from the plot below that the Proces predictors have a higher correlation with the response variable Yield over the Material predictors. Given limited time and resources to increase yield, it is reasonable to say that focusing on the Process could more reliable increase yield than focusing on the materials.
How to Create a Correlation Matrix with Too Many Variables in R