Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling)
library(caret)
library(pls)
library(elasticnet)
library(RANN)
library(VIM)
library(e1071)
library(corrplot)
library(tidyverse)
library(knitr)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?After using the nearZeroVar function, we were able to remove 719 predictors from the 1107 leaving us with a total of 388 predictors.
Looking at the graph above we see that the optimal amount of latent variables is 8 with an \(R^2\) of 0.5378.
The test set preformed worse than the training data with an \(R^2\) of 0.5325 and instead of the trained optimal \(R^2\) of 0.5378.
Before analyzing the \(R^2\) for the linear model we should check that the residuals make sense for the linear model.
The Observed vs. Predicted plot shows a linear relationship with no outliers in the data, meaning our model is calibrated well. Additionally, the Residual vs Predicted plot shows a random cloud of points with no real pattern. The linear model preformed as it should, so we can compare the \(R^2\) with other models.
Removing the first 4 models from the list since their RMSE was too high to be considered anyway, we can see a little clearer why the best model is the Ridge model using the \(\lambda\) of 0.0929.
When comparing the models we will focus on the \(R^2\) and the re sampled \(R^2\).
| model | RSquared |
|---|---|
| PLS | 0.5378 |
| Ridge | 0.4533 |
| Linear | 0.2768 |
When comparing the training models \(R^2\) we see that the PLS model preformed the best with an \(R^2\) of 0.5378, while the linear model preformed the worst with an \(R^2\) of 0.2768.
| model | RSquared |
|---|---|
| PLS | 0.5325 |
| Ridge | 0.5185 |
| Linear | 0.1965 |
We can see that by comparing the re-sampled \(R^2\) the PLS model has the closest the the training set \(R^2\) with a difference of 0.0053, meaning the training model was not optimistic, but fairly accurate. On the other hand, the linear regression model had a re-sampled \(R^2\) that was about 0.0803 smaller than the training \(R^2\), meaning the training model was overly optimistic at predicting permeability.
In both cases, the ridge model fit somewhere in between the PLS and Linear models. Therefore, the best model to predict permeability is the PLS model and I would not replace it with any of the other models tested.
Given that the best model, the PLS, was our best model and it only was apple to achieve an \(R^2\) of 0.5378, I wouldn’t recommend any model over the laboratory experiment.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
We can see that there are some missing data that seem to be missing at random and therefore we can use the preProcess function to impute those missing data using the KNN method discussed in section 3.8.
As we can see the data has been fully pre processed and all missing values imputed.
Given that there are 58 columns, we will want to remove some predictors. One way is to do this is start by eliminating any predictors that have zero variance. This process eliminates the column as a predictor.
We can then reduce the predictors further by removing predictors that are highly correlated with each other. This bring our total predictor count down from 58, to 36, a reduction of 22 predictors.
We will process the data using the preProcess function that will center and scale the data.
Using a PLS model again we see that 9 latent variables are optimal producing an \(R^2\) of 0.5524.
The test data preformed much better than the training data with a resampled \(R^2\) of 0.7169675, which was 0.1645 higher than the training set.
| Predictors | Overall |
|---|---|
| BiologicalMaterial11 | 100.00 |
| BiologicalMaterial09 | 84.56 |
| ManufacturingProcess36 | 60.95 |
| ManufacturingProcess33 | 57.75 |
| ManufacturingProcess28 | 45.75 |
| ManufacturingProcess02 | 45.26 |
| BiologicalMaterial10 | 44.35 |
| ManufacturingProcess04 | 37.93 |
| ManufacturingProcess12 | 30.95 |
| ManufacturingProcess19 | 30.25 |
The top 10 predictor list is dominated by Process predictors, making up 70% of all top predictors.
| type | predictors | percent |
|---|---|---|
| Biological | 3 | 0.3 |
| Process | 7 | 0.7 |
| predictors | coeff | |
|---|---|---|
| 2 | BiologicalMaterial09 | 0.4397 |
| 4 | BiologicalMaterial11 | 0.3132 |
| 27 | ManufacturingProcess33 | 0.2046 |
| 25 | ManufacturingProcess28 | 0.0934 |
| 8 | ManufacturingProcess04 | 0.0888 |
| 15 | ManufacturingProcess12 | 0.0271 |
| 3 | BiologicalMaterial10 | -0.1333 |
| 30 | ManufacturingProcess36 | -0.2916 |
| 18 | ManufacturingProcess19 | -0.3341 |
| 6 | ManufacturingProcess02 | -0.5824 |
Based on the top predictor coefficients, I would suggest that when the end goal of the manufacturing process is to have increased yield, then the process should focus mostly on the BiologicalMaterial09 and BiologicalMaterial11 components as they are the top two predictors and have a positive coefficient in relationship to the yield.
On the other hand if the goal of the manufacturing process is to have a reduced yield then the process should focus on the ManufacturingProcess02, ManufacturingProcess19, and the ManufacturingProcess36 components as they are in the top 10 important components for yield prediction and they have negative relationship with yield, meaning lower yield.