library(caret)
library(pls)
library(tidyverse)
library(AppliedPredictiveModeling)
library(corrplot)
Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
nearZeroVar function from the caret package. How many predictors are left for modeling?The nearZeroVar function helps remove any sparse and unbalanced variables using the following criteria:
fingerprints <- as.data.frame(fingerprints)
print(paste('Total predictors:', ncol(fingerprints)))
## [1] "Total predictors: 1107"
print(paste('Non-Sparse predictors:', ncol(fingerprints[, -nearZeroVar(fingerprints)])))
## [1] "Non-Sparse predictors: 388"
#preprocessing
fingerprints <- fingerprints[, -nearZeroVar(fingerprints)]
# test train split
set.seed(0)
smp_size <- floor(0.8 * nrow(fingerprints))
train_ind <- sample(seq_len(nrow(fingerprints)), size = smp_size)
Xtrain <- fingerprints[train_ind, ]
Xtest <- fingerprints[-train_ind, ]
ytrain <- permeability[train_ind, ]
ytest <- permeability[-train_ind, ]
#model
set.seed(0)
plsTune <- train(Xtrain,
ytrain,
method = "pls",
tuneLength = 30,
preProc = c("center", "scale"),
trControl = trainControl(method = 'cv', 10))
plot(plsTune)
lv <- which.min(plsTune$results$RMSE)
paste("According to the scree plot, the ideal amount of latent variables is", lv)
## [1] "According to the scree plot, the ideal amount of latent variables is 7"
print(paste('The R-squared corresponding to', lv, 'latent variables is', plsTune$results[lv,3]))
## [1] "The R-squared corresponding to 7 latent variables is 0.459970644704081"
pls_pred <- predict(plsTune, Xtest)
plot(pls_pred, ytest, main=paste("Predicted vs Observed Permeability, PLS Model with", lv, "Components"), xlab="Predicted", ylab="Actual")
print(paste('R-squared with PLS model for the test set is', cor(ytest, pls_pred) ^ 2))
## [1] "R-squared with PLS model for the test set is 0.53666911598211"
Other models discussed in this chapter include penalized regression models. Ridge regression and elastic net models have been fit to the dataset, however the R-squared values produced by these models are less than the PLS model R-squared.
set.seed(0)
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 20))
ridgeTune <- train(Xtrain,
ytrain,
method = "ridge",
tuneGrid = ridgeGrid,
trControl = trainControl(method = 'cv', 10),
preProc = c("center", "scale"))
## Warning: model fit failed for Fold05: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
ridge_pred <- predict(ridgeTune, Xtest)
print(paste('R-squared with Ridge regression for the test set is', cor(ytest, ridge_pred) ^ 2))
## [1] "R-squared with Ridge regression for the test set is 0.419179090783013"
set.seed(0)
enetGrid <- expand.grid(lambda = c(0, 0.01, .1), fraction = seq(.05, 1, length = 20))
enetTune <- train(Xtrain,
ytrain,
method = "enet",
tuneGrid = enetGrid,
trControl = trainControl(method = 'cv', 10),
preProc = c("center", "scale"))
## Warning: model fit failed for Fold05: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
enet_pred <- predict(enetTune, Xtest)
print(paste('R-squared with Elastic Net regression for the test set is', cor(ytest, enet_pred) ^ 2))
## [1] "R-squared with Elastic Net regression for the test set is 0.486376351174641"
I would not recommend any of these models to replace the permeability laboratory experiment, because the R-squared values indicate that the models explain little of the varaiability of the permeability data.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
Before imputing the data, feature BiologicalMaterial07 will be removed because it has 0 variance.
cmp <- ChemicalManufacturingProcess[, -nearZeroVar(ChemicalManufacturingProcess)]
print(paste('Total predictors:', ncol(ChemicalManufacturingProcess)))
## [1] "Total predictors: 58"
print(paste('Non-Sparse predictors:', ncol(ChemicalManufacturingProcess[, -nearZeroVar(ChemicalManufacturingProcess)])))
## [1] "Non-Sparse predictors: 57"
kNN imputation will be used to fill the missing values in the dataset.
cmp <- preProcess(as.data.frame(cmp), method = "knnImpute", k = 10)$data
# test train split
set.seed(0)
smp_size <- floor(0.8 * nrow(cmp))
train_ind <- sample(seq_len(nrow(cmp)), size = smp_size)
Xtrain <- cmp[train_ind, -1]
Xtest <- cmp[-train_ind, -1]
ytrain <- cmp[train_ind, 1]
ytest <- cmp[-train_ind, 1]
Looking at the histograms of the features of the training dataset, there are features with skewed distributions. The skewness function confirms the skewness in the features of this dataset. The data will centered and scaled to address this.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(e1071)
multi.hist(Xtrain, main = '', bcol = 'blue')
head(sort(apply(Xtrain, 2, skewness)), 8)
## ManufacturingProcess26 ManufacturingProcess25 ManufacturingProcess27
## -10.635856 -10.595888 -10.555903
## ManufacturingProcess31 ManufacturingProcess29 ManufacturingProcess39
## -10.107105 -9.556297 -7.230829
## ManufacturingProcess30 ManufacturingProcess02
## -4.947736 -2.181194
tail(sort(apply(Xtrain, 2, skewness)), 2)
## ManufacturingProcess06 ManufacturingProcess43
## 3.764174 8.986510
#model
set.seed(0)
plsTune <- train(Xtrain,
ytrain,
method = "pls",
tuneLength = 30,
preProc = c("center", "scale"),
trControl = trainControl(method = 'cv', 10))
plot(plsTune)
lv <- which.min(plsTune$results$RMSE)
paste("According to the scree plot, the optimal value of latent variables is", lv)
## [1] "According to the scree plot, the optimal value of latent variables is 2"
print(paste('Train set R-squared with PLS model with', lv, 'latent variables is', plsTune$results[lv,3]))
## [1] "Train set R-squared with PLS model with 2 latent variables is 0.541133093105143"
The test set r-squared is slightly lower than the performance metric on the training set.
pls_pred <- predict(plsTune, Xtest)
print(paste('Test set R-squared with PLS model is', cor(ytest, pls_pred) ^ 2))
## [1] "Test set R-squared with PLS model is 0.498513885089613"
The most important predictors are ManufacturingProcess09, ManufacturingProcess13, ManufacturingProcess32, ManufacturingProcess17, and ManufacturingProcess36. The top 20 predictors have a 60-40 ratio of Manufacturing to Biological features. The top five predictors are all process predictors.
plot(varImp(plsTune), top = 10)
Relationships between the top 5 predictors and Yield:
There is also significant correlation between predictors. This information can be helpful in improving yield in future runs, top processes with negative correlations with yield can be reduced and features with positive correlations can be enhanced.
cmp %>%
select(c('ManufacturingProcess09','ManufacturingProcess13','ManufacturingProcess32','ManufacturingProcess17','ManufacturingProcess36',
'BiologicalMaterial02', 'BiologicalMaterial03', 'BiologicalMaterial06', 'ManufacturingProcess06', 'BiologicalMaterial04', 'Yield')) %>%
cor() %>%
corrplot(method = 'circle')