6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
The fingerprint predictors indicate the presence or absence of
substructures of a molecule and are often sparse meaning that relatively
few of the molecules contain each substructure. Filter out the
predictors that have low frequencies using the nearZeroVar
function from the caret package. How many predictors are
left for modeling?
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?
Predict the response for the test set. What is the test set estimate of R2?
Try building other models discussed in this chapter. Do any have better predictive performance?
Would you recommend any of your models to replace the permeability laboratory experiment?
According to the R Documentation:
“nearZeroVar diagnoses predictors that have one unique
value (i.e. are zero variance predictors) or predictors that are have
both of the following characteristics: they have very few unique values
relative to the number of samples and the ratio of the frequency of the
most common value to the frequency of the second most common value is
large. checkConditionalX looks at the distribution of the
columns of x conditioned on the levels of y and identifies columns of x
that are sparse within groups of y.”
high_freq_predictors <- fingerprints[,-nearZeroVar(fingerprints)]
dim(high_freq_predictors)
## [1] 165 388
After using the nearZeroVar function to parse out low
frequency predictors, we went from 1,107 predictors to 388,
The first thing that we should do is create the testing and training data, which is done below. The dimensions of the training and testing data are also provided.
set.seed(1)
sample <- sample.split(permeability, SplitRatio = 0.8)
train_X <- subset(high_freq_predictors, sample == TRUE) %>% as.matrix()
test_X <- subset(high_freq_predictors, sample == FALSE) %>% as.matrix()
train_y <- subset(permeability, sample == TRUE) %>% as.vector()
test_y <- subset(permeability, sample == FALSE) %>% as.vector()
dim(train_X)
## [1] 132 388
dim(test_X)
## [1] 33 388
Now we can fit a PLS model. Page 134 shows us an example where the
variables are preprocessed using center and
scale. We will be applying the same methodology to this
dataset.
ctrl <- trainControl(method = "cv", number = 10)
plsTune <- caret::train(
train_X,
train_y,
method = "pls",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale")
)
plot(plsTune)
plsTune
## Partial Least Squares
##
## 132 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 118, 120, 118, 119, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.19418 0.3355888 10.161782
## 2 12.08697 0.4154592 8.381873
## 3 12.02124 0.4257758 8.933843
## 4 12.09577 0.4278739 9.019170
## 5 12.03456 0.4320812 9.117138
## 6 12.02757 0.4359904 8.924423
## 7 11.79439 0.4504540 8.934411
## 8 11.96022 0.4447283 9.249347
## 9 11.99048 0.4590222 9.203734
## 10 12.19598 0.4459319 9.207656
## 11 12.35679 0.4335111 9.265978
## 12 12.55873 0.4201770 9.346663
## 13 12.54423 0.4188488 9.318062
## 14 12.54874 0.4155148 9.356758
## 15 12.80671 0.4068115 9.660144
## 16 13.22448 0.3897108 9.971214
## 17 13.65284 0.3846397 10.294842
## 18 14.37353 0.3591093 10.885709
## 19 14.75681 0.3479585 11.216480
## 20 15.34589 0.3247690 11.519261
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 7.
Based on the output above, the number of latent variables that is optimal is 7, and the corresponding resampled estimate of \(R^2\) is 0.475.
plsPred1 <- predict(plsTune, test_X)
plsPred1 %>% head()
## [1] -3.826290 -3.826290 41.249095 -3.026750 4.053989 9.324345
The output above shows us the first 6 predicted values using the test set predictors.
plsValues1 <- data.frame(obs = test_y, pred = plsPred1)
defaultSummary(plsValues1)
## RMSE Rsquared MAE
## 9.0697275 0.6245837 7.1262647
The output above shows us that our test set R-squared came out to 0.625.
We can try fitting a ridge regression model to the data use the same
preprocessing methodology like what we did in Question 6.2c. What we
will do is create a search grid for lambda, which finds the
value for lambda which minimizes the RMSE in an iterative fashion.
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(100)
ridgeRegFit <- train(
train_X,
train_y,
method = "ridge",
tuneGrid = ridgeGrid,
trControl = ctrl,
preProc = c("center", "scale")
)
## Warning: model fit failed for Fold05: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold09: lambda=0.000000 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
ridgeRegFit
## Ridge Regression
##
## 132 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 118, 120, 119, 118, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000000 14.54432 0.3489843 10.29389
## 0.007142857 127.85644 0.2500175 74.30955
## 0.014285714 448.90916 0.2090538 288.22806
## 0.021428571 15.89967 0.3024838 11.34149
## 0.028571429 15.41820 0.3180304 11.07465
## 0.035714286 15.04636 0.3383254 10.75635
## 0.042857143 14.73696 0.3513392 10.59260
## 0.050000000 14.54815 0.3643678 10.46277
## 0.057142857 14.47153 0.3679564 10.44045
## 0.064285714 14.20348 0.3827026 10.28230
## 0.071428571 14.07164 0.3917995 10.20378
## 0.078571429 13.96364 0.3984985 10.14318
## 0.085714286 13.86500 0.4053690 10.08946
## 0.092857143 13.79269 0.4105608 10.05292
## 0.100000000 13.72585 0.4158997 10.01949
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
We provide the RMSE based on the best performing model as shown below.
ridgeRegPred1 <- predict(ridgeRegFit, test_X)
ridgeRegPred1 %>% head()
## 4 6 7 18 20 21
## -1.236078 -1.236078 41.019991 -5.626989 1.897921 9.382436
ridgeRegValues1 <- data.frame(obs = test_y, pred = ridgeRegPred1)
defaultSummary(ridgeRegValues1)
## RMSE Rsquared MAE
## 10.2383550 0.5607206 8.0225647
Next model that we will be trying out is the elastic net model. Again
we set up a search grid which searches for a value of
lambda and fraction that minimizes the
RMSE.
enetGrid <- expand.grid(.lambda = c(0, 0.01, .1),
.fraction = seq(.05, 1, length = 20))
set.seed(100)
enetTune <- train(train_X,
train_y,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c("center", "scale")
)
## Warning: model fit failed for Fold05: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold09: lambda=0.00, fraction=1 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
enetTune
## Elasticnet
##
## 132 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 119, 118, 120, 119, 118, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 12.02047 0.5189474 9.097294
## 0.00 0.10 11.60474 0.5200844 8.059818
## 0.00 0.15 11.93991 0.5014864 8.226068
## 0.00 0.20 12.06173 0.4843267 8.421519
## 0.00 0.25 12.08863 0.4735817 8.545991
## 0.00 0.30 12.17088 0.4671724 8.751755
## 0.00 0.35 12.27840 0.4586238 8.980117
## 0.00 0.40 12.46660 0.4457948 9.209953
## 0.00 0.45 12.64958 0.4335609 9.360218
## 0.00 0.50 12.86976 0.4211769 9.488794
## 0.00 0.55 13.08509 0.4098843 9.637639
## 0.00 0.60 13.28154 0.4015285 9.760444
## 0.00 0.65 13.48040 0.3940941 9.869420
## 0.00 0.70 13.63104 0.3878474 9.913044
## 0.00 0.75 13.66381 0.3884211 9.904921
## 0.00 0.80 13.72624 0.3879019 9.945552
## 0.00 0.85 13.88461 0.3811717 10.022249
## 0.00 0.90 14.00510 0.3759211 10.070932
## 0.00 0.95 14.23060 0.3650705 10.165363
## 0.00 1.00 14.54432 0.3489843 10.293891
## 0.01 0.05 11.81562 0.4918210 8.429058
## 0.01 0.10 11.65594 0.4783427 8.302625
## 0.01 0.15 11.75026 0.4705944 8.651031
## 0.01 0.20 12.16488 0.4516813 9.083736
## 0.01 0.25 12.48859 0.4416718 9.267566
## 0.01 0.30 12.83083 0.4278192 9.399609
## 0.01 0.35 13.35513 0.4023965 9.689018
## 0.01 0.40 13.88906 0.3769515 9.993169
## 0.01 0.45 14.31740 0.3564402 10.245366
## 0.01 0.50 14.64406 0.3382519 10.431969
## 0.01 0.55 14.96190 0.3216162 10.596338
## 0.01 0.60 15.25618 0.3074271 10.793407
## 0.01 0.65 15.54814 0.2950449 11.004723
## 0.01 0.70 15.83378 0.2849387 11.234173
## 0.01 0.75 16.13159 0.2753088 11.476382
## 0.01 0.80 16.41540 0.2688916 11.683412
## 0.01 0.85 16.67132 0.2645065 11.888190
## 0.01 0.90 16.87614 0.2622441 12.055586
## 0.01 0.95 17.01145 0.2616895 12.156351
## 0.01 1.00 17.13338 0.2610388 12.243228
## 0.10 0.05 12.24076 0.4993314 9.329858
## 0.10 0.10 11.65844 0.5049855 8.301089
## 0.10 0.15 11.64699 0.4992936 8.254943
## 0.10 0.20 11.57167 0.4954713 8.281789
## 0.10 0.25 11.58661 0.4926245 8.391096
## 0.10 0.30 11.74659 0.4860902 8.584354
## 0.10 0.35 11.86992 0.4817331 8.677382
## 0.10 0.40 12.00345 0.4792977 8.746524
## 0.10 0.45 12.13955 0.4762629 8.806233
## 0.10 0.50 12.27736 0.4738214 8.878685
## 0.10 0.55 12.42115 0.4695951 8.982025
## 0.10 0.60 12.55246 0.4656409 9.077755
## 0.10 0.65 12.71429 0.4586517 9.182696
## 0.10 0.70 12.88703 0.4504408 9.313726
## 0.10 0.75 13.02956 0.4438379 9.433124
## 0.10 0.80 13.16818 0.4382233 9.553909
## 0.10 0.85 13.32382 0.4319150 9.692010
## 0.10 0.90 13.46948 0.4261461 9.816405
## 0.10 0.95 13.60155 0.4208544 9.924682
## 0.10 1.00 13.72585 0.4158997 10.019491
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.2 and lambda = 0.1.
We provide the RMSE based on the best performing model as shown below.
enetTunePred1 <- predict(enetTune, test_X)
enetTunePred1 %>% head()
## 4 6 7 18 20 21
## -3.4935058 -3.4935058 37.9875690 -0.7016277 2.2593745 9.6945005
enetTuneValues1 <- data.frame(obs = test_y, pred = enetTunePred1)
defaultSummary(enetTuneValues1)
## RMSE Rsquared MAE
## 10.4621340 0.5034596 8.1426555
The R-squared and RMSE values for the PLS model are 9.0697275 and 0.6245837 respectively. The other two models that we have generated in this question are unable to beat these metrics, which indicates that we do not have better predictive performance across these two models.
The PLS model performs the best out of all of the models that were tested in Question 6.2e, therefore, I would probably continue to use the PLS model for the permeability laboratory experiment.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors
(12 describing the input biological material and 45 describing the
process predictors) for the 176 manufacturing runs. yield contains the
percent yield for each run.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
Using the methodology explained in Section 3.8, we impute using the code chunk below.
impute <- ChemicalManufacturingProcess %>%
preProcess("knnImpute")
imputed <- predict(impute, ChemicalManufacturingProcess)
Let’s use that nearzerovar function that we used in the
previous problem. Let’s split the data into X and y.
X <- imputed %>% select(-Yield)
y = imputed$Yield
high_freq_predictors <- X[,-nearZeroVar(X)]
dim(high_freq_predictors)
## [1] 176 56
We create the training and test set in the code chunk below.
set.seed(1)
sample <- sample.split(y, SplitRatio = 0.8)
train_X <- subset(high_freq_predictors, sample == TRUE) %>% as.matrix()
test_X <- subset(high_freq_predictors, sample == FALSE) %>% as.matrix()
train_y <- subset(y, sample == TRUE) %>% as.vector()
test_y <- subset(y, sample == FALSE) %>% as.vector()
dim(train_X)
## [1] 140 56
dim(test_X)
## [1] 36 56
ctrl <- trainControl(method = "cv", number = 10)
plsTune <- caret::train(
train_X,
train_y,
method = "pls",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale")
)
plot(plsTune)
plsTune
## Partial Least Squares
##
## 140 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 125, 125, 125, 126, 127, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.8815812 0.4115193 0.6873771
## 2 1.2940157 0.4451745 0.7666037
## 3 0.8434375 0.5089979 0.6138395
## 4 0.7695028 0.5882731 0.5817554
## 5 0.8551909 0.5714961 0.6158262
## 6 0.9170698 0.5524309 0.6363787
## 7 0.9830327 0.5561862 0.6583479
## 8 1.0931773 0.5293409 0.7067873
## 9 1.3519114 0.4881377 0.7895779
## 10 1.5499833 0.4747700 0.8514642
## 11 1.7507091 0.4785277 0.9063816
## 12 1.7846913 0.4849381 0.9095140
## 13 1.7866934 0.4827298 0.9164390
## 14 1.7872606 0.4926111 0.9123233
## 15 1.8224340 0.4956915 0.9212411
## 16 1.8833359 0.4896128 0.9426463
## 17 1.9042100 0.4866516 0.9497834
## 18 1.9484455 0.4873739 0.9646124
## 19 2.0652901 0.4834424 0.9979863
## 20 2.1162865 0.4825031 1.0117810
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 4.
The output above shows us that the iteration with the lowest RMSE was at 4, it is at this iteration that the RMSE is at its lowest at 0.77. The R-squared and MAE are 0.59 and 0.58 respectively.
plsPred1 <- predict(plsTune, test_X)
plsValues1 <- data.frame(obs = test_y, pred = plsPred1)
defaultSummary(plsValues1)
## RMSE Rsquared MAE
## 0.7171429 0.4864144 0.6090410
The RMSE for the test set is not that far off from the RMSE from cross-validation, indicating that the model is doing a decent job at predicting data that it has never seen before.
The caret package includes a function for calculating
feature importance, varImp, which is what we use to
calculate and plot feature importance below.
ggplot(varImp(plsTune), top = 10)
##
## Attaching package: 'pls'
## The following object is masked from 'package:corrplot':
##
## corrplot
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
We have the top 10 predictors that are most important in the PLS model that was created. It would seem that the manufacturing processes dominate the list, since the top 6 predictors are manufacturing processes.
top_10_predictors_row_names <- varImp(plsTune)$importance %>%
arrange(desc(Overall)) %>%
head(10) %>%
row.names() %>%
as.vector()
top_10_predictors_df = subset(high_freq_predictors) %>%
select(top_10_predictors_row_names)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(top_10_predictors_row_names)
##
## # Now:
## data %>% select(all_of(top_10_predictors_row_names))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
top_10_predictors_df$Yield = y
We can explore the relationships between the predictors with a correlation plot.
corrplot::corrplot(cor(dplyr::select_if(top_10_predictors_df, is.numeric), use = "na.or.complete"),
method = 'number',
type = 'lower',
diag = FALSE,
number.cex = 0.75,
tl.cex = 0.75)
Some of the processes above have a negative correlation with yield, while others have a positive yield. As we can see from the output above, Manufacturing Processes 13, 17, and 36 have a negative correlation. This means that yield might increase if these manufacturing processes would decrease, while we would have in increase in all of the other processes, so we would probably have to reallocate resources accordingly. The inverse might also be true too, However, with that being said, there could be another model type that we didn’t explore in this question that gives us a totally different plot. So we should be cognizant of that including the fact that we also have to worry about budgetary constraints, manufacturing hiccups, etc. But based on this plot, we could investigate those three variables that have a negative correlation with yield.