Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
Start R and use these commands to load the data:
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package.
fingerprints <- as.data.frame(fingerprints)
# permeability <- as.data.frame(permeability)
near0 <- nearZeroVar(fingerprints)
fp <- fingerprints[,-near0]How many predictors are left for modeling?
There are 388 predictors out of 1107 left for modeling.
Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?
trainingRows <- createDataPartition(permeability, p = .80, list= FALSE)
train_x <- fp[trainingRows, ]
train_y <- permeability[trainingRows]
test_x <- fp[-trainingRows, ]
test_y <- permeability[-trainingRows] caret packagepls_model <- train(train_x, train_y, method="pls",
tuneLength = 20,
# trControl=trainControl(method="repeatedcv",repeats=5),
preProcess = c("center","scale"))
pls_model## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 133, 133, 133, 133, 133, 133, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.84920 0.2709024 10.636429
## 2 12.91205 0.3698073 9.213105
## 3 12.93696 0.3786810 9.671910
## 4 13.41528 0.3513935 10.194430
## 5 13.23003 0.3749945 10.009937
## 6 13.24614 0.3840792 9.949120
## 7 13.08158 0.3949232 9.849193
## 8 13.10461 0.4004220 9.924006
## 9 13.20249 0.4013161 10.010824
## 10 13.43059 0.3965166 10.164667
## 11 13.66669 0.3892159 10.297159
## 12 13.82120 0.3848419 10.417121
## 13 14.05644 0.3767074 10.658454
## 14 14.41308 0.3573801 10.927938
## 15 14.67841 0.3470513 11.154293
## 16 14.92639 0.3337784 11.360811
## 17 15.10416 0.3289769 11.508837
## 18 15.39667 0.3189454 11.758257
## 19 15.66749 0.3125186 11.956640
## 20 15.98925 0.3048928 12.233394
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
## Data: X dimension: 133 388
## Y dimension: 133 1
## Fit method: oscorespls
## Number of components considered: 2
## TRAINING: % variance explained
## 1 comps 2 comps
## X 21.43 34.77
## .outcome 32.02 48.73
pls packagelibrary(pls)
train <- train_x
train['y'] <- train_y
test <- test_x
test['y'] <- test_y
plsFit <- plsr(y ~ ., data = train, scale = TRUE, center = TRUE,
ncomp=20, validation = "CV")Both the caret and the pls package find that 2 components are optimal (although when using cross-validation I can’t always get reproducible results and sometimes it recommends more). For the caret model the corresponding \(R^2\) value is 0.3698073. For the pls model the corresponding \(R^2\) value is 0.4872599 (if I calculated that correctly).
Predict the response for the test set. What is the test set estimate of \(R^2\)?
caret package predictions## RMSE Rsquared MAE
## 10.2093175 0.5067774 7.4334935
pls package predictionspls_pred <- predict(plsFit, test_x, ncomp=2)
pls.eval <- data.frame(obs=test_y, pred=pls_pred[,1,1])
defaultSummary(pls.eval)## RMSE Rsquared MAE
## 10.2093175 0.5067774 7.4334935
I’m a little shocked that my model evaluation statistics all came out exactly the same for both models. But I guess that’s possible.
Try building other models discussed in this chapter. Do any have better predictive performance?
caretridge_grid <- data.frame(.lambda = seq(0, .1, length=11))
ctrl <- trainControl(method = "cv", number = 10)
ridge_model <- train(train_x, train_y,
method = "ridge",
tuneGrid = ridge_grid,
trControl = ctrl,
preProcess = c("center","scale"))
ridge_model## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 121, 121, 118, 120, 119, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00 14.86420 0.3787189 11.350326
## 0.01 108.35841 0.2926301 77.502509
## 0.02 14.45405 0.3741074 10.628271
## 0.03 13.95151 0.3964276 10.286067
## 0.04 13.77835 0.4078377 10.154959
## 0.05 13.43403 0.4235641 9.900218
## 0.06 13.27553 0.4328217 9.772313
## 0.07 13.16316 0.4398899 9.676881
## 0.08 13.05329 0.4471737 9.600932
## 0.09 12.97369 0.4529305 9.538200
## 0.10 12.90751 0.4579974 9.491182
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
caret model## RMSE Rsquared MAE
## 10.4890920 0.6111045 8.1409039
elasticnet model## RMSE Rsquared MAE
## 10.4850440 0.6112443 8.1364082
After tuning with the caret package train function an elastic net model created with the elasticnet package and using \(\lambda = 0.1\) resulted in a slightly better \(R^2\) but since we are comparing two different models this may not be the best measure of performance. The RMSE and MAE are not much different though, so I would think the elastic net has better predictive performance.
The model created by the train function in the caret package returned the same results, so I’m not really sure why the textbook has you recreate the model with the elasticnet package?
The plots of the actual vs. predicted values above confirm that the elastic net model seems to be a better fit.
The qq plots also show a better overall fit for the elastic net model even though the two PLS models made with the caret and the pls packages fit the majority of the data better, but have 5 significant outliers.
Would you recommend any of your models to replace the permeability laboratory experiment?
No, based on the plots above I don’t think any of the models would be accurate enough to justify replacing the permeability laboratory experiment.
A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:
Start R and use these commands to load the data:
The data.frame ChemicalManufacturingProcess contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs and one target variable yield, which contains the percent yield for each run.
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
# missing <- data.frame(t(apply(is.na(ChemicalManufacturingProcess), 2, sum)))
# t(missing[,colSums(missing) > 0])
library(naniar) # for missing data viz
missingness <- ChemicalManufacturingProcess %>%
miss_var_summary()
kable(missingness %>%
filter(pct_miss > 0)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| variable | n_miss | pct_miss |
|---|---|---|
| ManufacturingProcess03 | 15 | 8.5227273 |
| ManufacturingProcess11 | 10 | 5.6818182 |
| ManufacturingProcess10 | 9 | 5.1136364 |
| ManufacturingProcess25 | 5 | 2.8409091 |
| ManufacturingProcess26 | 5 | 2.8409091 |
| ManufacturingProcess27 | 5 | 2.8409091 |
| ManufacturingProcess28 | 5 | 2.8409091 |
| ManufacturingProcess29 | 5 | 2.8409091 |
| ManufacturingProcess30 | 5 | 2.8409091 |
| ManufacturingProcess31 | 5 | 2.8409091 |
| ManufacturingProcess33 | 5 | 2.8409091 |
| ManufacturingProcess34 | 5 | 2.8409091 |
| ManufacturingProcess35 | 5 | 2.8409091 |
| ManufacturingProcess36 | 5 | 2.8409091 |
| ManufacturingProcess02 | 3 | 1.7045455 |
| ManufacturingProcess06 | 2 | 1.1363636 |
| ManufacturingProcess01 | 1 | 0.5681818 |
| ManufacturingProcess04 | 1 | 0.5681818 |
| ManufacturingProcess05 | 1 | 0.5681818 |
| ManufacturingProcess07 | 1 | 0.5681818 |
| ManufacturingProcess08 | 1 | 0.5681818 |
| ManufacturingProcess12 | 1 | 0.5681818 |
| ManufacturingProcess14 | 1 | 0.5681818 |
| ManufacturingProcess22 | 1 | 0.5681818 |
| ManufacturingProcess23 | 1 | 0.5681818 |
| ManufacturingProcess24 | 1 | 0.5681818 |
| ManufacturingProcess40 | 1 | 0.5681818 |
| ManufacturingProcess41 | 1 | 0.5681818 |
imputer <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
CMP_imputed <- predict(imputer, ChemicalManufacturingProcess)
missing <- CMP_imputed %>%
miss_var_summary()
kable(missing %>%
filter(pct_miss > 0)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| variable | n_miss | pct_miss |
|---|---|---|
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
There are 57 predictors out of 58 left for modeling after removing variables with near zero variance.
trainingRows <- createDataPartition(cmp$Yield, p = .80, list= FALSE)
train <- cmp[trainingRows, ]
test <- cmp[-trainingRows, ]caret packageCMP_pls_model <- train(train[,-1], train[,"Yield"], method="pls",
tuneLength=20,
# trControl=trainControl(method="repeatedcv",repeats=5),
preProcess=c("center","scale"))
CMP_pls_model## Partial Least Squares
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.023764 0.2662151 0.7024952
## 2 1.517135 0.2556752 0.7678175
## 3 1.144342 0.3256068 0.7054432
## 4 1.051293 0.3594072 0.6832918
## 5 1.134892 0.3501609 0.7039725
## 6 1.186446 0.3432564 0.7301779
## 7 1.269517 0.3291640 0.7605063
## 8 1.409004 0.2999849 0.7995211
## 9 1.527249 0.2787915 0.8366923
## 10 1.635630 0.2667980 0.8673658
## 11 1.728955 0.2603218 0.8882150
## 12 1.783722 0.2508235 0.9134676
## 13 1.869813 0.2296608 0.9384069
## 14 2.043987 0.2143848 0.9771190
## 15 2.197101 0.2077721 1.0105228
## 16 2.316361 0.1920839 1.0408806
## 17 2.471245 0.1893137 1.0761429
## 18 2.591035 0.1916388 1.1029218
## 19 2.688780 0.1857432 1.1276270
## 20 2.815421 0.1805420 1.1566318
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 1.
I’m not impressed with the results, so let’s try an elastic net…
caretCMP_ridge_model <- train(train[,-1], train$Yield,
method = "ridge",
tuneGrid = ridge_grid,
trControl = ctrl,
preProcess = c("center","scale"))
CMP_ridge_model## Ridge Regression
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 131, 131, 128, 130, 128, 131, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00 2.331986 0.3814637 1.1611843
## 0.01 1.159724 0.4470011 0.7405773
## 0.02 1.198792 0.4572407 0.7279004
## 0.03 1.183358 0.4632850 0.7177072
## 0.04 1.160956 0.4674776 0.7106364
## 0.05 1.139159 0.4708060 0.7041047
## 0.06 1.119448 0.4737120 0.6988919
## 0.07 1.101965 0.4764262 0.6939708
## 0.08 1.086531 0.4790872 0.6896225
## 0.09 1.072910 0.4817887 0.6860740
## 0.10 1.060882 0.4846009 0.6832646
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.1.
The elastic net model seems to give us significantly better performance than the PLS model. So let’s follow the textbook examples and train with the elasticnet package too using the optimal lambda parameter as determined by the caret model training.
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
caret model## RMSE Rsquared MAE
## 0.5727537 0.7324409 0.4646206
elasticnet modelCMP_ridge_pred2 <- predict(CMP_ridge_model2, as.matrix(test[,-1]), s = 1,
mode = "fraction",
type = "fit")## RMSE Rsquared MAE
## 0.5727537 0.7324409 0.4646206
Once again we get almost exactly the same performance from the models trained with the caret package and the elasticnet package. Each has an \(R^2\) of about 0.69 which is significantly higher than the resampled \(R^2\) on the training set which was about 0.47.
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
CMP_enet_coef <- predict(CMP_ridge_model2, as.matrix(test[,-1]), s = 1,
mode = "fraction",
type = "coefficients")
top3 <- head(sort(CMP_enet_coef$coefficients, decreasing = TRUE), 3)
bottom3 <- tail(sort(CMP_enet_coef$coefficients, decreasing = TRUE), 3)
top3## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess04
## 0.3898830 0.2392460 0.1989751
## ManufacturingProcess36 ManufacturingProcess13 ManufacturingProcess37
## -0.1442161 -0.1476988 -0.1734662
The 6 most influential predictors (3 of which are positively influential and 3 of which are negatively influential) are all manufacturing process predictors.
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?
top_vars <- c('ManufacturingProcess32', 'ManufacturingProcess09', 'ManufacturingProcess04',
'ManufacturingProcess36', 'ManufacturingProcess13', 'ManufacturingProcess37')
mod <- lm(Yield~ManufacturingProcess32+ManufacturingProcess09+ManufacturingProcess04+
ManufacturingProcess36+ManufacturingProcess13+ManufacturingProcess37,
data = CMP_imputed)
summary(mod)##
## Call:
## lm(formula = Yield ~ ManufacturingProcess32 + ManufacturingProcess09 +
## ManufacturingProcess04 + ManufacturingProcess36 + ManufacturingProcess13 +
## ManufacturingProcess37, data = CMP_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.45215 -0.48449 -0.02652 0.37212 1.59264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.001134 0.046399 -0.024 0.9805
## ManufacturingProcess32 0.526013 0.082529 6.374 1.69e-09 ***
## ManufacturingProcess09 0.389563 0.078537 4.960 1.70e-06 ***
## ManufacturingProcess04 0.100731 0.053438 1.885 0.0611 .
## ManufacturingProcess36 -0.098533 0.077647 -1.269 0.2062
## ManufacturingProcess13 -0.148247 0.078044 -1.900 0.0592 .
## ManufacturingProcess37 -0.118954 0.049016 -2.427 0.0163 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6155 on 169 degrees of freedom
## Multiple R-squared: 0.6342, Adjusted R-squared: 0.6212
## F-statistic: 48.82 on 6 and 169 DF, p-value: < 2.2e-16
A simple linear model using only the top 6 predictors to model the response results in an adjusted \(R^2\) value of 0.62 which is only slightly lower than the much more complicated elastic net model. And the top 2 predictors, ManufacturingProcess32 and ManufacturingProcess09, have by far the greatest significance in this model, so these are the first two processes that I would look at if I were trying to optimize this manufacturing process. Although since both of those processes have a positive effect on the yield, it may also be advisable to look at the three processes that have a negative impact on yield to see if improvements to those processes can increase yield.