library(tidyverse)
library(caret)
library(elasticnet)
library(ggfortify)
library(ggplot2)
library(RANN)
Prompt: Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
# Provided code
library(AppliedPredictiveModeling)
data(permeability)
The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.
# Filter fingerprints using nearZeroVar
near_zero_fprint <- nearZeroVar(fingerprints)
filtered_fprint <- fingerprints[, -near_zero_fprint]
# Original count
dim(fingerprints)
## [1] 165 1107
# nearZeroVar count
ncol(filtered_fprint)
## [1] 388
The original fingerprints data has 1107 predictors. Filtering the predictors with low frequencies gives 388 predictors left for modelling.
# Set seed for reproducability
set.seed(64)
# Create partition of 80-20
index <- createDataPartition(permeability, p = .8, list = FALSE)
# Split X into training and testing
X_train <- filtered_fprint[index, ]
X_test <- filtered_fprint[-index, ]
# Split y into training and testing
y_train <- permeability[index, ]
y_test <- permeability[-index, ]
# Train the pls model
pls_finger <- train(
x = X_train,
y = y_train,
method = "pls",
preProcess = c("center", "scale"),
tuneLength = 20,
trControl = trainControl(method = "cv", number = 10)
)
# Visualize the tuned data
plot(pls_finger)
# Get the best tuned value by filtering the results to the best tuned row
pls_finger$results[pls_finger$results$ncomp == pls_finger$bestTune$ncomp, ]
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 2 2 11.66697 0.4816941 8.311813 3.312448 0.2278563 2.057526
The optimal number of latent variables is 2 based on the graph and the lowest RMSE score. The corresponding resampled estimate of R^2 is 0.482.
# Predict on the test data
predicted_finger <- predict(pls_finger, X_test)
# Get results of the tests and use carets defaultSummary()
predicted_finger_y <- data.frame(obs = y_test, pred = predicted_finger)
defaultSummary(predicted_finger_y)
## RMSE Rsquared MAE
## 11.9475566 0.4092106 7.5129098
The prediction results in the RMSE = 11.95, R^2 = 0.409, and MAE = 7.513. The The R^2 of 0.409 represents the model explaining 40.9% of the variability in the test set values.
“In addition to ordinary linear regression, these types of mod- els include partial least squares (PLS) and penalized models such as ridge regression, the lasso, and the elastic net.” (KJ 101)
Ordinary Linear Regression
# Find multicollinear columns
combo_info <- findLinearCombos(X_train)
# Remove those from both sets
olr_X_train <- X_train[, -combo_info$remove]
olr_X_test <- X_test[, -combo_info$remove]
# Ordinary Linear Regression model
olr_fit_finger <- train(
x = olr_X_train,
y = y_train,
method = "lm",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10)
)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
# Extract the final lm model from caret
final_model <- olr_fit_finger$finalModel
# autoplot supports base lm objects nicely
autoplot(olr_fit_finger$finalModel, which = 1)
# Predict on the test data
olr_predicted_finger <- predict(olr_fit_finger, olr_X_test)
# Get results of the tests and use carets defaultSummary()
olr_pred_finger_y <- data.frame(obs = y_test, pred = olr_predicted_finger)
defaultSummary(olr_pred_finger_y)
## RMSE Rsquared MAE
## 24.63281224 0.08197337 17.10432209
Ridge Regression
# Grid Search for optimal Lambda minimizing RMSE
ridge_grid_finger <- data.frame(.lambda = seq(0, .1, length = 15))
# Train the ridge model
ridge_fit_finger <- train(
X_train,
y_train,
method = "ridge",
tuneGrid = ridge_grid_finger,
trControl = trainControl(method = "cv", number = 10),
preProc = c("center", "scale")
)
# Visualize the data
plot(ridge_fit_finger)
# Predict on the test data
ridge_finger <- predict(ridge_fit_finger, X_test)
# Get results of the tests and use carets defaultSummary()
ridge_pred_finger_y <- data.frame(obs = y_test, pred = ridge_finger)
defaultSummary(ridge_pred_finger_y)
## RMSE Rsquared MAE
## 12.4402796 0.4489676 8.9131521
Elastic Net
# Grid Search for optimal Lambda minimizing RMSE
elastic_grid_finger <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))
# Train the elastic net model
elastic_fit_finger <- train(
X_train,
y_train,
method = "enet",
tuneGrid = elastic_grid_finger,
trControl = trainControl(method = "cv", number = 10),
preProc = c("center", "scale")
)
# Visualize the data
plot(elastic_fit_finger)
# Predict on the test data
elastic_finger <- predict(elastic_fit_finger, X_test)
# Get results of the tests and use carets defaultSummary()
elastic_pred_finger_y <- data.frame(obs = y_test, pred = elastic_finger)
defaultSummary(elastic_pred_finger_y)
## RMSE Rsquared MAE
## 10.4792876 0.5469005 7.3754463
print("Permeability Laboratory Experiment Results:")
## [1] "Permeability Laboratory Experiment Results:"
defaultSummary(predicted_finger_y)
## RMSE Rsquared MAE
## 11.9475566 0.4092106 7.5129098
print("Ordinary Linear Regression Results:")
## [1] "Ordinary Linear Regression Results:"
defaultSummary(olr_pred_finger_y)
## RMSE Rsquared MAE
## 24.63281224 0.08197337 17.10432209
print("Ridge Regression Results:")
## [1] "Ridge Regression Results:"
defaultSummary(ridge_pred_finger_y)
## RMSE Rsquared MAE
## 12.4402796 0.4489676 8.9131521
print("Elastic Net Regression Results:")
## [1] "Elastic Net Regression Results:"
defaultSummary(elastic_pred_finger_y)
## RMSE Rsquared MAE
## 10.4792876 0.5469005 7.3754463
The ideal model is the one with the lowest RMSE, highest R^2, and lowest MAE.
The PLS model has the second lowest RMSE, second lowest R^2 and the lowest MAE.
The Elastic Net model has the lowest RMSE, lowest R^2 and the second lowest MAE.
For the Elastic Net model, the trade off between the MAE and the lower RMS and R^2 makes sense. Because of this, I would reccommend the Elastic Net regression model over the PLS, just barely.
Prompt: A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the re- lationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing pro- cess. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
# Predict missing values using KNN imputation function
chem_impute_vals <- preProcess(
ChemicalManufacturingProcess,
method = c("knnImpute")
)
# Fill the predicted values
chem_imputed <- predict(
chem_impute_vals,
ChemicalManufacturingProcess
)
# Check for any remaining missing values
sum(is.na(chem_imputed))
## [1] 0
# Like in 6.2 b, filter using nearZeroVar
near_zero_chem <- nearZeroVar(chem_imputed)
filtered_chem <- chem_imputed[, -near_zero_chem]
# Original count
dim(chem_imputed)
## [1] 176 58
# nearZeroVar count
ncol(filtered_chem)
## [1] 57
# Set seed for reproducability
set.seed(64)
# Create partition of 80-20
index_chem <- createDataPartition(filtered_chem$Yield, p = .8, list = FALSE)
# Split X into training and testing (exclude 'Yield' column)
X_train_chem <- filtered_chem[index_chem, !(names(filtered_chem) %in% "Yield")]
X_test_chem <- filtered_chem[-index_chem, !(names(filtered_chem) %in% "Yield")]
# Split y into training and testing (only the 'Yield' column)
y_train_chem <- filtered_chem[index_chem, "Yield", drop = TRUE]
y_test_chem <- filtered_chem[-index_chem, "Yield", drop = TRUE]
# Train the pls model
pls_chem <- train(
x = X_train_chem,
y = y_train_chem,
method = "pls",
preProcess = c("center", "scale"),
tuneLength = 20,
trControl = trainControl(method = "cv", number = 10)
)
# Visualize the tuned data
plot(pls_chem)
# Get the best tuned value by filtering the results to the best tuned row
pls_chem$results[pls_chem$results$ncomp == pls_chem$bestTune$ncomp, ]
## ncomp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 0.7505754 0.4490387 0.6030918 0.07953122 0.1446725 0.08384717
The optimal value of the performance metric is 1. The first row corresponds to the lowest number of components, suggesting a single component provided the best performance for the model with RMSE. The RMSE of that optimal value of the performance metric is 0.751.
# Predict on the test data
predicted_chem <- predict(pls_chem, X_test_chem)
# Get results of the tests and use carets defaultSummary()
predicted_chem_y <- data.frame(obs = y_test_chem, pred = predicted_chem)
defaultSummary(predicted_chem_y)
## RMSE Rsquared MAE
## 1.3147047 0.2071462 0.7344713
The value of the performance metric RMSE is 1.315. This value is worse than the training set RMSE of 0.751, almost double the training value. This model is likely overfitting the data based on that RMSE change between training and testing.
# Get the most important predictors with varImp
chem_importance <- varImp(pls_chem, scale = FALSE)$importance |>
arrange(-Overall)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
# See the features
head(chem_importance, 10)
## Overall
## ManufacturingProcess32 0.07978184
## ManufacturingProcess09 0.07155332
## ManufacturingProcess36 0.06864897
## ManufacturingProcess13 0.06703950
## BiologicalMaterial02 0.06037569
## BiologicalMaterial06 0.06024218
## BiologicalMaterial03 0.05943981
## ManufacturingProcess17 0.05737441
## ManufacturingProcess33 0.05634232
## ManufacturingProcess12 0.05183224
It looks like the highest 4 values, and 7 of the top 10 values for predictors most important in the model are manufacturing process predictors. Manufacturing process predictors are most important.
# Get the top 10 important predictors
top10 <- head(chem_importance, 10)
# Select Yield and the top 10 predictors
correlation_matrix <- filtered_chem |>
select(Yield, all_of(rownames(top10))) |>
cor(use = "complete.obs")
# Visualize the correlation matrix using corrplot
corrplot::corrplot(correlation_matrix,
method = "color",
type = "upper",
order = "hclust",
tl.cex = 0.8)
# Loop through top 10 predictors and create scatter plots
top10_predictors <- rownames(top10)
for (predictor in top10_predictors) {
p <- ggplot(filtered_chem, aes(x = .data[[predictor]], y = Yield)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "salmon") + # Add a linear regression line
labs(x = predictor, y = "Yield", title = paste("Relationship between", predictor, "and Yield")) +
theme_minimal()
print(p)
}
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Positive relationships:
ManufacturingProcess32 - Stronger
ManufacturingProcess09 - Stronger
BiologicalMaterial02 - Weaker
BiologicalMaterial06 - Weaker
BiologicalMaterial03 - Weaker
ManufacturingProcess33 - Weaker
Negative relationships:
ManufacturingProcess36 - Stronger
ManufacturingProcess13 - Stronger
ManufacturingProcess17 - Stronger
ManufacturingProcess12** - Weaker, data does not have a solid linear trend
There are 4 groups: strong postive, strong negative, weak postive, weak negative. Within these groups, the stronger relationships will be most helpful in improving yield in future runs of the manufacturing process.
ManufacturingProcess32 and ManufacturingProcess09 have strong postive relationships with Yeild. This means increased numerical values for those processes correlate with higher yeild. This is good information as increasing their values within an efficient range could result in higher yields.
ManufacturingProcess36, ManufacturingProcess13 and ManufacturingProcess17 have strong negative relationships with Yeild. This means increased numerical values for those processes correlate with lower yeild. This is good information as more regulation to these processes to decrease their values could result in higher yields.