DATA 624 Homework 7
Question 6.2
Developing a model to predict permeability (See Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:
Part A
Start R
and use these commands to load the data:
The matrix fingerprints
contains the 1,107 binary molecular predictors for the 165 compounds, while permeability
contains permeability response.
Part B
the fingerprints
predictors indicate the presense or absense of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar
function from the caret
package. How many are left for modeling?
719
There are 719 variables left.
Part C
Split the data into training and test sets, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?
I will split it using a 80:20 split. I will center the data.
df <- as.data.frame(fingerprints[, nearZeroVar(fingerprints)]) %>%
mutate(y = permeability)
# Make this reproducible
set.seed(42)
in_train <- createDataPartition(df$y, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]
pls_model <- train(
y ~ ., data = train_df, method = "pls",
center = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
# Plot model RMSE vs different values of components
title <- paste("Training Set RMSE Minimized at",
pls_model$bestTune$ncomp,
"Components")
plot(pls_model, main = title)
pls_model$results %>%
filter(ncomp == pls_model$bestTune$ncomp) %>%
select(ncomp, RMSE, Rsquared) %>%
kable() %>%
kable_styling()
ncomp | RMSE | Rsquared |
---|---|---|
5 | 14.29229 | 0.263486 |
The optimal number of components in the PLS model is 5. This captures 26% of the variation in the permeability.
Part D
Predict the response of the test set. What is the test set estimate of \(R^2\)?
# Make predictions
pls_predictions <- predict(pls_model, test_df)
# Model performance metrics
results <- data.frame(Model = "PLS",
RMSE = caret::RMSE(pls_predictions, test_df$y),
Rsquared = caret::R2(pls_predictions, test_df$y))
results %>%
kable() %>%
kable_styling()
Model | RMSE | Rsquared | |
---|---|---|---|
permeability | PLS | 13.23639 | 0.3016825 |
The \(R^2\) is 0.3.
Part E
Try building other models discussed in this chapter. Do any have better predictive performance?
PCR Model
pcr_model <- train(
y ~ ., data = train_df, method = "pcr",
center = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
title <- paste("Training Set RMSE Minimized at",
pcr_model$bestTune,
"Components")
plot(pcr_model, main = title)
# Make predictions
pcr_predictions <- predict(pcr_model, test_df)
# Model performance metrics
pcr_results <- data.frame(Model = "PCR",
RMSE = caret::RMSE(pcr_predictions, test_df$y),
Rsquared = caret::R2(pcr_predictions, test_df$y))
pcr_results %>%
kable() %>%
kable_styling()
Model | RMSE | Rsquared | |
---|---|---|---|
permeability | PCR | 15.90717 | 0.0344595 |
No improvement here.
Ridge Regression
x <- model.matrix(y ~ ., data = train_df)
x_test <- model.matrix(y ~ ., data = test_df)
rr_cv <- cv.glmnet(x, train_df$y, alpha = 0)
rr_model <- glmnet(x, train_df$y, alpha = 0, lambda = rr_cv$lambda.min)
rr_predictions <- as.vector(predict(rr_model, x_test))
rr_results <- data.frame(Model = "Ridge Regression",
RMSE = caret::RMSE(rr_predictions, test_df$y),
Rsquared = caret::R2(rr_predictions, test_df$y))
rr_results %>%
kable() %>%
kable_styling()
Model | RMSE | Rsquared | |
---|---|---|---|
permeability | Ridge Regression | 14.59042 | 0.1426659 |
PLS preformed better.
Lasso Regression
lr_cv <- cv.glmnet(x, train_df$y, alpha = 1)
lr_model <- glmnet(x, train_df$y, alpha = 1, lambda = lr_cv$lambda.min)
lr_predictions <- as.vector(predict(lr_model, x_test))
lr_results <- data.frame(Model = "Lasso Regression",
RMSE = caret::RMSE(lr_predictions, test_df$y),
Rsquared = caret::R2(lr_predictions, test_df$y))
lr_results %>%
kable() %>%
kable_styling()
Model | RMSE | Rsquared | |
---|---|---|---|
permeability | Lasso Regression | 13.98552 | 0.2417885 |
PLS has more explanatory power.
Elastic Net Regession
en_model <- train(
y ~ ., data = train_df, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# Best tuning parameters
en_model$bestTune
alpha lambda
65 0.7 1.333236
en_predictions <- en_model %>% predict(x_test)
# Model performance metrics
en_results <- data.frame(Model = "Elastic Net Regression",
RMSE = caret::RMSE(en_predictions, test_df$y),
Rsquared = caret::R2(en_predictions, test_df$y))
en_results %>%
kable() %>%
kable_styling()
Model | RMSE | Rsquared | |
---|---|---|---|
permeability | Elastic Net Regression | 14.08593 | 0.2055749 |
There isn’t any improvment over the PLS.
Summary
pls_model$results %>%
filter(ncomp == pls_model$bestTune$ncomp) %>%
mutate("Model" = "PLS") %>%
select(Model, RMSE, Rsquared) %>%
as.data.frame() %>%
bind_rows(pcr_results) %>%
bind_rows(rr_results) %>%
bind_rows(lr_results) %>%
bind_rows(en_results) %>%
arrange(desc(Rsquared)) %>%
kable() %>%
kable_styling()
Model | RMSE | Rsquared |
---|---|---|
PLS | 14.29229 | 0.2634860 |
Lasso Regression | 13.98552 | 0.2417885 |
Elastic Net Regression | 14.08593 | 0.2055749 |
Ridge Regression | 14.59042 | 0.1426659 |
PCR | 15.90717 | 0.0344595 |
None of the models did a better job (had a higher \(R^2\)) than the PLS model.
Part F
Would you recommend any of your models to replace the permeability laboratory experiment?
No, because the best model’s (PLS) \(R^2\) was really low. The model did not have much explanatory power.
Question 6.3
A chemical manufacturing process for a pharmaceutical produce was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurement of the raw materials (predictors), measurements of the manufacutring process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw materials before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boot revenue by approximately one hundred thousand dollars per batch:
Part A
Start R
and use these commands to load the data:
the matrix processPredictors
contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs, yield
contains the percent yueld for each run.
Part B
A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
I will use KNN to impute values. The caret
package makes it simple:
Part C
Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
I will pre-process the data removing the zero variance variables prior to splitting the data.
Now the data will be split into training and test set using an 80:20 split.
in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]
Given the preformance of the PLS in the previous exercise, I will use it again. I’m going to further pre-process by centering and scaling the predictors.
pls_model <- train(
Yield ~ ., data = train_df, method = "pls",
center = TRUE,
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
# Plot model RMSE vs different values of components
title <- paste("Training Set RMSE Minimized at",
pls_model$bestTune$ncomp,
"Components")
plot(pls_model, main = title)
For the performance metric I will again look at the RMSE and the \(R^2\). The optimal values for RMSE would be zero, and the optimal \(R^2\) would be 1. Here’s the performance metrics for the optimal model on the training set.
pls_model$results %>%
filter(ncomp == pls_model$bestTune$ncomp) %>%
select(ncomp, RMSE, Rsquared) %>%
kable() %>%
kable_styling()
ncomp | RMSE | Rsquared |
---|---|---|
3 | 0.7363066 | 0.5244517 |
Part D
Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
# Make predictions
pls_predictions <- predict(pls_model, test_df)
# Model performance metrics
results <- data.frame(RMSE = caret::RMSE(pls_predictions, test_df$Yield),
Rsquared = caret::R2(pls_predictions, test_df$Yield))
results %>%
kable() %>%
kable_styling()
RMSE | Rsquared |
---|---|
0.6192577 | 0.6771122 |
We have an RMSE of about 0.62 and a \(R^2\) of roughly 0.68 on the test set.
Part E
Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
I will select the variables with a varImp
sore greater than or equal to 60 to be the “important” ones.
pls_importance <- varImp(pls_model)$importance %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
filter(Overall >= 50) %>%
arrange(desc(Overall)) %>%
mutate(importance = row_number())
varImp(pls_model) %>%
plot(., top = max(pls_importance$importance), main = "Important Variables")
pls_importance %>%
mutate(Variable = gsub("[0-9]+", "", Variable)) %>%
group_by(Variable) %>%
tally() %>%
arrange(desc(n)) %>%
kable() %>%
kable_styling()
Variable | n |
---|---|
ManufacturingProcess | 9 |
BiologicalMaterial | 3 |
There are only 12 variables and the process predictors dominate the list.
Part F
Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future rounds of the manufacturing process?
important_vars <- df %>%
select_at(vars(Yield, pls_importance$Variable))
important_vars_p <- cor.mtest(important_vars)$p
important_vars %>%
cor() %>%
corrplot(method = "color", type = "lower", order = "hclust",
tl.cex = 0.8, tl.col = "black", tl.srt = 45,
addCoef.col = "black", number.cex = 0.7,
p.mat = important_vars_p, sig.level = 0.05, diag = FALSE)
All variables are positively correlated with the Yield. Manufacuring process 32 is the most important variable. It is positively correlated the biological variables. It is also negatively correlated with manufacturing process 13.
Further exploration of these relationships could improve the yield. This would be valuable information for these chemical manufacturers.