DATA 624 Homework 7

Question 6.2

Developing a model to predict permeability (See Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug:

Part A

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

Part B

the fingerprints predictors indicate the presense or absense of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many are left for modeling?

fingerprints %>%
  nearZeroVar() %>%
  length()

There are 719 variables left.

Part C

Split the data into training and test sets, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of \(R^2\)?

I will split it using a 80:20 split. I will center the data.

df <- as.data.frame(fingerprints[, nearZeroVar(fingerprints)]) %>%
  mutate(y = permeability)

# Make this reproducible
set.seed(42)

in_train <- createDataPartition(df$y, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]

pls_model <- train(
  y ~ ., data = train_df, method = "pls",
  center = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)

# Plot model RMSE vs different values of components
title <- paste("Training Set RMSE Minimized at",
               pls_model$bestTune$ncomp,
               "Components")
plot(pls_model, main = title)

pls_model$results %>%
  filter(ncomp == pls_model$bestTune$ncomp) %>%
  select(ncomp, RMSE, Rsquared) %>%
  kable() %>%
  kable_styling()

ncomp	RMSE	Rsquared
5	14.29229	0.263486

The optimal number of components in the PLS model is 5. This captures 26% of the variation in the permeability.

Part D

Predict the response of the test set. What is the test set estimate of \(R^2\)?

# Make predictions
pls_predictions <- predict(pls_model, test_df)
# Model performance metrics
results <- data.frame(Model = "PLS",
                      RMSE = caret::RMSE(pls_predictions, test_df$y),
                      Rsquared = caret::R2(pls_predictions, test_df$y))
results %>%
  kable() %>%
  kable_styling()

	Model	RMSE	Rsquared
permeability	PLS	13.23639	0.3016825

The \(R^2\) is 0.3.

Part E

Try building other models discussed in this chapter. Do any have better predictive performance?

PCR Model

pcr_model <- train(
  y ~ ., data = train_df, method = "pcr",
  center = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)

title <- paste("Training Set RMSE Minimized at",
               pcr_model$bestTune,
               "Components")
plot(pcr_model, main = title)

# Make predictions
pcr_predictions <- predict(pcr_model, test_df)
# Model performance metrics
pcr_results <- data.frame(Model = "PCR",
                          RMSE = caret::RMSE(pcr_predictions, test_df$y),
                          Rsquared = caret::R2(pcr_predictions, test_df$y))
pcr_results %>%
  kable() %>%
  kable_styling()

	Model	RMSE	Rsquared
permeability	PCR	15.90717	0.0344595

No improvement here.

Ridge Regression

x <- model.matrix(y ~ ., data = train_df)
x_test <- model.matrix(y ~ ., data = test_df)
rr_cv <- cv.glmnet(x, train_df$y, alpha = 0)
rr_model <- glmnet(x, train_df$y, alpha = 0, lambda = rr_cv$lambda.min)
rr_predictions <- as.vector(predict(rr_model, x_test))
rr_results <- data.frame(Model = "Ridge Regression",
                         RMSE = caret::RMSE(rr_predictions, test_df$y),
                         Rsquared = caret::R2(rr_predictions, test_df$y))
rr_results %>%
  kable() %>%
  kable_styling()

	Model	RMSE	Rsquared
permeability	Ridge Regression	14.59042	0.1426659

PLS preformed better.

Lasso Regression

lr_cv <- cv.glmnet(x, train_df$y, alpha = 1)
lr_model <- glmnet(x, train_df$y, alpha = 1, lambda = lr_cv$lambda.min)
lr_predictions <- as.vector(predict(lr_model, x_test))
lr_results <- data.frame(Model = "Lasso Regression",
                         RMSE = caret::RMSE(lr_predictions, test_df$y),
                         Rsquared = caret::R2(lr_predictions, test_df$y))
lr_results %>%
  kable() %>%
  kable_styling()

	Model	RMSE	Rsquared
permeability	Lasso Regression	13.98552	0.2417885

PLS has more explanatory power.

Elastic Net Regession

en_model <- train(
  y ~ ., data = train_df, method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneLength = 10
)
# Best tuning parameters
en_model$bestTune

   alpha   lambda
65   0.7 1.333236

en_predictions <- en_model %>% predict(x_test)
# Model performance metrics
en_results <- data.frame(Model = "Elastic Net Regression",
                         RMSE = caret::RMSE(en_predictions, test_df$y),
                         Rsquared = caret::R2(en_predictions, test_df$y))
en_results %>%
  kable() %>%
  kable_styling()

	Model	RMSE	Rsquared
permeability	Elastic Net Regression	14.08593	0.2055749

There isn’t any improvment over the PLS.

Summary

pls_model$results %>%
  filter(ncomp == pls_model$bestTune$ncomp) %>%
  mutate("Model" = "PLS") %>%
  select(Model, RMSE, Rsquared) %>%
  as.data.frame() %>%
  bind_rows(pcr_results) %>%
  bind_rows(rr_results) %>%
  bind_rows(lr_results) %>%
  bind_rows(en_results) %>%
  arrange(desc(Rsquared)) %>%
  kable() %>%
  kable_styling()

Model	RMSE	Rsquared
PLS	14.29229	0.2634860
Lasso Regression	13.98552	0.2417885
Elastic Net Regression	14.08593	0.2055749
Ridge Regression	14.59042	0.1426659
PCR	15.90717	0.0344595

None of the models did a better job (had a higher \(R^2\)) than the PLS model.

Part F

Would you recommend any of your models to replace the permeability laboratory experiment?

No, because the best model’s (PLS) \(R^2\) was really low. The model did not have much explanatory power.

Question 6.3

A chemical manufacturing process for a pharmaceutical produce was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurement of the raw materials (predictors), measurements of the manufacutring process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw materials before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boot revenue by approximately one hundred thousand dollars per batch:

Part A

Start R and use these commands to load the data:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

the matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs, yield contains the percent yueld for each run.

Part B

A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).

I will use KNN to impute values. The caret package makes it simple:

# Make this reproducible
set.seed(42)

knn_model <- preProcess(ChemicalManufacturingProcess, "knnImpute")
df <- predict(knn_model, ChemicalManufacturingProcess)

Part C

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

I will pre-process the data removing the zero variance variables prior to splitting the data.

df <- df %>%
  select_at(vars(-one_of(nearZeroVar(., names = TRUE))))

Now the data will be split into training and test set using an 80:20 split.

in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]

Given the preformance of the PLS in the previous exercise, I will use it again. I’m going to further pre-process by centering and scaling the predictors.

pls_model <- train(
  Yield ~ ., data = train_df, method = "pls",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)

# Plot model RMSE vs different values of components
title <- paste("Training Set RMSE Minimized at",
               pls_model$bestTune$ncomp,
               "Components")
plot(pls_model, main = title)

For the performance metric I will again look at the RMSE and the \(R^2\). The optimal values for RMSE would be zero, and the optimal \(R^2\) would be 1. Here’s the performance metrics for the optimal model on the training set.

pls_model$results %>%
  filter(ncomp == pls_model$bestTune$ncomp) %>%
  select(ncomp, RMSE, Rsquared) %>%
  kable() %>%
  kable_styling()

ncomp	RMSE	Rsquared
3	0.7363066	0.5244517

Part D

Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

# Make predictions
pls_predictions <- predict(pls_model, test_df)
# Model performance metrics
results <- data.frame(RMSE = caret::RMSE(pls_predictions, test_df$Yield),
           Rsquared = caret::R2(pls_predictions, test_df$Yield))
results %>%
  kable() %>%
  kable_styling()

RMSE	Rsquared
0.6192577	0.6771122

We have an RMSE of about 0.62 and a \(R^2\) of roughly 0.68 on the test set.

Part E

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

I will select the variables with a varImp sore greater than or equal to 60 to be the “important” ones.

pls_importance <- varImp(pls_model)$importance %>%
  as.data.frame() %>%
  rownames_to_column("Variable") %>%
  filter(Overall >= 50) %>%
  arrange(desc(Overall)) %>%
  mutate(importance = row_number())

varImp(pls_model) %>%
  plot(., top = max(pls_importance$importance), main = "Important Variables")

pls_importance %>%
  mutate(Variable = gsub("[0-9]+", "", Variable)) %>%
  group_by(Variable) %>%
  tally() %>%
  arrange(desc(n)) %>%
  kable() %>%
  kable_styling()

Variable	n
ManufacturingProcess	9
BiologicalMaterial	3

There are only 12 variables and the process predictors dominate the list.

Part F

Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future rounds of the manufacturing process?

important_vars <- df %>%
  select_at(vars(Yield, pls_importance$Variable))

important_vars_p <- cor.mtest(important_vars)$p

important_vars %>%
  cor() %>%
  corrplot(method = "color", type = "lower", order = "hclust",
           tl.cex = 0.8, tl.col = "black", tl.srt = 45,
           addCoef.col = "black", number.cex = 0.7,
           p.mat = important_vars_p,  sig.level = 0.05, diag = FALSE)

All variables are positively correlated with the Yield. Manufacuring process 32 is the most important variable. It is positively correlated the biological variables. It is also negatively correlated with manufacturing process 13.

Further exploration of these relationships could improve the yield. This would be valuable information for these chemical manufacturers.