library(tidymodels)
library(caret)
library(mice)
library(dlookr)
library(flextable)
library(tidyverse)
library(plsmod)
library(glmnet)
library(vip)

library(AppliedPredictiveModeling) # datasets

1 Background

This report is a response to problems 6.2 and 6.3 in Applied Predictive Predictive Modeling (Kuhn and Johnson). Tidymodels will be used for model development and evaluation.

2 Permeability Study

6.2. Developing a model to predict permeability (see Sect. 1.4) could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug.

  1. Start R and use these commands to load the data:
data(permeability)

The matrix fingerprints contains the 1,107 binary molecular predictors for the 165 compounds, while permeability contains permeability response.

#Create working matrices

response<-permeability

covariates<-fingerprints
  1. The fingerprint predictors indicate the presence or absence of substructures of a molecule and are often sparse meaning that relatively few of the molecules contain each substructure. Filter out the predictors that have low frequencies using the nearZeroVar function from the caret package. How many predictors are left for modeling?
#collect near zero variance covariates -- adapted from https://stackoverflow.com/questions/28043393/nearzerovar-function-in-caret

zvar<-nearZeroVar(covariates)

if(length(zvar) > 0) {
  nozero <- covariates[, -zvar]}
 
num<-dim(nozero)[2] #388

str_glue('There are {num} predictors left after removing removing those with near zero variance. This can also be accomplished as a recipe step (using step_nzv()) in tidymodels.')
## There are 388 predictors left after removing removing those with near zero variance. This can also be accomplished as a recipe step (using step_nzv()) in tidymodels.
  1. Split the data into a training and a test set, pre-process the data, and tune a PLS model. How many latent variables are optimal and what is the corresponding resampled estimate of R2?

As an initial step, I will split the data and create resamples using 10-fold cross-validation.

# combine response and predictors into single dataset

perm<-cbind(response, covariates)%>%as_tibble()

# create test and train sets

set.seed(10)

# create random data split -- data will not include variable for computer programs

perm.split <- initial_split(perm)

# create train and test sets from split

perm.train <- training(perm.split) # 123, 1108

perm.test  <- testing(perm.split) # 42, 1108

# create resamples for model tuning

set.seed(20)

cv_folds<-
  vfold_cv(perm.train, v=10) 

Now I specify the PLS regression model, create a preprocessing recipe, and build a workflow.

Note that the step_nzv() function removes unbalanced and near_zero variables (see code block): 400 covariates remained after applying the recipe to the train set.

#this codeblock is adapted from: https://stackoverflow.com/questions/64582463/how-do-i-specify-a-pls-model-in-tidy-models



# create preprocessing recipe

model.recipe <- 
  recipe(permeability ~., data=perm.train)%>% 
  step_nzv(all_predictors())%>%
  step_normalize(all_numeric_predictors()) # scale and center

# extract and review processed data  -- this was problematic
    
perm_juiced <-prep(model.recipe)

perm.juiced <-juice(perm_juiced)


# specify pls regression model

pls.spec <- pls(num_comp = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("mixOmics")
  

# build model workflow and fit training model

pls.wf <- 
  workflow() %>% 
  add_model(pls.spec) %>% 
  add_recipe(model.recipe)


#create grid

pls.grid <- tibble(num_comp = seq(from = 1, to = 20, by = 1)) # evaluates 20 different components

# select fit metrics

pls.metrics<-metric_set(rmse, rsq)

# tune


set.seed(30)

pls.tune <-
  pls.wf%>%
  tune_grid(
    cv_folds,
    grid = pls.grid,
    metrics = pls.metrics
  )

autoplot(pls.tune)+theme_light()+labs(title='Figure 1. Hyperparameter Tuning for PLS Model')

#show_best(wm.dec.fit)

best.pls<-select_best(pls.tune, 'rmse')  # num_com = 2

The next step is to update the PLS model with the selected hyper-parameter (num_comp) based on RMSE. That parameter value = 2.

# update PLS model 

pls.updated<-finalize_model(pls.spec, select_best(pls.tune, 'rmse')) #with tuning specs
  1. Predict the response for the test set. What is the test set estimate of R2?

The test set RSQ estimate is 0.71.

# fit updated model to test set

pls.testfit<-last_fit(pls.updated,  permeability ~., split = perm.split)

# report fit metrics

collect_metrics(pls.testfit) %>% 
  filter(.metric == "rsq") %>% 
  mutate(model = "Partial Least Squares") %>% 
  dplyr::select(model, .metric, .estimate) %>%
  rename(c(Metric = .metric, Estimate = .estimate))%>%
  flextable()%>%
  set_caption('Table 2. Goodness of Fit (RSQ) for Partial Least Squares')
  1. Try building other models discussed in this chapter. Do any have better predictive performance?

I will fit models for Ridge Regression, Lasso Regression, and Elastic Net Regession.

Using RMSE scores to compare model fit (Table 3), it is clear that PLS outperformed the other models for this dataset.

# specify ridge regression model

ridge.spec <- linear_reg(penalty = tune(), mixture = 0) %>%  #note mixture is equivalent to alpha in glmnet, L2 only
  set_mode("regression") %>% 
  set_engine("glmnet")
  
# specify lasso regression model

lasso.spec <- linear_reg(penalty = tune(), mixture = 1) %>%  #note 1 = L1 only
  set_mode("regression") %>% 
  set_engine("glmnet")

# specify elastic net  model

elastic.spec <- linear_reg(penalty = tune(), mixture = .5) %>%  #note combo of L1 & L2
  set_mode("regression") %>% 
  set_engine("glmnet")


# build ridge workflow 

ridge.wf <- 
  workflow() %>% 
  add_model(ridge.spec) %>% 
  add_recipe(model.recipe)

# build lasso workflow

lasso.wf <- 
  workflow() %>% 
  add_model(lasso.spec) %>% 
  add_recipe(model.recipe)


#build elastic net workflow

elastic.wf <- 
  workflow() %>% 
  add_model(elastic.spec) %>% 
  add_recipe(model.recipe)


#create grid

models.grid <- grid_regular(penalty(), levels=5) 

# select fit metrics

models.metrics<-metric_set(rmse, rsq)

# tune the models


set.seed(40)

ridge.tune <-
  ridge.wf%>%
  tune_grid(
    cv_folds,
    grid = models.grid,
    metrics = models.metrics
  )


lasso.tune <-
  lasso.wf%>%
  tune_grid(
    cv_folds,
    grid = models.grid,
    metrics = models.metrics
  )


elastic.tune <-
  elastic.wf%>%
  tune_grid(
    cv_folds,
    grid = models.grid,
    metrics = models.metrics
  )


# collect best CV fits 

best.ridge<-select_best(ridge.tune, 'rmse')

best.lasso<-select_best(lasso.tune, 'rmse')

best.elastic<-select_best(elastic.tune, 'rmse')


# update models with best parameter fit


ridge.updated<-finalize_model(ridge.spec, select_best(ridge.tune, 'rmse')) #with tuning specs

lasso.updated<-finalize_model(lasso.spec, select_best(lasso.tune, 'rmse'))

elastic.updated<-finalize_model(elastic.spec, select_best(elastic.tune, 'rmse'))


# fit updated model to test set

ridge.testfit<-last_fit(ridge.updated,  permeability ~., split = perm.split)

lasso.testfit<-last_fit(lasso.updated,  permeability ~., split = perm.split)

elastic.testfit<-last_fit(elastic.updated,  permeability ~., split = perm.split)

# report fit metrics

collect_metrics(ridge.testfit) %>% 
  bind_rows(collect_metrics(lasso.testfit)) %>%
  bind_rows(collect_metrics(elastic.testfit)) %>% 
  bind_rows(collect_metrics(pls.testfit)) %>% 
  filter(.metric == c("rsq")) %>% 
  mutate(model = c("Ridge", "Lasso", "Elastic", "PLS")) %>% 
  dplyr::select(model, .metric, .estimate) %>%
  rename(c(Metric = .metric, Estimate = .estimate))%>%
  flextable()%>%
  set_caption('Table 3. Goodness of Fit Comparisons for Regularization Models')
  1. Would you recommend any of your models to replace the permeability laboratory experiment?

Of the four models, I would recommend PLS on the basis of it’s RMSE score and visual assessment of model fit (Figures 1-3). However, it is not clear to me yet whether a nonlinear model might have more predictive power in this experiment.

# plot scatterplots

ridge.testfit%>%
    collect_predictions() %>%
    ggplot(aes(permeability, .pred)) +
    #geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
    geom_point(alpha = 0.6, color = "midnightblue") +
    labs(title='Figure 1. Ridge Regression Model', subtitle='Permeability', x='Observed', y='Predicted')+
    theme_light()

lasso.testfit%>%
    collect_predictions() %>%
    ggplot(aes(permeability, .pred)) +
    #geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
    geom_point(alpha = 0.6, color = "midnightblue") +
    labs(title='Figure 2. Lasso Regression Model', subtitle='Permeability', x='Observed', y='Predicted')+
    theme_light()

elastic.testfit%>%
    collect_predictions() %>%
    ggplot(aes(permeability, .pred)) +
    #geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
    geom_point(alpha = 0.6, color = "midnightblue") +
    labs(title='Figure 3. Elastic Net Regression Model (50% L1 & L2)', subtitle='Permeability', x='Observed', y='Predicted')+
    theme_light()

pls.testfit%>%
    collect_predictions() %>%
    ggplot(aes(permeability, .pred)) +
    #geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
    geom_point(alpha = 0.6, color = "midnightblue") +
    labs(title='Figure 4. PLS Regression Model', subtitle='Permeability', x='Observed', y='Predicted')+
    theme_light()

3 Chemical Manufacturing Study

6.3. A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1 % will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data:
data(ChemicalManufacturingProcess)

chem<- ChemicalManufacturingProcess # dim = 176, 58 

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values.

There are 26 covariates with missing data. The greatest percent of missingness is ~ 8.5% (Table 4). Given the low percentage of missing observations, I am not overly concerned with nonrandom NA values and will impute using KNN as part of a tidymodel’s recipe steps.

#calculate percentage of NA in each column

missing<-chem%>%
  purrr::map(~ 100*mean(is.na(.)))

# create a table for missingness

chem%>%
    diagnose()%>%
    dplyr::select(-unique_count, -unique_rate)%>%
    filter(missing_count>0)%>%
    arrange(desc(missing_count))%>%
    flextable()%>%
    set_caption("Table 4. Missing Data Summary")
# impute missing data

#temp<-mice(chem,m=5,maxit=50,meth='pmm',seed=500) # method is predictive mean matching

# create new dataframe from imputation

#chem.2<-complete(temp,1)

# assess numerics

chem%>%
  diagnose_numeric()%>%
  flextable()%>%
  set_caption("Table 5. Univariate Summary for Imputed Dataset")

I will also assess multicollinearity among the covariates. From the correlation coefficients presented in Table 6, it is clear that multicollinearity is a significant issue in this dataset.

# Assess collinearity in the absence of zero values in numerical cols

chem%>%
    correlate()%>%
    filter(coef_corr > .8 | coef_corr < -.8)%>% # set thresholds to limit output 
    arrange(desc(coef_corr))%>%
    flextable()%>%
    set_caption("Table 6. Pairwise Correlation: Numerical Covariates")
  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
# create test and train sets

set.seed(100)

# create random data split -- data will not include variable for computer programs

chem.split <- initial_split(chem)

# create train and test sets from split

chem.train <- training(chem.split) # 123, 1108

chem.test  <- testing(chem.split) # 42, 1108

# create resamples for model tuning

set.seed(101)

chem_folds<-
  vfold_cv(chem.train, v=10) 

I will evaluate an elastic net model for this purposes of this exercise.

First the build the recipe, workflow and tuning grid, train the model, and then evaluate the model on the test set. Performance metrics for the training set are listed in Table 7.

# create apreprocessing recipe

#based on https://stackoverflow.com/questions/64254295/predictor-importance-for-pls-model-trained-with-tidymodels

chem.recipe <- 
  recipe(Yield ~., data=chem.train)%>% 
  step_impute_knn(all_predictors()) %>%
  step_BoxCox(all_predictors())%>%
  step_normalize(all_predictors())%>% 
  step_nzv(all_predictors()) %>% 
  step_corr(all_predictors()) # Removes variables that have large absolute correlations with other variables


# specify elastic net  model

elastic.chem.spec <- linear_reg(penalty = tune(), mixture = .5) %>%  #note combo of L1 & L2
  set_mode("regression") %>% 
  set_engine("glmnet")

#build elastic net workflow

elastic.chem.wf <- 
  workflow() %>% 
  add_model(elastic.chem.spec) %>% 
  add_recipe(chem.recipe)


#create elastic grid

elastic.grid <- grid_regular(penalty(), levels=5) 

# select fit metrics

models.metrics<-metric_set(rmse, rsq)

# tune the model

set.seed(101)

elastic.chem.tune <-
  elastic.chem.wf%>%
  tune_grid(
    chem_folds,
    grid = elastic.grid,
    metrics = models.metrics
  )

# collect best CV fits 


best.chem.elastic<-select_best(elastic.chem.tune, 'rmse')


# collect training metrics

enet.train.metrics<-collect_metrics(elastic.chem.tune)%>%
  filter(.config %in% 'Preprocessor1_Model5')%>%
  dplyr::select(c(.metric, mean))%>%
  mutate(model = 'Elastic Net')


enet.train.metrics%>%
  pivot_wider(names_from = .metric, values_from = mean)%>%
  flextable()%>%
  set_caption('Table 7. Performance Metrics for Elastic Net Training Model')

The optimal values for the Elastic Net performance metrics is as follows:

  • EN : Penalty = 1
  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?

The test performance metric for the Elastic Net model is included in Table 8. The rmse score is lower on the test set than on the training set (see Table 7). While the reverse is true for rsq scores. The predicted vs. observed values for the test set are plotted in Figure 5.

# update models with best parameter fit


elastic.chem.updated<-finalize_model(elastic.chem.spec, best.chem.elastic)

# fit updated model to test set

elastic.chem.testfit<-last_fit(elastic.chem.updated,  Yield ~., split = chem.split)


# report fit metrics

m1<-elastic.chem.testfit%>%
  collect_metrics()%>%
  dplyr::select(.metric, .estimate)%>%
  pivot_wider(names_from = .metric, values_from = .estimate)%>%
  mutate(model="Elastic Net")%>%
  relocate(model, .before=rmse)

m1%>%
  flextable()%>%
  set_caption('Table 8. Goodness of Fit for Elastic Net')
#plot obs vs. predicted

elastic.chem.testfit%>%
    collect_predictions() %>%
    ggplot(aes(Yield, .pred)) +
    geom_point(alpha = 0.6, color = "midnightblue") +
    labs(title='Figure 5. Elastic Net Regression Model (50% L1 & L2): Manufacturing Dataset', subtitle='Response Variable = Yield', x='Observed', y='Predicted')+
    theme_light()

  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?

Manufacturing process predictors dominate the list. These include process predictors 32, 09, and 13.

#adapted from https://stackoverflow.com/questions/64254295/predictor-importance-for-pls-model-trained-with-tidymodels

# specify model based on the optimized parameter for penalty

e.spec <- linear_reg(penalty = 1, mixture = .5) %>%  #note combo of L1 & L2
  set_mode("regression") %>% 
  set_engine("glmnet")

#build workflow for e.spec

tidy.wf<-
  workflow()%>%
  add_model(e.spec)%>%
  add_recipe(chem.recipe)

fit.elastic<- fit(tidy.wf, chem.train)

# collect loadings

info <- fit.elastic%>%
  pull_workflow_fit()%>%
  tidy()

# plot variable importance

var.imp<-
  info %>%
  filter(term != "Y") %>%
  dplyr::select(term, estimate)%>%
  slice(2:50)%>%
  arrange(desc(abs(estimate)))%>%
  filter(estimate != 0)

var.imp%>%
  ggplot(aes(estimate, fct_reorder(term, estimate))) +
  geom_col(show.legend = FALSE, fill='midnightblue') +
  labs(title='Figure 7. Variable Importance Elastic Net Model', y = NULL)+
  theme_classic()

  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

From Figure 7, its clear that ManufacturingProcess32 has a large positive influence on Yield (estimate=0.36), while the opposite is true for ManufacturingProcess13 (estimate=-.24). It follows that scaling back processes with negative coefficient values can serve to increase Yield, as can investing in processes with positive coefficient estimates. That said, it may be worthwhile to investigate any interactions (among covariates) that could countervail these outcomes.