Project Premise

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Data Exploration

Our data has already been split into a training and testing dataset, so these are loaded in easily.

# import train and test data
train <- read_csv("training_data.csv")
test <- read_csv("testing_data.csv")

Exploring the train dataset we see there are 2,571 cases and 33 variables - all of which are numeric except for one character variable. The character variable is BrandCode that appears to have 4 unique values.

With regard to missingness, no variable has an alarming amount of missing data - MFR has 212 values missing resulting in a complete case percentage of 91.8. We’ll consider imputing missing values dependent on the type of models we choose to try later. We also obverse drastically different scales and ranges of values in the numeric variables, so scaling and centering will be considered dependent on the models selected later.

skim(train)
Data summary
Name train
Number of rows 2571
Number of columns 33
_______________________
Column type frequency:
character 1
numeric 32
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Brand Code 120 0.95 1 1 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Carb Volume 10 1.00 5.37 0.11 5.04 5.29 5.35 5.45 5.70 ▁▆▇▅▁
Fill Ounces 38 0.99 23.97 0.09 23.63 23.92 23.97 24.03 24.32 ▁▂▇▂▁
PC Volume 39 0.98 0.28 0.06 0.08 0.24 0.27 0.31 0.48 ▁▃▇▂▁
Carb Pressure 27 0.99 68.19 3.54 57.00 65.60 68.20 70.60 79.40 ▁▅▇▃▁
Carb Temp 26 0.99 141.09 4.04 128.60 138.40 140.80 143.80 154.00 ▁▅▇▃▁
PSC 33 0.99 0.08 0.05 0.00 0.05 0.08 0.11 0.27 ▆▇▃▁▁
PSC Fill 23 0.99 0.20 0.12 0.00 0.10 0.18 0.26 0.62 ▆▇▃▁▁
PSC CO2 39 0.98 0.06 0.04 0.00 0.02 0.04 0.08 0.24 ▇▅▂▁▁
Mnf Flow 2 1.00 24.57 119.48 -100.20 -100.00 65.20 140.80 229.40 ▇▁▁▇▂
Carb Pressure1 32 0.99 122.59 4.74 105.60 119.00 123.20 125.40 140.20 ▁▃▇▂▁
Fill Pressure 22 0.99 47.92 3.18 34.60 46.00 46.40 50.00 60.40 ▁▁▇▂▁
Hyd Pressure1 11 1.00 12.44 12.43 -0.80 0.00 11.40 20.20 58.00 ▇▅▂▁▁
Hyd Pressure2 15 0.99 20.96 16.39 0.00 0.00 28.60 34.60 59.40 ▇▂▇▅▁
Hyd Pressure3 15 0.99 20.46 15.98 -1.20 0.00 27.60 33.40 50.00 ▇▁▃▇▁
Hyd Pressure4 30 0.99 96.29 13.12 52.00 86.00 96.00 102.00 142.00 ▁▃▇▂▁
Filler Level 20 0.99 109.25 15.70 55.80 98.30 118.40 120.00 161.20 ▁▃▅▇▁
Filler Speed 57 0.98 3687.20 770.82 998.00 3888.00 3982.00 3998.00 4030.00 ▁▁▁▁▇
Temperature 14 0.99 65.97 1.38 63.60 65.20 65.60 66.40 76.20 ▇▃▁▁▁
Usage cont 5 1.00 20.99 2.98 12.08 18.36 21.79 23.75 25.90 ▁▃▅▃▇
Carb Flow 2 1.00 2468.35 1073.70 26.00 1144.00 3028.00 3186.00 5104.00 ▂▅▆▇▁
Density 1 1.00 1.17 0.38 0.24 0.90 0.98 1.62 1.92 ▁▅▇▂▆
MFR 212 0.92 704.05 73.90 31.40 706.30 724.00 731.00 868.60 ▁▁▁▂▇
Balling 1 1.00 2.20 0.93 -0.17 1.50 1.65 3.29 4.01 ▁▇▇▁▇
Pressure Vacuum 0 1.00 -5.22 0.57 -6.60 -5.60 -5.40 -5.00 -3.60 ▂▇▆▂▁
PH 4 1.00 8.55 0.17 7.88 8.44 8.54 8.68 9.36 ▁▅▇▂▁
Oxygen Filler 12 1.00 0.05 0.05 0.00 0.02 0.03 0.06 0.40 ▇▁▁▁▁
Bowl Setpoint 2 1.00 109.33 15.30 70.00 100.00 120.00 120.00 140.00 ▁▂▃▇▁
Pressure Setpoint 12 1.00 47.62 2.04 44.00 46.00 46.00 50.00 52.00 ▁▇▁▆▁
Air Pressurer 0 1.00 142.83 1.21 140.80 142.20 142.60 143.00 148.20 ▅▇▁▁▁
Alch Rel 9 1.00 6.90 0.51 5.28 6.54 6.56 7.24 8.62 ▁▇▂▃▁
Carb Rel 10 1.00 5.44 0.13 4.96 5.34 5.40 5.54 6.06 ▁▇▇▂▁
Balling Lvl 1 1.00 2.05 0.87 0.00 1.38 1.48 3.14 3.66 ▁▇▂▁▆

To take a closer look at our missingness, we check for any frequent patterns of the missing data. For 100 cases only the MFR value is missing, and for 91 cases only the BrandCode is missing. These are the top two patterns of missingness, and we feel comfortable that these values are ‘missing completely at random’ and not a result of an underlying interaction or phenomena.

# plotting patterns in missingness, top 15 patterns
train %>% 
  plot_na_intersect(only_na = TRUE, typographic = FALSE, n_intersacts = 15)

For our numeric variables with missing values, we first check the shape of these variables to see if a mean or median imputation might suffice. From the histograms we see a lot of not-normally distributed variables and many with outliers. This rules out a mean imputation being useful, and while a mode could suffice we’ll choose to use knn imputation when model-building since with this smaller dataset it won’t be computationally challenging.

train %>%
  #histograms only for numeric variables
  subset(select = -`Brand Code`) %>%
  #reshape
  gather() %>% 
  ggplot(aes(value)) +
  geom_histogram() +
  facet_wrap(~ key, scales = "free", ncol = 4) +
  labs(title = "Checking Distribution of Numeric Predictor Variables")

To see if any of the numeric variables may be better served as factors, we look at the above histograms and the following output that shows the count of unique values for each variable. We don’t see any numeric variables that appear to be dichotomous or with only a few levels/values, so we will not switch any of the numeric variables to factors.

var_unique <- lapply(train, unique)
lengths(var_unique)
##        Brand Code       Carb Volume       Fill Ounces         PC Volume 
##                 5               102                93               455 
##     Carb Pressure         Carb Temp               PSC          PSC Fill 
##               107               124               130                33 
##           PSC CO2          Mnf Flow    Carb Pressure1     Fill Pressure 
##                14               488               141               109 
##     Hyd Pressure1     Hyd Pressure2     Hyd Pressure3     Hyd Pressure4 
##               246               208               193                41 
##      Filler Level      Filler Speed       Temperature        Usage cont 
##               289               245                57               482 
##         Carb Flow           Density               MFR           Balling 
##               534                79               588               218 
##   Pressure Vacuum                PH     Oxygen Filler     Bowl Setpoint 
##                16                53               339                12 
## Pressure Setpoint     Air Pressurer          Alch Rel          Carb Rel 
##                 9                32                54                43 
##       Balling Lvl 
##                83

As it’s hard to visualize all of these variables and their possible correlations, we numerically check for multicollinearity in the training data. We find many variables with greater than (+/-) 0.8. Knowing such, we’ll employ the step_corr feature of model recipes to determine which features should be removed due to the collinearity.

train %>%
    #filter_if(is.numeric)%>%
    correlate() %>%
    filter(coef_corr > .8 | coef_corr < -.8)%>% # set thresholds to limit output 
    arrange(desc(abs(coef_corr))) %>%
    flextable() %>%
  set_caption('Table 1. Correlated Variables')

Model Development

Based on findings in our initial exploratory data analysis, we will evaluate a set of machine learning models to predict PH that make no assumptions about the underlying variable distributions, are appropriate for continuous response variables, can fit nonlinear relationships, and are robust to both missing values and outliers.

The following predictive models will be trained and evaluated here:

  • Basic Decision Tree
  • Random Forest
  • Support Vector Machine (SVM)
  • Neural Net (single layer, forward) (NNET)
  • Multivariate Adaptive Regression Splines (MARS)

Model Characteristics

Decision Tree: can capture nonlinear relationships, little data preprocessing, doesn’t require normalizing, easy to interpret, can get sense of feature importance, sensitive to collinearity as well as changes in data input, lower accuracy than other models.

Random Forest: one of the most accurate decision models, Works well on large datasets, can be used to extract variable importance, does not require feature engineering (scaling and normalization), tends to overfit noisy data, hard to interpret, requires careful tuning.

SVM: complex, high dimension data, low risk of overfitting, memory efficient, but difficult to interpret

NNET: works with little data processing, good with nonlinear data with a large number of inputs, parallel processing ability, low interpretability, computationally intensive

MARS: automatic feature selection, interpretability, little data preprocessing required, continuous and categorical predictors

Train and Test Splits

Our data has already been split into a training and testing dataset. To use these data in a Tidymodels modeling process, we will combine the two and reverse engineer our split.

# The following code is adapted from: https://stackoverflow.com/questions/64004132/tidymodels-creating-an-rsplit-object-from-training-and-testing-data



# combine train and test data

data<-rbind(train,test) # combine data

data%<>%
  rename(BrandCode = 'Brand Code')%>%
  filter(!is.na(PH))%>%
  mutate_if(is.character, as.factor)

# calculate proportion of data from train

prop<-nrow(train)/(nrow(train)+nrow(test)) 

# create tidymodels split

data.split<-initial_time_split(data, prop=prop) # split without randomizing

train.split<-training(data.split) # create training set from 'data'

test.split<-testing(data.split) # create test set to graph results of last_fit

# check that train and train.split are identical

all_equal(train, train.split)
## Cols in `y` but not `x`: `BrandCode`.
## Cols in `x` but not `y`: `Brand Code`.

Cross-Validation

We start model development by setting up our resampling scheme for parameter optimization and model selection. Here we will use 10-fold cross-validation and identify PH as our response variable.

# create resamples for model tuning

set.seed(0501)

cv<-
  vfold_cv(train.split, 
           v=10,
           strata= 'PH')

Tidymodel Recipes

Next, we conduct our data preprocessing steps on the training data using the Recipes package. Two recipes are represented here: one for our MARS model and another for the remaining models. These steps include the following:

  1. MARS: impute missing values, dummy code factor variables
  2. Others: impute missing values, normalize numeric data, dummy-code factor variables, remove multicollinear predictors

We note that multicollinearity can impact tree-based models and that SVM requires normalizing (scaling and centering) prior to model training. We also impute missing values to enable parameter estimation via grid-tuning. Our earlier tuning attempts failed due to missingness.

# build recipe for dtree, svm, nnet, rf models

train.recipe<-
  recipe(PH ~., data=train.split)%>%
  step_impute_knn(all_predictors())%>%
  step_normalize(all_numeric())%>% # center and scale for svm
  step_dummy(all_nominal()) %>%
  step_corr(threshold = 0.8) # trees

# build recipe for mars model

mars.recipe<- 
  recipe(PH ~ ., data = train.split) %>%
  step_impute_knn(all_predictors())%>%
  step_dummy(all_nominal()) 

Model Specification

Our model specifications include selected tuning parameters specific to each model. These parameters will be optimized using grid_tune().

#Specify Tree

dtree.spec<-decision_tree(
  tree_depth = tune(), # creating tuneable parameters
  min_n = tune(),
  cost_complexity = tune())%>%
    set_engine("rpart")%>% # ctree - tree_depth, min_n
    set_mode("regression")

# Specify SVM

svm.spec<-svm_rbf(
  cost = tune(), 
  margin = tune(),
  rbf_sigma = tune()) %>% 
  set_engine("kernlab") %>% 
  set_mode("regression")

# specify random forest

rf.spec <-rand_forest(
    mtry = tune(),
    trees = tune(),
    min_n = tune()) %>%
    set_mode("regression")%>%
    set_engine("ranger")

# Specify Neural Net

nnet.spec<-mlp(
  hidden_units = tune(), 
  penalty = tune(),
  epochs = tune()) %>% 
  set_engine("nnet", MaxNWts = 2600) %>% 
  set_mode("regression")

# a neural network should have up to 27 hidden units in layer (Kuhn Johnson)

nnet.param <- 
   nnet.spec %>% 
   extract_parameter_set_dials() %>% 
   update(hidden_units = hidden_units(c(1, 27)))

# Specify MARS

mars.spec <-
  mars(mode = "regression", 
       num_terms = tune(), 
       prod_degree = tune()) %>%
  set_mode("regression")%>%
  set_engine("earth")

Model Workflows

We now construct our initial model workflows and combine them using workflow_set() function to streamline our training process. The workflows themselves link each model’s recipe and specification.

#build workflow for recipe with normalization

models.wf<-
  workflow_set(
    preproc=list(normalized = train.recipe),
    models = list(Dec.Tree = dtree.spec, SVM.radial = svm.spec, R.Forest = rf.spec, NNet = nnet.spec)
  )

mars.wf<-
  workflow_set(
    preproc=list(simple = mars.recipe),
    models = list(MARS = mars.spec)
  )

#combine workflows 

combined.wf<-
  bind_rows(models.wf, mars.wf)%>%
  mutate(wflow_id = gsub("(simple_)|(normalized_)", "", wflow_id))

Model Parameter Tuning

Next, we optimize our model’s parameters via a 20-point grid and fit our training data in one step. The fitting process is time intensive. As a result, we also save our results to an .RData file that can be reused at later dates.

# speed grid estimation

race.ctrl <-
   control_race(
      save_pred = TRUE,
      save_workflow = TRUE
   )

grid.results <-
   combined.wf %>%
   workflow_map(     
      'tune_race_anova', # compute performance metrics (finetune)
      seed = 05402112,
      resamples = cv,
      grid = 20, # setting up grid size of 20
      control = race.ctrl,
      verbose = TRUE
   )

# review and save tuning results

save(grid.results, file='grid.results.Rdata')

#load('grid.results.Rdata')

Training Fit: Performance Metrics

We use the RMSE (root mean squared error) metric to evaluate/compare our model’s performance in fitting the training data.

From Table 2 and Figure 1, it is clear that the MARS model outperformed the other models - i.e., lowest rmse & smallest std_err.

# rank results

rank_results(grid.results, rank_metric = "rmse")%>%
  flextable()%>%
  set_caption('Table 2. Model Performance Metrics for Grid Tuning')
# plot rmse tuning results for each model

autoplot(
  grid.results,
  rank_metric = "rmse",  # <- how to order models
  metric = "rmse",       # <- which metric to visualize
  select_best = TRUE) +     # <- one point per workflow
  lims(y = c(0, 0.75)) +
  labs(title='Figure 1. Estimated RMSE for Best Model Configuration', x='')+
  theme_light()

We can also extract and compare the highest performing resample model for each algorithm. These estimates are included in Table 3.

 # list best rmse results by model

collect_metrics(grid.results)%>%
  filter(.metric == c("rmse"))%>%
  dplyr::select(model, .metric, mean, .config) %>%
  group_by(model)%>%
  slice(which.min(mean))%>%
  rename(c(Model = model, Estimate = 'mean', 'CV-Fold' = .config))%>%
  mutate(Metric = 'RMSE')%>%
  relocate(Metric, .after=Model)%>%
  mutate(Model = ifelse(Model %in% 'rand_forest', 'Random Forest', Model))%>%
  mutate(Model = ifelse(Model %in% 'svm_rbf', 'SVM', Model))%>%
  mutate(Model = ifelse(Model %in% 'mlp', 'NNET', Model))%>%
  mutate(Model = ifelse(Model %in% 'decision_tree', 'Decision Tree', Model))%>%
  mutate(Model = ifelse(Model %in% 'mars', 'MARS', Model))%>%
  #mutate(Model = toupper(Model))%>%
  arrange(Estimate)%>%
  select(!.metric)%>%
  flextable()%>%
  set_caption('Table 3. Selected Models Based on Training Fit')

Model Evaluation

Our final model evaluation process includes the following steps:

  1. update each model with the highest performing set of tuned parameters for each model.
  2. update our workflows to include the these models
  3. fit our models to the full training and test datasets at one time
  4. collect/compare performance metrics for each model on the test data
  5. select the highest performing model and inspect our predictions

Table 4. shows the results of our model evaluation on the test data. It’s clear, based on rmse scores, that the MARS model outperformed the others. It is also worth noting that the rmse score for the MARS test data is less than for the training data. This lends us some assurance that the model is not over-fitting the data. If that was the case, we would expect this error to be equal to, or less than, that from the training fit.

# select best model fits

#MARS

mars.best <- 
   grid.results %>% 
   extract_workflow_set_result("MARS") %>% 
   select_best(Metric = "RMSE")

#Random Forest

rforest.best <- 
   grid.results %>% 
   extract_workflow_set_result("R.Forest") %>% 
   select_best(Metric = "RMSE")

#Decision Tree

dtree.best <- 
   grid.results %>% 
   extract_workflow_set_result("Dec.Tree") %>% 
   select_best(Metric = "RMSE")

# SVM

svm.best <- 
   grid.results %>% 
   extract_workflow_set_result("SVM.radial") %>% 
   select_best(Metric = "RMSE")

#NNET

nnet.best <- 
   grid.results %>% 
   extract_workflow_set_result("NNet") %>% 
   select_best(Metric = "RMSE")


# Finalize workflows and fit test data with last_fit

mars.test<-
  grid.results%>%
  extract_workflow("MARS") %>% 
  finalize_workflow(mars.best)%>% 
  last_fit(split = data.split)

rforest.test<-
  grid.results%>%
  extract_workflow("R.Forest") %>% 
  finalize_workflow(rforest.best)%>% 
  last_fit(split = data.split)


dtree.test<-
  grid.results%>%
  extract_workflow("Dec.Tree") %>% 
  finalize_workflow(dtree.best)%>% 
  last_fit(split = data.split)

svm.test<-
  grid.results%>%
  extract_workflow("SVM.radial") %>% 
  finalize_workflow(svm.best)%>% 
  last_fit(split = data.split)

nnet.test<-
  grid.results%>%
  extract_workflow("NNet") %>% 
  finalize_workflow(nnet.best)%>% 
  last_fit(split = data.split)

# collect model rmse for fit comparison

metrics1<-mars.test%>% 
  collect_metrics()%>%
  select(.metric, .estimate)%>%
  filter(.metric == c("rmse"))%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  mutate(Model = 'MARS')%>%
  relocate(Model, .before='Metric')
  
metrics2<-rforest.test%>% 
  collect_metrics()%>%
  select(.metric, .estimate)%>%
  filter(.metric == c("rmse"))%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  mutate(Model = 'Random Forest')%>%
  relocate(Model, .before='Metric')

metrics3<-dtree.test%>% 
  collect_metrics()%>%
  select(.metric, .estimate)%>%
  filter(.metric == c("rmse"))%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  mutate(Model = 'Decision Tree')%>%
  relocate(Model, .before='Metric')

metrics4<-svm.test%>% 
  collect_metrics()%>%
  select(.metric, .estimate)%>%
  filter(.metric == c("rmse"))%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  mutate(Model = 'SVM')%>%
  relocate(Model, .before='Metric')

metrics5<-nnet.test%>% 
  collect_metrics()%>%
  select(.metric, .estimate)%>%
  filter(.metric == c("rmse"))%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  mutate(Model = 'Neural Network')%>%
  relocate(Model, .before='Metric')

# print table of test fit results
  
rbind(metrics1, metrics2, metrics3, metrics4, metrics5)%>%
   arrange(Estimate)%>%
   flextable()%>%
   set_caption('Table 4. Comparison of Model Fit on Test Data')

Model Interpretation

One advantage of MARS models is their interpretability. Table 5 shows the variables selected for the final model along with their respective coefficients.

The final model form is:

PH = 8.466+.001h(0.2-Mnf Flow)-0.113BrandCode_C-0.045h(Usage cont-22.14)+0.028h(68.2-Temperature); where h = the model hinge

Figure 2. displays the partial dependence between PH and each of the predictors for the MARS model. In this context, each plot describes the effect of a predictor with all other covariates held at their average value.

Each plot also reveals the hinge that is characteristic of MARS such that:

  • Mnf Flow = 0 above 0.2
  • Usage Cont = 0 below 22.14
  • Temperature = 0 above 68.2
  • BrandCode_C = 0 for all values other than 0.45
# Print model
output<-extract_fit_engine(mars.test)%>%
  summary()

output[12]%>%
  data.frame()%>%
  rownames_to_column(var="Model Components")%>%
  rename(Coefficient = ..y)%>%
  flextable%>%
  set_caption('Table 5. Final MARS Model: Features & Coefficients')
# plot partial dependence of features (from plotmo package in Earth -- plot regression surfaces)
plotmo(output, pmethod='partdep', caption='Figure 2. Partial dependence plots for continuous predictors using MARS')
## calculating partdep for Mnf Flow 
## calculating partdep for Temperature 
## calculating partdep for Usage cont 
## calculating partdep for BrandCode_C

To effectively visualize the model results we can plot our observed PH values against the model PH predictions for both training and test sets. These results as shown in Figure 3 and Table 6. Our errors are lowest around ~ PH = 8.5. Our model tends to overpredict at lower values and overpredict at higher values for PH on the test data.

# adapted from https://stackoverflow.com/questions/68124804/tidymodels-get-predictions-and-metrics-on-training-data-using-workflow-recipe

# pull trained workflow

fit.preds <- mars.test$.workflow[[1]] 

# augment on test and training data 

model.predictions <- bind_rows(
  augment(fit.preds, new_data = test.split) %>% 
    mutate(type = "Test Data"),
  augment(fit.preds, new_data = train.split) %>% 
    mutate(type = "Training Data"))

# plot observed vs predicted PH for train and test sets

model.predictions %>%
  ggplot(aes(PH, .pred)) +
  geom_point(aes(color=type, alpha=.95), show.legend = FALSE) +
  scale_color_manual(values = c("Training Data" = "#4DB6D0", "Test Data" = "#BECA55"))+
  labs(x='PH (Observed)', y= 'PH (Predicted)', title='Figure 3. Observed vs Predicted PH Values: MARS Model')+
  theme(plot.title = element_text(vjust = 10))+ # this doesn't seem to be working
  theme_minimal() +
  facet_wrap(~type)

# print fit metrics for train and test 
  
model.predictions %>%
  group_by(type) %>%
  metrics(PH, .pred)%>%
  select(!.estimator)%>%
  rename(Metric=.metric, Estimate=.estimate, Data=type)%>%
  pivot_wider(names_from = Metric, values_from = Estimate)%>%
  select(!mae)%>%
  flextable()%>%
  set_caption('Table 6. Performance Metrics on Final Training and Test Datasets: MARS')

Finally, we can rank the relative importance of each covariate in terms of their contribution as predictors of PH. From Table 7., we that ‘MnF Flow’ makes the greatest contribution to the model prediction. For this dataset, only four covariates were included in the final model.

# Plot variable importance for final model and test data


mars.test%>%
  extract_fit_parsnip()%>%
  vi_model(type='gcv')%>% # we can control importance type using vi_model instead of vip
  filter(Importance > 0)%>%
  arrange(desc(Importance))%>%
  mutate(Variable =  factor(Variable), Variable = fct_reorder(Variable, Importance))%>%
  ggplot(aes(x=Importance, y = Variable))+
  geom_col( fill='midnightblue', alpha=0.7)+
  labs(title='Figure 7. Variable Importance: MARS Model', subtitle='PH')+
  theme_light()

We conclude that any investments to manipulate PH as part of the manufacturing process should focus on adjustments to these four covariates. And that a cost:benefit analysis be conducted before committing company resources for this purpose.