ABC Beverage: Predictive pH Factor Predictive Modeling

Author

Tony Fraser and Seung-min Song

Published

May 12, 2024

library(readxl)
library(dplyr)
library(ggplot2)
library(corrplot)
library(patchwork)
library(shiny)
library(caret)
library(randomForest)
library(tidyr)

Executive Summary

This project aims to establish a robust predictive model for pH levels in the bottling process. We explored several modeling algorithms, some of which we included, but in the end, we chose Random Forest using cross-validation because of accuracy and resilience.

Key Outcomes

  • Our best-performing models are:
    1. After only model tuning: Random Forest with an R Squared of 0.6917606.
    2. After discarding the most distant outliers from a conservative range: Random Forest with an R Squared of 0.9763720.
  • Approximately 30% of the columns are either irrelevant or collinear, and were removed.
  • Regarding split, in all tested cases, Random Forest is most stable with an 80/20 test/train split ratio.
  • In order to predict all 267 records we were asked to predict, we chose to implement a linear regression imputation model to fill out the 675 missing fields.
  • Our outcome file is called ph_predictions.csv.

Recommendations

  1. Stabilize factors Mnf Flow, Usage Cont, Temperature, Filler Level, Oxygen Filler, and Carb Rel. These appear to be the most important factors in keeping pH levels stable.
  2. Continue test these models with different groups of data. Update regularly.
  3. Find causes of null fields within records. Either fix the measurement process, or plan how to handle partial records.
  4. Automate a reporting and warning system, perhaps on a weekly basis. Monitoring metrics like R Squared, and out of bag counts will provide good feedback.
  5. Build and socialize a data dictionary. Start with the definitions of each variable. Over time, add behaviors as well. This study would have been more comprehensive and accurate if we data scientists knew more about each individual variable.
  6. Look into outlier records. Within this study we’ve proven we an make considerably more accurate predictions if we can dismiss or smooth outliers.

What’s Next or Missing

  • We opted to make predictions with the model including the outliers. (The 69.1% model) This might not be the best prediction strategy, and most industry standards require models performing at 70% or better, but we believe we are making the correct conservative decision. Predicting using our second model requires consensus and industry experience, neither of which we have. We did however make it a one line adjustment to switch between the 69% and the 97% model should assumptions and logic be approved.
  • There most certainly is feature engineering worth exploring, however we lack both a data dictionary and industry knowledge. Our strongest suggestion before furthering this research is for all parties to gain a better understanding of these columns.
  • Some fields in the training data are empty and we chose to exclude them from model building, e.g., we dropped nulls. Normally the correct course of action is to first find out why they are null and decide if they can be dropped or not. If we continue this assignment, this null question needs to be reviewed and decided upon.

GitHub Project Location

https://github.com/tonythor/abc-beverage

Data Overview

  1. Our training consists of 2,571 entries with 33 columns.
  2. Our prediction dataset consists of 267 records similar to the training data, but with the pH values removed for us to predict.
  3. This dataset is not time-series data; it comprises discrete measurements taken at the time of bottling.
  4. We built two usable Random Forest models for this study. The first model kept all records and was well tuned. The second model was tuned like the first, but we also removed distant outliers based on a conservative threshold set at five times the interquartile range. In the end, we removed only 85 outlier records from 1600 training records.

Load Data

We moved data up to github and will load straight from there. Upon loading, we will immediately clean and standardize the column names, removing spaces and converting all to lowercase.

github_url <- "https://github.com/tonythor/abc-beverage/raw/develop/data/"
train_fn <- "StudentData_Training.xlsx"
predict_me_fn <- "StudentEvaluation_Test.xlsx"
train_url <- paste0(github_url, train_fn) 
predict_me_url <- paste0(github_url, predict_me_fn)
download.file(train_url, destfile = train_fn, mode = "wb")
download.file(predict_me_url, destfile = predict_me_fn, mode = "wb")

# read in, clean by replacing all spaces and upper case column names
# and mutate any nulls with a column average. 
# train_raw has 2571 records.
train_raw <- read_excel(train_fn) %>%
  rename_with(~ gsub(" ", "", .x) %>% tolower()) %>%
  mutate(brandcodenumeric = as.numeric(as.factor(brandcode))) %>%
  select(-brandcode) %>%
  mutate(across(where(is.factor), ~ as.numeric(as.factor(.))))  # Convert all factors to numeric

#train has 2038 records
train <- train_raw %>%  filter_all(all_vars(!is.na(.)))

predict_me <- read_excel(predict_me_fn) %>%
  rename_with(~ gsub(" ", "", .x) %>% tolower()) %>%
  mutate(brandcodenumeric = as.numeric(as.factor(brandcode))) %>%
  select(-brandcode) %>% 
  mutate(across(where(is.factor), ~ as.numeric(as.factor(.))))  # Convert all factors to numeric

file.remove(c(train_fn, predict_me_fn))

Basic Visualizations

pH seems to be fairly normally distributed. After looking at a correlation chart, we might have some collinearity or irrelevant columns to deal with. With just these two visuals, we should likely test several different models and see which performs best.

numeric_data <- train %>% select_if(is.numeric)

#correlation matrix to df
cor_matrix <- cor(numeric_data, use = "complete.obs")
cor_df <- as.data.frame(as.table(cor_matrix))
names(cor_df) <- c("Variable1", "Variable2", "Correlation")

# filter to include only higher correlations
threshold <- 0.5
cor_df <- cor_df %>%
  filter(abs(Correlation) >= threshold, Variable1 != Variable2)

cor_plot <- ggplot(cor_df, aes(x = Variable1, y = Variable2, fill = Correlation)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1), axis.text.y = element_text(angle = 0)) +
  labs(fill = "Correlation", title = "Filtered Correlation Matrix")

ph_plot <- ggplot(numeric_data, aes(x = ph)) + 
    geom_histogram(bins = 30, fill = "blue", color = "black") +
    ggtitle("Distribution of pH Values") +
    xlab("pH") +
    ylab("Frequency")
  
combined_plot <- ph_plot + cor_plot
print(combined_plot)

Shiny App: Boxpolots

You can review all the box and whisker plots for each numeric variable via this shiny app.

Launch the popup to review individual variable boxplots.

Review Multiple Models

First, we’ll fit and predict using several common models to evaluate their performance. This analysis demonstrates that Random Forest is the strongest candidate among the models tested. Please note that this is not an exhaustive list of all models we tested.

set.seed(200)  # for reproducibility

train <- train %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .))) %>%
  select(where(is.numeric))  # Ensure only numeric data is used

# test and train
set.seed(123)  
indexes <- createDataPartition(train$ph, p = 0.8, list = FALSE)
training_data <- train[indexes, ]
testing_data <- train[-indexes, ]

# prepare predictor and response variables correctly
training_data$x <- data.frame(sapply(training_data[, -which(names(training_data) == "ph")], as.numeric))
training_data$y <- training_data$ph
testing_data$x <- data.frame(sapply(testing_data[, -which(names(testing_data) == "ph")], as.numeric))
testing_data$y <- testing_data$ph

# KNN
knnModel <- train(x = training_data$x, y = training_data$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 5)
knnPred <- predict(knnModel, newdata = testing_data$x)
knnMetrics <- postResample(pred = knnPred, obs = testing_data$y)

# MARS
marsModel <- train(x = training_data$x, y = training_data$y,
                   method = "earth",
                   preProc = c("center", "scale"),
                   tuneLength = 5)
marsPred <- predict(marsModel, newdata = testing_data$x)
marsMetrics <- postResample(pred = marsPred, obs = testing_data$y)

# Neural Net
nnModel <- train(x = training_data$x, y = training_data$y,
                 method = "nnet",
                 preProcess = c("center", "scale"),
                 tuneLength = 5, 
                 trace = FALSE)
nnPred <- predict(nnModel, newdata = testing_data$x)
nnMetrics <- postResample(pred = nnPred, obs = testing_data$y)

# SVM
svmModel <- train(x = training_data$x, y = training_data$y,
                  method = "svmRadial",
                  preProcess = c("center", "scale"),
                  tuneLength = 5)
svmPred <- predict(svmModel, newdata = testing_data$x)
svmMetrics <- postResample(pred = svmPred, obs = testing_data$y)

# Random Forest
trainControl <- trainControl(
  method = "cv", # <- this is how you get your model to cross validate!
  number = 3, # <- three fold validation
  verboseIter = FALSE,
  savePredictions = "final",
  returnResamp = "all"
)

tuneGrid <- expand.grid(
  mtry = c(2, floor(sqrt(ncol(training_data$x))))
)

rfModel <- train(
  x = training_data$x,
  y = training_data$y,
  method = "rf",
  trControl = trainControl,
  tuneGrid = tuneGrid
)
rfPred <- predict(rfModel, newdata = testing_data$x)
rfMetrics <- postResample(pred = rfPred, obs = testing_data$y)
importance_measures <- varImp(rfModel, scale = TRUE)

# Linear Regression
lmModel <- train(x = training_data$x, y = training_data$y,
                 method = "lm",
                 preProcess = c("center", "scale"))
lmPred <- predict(lmModel, newdata = testing_data$x)
lmMetrics <- postResample(pred = lmPred, obs = testing_data$y)

# Collecting all model performance metrics into a small df
modelPerformance <- data.frame(
  Model = c("KNN", "MARS", "Neural Network", "SVM", "Random Forest (RF)", "Linear Regression (LR)"),
  RMSE = c(knnMetrics[1], marsMetrics[1], nnMetrics[1], svmMetrics[1], rfMetrics[1], lmMetrics[1]),
  Rsquared = c(knnMetrics[2], marsMetrics[2], nnMetrics[2], svmMetrics[2], rfMetrics[2], lmMetrics[2]),
  MAE = c(knnMetrics[3], marsMetrics[3], nnMetrics[3], svmMetrics[3], rfMetrics[3], lmMetrics[3])
)



print(modelPerformance, row.names = FALSE)
                  Model      RMSE  Rsquared        MAE
                    KNN 0.1140456 0.5338995 0.08391264
                   MARS 0.1298918 0.3880553 0.09911886
         Neural Network 7.5542939        NA 7.55248157
                    SVM 0.1079469 0.5757274 0.07821117
     Random Forest (RF) 0.0995023 0.6682969 0.07138323
 Linear Regression (LR) 0.1328399 0.3584526 0.10245441

RF Model 1, and Setup

Now that we’ve selected a machine learning algorithm, we’ll use this section to set up the Random Forest model. We’ll figure out which columns to keep, how to set our hyperparameters, etc. We will not modify any data for this first model; instead, we will focus on optimizing the infrastructure as effectively as possible.

Feature Importance

First, let’s get a handle on which features seem the most important. We should see some of these less important columns later when we start removing unimportant factors.

importance_measures <- varImp(rfModel, scale = TRUE)
plot(importance_measures, main = "Feature Importance in Random Forest Model")

Build Model and Remove Columns

Some of the less important columns can confuse the model. Before we refit this model, we’ll set an importance threshold to selectively remove those columns that are less impactful. Then, we’ll either refine our existing model or build a completely new one, refit, and evaluate its performance.

As you’ll see in the results, the performance isn’t bad. We’ve achieved over 69%. If we were to round up, it would be considered good enough.

# clean and type and flatten
clean_data <- function(data, vars_to_remove) {
  data <- data[, !(names(data) %in% vars_to_remove)]
  data <- as.data.frame(data)
  data[] <- lapply(data, function(x) if(is.list(x)) unlist(x) else x)
  return(data)
}

# important features
importance_df <- as.data.frame(importance_measures$importance)
importance_df$Variable <- rownames(importance_df)
importance_df <- importance_df[order(importance_df$Overall), ]

# cut off at less than importance_factor %
importance_factor <- 0.3
cutoff <- quantile(importance_df$Overall, importance_factor)
variables_to_remove <- importance_df$Variable[importance_df$Overall <= cutoff]

training_data_updated <- clean_data(training_data, c(variables_to_remove, "x", "y", "ph"))
# check the data types of variables after cleaning and removal.

trainControl2 <- trainControl(
  method = "cv", # <- this is how you get your model to cross validate!
  number = 3, # <- three fold validation
  verboseIter = FALSE,
  savePredictions = "final",
  returnResamp = "all"
)

tuneGrid2 <- expand.grid(
  mtry = c(2, floor(sqrt(ncol(training_data$x))))
)

n_tree = 2000
rfModel_updated <- train(
  x = training_data_updated,
  y = training_data$y,
  method = "rf",
  trControl = trainControl2,
  tuneGrid = tuneGrid2,
  ntree = n_tree
)

# prepare and clean prediction data
predictor_data_for_model <- clean_data(testing_data, c(variables_to_remove, "x", "y"))

rfPred_updated <- predict(rfModel_updated, newdata = predictor_data_for_model)
rfMetrics_updated <- postResample(pred = rfPred_updated, obs = testing_data$y)

print(variables_to_remove) #<- columns we removed before retraining 

Model 1 Performance

Let us now take a quick look at this last run and see how much better it performs.

Not bad, removing those columns and adjusting those hyper parameters made a big difference.

updated_model_performance <- data.frame(
  Model = paste0("RF Tuned (if:ntree ", importance_factor,":", n_tree,  ")"),
  RMSE = rfMetrics_updated[1],
  Rsquared = rfMetrics_updated[2],
  MAE = rfMetrics_updated[3]
)

modelPerformance <- rbind(modelPerformance, updated_model_performance)
print(modelPerformance, row.names = FALSE)  #<- a big improvement
                        Model       RMSE  Rsquared        MAE
                          KNN 0.11404559 0.5338995 0.08391264
                         MARS 0.12989181 0.3880553 0.09911886
               Neural Network 7.55429390        NA 7.55248157
                          SVM 0.10794695 0.5757274 0.07821117
           Random Forest (RF) 0.09950230 0.6682969 0.07138323
       Linear Regression (LR) 0.13283985 0.3584526 0.10245441
 RF Tuned (if:ntree 0.3:2000) 0.09509905 0.6917606 0.06705064
print(variables_to_remove) #<- columns we removed before retraining 
 [1] "pscco2"           "pscfill"          "carbtemp"         "psc"             
 [5] "carbpressure"     "hydpressure4"     "fillounces"       "pressuresetpoint"
 [9] "pcvolume"         "hydpressure2"    

Residuals Plot Review

This chart shows the differences between the pH we predicted in model 1 and the actual pH from the dataset. Residual points appear randomly distributed around the zero line, which is ideal. This suggests that the model exhibits homoscedasticity, meaning it does not suffer from non-constant variance. Also, there are no apparent patterns or trends in the residuals, which implies that the linearity between predictors and target is reasonably satisfied.

However, there are some outliers that we need to address.

predictions <- predict(rfModel_updated, newdata = predictor_data_for_model)
residuals <- testing_data$y - predictions
residuals_df <- data.frame(Predicted = predictions, Residuals = residuals)
residuals_df <- na.omit(residuals_df)
ggplot(residuals_df, aes(x = Predicted, y = Residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residual Plot", x = "Predicted Values", y = "Residuals")

RF Model 2, Outliers Removed

Random Forest deals reasonably well with outliers, but let’s kick out just a few here to stabilize our model a little more. We’ll use an IQR strategy to remove only those outliers that are extremely distant, specifically those exceeding 8 times the distance between the first and third quartiles.

Randomly dropping records because they are outliers is not generally a preferred strategy, but we’re already dropping a few records because they have null fields. These might be valid records; we simply don’t know whether the records we’re dropping are valid or not. We should look into each of them to figure out why, and try to account for that variance in some other way.

From a process perspective, let’s use our previous test/train split with unimportant columns removed. All we’ll do is remove a few of the records furthest from the mean.

And obviously, this drastically stabilized the model as our performance table below shows.

iqr_multiplier <- 5
filter_outliers <- function(data) {
  num_cols <- sapply(data, is.numeric)  # Identify numeric columns
  before_count <- nrow(data)
  data <- data %>%
    filter(if_all(where(is.numeric), ~ {
      q1 <- quantile(., 0.25, na.rm = TRUE)
      q3 <- quantile(., 0.75, na.rm = TRUE)
      iqr <- q3 - q1
      # lower bound(-25)= 25 -( IQRMultipler=1 * 50) < assuming 25/50/75/100 quartile ranges 
      lower_bound <- q1 - iqr_multiplier * iqr
      upper_bound <- q3 + iqr_multiplier * iqr
      . >= lower_bound & . <= upper_bound
    }))
  after_count <- nrow(data)
  return(list(data = data, before = before_count, after = after_count))
}

# apply outlier filter
filtered_training_data <- filter_outliers(training_data)
filtered_testing_data <- filter_outliers(testing_data)

#counts
training_data_filtered <- filtered_training_data$data
testing_data_filtered <- filtered_testing_data$data
train_before <- filtered_training_data$before
train_after <- filtered_training_data$after
test_before <- filtered_testing_data$before
test_after <- filtered_testing_data$after

training_data_updated <- clean_data(training_data_filtered, c(variables_to_remove, "x", "y", "ph"))
testing_data_for_model <- clean_data(testing_data_filtered, c(variables_to_remove, "x", "y"))

# ensure 'y' 
training_data_updated$y <- training_data_filtered$ph
testing_data_for_model$y <- testing_data_filtered$ph

rfModel_filtered <- train(
  x = training_data_updated,
  y = training_data_updated$y,
  method = "rf",
  trControl = trainControl2,
  tuneGrid = tuneGrid2,
  ntree = n_tree
)

rfPred_filtered <- predict(rfModel_filtered, newdata = testing_data_for_model)
rfMetrics_filtered <- postResample(pred = rfPred_filtered, obs = testing_data_for_model$y)

filtered_model_performance <- data.frame(
  Model = paste0("RF Tuned FilteredOutliers (if:ntree ", importance_factor, ":", n_tree, ")"),
  RMSE = rfMetrics_filtered[1],
  Rsquared = rfMetrics_filtered[2],
  MAE = rfMetrics_filtered[3]
)

Model Performance

As expected, removing some of these records causes a much better fit. This does not mean this model is better, to the contrary this might be worse, we simply don’t know.

modelPerformance <- rbind(modelPerformance, filtered_model_performance)
print(modelPerformance, row.names = FALSE)
                                         Model       RMSE  Rsquared        MAE
                                           KNN 0.11404559 0.5338995 0.08391264
                                          MARS 0.12989181 0.3880553 0.09911886
                                Neural Network 7.55429390        NA 7.55248157
                                           SVM 0.10794695 0.5757274 0.07821117
                            Random Forest (RF) 0.09950230 0.6682969 0.07138323
                        Linear Regression (LR) 0.13283985 0.3584526 0.10245441
                  RF Tuned (if:ntree 0.3:2000) 0.09509905 0.6917606 0.06705064
 RF Tuned FilteredOutliers (if:ntree 0.3:2000) 0.03016909 0.9763720 0.01757425

Overfitting

The comparison between train and test data indicates that the model generalizes well, maintaining high accuracy on both datasets. The minor drop in R Squared on the test data is expected and does not raise concerns about overfitting, reflecting excellent model performance.

However, despite the model’s strong performance and lack of overfitting issues, we still have reservations about the potential arbitrariness of forcefully dropping records that may or may not be outliers.

evaluate_model_performance <- function(model, training_data, testing_data) {
  calc_performance <- function(data, model) {
    predictions <- predict(model, newdata = data)
    postResample(pred = predictions, obs = data$y)
  }
  
  train_performance <- calc_performance(training_data, model)
  test_performance <- calc_performance(testing_data, model)
  
  performance <- data.frame(
    DataSet = c("Training", "Testing"),
    RMSE = c(train_performance[1], test_performance[1]),
    Rsquared = c(train_performance[2], test_performance[2]),
    MAE = c(train_performance[3], test_performance[3])
  )
  
  return(performance)
}

model_performance <- evaluate_model_performance(rfModel_filtered, training_data_updated, testing_data_for_model)
print(model_performance)
   DataSet       RMSE  Rsquared        MAE
1 Training 0.01345271 0.9962199 0.00787308
2  Testing 0.03016909 0.9763720 0.01757425

Record Removal Visualization

How many outliers do we have to remove to stabilize that model? The more we remove, the more we over-fit. Our goal was to remove as little as reasonably possible.

counts_data <- data.frame(
  Stage = rep(c("Before", "After"), each = 2),
  Dataset = rep(c("Training", "Testing"), 2),
  Count = c(train_before, test_before, train_after, test_after)
)

counts_data$Stage <- factor(counts_data$Stage, levels = c("Before", "After"))

library(ggplot2)
ggplot(counts_data, aes(x = Stage, y = Count, fill = Dataset)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(
    title = "Record Counts Before and After Outlier Filtering",
    x = "",
    y = "Record Count",
    fill = "Dataset"
  ) +
  geom_text(aes(label = Count, group = Dataset), position = position_dodge(width = 0.9), vjust = -0.25)

Predictions

We’ll use Model 1 for predictions. Although removing outliers can improve fits and visualizations, it may be too aggressive. This decision should ultimately rest with the stakeholders.

To predict, we will first impute missing data, then proceed to predict pH levels, and finally, write the results to a CSV file.

Imputation

Before predicting, we must address the missing fields in the dataset for which we are to predict pH levels. We’ll use a simple linear regression model to estimate these missing fields.

impute_linear_regression <- function(data) {
    # For each column ... 
    for (col in names(data)) {
        # is numeric and any missing values
        if (is.numeric(data[[col]]) && any(is.na(data[[col]]))) {
            # Identify predictor columns, excluding the current column and any columns with missing values
            predictors <- setdiff(names(data), col)
            predictors <- predictors[!sapply(data[predictors], function(x) any(is.na(x)))]

            # Extract rows where the target column is not missing for training the model
            train_data <- data[!is.na(data[[col]]), predictors, drop = FALSE]
            train_target <- data[[col]][!is.na(data[[col]])]

            # if predictors and sufficient data, fit a linear model
            if (nrow(train_data) > 1 && length(predictors) > 0) {
                lm_model <- lm(train_target ~ ., data = train_data)

                # find indices of missing values in the target column
                na_indices <- which(is.na(data[[col]]))
                if (length(na_indices) > 0) {
                    # predict missing values
                    predictions <- predict(lm_model, newdata = data[na_indices, predictors, drop = FALSE])
                    # Replace missing values with the predictions
                    data[[col]][na_indices] <- predictions
                }
            } else {
                # Use median imputation as a fallback when not enough data is available for regression
                median_value <- median(train_target, na.rm = TRUE)
                data[[col]][is.na(data[[col]])] <- median_value
            }
        }
    }
    return(data)
}

Build predictions

Now that we have an imputation strategy, we can impute and predict.

all_vars_to_remove <- c("ph", variables_to_remove) 

predict_me_filled <- predict_me %>% 
  select(-all_of(all_vars_to_remove)) %>% 
  impute_linear_regression()

predicted_ph <- predict(rfModel_updated, newdata = predict_me_filled) #<- change this line to predict from 97% model.
predict_me_filled$predicted_ph <- round(predicted_ph, 5)

Persist as CSV

As a last step, we’ll save this this predicdtion set as a CSV file. One can expect from this spreadsheet, 267 records, no null fields in any record due to imputation, and a new column called predicted_ph.

predict_me_filled %>% ## one last cleanup. 
  mutate_if(is.numeric, round, digits = 5)
  
write.csv(predict_me_filled, "ph_predictions.csv", row.names = FALSE)