library(readxl)
library(dplyr)
library(ggplot2)
library(corrplot)
library(patchwork)
library(shiny)
library(caret)
library(randomForest)
library(tidyr)
ABC Beverage: Predictive pH Factor Predictive Modeling
Executive Summary
This project aims to establish a robust predictive model for pH levels in the bottling process. We explored several modeling algorithms, some of which we included, but in the end, we chose Random Forest using cross-validation because of accuracy and resilience.
Key Outcomes
- Our best-performing models are:
- After only model tuning: Random Forest with an R Squared of 0.6917606.
- After discarding the most distant outliers from a conservative range: Random Forest with an R Squared of 0.9763720.
- Approximately 30% of the columns are either irrelevant or collinear, and were removed.
- Regarding split, in all tested cases, Random Forest is most stable with an 80/20 test/train split ratio.
- In order to predict all 267 records we were asked to predict, we chose to implement a linear regression imputation model to fill out the 675 missing fields.
- Our outcome file is called ph_predictions.csv.
Recommendations
- Stabilize factors Mnf Flow, Usage Cont, Temperature, Filler Level, Oxygen Filler, and Carb Rel. These appear to be the most important factors in keeping pH levels stable.
- Continue test these models with different groups of data. Update regularly.
- Find causes of null fields within records. Either fix the measurement process, or plan how to handle partial records.
- Automate a reporting and warning system, perhaps on a weekly basis. Monitoring metrics like R Squared, and out of bag counts will provide good feedback.
- Build and socialize a data dictionary. Start with the definitions of each variable. Over time, add behaviors as well. This study would have been more comprehensive and accurate if we data scientists knew more about each individual variable.
- Look into outlier records. Within this study we’ve proven we an make considerably more accurate predictions if we can dismiss or smooth outliers.
What’s Next or Missing
- We opted to make predictions with the model including the outliers. (The 69.1% model) This might not be the best prediction strategy, and most industry standards require models performing at 70% or better, but we believe we are making the correct conservative decision. Predicting using our second model requires consensus and industry experience, neither of which we have. We did however make it a one line adjustment to switch between the 69% and the 97% model should assumptions and logic be approved.
- There most certainly is feature engineering worth exploring, however we lack both a data dictionary and industry knowledge. Our strongest suggestion before furthering this research is for all parties to gain a better understanding of these columns.
- Some fields in the training data are empty and we chose to exclude them from model building, e.g., we dropped nulls. Normally the correct course of action is to first find out why they are null and decide if they can be dropped or not. If we continue this assignment, this null question needs to be reviewed and decided upon.
GitHub Project Location
Data Overview
- Our training consists of 2,571 entries with 33 columns.
- Our prediction dataset consists of 267 records similar to the training data, but with the pH values removed for us to predict.
- This dataset is not time-series data; it comprises discrete measurements taken at the time of bottling.
- We built two usable Random Forest models for this study. The first model kept all records and was well tuned. The second model was tuned like the first, but we also removed distant outliers based on a conservative threshold set at five times the interquartile range. In the end, we removed only 85 outlier records from 1600 training records.
Load Data
We moved data up to github and will load straight from there. Upon loading, we will immediately clean and standardize the column names, removing spaces and converting all to lowercase.
<- "https://github.com/tonythor/abc-beverage/raw/develop/data/"
github_url <- "StudentData_Training.xlsx"
train_fn <- "StudentEvaluation_Test.xlsx"
predict_me_fn <- paste0(github_url, train_fn)
train_url <- paste0(github_url, predict_me_fn)
predict_me_url download.file(train_url, destfile = train_fn, mode = "wb")
download.file(predict_me_url, destfile = predict_me_fn, mode = "wb")
# read in, clean by replacing all spaces and upper case column names
# and mutate any nulls with a column average.
# train_raw has 2571 records.
<- read_excel(train_fn) %>%
train_raw rename_with(~ gsub(" ", "", .x) %>% tolower()) %>%
mutate(brandcodenumeric = as.numeric(as.factor(brandcode))) %>%
select(-brandcode) %>%
mutate(across(where(is.factor), ~ as.numeric(as.factor(.)))) # Convert all factors to numeric
#train has 2038 records
<- train_raw %>% filter_all(all_vars(!is.na(.)))
train
<- read_excel(predict_me_fn) %>%
predict_me rename_with(~ gsub(" ", "", .x) %>% tolower()) %>%
mutate(brandcodenumeric = as.numeric(as.factor(brandcode))) %>%
select(-brandcode) %>%
mutate(across(where(is.factor), ~ as.numeric(as.factor(.)))) # Convert all factors to numeric
file.remove(c(train_fn, predict_me_fn))
Basic Visualizations
pH seems to be fairly normally distributed. After looking at a correlation chart, we might have some collinearity or irrelevant columns to deal with. With just these two visuals, we should likely test several different models and see which performs best.
<- train %>% select_if(is.numeric)
numeric_data
#correlation matrix to df
<- cor(numeric_data, use = "complete.obs")
cor_matrix <- as.data.frame(as.table(cor_matrix))
cor_df names(cor_df) <- c("Variable1", "Variable2", "Correlation")
# filter to include only higher correlations
<- 0.5
threshold <- cor_df %>%
cor_df filter(abs(Correlation) >= threshold, Variable1 != Variable2)
<- ggplot(cor_df, aes(x = Variable1, y = Variable2, fill = Correlation)) +
cor_plot geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
theme(axis.text.x = element_text(angle = 90, hjust = 1), axis.text.y = element_text(angle = 0)) +
labs(fill = "Correlation", title = "Filtered Correlation Matrix")
<- ggplot(numeric_data, aes(x = ph)) +
ph_plot geom_histogram(bins = 30, fill = "blue", color = "black") +
ggtitle("Distribution of pH Values") +
xlab("pH") +
ylab("Frequency")
<- ph_plot + cor_plot
combined_plot print(combined_plot)
Shiny App: Boxpolots
You can review all the box and whisker plots for each numeric variable via this shiny app.
Launch the popup to review individual variable boxplots.
Review Multiple Models
First, we’ll fit and predict using several common models to evaluate their performance. This analysis demonstrates that Random Forest is the strongest candidate among the models tested. Please note that this is not an exhaustive list of all models we tested.
set.seed(200) # for reproducibility
<- train %>%
train mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .))) %>%
select(where(is.numeric)) # Ensure only numeric data is used
# test and train
set.seed(123)
<- createDataPartition(train$ph, p = 0.8, list = FALSE)
indexes <- train[indexes, ]
training_data <- train[-indexes, ]
testing_data
# prepare predictor and response variables correctly
$x <- data.frame(sapply(training_data[, -which(names(training_data) == "ph")], as.numeric))
training_data$y <- training_data$ph
training_data$x <- data.frame(sapply(testing_data[, -which(names(testing_data) == "ph")], as.numeric))
testing_data$y <- testing_data$ph
testing_data
# KNN
<- train(x = training_data$x, y = training_data$y,
knnModel method = "knn",
preProc = c("center", "scale"),
tuneLength = 5)
<- predict(knnModel, newdata = testing_data$x)
knnPred <- postResample(pred = knnPred, obs = testing_data$y)
knnMetrics
# MARS
<- train(x = training_data$x, y = training_data$y,
marsModel method = "earth",
preProc = c("center", "scale"),
tuneLength = 5)
<- predict(marsModel, newdata = testing_data$x)
marsPred <- postResample(pred = marsPred, obs = testing_data$y)
marsMetrics
# Neural Net
<- train(x = training_data$x, y = training_data$y,
nnModel method = "nnet",
preProcess = c("center", "scale"),
tuneLength = 5,
trace = FALSE)
<- predict(nnModel, newdata = testing_data$x)
nnPred <- postResample(pred = nnPred, obs = testing_data$y)
nnMetrics
# SVM
<- train(x = training_data$x, y = training_data$y,
svmModel method = "svmRadial",
preProcess = c("center", "scale"),
tuneLength = 5)
<- predict(svmModel, newdata = testing_data$x)
svmPred <- postResample(pred = svmPred, obs = testing_data$y)
svmMetrics
# Random Forest
<- trainControl(
trainControl method = "cv", # <- this is how you get your model to cross validate!
number = 3, # <- three fold validation
verboseIter = FALSE,
savePredictions = "final",
returnResamp = "all"
)
<- expand.grid(
tuneGrid mtry = c(2, floor(sqrt(ncol(training_data$x))))
)
<- train(
rfModel x = training_data$x,
y = training_data$y,
method = "rf",
trControl = trainControl,
tuneGrid = tuneGrid
)<- predict(rfModel, newdata = testing_data$x)
rfPred <- postResample(pred = rfPred, obs = testing_data$y)
rfMetrics <- varImp(rfModel, scale = TRUE)
importance_measures
# Linear Regression
<- train(x = training_data$x, y = training_data$y,
lmModel method = "lm",
preProcess = c("center", "scale"))
<- predict(lmModel, newdata = testing_data$x)
lmPred <- postResample(pred = lmPred, obs = testing_data$y)
lmMetrics
# Collecting all model performance metrics into a small df
<- data.frame(
modelPerformance Model = c("KNN", "MARS", "Neural Network", "SVM", "Random Forest (RF)", "Linear Regression (LR)"),
RMSE = c(knnMetrics[1], marsMetrics[1], nnMetrics[1], svmMetrics[1], rfMetrics[1], lmMetrics[1]),
Rsquared = c(knnMetrics[2], marsMetrics[2], nnMetrics[2], svmMetrics[2], rfMetrics[2], lmMetrics[2]),
MAE = c(knnMetrics[3], marsMetrics[3], nnMetrics[3], svmMetrics[3], rfMetrics[3], lmMetrics[3])
)
print(modelPerformance, row.names = FALSE)
Model RMSE Rsquared MAE
KNN 0.1140456 0.5338995 0.08391264
MARS 0.1298918 0.3880553 0.09911886
Neural Network 7.5542939 NA 7.55248157
SVM 0.1079469 0.5757274 0.07821117
Random Forest (RF) 0.0995023 0.6682969 0.07138323
Linear Regression (LR) 0.1328399 0.3584526 0.10245441
RF Model 1, and Setup
Now that we’ve selected a machine learning algorithm, we’ll use this section to set up the Random Forest model. We’ll figure out which columns to keep, how to set our hyperparameters, etc. We will not modify any data for this first model; instead, we will focus on optimizing the infrastructure as effectively as possible.
Feature Importance
First, let’s get a handle on which features seem the most important. We should see some of these less important columns later when we start removing unimportant factors.
<- varImp(rfModel, scale = TRUE)
importance_measures plot(importance_measures, main = "Feature Importance in Random Forest Model")
Build Model and Remove Columns
Some of the less important columns can confuse the model. Before we refit this model, we’ll set an importance threshold to selectively remove those columns that are less impactful. Then, we’ll either refine our existing model or build a completely new one, refit, and evaluate its performance.
As you’ll see in the results, the performance isn’t bad. We’ve achieved over 69%. If we were to round up, it would be considered good enough.
# clean and type and flatten
<- function(data, vars_to_remove) {
clean_data <- data[, !(names(data) %in% vars_to_remove)]
data <- as.data.frame(data)
data <- lapply(data, function(x) if(is.list(x)) unlist(x) else x)
data[] return(data)
}
# important features
<- as.data.frame(importance_measures$importance)
importance_df $Variable <- rownames(importance_df)
importance_df<- importance_df[order(importance_df$Overall), ]
importance_df
# cut off at less than importance_factor %
<- 0.3
importance_factor <- quantile(importance_df$Overall, importance_factor)
cutoff <- importance_df$Variable[importance_df$Overall <= cutoff]
variables_to_remove
<- clean_data(training_data, c(variables_to_remove, "x", "y", "ph"))
training_data_updated # check the data types of variables after cleaning and removal.
<- trainControl(
trainControl2 method = "cv", # <- this is how you get your model to cross validate!
number = 3, # <- three fold validation
verboseIter = FALSE,
savePredictions = "final",
returnResamp = "all"
)
<- expand.grid(
tuneGrid2 mtry = c(2, floor(sqrt(ncol(training_data$x))))
)
= 2000
n_tree <- train(
rfModel_updated x = training_data_updated,
y = training_data$y,
method = "rf",
trControl = trainControl2,
tuneGrid = tuneGrid2,
ntree = n_tree
)
# prepare and clean prediction data
<- clean_data(testing_data, c(variables_to_remove, "x", "y"))
predictor_data_for_model
<- predict(rfModel_updated, newdata = predictor_data_for_model)
rfPred_updated <- postResample(pred = rfPred_updated, obs = testing_data$y)
rfMetrics_updated
print(variables_to_remove) #<- columns we removed before retraining
Model 1 Performance
Let us now take a quick look at this last run and see how much better it performs.
Not bad, removing those columns and adjusting those hyper parameters made a big difference.
<- data.frame(
updated_model_performance Model = paste0("RF Tuned (if:ntree ", importance_factor,":", n_tree, ")"),
RMSE = rfMetrics_updated[1],
Rsquared = rfMetrics_updated[2],
MAE = rfMetrics_updated[3]
)
<- rbind(modelPerformance, updated_model_performance)
modelPerformance print(modelPerformance, row.names = FALSE) #<- a big improvement
Model RMSE Rsquared MAE
KNN 0.11404559 0.5338995 0.08391264
MARS 0.12989181 0.3880553 0.09911886
Neural Network 7.55429390 NA 7.55248157
SVM 0.10794695 0.5757274 0.07821117
Random Forest (RF) 0.09950230 0.6682969 0.07138323
Linear Regression (LR) 0.13283985 0.3584526 0.10245441
RF Tuned (if:ntree 0.3:2000) 0.09509905 0.6917606 0.06705064
print(variables_to_remove) #<- columns we removed before retraining
[1] "pscco2" "pscfill" "carbtemp" "psc"
[5] "carbpressure" "hydpressure4" "fillounces" "pressuresetpoint"
[9] "pcvolume" "hydpressure2"
Residuals Plot Review
This chart shows the differences between the pH we predicted in model 1 and the actual pH from the dataset. Residual points appear randomly distributed around the zero line, which is ideal. This suggests that the model exhibits homoscedasticity, meaning it does not suffer from non-constant variance. Also, there are no apparent patterns or trends in the residuals, which implies that the linearity between predictors and target is reasonably satisfied.
However, there are some outliers that we need to address.
<- predict(rfModel_updated, newdata = predictor_data_for_model)
predictions <- testing_data$y - predictions
residuals <- data.frame(Predicted = predictions, Residuals = residuals)
residuals_df <- na.omit(residuals_df)
residuals_df ggplot(residuals_df, aes(x = Predicted, y = Residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Residual Plot", x = "Predicted Values", y = "Residuals")
RF Model 2, Outliers Removed
Random Forest deals reasonably well with outliers, but let’s kick out just a few here to stabilize our model a little more. We’ll use an IQR strategy to remove only those outliers that are extremely distant, specifically those exceeding 8 times the distance between the first and third quartiles.
Randomly dropping records because they are outliers is not generally a preferred strategy, but we’re already dropping a few records because they have null fields. These might be valid records; we simply don’t know whether the records we’re dropping are valid or not. We should look into each of them to figure out why, and try to account for that variance in some other way.
From a process perspective, let’s use our previous test/train split with unimportant columns removed. All we’ll do is remove a few of the records furthest from the mean.
And obviously, this drastically stabilized the model as our performance table below shows.
<- 5
iqr_multiplier <- function(data) {
filter_outliers <- sapply(data, is.numeric) # Identify numeric columns
num_cols <- nrow(data)
before_count <- data %>%
data filter(if_all(where(is.numeric), ~ {
<- quantile(., 0.25, na.rm = TRUE)
q1 <- quantile(., 0.75, na.rm = TRUE)
q3 <- q3 - q1
iqr # lower bound(-25)= 25 -( IQRMultipler=1 * 50) < assuming 25/50/75/100 quartile ranges
<- q1 - iqr_multiplier * iqr
lower_bound <- q3 + iqr_multiplier * iqr
upper_bound >= lower_bound & . <= upper_bound
.
}))<- nrow(data)
after_count return(list(data = data, before = before_count, after = after_count))
}
# apply outlier filter
<- filter_outliers(training_data)
filtered_training_data <- filter_outliers(testing_data)
filtered_testing_data
#counts
<- filtered_training_data$data
training_data_filtered <- filtered_testing_data$data
testing_data_filtered <- filtered_training_data$before
train_before <- filtered_training_data$after
train_after <- filtered_testing_data$before
test_before <- filtered_testing_data$after
test_after
<- clean_data(training_data_filtered, c(variables_to_remove, "x", "y", "ph"))
training_data_updated <- clean_data(testing_data_filtered, c(variables_to_remove, "x", "y"))
testing_data_for_model
# ensure 'y'
$y <- training_data_filtered$ph
training_data_updated$y <- testing_data_filtered$ph
testing_data_for_model
<- train(
rfModel_filtered x = training_data_updated,
y = training_data_updated$y,
method = "rf",
trControl = trainControl2,
tuneGrid = tuneGrid2,
ntree = n_tree
)
<- predict(rfModel_filtered, newdata = testing_data_for_model)
rfPred_filtered <- postResample(pred = rfPred_filtered, obs = testing_data_for_model$y)
rfMetrics_filtered
<- data.frame(
filtered_model_performance Model = paste0("RF Tuned FilteredOutliers (if:ntree ", importance_factor, ":", n_tree, ")"),
RMSE = rfMetrics_filtered[1],
Rsquared = rfMetrics_filtered[2],
MAE = rfMetrics_filtered[3]
)
Model Performance
As expected, removing some of these records causes a much better fit. This does not mean this model is better, to the contrary this might be worse, we simply don’t know.
<- rbind(modelPerformance, filtered_model_performance)
modelPerformance print(modelPerformance, row.names = FALSE)
Model RMSE Rsquared MAE
KNN 0.11404559 0.5338995 0.08391264
MARS 0.12989181 0.3880553 0.09911886
Neural Network 7.55429390 NA 7.55248157
SVM 0.10794695 0.5757274 0.07821117
Random Forest (RF) 0.09950230 0.6682969 0.07138323
Linear Regression (LR) 0.13283985 0.3584526 0.10245441
RF Tuned (if:ntree 0.3:2000) 0.09509905 0.6917606 0.06705064
RF Tuned FilteredOutliers (if:ntree 0.3:2000) 0.03016909 0.9763720 0.01757425
Overfitting
The comparison between train and test data indicates that the model generalizes well, maintaining high accuracy on both datasets. The minor drop in R Squared on the test data is expected and does not raise concerns about overfitting, reflecting excellent model performance.
However, despite the model’s strong performance and lack of overfitting issues, we still have reservations about the potential arbitrariness of forcefully dropping records that may or may not be outliers.
<- function(model, training_data, testing_data) {
evaluate_model_performance <- function(data, model) {
calc_performance <- predict(model, newdata = data)
predictions postResample(pred = predictions, obs = data$y)
}
<- calc_performance(training_data, model)
train_performance <- calc_performance(testing_data, model)
test_performance
<- data.frame(
performance DataSet = c("Training", "Testing"),
RMSE = c(train_performance[1], test_performance[1]),
Rsquared = c(train_performance[2], test_performance[2]),
MAE = c(train_performance[3], test_performance[3])
)
return(performance)
}
<- evaluate_model_performance(rfModel_filtered, training_data_updated, testing_data_for_model)
model_performance print(model_performance)
DataSet RMSE Rsquared MAE
1 Training 0.01345271 0.9962199 0.00787308
2 Testing 0.03016909 0.9763720 0.01757425
Record Removal Visualization
How many outliers do we have to remove to stabilize that model? The more we remove, the more we over-fit. Our goal was to remove as little as reasonably possible.
<- data.frame(
counts_data Stage = rep(c("Before", "After"), each = 2),
Dataset = rep(c("Training", "Testing"), 2),
Count = c(train_before, test_before, train_after, test_after)
)
$Stage <- factor(counts_data$Stage, levels = c("Before", "After"))
counts_data
library(ggplot2)
ggplot(counts_data, aes(x = Stage, y = Count, fill = Dataset)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(
title = "Record Counts Before and After Outlier Filtering",
x = "",
y = "Record Count",
fill = "Dataset"
+
) geom_text(aes(label = Count, group = Dataset), position = position_dodge(width = 0.9), vjust = -0.25)
Predictions
We’ll use Model 1 for predictions. Although removing outliers can improve fits and visualizations, it may be too aggressive. This decision should ultimately rest with the stakeholders.
To predict, we will first impute missing data, then proceed to predict pH levels, and finally, write the results to a CSV file.
Imputation
Before predicting, we must address the missing fields in the dataset for which we are to predict pH levels. We’ll use a simple linear regression model to estimate these missing fields.
<- function(data) {
impute_linear_regression # For each column ...
for (col in names(data)) {
# is numeric and any missing values
if (is.numeric(data[[col]]) && any(is.na(data[[col]]))) {
# Identify predictor columns, excluding the current column and any columns with missing values
<- setdiff(names(data), col)
predictors <- predictors[!sapply(data[predictors], function(x) any(is.na(x)))]
predictors
# Extract rows where the target column is not missing for training the model
<- data[!is.na(data[[col]]), predictors, drop = FALSE]
train_data <- data[[col]][!is.na(data[[col]])]
train_target
# if predictors and sufficient data, fit a linear model
if (nrow(train_data) > 1 && length(predictors) > 0) {
<- lm(train_target ~ ., data = train_data)
lm_model
# find indices of missing values in the target column
<- which(is.na(data[[col]]))
na_indices if (length(na_indices) > 0) {
# predict missing values
<- predict(lm_model, newdata = data[na_indices, predictors, drop = FALSE])
predictions # Replace missing values with the predictions
<- predictions
data[[col]][na_indices]
}else {
} # Use median imputation as a fallback when not enough data is available for regression
<- median(train_target, na.rm = TRUE)
median_value is.na(data[[col]])] <- median_value
data[[col]][
}
}
}return(data)
}
Build predictions
Now that we have an imputation strategy, we can impute and predict.
<- c("ph", variables_to_remove)
all_vars_to_remove
<- predict_me %>%
predict_me_filled select(-all_of(all_vars_to_remove)) %>%
impute_linear_regression()
<- predict(rfModel_updated, newdata = predict_me_filled) #<- change this line to predict from 97% model.
predicted_ph $predicted_ph <- round(predicted_ph, 5) predict_me_filled
Persist as CSV
As a last step, we’ll save this this predicdtion set as a CSV file. One can expect from this spreadsheet, 267 records, no null fields in any record due to imputation, and a new column called predicted_ph.
%>% ## one last cleanup.
predict_me_filled mutate_if(is.numeric, round, digits = 5)
write.csv(predict_me_filled, "ph_predictions.csv", row.names = FALSE)