knitr::opts_chunk$set(warning = FALSE, message = FALSE)
This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
library(tidyverse)
library(readxl)
library(corrplot)
library(DataExplorer)
library(caret)
library(writexl)
raw <- read_excel("Training Data.xlsx")
# Getting a general sense of the data
glimpse(raw)
## Rows: 2,571
## Columns: 33
## $ `Brand Code` <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", …
## $ `Carb Volume` <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, …
## $ `Fill Ounces` <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, …
## $ `PC Volume` <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1113…
## $ `Carb Pressure` <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64…
## $ `Carb Temp` <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 1…
## $ PSC <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.15…
## $ `PSC Fill` <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.…
## $ `PSC CO2` <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.…
## $ `Mnf Flow` <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1` <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 1…
## $ `Fill Pressure` <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46…
## $ `Hyd Pressure1` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2` <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3` <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4` <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, …
## $ `Filler Level` <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 1…
## $ `Filler Speed` <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014…
## $ Temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65…
## $ `Usage cont` <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 1…
## $ `Carb Flow` <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902,…
## $ Density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.…
## $ MFR <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, …
## $ Balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1…
## $ `Pressure Vacuum` <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4…
## $ PH <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.…
## $ `Oxygen Filler` <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0…
## $ `Bowl Setpoint` <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46…
## $ `Air Pressurer` <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 1…
## $ `Alch Rel` <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.…
## $ `Carb Rel` <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.…
## $ `Balling Lvl` <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.…
The training set contains 2571 observations and 33 variables,
including the target variable PH. Other than the
Brand Code variable, which is a chr type, the rest of the
production factors are stored as numeric types.
Let’s get a better sense of the data by plotting their distributions.
# Distribution of numerical variables
plot_histogram(raw)
There is a mix of normal, skewed, and multimodal distributions across
the production variables. Several predictors, such as
Carb.Pressure, Carb.Temp,
Carb.Volume, Fill.Ounces,
PC.Volume, and PSC, exhibit a normal, unimodal
distribution. In contrast, variables such as Density,
Carb.Rel, Carb.Flow,Filler.Speed,
Balling and Balling.Lvl show a clear bimodal
or multimodal pattern. Some predictors, most notably
Oxygen.Filter, MFR,
Hyd.Pressure1, Hyd.Pressure2, and
Hyd.Pressure3 are heavily right skewed with many near zero
values and a long tail of larger values. These distributions suggest the
presence of potential outliers and cleaning may be required before
modeling.
# Bar plot for the categorical variable
ggplot(raw, aes(x = `Brand Code`)) +
geom_bar() +
theme_minimal() +
labs(
title = "Distribution of Brand Code",
x = "Brand",
y = "Frequency (Count)"
)
The categorical variable Brand Code is unevenly
distributed. Brand B accounts for almost half of all the observations
while Brands A, C, and D are significantly smaller. Because of this
imbalance, the usefulness of this variable in the modeling will need to
be evaluated, and it maybe removed depending on its predictive
contribution.
# Checking pH of each Brand
ggplot(raw, aes(x = `Brand Code`, y = PH)) +
geom_boxplot() +
facet_wrap(~ `Brand Code`,scales = "free") +
theme_minimal()
This boxplots show that all brands have similar PH distributions centered around 8.5.
Let’s check for missing values in the training data set.
# Count the number of NA values in each column
missing_counts <- colSums(is.na(raw))
missing_counts
## Brand Code Carb Volume Fill Ounces PC Volume
## 120 10 38 39
## Carb Pressure Carb Temp PSC PSC Fill
## 27 26 33 23
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## 39 2 32 22
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## 11 15 15 30
## Filler Level Filler Speed Temperature Usage cont
## 20 57 14 5
## Carb Flow Density MFR Balling
## 2 1 212 1
## Pressure Vacuum PH Oxygen Filler Bowl Setpoint
## 0 4 12 2
## Pressure Setpoint Air Pressurer Alch Rel Carb Rel
## 12 0 9 10
## Balling Lvl
## 1
Most of the predictors have a small percentage of missing values
(NA). the predictor MFRhas the highest missing value count
at 212, about 8.2% of the total observations. The target variable,
PH, also has 4 missing values. Before we begin modeling,
these missing values will first need to be addressed.
Let’s start by removing the 4 rows where the rows where
PH value is missing. We are removing them rather than
imputing them because a model cannot be trained on data that has no
target to learn from, doing so could introduce bias. Also, the 4 rows
represent less than 1% of the total data and removing them should not
have any significant impact on the model’s performance.
For the other missing values, they will be imputed with the median values since many of the variables were skewed or contained extreme values.Using the median ensures that the imputation does not distort the original distribution of the variables or bias the model
# Removing the rows where PH is missing
df <- raw %>%
drop_na(PH)
# Imputing the missing values with the median
df_imputed <- df %>%
mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
# Count the number of NA values in each column
missing_counts <- colSums(is.na(df_imputed))
missing_counts
## Brand Code Carb Volume Fill Ounces PC Volume
## 120 0 0 0
## Carb Pressure Carb Temp PSC PSC Fill
## 0 0 0 0
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## 0 0 0 0
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## 0 0 0 0
## Filler Level Filler Speed Temperature Usage cont
## 0 0 0 0
## Carb Flow Density MFR Balling
## 0 0 0 0
## Pressure Vacuum PH Oxygen Filler Bowl Setpoint
## 0 0 0 0
## Pressure Setpoint Air Pressurer Alch Rel Carb Rel
## 0 0 0 0
## Balling Lvl
## 0
There is still 120 missing values (~5%) in the categorical variable
Brand Code. Imputing these missing values with the most
common brand would likely distort the data and bias the model. Instead,
we can replace the NA values with a new “Unknown” category.
# Replacing NA values with Unknown
df_imputed <- df_imputed %>%
mutate(`Brand Code` = ifelse(is.na(`Brand Code`), "Unknown", `Brand Code`))
Based on the distribution plots of the predictors, several variables appeared to contain potential outliers. These extreme values were visible in the histograms as long tails or wide ranges. To evaluate these observations, the interquartile range (IQR) rule will be applied to each numeric predictor to identify how many outliers each variable contains.
# Select only numeric columns
numeric_vars <- df_imputed %>%
select(where(is.numeric))
# Flag outliers using the IQR rule
flag_outliers <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_val <- IQR(x, na.rm = TRUE)
lower <- Q1 - 1.5 * IQR_val
upper <- Q3 + 1.5 * IQR_val
(x < lower) | (x > upper)
}
# Apply the function across all numeric columns
outlier_flags <- numeric_vars %>%
mutate(across(everything(), flag_outliers, .names = "outlier_{col}"))
# Count number of outliers per variable and outlier percentage
outlier_summary <- map_dfr(names(numeric_vars), function(var) {
x <- numeric_vars[[var]]
flags <- flag_outliers(x)
outlier_count <- sum(flags, na.rm = TRUE)
outlier_pct <- (outlier_count / sum(!is.na(x))) * 100
tibble(
variable = var,
outlier_count = outlier_count,
outlier_pct = outlier_pct
)
}) %>%
arrange(desc(outlier_count))
outlier_summary
## # A tibble: 32 × 3
## variable outlier_count outlier_pct
## <chr> <int> <dbl>
## 1 Filler Speed 440 17.1
## 2 MFR 279 10.9
## 3 Air Pressurer 220 8.57
## 4 Oxygen Filler 193 7.52
## 5 Pressure Vacuum 125 4.87
## 6 Temperature 122 4.75
## 7 PC Volume 95 3.70
## 8 Hyd Pressure4 82 3.19
## 9 Fill Pressure 79 3.08
## 10 PSC CO2 77 3.00
## # ℹ 22 more rows
The IQR analysis confirms that some predictors contain a large number
of extreme values. Variables such as Filler.Speed,
MFR, Air.Pressurer, and
Oxygen.Filler have the highest outlier counts, ranging from
7% to 17% of their total observations. This is consistent with the
distribution plots, where each of these variables was heavily skewed. At
this stage in the EDA, these outliers will not be removed. For now, we
can just acknowledge that there are extreme values in the that may
require transformation during modeling.
To identify the relationship between the numeric variables, we can create a correlation matrix.
This correlation heatmap shows that most predictors have a weak to
moderate linear relationship with the target variable, pH.
The strongest predictor for pH is MnF Flow
with a correlation coefficient of -0.45, indicating that as
manufacturing flow rate increases, the pH tends to
decrease. Bowl Setpoint has the highest positive correlation with pH
with a correlation coefficient of 0.35.
The heatmap also reveals significant multicollinearity. As shown by
the dark clusters in the heatmap, predictors like Balling
and Balling Lvl, Density and
Alch Rel, and the Hyd Pressure variables have
a correlation close to 1.0. This redundancy may be problematic for
linear regression models as it makes it difficult to see the individual
effects of these predictors. However, since there is an absence of a
strong linear correlation (r >0.8) with the target variable, this
suggests that the relationship between the production factors and pH is
complex and likely non-linear.
The following code cells will conduct experiments using several machine learning algorithms using grid search, to predict pH. Additionally, the winner model, based on lowest RMSE, will be validated using the test set, to assess generalization.
Before running our algorithms, we need to prepare the data to ensure fair comparisons.
Encoding Categorical Variables: Machine learning models (like SVM and XGBoost) require numerical input. We will convert Brand Code into dummy variables (one-hot encoding).
Train/Test Split: We will split the provided historical data into a Training Set (80%) to build the models and a Test Set (20%) to act as an unseen validation set. This allows us to simulate how the model performs on new data.
Preprocessing: We will center and scale the numeric variables, which is essential for models like SVM to function correctly. Unlike XGBoost (multiple decision trees stacked on top of each other), SVM is based on distance metrics. Thus, differing scales between features can inadvertently put more weight on certain features than others (even for XGBoost, etc. it is generally good practice to scale and center).
set.seed(958) #Ensure reproducibility (958 = 9.58 seconds -> Usain Bolt's 100m world record. Lucky Number 😉)
#One-Hot Encoding for Brand Code
#We use model.matrix to expand factors to dummy variables
model_matrix = model.matrix(PH ~ . - 1, data = df_imputed) %>%
as.data.frame()
#Bind the target variable PH back to the dataset
data_ready = cbind(PH = df_imputed$PH, model_matrix)
#Splitting the data (80% Train, 20% Test)
train_index = createDataPartition(data_ready$PH, p = 0.80, list = FALSE)
train_data = data_ready[train_index, ]
test_data = data_ready[-train_index, ]
cat("Training Set Dimensions:", dim(train_data), "\n")
## Training Set Dimensions: 2055 37
cat("Test Set Dimensions:", dim(test_data))
## Test Set Dimensions: 512 37
We will utilize caret to perform a grid search with Cross-Validation (5-fold). This technique splits the training data into 5 chunks, training on 4 and validating on the 5th, rotating through all chunks. This prevents us from selecting a model that just happens to memorize the training data (overfitting).
We are testing three distinct approaches:
Multiple Linear Regression (Baseline): Assumes a linear relationship.
Support Vector Machine (Radial Basis Function): Good for complex, non-linear boundaries.
XGBoost (Gradient Boosting): An ensemble tree method that is currently the industry standard for high-performance tabular prediction.
#Set up Cross-Validation control
fit_control = trainControl(
method = "cv",
number = 5,
savePredictions = "final",
verboseIter = FALSE
)
#--- Model 1: Multiple Linear Regression (Baseline) ---
set.seed(958)
lm_model = train(
PH ~ .,
data = train_data,
method = "lm",
trControl = fit_control,
preProcess = c("center", "scale")
)
#--- Model 2: Support Vector Machine (SVM) ---
#Grid Search for Cost (C) and Sigma
svm_grid = expand.grid(
sigma = c(0.01, 0.05, 0.1),
C = c(0.1, 1, 10)
)
set.seed(958)
svm_model = train(
PH ~ .,
data = train_data,
method = "svmRadial",
trControl = fit_control,
tuneGrid = svm_grid,
preProcess = c("center", "scale")
)
#--- Model 3: XGBoost (Gradient Boosting) ---
#Grid Search for Tree Depth, Learning Rate (eta), and Iterations
xgb_grid = expand.grid(
nrounds = c(100, 200),
max_depth = c(3, 6),
eta = c(0.05, 0.1, 0.3),
gamma = 0,
colsample_bytree = 0.8,
min_child_weight = 1,
subsample = 0.8
)
set.seed(958)
xgb_model = train(
PH ~ .,
data = train_data,
method = "xgbTree",
trControl = fit_control,
tuneGrid = xgb_grid,
preProcess = c("center", "scale"),
verbosity = 0
)
We evaluate the models based on RMSE (Root Mean Squared Error). RMSE measures the average magnitude of the error; lower values indicate better fit. We also look at MAE (Mean Absolute Error), due to its interpretability especially for non-technical business stakeholders, and R-Squared.
#Collect results
results = resamples(list(
Linear_Reg = lm_model,
SVM = svm_model,
XGBoost = xgb_model
))
#Summarize the results
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: Linear_Reg, SVM, XGBoost
## Number of resamples: 5
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Linear_Reg 0.09831584 0.10173017 0.10316179 0.10252985 0.10432545 0.10511601
## SVM 0.08218187 0.08289437 0.08316028 0.08424167 0.08555756 0.08741428
## XGBoost 0.07179784 0.07275801 0.07614461 0.07556900 0.07821149 0.07893304
## NA's
## Linear_Reg 0
## SVM 0
## XGBoost 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Linear_Reg 0.12535439 0.1312758 0.1350440 0.1328562 0.1352401 0.1373669 0
## SVM 0.10969862 0.1147114 0.1157187 0.1156254 0.1161199 0.1218785 0
## XGBoost 0.09336151 0.1032859 0.1048307 0.1047195 0.1058988 0.1162205 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Linear_Reg 0.3902414 0.3986441 0.4034004 0.4128602 0.4353893 0.4366256 0
## SVM 0.4987920 0.5596984 0.5702012 0.5579778 0.5750533 0.5861439 0
## XGBoost 0.5460049 0.6405128 0.6465753 0.6342370 0.6516944 0.6863974 0
#Visualize the comparison
bwplot(results, metric = "RMSE", main = "Model Comparison: RMSE (Lower is Better)")
#Extract best RMSE values to a table
model_comparison = data.frame(
Model = c("Linear Regression", "SVM", "XGBoost"),
RMSE = c(min(lm_model$results$RMSE), min(svm_model$results$RMSE), min(xgb_model$results$RMSE)),
Rsquared = c(max(lm_model$results$Rsquared), max(svm_model$results$Rsquared), max(xgb_model$results$Rsquared))
) %>%
arrange(RMSE)
knitr::kable(model_comparison, caption = "Final Model Performance Comparison")
| Model | RMSE | Rsquared |
|---|---|---|
| XGBoost | 0.1047195 | 0.6357002 |
| SVM | 0.1156254 | 0.5579778 |
| Linear Regression | 0.1328562 | 0.4128602 |
Selection Decision: Based on the empirical evidence from our cross-validation, the XGBoost model yielded the lowest RMSE. The Linear Regression baseline performed significantly worse, confirming our hypothesis from the EDA that the relationships between manufacturing factors and pH are non-linear and complex. We will proceed with the winner (lowest RMSE) for final validation.
Now that we have selected our “Winner” model, we must validate it against the 20% holdout Test Set that the model has never seen. This gives us a realistic expectation of performance in the actual production environment.
#Determine the winner automatically
best_model_name = model_comparison$Model[1]
cat("The selected winner model is:", best_model_name, "\n")
## The selected winner model is: XGBoost
#Define the model object based on the winner
final_model = xgb_model
# Make predictions on the Test Set
predictions = predict(final_model, newdata = test_data)
# Calculate Final Metrics
postResample(pred = predictions, obs = test_data$PH)
## RMSE Rsquared MAE
## 0.10293945 0.63635542 0.07788711
To visually assess the quality of our predictions, we will plot the Actual vs. Predicted pH and the Distribution of Residuals (Errors). Ideally, the residuals should be normally distributed around zero.
#Create a dataframe for plotting
val_results = data.frame(
Actual = test_data$PH,
Predicted = predictions
) %>%
mutate(Residuals = Actual - Predicted)
#Plot 1: Actual vs Predicted
p1 = ggplot(val_results, aes(x = Actual, y = Predicted)) +
geom_point(color = "blue", alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
theme_minimal() +
labs(
title = paste("Actual vs. Predicted pH (", best_model_name, ")"),
subtitle = "Red dashed line represents perfect prediction",
x = "Actual pH",
y = "Predicted pH"
)
#Plot 2: Residuals Histogram
p2 = ggplot(val_results, aes(x = Residuals)) +
geom_histogram(bins = 30, fill = "darkgreen", color = "white", alpha = 0.7) +
geom_vline(xintercept = 0, color = "red", linetype = "dashed") +
theme_minimal() +
labs(
title = "Distribution of Residuals",
subtitle = "Ideally centered at 0",
x = "Residual Error (Actual - Predicted)",
y = "Count"
)
#Display plots
print(p1)
print(p2)
Interpretation: The Actual vs. Predicted plot shows how closely our model tracks the true pH values (besides one outlier). The tighter the points are to the red dashed line, the better the accuracy. The residuals histogram shows us the “error distribution.” The bell curve centered at zero indicates a healthy model that is not systematically over-predicting or under-predicting.
Finally, we will export the predictions to an Excel-readable format. Below, we will find additional analysis for the predictions. Specifically, we will look at feature importance from the XGBoost output, as well as using SHAP (SHapley Additive exPlanations) to evaluate the impact of specific features on each predicition, allowing us to identify trends and, most importantly, actionable insights for the beverage process.
#Add ID column if needed, here we just export the comparison
output_df = val_results %>%
select(Actual_PH = Actual, Predicted_PH = Predicted, Error = Residuals)
write_xlsx(output_df, "PH_Model_Predictions.xlsx")
cat("Predictions have been exported to 'PH_Model_Predictions.xlsx'.")
## Predictions have been exported to 'PH_Model_Predictions.xlsx'.
This section evaluates which production variables have the strongest impact on predicting pH, based on the final XGBoost interpretability model. We first present global feature importance using XGBoost’s built-in scoring, followed by SHAP (Shapley Additive Explanations) which explains how each variable influences the model’s predictions.
# XGBoost training for interpretability
library(xgboost)
df2 <- df_imputed
dummies <- model.matrix(PH ~ . - 1, data = df2)
X <- dummies
y <- df2$PH
dmat <- xgb.DMatrix(data = X, label = y)
params <- list(objective="reg:squarederror", eta=0.1, max_depth=6)
xgb_fit2 <- xgb.train(params=params, data=dmat, nrounds=200, verbose=0)
The XGBoost feature importance plot identifies the variables that contribute the most to predicting pH. Features at the top have the highest relative importance and drive the majority of the predictive power.
library(xgboost)
library(SHAPforxgboost)
importance_df <- xgb.importance(model = xgb_fit2)
xgb.plot.importance(
importance_matrix = importance_df,
top_n = 20,
rel_to_first = TRUE,
main = "Top 20 Feature Importances (XGBoost)",
xlab = "Relative Importance"
)
SHAP values measure how much each feature pushes the prediction higher or lower from the baseline pH value. This provides a more detailed, model-agnostic explanation of how the XGBoost model makes decisions.
shap_vals <- shap.values(
xgb_model = xgb_fit2,
X_train = X
)
shap_long <- shap.prep(
xgb_model = xgb_fit2,
X_train = X
)
shap.plot.summary(shap_long)
The SHAP summary plot shows:
The XGBoost feature importance results show that variables such as Mnf Flow, Usage Control, Brand Code C, Alch Rel, and Oxygen Filter are the strongest predictors of pH. These features have the highest relative importance scores, indicating that they consistently influence pH across the dataset.
SHAP provides a deeper view of how individual values of each predictor affect the pH prediction:
Together, feature importance and SHAP values show that pH is influenced by a combination of operational factors such as flow, pressure, chemical levels, and brand characteristics. SHAP adds clarity by showing not only which variables matter, but how they affect predictions. This supports operational decision-making by pinpointing which factors have the strongest influence on stabilizing pH during manufacturing.