knitr::opts_chunk$set(warning = FALSE, message = FALSE)

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Exploratory Data Analysis (EDA)

library(tidyverse)
library(readxl)
library(corrplot)
library(DataExplorer)
library(caret)
library(writexl)

Loading the training data

raw <- read_excel("Training Data.xlsx")

# Getting a general sense of the data
glimpse(raw)

## Rows: 2,571
## Columns: 33
## $ `Brand Code`        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", …
## $ `Carb Volume`       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, …
## $ `Fill Ounces`       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, …
## $ `PC Volume`         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1113…
## $ `Carb Pressure`     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64…
## $ `Carb Temp`         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 1…
## $ PSC                 <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.15…
## $ `PSC Fill`          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.…
## $ `PSC CO2`           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.…
## $ `Mnf Flow`          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1`    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 1…
## $ `Fill Pressure`     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46…
## $ `Hyd Pressure1`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4`     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, …
## $ `Filler Level`      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 1…
## $ `Filler Speed`      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014…
## $ Temperature         <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65…
## $ `Usage cont`        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 1…
## $ `Carb Flow`         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902,…
## $ Density             <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.…
## $ MFR                 <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, …
## $ Balling             <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1…
## $ `Pressure Vacuum`   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4…
## $ PH                  <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.…
## $ `Oxygen Filler`     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0…
## $ `Bowl Setpoint`     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46…
## $ `Air Pressurer`     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 1…
## $ `Alch Rel`          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.…
## $ `Carb Rel`          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.…
## $ `Balling Lvl`       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.…

The training set contains 2571 observations and 33 variables, including the target variable PH. Other than the Brand Code variable, which is a chr type, the rest of the production factors are stored as numeric types.

Distribution of variables

Let’s get a better sense of the data by plotting their distributions.

# Distribution of numerical variables
plot_histogram(raw)

There is a mix of normal, skewed, and multimodal distributions across the production variables. Several predictors, such as Carb.Pressure, Carb.Temp, Carb.Volume, Fill.Ounces, PC.Volume, and PSC, exhibit a normal, unimodal distribution. In contrast, variables such as Density, Carb.Rel, Carb.Flow,Filler.Speed, Balling and Balling.Lvl show a clear bimodal or multimodal pattern. Some predictors, most notably Oxygen.Filter, MFR, Hyd.Pressure1, Hyd.Pressure2, and Hyd.Pressure3 are heavily right skewed with many near zero values and a long tail of larger values. These distributions suggest the presence of potential outliers and cleaning may be required before modeling.

# Bar plot for the categorical variable
ggplot(raw, aes(x = `Brand Code`)) +
  geom_bar() +
  theme_minimal() +
  labs(
    title = "Distribution of Brand Code",
    x = "Brand",
    y = "Frequency (Count)"
  )

The categorical variable Brand Code is unevenly distributed. Brand B accounts for almost half of all the observations while Brands A, C, and D are significantly smaller. Because of this imbalance, the usefulness of this variable in the modeling will need to be evaluated, and it maybe removed depending on its predictive contribution.

# Checking pH of each Brand
ggplot(raw, aes(x = `Brand Code`, y = PH)) +
  geom_boxplot() +
  facet_wrap(~ `Brand Code`,scales = "free") +
  theme_minimal()

This boxplots show that all brands have similar PH distributions centered around 8.5.

Missing Values

Let’s check for missing values in the training data set.

# Count the number of NA values in each column
missing_counts <- colSums(is.na(raw))
missing_counts

##        Brand Code       Carb Volume       Fill Ounces         PC Volume 
##               120                10                38                39 
##     Carb Pressure         Carb Temp               PSC          PSC Fill 
##                27                26                33                23 
##           PSC CO2          Mnf Flow    Carb Pressure1     Fill Pressure 
##                39                 2                32                22 
##     Hyd Pressure1     Hyd Pressure2     Hyd Pressure3     Hyd Pressure4 
##                11                15                15                30 
##      Filler Level      Filler Speed       Temperature        Usage cont 
##                20                57                14                 5 
##         Carb Flow           Density               MFR           Balling 
##                 2                 1               212                 1 
##   Pressure Vacuum                PH     Oxygen Filler     Bowl Setpoint 
##                 0                 4                12                 2 
## Pressure Setpoint     Air Pressurer          Alch Rel          Carb Rel 
##                12                 0                 9                10 
##       Balling Lvl 
##                 1

Most of the predictors have a small percentage of missing values (NA). the predictor MFRhas the highest missing value count at 212, about 8.2% of the total observations. The target variable, PH, also has 4 missing values. Before we begin modeling, these missing values will first need to be addressed.

Let’s start by removing the 4 rows where the rows where PH value is missing. We are removing them rather than imputing them because a model cannot be trained on data that has no target to learn from, doing so could introduce bias. Also, the 4 rows represent less than 1% of the total data and removing them should not have any significant impact on the model’s performance.

For the other missing values, they will be imputed with the median values since many of the variables were skewed or contained extreme values.Using the median ensures that the imputation does not distort the original distribution of the variables or bias the model

# Removing the rows where PH is missing
df <- raw %>%
  drop_na(PH)

# Imputing the missing values with the median
df_imputed <- df %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))

# Count the number of NA values in each column
missing_counts <- colSums(is.na(df_imputed))
missing_counts

##        Brand Code       Carb Volume       Fill Ounces         PC Volume 
##               120                 0                 0                 0 
##     Carb Pressure         Carb Temp               PSC          PSC Fill 
##                 0                 0                 0                 0 
##           PSC CO2          Mnf Flow    Carb Pressure1     Fill Pressure 
##                 0                 0                 0                 0 
##     Hyd Pressure1     Hyd Pressure2     Hyd Pressure3     Hyd Pressure4 
##                 0                 0                 0                 0 
##      Filler Level      Filler Speed       Temperature        Usage cont 
##                 0                 0                 0                 0 
##         Carb Flow           Density               MFR           Balling 
##                 0                 0                 0                 0 
##   Pressure Vacuum                PH     Oxygen Filler     Bowl Setpoint 
##                 0                 0                 0                 0 
## Pressure Setpoint     Air Pressurer          Alch Rel          Carb Rel 
##                 0                 0                 0                 0 
##       Balling Lvl 
##                 0

There is still 120 missing values (~5%) in the categorical variable Brand Code. Imputing these missing values with the most common brand would likely distort the data and bias the model. Instead, we can replace the NA values with a new “Unknown” category.

# Replacing NA values with Unknown
df_imputed <- df_imputed %>%
  mutate(`Brand Code` = ifelse(is.na(`Brand Code`), "Unknown", `Brand Code`))

Outliers

Based on the distribution plots of the predictors, several variables appeared to contain potential outliers. These extreme values were visible in the histograms as long tails or wide ranges. To evaluate these observations, the interquartile range (IQR) rule will be applied to each numeric predictor to identify how many outliers each variable contains.

# Select only numeric columns
numeric_vars <- df_imputed %>%
  select(where(is.numeric))

# Flag outliers using the IQR rule
flag_outliers <- function(x) {
  Q1  <- quantile(x, 0.25, na.rm = TRUE)
  Q3  <- quantile(x, 0.75, na.rm = TRUE)
  IQR_val <- IQR(x, na.rm = TRUE)
  lower <- Q1 - 1.5 * IQR_val
  upper <- Q3 + 1.5 * IQR_val
  (x < lower) | (x > upper)
}

# Apply the function across all numeric columns
outlier_flags <- numeric_vars %>%
  mutate(across(everything(), flag_outliers, .names = "outlier_{col}"))

# Count number of outliers per variable and outlier percentage
outlier_summary <- map_dfr(names(numeric_vars), function(var) {
  x <- numeric_vars[[var]]
  flags <- flag_outliers(x)
  outlier_count <- sum(flags, na.rm = TRUE)
  outlier_pct <- (outlier_count / sum(!is.na(x))) * 100
  tibble(
    variable      = var,
    outlier_count = outlier_count,
    outlier_pct   = outlier_pct
  )
}) %>%
  arrange(desc(outlier_count))

outlier_summary

## # A tibble: 32 × 3
##    variable        outlier_count outlier_pct
##    <chr>                   <int>       <dbl>
##  1 Filler Speed              440       17.1 
##  2 MFR                       279       10.9 
##  3 Air Pressurer             220        8.57
##  4 Oxygen Filler             193        7.52
##  5 Pressure Vacuum           125        4.87
##  6 Temperature               122        4.75
##  7 PC Volume                  95        3.70
##  8 Hyd Pressure4              82        3.19
##  9 Fill Pressure              79        3.08
## 10 PSC CO2                    77        3.00
## # ℹ 22 more rows

The IQR analysis confirms that some predictors contain a large number of extreme values. Variables such as Filler.Speed, MFR, Air.Pressurer, and Oxygen.Filler have the highest outlier counts, ranging from 7% to 17% of their total observations. This is consistent with the distribution plots, where each of these variables was heavily skewed. At this stage in the EDA, these outliers will not be removed. For now, we can just acknowledge that there are extreme values in the that may require transformation during modeling.

Correlation plot

To identify the relationship between the numeric variables, we can create a correlation matrix.

This correlation heatmap shows that most predictors have a weak to moderate linear relationship with the target variable, pH. The strongest predictor for pH is MnF Flow with a correlation coefficient of -0.45, indicating that as manufacturing flow rate increases, the pH tends to decrease. Bowl Setpoint has the highest positive correlation with pH with a correlation coefficient of 0.35.

The heatmap also reveals significant multicollinearity. As shown by the dark clusters in the heatmap, predictors like Balling and Balling Lvl, Density and Alch Rel, and the Hyd Pressure variables have a correlation close to 1.0. This redundancy may be problematic for linear regression models as it makes it difficult to see the individual effects of these predictors. However, since there is an absence of a strong linear correlation (r >0.8) with the target variable, this suggests that the relationship between the production factors and pH is complex and likely non-linear.

Machine Learning Prediction Experiments and Validation

The following code cells will conduct experiments using several machine learning algorithms using grid search, to predict pH. Additionally, the winner model, based on lowest RMSE, will be validated using the test set, to assess generalization.

Data Preparation for Modeling

Before running our algorithms, we need to prepare the data to ensure fair comparisons.

Encoding Categorical Variables: Machine learning models (like SVM and XGBoost) require numerical input. We will convert Brand Code into dummy variables (one-hot encoding).
Train/Test Split: We will split the provided historical data into a Training Set (80%) to build the models and a Test Set (20%) to act as an unseen validation set. This allows us to simulate how the model performs on new data.
Preprocessing: We will center and scale the numeric variables, which is essential for models like SVM to function correctly. Unlike XGBoost (multiple decision trees stacked on top of each other), SVM is based on distance metrics. Thus, differing scales between features can inadvertently put more weight on certain features than others (even for XGBoost, etc. it is generally good practice to scale and center).

set.seed(958) #Ensure reproducibility (958 = 9.58 seconds -> Usain Bolt's 100m world record. Lucky Number 😉)

#One-Hot Encoding for Brand Code
#We use model.matrix to expand factors to dummy variables
model_matrix = model.matrix(PH ~ . - 1, data = df_imputed) %>%
  as.data.frame()

#Bind the target variable PH back to the dataset
data_ready = cbind(PH = df_imputed$PH, model_matrix)

#Splitting the data (80% Train, 20% Test)
train_index = createDataPartition(data_ready$PH, p = 0.80, list = FALSE)
train_data = data_ready[train_index, ]
test_data  = data_ready[-train_index, ]

cat("Training Set Dimensions:", dim(train_data), "\n")

## Training Set Dimensions: 2055 37

cat("Test Set Dimensions:", dim(test_data))

## Test Set Dimensions: 512 37

Model Experiments

We will utilize caret to perform a grid search with Cross-Validation (5-fold). This technique splits the training data into 5 chunks, training on 4 and validating on the 5th, rotating through all chunks. This prevents us from selecting a model that just happens to memorize the training data (overfitting).

We are testing three distinct approaches:

Multiple Linear Regression (Baseline): Assumes a linear relationship.
Support Vector Machine (Radial Basis Function): Good for complex, non-linear boundaries.
XGBoost (Gradient Boosting): An ensemble tree method that is currently the industry standard for high-performance tabular prediction.

#Set up Cross-Validation control
fit_control = trainControl(
  method = "cv",
  number = 5,
  savePredictions = "final",
  verboseIter = FALSE
)

#--- Model 1: Multiple Linear Regression (Baseline) ---
set.seed(958)
lm_model = train(
  PH ~ .,
  data = train_data,
  method = "lm",
  trControl = fit_control,
  preProcess = c("center", "scale")
)

#--- Model 2: Support Vector Machine (SVM) ---
#Grid Search for Cost (C) and Sigma
svm_grid = expand.grid(
  sigma = c(0.01, 0.05, 0.1),
  C = c(0.1, 1, 10)
)

set.seed(958)
svm_model = train(
  PH ~ .,
  data = train_data,
  method = "svmRadial",
  trControl = fit_control,
  tuneGrid = svm_grid,
  preProcess = c("center", "scale")
)

#--- Model 3: XGBoost (Gradient Boosting) ---
#Grid Search for Tree Depth, Learning Rate (eta), and Iterations
xgb_grid = expand.grid(
  nrounds = c(100, 200),
  max_depth = c(3, 6),
  eta = c(0.05, 0.1, 0.3),
  gamma = 0,
  colsample_bytree = 0.8,
  min_child_weight = 1,
  subsample = 0.8
)

set.seed(958)
xgb_model = train(
  PH ~ .,
  data = train_data,
  method = "xgbTree",
  trControl = fit_control,
  tuneGrid = xgb_grid,
  preProcess = c("center", "scale"),
  verbosity = 0
)

Model Selection

We evaluate the models based on RMSE (Root Mean Squared Error). RMSE measures the average magnitude of the error; lower values indicate better fit. We also look at MAE (Mean Absolute Error), due to its interpretability especially for non-technical business stakeholders, and R-Squared.

#Collect results
results = resamples(list(
  Linear_Reg = lm_model,
  SVM = svm_model,
  XGBoost = xgb_model
))

#Summarize the results
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: Linear_Reg, SVM, XGBoost 
## Number of resamples: 5 
## 
## MAE 
##                  Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
## Linear_Reg 0.09831584 0.10173017 0.10316179 0.10252985 0.10432545 0.10511601
## SVM        0.08218187 0.08289437 0.08316028 0.08424167 0.08555756 0.08741428
## XGBoost    0.07179784 0.07275801 0.07614461 0.07556900 0.07821149 0.07893304
##            NA's
## Linear_Reg    0
## SVM           0
## XGBoost       0
## 
## RMSE 
##                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Linear_Reg 0.12535439 0.1312758 0.1350440 0.1328562 0.1352401 0.1373669    0
## SVM        0.10969862 0.1147114 0.1157187 0.1156254 0.1161199 0.1218785    0
## XGBoost    0.09336151 0.1032859 0.1048307 0.1047195 0.1058988 0.1162205    0
## 
## Rsquared 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Linear_Reg 0.3902414 0.3986441 0.4034004 0.4128602 0.4353893 0.4366256    0
## SVM        0.4987920 0.5596984 0.5702012 0.5579778 0.5750533 0.5861439    0
## XGBoost    0.5460049 0.6405128 0.6465753 0.6342370 0.6516944 0.6863974    0

#Visualize the comparison
bwplot(results, metric = "RMSE", main = "Model Comparison: RMSE (Lower is Better)")

#Extract best RMSE values to a table
model_comparison = data.frame(
  Model = c("Linear Regression", "SVM", "XGBoost"),
  RMSE = c(min(lm_model$results$RMSE), min(svm_model$results$RMSE), min(xgb_model$results$RMSE)),
  Rsquared = c(max(lm_model$results$Rsquared), max(svm_model$results$Rsquared), max(xgb_model$results$Rsquared))
) %>%
  arrange(RMSE)

knitr::kable(model_comparison, caption = "Final Model Performance Comparison")

Final Model Performance Comparison
Model	RMSE	Rsquared
XGBoost	0.1047195	0.6357002
SVM	0.1156254	0.5579778
Linear Regression	0.1328562	0.4128602

Selection Decision: Based on the empirical evidence from our cross-validation, the XGBoost model yielded the lowest RMSE. The Linear Regression baseline performed significantly worse, confirming our hypothesis from the EDA that the relationships between manufacturing factors and pH are non-linear and complex. We will proceed with the winner (lowest RMSE) for final validation.

Final Validation on Test Set

Now that we have selected our “Winner” model, we must validate it against the 20% holdout Test Set that the model has never seen. This gives us a realistic expectation of performance in the actual production environment.

#Determine the winner automatically
best_model_name = model_comparison$Model[1]
cat("The selected winner model is:", best_model_name, "\n")

## The selected winner model is: XGBoost

#Define the model object based on the winner
final_model = xgb_model

# Make predictions on the Test Set
predictions = predict(final_model, newdata = test_data)

# Calculate Final Metrics
postResample(pred = predictions, obs = test_data$PH)

##       RMSE   Rsquared        MAE 
## 0.10293945 0.63635542 0.07788711

Visualization of Predictions

To visually assess the quality of our predictions, we will plot the Actual vs. Predicted pH and the Distribution of Residuals (Errors). Ideally, the residuals should be normally distributed around zero.

#Create a dataframe for plotting
val_results = data.frame(
  Actual = test_data$PH,
  Predicted = predictions
) %>%
  mutate(Residuals = Actual - Predicted)

#Plot 1: Actual vs Predicted
p1 = ggplot(val_results, aes(x = Actual, y = Predicted)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  theme_minimal() +
  labs(
    title = paste("Actual vs. Predicted pH (", best_model_name, ")"),
    subtitle = "Red dashed line represents perfect prediction",
    x = "Actual pH",
    y = "Predicted pH"
  )

#Plot 2: Residuals Histogram
p2 = ggplot(val_results, aes(x = Residuals)) +
  geom_histogram(bins = 30, fill = "darkgreen", color = "white", alpha = 0.7) +
  geom_vline(xintercept = 0, color = "red", linetype = "dashed") +
  theme_minimal() +
  labs(
    title = "Distribution of Residuals",
    subtitle = "Ideally centered at 0",
    x = "Residual Error (Actual - Predicted)",
    y = "Count"
  )

#Display plots
print(p1)

print(p2)

Interpretation: The Actual vs. Predicted plot shows how closely our model tracks the true pH values (besides one outlier). The tighter the points are to the red dashed line, the better the accuracy. The residuals histogram shows us the “error distribution.” The bell curve centered at zero indicates a healthy model that is not systematically over-predicting or under-predicting.

Finally, we will export the predictions to an Excel-readable format. Below, we will find additional analysis for the predictions. Specifically, we will look at feature importance from the XGBoost output, as well as using SHAP (SHapley Additive exPlanations) to evaluate the impact of specific features on each predicition, allowing us to identify trends and, most importantly, actionable insights for the beverage process.

#Add ID column if needed, here we just export the comparison
output_df = val_results %>%
  select(Actual_PH = Actual, Predicted_PH = Predicted, Error = Residuals)

write_xlsx(output_df, "PH_Model_Predictions.xlsx")

cat("Predictions have been exported to 'PH_Model_Predictions.xlsx'.")

## Predictions have been exported to 'PH_Model_Predictions.xlsx'.

Feature Importance and SHAP Values

This section evaluates which production variables have the strongest impact on predicting pH, based on the final XGBoost interpretability model. We first present global feature importance using XGBoost’s built-in scoring, followed by SHAP (Shapley Additive Explanations) which explains how each variable influences the model’s predictions.

# XGBoost training for interpretability
library(xgboost)
df2 <- df_imputed
dummies <- model.matrix(PH ~ . - 1, data = df2)
X <- dummies
y <- df2$PH
dmat <- xgb.DMatrix(data = X, label = y)
params <- list(objective="reg:squarederror", eta=0.1, max_depth=6)
xgb_fit2 <- xgb.train(params=params, data=dmat, nrounds=200, verbose=0)

1. XGBoost Feature Importance

The XGBoost feature importance plot identifies the variables that contribute the most to predicting pH. Features at the top have the highest relative importance and drive the majority of the predictive power.

library(xgboost)
library(SHAPforxgboost)

importance_df <- xgb.importance(model = xgb_fit2)

xgb.plot.importance(
  importance_matrix = importance_df,
  top_n = 20,
  rel_to_first = TRUE,
  main = "Top 20 Feature Importances (XGBoost)",
  xlab = "Relative Importance"
)

2. SHAP Value Analysis

SHAP values measure how much each feature pushes the prediction higher or lower from the baseline pH value. This provides a more detailed, model-agnostic explanation of how the XGBoost model makes decisions.

Compute SHAP Values

shap_vals <- shap.values(
  xgb_model = xgb_fit2,
  X_train = X
)

shap_long <- shap.prep(
  xgb_model = xgb_fit2,
  X_train = X
)

SHAP Summary Plot

shap.plot.summary(shap_long)

The SHAP summary plot shows:

High feature values (purple) and low feature values (yellow).
Points to the right increase predicted pH.
Points to the left decrease predicted pH.
The vertical spread indicates the magnitude of impact.

Interpretation of Feature Importance and SHAP Values

Feature Importance Interpretation

The XGBoost feature importance results show that variables such as Mnf Flow, Usage Control, Brand Code C, Alch Rel, and Oxygen Filter are the strongest predictors of pH. These features have the highest relative importance scores, indicating that they consistently influence pH across the dataset.

SHAP Interpretation

SHAP provides a deeper view of how individual values of each predictor affect the pH prediction:

High feature values tend to increase pH for some variables and decrease it for others.
Low feature values push predictions in the opposite direction.
SHAP confirms that the relationship between production variables and pH is non-linear, supporting XGBoost’s superior performance.

Overall Interpretation

Together, feature importance and SHAP values show that pH is influenced by a combination of operational factors such as flow, pressure, chemical levels, and brand characteristics. SHAP adds clarity by showing not only which variables matter, but how they affect predictions. This supports operational decision-making by pinpointing which factors have the strongest influence on stabilizing pH during manufacturing.

Project 2 - ABC Beverage

Sabina Baraili, Jian Quan Chen, Joe Foy, Lucas Weyrich

2025-11-02