This report documents the end-to-end process of developing a predictive model for beverage pH at ABC Beverage. Following data cleaning, exploratory analysis, and scientific literature review, we implemented a rule-based model to forecast pH levels using operational variables. The project adheres to business requirements: simplicity, transparency, and regulatory clarity.
## Rows: 2,571
## Columns: 33
## $ `Brand Code` <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", …
## $ `Carb Volume` <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, …
## $ `Fill Ounces` <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, …
## $ `PC Volume` <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1113…
## $ `Carb Pressure` <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64…
## $ `Carb Temp` <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 1…
## $ PSC <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.15…
## $ `PSC Fill` <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.…
## $ `PSC CO2` <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.…
## $ `Mnf Flow` <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1` <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 1…
## $ `Fill Pressure` <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46…
## $ `Hyd Pressure1` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2` <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3` <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4` <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, …
## $ `Filler Level` <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 1…
## $ `Filler Speed` <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014…
## $ Temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65…
## $ `Usage cont` <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 1…
## $ `Carb Flow` <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902,…
## $ Density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.…
## $ MFR <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, …
## $ Balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1…
## $ `Pressure Vacuum` <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4…
## $ PH <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.…
## $ `Oxygen Filler` <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0…
## $ `Bowl Setpoint` <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46…
## $ `Air Pressurer` <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 1…
## $ `Alch Rel` <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.…
## $ `Carb Rel` <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.…
## $ `Balling Lvl` <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.…
## Brand Code Carb Volume Fill Ounces PC Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler Level Filler Speed Temperature Usage cont Carb Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## PH Oxygen Filler Bowl Setpoint Pressure Setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## Air Pressurer Alch Rel Carb Rel Balling Lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
ggplot(data_clean, aes(x = PH)) +
geom_histogram(binwidth = 0.1, fill = "skyblue", color = "black") +
labs(title = "pH Distribution After Cleaning", x = "pH", y = "Frequency",
caption = "This histogram shows the distribution of pH values across 2,571 cleaned records. Most values cluster between 8.3 and 8.7, with minimal skewness (-0.31), indicating a nearly symmetrical dataset post-cleaning. This shape is ideal for applying both rule-based logic and statistical modeling.")
## Saving 7 x 5 in image
## [1] -0.3092434
data_clean <- data_clean %>%
mutate(Rule_PH = case_when(
`Carb Volume` > 5.5 & `Carb Pressure` > 70 ~ 7.2,
Balling < 3 & Density < 1 ~ 8.5,
`Oxygen Filler` > 0.03 ~ 7.9,
`Temperature` > 66 & `Carb Volume` > 5.4 ~ 7.5,
TRUE ~ 8.2
))
rmse_rule <- sqrt(mean((data_clean$PH - data_clean$Rule_PH)^2))
rmse_rule
## [1] 0.5834013
lm_model <- lm(PH ~ `Carb Volume` + Balling + `Oxygen Filler`, data = data_clean)
data_clean$LM_PH <- predict(lm_model)
rmse_lm <- sqrt(mean((data_clean$PH - data_clean$LM_PH)^2))
rmse_lm
## [1] 0.168921
cols_to_drop <- c("Rule_PH", "LM_PH", "RF_PH")
data_rf <- data_clean %>% select(-all_of(intersect(cols_to_drop, colnames(data_clean))))
predictors <- setdiff(names(data_rf), c("PH"))
data_rf <- data_rf %>% select(all_of(c("PH", predictors)))
data_rf <- data_rf %>% mutate(across(where(is.character), as.factor))
data_rf$`Brand Code` <- as.factor(data_rf$`Brand Code`)
names(data_rf) <- make.names(names(data_rf))
rf_model <- randomForest(PH ~ ., data = data_rf, na.action = na.omit)
data_clean$RF_PH <- predict(rf_model, newdata = data_rf)
rmse_rf <- sqrt(mean((data_clean$PH - data_clean$RF_PH)^2))
print(paste("RMSE for Random Forest:", round(rmse_rf, 4)))
## [1] "RMSE for Random Forest: 0.0387"
p_rf <- ggplot(data_clean, aes(x = PH, y = RF_PH)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Random Forest Prediction vs Actual pH",
x = "Actual pH", y = "Predicted pH",
caption = "Each point represents an observation’s predicted pH (via Random Forest) against the true pH. The red line shows the ideal 1:1 prediction line. While most points closely hug the diagonal, the model occasionally extends beyond the logic captured in the rule-based approach—suggesting it's remixing your rules, not replacing them.")
ggsave("random_forest_scatter.png", plot = p_rf)
## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Even though Random Forest performed best numerically (RMSE 0.0387), the rule-based model remains the champion in terms of clarity, control, and regulatory readiness. It’s hand-coded logic based on known physical and chemical relationships. It’s explainable. It’s dependable. And frankly, it’s giving main character energy.
Random Forest, we respect you. You tried to remix the same rules and did it well. But you’re not the one we’d want explaining things at a board meeting.
comparison_df <- predictions_excel %>%
mutate(Index = row_number()) %>%
pivot_longer(cols = c(Rule_PH, LM_PH, RF_PH), names_to = "Model", values_to = "Predicted")
ggplot(comparison_df, aes(x = Index, y = Predicted, color = Model)) +
geom_line(alpha = 0.6) +
labs(title = "Model Predictions Across Observations",
x = "Observation Index",
y = "Predicted pH",
color = "Model",
caption = "Each model's predictions are plotted across 10 observations for side-by-side comparison.") +
theme_minimal()
This project explores predictive modeling of beverage pH using a rule-based approach grounded in production logic and scientific literature. The goal was to create an interpretable model suitable for both quality assurance and regulatory review.
Carb Volume > 5.5
and
Carb Pressure > 70
→ predicted pH =
7.2Balling < 3
and
Density < 1
→ predicted pH = 8.5Oxygen Filler > 0.03
→ predicted
pH = 7.9Temperature > 66
and
Carb Volume > 5.4
→ predicted pH =
7.5## R version 4.4.3 (2025-02-28 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] randomForest_4.7-1.2 scales_1.4.0 openxlsx_4.2.8
## [4] e1071_1.7-16 caret_7.0-1 lattice_0.22-6
## [7] ggplot2_3.5.2 tidyr_1.3.1 dplyr_1.1.4
## [10] readxl_1.4.5 readr_2.1.5
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 timeDate_4041.110 farver_2.1.2
## [4] fastmap_1.2.0 pROC_1.18.5 digest_0.6.37
## [7] rpart_4.1.24 timechange_0.3.0 lifecycle_1.0.4
## [10] survival_3.8-3 magrittr_2.0.3 compiler_4.4.3
## [13] rlang_1.1.5 sass_0.4.10 tools_4.4.3
## [16] yaml_2.3.10 data.table_1.17.2 knitr_1.50
## [19] labeling_0.4.3 bit_4.6.0 plyr_1.8.9
## [22] RColorBrewer_1.1-3 withr_3.0.2 purrr_1.0.4
## [25] nnet_7.3-20 grid_4.4.3 stats4_4.4.3
## [28] future_1.49.0 globals_0.18.0 iterators_1.0.14
## [31] MASS_7.3-64 cli_3.6.4 crayon_1.5.3
## [34] rmarkdown_2.29 ragg_1.4.0 generics_0.1.4
## [37] rstudioapi_0.17.1 future.apply_1.11.3 reshape2_1.4.4
## [40] tzdb_0.5.0 cachem_1.1.0 proxy_0.4-27
## [43] stringr_1.5.1 splines_4.4.3 parallel_4.4.3
## [46] cellranger_1.1.0 vctrs_0.6.5 hardhat_1.4.1
## [49] Matrix_1.7-2 jsonlite_2.0.0 hms_1.1.3
## [52] bit64_4.6.0-1 listenv_0.9.1 systemfonts_1.2.3
## [55] foreach_1.5.2 gower_1.0.2 jquerylib_0.1.4
## [58] recipes_1.3.0 glue_1.8.0 parallelly_1.44.0
## [61] codetools_0.2-20 lubridate_1.9.4 stringi_1.8.7
## [64] gtable_0.3.6 tibble_3.2.1 pillar_1.10.2
## [67] htmltools_0.5.8.1 ipred_0.9-15 lava_1.8.1
## [70] R6_2.6.1 textshaping_1.0.1 vroom_1.6.5
## [73] evaluate_1.0.3 bslib_0.9.0 class_7.3-23
## [76] Rcpp_1.0.14 zip_2.3.3 nlme_3.1-167
## [79] prodlim_2025.04.28 mgcv_1.9-1 xfun_0.51
## [82] ModelMetrics_1.2.2.2 pkgconfig_2.0.3