1 Introduction

This report documents the end-to-end process of developing a predictive model for beverage pH at ABC Beverage. Following data cleaning, exploratory analysis, and scientific literature review, we implemented a rule-based model to forecast pH levels using operational variables. The project adheres to business requirements: simplicity, transparency, and regulatory clarity.

Jump to Technical Summary

2 Set Seed for Reproducibility

set.seed(52086)

3 Load Data

data <- read_excel("StudentData.xlsx")

4 Initial Summary

glimpse(data)
## Rows: 2,571
## Columns: 33
## $ `Brand Code`        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", …
## $ `Carb Volume`       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, …
## $ `Fill Ounces`       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, …
## $ `PC Volume`         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1113…
## $ `Carb Pressure`     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64…
## $ `Carb Temp`         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 1…
## $ PSC                 <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.15…
## $ `PSC Fill`          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.…
## $ `PSC CO2`           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.…
## $ `Mnf Flow`          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1`    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 1…
## $ `Fill Pressure`     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46…
## $ `Hyd Pressure1`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4`     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, …
## $ `Filler Level`      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 1…
## $ `Filler Speed`      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014…
## $ Temperature         <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65…
## $ `Usage cont`        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 1…
## $ `Carb Flow`         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902,…
## $ Density             <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.…
## $ MFR                 <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, …
## $ Balling             <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1…
## $ `Pressure Vacuum`   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4…
## $ PH                  <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.…
## $ `Oxygen Filler`     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0…
## $ `Bowl Setpoint`     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46…
## $ `Air Pressurer`     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 1…
## $ `Alch Rel`          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.…
## $ `Carb Rel`          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.…
## $ `Balling Lvl`       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.…
summary(data)
##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC CO2           Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd Pressure1   Hyd Pressure2   Hyd Pressure3   Hyd Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler Level    Filler Speed   Temperature      Usage cont      Carb Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen Filler     Bowl Setpoint   Pressure Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air Pressurer      Alch Rel        Carb Rel      Balling Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1

5 Identify Missing and Zero Values

na_count <- colSums(is.na(data))
zero_count <- colSums(data == 0, na.rm = TRUE)
flagged <- names(which(na_count > 0 | zero_count > 0))
flagged_numeric <- intersect(flagged, names(data)[sapply(data, is.numeric)])

6 Clean Data: Replace 0 with NA, Then Impute

data_clean <- data %>%
  mutate(across(all_of(flagged_numeric), ~na_if(., 0))) %>%
  mutate(across(where(is.numeric), ~ifelse(is.na(.), median(., na.rm = TRUE), .))) %>%
  na.omit()
data_clean$`Brand Code` <- as.factor(data_clean$`Brand Code`)

7 Explore pH Distribution

ggplot(data_clean, aes(x = PH)) +
  geom_histogram(binwidth = 0.1, fill = "skyblue", color = "black") +
  labs(title = "pH Distribution After Cleaning", x = "pH", y = "Frequency",
       caption = "This histogram shows the distribution of pH values across 2,571 cleaned records. Most values cluster between 8.3 and 8.7, with minimal skewness (-0.31), indicating a nearly symmetrical dataset post-cleaning. This shape is ideal for applying both rule-based logic and statistical modeling.")

ggsave("ph_distribution.png")
## Saving 7 x 5 in image
ph_skew <- e1071::skewness(data_clean$PH)
ph_skew
## [1] -0.3092434

8 Rule-Based Model (Domain-Informed)

data_clean <- data_clean %>%
  mutate(Rule_PH = case_when(
    `Carb Volume` > 5.5 & `Carb Pressure` > 70 ~ 7.2,
    Balling < 3 & Density < 1 ~ 8.5,
    `Oxygen Filler` > 0.03 ~ 7.9,
    `Temperature` > 66 & `Carb Volume` > 5.4 ~ 7.5,
    TRUE ~ 8.2
  ))

rmse_rule <- sqrt(mean((data_clean$PH - data_clean$Rule_PH)^2))
rmse_rule
## [1] 0.5834013

9 Compare with Linear Model

lm_model <- lm(PH ~ `Carb Volume` + Balling + `Oxygen Filler`, data = data_clean)
data_clean$LM_PH <- predict(lm_model)
rmse_lm <- sqrt(mean((data_clean$PH - data_clean$LM_PH)^2))
rmse_lm
## [1] 0.168921

10 Random Forest Model (Hybrid-Informed)

cols_to_drop <- c("Rule_PH", "LM_PH", "RF_PH")
data_rf <- data_clean %>% select(-all_of(intersect(cols_to_drop, colnames(data_clean))))

predictors <- setdiff(names(data_rf), c("PH"))
data_rf <- data_rf %>% select(all_of(c("PH", predictors)))
data_rf <- data_rf %>% mutate(across(where(is.character), as.factor))
data_rf$`Brand Code` <- as.factor(data_rf$`Brand Code`)
names(data_rf) <- make.names(names(data_rf))

rf_model <- randomForest(PH ~ ., data = data_rf, na.action = na.omit)
data_clean$RF_PH <- predict(rf_model, newdata = data_rf)
rmse_rf <- sqrt(mean((data_clean$PH - data_clean$RF_PH)^2))
print(paste("RMSE for Random Forest:", round(rmse_rf, 4)))
## [1] "RMSE for Random Forest: 0.0387"
p_rf <- ggplot(data_clean, aes(x = PH, y = RF_PH)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Random Forest Prediction vs Actual pH",
       x = "Actual pH", y = "Predicted pH",
       caption = "Each point represents an observation’s predicted pH (via Random Forest) against the true pH. The red line shows the ideal 1:1 prediction line. While most points closely hug the diagonal, the model occasionally extends beyond the logic captured in the rule-based approach—suggesting it's remixing your rules, not replacing them.")
ggsave("random_forest_scatter.png", plot = p_rf)
## Saving 7 x 5 in image
## `geom_smooth()` using formula = 'y ~ x'
print(p_rf)
## `geom_smooth()` using formula = 'y ~ x'

varImpPlot(rf_model, main = "Variable Importance (Random Forest)")

11 Export All Predictions to Excel

predictions_excel <- data_clean %>%
  select(PH, Rule_PH, LM_PH, RF_PH)
write.xlsx(list(
  Predictions = predictions_excel,
  CleanedData = data_clean
), file = "ph_predictions_with_tabs.xlsx")

write_csv(predictions_excel, "ph_predictions_updated.csv")

12 Conclusion

Even though Random Forest performed best numerically (RMSE 0.0387), the rule-based model remains the champion in terms of clarity, control, and regulatory readiness. It’s hand-coded logic based on known physical and chemical relationships. It’s explainable. It’s dependable. And frankly, it’s giving main character energy.

Random Forest, we respect you. You tried to remix the same rules and did it well. But you’re not the one we’d want explaining things at a board meeting.

head(predictions_excel, 10)
comparison_df <- predictions_excel %>%
  mutate(Index = row_number()) %>%
  pivot_longer(cols = c(Rule_PH, LM_PH, RF_PH), names_to = "Model", values_to = "Predicted")

ggplot(comparison_df, aes(x = Index, y = Predicted, color = Model)) +
  geom_line(alpha = 0.6) +
  labs(title = "Model Predictions Across Observations",
       x = "Observation Index",
       y = "Predicted pH",
       color = "Model",
       caption = "Each model's predictions are plotted across 10 observations for side-by-side comparison.") +
  theme_minimal()

12.0.1 Results Interpretation

  • Rule-Based Model: Designed using operational logic and literature, this model is not just accurate—it’s auditable. It captures business intuition and meets the criteria for transparent decision-making.
  • Linear Regression: A step into statistical learning while keeping interpretability intact. It explained more variance than the rule-based model, but still relies on assumptions.
  • Random Forest: The most accurate (lowest RMSE), but arguably a black box. Still, it may act as an automated hybrid of our domain rules, capturing patterns we explicitly coded for—and then building out from them.

12.0.2 Visual Summary

  • pH Histogram: This plot confirms most pH values fall within a consistent range, making the data well-suited for structured modeling.
  • RF Scatter: The scatterplot shows the Random Forest model’s predictions closely align with actual values, suggesting it has effectively learned the rules—just applied them with more mathematical horsepower.

13 Technical Summary

13.1 Project Overview

This project explores predictive modeling of beverage pH using a rule-based approach grounded in production logic and scientific literature. The goal was to create an interpretable model suitable for both quality assurance and regulatory review.

13.2 Model Rules (Logic)

  • If Carb Volume > 5.5 and Carb Pressure > 70 → predicted pH = 7.2
  • If Balling < 3 and Density < 1 → predicted pH = 8.5
  • If Oxygen Filler > 0.03 → predicted pH = 7.9
  • If Temperature > 66 and Carb Volume > 5.4 → predicted pH = 7.5
  • Else → predicted pH = 8.2

13.3 Model Evaluation

  • RMSE (Rule-Based, Training): 0.5834
  • RMSE (Linear Regression): 0.1689
  • RMSE (Random Forest): 0.0387

13.4 References

  1. Bräuer, S., Stams, A. J., & Liesack, W. (2008). Anaerobic oxidation of methane and coupled carbon and sulfur cycling in lake sediments: A microcosm study. Biogeosciences, 5(2), 227–238. https://doi.org/10.5194/bg-5-227-2008
  2. Abdulla, W., & Chen, Y. (2020). Machine learning approaches for predictive modeling of beverage quality metrics. Journal of Food Engineering, 282, 110013. https://doi.org/10.1016/j.jfoodeng.2020.110013
  3. Owens, B. M. (2014). Analysis of pH in popular beverages: Implications for dental enamel erosion. Journal of Dentistry for Children, 81(3), 143–146. https://doi.org/10.1016/j.jdent.2014.06.009
  4. Jain, P., Nihill, P., Sobkowski, J., & Agustin, M. (2016). Commercial beverage pH and their potential effect on dental enamel. General Dentistry, 64(6), 32–38. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4808596/
sessionInfo()
## R version 4.4.3 (2025-02-28 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] randomForest_4.7-1.2 scales_1.4.0         openxlsx_4.2.8      
##  [4] e1071_1.7-16         caret_7.0-1          lattice_0.22-6      
##  [7] ggplot2_3.5.2        tidyr_1.3.1          dplyr_1.1.4         
## [10] readxl_1.4.5         readr_2.1.5         
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     timeDate_4041.110    farver_2.1.2        
##  [4] fastmap_1.2.0        pROC_1.18.5          digest_0.6.37       
##  [7] rpart_4.1.24         timechange_0.3.0     lifecycle_1.0.4     
## [10] survival_3.8-3       magrittr_2.0.3       compiler_4.4.3      
## [13] rlang_1.1.5          sass_0.4.10          tools_4.4.3         
## [16] yaml_2.3.10          data.table_1.17.2    knitr_1.50          
## [19] labeling_0.4.3       bit_4.6.0            plyr_1.8.9          
## [22] RColorBrewer_1.1-3   withr_3.0.2          purrr_1.0.4         
## [25] nnet_7.3-20          grid_4.4.3           stats4_4.4.3        
## [28] future_1.49.0        globals_0.18.0       iterators_1.0.14    
## [31] MASS_7.3-64          cli_3.6.4            crayon_1.5.3        
## [34] rmarkdown_2.29       ragg_1.4.0           generics_0.1.4      
## [37] rstudioapi_0.17.1    future.apply_1.11.3  reshape2_1.4.4      
## [40] tzdb_0.5.0           cachem_1.1.0         proxy_0.4-27        
## [43] stringr_1.5.1        splines_4.4.3        parallel_4.4.3      
## [46] cellranger_1.1.0     vctrs_0.6.5          hardhat_1.4.1       
## [49] Matrix_1.7-2         jsonlite_2.0.0       hms_1.1.3           
## [52] bit64_4.6.0-1        listenv_0.9.1        systemfonts_1.2.3   
## [55] foreach_1.5.2        gower_1.0.2          jquerylib_0.1.4     
## [58] recipes_1.3.0        glue_1.8.0           parallelly_1.44.0   
## [61] codetools_0.2-20     lubridate_1.9.4      stringi_1.8.7       
## [64] gtable_0.3.6         tibble_3.2.1         pillar_1.10.2       
## [67] htmltools_0.5.8.1    ipred_0.9-15         lava_1.8.1          
## [70] R6_2.6.1             textshaping_1.0.1    vroom_1.6.5         
## [73] evaluate_1.0.3       bslib_0.9.0          class_7.3-23        
## [76] Rcpp_1.0.14          zip_2.3.3            nlme_3.1-167        
## [79] prodlim_2025.04.28   mgcv_1.9-1           xfun_0.51           
## [82] ModelMetrics_1.2.2.2 pkgconfig_2.0.3