Technical Report: Predictive Modeling of pH for ABC Beverage

Author: John Ferrara, Javier Pajuelo Bazan, Benson Yikseong Toi, Bikash Bhowmik, Jose Fuentes

28 Apr 2025

Column

Column

Introduction

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Data Overview

The datasets provided include:

  • Training Data: 2,571 observations, 33 variables (32 predictors + pH).
  • Test Data: 267 observations, 32 predictors (no pH values).
  • Variables: Operational metrics (e.g., Mnf Flow, Carb Pressure, Temperature) and Brand Code (categorical).

The goal is to predict pH, a measure of beverage acidity, to support regulatory compliance and quality control.

# Load datasets
train_data <- read_excel("StudentData.xlsx")
test_data <- read_excel("StudentEvaluation.xlsx")

# Display head of datasets
head(train_data)
# A tibble: 6 × 33
  `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
  <chr>                <dbl>         <dbl>       <dbl>           <dbl>
1 B                     5.34          24.0       0.263            68.2
2 A                     5.43          24.0       0.239            68.4
3 B                     5.29          24.1       0.263            70.8
4 A                     5.44          24.0       0.293            63  
5 A                     5.49          24.3       0.111            67.2
6 A                     5.38          23.9       0.269            66.6
# ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
#   `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
#   `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
#   `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
#   `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
#   `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
#   `Pressure Vacuum` <dbl>, PH <dbl>, `Oxygen Filler` <dbl>, …
head(test_data)
# A tibble: 6 × 33
  `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
  <chr>                <dbl>         <dbl>       <dbl>           <dbl>
1 D                     5.48          24.0       0.27             65.4
2 A                     5.39          24.0       0.227            63.2
3 B                     5.29          23.9       0.303            66.4
4 B                     5.27          23.9       0.186            64.8
5 B                     5.41          24.2       0.16             69.4
6 B                     5.29          24.1       0.212            73.4
# ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
#   `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
#   `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
#   `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
#   `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
#   `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
#   `Pressure Vacuum` <dbl>, PH <lgl>, `Oxygen Filler` <dbl>, …

Data Cleaning and Preparation

We performed the following steps to prepare the data:

  1. Remove Missing pH: Dropped 4 training rows where pH was missing.
  2. Handle Missing Values:
    • Dropped rows with \(\ge\) 4 missing values (20 in training, 6 in test).
    • Imputed remaining missing values using MICE (Predictive Mean Matching for numeric, CART for Brand Code).
  3. Convert Brand Code: Changed Brand Code to a factor, ensuring consistent levels.
  4. Split Training Data: 80/20 split for training and validation.
  5. Centering and Scaling: Applied within modeling functions to handle different units.

As part of the data preparation, there were null values in the data that needed to be handled. Firstly, any rows in the training set where the target variable PH was missing were removed, as these rows could not contribute to model training. Additionally, rows with four or more missing values were dropped from both the training and testing datasets to avoid introducing too much uncertainty into the imputation process. For the remaining missing values, a multi-level imputation strategy was implemented using MICE. The numeric variables were imputed using Predictive Mean Matching and the singular categorical variable, Brand Code, was imputed separately using a Classification and Regression Tree method. This dual process resulted in fully imputed datasets for both training and test sets, enabling a complete and consistent modeling pipeline.

Lastly, in order to obtain solid results, the training data, originally sourced from the “StudentData.xlsx” file, was split 80/20 for training and testing. As mentioned before, the prediction testing (StudentEvaluation.xslx) dataset received, had no PH values. Therefore in order to test multiple models, a smaller test set with PH values to gauge performance was needed to help select a model. Data was also centered and scaled prior to modeling, but within the modeling functions, to account for differing measurement units.

# Remove missing pH
train_nonullPH <- train_data %>% filter(!is.na(PH))
missing_per_row <- rowSums(is.na(train_nonullPH))
train_cleaned <- train_nonullPH[missing_per_row < 4, ]

# Convert Brand Code to factor
train_cleaned$`Brand Code` <- as.factor(train_cleaned$`Brand Code`)
train_levels <- levels(train_cleaned$`Brand Code`)

# Impute missing values
init <- mice(train_cleaned, maxit = 0)
meth <- init$method
meth["Brand Code"] <- "cart"
imputed_data <- mice(train_cleaned, method = meth, m = 10, seed = 100)

 iter imp variable
  1   1  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  1   2  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  1   3  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  1   4  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  1   5  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  1   6  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  1   7  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  1   8  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  1   9  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  1   10  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  2   1  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  2   2  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  2   3  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  2   4  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  2   5  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  2   6  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  2   7  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  2   8  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  2   9  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  2   10  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  3   1  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  3   2  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  3   3  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  3   4  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  3   5  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  3   6  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  3   7  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  3   8  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  3   9  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  3   10  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  4   1  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  4   2  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  4   3  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  4   4  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  4   5  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  4   6  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  4   7  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  4   8  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  4   9  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  4   10  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  5   1  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  5   2  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  5   3  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  5   4  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  5   5  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  5   6  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  5   7  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  5   8  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  5   9  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
  5   10  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Pressure  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Fill Pressure  Hyd Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Level  Filler Speed  Temperature  Usage cont  Carb Flow  MFR  Oxygen Filler  Bowl Setpoint  Pressure Setpoint  Alch Rel  Carb Rel
working_data <- complete(imputed_data, 1)
# Remove missing values
missing_per_row_test <- rowSums(is.na(test_data))
test_cleaned <- test_data[missing_per_row_test < 4, ]
test_cleaned <- test_cleaned %>% select(-any_of("PH"))

# Convert Brand Code to factor
test_cleaned$`Brand Code` <- factor(test_cleaned$`Brand Code`, levels = train_levels)

# Impute missing values
init_test <- mice(test_cleaned, maxit = 0)
meth_test <- init_test$method
meth_test["Brand Code"] <- "cart"
meth_test <- meth_test[names(meth_test) != "PH"]
imputed_data_test <- mice(test_cleaned, method = meth_test, m = 10, seed = 100)

 iter imp variable
  1   1  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  1   2  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  1   3  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  1   4  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  1   5  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  1   6  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  1   7  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  1   8  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  1   9  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  1   10  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  2   1  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  2   2  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  2   3  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  2   4  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  2   5  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  2   6  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  2   7  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  2   8  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  2   9  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  2   10  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  3   1  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  3   2  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  3   3  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  3   4  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  3   5  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  3   6  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  3   7  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  3   8  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  3   9  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  3   10  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  4   1  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  4   2  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  4   3  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  4   4  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  4   5  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  4   6  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  4   7  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  4   8  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  4   9  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  4   10  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  5   1  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  5   2  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  5   3  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  5   4  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  5   5  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  5   6  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  5   7  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  5   8  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  5   9  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
  5   10  Brand Code  Carb Volume  Fill Ounces  PC Volume  Carb Temp  PSC  PSC Fill  PSC CO2  Carb Pressure1  Hyd Pressure2  Hyd Pressure3  Hyd Pressure4  Filler Speed  Usage cont  MFR  Pressure Setpoint  Alch Rel
working_data_test <- complete(imputed_data_test, 1)

# Verify no missing values
if (anyNA(working_data_test)) {
  stop("Imputation failed: missing values remain in working_data_test")
}

# Verify dimensions (expect 261 rows, 32 columns)
print("Dimensions of working_data_test after imputation:")
[1] "Dimensions of working_data_test after imputation:"
print(dim(working_data_test))  # Should be [261, 32]
[1] 261  32
# Verify Brand.Code levels
print("Brand.Code levels in working_data_test:")
[1] "Brand.Code levels in working_data_test:"
print(levels(working_data_test$`Brand Code`))
[1] "A" "B" "C" "D"
# Verify no PH column
if ("PH" %in% colnames(working_data_test)) {
  stop("PH column found in working_data_test")
}

Exploratory Data Analysis

We conducted exploratory analysis to understand the data and relationships among variables. Exploratory analysis began with the creation of summary statistics, correlation matrices, and a series of pairwise plots using GGally. All of these methods were leveraged to get a better sense of the data, and the relationships between variables. The correlation matrix revealed several strong linear relationships among the variables. For instance,the variables Carb Temp and Carb Pressure, as well as the variables Density and Balling, were each found to have high respective correlations with one another. This indicates potential multicollinearity within the dataset, which could negatively impact some models. However, no corrective steps were taken due to the intended use of modeling techniques such as Random Forest and XGBoost, which automatically manage these types of relationships between predictors when modeling. Lastly, while some variables showed skewed or non-normal distributions, this was not considered problematic, especially for due to the choice of modeling approaches.

Summary Statistics

summary(working_data)
 Brand Code  Carb Volume     Fill Ounces      PC Volume       Carb Pressure  
 A: 296     Min.   :5.040   Min.   :23.63   Min.   :0.07933   Min.   :57.00  
 B:1292     1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23933   1st Qu.:65.60  
 C: 347     Median :5.347   Median :23.97   Median :0.27133   Median :68.20  
 D: 612     Mean   :5.371   Mean   :23.97   Mean   :0.27761   Mean   :68.22  
            3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200   3rd Qu.:70.60  
            Max.   :5.700   Max.   :24.32   Max.   :0.47800   Max.   :79.40  
   Carb Temp          PSC             PSC Fill         PSC CO2       
 Min.   :128.6   Min.   :0.00200   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:138.4   1st Qu.:0.05000   1st Qu.:0.1000   1st Qu.:0.02000  
 Median :140.8   Median :0.07800   Median :0.1800   Median :0.04000  
 Mean   :141.1   Mean   :0.08504   Mean   :0.1961   Mean   :0.05647  
 3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600   3rd Qu.:0.08000  
 Max.   :154.0   Max.   :0.27000   Max.   :0.6200   Max.   :0.24000  
    Mnf Flow       Carb Pressure1  Fill Pressure   Hyd Pressure1  
 Min.   :-100.20   Min.   :105.6   Min.   :34.60   Min.   :-0.80  
 1st Qu.:-100.00   1st Qu.:118.8   1st Qu.:46.00   1st Qu.: 0.00  
 Median :  84.20   Median :123.0   Median :46.40   Median :11.60  
 Mean   :  25.22   Mean   :122.5   Mean   :47.93   Mean   :12.56  
 3rd Qu.: 140.90   3rd Qu.:125.4   3rd Qu.:50.00   3rd Qu.:20.40  
 Max.   : 229.40   Max.   :140.2   Max.   :60.40   Max.   :58.00  
 Hyd Pressure2   Hyd Pressure3   Hyd Pressure4     Filler Level  
 Min.   : 0.00   Min.   :-1.20   Min.   : 62.00   Min.   : 55.8  
 1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00   1st Qu.: 97.7  
 Median :28.80   Median :27.80   Median : 96.00   Median :118.4  
 Mean   :21.12   Mean   :20.58   Mean   : 96.41   Mean   :109.2  
 3rd Qu.:34.80   3rd Qu.:33.40   3rd Qu.:102.00   3rd Qu.:120.0  
 Max.   :59.40   Max.   :50.00   Max.   :142.00   Max.   :161.2  
  Filler Speed   Temperature      Usage cont      Carb Flow       Density     
 Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26   Min.   :0.340  
 1st Qu.:3866   1st Qu.:65.20   1st Qu.:18.38   1st Qu.:1174   1st Qu.:0.900  
 Median :3980   Median :65.60   Median :21.82   Median :3030   Median :0.980  
 Mean   :3658   Mean   :65.96   Mean   :21.00   Mean   :2481   Mean   :1.175  
 3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.76   3rd Qu.:3188   3rd Qu.:1.620  
 Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104   Max.   :1.920  
      MFR           Balling      Pressure Vacuum        PH       
 Min.   : 31.4   Min.   :0.346   Min.   :-6.600   Min.   :7.880  
 1st Qu.:696.3   1st Qu.:1.496   1st Qu.:-5.600   1st Qu.:8.440  
 Median :721.8   Median :1.648   Median :-5.400   Median :8.540  
 Mean   :673.3   Mean   :2.202   Mean   :-5.216   Mean   :8.545  
 3rd Qu.:730.5   3rd Qu.:3.292   3rd Qu.:-5.000   3rd Qu.:8.680  
 Max.   :868.6   Max.   :4.012   Max.   :-3.600   Max.   :8.940  
 Oxygen Filler     Bowl Setpoint   Pressure Setpoint Air Pressurer  
 Min.   :0.00240   Min.   : 70.0   Min.   :44.00     Min.   :140.8  
 1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00     1st Qu.:142.2  
 Median :0.03340   Median :120.0   Median :46.00     Median :142.6  
 Mean   :0.04639   Mean   :109.3   Mean   :47.62     Mean   :142.8  
 3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00     3rd Qu.:143.0  
 Max.   :0.40000   Max.   :140.0   Max.   :52.00     Max.   :148.2  
    Alch Rel        Carb Rel      Balling Lvl   
 Min.   :6.320   Min.   :4.960   Min.   :0.000  
 1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.380  
 Median :6.560   Median :5.400   Median :1.480  
 Mean   :6.898   Mean   :5.436   Mean   :2.053  
 3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.140  
 Max.   :8.620   Max.   :6.060   Max.   :3.660  

Correlations

We analyzed correlations among numeric predictors to assess multicollinearity. The goal is to confirm whether predictors are highly uncorrelated (i.e., correlations below 0.75) or identify significant correlations that may impact model interpretation.

Summary: The data is not highly uncorrelated. Several predictor pairs have correlations above 0.75, indicating multicollinearity. For example, Balling Lvl and Balling have a correlation of 0.98. XGBoost, our selected model, is robust to multicollinearity, but these relationships should be considered for process optimization

# Select numeric predictors (exclude Brand Code)
numeric_data <- working_data %>% select(-c("Brand Code"))

# Compute correlation matrix
cor_matrix <- cor(numeric_data, use = "pairwise.complete.obs")

# Find high correlations (> 0.75)
cor_pairs <- which(abs(cor_matrix) > 0.75 & upper.tri(cor_matrix), arr.ind = TRUE)
high_cor <- data.frame(
  Var1 = rownames(cor_matrix)[cor_pairs[, 1]],
  Var2 = colnames(cor_matrix)[cor_pairs[, 2]],
  Correlation = cor_matrix[cor_pairs]
) %>% arrange(desc(abs(Correlation)))

# Display high correlations
knitr::kable(high_cor, caption = "Predictor Pairs with Correlations > 0.75")
Predictor Pairs with Correlations > 0.75
Var1 Var2 Correlation
Balling Balling Lvl 0.9798955
Density Balling 0.9548024
Filler Level Bowl Setpoint 0.9509117
Density Balling Lvl 0.9490347
Filler Speed MFR 0.9419364
Balling Alch Rel 0.9260832
Alch Rel Balling Lvl 0.9246620
Hyd Pressure2 Hyd Pressure3 0.9244579
Density Alch Rel 0.9037947
Carb Rel Balling Lvl 0.8492018
Alch Rel Carb Rel 0.8480637
Density Carb Rel 0.8298294
Carb Pressure Carb Temp 0.8283693
Balling Carb Rel 0.8279466
Carb Volume Carb Rel 0.8019557
Carb Volume Balling Lvl 0.7867955
Carb Volume Balling 0.7863082
Carb Volume Alch Rel 0.7837527
Carb Volume Density 0.7667585
Mnf Flow Hyd Pressure3 0.7585844
# Visualize correlation matrix
corrplot(cor_matrix, method = "number", tl.cex = 0.8, number.cex = 0.6, 
         diag = FALSE, title = "Correlation Matrix of Numeric Predictors")

Relationships Among Key Predictors

We visualized relationships among the top predictors (Mnf Flow, Carb Pressure, Brand Code) and pH to understand their impact on the target variable.

  • Mnf Flow and Carb Pressure have moderate negative correlations with PH (~ - 0.3 to -0.6), explaining their high importance in XGBoost.
  • Brand Code introduces recipe-specific patterns, with different flow and pressure settings affecting pH.
  • Non-linear patterns support XGBoost’s effectiveness (RMSE of 0.12).
# key predictors and PH
key_vars <- working_data %>% select(`Mnf Flow`, `Carb Pressure`, `Brand Code`, PH)

# Plot relationships
print(ggpairs(key_vars, progress = FALSE, 
              mapping = aes(color = `Brand Code`), 
              title = "Pairwise Relationships of Mnf Flow, Carb Pressure, Brand Code, and pH"))

Histograms

Distributions varied:

  • Normal: Carb Pressure, Carb Temp, pH.
  • Skewed: Mnf Flow, Oxygen Filler.
  • Bi-Modal: Density, Balling.

The histograms reveal that Mnf Flow and Oxygen Filler are right-skewed, with outliers that likely contribute to the slight underprediction bias (residuals ~ -0.2) in the XGBoost model (RMSE of 0.12). Other predictors, such as Carb Pressure and Balling Lvl, are likely normally distributed, supporting model stability. Negative values in Mnf Flow and Hyd Pressure1 and outliers in Density and Oxygen Filler indicate data quality issues that must be addressed for reliable predictions.

hist_data <- working_data %>% select(-`Brand Code`)
gather_features <- hist_data %>% gather(key = "features", value = "value")
ggplot(gather_features) +
  geom_histogram(aes(x = value, y = ..density..), bins = 30) +
  geom_density(aes(x = value), color = "green") +
  facet_wrap(.~features, scales = "free", ncol = 4)

Scatter Plots

Relationships between pH and predictors showed linear, quadratic, or no patterns.

Next we show the relationship between PH and every single other feature where every feature is on the x-axis and PH on the y-axis. There are many outliers in many predictors.

There are many linear relationships between PH and several predictors where the relationship is a straight horizontal line, meaning the slope = 0 and y=b where b is “semi-constant”.

Other predictors show a quadratic relationship with PH.

numeric_data <- working_data %>% select_if(is.numeric)
theme1 <- trellis.par.get()
theme1$plot.symbol$col = rgb(.2, .2, .2, .4)
theme1$plot.symbol$pch = 16
theme1$plot.line$col = rgb(1, 0, 0, .7)
theme1$plot.line$lwd <- 2
theme1$fontsize$text <- 7

trellis.par.set(theme1)
caret::featurePlot(x = numeric_data[, 2:ncol(numeric_data)],
                   y = numeric_data$PH,
                   type = c("p", "smooth"),
                   span = 0.5)

Modeling

We tested six models to predict pH: - Decision Tree (unpruned and pruned) - Random Forest - Support Vector Machine (SVM) - Multivariate Adaptive Regression Splines (MARS) - XGBoost - Linear Model

Data Split

set.seed(123)
split_index <- createDataPartition(working_data$PH, p = 0.8, list = FALSE)
train_data_split <- working_data[split_index, ]
test_data_split <- working_data[-split_index, ]

# Clean column names
colnames(train_data_split) <- make.names(colnames(train_data_split))
colnames(test_data_split) <- make.names(colnames(test_data_split))
colnames(working_data_test) <- make.names(colnames(working_data_test))

Feature Engineering

Added interaction terms to capture complex relationships and improve model performance.

train_data_split <- train_data_split %>%
  mutate(Mnf.Flow_Carb.Pressure = Mnf.Flow * Carb.Pressure,
         Mnf.Flow_Brand.Code = Mnf.Flow * as.numeric(Brand.Code))
test_data_split <- test_data_split %>%
  mutate(Mnf.Flow_Carb.Pressure = Mnf.Flow * Carb.Pressure,
         Mnf.Flow_Brand.Code = Mnf.Flow * as.numeric(Brand.Code))
working_data_test <- working_data_test %>%
  mutate(Mnf.Flow_Carb.Pressure = Mnf.Flow * Carb.Pressure,
         Mnf.Flow_Brand.Code = Mnf.Flow * as.numeric(Brand.Code))

Model Training

Decision Tree

Removed highly correlated predictors (>0.8) to simplify the tree.

tree_train <- train_data_split %>% select(-c(Balling, Filler.Level, Hyd.Pressure2, Density, Filler.Speed, Carb.Rel, Carb.Temp))
tree_model <- rpart(PH ~ ., data = tree_train, method = "anova", control = rpart.control(minsplit = 10, cp = 0.01))
pruned_tree <- prune(tree_model, cp = 0.011)
rpart.plot(pruned_tree, box.palette = "auto", nn = TRUE)

Random Forest
set.seed(123)
rf_model <- randomForest(PH ~ ., data = train_data_split, ntree = 500, importance = TRUE)

Support Vector Machine

set.seed(123)
svm_model <- svm(PH ~ ., data = train_data_split, type = "eps-regression", kernel = "radial", scale = TRUE)
MARS
set.seed(123)
mars_model <- earth(PH ~ ., data = train_data_split, degree = 2)

XGBoost

train_features <- model.matrix(PH ~ . -1, data = train_data_split)
train_labels <- train_data_split$PH
test_features <- model.matrix(PH ~ . -1, data = test_data_split)
test_labels <- test_data_split$PH
dtrain <- xgb.DMatrix(data = train_features, label = train_labels)
dtest <- xgb.DMatrix(data = test_features, label = test_labels)

params <- list(
  objective = "reg:squarederror",
  eta = 0.05,
  max_depth = 6,
  gamma = 0.3,
  subsample = 0.8,
  colsample_bytree = 0.8
)

set.seed(123)
xgb_model <- xgb.train(params = params, data = dtrain, nrounds = 500, watchlist = list(test = dtest), verbose = 0)

Linear Model

ctrl <- trainControl(method = "cv", number = 10)
lm_model <- train(PH ~ ., data = train_data_split, method = "lm", trControl = ctrl)

Model Evaluation

We evaluated models using Root Mean Squared Error (RMSE) on the validation set.

  • Models Evaluated: XGBoost, Random Forest, Support Vector Machine (SVM), Multivariate Adaptive Regression Splines (MARS), Decision Tree, Pruned Decision Tree, and Linear Model.
  • Primary Goal: Accurately predict pH using manufacturing process data and identify key factors influencing pH.
  • Best Models: Random Forest (RMSE = 0.095218) and XGBoost (RMSE = 0.12) emerged as top performers, with XGBoost preferred for production due to its practical advantages.
  • Key Features: Mnf Flow, Carb Pressure, Brand Code (primary), and Balling Lvl, Oxygen Filler, Density (secondary) were identified as key drivers of pH.

Since the primary objective is to understand the manufacturing process, we begin with decision trees. As stated:
“.. that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of pH.”

The Root Mean Squared Error (RMSE) represents the error of the estimator or predictive model created based on the given set of observations in the sample. It measures the average squared difference between the predicted values and the actual values, quantifying the discrepancy between the model’s predictions and the true observations. RMSE is commonly used to assess how close estimates or forecasts are to actual values. A lower RMSE indicates that the forecast is closer to the actual values, signifying a better model fit. As a model evaluation measure for regression models, RMSE effectively captures the model’s prediction accuracy across the sample.

# Decision Tree
tree_pred <- predict(tree_model, test_data_split %>% select(colnames(tree_train)))
tree_rmse <- RMSE(tree_pred, test_data_split$PH)

# Pruned Tree
ptree_pred <- predict(pruned_tree, test_data_split %>% select(colnames(tree_train)))
ptree_rmse <- RMSE(ptree_pred, test_data_split$PH)

# Random Forest
rf_pred <- predict(rf_model, test_data_split)
rf_rmse <- RMSE(rf_pred, test_data_split$PH)

# SVM
svm_pred <- predict(svm_model, test_data_split)
svm_rmse <- RMSE(svm_pred, test_data_split$PH)

# MARS
mars_pred <- predict(mars_model, test_data_split)
mars_rmse <- RMSE(mars_pred, test_data_split$PH)

# XGBoost
xgb_pred <- predict(xgb_model, dtest)
xgb_rmse <- RMSE(xgb_pred, test_labels)

# Linear Model
lm_pred <- predict(lm_model, test_data_split)
lm_eval <- postResample(lm_pred, test_data_split$PH)
lm_rmse <- lm_eval["RMSE"]

# Model comparison
model_comparison <- data.frame(
  Model = c("Decision Tree", "Pruned Tree", "Random Forest", "SVM", "MARS", "XGBoost", "Linear Model"),
  RMSE = c(tree_rmse, ptree_rmse, rf_rmse, svm_rmse, mars_rmse, xgb_rmse, lm_rmse)
)
knitr::kable(model_comparison, caption = "Model Performance Comparison")
Model Performance Comparison
Model RMSE
Decision Tree 0.1285591
Pruned Tree 0.1285591
Random Forest 0.0952128
SVM 0.1177867
MARS 0.1201891
XGBoost 0.1138460
Linear Model 0.1302291

Results: Random Forest achieved the lowest RMSE, followed by XGBoost. Decision Tree and Linear Model performed poorly, likely due to their inability to capture complex relationships.

Feature Importance

XGBoost

The hyperparameter analysis of the final tuned XGBoost model reveals a configuration aimed at capturing complex data patterns. The deep trees (max_depth=8) and the high number of boosting rounds (nrounds=1000) suggest a model with the capacity to learn intricate relationships but also a potential risk of overfitting the training data. The absence of L1 or L2 regularization (gamma=0) allows for unrestricted tree growth, which could exacerbate overfitting by fitting to noise within the data. The conservative sampling rates for features (colsample_bytree=0.8) and instances (subsample=0.8) likely aimed to mitigate overfitting, but might also limit the model’s exploration of the feature space. To improve the model’s generalization ability, it is recommended to experiment with adding regularization (e.g., gamma values between 0.1 and 0.5), simplifying the tree depth (reducing max_depth to 6), and potentially increasing the learning rate (e.g., eta=0.05) to achieve faster convergence with fewer trees.

importance_scores <- xgb.importance(feature_names = colnames(train_features), model = xgb_model)
vip(xgb_model, num_features = 15, geom = "point", horizontal = FALSE) +
  ggtitle("XGBoost Feature Importance") +
  theme_minimal()

Random Forest

importance_rf <- as.data.frame(importance(rf_model))
importance_rf$Feature <- rownames(importance_rf)
importance_rf <- importance_rf[order(-importance_rf$IncNodePurity), ]
ggplot(importance_rf[1:15, ], aes(x = reorder(Feature, IncNodePurity), y = IncNodePurity)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Random Forest Feature Importance", x = "", y = "") +
  theme_minimal()

Key Predictors: Mnf Flow, Carb Pressure, and Brand Code were consistently important across both models.

Residual Diagnostics

We analyzed XGBoost residuals to assess model performance.

results <- data.frame(
  Actual = test_data_split$PH,
  Predicted = xgb_pred,
  Residuals = test_data_split$PH - xgb_pred
)

# Predicted vs Actual
ggplot(results, aes(x = Predicted, y = Actual)) +
  geom_point(alpha = 0.5) +
  geom_abline(color = "red") +
  ggtitle("Predicted vs Actual pH Values") +
  theme_minimal()

# Residual Histogram
ggplot(results, aes(x = Residuals)) +
  geom_histogram(bins = 30, fill = "steelblue") +
  ggtitle("Residual Distribution") +
  theme_minimal()

Findings:

  • Predicted vs. actual plot shows good alignment along the 45-degree line.
  • Residuals are slightly skewed, with a minor under prediction bias (clustering around -0.2).

Analysis

Ensemble methods, specifically XGBoost and Random Forest, emerged as the top performers in this analysis, surpassing simpler models. This outcome aligns with the inherent strengths of ensemble techniques, which combine predictions from multiple base learners to mitigate variance and enhance the model’s ability to generalize unseen data patterns. The superior performance of XGBoost can likely be attributed to its gradient boosting framework, a sequential process of building trees that focuses on correcting errors made by preceding trees, enabling it to capture intricate relationships within the dataset more effectively.

In contrast, the Support Vector Machine (SVM), a kernel-based method, demonstrated better predictive power than Multivariate Adaptive Regression Splines (MARS). This suggests that the underlying data may exhibit non-linear patterns that were better captured by the radial basis function kernel employed by SVM compared to the piecewise linear approximations inherent in MARS. This highlights the importance of selecting models that can effectively handle the underlying structure of the data.

Conversely, both the standard Decision Tree, its pruned counterpart and Linear Model exhibited the highest Root Mean Squared Error (RMSE), indicating the poorest performance among the models evaluated. This underperformance is typical of individual decision trees, which are susceptible to high variance and overfitting, particularly when dealing with a substantial number of predictor variables. The fact that pruning did not improve the tree’s performance suggests that either the initial tree was already too simplistic or that the pruning parameters, such as the complexity parameter (cp), were not optimally tuned to prevent overfitting.

Predictions

We generated pH predictions for the test dataset using XGBoost.

The preprocessing steps are rigorous and appropriate. Ensuring no missing values, aligning feature names, and matching factor levels prevent common errors in prediction pipelines.

# Ensure consistent feature engineering
working_data_test <- working_data_test %>%
  mutate(Mnf.Flow_Carb.Pressure = Mnf.Flow * Carb.Pressure,
         Mnf.Flow_Brand.Code = Mnf.Flow * as.numeric(Brand.Code))

# Remove PHTRUE or PH if present (redundant check)
working_data_test <- working_data_test %>% select(-any_of(c("PHTRUE", "PH")))

# Verify no missing values
if (anyNA(working_data_test)) {
  stop("Missing values detected in working_data_test")
}

# Verify dimensions
print("Dimensions of working_data_test:")
[1] "Dimensions of working_data_test:"
print(dim(working_data_test))  # Should be [261, 34] (32 predictors + 2 engineered)
[1] 261  34
# Ensure Brand.Code is a factor with correct levels
working_data_test$Brand.Code <- factor(working_data_test$Brand.Code, levels = train_levels)

# Create test feature matrix
test_features_final <- tryCatch({
  model.matrix(~ . -1, data = working_data_test)
}, error = function(e) {
  stop("model.matrix failed: ", e$message)
})

# Diagnostic: Check dimensions
print("Dimensions of test_features_final before alignment:")
[1] "Dimensions of test_features_final before alignment:"
print(dim(test_features_final))  # Should be [261, 37]
[1] 261  37
# Align test features with training features
train_feature_names <- colnames(train_features)
test_feature_names <- colnames(test_features_final)

# Check feature alignment
print("Training features:")
[1] "Training features:"
print(train_feature_names)
 [1] "Brand.CodeA"            "Brand.CodeB"            "Brand.CodeC"           
 [4] "Brand.CodeD"            "Carb.Volume"            "Fill.Ounces"           
 [7] "PC.Volume"              "Carb.Pressure"          "Carb.Temp"             
[10] "PSC"                    "PSC.Fill"               "PSC.CO2"               
[13] "Mnf.Flow"               "Carb.Pressure1"         "Fill.Pressure"         
[16] "Hyd.Pressure1"          "Hyd.Pressure2"          "Hyd.Pressure3"         
[19] "Hyd.Pressure4"          "Filler.Level"           "Filler.Speed"          
[22] "Temperature"            "Usage.cont"             "Carb.Flow"             
[25] "Density"                "MFR"                    "Balling"               
[28] "Pressure.Vacuum"        "Oxygen.Filler"          "Bowl.Setpoint"         
[31] "Pressure.Setpoint"      "Air.Pressurer"          "Alch.Rel"              
[34] "Carb.Rel"               "Balling.Lvl"            "Mnf.Flow_Carb.Pressure"
[37] "Mnf.Flow_Brand.Code"   
print("Test features:")
[1] "Test features:"
print(test_feature_names)
 [1] "Brand.CodeA"            "Brand.CodeB"            "Brand.CodeC"           
 [4] "Brand.CodeD"            "Carb.Volume"            "Fill.Ounces"           
 [7] "PC.Volume"              "Carb.Pressure"          "Carb.Temp"             
[10] "PSC"                    "PSC.Fill"               "PSC.CO2"               
[13] "Mnf.Flow"               "Carb.Pressure1"         "Fill.Pressure"         
[16] "Hyd.Pressure1"          "Hyd.Pressure2"          "Hyd.Pressure3"         
[19] "Hyd.Pressure4"          "Filler.Level"           "Filler.Speed"          
[22] "Temperature"            "Usage.cont"             "Carb.Flow"             
[25] "Density"                "MFR"                    "Balling"               
[28] "Pressure.Vacuum"        "Oxygen.Filler"          "Bowl.Setpoint"         
[31] "Pressure.Setpoint"      "Air.Pressurer"          "Alch.Rel"              
[34] "Carb.Rel"               "Balling.Lvl"            "Mnf.Flow_Carb.Pressure"
[37] "Mnf.Flow_Brand.Code"   
print("Feature names identical:")
[1] "Feature names identical:"
print(identical(train_feature_names, test_feature_names))
[1] TRUE
# Check for missing or extra features
missing_features <- setdiff(train_feature_names, test_feature_names)
extra_features <- setdiff(test_feature_names, train_feature_names)
if (length(missing_features) > 0) {
  stop("Missing features in test data: ", paste(missing_features, collapse = ", "))
}
if (length(extra_features) > 0) {
  warning("Extra features in test data removed: ", paste(extra_features, collapse = ", "))
}

# Subset and reorder test features to match training features
test_features_final <- test_features_final[, train_feature_names, drop = FALSE]

# Verify dimensions
if (nrow(test_features_final) != nrow(working_data_test) || ncol(test_features_final) != length(train_feature_names)) {
  stop("Dimension mismatch in test_features_final: expected ", nrow(working_data_test), " rows, ", length(train_feature_names), " columns, got ", nrow(test_features_final), " rows, ", ncol(test_features_final), " columns")
}

# Create DMatrix
dtest_final <- xgb.DMatrix(data = test_features_final)

# Generate predictions with error handling
xgb_final_pred <- tryCatch({
  predict(xgb_model, dtest_final)
}, error = function(e) {
  stop("Prediction failed: ", e$message)
})

# Verify predictions
if (length(xgb_final_pred) != nrow(working_data_test)) {
  stop("Prediction length mismatch: expected ", nrow(working_data_test), ", got ", length(xgb_final_pred))
}

# Save to Excel
output <- working_data_test
output$PH_Predicted <- xgb_final_pred
write_xlsx(output, "PH_Predictions.xlsx")

Conclusion

In conclusion, the evaluation of various models for predicting pH revealed that XGBoost, an ensemble method, demonstrated superior performance compared to simpler models like Decision Trees, Pruned Trees, and a Linear Model. This highlights XGBoost’s effectiveness in capturing complex relationships within the beverage manufacturing process data. Analysis of the XGBoost model’s predictions against actual pH values and the distribution of its residuals indicated generally good accuracy, although a minor tendency for underprediction was noted in some cases. The hyperparameter tuning suggested a complex model configuration that, while achieving strong results, warrants consideration for regularization and potential simplification to mitigate the risk of overfitting. Addressing underlying data quality issues identified during the initial exploration, such as skewed features and outliers, will further contribute to the robustness and reliability of the XGBoost model for predicting pH.

Key Factors Influencing pH

Mnf Flow, Carb Pressure, and Brand Code are the primary drivers of pH, with Balling Lvl, Oxygen Filler, and Density as secondary factors. Their influence stems from their roles in mixing, carbonation, and recipe formulation, respectively. Skewness and outliers in Mnf Flow and Oxygen Filler require preprocessing to ensure accurate predictions.

Best Model Predicts pH

Best-Performing Model show Random Forest RMSE = 0.0952128 is the most accurate, but XGBoost’s competitive performance and practical advantages make it the preferred choice for production. We pick XGBoost cause its predicts pH accurately by modeling non-linear relationships, handling multicollinearity, and leveraging engineered features. Its High correlations, Robust to skewness and outliers, and RMSE of 0.12 and robust preprocessing ensure reliable predictions, though addressing skewness and data quality issues enhances performance.