Introduction
Data
Overview
Data
Cleaning and Preparation
Exploratory
Data Analysis
Summary
Statistics
Correlations
Relationships
Among Key Predictors
Histograms
Scatter
Plots
Modeling
Data
Split
Feature
Engineering
Model
Training
Model
Evaluation
Feature
Importance
XGBoost
Random Forest
pH
Residual Diagnostics
Analysis
Predictions
Conclusion
Key Factors
Influencing pH
Best Model
Predicts pH
Introduction
This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
The datasets provided include:
The goal is to predict pH, a measure of beverage acidity, to support regulatory compliance and quality control.
# Load datasets
train_data <- read_excel("StudentData.xlsx")
test_data <- read_excel("StudentEvaluation.xlsx")
# Display head of datasets
head(train_data)
# A tibble: 6 × 33
`Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
<chr> <dbl> <dbl> <dbl> <dbl>
1 B 5.34 24.0 0.263 68.2
2 A 5.43 24.0 0.239 68.4
3 B 5.29 24.1 0.263 70.8
4 A 5.44 24.0 0.293 63
5 A 5.49 24.3 0.111 67.2
6 A 5.38 23.9 0.269 66.6
# ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
# `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
# `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
# `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
# `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
# `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
# `Pressure Vacuum` <dbl>, PH <dbl>, `Oxygen Filler` <dbl>, …
# A tibble: 6 × 33
`Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
<chr> <dbl> <dbl> <dbl> <dbl>
1 D 5.48 24.0 0.27 65.4
2 A 5.39 24.0 0.227 63.2
3 B 5.29 23.9 0.303 66.4
4 B 5.27 23.9 0.186 64.8
5 B 5.41 24.2 0.16 69.4
6 B 5.29 24.1 0.212 73.4
# ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
# `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
# `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
# `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
# `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
# `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
# `Pressure Vacuum` <dbl>, PH <lgl>, `Oxygen Filler` <dbl>, …
We performed the following steps to prepare the data:
As part of the data preparation, there were null values in the data that needed to be handled. Firstly, any rows in the training set where the target variable PH was missing were removed, as these rows could not contribute to model training. Additionally, rows with four or more missing values were dropped from both the training and testing datasets to avoid introducing too much uncertainty into the imputation process. For the remaining missing values, a multi-level imputation strategy was implemented using MICE. The numeric variables were imputed using Predictive Mean Matching and the singular categorical variable, Brand Code, was imputed separately using a Classification and Regression Tree method. This dual process resulted in fully imputed datasets for both training and test sets, enabling a complete and consistent modeling pipeline.
Lastly, in order to obtain solid results, the training data, originally sourced from the “StudentData.xlsx” file, was split 80/20 for training and testing. As mentioned before, the prediction testing (StudentEvaluation.xslx) dataset received, had no PH values. Therefore in order to test multiple models, a smaller test set with PH values to gauge performance was needed to help select a model. Data was also centered and scaled prior to modeling, but within the modeling functions, to account for differing measurement units.
# Remove missing pH
train_nonullPH <- train_data %>% filter(!is.na(PH))
missing_per_row <- rowSums(is.na(train_nonullPH))
train_cleaned <- train_nonullPH[missing_per_row < 4, ]
# Convert Brand Code to factor
train_cleaned$`Brand Code` <- as.factor(train_cleaned$`Brand Code`)
train_levels <- levels(train_cleaned$`Brand Code`)
# Impute missing values
init <- mice(train_cleaned, maxit = 0)
meth <- init$method
meth["Brand Code"] <- "cart"
imputed_data <- mice(train_cleaned, method = meth, m = 10, seed = 100)
iter imp variable
1 1 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
1 2 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
1 3 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
1 4 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
1 5 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
1 6 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
1 7 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
1 8 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
1 9 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
1 10 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
2 1 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
2 2 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
2 3 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
2 4 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
2 5 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
2 6 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
2 7 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
2 8 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
2 9 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
2 10 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
3 1 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
3 2 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
3 3 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
3 4 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
3 5 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
3 6 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
3 7 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
3 8 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
3 9 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
3 10 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
4 1 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
4 2 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
4 3 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
4 4 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
4 5 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
4 6 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
4 7 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
4 8 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
4 9 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
4 10 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
5 1 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
5 2 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
5 3 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
5 4 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
5 5 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
5 6 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
5 7 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
5 8 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
5 9 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
5 10 Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Fill Pressure Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level Filler Speed Temperature Usage cont Carb Flow MFR Oxygen Filler Bowl Setpoint Pressure Setpoint Alch Rel Carb Rel
# Remove missing values
missing_per_row_test <- rowSums(is.na(test_data))
test_cleaned <- test_data[missing_per_row_test < 4, ]
test_cleaned <- test_cleaned %>% select(-any_of("PH"))
# Convert Brand Code to factor
test_cleaned$`Brand Code` <- factor(test_cleaned$`Brand Code`, levels = train_levels)
# Impute missing values
init_test <- mice(test_cleaned, maxit = 0)
meth_test <- init_test$method
meth_test["Brand Code"] <- "cart"
meth_test <- meth_test[names(meth_test) != "PH"]
imputed_data_test <- mice(test_cleaned, method = meth_test, m = 10, seed = 100)
iter imp variable
1 1 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
1 2 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
1 3 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
1 4 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
1 5 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
1 6 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
1 7 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
1 8 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
1 9 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
1 10 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
2 1 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
2 2 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
2 3 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
2 4 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
2 5 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
2 6 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
2 7 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
2 8 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
2 9 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
2 10 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
3 1 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
3 2 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
3 3 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
3 4 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
3 5 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
3 6 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
3 7 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
3 8 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
3 9 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
3 10 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
4 1 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
4 2 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
4 3 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
4 4 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
4 5 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
4 6 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
4 7 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
4 8 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
4 9 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
4 10 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
5 1 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
5 2 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
5 3 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
5 4 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
5 5 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
5 6 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
5 7 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
5 8 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
5 9 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
5 10 Brand Code Carb Volume Fill Ounces PC Volume Carb Temp PSC PSC Fill PSC CO2 Carb Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Speed Usage cont MFR Pressure Setpoint Alch Rel
working_data_test <- complete(imputed_data_test, 1)
# Verify no missing values
if (anyNA(working_data_test)) {
stop("Imputation failed: missing values remain in working_data_test")
}
# Verify dimensions (expect 261 rows, 32 columns)
print("Dimensions of working_data_test after imputation:")
[1] "Dimensions of working_data_test after imputation:"
[1] 261 32
[1] "Brand.Code levels in working_data_test:"
[1] "A" "B" "C" "D"
# Verify no PH column
if ("PH" %in% colnames(working_data_test)) {
stop("PH column found in working_data_test")
}
We conducted exploratory analysis to understand the data and relationships among variables. Exploratory analysis began with the creation of summary statistics, correlation matrices, and a series of pairwise plots using GGally. All of these methods were leveraged to get a better sense of the data, and the relationships between variables. The correlation matrix revealed several strong linear relationships among the variables. For instance,the variables Carb Temp and Carb Pressure, as well as the variables Density and Balling, were each found to have high respective correlations with one another. This indicates potential multicollinearity within the dataset, which could negatively impact some models. However, no corrective steps were taken due to the intended use of modeling techniques such as Random Forest and XGBoost, which automatically manage these types of relationships between predictors when modeling. Lastly, while some variables showed skewed or non-normal distributions, this was not considered problematic, especially for due to the choice of modeling approaches.
Brand Code Carb Volume Fill Ounces PC Volume Carb Pressure
A: 296 Min. :5.040 Min. :23.63 Min. :0.07933 Min. :57.00
B:1292 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23933 1st Qu.:65.60
C: 347 Median :5.347 Median :23.97 Median :0.27133 Median :68.20
D: 612 Mean :5.371 Mean :23.97 Mean :0.27761 Mean :68.22
3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200 3rd Qu.:70.60
Max. :5.700 Max. :24.32 Max. :0.47800 Max. :79.40
Carb Temp PSC PSC Fill PSC CO2
Min. :128.6 Min. :0.00200 Min. :0.0000 Min. :0.00000
1st Qu.:138.4 1st Qu.:0.05000 1st Qu.:0.1000 1st Qu.:0.02000
Median :140.8 Median :0.07800 Median :0.1800 Median :0.04000
Mean :141.1 Mean :0.08504 Mean :0.1961 Mean :0.05647
3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600 3rd Qu.:0.08000
Max. :154.0 Max. :0.27000 Max. :0.6200 Max. :0.24000
Mnf Flow Carb Pressure1 Fill Pressure Hyd Pressure1
Min. :-100.20 Min. :105.6 Min. :34.60 Min. :-0.80
1st Qu.:-100.00 1st Qu.:118.8 1st Qu.:46.00 1st Qu.: 0.00
Median : 84.20 Median :123.0 Median :46.40 Median :11.60
Mean : 25.22 Mean :122.5 Mean :47.93 Mean :12.56
3rd Qu.: 140.90 3rd Qu.:125.4 3rd Qu.:50.00 3rd Qu.:20.40
Max. : 229.40 Max. :140.2 Max. :60.40 Max. :58.00
Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level
Min. : 0.00 Min. :-1.20 Min. : 62.00 Min. : 55.8
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00 1st Qu.: 97.7
Median :28.80 Median :27.80 Median : 96.00 Median :118.4
Mean :21.12 Mean :20.58 Mean : 96.41 Mean :109.2
3rd Qu.:34.80 3rd Qu.:33.40 3rd Qu.:102.00 3rd Qu.:120.0
Max. :59.40 Max. :50.00 Max. :142.00 Max. :161.2
Filler Speed Temperature Usage cont Carb Flow Density
Min. : 998 Min. :63.60 Min. :12.08 Min. : 26 Min. :0.340
1st Qu.:3866 1st Qu.:65.20 1st Qu.:18.38 1st Qu.:1174 1st Qu.:0.900
Median :3980 Median :65.60 Median :21.82 Median :3030 Median :0.980
Mean :3658 Mean :65.96 Mean :21.00 Mean :2481 Mean :1.175
3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.76 3rd Qu.:3188 3rd Qu.:1.620
Max. :4030 Max. :76.20 Max. :25.90 Max. :5104 Max. :1.920
MFR Balling Pressure Vacuum PH
Min. : 31.4 Min. :0.346 Min. :-6.600 Min. :7.880
1st Qu.:696.3 1st Qu.:1.496 1st Qu.:-5.600 1st Qu.:8.440
Median :721.8 Median :1.648 Median :-5.400 Median :8.540
Mean :673.3 Mean :2.202 Mean :-5.216 Mean :8.545
3rd Qu.:730.5 3rd Qu.:3.292 3rd Qu.:-5.000 3rd Qu.:8.680
Max. :868.6 Max. :4.012 Max. :-3.600 Max. :8.940
Oxygen Filler Bowl Setpoint Pressure Setpoint Air Pressurer
Min. :0.00240 Min. : 70.0 Min. :44.00 Min. :140.8
1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2
Median :0.03340 Median :120.0 Median :46.00 Median :142.6
Mean :0.04639 Mean :109.3 Mean :47.62 Mean :142.8
3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:143.0
Max. :0.40000 Max. :140.0 Max. :52.00 Max. :148.2
Alch Rel Carb Rel Balling Lvl
Min. :6.320 Min. :4.960 Min. :0.000
1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.380
Median :6.560 Median :5.400 Median :1.480
Mean :6.898 Mean :5.436 Mean :2.053
3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.140
Max. :8.620 Max. :6.060 Max. :3.660
We analyzed correlations among numeric predictors to assess multicollinearity. The goal is to confirm whether predictors are highly uncorrelated (i.e., correlations below 0.75) or identify significant correlations that may impact model interpretation.
Summary: The data is not highly uncorrelated. Several predictor pairs have correlations above 0.75, indicating multicollinearity. For example, Balling Lvl and Balling have a correlation of 0.98. XGBoost, our selected model, is robust to multicollinearity, but these relationships should be considered for process optimization
# Select numeric predictors (exclude Brand Code)
numeric_data <- working_data %>% select(-c("Brand Code"))
# Compute correlation matrix
cor_matrix <- cor(numeric_data, use = "pairwise.complete.obs")
# Find high correlations (> 0.75)
cor_pairs <- which(abs(cor_matrix) > 0.75 & upper.tri(cor_matrix), arr.ind = TRUE)
high_cor <- data.frame(
Var1 = rownames(cor_matrix)[cor_pairs[, 1]],
Var2 = colnames(cor_matrix)[cor_pairs[, 2]],
Correlation = cor_matrix[cor_pairs]
) %>% arrange(desc(abs(Correlation)))
# Display high correlations
knitr::kable(high_cor, caption = "Predictor Pairs with Correlations > 0.75")
Var1 | Var2 | Correlation |
---|---|---|
Balling | Balling Lvl | 0.9798955 |
Density | Balling | 0.9548024 |
Filler Level | Bowl Setpoint | 0.9509117 |
Density | Balling Lvl | 0.9490347 |
Filler Speed | MFR | 0.9419364 |
Balling | Alch Rel | 0.9260832 |
Alch Rel | Balling Lvl | 0.9246620 |
Hyd Pressure2 | Hyd Pressure3 | 0.9244579 |
Density | Alch Rel | 0.9037947 |
Carb Rel | Balling Lvl | 0.8492018 |
Alch Rel | Carb Rel | 0.8480637 |
Density | Carb Rel | 0.8298294 |
Carb Pressure | Carb Temp | 0.8283693 |
Balling | Carb Rel | 0.8279466 |
Carb Volume | Carb Rel | 0.8019557 |
Carb Volume | Balling Lvl | 0.7867955 |
Carb Volume | Balling | 0.7863082 |
Carb Volume | Alch Rel | 0.7837527 |
Carb Volume | Density | 0.7667585 |
Mnf Flow | Hyd Pressure3 | 0.7585844 |
# Visualize correlation matrix
corrplot(cor_matrix, method = "number", tl.cex = 0.8, number.cex = 0.6,
diag = FALSE, title = "Correlation Matrix of Numeric Predictors")
Relationships Among Key Predictors
We visualized relationships among the top predictors (Mnf Flow, Carb Pressure, Brand Code) and pH to understand their impact on the target variable.
# key predictors and PH
key_vars <- working_data %>% select(`Mnf Flow`, `Carb Pressure`, `Brand Code`, PH)
# Plot relationships
print(ggpairs(key_vars, progress = FALSE,
mapping = aes(color = `Brand Code`),
title = "Pairwise Relationships of Mnf Flow, Carb Pressure, Brand Code, and pH"))
Distributions varied:
The histograms reveal that Mnf Flow and Oxygen Filler are right-skewed, with outliers that likely contribute to the slight underprediction bias (residuals ~ -0.2) in the XGBoost model (RMSE of 0.12). Other predictors, such as Carb Pressure and Balling Lvl, are likely normally distributed, supporting model stability. Negative values in Mnf Flow and Hyd Pressure1 and outliers in Density and Oxygen Filler indicate data quality issues that must be addressed for reliable predictions.
hist_data <- working_data %>% select(-`Brand Code`)
gather_features <- hist_data %>% gather(key = "features", value = "value")
ggplot(gather_features) +
geom_histogram(aes(x = value, y = ..density..), bins = 30) +
geom_density(aes(x = value), color = "green") +
facet_wrap(.~features, scales = "free", ncol = 4)
Relationships between pH and predictors showed linear, quadratic, or no patterns.
Next we show the relationship between PH and every single other feature where every feature is on the x-axis and PH on the y-axis. There are many outliers in many predictors.
There are many linear relationships between PH and several predictors where the relationship is a straight horizontal line, meaning the slope = 0 and y=b where b is “semi-constant”.
Other predictors show a quadratic relationship with PH.
numeric_data <- working_data %>% select_if(is.numeric)
theme1 <- trellis.par.get()
theme1$plot.symbol$col = rgb(.2, .2, .2, .4)
theme1$plot.symbol$pch = 16
theme1$plot.line$col = rgb(1, 0, 0, .7)
theme1$plot.line$lwd <- 2
theme1$fontsize$text <- 7
trellis.par.set(theme1)
caret::featurePlot(x = numeric_data[, 2:ncol(numeric_data)],
y = numeric_data$PH,
type = c("p", "smooth"),
span = 0.5)
We tested six models to predict pH: - Decision Tree (unpruned and pruned) - Random Forest - Support Vector Machine (SVM) - Multivariate Adaptive Regression Splines (MARS) - XGBoost - Linear Model
set.seed(123)
split_index <- createDataPartition(working_data$PH, p = 0.8, list = FALSE)
train_data_split <- working_data[split_index, ]
test_data_split <- working_data[-split_index, ]
# Clean column names
colnames(train_data_split) <- make.names(colnames(train_data_split))
colnames(test_data_split) <- make.names(colnames(test_data_split))
colnames(working_data_test) <- make.names(colnames(working_data_test))
Added interaction terms to capture complex relationships and improve model performance.
train_data_split <- train_data_split %>%
mutate(Mnf.Flow_Carb.Pressure = Mnf.Flow * Carb.Pressure,
Mnf.Flow_Brand.Code = Mnf.Flow * as.numeric(Brand.Code))
test_data_split <- test_data_split %>%
mutate(Mnf.Flow_Carb.Pressure = Mnf.Flow * Carb.Pressure,
Mnf.Flow_Brand.Code = Mnf.Flow * as.numeric(Brand.Code))
working_data_test <- working_data_test %>%
mutate(Mnf.Flow_Carb.Pressure = Mnf.Flow * Carb.Pressure,
Mnf.Flow_Brand.Code = Mnf.Flow * as.numeric(Brand.Code))
Decision Tree
Removed highly correlated predictors (>0.8) to simplify the tree.
tree_train <- train_data_split %>% select(-c(Balling, Filler.Level, Hyd.Pressure2, Density, Filler.Speed, Carb.Rel, Carb.Temp))
tree_model <- rpart(PH ~ ., data = tree_train, method = "anova", control = rpart.control(minsplit = 10, cp = 0.01))
pruned_tree <- prune(tree_model, cp = 0.011)
rpart.plot(pruned_tree, box.palette = "auto", nn = TRUE)
set.seed(123)
rf_model <- randomForest(PH ~ ., data = train_data_split, ntree = 500, importance = TRUE)
Support Vector Machine
set.seed(123)
svm_model <- svm(PH ~ ., data = train_data_split, type = "eps-regression", kernel = "radial", scale = TRUE)
XGBoost
train_features <- model.matrix(PH ~ . -1, data = train_data_split)
train_labels <- train_data_split$PH
test_features <- model.matrix(PH ~ . -1, data = test_data_split)
test_labels <- test_data_split$PH
dtrain <- xgb.DMatrix(data = train_features, label = train_labels)
dtest <- xgb.DMatrix(data = test_features, label = test_labels)
params <- list(
objective = "reg:squarederror",
eta = 0.05,
max_depth = 6,
gamma = 0.3,
subsample = 0.8,
colsample_bytree = 0.8
)
set.seed(123)
xgb_model <- xgb.train(params = params, data = dtrain, nrounds = 500, watchlist = list(test = dtest), verbose = 0)
Linear Model
ctrl <- trainControl(method = "cv", number = 10)
lm_model <- train(PH ~ ., data = train_data_split, method = "lm", trControl = ctrl)
Model Evaluation
We evaluated models using Root Mean Squared Error (RMSE) on the validation set.
Since the primary objective is to understand the manufacturing
process, we begin with decision trees. As stated:
“.. that new regulations are requiring us to understand our
manufacturing process, the predictive factors and be able to report
to them our predictive model of pH.”
The Root Mean Squared Error (RMSE) represents the error of the estimator or predictive model created based on the given set of observations in the sample. It measures the average squared difference between the predicted values and the actual values, quantifying the discrepancy between the model’s predictions and the true observations. RMSE is commonly used to assess how close estimates or forecasts are to actual values. A lower RMSE indicates that the forecast is closer to the actual values, signifying a better model fit. As a model evaluation measure for regression models, RMSE effectively captures the model’s prediction accuracy across the sample.
# Decision Tree
tree_pred <- predict(tree_model, test_data_split %>% select(colnames(tree_train)))
tree_rmse <- RMSE(tree_pred, test_data_split$PH)
# Pruned Tree
ptree_pred <- predict(pruned_tree, test_data_split %>% select(colnames(tree_train)))
ptree_rmse <- RMSE(ptree_pred, test_data_split$PH)
# Random Forest
rf_pred <- predict(rf_model, test_data_split)
rf_rmse <- RMSE(rf_pred, test_data_split$PH)
# SVM
svm_pred <- predict(svm_model, test_data_split)
svm_rmse <- RMSE(svm_pred, test_data_split$PH)
# MARS
mars_pred <- predict(mars_model, test_data_split)
mars_rmse <- RMSE(mars_pred, test_data_split$PH)
# XGBoost
xgb_pred <- predict(xgb_model, dtest)
xgb_rmse <- RMSE(xgb_pred, test_labels)
# Linear Model
lm_pred <- predict(lm_model, test_data_split)
lm_eval <- postResample(lm_pred, test_data_split$PH)
lm_rmse <- lm_eval["RMSE"]
# Model comparison
model_comparison <- data.frame(
Model = c("Decision Tree", "Pruned Tree", "Random Forest", "SVM", "MARS", "XGBoost", "Linear Model"),
RMSE = c(tree_rmse, ptree_rmse, rf_rmse, svm_rmse, mars_rmse, xgb_rmse, lm_rmse)
)
knitr::kable(model_comparison, caption = "Model Performance Comparison")
Model | RMSE |
---|---|
Decision Tree | 0.1285591 |
Pruned Tree | 0.1285591 |
Random Forest | 0.0952128 |
SVM | 0.1177867 |
MARS | 0.1201891 |
XGBoost | 0.1138460 |
Linear Model | 0.1302291 |
Results: Random Forest achieved the lowest RMSE, followed by XGBoost. Decision Tree and Linear Model performed poorly, likely due to their inability to capture complex relationships.
Feature Importance
XGBoost
The hyperparameter analysis of the final tuned XGBoost model reveals a configuration aimed at capturing complex data patterns. The deep trees (max_depth=8) and the high number of boosting rounds (nrounds=1000) suggest a model with the capacity to learn intricate relationships but also a potential risk of overfitting the training data. The absence of L1 or L2 regularization (gamma=0) allows for unrestricted tree growth, which could exacerbate overfitting by fitting to noise within the data. The conservative sampling rates for features (colsample_bytree=0.8) and instances (subsample=0.8) likely aimed to mitigate overfitting, but might also limit the model’s exploration of the feature space. To improve the model’s generalization ability, it is recommended to experiment with adding regularization (e.g., gamma values between 0.1 and 0.5), simplifying the tree depth (reducing max_depth to 6), and potentially increasing the learning rate (e.g., eta=0.05) to achieve faster convergence with fewer trees.
importance_scores <- xgb.importance(feature_names = colnames(train_features), model = xgb_model)
vip(xgb_model, num_features = 15, geom = "point", horizontal = FALSE) +
ggtitle("XGBoost Feature Importance") +
theme_minimal()
Random Forest
importance_rf <- as.data.frame(importance(rf_model))
importance_rf$Feature <- rownames(importance_rf)
importance_rf <- importance_rf[order(-importance_rf$IncNodePurity), ]
ggplot(importance_rf[1:15, ], aes(x = reorder(Feature, IncNodePurity), y = IncNodePurity)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Random Forest Feature Importance", x = "", y = "") +
theme_minimal()
Key Predictors: Mnf Flow, Carb Pressure, and Brand Code were consistently important across both models.
Residual Diagnostics
We analyzed XGBoost residuals to assess model performance.
results <- data.frame(
Actual = test_data_split$PH,
Predicted = xgb_pred,
Residuals = test_data_split$PH - xgb_pred
)
# Predicted vs Actual
ggplot(results, aes(x = Predicted, y = Actual)) +
geom_point(alpha = 0.5) +
geom_abline(color = "red") +
ggtitle("Predicted vs Actual pH Values") +
theme_minimal()
# Residual Histogram
ggplot(results, aes(x = Residuals)) +
geom_histogram(bins = 30, fill = "steelblue") +
ggtitle("Residual Distribution") +
theme_minimal()
Findings:
Ensemble methods, specifically XGBoost and Random Forest, emerged as the top performers in this analysis, surpassing simpler models. This outcome aligns with the inherent strengths of ensemble techniques, which combine predictions from multiple base learners to mitigate variance and enhance the model’s ability to generalize unseen data patterns. The superior performance of XGBoost can likely be attributed to its gradient boosting framework, a sequential process of building trees that focuses on correcting errors made by preceding trees, enabling it to capture intricate relationships within the dataset more effectively.
In contrast, the Support Vector Machine (SVM), a kernel-based method, demonstrated better predictive power than Multivariate Adaptive Regression Splines (MARS). This suggests that the underlying data may exhibit non-linear patterns that were better captured by the radial basis function kernel employed by SVM compared to the piecewise linear approximations inherent in MARS. This highlights the importance of selecting models that can effectively handle the underlying structure of the data.
Conversely, both the standard Decision Tree, its pruned counterpart and Linear Model exhibited the highest Root Mean Squared Error (RMSE), indicating the poorest performance among the models evaluated. This underperformance is typical of individual decision trees, which are susceptible to high variance and overfitting, particularly when dealing with a substantial number of predictor variables. The fact that pruning did not improve the tree’s performance suggests that either the initial tree was already too simplistic or that the pruning parameters, such as the complexity parameter (cp), were not optimally tuned to prevent overfitting.
Predictions
We generated pH predictions for the test dataset using XGBoost.
The preprocessing steps are rigorous and appropriate. Ensuring no missing values, aligning feature names, and matching factor levels prevent common errors in prediction pipelines.
# Ensure consistent feature engineering
working_data_test <- working_data_test %>%
mutate(Mnf.Flow_Carb.Pressure = Mnf.Flow * Carb.Pressure,
Mnf.Flow_Brand.Code = Mnf.Flow * as.numeric(Brand.Code))
# Remove PHTRUE or PH if present (redundant check)
working_data_test <- working_data_test %>% select(-any_of(c("PHTRUE", "PH")))
# Verify no missing values
if (anyNA(working_data_test)) {
stop("Missing values detected in working_data_test")
}
# Verify dimensions
print("Dimensions of working_data_test:")
[1] "Dimensions of working_data_test:"
[1] 261 34
# Ensure Brand.Code is a factor with correct levels
working_data_test$Brand.Code <- factor(working_data_test$Brand.Code, levels = train_levels)
# Create test feature matrix
test_features_final <- tryCatch({
model.matrix(~ . -1, data = working_data_test)
}, error = function(e) {
stop("model.matrix failed: ", e$message)
})
# Diagnostic: Check dimensions
print("Dimensions of test_features_final before alignment:")
[1] "Dimensions of test_features_final before alignment:"
[1] 261 37
# Align test features with training features
train_feature_names <- colnames(train_features)
test_feature_names <- colnames(test_features_final)
# Check feature alignment
print("Training features:")
[1] "Training features:"
[1] "Brand.CodeA" "Brand.CodeB" "Brand.CodeC"
[4] "Brand.CodeD" "Carb.Volume" "Fill.Ounces"
[7] "PC.Volume" "Carb.Pressure" "Carb.Temp"
[10] "PSC" "PSC.Fill" "PSC.CO2"
[13] "Mnf.Flow" "Carb.Pressure1" "Fill.Pressure"
[16] "Hyd.Pressure1" "Hyd.Pressure2" "Hyd.Pressure3"
[19] "Hyd.Pressure4" "Filler.Level" "Filler.Speed"
[22] "Temperature" "Usage.cont" "Carb.Flow"
[25] "Density" "MFR" "Balling"
[28] "Pressure.Vacuum" "Oxygen.Filler" "Bowl.Setpoint"
[31] "Pressure.Setpoint" "Air.Pressurer" "Alch.Rel"
[34] "Carb.Rel" "Balling.Lvl" "Mnf.Flow_Carb.Pressure"
[37] "Mnf.Flow_Brand.Code"
[1] "Test features:"
[1] "Brand.CodeA" "Brand.CodeB" "Brand.CodeC"
[4] "Brand.CodeD" "Carb.Volume" "Fill.Ounces"
[7] "PC.Volume" "Carb.Pressure" "Carb.Temp"
[10] "PSC" "PSC.Fill" "PSC.CO2"
[13] "Mnf.Flow" "Carb.Pressure1" "Fill.Pressure"
[16] "Hyd.Pressure1" "Hyd.Pressure2" "Hyd.Pressure3"
[19] "Hyd.Pressure4" "Filler.Level" "Filler.Speed"
[22] "Temperature" "Usage.cont" "Carb.Flow"
[25] "Density" "MFR" "Balling"
[28] "Pressure.Vacuum" "Oxygen.Filler" "Bowl.Setpoint"
[31] "Pressure.Setpoint" "Air.Pressurer" "Alch.Rel"
[34] "Carb.Rel" "Balling.Lvl" "Mnf.Flow_Carb.Pressure"
[37] "Mnf.Flow_Brand.Code"
[1] "Feature names identical:"
[1] TRUE
# Check for missing or extra features
missing_features <- setdiff(train_feature_names, test_feature_names)
extra_features <- setdiff(test_feature_names, train_feature_names)
if (length(missing_features) > 0) {
stop("Missing features in test data: ", paste(missing_features, collapse = ", "))
}
if (length(extra_features) > 0) {
warning("Extra features in test data removed: ", paste(extra_features, collapse = ", "))
}
# Subset and reorder test features to match training features
test_features_final <- test_features_final[, train_feature_names, drop = FALSE]
# Verify dimensions
if (nrow(test_features_final) != nrow(working_data_test) || ncol(test_features_final) != length(train_feature_names)) {
stop("Dimension mismatch in test_features_final: expected ", nrow(working_data_test), " rows, ", length(train_feature_names), " columns, got ", nrow(test_features_final), " rows, ", ncol(test_features_final), " columns")
}
# Create DMatrix
dtest_final <- xgb.DMatrix(data = test_features_final)
# Generate predictions with error handling
xgb_final_pred <- tryCatch({
predict(xgb_model, dtest_final)
}, error = function(e) {
stop("Prediction failed: ", e$message)
})
# Verify predictions
if (length(xgb_final_pred) != nrow(working_data_test)) {
stop("Prediction length mismatch: expected ", nrow(working_data_test), ", got ", length(xgb_final_pred))
}
# Save to Excel
output <- working_data_test
output$PH_Predicted <- xgb_final_pred
write_xlsx(output, "PH_Predictions.xlsx")
Conclusion
In conclusion, the evaluation of various models for predicting pH revealed that XGBoost, an ensemble method, demonstrated superior performance compared to simpler models like Decision Trees, Pruned Trees, and a Linear Model. This highlights XGBoost’s effectiveness in capturing complex relationships within the beverage manufacturing process data. Analysis of the XGBoost model’s predictions against actual pH values and the distribution of its residuals indicated generally good accuracy, although a minor tendency for underprediction was noted in some cases. The hyperparameter tuning suggested a complex model configuration that, while achieving strong results, warrants consideration for regularization and potential simplification to mitigate the risk of overfitting. Addressing underlying data quality issues identified during the initial exploration, such as skewed features and outliers, will further contribute to the robustness and reliability of the XGBoost model for predicting pH.
Mnf Flow, Carb Pressure, and Brand Code are the primary drivers of pH, with Balling Lvl, Oxygen Filler, and Density as secondary factors. Their influence stems from their roles in mixing, carbonation, and recipe formulation, respectively. Skewness and outliers in Mnf Flow and Oxygen Filler require preprocessing to ensure accurate predictions.
Best-Performing Model show Random Forest RMSE = 0.0952128 is the most accurate, but XGBoost’s competitive performance and practical advantages make it the preferred choice for production. We pick XGBoost cause its predicts pH accurately by modeling non-linear relationships, handling multicollinearity, and leveraging engineered features. Its High correlations, Robust to skewness and outliers, and RMSE of 0.12 and robust preprocessing ensure reliable predictions, though addressing skewness and data quality issues enhances performance.