Final Project

1. Introduction

Recent regulatory changes emphasize the importance of having a comprehensive understanding of manufacturing processes and their impact on product quality. At ABC Beverage, pH levels are a critical parameter for ensuring consistency and maintaining product standards. This analysis aims to identify and quantify the factors driving pH variability while developing a predictive model that meets these new regulatory standards. By leveraging advanced analytical methods and predictive modeling techniques, the analysis provides data-driven insights to ensure compliance and enhance understanding of the production process. The scope of this analysis includes data preprocessing, the identification of predictive factors, and the development and validation of a robust predictive model for pH levels. The focus is on exploring relationships within the manufacturing data to uncover insights that can drive better decision-making. By implementing this structured approach, the findings will provide actionable insights into the key drivers of pH variability and a reliable framework for prediction.

Understanding and predicting pH levels is crucial not only for meeting regulatory standards but also for maintaining operational excellence. Accurate prediction will enable more efficient quality control, reducing variability and ensuring that ABC Beverage consistently delivers high-quality products. The results of this analysis will empower the company to address compliance needs while optimizing its manufacturing process, underscoring the importance of data-driven decision-making in the production environment.

2. Dataset Overview

The dataset comprises observations collected from a beverage production line, capturing information about carbonation levels, filling processes, environmental conditions, and quality control metrics. Each row represents a single production instance, and each column corresponds to a specific variable measured or controlled during the process.

# Create a dataframe in R
variables <- data.frame(
  Feature = c(
    "Brand Code", "Carb Volume", "Fill Ounces", "PC Volume", "Carb Pressure", 
    "Carb Temp", "PSC", "PSC Fill", "PSC CO2", "Mnf Flow", "Carb Pressure1", 
    "Fill Pressure", "Hyd Pressure1", "Hyd Pressure2", "Hyd Pressure3", "Hyd Pressure4",
    "Filler Level", "Filler Speed", "Temperature", "Usage cont", "Carb Flow", 
    "Density", "MFR", "Balling", "Pressure Vacuum", "PH", "Oxygen Filler", 
    "Bowl Setpoint", "Pressure Setpoint", "Air Pressure", "Alch Rel", "Carb Rel", 
    "Balling Lvl"
  ),
  Description = c(
    "Unique identifier for the product's brand.",
    "Volume of carbon dioxide dissolved in the product.",
    "Volume of liquid dispensed into each container.",
    "Process control volume for monitoring liquid levels.",
    "Pressure level during carbonation.",
    "Temperature during carbonation for CO2 solubility.",
    "Process Setpoint Control for maintaining parameter targets.",
    "Filling setpoint under controlled conditions.",
    "Setpoint for CO2 levels during carbonation.",
    "Manufacturing flow rate for liquid or gas.",
    "Secondary carbonation pressure reading.",
    "Pressure applied during filling operations.",
    "Hydraulic pressure reading 1 for machine operation.",
    "Hydraulic pressure reading 2 for machine operation.",
    "Hydraulic pressure reading 3 for machine operation.",
    "Hydraulic pressure reading 4 for machine operation.",
    "Measurement of product level in containers.",
    "Speed of the filling machine or process.",
    "Temperature of the process environment.",
    "Container usage or consumption metrics.",
    "Flow rate of CO2 during carbonation.",
    "Density of the product for consistency monitoring.",
    "Mass flow rate of the material through the system.",
    "Sugar concentration level measured by the Balling scale.",
    "Vacuum pressure in the system.",
    "Acidity level of the product.",
    "Oxygen levels in the filling process.",
    "Target setpoint for intermediate container levels.",
    "Desired pressure level in the process.",
    "Air pressure measurement in the system.",
    "Alcohol release or related parameter.",
    "Carbonation release or related measurement.",
    "Sugar concentration level in the final product."
  ),
  Type = c(
    "Categorical", "Numerical", "Numerical", "Numerical", "Numerical",
    "Numerical", "Numerical", "Numerical", "Numerical", "Numerical", 
    "Numerical", "Numerical", "Numerical", "Numerical", "Numerical", 
    "Numerical", "Numerical", "Numerical", "Numerical", "Numerical",
    "Numerical", "Numerical", "Numerical", "Numerical", "Numerical", 
    "Numerical", "Numerical", "Numerical", "Numerical", "Numerical", 
    "Numerical", "Numerical", "Numerical"
  ),
  stringsAsFactors = FALSE
)

variables |> kable(caption = "Beverage Manufacture Process Features") |> kable_styling() |>  kable_classic()

Beverage Manufacture Process Features
Feature	Description	Type
Brand Code	Unique identifier for the product’s brand.	Categorical
Carb Volume	Volume of carbon dioxide dissolved in the product.	Numerical
Fill Ounces	Volume of liquid dispensed into each container.	Numerical
PC Volume	Process control volume for monitoring liquid levels.	Numerical
Carb Pressure	Pressure level during carbonation.	Numerical
Carb Temp	Temperature during carbonation for CO2 solubility.	Numerical
PSC	Process Setpoint Control for maintaining parameter targets.	Numerical
PSC Fill	Filling setpoint under controlled conditions.	Numerical
PSC CO2	Setpoint for CO2 levels during carbonation.	Numerical
Mnf Flow	Manufacturing flow rate for liquid or gas.	Numerical
Carb Pressure1	Secondary carbonation pressure reading.	Numerical
Fill Pressure	Pressure applied during filling operations.	Numerical
Hyd Pressure1	Hydraulic pressure reading 1 for machine operation.	Numerical
Hyd Pressure2	Hydraulic pressure reading 2 for machine operation.	Numerical
Hyd Pressure3	Hydraulic pressure reading 3 for machine operation.	Numerical
Hyd Pressure4	Hydraulic pressure reading 4 for machine operation.	Numerical
Filler Level	Measurement of product level in containers.	Numerical
Filler Speed	Speed of the filling machine or process.	Numerical
Temperature	Temperature of the process environment.	Numerical
Usage cont	Container usage or consumption metrics.	Numerical
Carb Flow	Flow rate of CO2 during carbonation.	Numerical
Density	Density of the product for consistency monitoring.	Numerical
MFR	Mass flow rate of the material through the system.	Numerical
Balling	Sugar concentration level measured by the Balling scale.	Numerical
Pressure Vacuum	Vacuum pressure in the system.	Numerical
PH	Acidity level of the product.	Numerical
Oxygen Filler	Oxygen levels in the filling process.	Numerical
Bowl Setpoint	Target setpoint for intermediate container levels.	Numerical
Pressure Setpoint	Desired pressure level in the process.	Numerical
Air Pressure	Air pressure measurement in the system.	Numerical
Alch Rel	Alcohol release or related parameter.	Numerical
Carb Rel	Carbonation release or related measurement.	Numerical
Balling Lvl	Sugar concentration level in the final product.	Numerical

The variables play a crucial role in capturing and monitoring various aspects of the production process, directly impacting product quality and operational efficiency:

Carbonation Variables: These variables (e.g., Carb Volume, Carb Pressure, and Carb Temp) ensure the beverage achieves the desired level of fizziness and retains CO2 effectively. They are critical for meeting product specifications and customer satisfaction.
Filling Variables:
Variables like Fill Ounces, PC Volume, and Filler Speed ensure accurate and consistent product volume in containers, minimizing waste and maintaining packaging standards.
Quality Control Metrics: Metrics such as Density, Balling, and PSC monitor the chemical and physical properties of the beverage, including sugar concentration, density, and carbonation, which significantly influence the product’s pH balance. Maintaining the appropriate pH ensures flavor stability, microbial safety, and overall product quality.
Process Control Variables: Variables like PSC CO2 and PSC act as setpoints to maintain optimal operational conditions, reducing variability and enhancing production reliability.

3. Exploratory Data Analysis

beverage_manufacturing_data <- read_xlsx("StudentData.xlsx")
colnames(beverage_manufacturing_data) <- gsub(" ", "_", colnames(beverage_manufacturing_data))

skim(beverage_manufacturing_data) |> kable() |> kable_styling() |>  kable_classic()

skim_type	skim_variable	n_missing	complete_rate	character.min	character.max	character.empty	character.n_unique	character.whitespace	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
character	Brand_Code	120	0.9533256	1	1	0	4	0	NA	NA	NA	NA	NA	NA	NA	NA
numeric	Carb_Volume	10	0.9961105	NA	NA	NA	NA	NA	5.3701978	0.1063852	5.0400000	5.2933333	5.3466667	5.453333	5.700	▁▆▇▅▁
numeric	Fill_Ounces	38	0.9852198	NA	NA	NA	NA	NA	23.9747546	0.0875299	23.6333333	23.9200000	23.9733333	24.026667	24.320	▁▂▇▂▁
numeric	PC_Volume	39	0.9848308	NA	NA	NA	NA	NA	0.2771187	0.0606953	0.0793333	0.2391667	0.2713333	0.312000	0.478	▁▃▇▂▁
numeric	Carb_Pressure	27	0.9894982	NA	NA	NA	NA	NA	68.1895755	3.5382039	57.0000000	65.6000000	68.2000000	70.600000	79.400	▁▅▇▃▁
numeric	Carb_Temp	26	0.9898872	NA	NA	NA	NA	NA	141.0949234	4.0373861	128.6000000	138.4000000	140.8000000	143.800000	154.000	▁▅▇▃▁
numeric	PSC	33	0.9871645	NA	NA	NA	NA	NA	0.0845737	0.0492690	0.0020000	0.0480000	0.0760000	0.112000	0.270	▆▇▃▁▁
numeric	PSC_Fill	23	0.9910541	NA	NA	NA	NA	NA	0.1953689	0.1177817	0.0000000	0.1000000	0.1800000	0.260000	0.620	▆▇▃▁▁
numeric	PSC_CO2	39	0.9848308	NA	NA	NA	NA	NA	0.0564139	0.0430387	0.0000000	0.0200000	0.0400000	0.080000	0.240	▇▅▂▁▁
numeric	Mnf_Flow	2	0.9992221	NA	NA	NA	NA	NA	24.5689373	119.4811263	-100.2000000	-100.0000000	65.2000000	140.800000	229.400	▇▁▁▇▂
numeric	Carb_Pressure1	32	0.9875535	NA	NA	NA	NA	NA	122.5863726	4.7428819	105.6000000	119.0000000	123.2000000	125.400000	140.200	▁▃▇▂▁
numeric	Fill_Pressure	22	0.9914430	NA	NA	NA	NA	NA	47.9221656	3.1775457	34.6000000	46.0000000	46.4000000	50.000000	60.400	▁▁▇▂▁
numeric	Hyd_Pressure1	11	0.9957215	NA	NA	NA	NA	NA	12.4375781	12.4332538	-0.8000000	0.0000000	11.4000000	20.200000	58.000	▇▅▂▁▁
numeric	Hyd_Pressure2	15	0.9941657	NA	NA	NA	NA	NA	20.9610329	16.3863066	0.0000000	0.0000000	28.6000000	34.600000	59.400	▇▂▇▅▁
numeric	Hyd_Pressure3	15	0.9941657	NA	NA	NA	NA	NA	20.4584507	15.9757236	-1.2000000	0.0000000	27.6000000	33.400000	50.000	▇▁▃▇▁
numeric	Hyd_Pressure4	30	0.9883314	NA	NA	NA	NA	NA	96.2888627	13.1225594	52.0000000	86.0000000	96.0000000	102.000000	142.000	▁▃▇▂▁
numeric	Filler_Level	20	0.9922209	NA	NA	NA	NA	NA	109.2523716	15.6984241	55.8000000	98.3000000	118.4000000	120.000000	161.200	▁▃▅▇▁
numeric	Filler_Speed	57	0.9778296	NA	NA	NA	NA	NA	3687.1988862	770.8200208	998.0000000	3888.0000000	3982.0000000	3998.000000	4030.000	▁▁▁▁▇
numeric	Temperature	14	0.9945546	NA	NA	NA	NA	NA	65.9675401	1.3827783	63.6000000	65.2000000	65.6000000	66.400000	76.200	▇▃▁▁▁
numeric	Usage_cont	5	0.9980552	NA	NA	NA	NA	NA	20.9929618	2.9779364	12.0800000	18.3600000	21.7900000	23.755000	25.900	▁▃▅▃▇
numeric	Carb_Flow	2	0.9992221	NA	NA	NA	NA	NA	2468.3542234	1073.6964743	26.0000000	1144.0000000	3028.0000000	3186.000000	5104.000	▂▅▆▇▁
numeric	Density	1	0.9996110	NA	NA	NA	NA	NA	1.1736498	0.3775269	0.2400000	0.9000000	0.9800000	1.620000	1.920	▁▅▇▂▆
numeric	MFR	212	0.9175418	NA	NA	NA	NA	NA	704.0492582	73.8983094	31.4000000	706.3000000	724.0000000	731.000000	868.600	▁▁▁▂▇
numeric	Balling	1	0.9996110	NA	NA	NA	NA	NA	2.1977696	0.9310914	-0.1700000	1.4960000	1.6480000	3.292000	4.012	▁▇▇▁▇
numeric	Pressure_Vacuum	0	1.0000000	NA	NA	NA	NA	NA	-5.2161027	0.5699933	-6.6000000	-5.6000000	-5.4000000	-5.000000	-3.600	▂▇▆▂▁
numeric	PH	4	0.9984442	NA	NA	NA	NA	NA	8.5456486	0.1725162	7.8800000	8.4400000	8.5400000	8.680000	9.360	▁▅▇▂▁
numeric	Oxygen_Filler	12	0.9953326	NA	NA	NA	NA	NA	0.0468426	0.0466436	0.0024000	0.0220000	0.0334000	0.060000	0.400	▇▁▁▁▁
numeric	Bowl_Setpoint	2	0.9992221	NA	NA	NA	NA	NA	109.3265862	15.3031541	70.0000000	100.0000000	120.0000000	120.000000	140.000	▁▂▃▇▁
numeric	Pressure_Setpoint	12	0.9953326	NA	NA	NA	NA	NA	47.6153966	2.0390474	44.0000000	46.0000000	46.0000000	50.000000	52.000	▁▇▁▆▁
numeric	Air_Pressurer	0	1.0000000	NA	NA	NA	NA	NA	142.8339946	1.2119170	140.8000000	142.2000000	142.6000000	143.000000	148.200	▅▇▁▁▁
numeric	Alch_Rel	9	0.9964994	NA	NA	NA	NA	NA	6.8974161	0.5052753	5.2800000	6.5400000	6.5600000	7.240000	8.620	▁▇▂▃▁
numeric	Carb_Rel	10	0.9961105	NA	NA	NA	NA	NA	5.4367825	0.1287183	4.9600000	5.3400000	5.4000000	5.540000	6.060	▁▇▇▂▁
numeric	Balling_Lvl	1	0.9996110	NA	NA	NA	NA	NA	2.0500078	0.8703089	0.0000000	1.3800000	1.4800000	3.140000	3.660	▁▇▂▁▆

beverage_manufacturing_data |> ggplot() + 
  geom_bar(aes(x = Brand_Code)) +
  ggtitle("Distribution of the Brand Codes")

beverage_manufacturing_data %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(binwidth = bins_cal) + 
  facet_wrap(~key, scales = "free") +
  ggtitle("Histograms of Numerical Predictors")

beverage_manufacturing_data  |>
  keep(is.numeric) |>
  gather() |>
  ggplot(aes(value)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=8,outlier.size=4) + 
  facet_wrap(~key, scales = 'free',  ncol = 6) +
  ggtitle("Boxplots of Numerical Predictors")

## Warning: Removed 724 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

beverage_manufacturing_data_cor <- beverage_manufacturing_data %>% na.omit() %>% select(-Brand_Code) %>% cor()
corrplot::corrplot(tl.cex = 0.7,beverage_manufacturing_data_cor)

The dataset provides a detailed snapshot of variables that are predominantly numeric, along with a single character variable, Brand_Code. Most variables exhibit high levels of completeness, with many exceeding 98% complete. For instance, variables like Pressure_Vacuum and Air_Pressurer are fully complete, whereas MFR stands out with a lower completeness rate of 91.8%. The character variable Brand_Code is 95.3% complete, with four unique codes and no empty values. While the dataset overall shows high data quality, variables like MFR and Brand_Code might require imputation or exclusion if they are deemed critical for analysis.

The numeric variables in the dataset reveal diverse distributions and variability. Some, such as Carb_Volume, Fill_Ounces, PSC, and Temperature, have tightly clustered values and small standard deviations, indicating consistent measurements. On the other hand, variables like Filler_Speed, MFR, and Carb_Flow show wide ranges and high variability, suggesting potential outliers. Distributions are also varied, with variables like PSC_Fill and Alch_Rel showing skewness, while Filler_Speed exhibits a peaked distribution with clustering near the upper bounds. This variability underscores the need for preprocessing steps such as normalization or transformations to handle skewness and outliers effectively.

The target variable, PH, is highly complete (99.8%) with only four missing values. Its mean value of 8.55 and range of 7.88 to 9.36 suggest a well-behaved, normal-like distribution, making it suitable for regression or machine learning tasks. Pressure-related variables such as Carb_Pressure1, Fill_Pressure, and Pressure_Setpoint exhibit narrow standard deviations and consistent trends, making them promising predictors. However, some pressure variables, including Hyd_Pressure1 and Hyd_Pressure3, have negative or zero values that might indicate measurement errors or anomalies. Investigating and addressing these issues is critical to ensure the reliability of these predictors.

Despite its overall robustness, the dataset does have potential challenges. Variables with extreme ranges, such as MFR, Carb_Flow, and Filler_Speed, may require additional preprocessing to minimize the influence of outliers. Additionally, the presence of negative values in pressure-related variables could require further investigation to determine whether they reflect true variability or data recording errors. Standardizing variables with wide ranges and handling missing or anomalous data points are essential steps before modeling.

4. Data Preparation

Before deciding how to handle missing values in our data we need to gain a better understanding of what data is missing. Looking at the columns with the most missing data in the figure below we can see there is an outlier in MFR, with roughly four times as many missing values as the next most missing variable. The majority of the remaining variables are missing in less than 2% of the observations, so we can be comfortable imputing values will not skew the results. It should be noted there are several observations with missing records for PH. As PH is the value we are attempting to predict, and the rows with missing PH values make up such a small proportion of the data we will be dropping these rows from our training set. The MFR column does not have enough missing values to justify excluding it from our modeling, so we will include it in our imputation. Before beginning to impute the missing values, we need to explore if there are patterns to the missingness.

miss_var_summary(tidy_train) |>
  rename(`Missing Count` = n_miss, `Percent Missing` = pct_miss) |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)

variable	Missing Count	Percent Missing
MFR	212	8.25
Filler.Speed	57	2.22
PC.Volume	39	1.52
PSC.CO2	39	1.52
Fill.Ounces	38	1.48
PSC	33	1.28
Carb.Pressure1	32	1.24
Hyd.Pressure4	30	1.17
Carb.Pressure	27	1.05
Carb.Temp	26	1.01
PSC.Fill	23	0.895
Fill.Pressure	22	0.856
Filler.Level	20	0.778
Hyd.Pressure2	15	0.583
Hyd.Pressure3	15	0.583
Temperature	14	0.545
Oxygen.Filler	12	0.467
Pressure.Setpoint	12	0.467
Hyd.Pressure1	11	0.428
Carb.Volume	10	0.389
Carb.Rel	10	0.389
Alch.Rel	9	0.350
Usage.cont	5	0.194
PH	4	0.156
Mnf.Flow	2	0.0778
Carb.Flow	2	0.0778
Bowl.Setpoint	2	0.0778
Density	1	0.0389
Balling	1	0.0389
Balling.Lvl	1	0.0389
Brand.Code	0	0
Pressure.Vacuum	0	0
Air.Pressurer	0	0

Before moving on, it should be noted that Brand.Code does have values which could be considered “missing” as seen in the table below. If we compare the number of observations with no brand code to the missing predictors above we can see it is actually the second most missing column. As Brand.Code can be converted to a factor and one hot encoded, we can treat the “missing” codes as another valid level. Doing so will allow some of our models to use any predictive power present in the missing values without hampering models that are unable to make use of missing values. We will therefore not be imputing values for the missing Brand.Codes.

tidy_train |>
  count(Brand.Code) |>
  arrange(desc(n))|> 
  kable() |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)

Brand.Code	n
B	1239
D	615
C	304
A	293
	120

As a last check before imputing values, we need to examine whether there are any strong patterns in the missingness of our data. Looking at the frequency plot below we can see that while there is not a significant pattern to the missingness of most variables, Filler.Speed does appear to be missing with MFR frequently. This is not necessarily surprising however, as MFR is our most missing variable by far. It could be argued that the relatively high level of correlated missingness between the two variables justifies using an imputation technique that accounts for the correlated missingness such as Multivariate Imputations by Chained Equations (MICE). Referring back to our missing counts however, we can see that Filler.Speed only has 57 missing observations. Our correlation analysis indicates that less than half of these missing observations are also missing MFR. Given 30 of our predictors have any missing data, having one pair of predictors exhibiting an apparently significant correlation is still consistent with random missingness. As no other predictors exhibit a high level of correlated missingness we can conclude there is no relationship between missing values in predictors and, even if there were, using a method such as MICE would not provide meaningful value for such a small number of potentially correlated missing values.

aggr_plot <- aggr(tidy_train, 
                  col=c('navyblue','red'), 
                  numbers=TRUE, 
                  sortVars=TRUE, 
                  labels=names(tidy_train), 
                  cex.axis=.7, 
                  gap=3, 
                  ylab=c("Proportion Missing","Pattern"), 
                  only.miss=TRUE, 
                  sortCombs=TRUE,
                  combined=TRUE)

## 
##  Variables sorted by number of missings: 
##           Variable Count
##                MFR   212
##       Filler.Speed    57
##          PC.Volume    39
##            PSC.CO2    39
##        Fill.Ounces    38
##                PSC    33
##     Carb.Pressure1    32
##      Hyd.Pressure4    30
##      Carb.Pressure    27
##          Carb.Temp    26
##           PSC.Fill    23
##      Fill.Pressure    22
##       Filler.Level    20
##      Hyd.Pressure2    15
##      Hyd.Pressure3    15
##        Temperature    14
##      Oxygen.Filler    12
##  Pressure.Setpoint    12
##      Hyd.Pressure1    11
##        Carb.Volume    10
##           Carb.Rel    10
##           Alch.Rel     9
##         Usage.cont     5
##                 PH     4
##           Mnf.Flow     2
##          Carb.Flow     2
##      Bowl.Setpoint     2
##            Density     1
##            Balling     1
##        Balling.Lvl     1
##         Brand.Code     0
##    Pressure.Vacuum     0
##      Air.Pressurer     0

Having determined what we will do with the special case variables of PH and Brand.Code, as well as determining there is not a significant level of correlated missingness in the remaining predictors, we can safely impute the remaining missing values. We will be using bootstrap aggregation (bagging) as our method of choice for imputing the missing values. We chose this method as it is able to make use of other predictors with missing values when imputing. Given the large proportion of predictors with any missing values this is a necessary trait and will result in better predictions.

# Turn integer columns into numerics to prevent errors
tidy_train %<>% mutate(across(
  .cols = c('Filler.Speed', 'Hyd.Pressure4', 'Bowl.Setpoint', 'Carb.Flow'),
  .fns = as.numeric
  ))

# Dropping rows with missing Brand.Code or PH
imputation_set <- tidy_train |>
  filter(!is.na(PH))

# Create a one hot encoding tibble
dummies <- dummyVars( ~ Brand.Code, data = imputation_set)
one_hot_df <- predict(dummies, newdata = imputation_set) |>
  as_tibble()

# Add the one hot df to the original and remove the categorical column
one_hot_df <- imputation_set |>
  cbind(one_hot_df) |>
  select(-Brand.Code)

preprocessor <- preProcess(one_hot_df,
                           method = "bagImpute",
                           allowParallelUpdates = TRUE) 

imputed_data <- predict(preprocessor, newdata = one_hot_df)

imputed_data |>
  write_csv("imputed_test_data.csv")

5. Model Development

This section outlines the methodology for building and implementing the predictive model. It includes details on data preprocessing, feature selection, hyperparameter tuning, and the modeling framework used. By leveraging advanced machine learning techniques, the objective is to create a robust, accurate, and generalizable model that captures the relationships between key variables and the target outcome.

manufacturing_tr <- read.csv("https://raw.githubusercontent.com/MarjeteV/data624/refs/heads/main/imputed_test_data.csv")

colnames(manufacturing_tr) <- gsub(" ", "_", colnames(manufacturing_tr))
brand_code_col <- c("Brand.CodeA","Brand.CodeB","Brand.CodeC","Brand.CodeD")

5.1 Support Vector Machine

This section outlines the training and tuning of a Support Vector Machine with Gaussian Radial Basis Function Kernel to predict pH levels, leveraging the model’s ability to handle non-linear relationships. The SVM model was selected due to its flexibility in capturing complex patterns in the data, making it well-suited for the task. Using a Gaussian Radial Basis Function (RBF) kernel, the SVM transforms input data into a higher-dimensional space by computing the similarity between data points. This transformation enables the model to find an optimal decision boundary or regression function in the new space, effectively capturing non-linear relationships between predictors and the target variable. This approach is particularly valuable for addressing the variability observed in pH levels, where intricate interactions among features may influence the outcome.

5.1.1 Support Vector Machine Model Preprocessing

The dataset is split into training (75%) and testing (25%) subsets using random sampling to ensure the model is trained on a representative and diverse portion of the data while reserving an independent set for unbiased performance validation. Feature selection is applied prior to fitting a Radial Support Vector Machine (SVM) model to enhance model performance and efficiency. Recursive Feature Elimination (RFE) is employed as a robust method to identify the most relevant predictors by systematically ranking features based on their importance and iteratively removing those with minimal contribution. The dataset includes a categorical variable, Brand Code, which is transformed using target encoding. Target encoding replaces each category with the mean of the target variable for that category, allowing the relationship between the categorical variable and the target to be represented numerically. This transformation ensures compatibility with the RFE process by converting the categorical variable into a format that can be used effectively in feature selection. A Random Forest (rfFuncs) is used to evaluate feature importance, with performance assessed through 5-fold cross-validation to ensure the reliability and robustness of the selected feature subset.

set.seed(8675309)

trainIndex <- sample(1:nrow(manufacturing_tr), size = 0.75 * nrow(manufacturing_tr))

beverage_man_train <- manufacturing_tr[trainIndex, ]
beverage_man_test <- manufacturing_tr[-trainIndex, ]

rfe_dataset <- beverage_man_train %>%
  rowwise() %>%
  mutate(brand_code = case_when(
    Brand.CodeA == 1 ~ "A",
    Brand.CodeB == 1 ~ "B",
    Brand.CodeC == 1 ~ "C",
    Brand.CodeD == 1 ~ "D",
    TRUE ~ NA_character_
  )) %>%
  ungroup() %>% group_by(brand_code) %>% 
  mutate(brand_code_encoded = mean(PH)) |> ungroup() |>
  select(-c(Brand.CodeA, Brand.CodeB, Brand.CodeC, Brand.CodeD,"brand_code"))

set.seed(8675309)
ignore_col <- c("PH")

control <- rfeControl(functions = rfFuncs, method = "cv", number = 5,
                      allowParallel = T)

# Perform RFE
rfe_results <- rfe(
    rfe_dataset |> select(-all_of(ignore_col)),
  rfe_dataset$PH, 
  sizes = c(1:length(rfe_dataset)),
  rfeControl = control
)

plot(rfe_results)

rfeMaxR2 <- max(rfe_results$resample[,"RMSE"])
rfeMinR2 <- min(rfe_results$resample[,"RMSE"])

rfe_results$resample |> kable(caption = "Recursive Feature Elimination Results") |> kable_styling() |>  kable_classic()

Recursive Feature Elimination Results
	Variables	RMSE	Rsquared	MAE	Resample
16	16	0.1006804	0.6535230	0.0735816	Fold1
48	16	0.0900474	0.7115275	0.0667801	Fold2
80	16	0.1017494	0.6698212	0.0743803	Fold3
112	16	0.1026493	0.7016155	0.0733228	Fold4
144	16	0.0954944	0.6961513	0.0683633	Fold5

The plot generated from the Recursive Feature Elimination (RFE) process shows that the model achieves low RMSE values with the 16-feature subset, indicating strong predictive performance during feature selection. Across the 5-fold cross-validation, the results are consistent, with RMSE, R², and MAE values remaining stable across iterations. The low RMSE and MAE confirm the model’s accuracy and stability, while R² values, ranging from 0.0900474 to 0.1026493, suggest that the model has the potential to explain a reasonable portion of the variance in the target variable. These consistent metrics highlight the robustness of the model and validate the 16-feature subset as a reliable choice for prediction.

varImportance <- varImp(rfe_results)
rfe_importance_df <- data.frame(Variable= rownames(varImportance),
                                Overall=varImportance$Overall ) 

rfe_importance_df |> ggplot( aes(y = reorder(Variable, +Overall), x = Overall)) + geom_bar(stat = "identity") + labs(
    title = "Variable Importance",
    x = "Overall Score",
    y = "Variable"
  )  + theme_minimal()

imp_predictors <- predictors(rfe_results)
best_predictors <- append(brand_code_col,imp_predictors[imp_predictors != "brand_code_encoded"])

The Recursive Feature Elimination (RFE) results provide a clear ranking of predictors, emphasizing their contributions to the model’s performance. The top-ranked variable, brand_code_encoded, underscores the importance of capturing brand-specific patterns through target encoding, as it strongly correlates with the target variable. Among the operational process variables, Mnf.Flow, Oxygen.Filler, Pressure.Vacuum, and Temperature stand out, reflecting their critical role in driving variability in the target. These variables highlight the influence of flow rates, oxygen levels, pressure conditions, and environmental factors in the manufacturing process, which are essential for maintaining product quality.

Additional contributors, such as Usage.cont, Carb.Rel, and Balling.Lvl, showcase the importance of operational and compositional properties in fine-tuning predictions. While variables like Filler.Speed, Carb.Flow, and Balling rank lower, they still provide valuable supplementary information about the operational flow and composition dynamics.

Features such as Density, Bowl.Setpoint, and Filler.Level rank among the lowest, indicating limited direct impact or indirect influence through higher-ranked variables. Similarly, Carb.Pressure1 and Hyd.Pressure3 show minimal importance, suggesting their contribution to the target is captured by other operational metrics.

The RFE results highlight a robust combination of categorical, operational, and compositional variables, with brand_code_encoded and key operational factors leading the rankings. The brand_code_encoded variable, while ranked as the most significant, will be provided to the final model in its one-hot encoded format to ensure compatibility with the Support Vector Machine (SVM) regression model.

5.1.2 Support Vector Machine Model Setup

This section details the implementation and tuning of a Support Vector Machine (SVM) regression model with a Gaussian Radial Basis Function (RBF) kernel to predict pH levels in the manufacturing process. The model is trained on a carefully selected set of predictors, which will be preprocessed using centering and scaling techniques. These preprocessing steps ensure that all variables contribute proportionally to the model and meet the requirements for SVM, which is sensitive to the magnitude of features.

To optimize model performance, hyperparameter tuning was conducted using a systematic grid search approach. This process explored a range of values for two critical parameters: the kernel width (sigma) and the regularization parameter (C). The kernel width (sigma) was varied from 0.01 to 0.2 in increments of 0.01, controlling the locality of the RBF kernel and determining how far its influence extends in the feature space. The regularization parameter (C) was tested over a range of 1 to 5 in increments of 1, balancing the trade-off between minimizing errors on the training data and maintaining a simpler, more generalizable model. Together, these parameters define the flexibility of the regression function and its ability to manage prediction deviations.

The training process incorporated cross-validation to evaluate model performance and ensure the selected hyperparameters generalize well to unseen data. This approach reduces the risk of overfitting and helps develop a predictive model that is both accurate and robust, meeting regulatory requirements for monitoring and reporting pH variability in the manufacturing process.

set.seed(8675309)

svmRTuned <- train(
           x=beverage_man_train |>  select(all_of(best_predictors)),

                y=beverage_man_train$PH, 
                   method = "svmRadial",
                   preProcess = c("center", "scale"),
                    tuneGrid = expand.grid(sigma = seq(0.01,0.2,0.01), C = seq(1,5,1)),

                   trControl = trainControl(method = "cv",allowParallel = TRUE))

svmRTuned$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 4 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.18 
## 
## Number of Support Vectors : 1542 
## 
## Objective Function Value : -1201.481 
## Training error : 0.085704

plot(svmRTuned)

sigmaParam <- svmRTuned$bestTune[1,"sigma"]
cParam <- svmRTuned$bestTune[1,"C"]

svmResults <- svmRTuned$results |> filter(sigma == sigmaParam & C==cParam)
svmResults |> kable(caption = "SVM Training Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

SVM Training Set Evaluation Metrics
sigma	C	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
0.18	4	0.1032428	0.6365406	0.0722884	0.0081395	0.057189	0.004639

The results of the cross-validated SVM model with a sigma value of 0.18 and a cost parameter (C) of 4 demonstrate robust performance. The model achieves a low RMSE of 0.1032428 and an MAE of 0.0722884, indicating accurate and consistent predictions on the scaled data. The R-squared value of 0.6365406 suggests the model explains approximately 7.2288388 of the variance in the target variable, showing moderate explanatory power. The standard deviations for RMSE (0.0081395), R-squared (0.057189), and MAE (0.004639) across resampling folds are small, highlighting the stability of the model’s performance across different data splits. These metrics indicate that the chosen parameters produce a reliable and well-generalized model for the given training dataset.

Building on these results, it is essential to evaluate the model’s consistency across training and test datasets to ensure its robustness and generalizability. By comparing cross-validation metrics with test set performance, we can assess whether the SVM model maintains its predictive accuracy and explanatory power when applied to unseen data.

svmPred <- predict(svmRTuned,newdata = beverage_man_test |> select(all_of(best_predictors)))

svm_test_post <- postResample(pred = svmPred, obs = beverage_man_test$PH) |> as.data.frame() 

svm_test_metrics <- data.frame(RMSE=svm_test_post[1,1],Rsquared=svm_test_post[2,1],MAE=svm_test_post[3,1])

svm_test_metrics |> kable(caption = "SVM Test Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

SVM Test Set Evaluation Metrics
RMSE	Rsquared	MAE
0.107012	0.6314344	0.0751116

The SVM model demonstrates consistent performance between the training and test datasets, as indicated by the metrics from cross-validation and post-resample evaluation. The RMSE during training, (0.1032428), is nearly identical to the test RMSE, (0.1032428), suggesting stable predictive accuracy. Similarly, the MAE values are close, with (0.0722884) in training and (0.0722884) on the test set, reflecting consistent error magnitudes. The R-squared value shows a minor decrease from (0.6365406) in training to (0.6365406) on the test set, indicating that the model maintains reasonable explanatory power without significant overfitting. These results suggest that the model generalizes well and is reliable for making predictions on unseen data.

Overall, the SVM model demonstrates moderate predictive performance, with consistent metrics across training and test datasets that highlight its robustness and reliability. While the RMSE and MAE values indicate good predictive accuracy, the R-squared suggests room for improvement in explaining the variance of the target variable. These results establish a solid benchmark for comparison with other models.

5.2 partial least squares (PLS)

Predictive modeling with partial least squares (PLS) is an effective technique, especially with datasets containing multiple correlated predictor variables. As PLS can identify key relationships between predictors and outcomes, it is ideal for understanding how PH levels in production are determined. PLS performs well by extracting the most relevant variability from both predictors and the outcome, and PLS can handle high-dimensional data. However, challenges are associated with the PLS approach, such as interpretability in understanding its latent components. PLS assumes linear relationships and may overemphasize high-variance predictors, limiting its effectiveness with nonlinear interactions or less relevant variables.

The PLS regression in this code uses cross-validation to minimize the Root Mean Squared Error of Prediction and determine the optimal number of components, evaluating up to 10 to balance flexibility and simplicity. The model achieved a Test RMSE of 0.141, an R² of 0.304, and a Test MAE of 0.111. While RMSE and MAE indicate reasonable predictions, the low R² value shows the model struggles to explain much of PH’s variability, showing its limited predictive power.

set.seed(8675309)

model_control <- trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 5,
                           allowParallel = TRUE)

plsGrid <- expand.grid(ncomp = seq(1, 33, by = 1))

pls_model <- train(
  PH ~ .,
  data = caret_train,
  method = "pls",
  tuneGrid = plsGrid,
  preProcess = c("center", "scale", "nzv", "BoxCox"),
  trControl = model_control
  )

ggplot(pls_model)

plsNcomp <- pls_model$bestTune[1,"ncomp"]
pls_model_res <- pls_model$results |> filter(ncomp == plsNcomp)
pls_model_res |> kable(caption = "Partial Least Squares Training Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

Partial Least Squares Training Set Evaluation Metrics
ncomp	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
15	0.131832	0.4187465	0.1026896	0.0052329	0.0298145	0.0037262

pls_model_pred <- predict(pls_model, newdata = caret_test)
plsPostS <- postResample(pred = pls_model_pred, obs = caret_test$PH)

pls_test_metrics <- as.data.frame(t(plsPostS))

pls_test_metrics |> kable(caption = "Partial Least Squares Evaluation Metrics") |> kable_styling() |>  kable_classic()

Partial Least Squares Evaluation Metrics
RMSE	Rsquared	MAE
0.137191	0.3677537	0.1056799

5.3 Classification And Regression Trees (CART)

Classification And Regression Trees (CART) is a non-parametric decision tree algorithm. For regression, CART divides the data based on the values of specific input predictors to produce a continuous target variable. CART is a valuable tool for predicting PH in manufacturing because it can handle non-linear relationships and interactions between predictors. Other advantages include its interpretability, the tree structure makes it easy to understand and communicate results. It is robust to outliers and requires minimal preprocessing, as it does not depend on scaling or transformation of predictors. However, CART has limitations, as it tends to overfit with overly complex trees that capture noise, though pruning can reduce this at the risk of oversimplifying the model. CART can be unstable, where small data changes cause big tree adjustments and may underperform compared to advanced methods like ensembles. Still, it remains a valuable tool when properly tuned and validated.

set.seed(8675309)

#tuning grid
tune_grid <- expand.grid(
  cp = seq(0.001, 0.05, by = 0.005)  
)

optimized_train_control <- trainControl(
  method = "cv",
  number = 5,           
  verboseIter = FALSE,  
  allowParallel = TRUE
)

# Train  using caret
optimized_cart_model <- train(
  PH ~ ., 
  data = caret_train,
  method = "rpart",
  trControl = optimized_train_control, 
  tuneGrid = tune_grid  
)

plot(optimized_cart_model)

cartCPVal <- optimized_cart_model$bestTune[1,"cp"]
cart_model_res <- optimized_cart_model$results |> filter(cp == cartCPVal)
cart_model_res |> kable(caption = "Cart Training Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

Cart Training Set Evaluation Metrics
cp	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
0.001	0.1239147	0.5133253	0.0886809	0.0066336	0.0508	0.0031134

cart_model_pred <- predict(optimized_cart_model, newdata = caret_test)
cartPostS <- postResample(pred = cart_model_pred, obs = caret_test$PH)

cart_test_metrics <- as.data.frame(t(cartPostS))
cart_test_metrics |> kable(caption = "Cart Evaluation Metrics") |> kable_styling() |>  kable_classic()

Cart Evaluation Metrics
RMSE	Rsquared	MAE
0.1269302	0.4843991	0.0884333

The code uses the caret package to train an optimized CART model for predicting PH, leveraging parallel processing to improve efficiency. A tuning grid is defined for the complexity parameter (cp), ranging from 0.001 to 0.05, to identify the best pruning level. Five-fold cross-validation ensures the model is robust and avoids overfitting. The best cp value, 0.001, is selected based on cross-validation results, and the model’s performance is evaluated on the test dataset using RMSE, R², and MAE. The results show a Test RMSE of 0.125, an R² of 0.483, and a Test MAE of 0.090, indicating the model explains only part of the variability in PH, highlighting some limitations despite effective tuning.

5.4 Cubist

Cubist models are rules based models that make use of an amalgamation of other techniques, including boosting, linear model smoothing, and adjustments based on nearby data points. The model was generated using the tuning parameters as shown below.

set.seed(8675309)

cubist_grid <- expand.grid(committees = c(1, 10, 50, 100), neighbors = seq(0,9))
model_control <- trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 5,
                           allowParallel = TRUE)

cubist_model <- train(
  PH ~ .,
  data = caret_train,
  method = "cubist",
  tuneGrid = cubist_grid,
  trControl = model_control
  )

Plotting the cubist model performance over our tuning parameters. We can see performance largely levels off beyond 5 instances, and there is no benefit beyond 50 committees.

ggplot(cubist_model)

committeesV <- cubist_model$bestTune[1,"committees"]
neighborsV <- cubist_model$bestTune[1,"neighbors"]

cubist_model_res <- cubist_model$results |> filter(committees == committeesV & neighbors==neighborsV)
cubist_model_res |> kable(caption = "Cubist Training Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

Cubist Training Set Evaluation Metrics
committees	neighbors	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
50	8	0.1008441	0.6638041	0.0724652	0.0056245	0.0333644	0.0029139

cubist_model_pred <- predict(cubist_model, newdata = caret_test)
cubistPostS <- postResample(pred = cubist_model_pred, obs = caret_test$PH)

cubist_test_metrics <- as.data.frame(t(cubistPostS))

cubist_test_metrics |> kable(caption = "Cubist Test Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

Cubist Test Set Evaluation Metrics
RMSE	Rsquared	MAE
0.0910123	0.7095401	0.0660911

The resulting model had an RMSE of 0.09115836 and an \(R^2\) of 0.70848454. While one of the best models generated, we are deciding not to go with the Cubist as the XGBoost model had a similar level of performance. While the rules of the Cubist model give some insight into the decision making process of the model, there is no universally agreed upon method of determining predictor importance. Additionally, the use of committees and neighbor adjustments obfuscates the interpretability further. As the different methods of determining predictor importance can provide significantly different interpretations of the model, and the number of rules is extremely large, XGBoost stands as the more interpretable model. As they have very similar performance, interpretability is the deciding factor.

5.5 Ridge

Ridge regression models are an improvement upon PLS and deal with highly dimensional data as well as highly correlated data. They are also highly resilient to overfitting which can plague other models. As a linear model without feature selection it is highly interpretable, and generally provides a good baseline of performance. The model was tuned using the code below.

ridgeGrid <- expand.grid(lambda = seq(0, 0.03, by = 0.005))

ridge_model <- train(
  PH ~ .,
  data = caret_train,
  method = "ridge",
  preProcess = c("center", "scale", "corr"),
  tuneGrid = ridgeGrid,
  trControl = model_control
  )

The resulting model performed best with a decay of 0.2 as seen in the plot below. The model had an RMSE of 0.1293958 and \(R^2\) of 0.4102869 which, while respectable for a linear model, does not compete with the other models under consideration.

riddeLambdaV <- ridge_model$bestTune[1,"lambda"]
ridge_model_res <- ridge_model$results |> filter(lambda == riddeLambdaV)
ridge_model_res |> kable(caption = "Ridge Training Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

Ridge Training Set Evaluation Metrics
lambda	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
0.03	0.1377302	0.3730064	0.1070736	0.0056827	0.0438851	0.0040439

ggplot(ridge_model)

ridge_model_pred <- predict(ridge_model, newdata = caret_test)
ridgePost <- postResample(pred = ridge_model_pred, obs = caret_test$PH)

ridge_test_metrics <- as.data.frame(t(ridgePost))

ridge_test_metrics |> kable(caption = "Ridge Test Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

Ridge Test Set Evaluation Metrics
RMSE	Rsquared	MAE
0.129377	0.4104624	0.1028956

5.6 LASSO Regression

LASSO regression is an extension of Ridge Regression which, in addition to regularizing, conducts feature selection. The model was generated using the code below.

lassoGrid <- expand.grid(.fraction = seq(.05, 1, length = 20))

lasso_model <- train(
  PH ~ .,
  data = caret_train,
  method = "lasso",
  preProcess = c("center", "scale"),
  tuneGrid = lassoGrid,
  trControl = model_control
  )

fractionLassoV <- lasso_model$bestTune[1,"fraction"]
lasso_model_res <- lasso_model$results |> filter(fraction == fractionLassoV)
lasso_model_res |> kable(caption = "Lasso Regression Training Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

Lasso Regression Training Set Evaluation Metrics
fraction	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
0.9	0.1357746	0.3914469	0.1047628	0.0077368	0.0395431	0.0050844

The resulting model had performance largely level off with a regression coefficient of 0.75. The resulting model had an RMSE of 0.1266923 and an \(R^2\) of 0.4345533 on the test set which is practically equivalent to the Ridge model

ggplot(lasso_model)

lasso_model_pred <- predict(lasso_model, newdata = caret_test)
lassoPostS <- postResample(pred = lasso_model_pred, obs = caret_test$PH)

lasso_test_metrics <- as.data.frame(t(lassoPostS))
lasso_test_metrics |> kable(caption = "Lasso Regression Test Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

Lasso Regression Test Set Evaluation Metrics
RMSE	Rsquared	MAE
0.1266923	0.4345533	0.1001281

5.7 Elastic Net (ENET)

Elastic Net models are a generalization of LASSO which makes use of multiple tuning parameters to get the benefits of the regularization of Ridge models with the feature selection of LASSO models. The model was tuned using the code below.

set.seed(8675309)

enetGrid <- expand.grid(.lambda = c(0, 0.01, .1), .fraction = seq(.05, 1, length = 20))

enet_model <- train(
  PH ~ .,
  data = caret_train,
  method = "enet",
  preProcess = c("center", "scale"),
  tuneGrid = enetGrid,
  trControl = model_control
  )

fractionEnentV <- enet_model$bestTune[1,"fraction"]
enetLambdaV <- enet_model$bestTune[1,"lambda"]

enet_model_res <- enet_model$results |> filter(fraction == fractionEnentV & lambda == enetLambdaV)
enet_model_res |> kable(caption = "Elastic Net Training Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

Elastic Net Training Set Evaluation Metrics
lambda	fraction	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
0.01	1	0.1361057	0.3879753	0.1049803	0.0049833	0.027471	0.0032481

The resulting model demonstrates that linear models are likely not the best solution to predicting this data, as model performance improved as more of the solution was added. The final model had an RMSE of 0.1267145 and 0.4343588 on the test set, performing equivalently to the Ridge and LASSO models.

ggplot(enet_model)

enet_model_pred <- predict(enet_model, newdata = caret_test)
enetPostS <- postResample(pred = enet_model_pred, obs = caret_test$PH)

enet_test_metrics <- as.data.frame(t(enetPostS))
enet_test_metrics |> kable(caption = "Elastic Net Test Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

Elastic Net Test Set Evaluation Metrics
RMSE	Rsquared	MAE
0.1267145	0.4343588	0.1003512

5.8 Random Forest

The optimal Random Forest model ended up having a slightly disappoint \(R^2\)=0.6639744 on the training set, however when evaluated on the test set we saw an \(R^2\)=0.7273597 and a decreased RMSE from 0.1018283 to 0.0891283. This is a particularly encouraging result as one of the dangers of Random Forest models is a propensity to overfit. Better performance on the test set than the training set indicates there is a low likelihood of overfitting and this model will likely generalize well.

set.seed(8675309)

data_split <- createDataPartition(imputed_data$PH, p = 0.75, list = FALSE)
training_data <- imputed_data[data_split, ]
testing_data <- imputed_data[-data_split, ]


# Separate predictors and response variable for both training and testing sets
train_x <- training_data %>% select(-PH)
train_y <- training_data$PH
test_x <- testing_data %>% select(-PH)
test_y <- testing_data$PH

set.seed(8675309)

train_control <- trainControl(method = "cv", number = 5)
# Define the grid for hyperparameters to tune (only mtry here)
tune_grid <- expand.grid(
  mtry = 34)  # mtry values to test


# Train the Random Forest model using cross-validation for mtry
rf_cv_model <- train(
  x = train_x,
  y = train_y,
  method = "rf",
  trControl = train_control,
  tuneGrid = tune_grid,
  ntree =  1000,  # Number of trees
  importance = TRUE
)

rfMtryV <- rf_cv_model$bestTune[1,"mtry"]

rf_model_res <- rf_cv_model$results |> filter(mtry == rfMtryV)
rf_model_res |> kable(caption = "Random Forest Training Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

Random Forest Training Set Evaluation Metrics
mtry	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
34	0.1018283	0.6639744	0.0734955	0.0050027	0.0456675	0.0016966

rf_cv_model_pred <- predict(rf_cv_model, newdata = test_x)
randomForestPost <- postResample(pred = rf_cv_model_pred, obs = test_y)

rf_test_metrics <- as.data.frame(t(randomForestPost))
rf_test_metrics |> kable(caption = "Random Forest Test Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

Random Forest Test Set Evaluation Metrics
RMSE	Rsquared	MAE
0.0891283	0.7273597	0.0646925

rf_final_varImportance <- varImp(rf_cv_model)
rf_final_importance_df <- data.frame(Variable= rownames(rf_final_varImportance$importance),
                                Overall=rf_final_varImportance$importance$Overall) 

rf_final_importance_df |> arrange(-Overall) %>% head(10) %>%
  ggplot( aes(y = reorder(Variable, +Overall), x = Overall)) + geom_bar(stat = "identity") + labs(
    title = "Variable Importance",
    x = "Overall Score",
    y = "Variable"
  )  + theme_minimal()

5.9 XGBoost

XGBoost (Extreme Gradient Boosting) is a robust machine learning algorithm built on gradient boosting techniques, which combine gradient descent for numerical optimization with boosting, an ensemble method that iteratively improves weak learners to create strong models. The term “gradient” reflects the algorithm’s use of derivatives from the loss function to optimize predictions. XGBoost operates through three main components: an additive model that sequentially builds improvements, a customizable differentiable loss function to measure prediction errors, and weak learners, typically decision trees, refined iteratively by addressing their shortcomings.XGBoost enhances flexibility and predictive performance by framing boosting as a numerical optimization problem, making it a favored tool in machine learning.

set.seed(8675309)

data_split <- createDataPartition(manufacturing_tr$PH, p = 0.75, list = FALSE)
training_data <- imputed_data[data_split, ]
testing_data <- imputed_data[-data_split, ]

set.seed(8675309)

# Define trainControl for cross-validation
train_control <- trainControl(
  method = "cv",                # Cross-validation
  number = 5,                   # Number of folds
  verboseIter = F            # Show training progress
)

# Define a grid for hyperparameter tuning
xgb_grid <- expand.grid(
  nrounds = 1000,               # Number of boosting rounds
  max_depth = 6,                # Maximum tree depth
  eta = 0.05,                   # Learning rate
  gamma = 0,                    # Minimum loss reduction
  colsample_bytree = 0.8,       # Fraction of features used for each tree
  min_child_weight = 6,         # Minimum number of observations for a split
  subsample = 0.9               # Fraction of data used for each tree
)

# Train the XGBoost model using caret
xgb_model <- train(
  PH ~ .,                       # Formula (PH is the target variable)
  data = training_data,         # Training dataset
  method = "xgbTree",           # Use XGBoost
  trControl = train_control,    # Cross-validation control
  tuneGrid = xgb_grid,          # Hyperparameter grid
  metric = "RMSE"               # Optimize for RMSE
)

xgb_model$results |> kable(caption = " XGBoost Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

XGBoost Set Evaluation Metrics
nrounds	max_depth	eta	gamma	colsample_bytree	min_child_weight	subsample	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
1000	6	0.05	0	0.8	6	0.9	0.1040341	0.643032	0.0754363	0.003141	0.0328406	0.0028108

# Make predictions using the trained model and iteration_range
model_pred <- predict(xgb_model, newdata = testing_data)
xgBoostPostS <- postResample(pred = model_pred, obs = testing_data$PH)

xgBoos_test_metrics <- as.data.frame(t(xgBoostPostS))
xgBoos_test_metrics |> kable(caption = "XGBoost Test Set Evaluation Metrics") |> kable_styling() |>  kable_classic()

XGBoost Test Set Evaluation Metrics
RMSE	Rsquared	MAE
0.0890684	0.7218483	0.0664595

xgBoost_varImportance <- varImp(xgb_model)
xgBoost_importance_df <- data.frame(Variable= rownames(xgBoost_varImportance$importance),
                                Overall=xgBoost_varImportance$importance$Overall) 

xgBoost_importance_df |> arrange(-Overall) %>% head(10) %>%
  ggplot( aes(y = reorder(Variable, +Overall), x = Overall)) + geom_bar(stat = "identity") + labs(
    title = "Variable Importance",
    x = "Overall Score",
    y = "Variable"
  )  + theme_minimal()

5.9 Neural Networks

The concept of Neural Networks is a node-layer system trained on an existing dataset. An artificial neural network (ANN) model contains an input layer, hidden layers, and an output layer, with each of these layers containing nodes. At the input layer, each node contains a dimension of the dataset. At the outer layer, each node represents the final output or prediction. The hidden layers are the core of the model, where the computation and learning take place. Each node of a neural network in a hidden layer takes in a set of the input, performs a computation based on the activation function, and then creates an output. These outputs are assigned weights which determine its level of consideration in the final output and are adjusted throughout the training process. The model is trained over many cycles, called epochs. At the end of each batch, a loss function is evaluated based on the model’s performance. Batches are subsets of the training dataset, and in the case of the nnet package in R, batch size is equal to the size of the dataset as a whole. With supervised learning, the dataset will have various input parameters as well as a known output. The loss function compares the model output with the known output. Weights for each node are adjusted based on this loss function, and the model is said to converge when the loss function no longer decreases.

ANN models can utilize parallel processing, allowing faster model training. As the weight of each node is adjusted through the process mentioned above, ANN models can distinguish important input parameters; as our dataset contains over 30 input variables, we tested various ANN models as a potential choice for predicting pH. Since some of the input variables are highly correlated, we can account for this in the ANN model by testing various weight decay values to prevent overfitting.

The nnet method (nnet) in the nnet package was used with caret to tune an ANN model to the data.

StudentData <- read.csv("https://raw.githubusercontent.com/MarjeteV/data624/refs/heads/main/imputed_test_data.csv")

# Parallel processing


set.seed(8675309)
trainIndex <- createDataPartition(StudentData$PH, p = 0.75, list = FALSE)
trainData <- StudentData[trainIndex, ]
testData <- StudentData[-trainIndex, ]
# Separate x, y, train
x_train <- trainData[, names(trainData) != "PH"]
y_train <- trainData$PH
x_test <- testData[, names(testData) != "PH"]
y_test <- testData$PH

Data was preprocessed with centering and scaling, and a linear activation function was used. The model utilized cross validation, and 500 maximum iterations were specified in the tuning phase. Since the nnet package defaults to a batch size equal to the training set size, the number of epochs is equal to the max number of iterations specified given that the model does not converge before the max iterations is reached. For the tuning grid, we tested 1:35 nodes and weight decay regularization values of 0.01, 0.05, 0.1, 0.5, 1, and 2. Using these parameters, we found a best tune model with 17 nodes and a decay value of 1. This resulted in a test set RMSE of 0.113 and an \(R^2\) of 0.555.

# tuning grid
nnetGrid <- expand.grid(
 size = 1:35,
 decay = c(0.01,0.05, 0.1, 0.5, 1, 2)
 )

# model
nnetTuned <- train(
 x = x_train,
 y = y_train,
 method = "nnet", 
 tuneGrid = nnetGrid,
 preProc = c("center", "scale"),
 trControl = trainControl(method = "cv"),
 linout = TRUE,
 trace = FALSE,
 maxit = 500,
 MaxNWts = 35 * (ncol(x_train) + 1) + 35 + 1)

# Tuning plot
plot(nnetTuned)
# Model results
nnetTuned$bestTune

    size decay
101   17     1

      RMSE   Rsquared        MAE
0.11281243 0.55485083 0.08694197

Taking a look at the variable importance of this nnet model, we can see that variables of high importance are Density, Mnf Flow, Bowl Setpoint, and Temperature.

# Variable importance
nnetTuned_final_varImportance <- varImp(nnetTuned)

nnetTuned_final_importance_df <- data.frame(Variable= rownames(nnetTuned_final_varImportance$importance),
                                Overall=nnetTuned_final_varImportance$importance$Overall) 

nnetTuned_final_importance_df |> ggplot( aes(y = reorder(Variable, +Overall), x = Overall)) + geom_bar(stat = "identity") + labs(
    title = "Variable Importance",
    x = "Overall Score",
    y = "Variable"
  )  + theme_minimal()

In addition to exploring various regularization values, an ensemble neural network method, average neural net (avnnet), was also tuned to account for the highly correlated variables. For the tuning grid, we again tested 1:35 nodes and weight decay regularization values of 0.01, 0.05, 0.1, 0.5, 1, and 2. We also specified both bootstrapped aggregation and no bootstrapped aggregation in the tuning grid. The best tuned model had 15 nodes and 0.01 decay, with no bootstrapped aggregation.

# tuning grid
avNNetGrid <- expand.grid(
 size = 1:35,
 decay = c(0.01,0.05, 0.1, 0.5, 1, 2),
 bag = c(TRUE,FALSE))

# model
avNNetTuned <- train(
 x = x_train,
 y = y_train,
 method = "avNNet", 
 tuneGrid = avNNetGrid,
 preProc = c("center", "scale"),
 trControl = trainControl(method = "cv", allowParallel = TRUE),
 linout = TRUE,
 trace = FALSE,
 maxit = 500,
 MaxNWts = 35 * (ncol(x_train) + 1) + 35 + 1)
# Model results
avNNetTuned$bestTune
# Tuning plot
plot(avNNetTuned)

    size decay   bag
169   15  0.01 FALSE

# Test set predictions & metrics
avNNetPred <- predict(avNNetTuned, newdata = x_test)
postResample(pred = avNNetPred, obs = y_test)

     RMSE  Rsquared       MAE 
0.1143450 0.5677752 0.0824541

However, after a best tune was determined, these best tuned parameters were used for a final model training that included a higher and higher numbers of maximum iterations until RMSE and \(R^2\) stabilized. This final model used a maximum iteration of 5000 and had an RMSE of 0.109 and \(R^2\) of 0.603.

avNNetGrid_final <- expand.grid(
 size = 15,
 decay = 0.01,
 bag = FALSE)

avNNetTuned_final <- train(
 x = x_train,
 y = y_train,
 method = "avNNet", 
 tuneGrid = avNNetGrid_final,
 preProc = c("center", "scale"),
 trControl = trainControl(method = "cv", allowParallel = TRUE),
 linout = TRUE,
 trace = FALSE,
 maxit = 1000,
 MaxNWts = 15 * (ncol(x_train) + 1) + 15 + 1)

avNNetPred_final <- predict(avNNetTuned_final, newdata = x_test)
postResample(pred = avNNetPred_final, obs = y_test)

      RMSE   Rsquared        MAE 
0.10904335 0.60255419 0.07838364

When examining the variable importance, we can see that Mnf flow ranks as the highest importance variable by a large margin, followed by Usage cont and Bowl Setpoint.

# Variable importance
avvNNetTuned_final_varImportance <- varImp(avNNetTuned_final)

avvNNetTuned_final_importance_df <- data.frame(Variable= rownames(avvNNetTuned_final_varImportance$importance),
                                Overall=avvNNetTuned_final_varImportance$importance$Overall) 

avvNNetTuned_final_importance_df %>% ggplot( aes(y = reorder(Variable, +Overall), x = Overall)) + geom_bar(stat = "identity") + labs(
    title = "Variable Importance",
    x = "Overall Score",
    y = "Variable"
  )  + theme_minimal()

6. Model Selection

From the model summary table above, the best performing models were the XGBoost model and the Cross-Validated Random Forest model. Both models had an \(R^2\) of approximately 0.72, and both models had an RMSE of 0.89 on the test set.

Overall, the performance metrics for these two models were very similar, especially when looking at the test set evaluation. As such, model selection took into account other selection criteria. The Random Forest model has better interpretability, but XGBoost excels in other areas such as in its generalization - specifically, XGBoost has built in regularization techniques that can prevent overfitting to data. When looking at the training \(R^2\) and RMSE, the RF model had a higher \(R^2\) at 0.66 and a lower RMSE at 0.10 compared to an \(R^2\) of 0.64 and an RMSE of 0.10 for the XGBoost model. As such, it does not appear overfitting was an issue for either model given the higher \(R^2\) for the out-of-sample data.

When looking at variable importance, XGBoost model and Random Forest model show similar variables of importance, with 6 of the top 10 variables overlapping. Both models also appear to have a similar distribution of importance among its top ten variables. The importance of the Mnf Flow variable is greatest for both models, having an overall score of 100. After the top variable, both models have a steep drop off in importance score for the remaining top variables.

The Random Forest model and the XGBoost model exhibited very similar performance metrics and variable importance distributions. While XGBoost models are less prone to overfitting due to its built in regularization, comparing the in-sample \(R^2\) and RMSE to the out-of-sample \(R^2\) and RMSE show that neither models are overfitted. As a result, the Random Forest model was chosen for its better interpretability.

# Retraining selected model on full training set and making final predictions
set.seed(8675309)

train_control <- trainControl(method = "cv", number = 5)
# Define the grid for hyperparameters to tune (only mtry here)
tune_grid <- expand.grid(
  mtry = 34)  # mtry values to test


# Train the Random Forest model using cross-validation for mtry
rf_final_model <- train(
  PH ~ .,
  data = imputed_data,
  method = "rf",
  trControl = train_control,
  tuneGrid = tune_grid,
  ntree =  1000,  # Number of trees
  importance = TRUE
)

test %<>% mutate(across(
  .cols = c('Filler.Speed', 'Hyd.Pressure4', 'Bowl.Setpoint', 'Carb.Flow'),
  .fns = as.numeric
  ))

# Create a one hot encoding tibble
dummies_final <- dummyVars( ~ Brand.Code, data = test)

one_hot_df_final <- predict(dummies_final, newdata = test) |>
  as_tibble()

# Add the one hot df to the original and remove the categorical column
one_hot_df_final <- test |>
  cbind(one_hot_df_final) |>
  select(-Brand.Code)

final_pred <- predict(rf_final_model, newdata = one_hot_df_final)

names(final_pred) <- "PH"
write.csv(final_pred, "~/School/data624/final_predictions.csv")

7. Conclusion

This project explored multiple predictive models to address the new regulatory requirement of understanding and predicting PH in the manufacturing process at ABC Beverage. The analysis began with exploratory data analysis to understand variable distributions and relationships, detect outliers, and address missing data. Bagging imputation technique was applied to fill in missing values. Additionally, rows with missing PH values were dropped, and we applied one-hot encoding to the Brand.Code variable to retain its categorical information without introducing imputation risks.

Several models were tested, covering a range of categories: linear models (PLS, Ridge Regression, LASSO, Elastic Net), non-linear models (SVM, Neural Networks), tree-based models (CART, Random Forest, XGBoost), and rule-based hybrid models (Cubist). Each model’s performance was evaluated based on metrics such as RMSE, R², and MAE, focusing on balancing predictive accuracy and interpretability. Among the models, Cubist, Random Forest, and XGBoost demonstrated a strong ability to capture variability. However, the Cubist model, despite achieving an R² of 0.66, and XGBoost, with a strong test R² of 0.71, were both not selected due to their limited interpretability and weaker performance during training evaluation, making them less suitable for identifying the key drivers of pH variability.

The Random Forest model demonstrated strong performance and was selected as the best model for predicting pH. On the training set, it achieved an R² of 0.664 with an RMSE of 0.1018 and an MAE of 0.0735, indicating effective learning of relationships within the data. On the test set, the model continued to generalize well, achieving an R² of 0.727, RMSE of 0.0891, and MAE of 0.0647. These results highlight the Random Forest model’s ability to explain variability and produce accurate predictions on unseen data. While other models, such as XGBoost, were considered, Random Forest’s robust performance across both the training and test sets, along with its ability to handle complex, non-linear relationships, made it the preferred choice for this analysis. The balanced test metrics indicate minimal overfitting and strong predictive accuracy, making Random Forest the most reliable model for understanding and predicting pH variability in the manufacturing process. We also used Random Forest to identify the variables with the highest importance: Manufacturing Flow Rate (Mnf.Flow), Continuous Usage (Usage.cont) and Oxygen in Filler (Oxygen.Filler) are key contributors. These variables are critical in regulating pH levels during production and ensuring product consistency.

Future improvements could focus on adding more predictors, enhancing feature engineering, and increasing the dataset size with additional observations to help us better understand our manufacturing process, predictive factors, and accurately report our model of PH. Exploring other interpretable models or improving current models through advanced tuning and validation techniques could also enhance performance while maintaining transparency. Accurate pH predictions are essential for meeting regulatory standards, improving quality control, and reducing variability, ensuring ABC Beverage consistently delivers high-quality products.

8. Annotated References

Anton Paar Wiki. (n.d.). Carbon Dioxide in Beverages.
Carbon Dioxide in Beverages
This resource provides an in-depth understanding of the role of carbon dioxide in beverages, including its effect on carbonation, fizziness, and product quality. It directly supports our analysis of variables like Carb Flow and Carb Pressure, which influence carbonation and pH levels.
Omega. (n.d.). What is pH?
What is pH?
This article explains the concept of pH, its measurement, and its relevance to various industries. It is useful for understanding the chemical principles behind pH variability in beverages and helps contextualize our target variable within the manufacturing process.
Emerson. (n.d.). Training Beverage Process Solutions Guide on De-Aeration.
Training Beverage Process Solutions Guide on De-Aeration
This document focuses on de-aeration processes in beverage production, particularly the removal of dissolved oxygen and its impact on carbonation and quality. It directly relates to variables like Oxygen.Filler and Pressure.Vacuum, providing insights into their operational importance.
Jochamp. (n.d.). Carbonated Beverages Manufacturing Process - A Step by Step Guide.
Carbonated Beverages Manufacturing Process - A Step by Step Guide
This guide offers a comprehensive overview of the carbonation process in beverage production, including the role of temperature, pressure, and filling speed. It supports our understanding of variables like Temperature, Filler.Speed, and Carb Pressure, helping us interpret their influence on product quality.