DATA624 Project 2

Overview

This report documents the full modeling pipeline developed to predict the pH of beverage products manufactured at ABC Beverage. pH is a critical quality control metric as it directly influences product taste, microbial stability, and regulatory compliance. New regulations require that ABC Beverage not only monitor pH, but document the manufacturing factors that drive it and demonstrate a systematic, data-driven approach to its prediction.

The analysis proceeds in four stages: (1) exploratory data analysis to understand the structure, quality, and relationships within the historical production data; (2) data preprocessing to prepare the feature set for modeling; (3) comparative evaluation of six candidate predictive models; and (4) selection and interpretation of the final model, followed by generation of pH predictions for the evaluation dataset.

All models were trained and evaluated using the caret framework in R, with 5-fold cross-validation applied consistently across all approaches to ensure fair comparison.

Import Data

The training dataset (StudentData.xlsx) and evaluation dataset (StudentEvaluation.xlsx) are imported directly from GitHub. The training set contains labeled records with known pH values and will be used for model development. The evaluation set contains unlabeled records for which pH predictions will be generated using the final selected model.

# import data from github
train_data_url = "https://github.com/okazevedo90/DATA624/raw/refs/heads/main/project_2/Data/StudentData.xlsx"
test_data_url = "https://github.com/okazevedo90/DATA624/raw/refs/heads/main/project_2/Data/StudentEvaluation.xlsx"

train_data = import(train_data_url)
test_data = import(test_data_url)

knit_table(head(train_data), 'View Head of train data')

View Head of train data
Brand Code	Carb Volume	Fill Ounces	PC Volume	Carb Pressure	Carb Temp	PSC	PSC Fill	PSC CO2	Mnf Flow	Carb Pressure1	Fill Pressure	Hyd Pressure2	Hyd Pressure3	Hyd Pressure4	Filler Level	Filler Speed	Temperature	Usage cont	Carb Flow	Density	MFR	Balling	Pressure Vacuum	PH	Oxygen Filler	Bowl Setpoint	Pressure Setpoint	Air Pressurer	Alch Rel	Carb Rel	Balling Lvl
B	5.340000	23.96667	0.2633333	68.2	141.2	0.104	0.26	0.04	-100	118.8	46.0	NA	NA	118	121.2	4002	66.0	16.18	2932	0.88	725.0	1.398	-4.0	8.36	0.022	120	46.4	142.6	6.58	5.32	1.48
A	5.426667	24.00667	0.2386667	68.4	139.6	0.124	0.22	0.04	-100	121.6	46.0	NA	NA	106	118.6	3986	67.6	19.90	3144	0.92	726.8	1.498	-4.0	8.26	0.026	120	46.8	143.0	6.56	5.30	1.56
B	5.286667	24.06000	0.2633333	70.8	144.8	0.090	0.34	0.16	-100	120.2	46.0	NA	NA	82	120.0	4020	67.0	17.76	2914	1.58	735.0	3.142	-3.8	8.94	0.024	120	46.6	142.0	7.66	5.84	3.28
A	5.440000	24.00667	0.2933333	63.0	132.6	NA	0.42	0.04	-100	115.2	46.4	0	0	92	117.8	4012	65.6	17.42	3062	1.54	730.6	3.042	-4.4	8.24	0.030	120	46.0	146.2	7.14	5.42	3.04
A	5.486667	24.31333	0.1113333	67.2	136.8	0.026	0.16	0.12	-100	118.4	45.8	0	0	92	118.6	4010	65.6	17.68	3054	1.54	722.8	3.042	-4.4	8.26	0.030	120	46.0	146.2	7.14	5.44	3.04
A	5.380000	23.92667	0.2693333	66.6	138.4	0.090	0.24	0.04	-100	119.6	45.6	0	0	116	120.2	4014	66.2	23.82	2948	1.52	738.8	2.992	-4.4	8.32	0.024	120	46.0	146.6	7.16	5.44	3.02

knit_table(head(test_data), 'View Head of test data')

View Head of test data
Brand Code	Carb Volume	Fill Ounces	PC Volume	Carb Pressure	Carb Temp	PSC	PSC Fill	PSC CO2	Mnf Flow	Carb Pressure1	Fill Pressure	Hyd Pressure2	Hyd Pressure3	Hyd Pressure4	Filler Level	Filler Speed	Temperature	Usage cont	Carb Flow	Density	MFR	Balling	Pressure Vacuum	PH	Oxygen Filler	Bowl Setpoint	Pressure Setpoint	Air Pressurer	Alch Rel	Carb Rel	Balling Lvl
D	5.480000	24.03333	0.2700000	65.4	134.6	0.236	0.40	0.04	-100	116.6	46.0	NA	NA	96	129.4	3986	66.0	21.66	2950	0.88	727.6	1.398	-3.8	NA	0.022	130	45.2	142.6	6.56	5.34	1.48
A	5.393333	23.95333	0.2266667	63.2	135.0	0.042	0.22	0.08	-100	118.8	46.2	0	0	112	120.0	4012	65.6	17.60	2916	1.50	735.8	2.942	-4.4	NA	0.030	120	46.0	147.2	7.14	5.58	3.04
B	5.293333	23.92000	0.3033333	66.4	140.4	0.068	0.10	0.02	-100	120.2	45.8	0	0	98	119.4	4010	65.6	24.18	3056	0.90	734.8	1.448	-4.2	NA	0.046	120	46.0	146.6	6.52	5.34	1.46
B	5.266667	23.94000	0.1860000	64.8	139.0	0.004	0.20	0.02	-100	124.8	40.0	0	0	132	120.2	NA	74.4	18.12	28	0.74	NA	1.056	-4.0	NA	NA	120	46.0	146.4	6.48	5.50	1.48
B	5.406667	24.20000	0.1600000	69.4	142.2	0.040	0.30	0.06	-100	115.0	51.4	0	0	94	116.0	4018	66.4	21.32	3214	0.88	752.0	1.398	-4.0	NA	0.082	120	50.0	145.8	6.50	5.38	1.46
B	5.286667	24.10667	0.2120000	73.4	147.2	0.078	0.22	NA	-100	118.6	46.4	0	0	94	120.4	4010	66.6	18.00	3064	0.84	732.0	1.298	-3.8	NA	0.064	120	46.0	146.0	6.50	5.42	1.44

EDA

Exploratory data analysis (EDA) is performed to characterize the structure of the dataset, assess data quality, identify distributional patterns, detect outliers, and understand relationships between predictors and the target variable (PH). These findings directly inform preprocessing decisions and model selection.

Initial Data Exploration

We begin by examining the data types, dimensions, and summary statistics for each variable. The skim() function provides a compact overview including completeness rates, mean, standard deviation, and distribution shape for each column. This step helps identify which variables are numeric versus categorical, flags any immediate data quality concerns, and provides context for how much variation exists in PH and each potential predictor.

# View dataset structure
str(train_data)

## 'data.frame':    2571 obs. of  33 variables:
##  $ Brand Code       : chr  "B" "A" "B" "A" ...
##  $ Carb Volume      : num  5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill Ounces      : num  24 24 24.1 24 24.3 ...
##  $ PC Volume        : num  0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb Pressure    : num  68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb Temp        : num  141 140 145 133 137 ...
##  $ PSC              : num  0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC Fill         : num  0.26 0.22 0.34 0.42 0.16 ...
##  $ PSC CO2          : num  0.04 0.04 0.16 0.04 0.12 ...
##  $ Mnf Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb Pressure1   : num  119 122 120 115 118 ...
##  $ Fill Pressure    : num  46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure2    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure3    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure4    : num  118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler Level     : num  121 119 120 118 119 ...
##  $ Filler Speed     : num  4002 3986 4020 4012 4010 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb Flow        : num  2932 3144 2914 3062 3054 ...
##  $ Density          : num  0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Balling          : num  1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure Vacuum  : num  -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl Setpoint    : num  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air Pressurer    : num  143 143 142 146 146 ...
##  $ Alch Rel         : num  6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling Lvl      : num  1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...

# Dataset dimensions
knit_table(
  tibble(
    Dimmension = c('Rows', 'Columns'),
    Count = dim(train_data))
)


Dimmension	Count
Rows	2571
Columns	33

# Detailed overview
skim(train_data)

Data summary
Name	train_data
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand Code	120	0.95	1	1	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Fill Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
Carb Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Fill Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Hyd Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
Hyd Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
Hyd Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
Filler Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
Filler Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
Carb Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Pressure Vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Bowl Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Pressure Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air Pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Alch Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
Carb Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
Balling Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

The training dataset contains 2,571 observations across 33 columns — 32 process variables and 1 target variable (PH). One column, Brand Code, is categorical; all others are numeric. The skim() output reveals the presence of missing values across several variables, which will be addressed during preprocessing.

Check Missing Values

Missing data must be explicitly quantified before modeling. Some algorithms cannot handle NA values natively and will either error or silently drop rows, both of which can introduce bias or reduce effective sample size. Understanding the missingness pattern also informs the choice of imputation strategy.

# Count missing values in each column
missing_table <- rownames_to_column(data.frame(colSums(is.na(train_data))), 'Variable')
colnames(missing_table) = c('Variable',  'Missing_Count')

knit_table(missing_table, 'Column Missing Value Count')

Column Missing Value Count
Variable	Missing_Count
Brand Code	120
Carb Volume	10
Fill Ounces	38
PC Volume	39
Carb Pressure	27
Carb Temp	26
PSC	33
PSC Fill	23
PSC CO2	39
Mnf Flow	2
Carb Pressure1	32
Fill Pressure	22
Hyd Pressure1	11
Hyd Pressure2	15
Hyd Pressure3	15
Hyd Pressure4	30
Filler Level	20
Filler Speed	57
Temperature	14
Usage cont	5
Carb Flow	2
Density	1
MFR	212
Balling	1
Pressure Vacuum	0
PH	4
Oxygen Filler	12
Bowl Setpoint	2
Pressure Setpoint	12
Air Pressurer	0
Alch Rel	9
Carb Rel	10
Balling Lvl	1

Most variables have a small proportion of missing values (typically under 5%), consistent with random sensor dropouts or logging gaps rather than systematic data collection failures. The target variable PH itself has a small number of missing values; those records will be excluded from training since supervised learning requires a known outcome. Predictor missingness will be handled via median imputation in the preprocessing step.

Histograms for Numeric Variables

Histograms reveal the distributional shape of each numeric predictor — whether variables are approximately normal, skewed, bimodal, or contain discrete spikes. Severe skewness or multimodality can degrade the performance of linear models and distance-based algorithms, and may suggest the need for transformation. Centering and scaling (applied later) partially address scale differences, but shape characteristics remain relevant for model selection.

# Select numeric columns only
numeric_cols <- train_data %>%
  select(where(is.numeric))

# Create histograms for all numeric variables
for(col in names(numeric_cols)){
  print(
    ggplot(train_data, aes(x = .data[[col]])) +
      geom_histogram(
        bins = 30,
        fill = "#00BFC4",
        color = "black"
      ) +
      labs(
        title = paste("Histogram of", col),
        x = col,
        y = "Frequency"
      ) +
      theme_minimal() +
      theme(text = element_text(size = 16))
  )
}

Several variables exhibit right-skewed distributions or contain isolated spikes at zero, suggesting the presence of process shutdowns or idle periods recorded as zeros rather than true missing values. The target variable PH appears approximately normally distributed, which is favorable for regression-based modeling. Variables with heavily non-normal distributions further motivate the use of tree-based methods, which are insensitive to predictor scale and shape.

Outlier Detection

Boxplots are used to visualize the spread and identify extreme values in each numeric variable. Outliers — defined here as observations beyond 1.5 times the interquartile range from the quartile boundaries — are flagged in coral. Extreme outliers can distort parameter estimates in linear models, though tree-based methods are generally robust to them.

for(col in names(numeric_cols)){
  print(
    ggplot(train_data, aes(y = .data[[col]])) +
      geom_boxplot(fill = "#22a298", outlier.colour = "coral1") +
      labs(
        title = paste("Boxplot of", col),
        y = col
      ) +
      coord_flip() +
      theme_minimal() +
      theme(text = element_text(size = 16))
  )
}

Several manufacturing variables show notable outliers, particularly flow- and pressure-related measurements. These likely reflect legitimate process events (e.g., pressure spikes, flow interruptions) rather than data entry errors, so values are retained rather than removed. Their presence reinforces the suitability of ensemble tree methods, which partition the feature space and are not distorted by extreme values the way ordinary least squares regression is.

Correlation Analysis

Understanding pairwise correlations among predictors serves two purposes: (1) identifying redundant variables that may cause multicollinearity in linear models, and (2) identifying predictors that are directly correlated with PH, which provides early signal about which features may be most informative. The correlation matrix is computed using pairwise complete observations to handle missing values without listwise deletion.

# Compute correlation matrix
cor_matrix <- cor(
  numeric_cols,
  use = "pairwise.complete.obs"
)

# Correlation heatmap
corrplot(
  cor_matrix,
  type = 'lower', order = 'hclust', tl.col = 'black',
         cl.ratio = 0.2, tl.srt = 45, col = COL2('RdYlBu', 10))

The hierarchically clustered correlation heatmap reveals several clusters of highly correlated predictors, particularly among carbonation- and pressure-related variables. Correlations exceeding |0.90| between predictors indicate multicollinearity that would inflate standard errors in linear regression and reduce model interpretability. These pairs are removed during preprocessing using findCorrelation(). Additionally, several predictors show moderate correlations with PH (the bottom row/rightmost column), providing initial evidence that a predictive signal exists in this feature set.

Scatterplots

Individual scatterplots of each predictor against PH, overlaid with a linear regression trend line, provide a visual assessment of the linearity and strength of each predictor-response relationship. Non-linear or heteroscedastic patterns motivate the inclusion of flexible, non-parametric models (e.g., MARS, Random Forest) in the candidate set.

# Remove PH from predictor list
predictor_cols <- setdiff(names(numeric_cols), "PH")

# Scatterplots against PH
for(col in predictor_cols){
  print(
    ggplot(train_data, aes(x = .data[[col]], y = PH)) +
      geom_point(alpha = 0.6, color = "#00BFC4") +
      geom_smooth(method = "lm", formula = 'y ~ x', se = FALSE, color = "coral1") +
      labs(
        title = paste(col, "vs PH"),
        x = col,
        y = "PH"
      ) +
      theme_minimal() +
      theme(text = element_text(size = 16))
  )
}

The scatterplots reveal a mix of relationships: some predictors (e.g., Mnf.Flow) show a discernible trend with PH, while others appear diffuse or non-linear. The non-linear, heteroscedastic patterns visible in several plots indicate that linear methods alone will be insufficient to fully capture the predictor-response relationships. This is consistent with a complex, multi-stage manufacturing process where pH is driven by interactions among multiple variables simultaneously rather than any single linear effect.

Data Prep

Data preprocessing transforms the raw dataset into a form suitable for model training. The steps below address categorical encoding, target separation, dimensionality reduction via variance and correlation filtering, and numerical standardization. Critically, all preprocessing transformations are learned on the training data only and then applied identically to the test/evaluation data to prevent data leakage.

Feature Engineering

Brand Code is the only categorical variable in the dataset. It identifies the product formulation (brands A, B, C, D), and different brands are likely manufactured under distinct process conditions that influence pH. Most regression and ensemble algorithms require numeric inputs, so Brand Code is converted to a set of binary dummy variables using one-hot encoding via dummy_cols(). Missing Brand Code values are assigned an “Unknown” category before encoding rather than dropped, preserving those observations for modeling.

The same encoding scheme, including the same factor levels, is applied to the evaluation dataset to ensure the test feature matrix is structurally identical to the training feature matrix.

# Convert Brand Code to categorical variable
if("Brand Code" %in% names(train_data)) {

  train_data$`Brand Code` <- as.character(
    train_data$`Brand Code`
  )

  train_data$`Brand Code`[
    is.na(train_data$`Brand Code`)
  ] <- "Unknown"

  train_data$`Brand Code` <- as.factor(
    train_data$`Brand Code`
  )

  test_data$`Brand Code` <- as.character(
    test_data$`Brand Code`
  )

  test_data$`Brand Code`[
    is.na(test_data$`Brand Code`)
  ] <- "Unknown"

  test_data$`Brand Code` <- factor(
    test_data$`Brand Code`,
    levels = levels(train_data$`Brand Code`)
  )
}

# Transform Brand Code into Dummy Variables
train_data = dummy_cols(train_data, select_columns = "Brand Code", remove_selected_columns = TRUE)
test_data = dummy_cols(test_data, select_columns = "Brand Code", remove_selected_columns = TRUE)

Separate Target Variable

The response variable (PH) is isolated from the predictor matrix before any further preprocessing. This is standard practice: preprocessing transformations (centering, scaling, imputation) apply only to predictors, not to the target variable.

# PH is the prediction target
y <- train_data$PH

# Remove PH from predictors
x <- train_data %>%
  select(-PH)

Remove Variables

Remove Near-Zero Variance Variables

Variables with near-zero variance carry minimal predictive information. If a variable takes the same value (or nearly the same value) for almost every observation, it cannot differentiate between high-pH and low-pH batches. Including such variables wastes model capacity and can destabilize coefficient estimates. nearZeroVar() from the caret package identifies predictors where the ratio of the most common value’s frequency to the second most common value’s frequency exceeds a threshold, or where the percentage of unique values is very low.

# Identify low-information predictors
nzv <- nearZeroVar(x)

# Remove them if any exist
if(length(nzv) > 0){

  x <- x[, -nzv]
}

# Match columns in test dataset
test_processed <- test_data[, names(x)]

Remove Highly Correlated Variables

Highly correlated predictor pairs are redundant: they encode essentially the same information about the response. Including both inflates the effective dimensionality of the feature space and, for linear models specifically, causes multicollinearity that inflates variance in coefficient estimates and degrades interpretability. findCorrelation() uses the correlation matrix to identify variables that should be removed to ensure all remaining pairwise correlations fall below the specified threshold (|r| < 0.90). When a pair exceeds the threshold, the variable with the larger mean absolute correlation with all other predictors is removed, because it removes the most redundant variable first.

# Keep numeric variables only
numeric_x <- x %>%
  select(where(is.numeric))

# Correlation matrix
cor_matrix <- cor(
  numeric_x,
  use = "pairwise.complete.obs"
)

# Identify highly correlated variables
highCorr <- findCorrelation(
  cor_matrix,
  cutoff = 0.90
)

# Remove correlated predictors
if(length(highCorr) > 0){

  highCorr_names <- colnames(numeric_x)[highCorr]

  x <- x[, !(names(x) %in% highCorr_names)]

  test_processed <- test_processed[, names(x)]
}

Data Preprocessing

Three preprocessing transformations are applied via caret’s preProcess() function:

Median Imputation: Missing values in each predictor are replaced with that predictor’s median value computed from the training set. Median imputation is preferred over mean imputation when outliers are present (as observed in the EDA), since the median is a more robust central tendency measure. Importantly, the median values are computed on the training data only and then applied to both training and test sets. This prevents data leakage from the evaluation set into the preprocessing step.
Centering: Each predictor is shifted by subtracting its training-set mean, resulting in zero-mean predictors. Centering ensures that the model intercept represents the predicted PH when all predictors are at their average values, and is particularly important for algorithms that use gradient descent or penalized regression.
Scaling: Each predictor is divided by its training-set standard deviation, normalizing all variables to unit variance. Scaling ensures that predictors measured in large units (e.g., pressure in PSI) do not numerically dominate predictors measured in small units (e.g., temperature in degrees), which would otherwise bias distance-based methods and penalized regression.

Together, centering and scaling standardize the feature space, facilitating fair comparison of coefficient magnitudes and improving numerical stability across all model types.

# Apply the preprocessing transformations:
# 1. Median imputation
# 2. Centering
# 3. Scaling

preprocess_model <- preProcess(
  x,
  method = c(
    "medianImpute",
    "center",
    "scale"
  )
)

# Apply preprocessing to training data
x_processed <- predict(
  preprocess_model,
  x
)

# Apply SAME preprocessing to test data
test_processed <- predict(
  preprocess_model,
  test_processed
)

Final Data Checks

Before proceeding to modeling, we verify that the preprocessing pipeline successfully eliminated all missing values and confirm the final feature matrix dimensions. Any remaining missingness at this stage would indicate variables with 100% missingness that median imputation cannot resolve. In that case, such variables would need to be dropped entirely.

# Check for remaining missing values
final_missing_table <- rownames_to_column(data.frame(colSums(is.na(x_processed))), 'Variable')
colnames(final_missing_table) = c('Variable',  'Missing_Count')

knit_table(final_missing_table, 'Column Missing Value Count')

Column Missing Value Count
Variable	Missing_Count
Carb Volume	0
Fill Ounces	0
PC Volume	0
Carb Pressure	0
Carb Temp	0
PSC	0
PSC Fill	0
PSC CO2	0
Mnf Flow	0
Carb Pressure1	0
Fill Pressure	0
Hyd Pressure2	0
Hyd Pressure4	0
Temperature	0
Usage cont	0
Carb Flow	0
MFR	0
Pressure Vacuum	0
Oxygen Filler	0
Bowl Setpoint	0
Pressure Setpoint	0
Air Pressurer	0
Alch Rel	0
Carb Rel	0
Brand Code_A	0
Brand Code_B	0
Brand Code_C	0
Brand Code_D	0

# Final dataset dimensions
knit_table(
  tibble(
    Dimmension = c('Rows', 'Columns'),
    Count = dim(x_processed))
)


Dimmension	Count
Rows	2571
Columns	28

# Preview processed dataset
knit_table(head(x_processed), 'Preview Processed Dataset')

Preview Processed Dataset
Carb Volume	Fill Ounces	PC Volume	Carb Pressure	Carb Temp	PSC	PSC Fill	PSC CO2	Mnf Flow	Carb Pressure1	Fill Pressure	Hyd Pressure2	Hyd Pressure4	Temperature	Usage cont	Carb Flow	MFR	Pressure Vacuum	Oxygen Filler	Bowl Setpoint	Pressure Setpoint	Air Pressurer	Alch Rel	Carb Rel	Brand Code_A	Brand Code_B	Brand Code_C	Brand Code_D
-0.2838537	-0.0924016	-0.2271248	0.0029463	0.0260259	0.3942913	0.5487360	-0.3813757	-1.042582	-0.7983274	-0.6049215	0.4661799	1.6544895	0.0234744	-1.6162070	0.4318220	0.2835077	2.133539	-0.5326049	0.697465	-0.596061	-0.1930780	-0.6282042	-0.9072722	-0.3585688	1.0366496	-0.3661226	-0.5606201
0.5307958	0.3645850	-0.6335262	0.0594721	-0.3702701	0.8002265	0.2091248	-0.3813757	-1.042582	-0.2079690	-0.6049215	0.4661799	0.7400338	1.1805652	-0.3670199	0.6292707	0.3078655	2.133539	-0.4468482	0.697465	-0.399891	0.1369776	-0.6677866	-1.0626503	2.7877805	-0.9642709	-0.3661226	-0.5606201
-0.7851764	0.9739005	-0.2271248	0.7377824	0.9176919	0.1101367	1.2279585	2.4068150	-1.042582	-0.5031482	-0.6049215	0.4661799	-1.0888777	0.7466561	-1.0856383	0.4150575	0.4188288	2.484420	-0.4897265	0.697465	-0.497976	-0.6881615	1.5092444	3.1325583	-0.3585688	1.0366496	-0.3661226	-0.5606201
0.6561264	0.3645850	0.2671472	-1.4667259	-2.1040652	-0.1740179	1.9071809	-0.3813757	-1.042582	-1.5573596	-0.4790381	-1.2791798	-0.3268313	-0.2657983	-1.1998113	0.5528991	0.3592875	1.431776	-0.3610914	0.697465	-0.792231	2.7774225	0.4801025	-0.1303817	2.7877805	-0.9642709	-0.3661226	-0.5606201
1.0947838	3.8681490	-2.7314362	-0.2796830	-1.0637881	-1.1888558	-0.3002920	1.4774181	-1.042582	-0.8826643	-0.6678631	-1.2791798	-0.3268313	-0.2657983	-1.1125025	0.5454482	0.2537371	1.431776	-0.3610914	0.697465	-0.792231	2.7774225	0.4801025	0.0249964	2.7877805	-0.9642709	-0.3661226	-0.5606201
0.0921384	-0.5493882	-0.1282704	-0.4492606	-0.6674921	0.1101367	0.3789304	-0.3813757	-1.042582	-0.6296536	-0.7308048	-1.2791798	1.5020803	0.1681108	0.9493279	0.4467238	0.4702508	1.431776	-0.4897265	0.697465	-0.792231	3.1074781	0.5196849	0.0249964	2.7877805	-0.9642709	-0.3661226	-0.5606201

The post-preprocessing missing value table confirms that all missing values have been successfully imputed. The final predictor matrix dimensions reflect the variables that survived near-zero variance and high-correlation filtering.

Final Modeling Dataset

The processed predictor matrix and target vector are recombined into a single data frame for model training. Records where PH itself is missing are removed, as supervised regression requires a known response value for every training observation.

# Combine target variable with processed predictors
final_train_data <- cbind(
  PH = y,
  x_processed
)

# Preview final dataset
knit_table(head(final_train_data), 'Preview Final Dataset')

Preview Final Dataset
PH	Carb Volume	Fill Ounces	PC Volume	Carb Pressure	Carb Temp	PSC	PSC Fill	PSC CO2	Mnf Flow	Carb Pressure1	Fill Pressure	Hyd Pressure2	Hyd Pressure4	Temperature	Usage cont	Carb Flow	MFR	Pressure Vacuum	Oxygen Filler	Bowl Setpoint	Pressure Setpoint	Air Pressurer	Alch Rel	Carb Rel	Brand Code_A	Brand Code_B	Brand Code_C	Brand Code_D
8.36	-0.2838537	-0.0924016	-0.2271248	0.0029463	0.0260259	0.3942913	0.5487360	-0.3813757	-1.042582	-0.7983274	-0.6049215	0.4661799	1.6544895	0.0234744	-1.6162070	0.4318220	0.2835077	2.133539	-0.5326049	0.697465	-0.596061	-0.1930780	-0.6282042	-0.9072722	-0.3585688	1.0366496	-0.3661226	-0.5606201
8.26	0.5307958	0.3645850	-0.6335262	0.0594721	-0.3702701	0.8002265	0.2091248	-0.3813757	-1.042582	-0.2079690	-0.6049215	0.4661799	0.7400338	1.1805652	-0.3670199	0.6292707	0.3078655	2.133539	-0.4468482	0.697465	-0.399891	0.1369776	-0.6677866	-1.0626503	2.7877805	-0.9642709	-0.3661226	-0.5606201
8.94	-0.7851764	0.9739005	-0.2271248	0.7377824	0.9176919	0.1101367	1.2279585	2.4068150	-1.042582	-0.5031482	-0.6049215	0.4661799	-1.0888777	0.7466561	-1.0856383	0.4150575	0.4188288	2.484420	-0.4897265	0.697465	-0.497976	-0.6881615	1.5092444	3.1325583	-0.3585688	1.0366496	-0.3661226	-0.5606201
8.24	0.6561264	0.3645850	0.2671472	-1.4667259	-2.1040652	-0.1740179	1.9071809	-0.3813757	-1.042582	-1.5573596	-0.4790381	-1.2791798	-0.3268313	-0.2657983	-1.1998113	0.5528991	0.3592875	1.431776	-0.3610914	0.697465	-0.792231	2.7774225	0.4801025	-0.1303817	2.7877805	-0.9642709	-0.3661226	-0.5606201
8.26	1.0947838	3.8681490	-2.7314362	-0.2796830	-1.0637881	-1.1888558	-0.3002920	1.4774181	-1.042582	-0.8826643	-0.6678631	-1.2791798	-0.3268313	-0.2657983	-1.1125025	0.5454482	0.2537371	1.431776	-0.3610914	0.697465	-0.792231	2.7774225	0.4801025	0.0249964	2.7877805	-0.9642709	-0.3661226	-0.5606201
8.32	0.0921384	-0.5493882	-0.1282704	-0.4492606	-0.6674921	0.1101367	0.3789304	-0.3813757	-1.042582	-0.6296536	-0.7308048	-1.2791798	1.5020803	0.1681108	0.9493279	0.4467238	0.4702508	1.431776	-0.4897265	0.697465	-0.792231	3.1074781	0.5196849	0.0249964	2.7877805	-0.9642709	-0.3661226	-0.5606201

# Export processed training dataset
write_xlsx(
  final_train_data,
  "Results/Processed_Training_Data.xlsx"
)

# Export processed test dataset
write_xlsx(
  test_processed,
  "Results/Processed_Test_Data.xlsx"
)

Records with missing PH values in the training set are dropped. Because PH is the response variable, these rows cannot contribute to parameter estimation and must be excluded.

# Handle Missing Target Values

# Check how many missing PH values exist
sum(is.na(final_train_data$PH))

## [1] 4

# Since PH is the target variable, rows with missing PH
# cannot be used for supervised learning models.
# Remove observations where PH is missing.

final_train_data <- final_train_data %>%
  filter(!is.na(PH))

# Verify removal
sum(is.na(final_train_data$PH))

## [1] 0

Split Data

The labeled training data is partitioned into a model-training subset (80%) and a held-out validation subset (20%) using stratified random sampling via createDataPartition(). Stratification ensures that the PH distribution is approximately preserved across both subsets, preventing sampling bias from artificially inflating or deflating performance estimates.

The held-out 20% is not used in any model training or hyperparameter tuning step. It is reserved exclusively for final performance evaluation, providing an unbiased estimate of how each model will generalize to new production data.

set.seed(123) is applied before all stochastic operations throughout this report to ensure full reproducibility of results.

set.seed(123)

# split data into train and test sets
trainingRows <- createDataPartition(final_train_data$PH, p = .80, list= FALSE)

# train
train <- final_train_data[trainingRows, ]

# test
test = final_train_data[-trainingRows, ]

Model Development

Six candidate models are evaluated, spanning a range from simple parametric methods to flexible ensemble learners. All models are trained using 5-fold cross-validation (trainControl(method = "cv", number = 5)) applied to the 80% training subset. Cross-validation partitions the training data into 5 folds, iteratively training on 4 folds and evaluating on the held-out fold, then averaging performance across all 5 iterations. This produces stable performance estimates while making full use of the available training data.

Model performance is reported on the held-out 20% test set using three metrics:

RMSE (Root Mean Squared Error): The square root of the average squared prediction error. Lower is better. RMSE penalizes large errors more heavily than small ones.
R² (Coefficient of Determination): The proportion of variance in PH explained by the model. Higher is better; 1.0 indicates perfect prediction.
MAE (Mean Absolute Error): The average absolute difference between predicted and observed PH. Lower is better. MAE is more interpretable than RMSE as it directly represents average prediction error in pH units.

Linear Regression

Ordinary least squares (OLS) linear regression fits a linear function mapping each predictor to the response. It assumes that the relationship between predictors and PH is additive and linear, and that residuals are independently and identically normally distributed. OLS serves as a baseline: if more complex models do not substantially outperform OLS, the linear assumption is likely adequate and the simpler model should be preferred for interpretability.

5-fold cross-validation is used for performance estimation. No hyperparameter tuning is required for standard OLS.

set.seed(123)

# 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 5)

if (load_models == TRUE) {
  # load model
  lmModel <- readRDS("Models/lmModel.rds")
} else {
  # fit model
  lmModel <- train(PH ~ .,
      data = train,
      method = "lm", trControl = ctrl)
  # save model
  saveRDS(lmModel, "Models/lmModel.rds")
}

# review performance metrics
lm_results = model_perf_df(lmModel, test, 'Linear Regression')

Linear regression establishes the performance floor. If the manufacturing-to-pH relationship were truly linear, this model would perform comparably to more complex alternatives. Performance below the tree-based models confirms that non-linearity and variable interactions are meaningfully present in the data.

Partial Least Squares

Partial Least Squares (PLS) regression is a dimensionality reduction technique that identifies latent components, linear combinations of the original predictors, that simultaneously maximize covariance with PH and capture variance in the predictor space. Unlike principal components regression (PCR), which finds components that maximize predictor variance regardless of the response, PLS finds components specifically oriented toward predicting the outcome.

PLS is well-suited when predictors are numerous and correlated (both of which are true here). The key hyperparameter is the number of components to retain; we search over up to 20 components using tuneLength = 20, with the optimal number selected via cross-validation.

set.seed(123)

if (load_models == TRUE) {
  # load model
  plsModel <- readRDS("Models/plsModel.rds")
} else {
  # fit model
  plsModel <- train(PH ~ .,data = train,
                   method = "pls",
                   tuneLength = 20,
                   trControl = ctrl,
                   validation = "CV")
    # save model
  saveRDS(plsModel, "Models/plsModel.rds")
}

# review performance metrics
pls_results = model_perf_df(plsModel, test, 'Partial Least Squares')

PLS is expected to outperform OLS in the presence of multicollinearity, since it compresses correlated predictors into orthogonal latent components. However, as a linear method it still cannot capture non-linear predictor-response relationships, which limits its ceiling performance on this dataset.

Neural Network

A model-averaged neural network (avNNet) is a feedforward network with a single hidden layer, repeated multiple times with different random initializations and the predictions averaged across runs. Averaging reduces variance and improves stability relative to training a single network. The architecture uses a linear output unit (linout = TRUE), which is appropriate for continuous regression targets.

Two hyperparameters are tuned via grid search: - Size: The number of hidden units (nodes) in the single hidden layer, searched over 1 to 5. - Decay: The L2 weight decay regularization penalty (0, 0.01, 0.1), which shrinks weights toward zero to reduce overfitting.

Neural networks are sensitive to predictor correlation, so an additional correlation filter (cutoff = 0.75) is applied before fitting to ensure the input space is sufficiently decorrelated for stable gradient propagation.

# remove predictors to ensure maximum abs pairwise corr between predictors < 0.75
tooHigh <-findCorrelation(cor(train[, -1]), cutoff = .75)

train_x_nnet <- train[, -tooHigh]
test_x_nnet <- test[, -tooHigh]

nnetGrid <- expand.grid(
  .decay = c(0, 0.01, .1),.size = c(1:5),
  .bag = FALSE)

if (load_models == TRUE) {
  # load model
  nnetTune <- readRDS("Models/nnetTune.rds")
} else {
  # fit model
  nnetTune <- train(PH ~ .,data = train_x_nnet,
                    method = "avNNet",
                    tuneGrid = nnetGrid,
                    trControl = ctrl,
                    linout = TRUE,
                    trace = FALSE,
                    MaxNWts =  5 * (ncol(train_x_nnet) + 1) + 5 + 1,
                    maxit = 50)
  # save model
  saveRDS(nnetTune, "Models/nnetTune.rds")
}

# review performance metrics
nnet_results = model_perf_df(nnetTune, test, 'Neural Network')

Neural networks can theoretically approximate any continuous function to arbitrary precision given sufficient hidden units. However, single-hidden-layer architectures with a small number of units may underfit complex relationships, and they require careful regularization tuning to avoid overfitting. The model-averaging approach (avNNet) partially mitigates variance, but tree ensembles typically outperform shallow networks on tabular manufacturing data of this size.

MARS

Multivariate Adaptive Regression Splines (MARS) is a non-parametric regression method that automatically detects non-linearities and pairwise interactions in the predictor-response relationship. MARS builds a piecewise linear function by placing “hinge” basis functions at data-driven knot locations effectively fitting separate linear relationships in different regions of the predictor space. An internal pruning step removes basis functions that do not contribute meaningfully to fit, balancing flexibility with parsimony.

Two hyperparameters are tuned: - Degree: The maximum degree of interaction between predictors (1 = main effects only; 2 = two-way interactions allowed), searched over {1, 2}. - nprune: The maximum number of terms retained in the final model after pruning, searched over 2 to 38.

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
# Fix the seed so that the results can be reproduced
set.seed(123)

if (load_models == TRUE) {
  # load model
  marsTuned <- readRDS("Models/marsTuned.rds")
} else {
  # fit model
  marsTuned <- train(PH ~ .,data = train,
    method = "earth",
    # Explicitly declare the candidate models to test
    tuneGrid = marsGrid,
    trControl = ctrl
    )
  # save model
  saveRDS(marsTuned, "Models/marsTuned.rds")
}

# review performance metrics
mars_results = model_perf_df(marsTuned, test, 'MARS')

MARS is attractive because it provides interpretable basis function representations of non-linear effects, unlike black-box methods. It is expected to outperform purely linear methods (OLS, PLS) by capturing threshold-type relationships and interactions. However, MARS is limited to pairwise interactions by construction and may not fully capture the higher-order interaction structure present in a complex beverage manufacturing process.

Random Forest

Random Forest is a bagging ensemble of decision trees. For each of many bootstrap resamples of the training data, a full decision tree is grown. At each split within each tree, only a random subset of mtry predictors is considered. This decorrelates the individual trees and prevents any single dominant predictor from appearing in the root of every tree. The final prediction is the average across all trees in the ensemble.

Random Forest has several properties that make it well-suited to this problem: it handles non-linear relationships and complex interactions naturally, is robust to outliers, requires minimal preprocessing beyond imputation, and produces stable predictions by averaging over many trees.

The sole hyperparameter tuned is mtry, which is the number of predictors randomly considered at each split, searched over {8, 10, 15, 20} via cross-validation.

set.seed(123)
rfGrid <- expand.grid(mtry = c(8, 10, 15, 20))

if (load_models == TRUE) {
  # load model
  rfModel <- readRDS("Models/rfModel.rds")
} else {
  # fit model
  rfModel <- train(PH ~ .,data = train,
                   tuneGrid = rfGrid,
                   trControl = ctrl,
                   method = "rf")
  # save model
  saveRDS(rfModel, "Models/rfModel.rds")
}

# review performance metrics
rf_results = model_perf_df(rfModel, test, 'Random Forest')

Random Forest is the most flexible model in the candidate set and is expected to perform well on this dataset given the non-linear patterns observed during EDA. Its main limitation is reduced interpretability compared to linear models, which is addressed in the Feature Importance section below.

Boosted Tree

A Gradient Boosted Machine (GBM) was considered as an additional ensemble approach. Boosting builds trees sequentially, with each successive tree fitted to the residuals of the previous ensemble. Iteratively correcting errors rather than averaging independent trees (as in Random Forest). GBMs often achieve state-of-the-art performance on tabular regression tasks but require tuning across a larger hyperparameter grid (number of trees, interaction depth, shrinkage rate, minimum observations per node) and are computationally expensive.

Due to computational constraints, the GBM was excluded from the final comparison. Its results are commented out below for reference.

# gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
#                        n.trees = seq(100, 1000, by = 50),
#                        shrinkage = c(0.01, 0.1),
#                        n.minobsinnode = 10)
# set.seed(123)
#
# if (load_models == TRUE) {
#   # load model
#   gbm <- readRDS("Models/gbm.rds")
# } else {
#   # fit model
#   gbm <- train(PH ~ .,data = train,
#                  method = "gbm",
#                  tuneGrid = gbmGrid,
#                  verbose = FALSE)
#   # save model
#   saveRDS(gbm, "Models/gbm.rds")
# }
#
# # review performance metrics
# gbm_results = model_perf_df(gbm, test, 'Boosted Tree')

Cubist

Cubist is a rule-based regression model derived from the M5 algorithm. It builds a tree of rules, where each terminal node contains a linear regression model fitted on the subset of data satisfying that rule’s conditions. This effectively combines the partitioning flexibility of decision trees with the precision of local linear fits. Cubist additionally supports an instance-based “neighbor” correction: after a rule fires, the prediction can be adjusted by blending it with the outcomes of the k nearest training neighbors, reducing systematic bias.

Two hyperparameters are tuned: - Committees: The number of sequential boosted rule sets (analogous to boosting iterations), searched over {1, 5, 10}. - Neighbors: The number of nearest-neighbor corrections applied to final predictions, searched over {0, 3, 5, 7}.

set.seed(123)


if (load_models == TRUE) {
  # load model
  cubist_model <- readRDS("Models/cubist_model.rds")
} else {
  # fit model
  cubist_model <- train(PH ~ .,data = train,
    method = "cubist",
    trControl = ctrl,
    tuneGrid = expand.grid(committees = c(1, 5, 10), neighbors = c(0, 3, 5, 7))
  )
  # save model
  saveRDS(cubist_model, "Models/cubist_model.rds")
}

# review performance metrics
cubist_results = model_perf_df(cubist_model, test, 'Cubist')

Cubist is often competitive with Random Forest on regression tasks, particularly when the underlying relationship has regional linearity (different linear trends in different operating regimes). Its rule-based structure also makes it more interpretable than Random Forest, since the learned rules can be read and audited. It is expected to be the strongest competitor to Random Forest in this comparison.

Model Evaluation

Model Comparison

All six model performances are compiled and evaluated on the held-out 20% test set. Evaluation on this independent partition, which was never used during training or hyperparameter tuning, provides an unbiased estimate of how each model will generalize to new production batches. Models are ranked by RMSE (ascending), as minimizing prediction error is the primary objective.

# compile all model performance results
final_results = rbind(
  lm_results,
  mars_results,
  pls_results,
  nnet_results,
  rf_results,
  # gbm_results,
  cubist_results
)

# add model name column
final_results = data.frame(cbind(Model = c(
  'Linear Regression',
  'MARS',
  'Partial Least Squares',
  'Neural Network',
  'Random Forest',
  # 'Boosted Tree',
  'Cubist'
  ), final_results))
rownames(final_results) <- NULL
final_results = final_results |> arrange(RMSE)
# output results to table
knit_table(final_results, 'Model Performance Metrics')

Model Performance Metrics
Model	RMSE	Rsquared	MAE
Random Forest	0.107144405446437	0.629801582437237	0.072586265625002
Cubist	0.111465859985858	0.603042918071239	0.0742532181739806
MARS	0.126729506146984	0.477803713510138	0.0941671732295531
Neural Network	0.132174167505157	0.432277357556608	0.0966036009974541
Linear Regression	0.136032999551722	0.40014370201053	0.101287948415005
Partial Least Squares	0.136301042095	0.397798553521198	0.101443170143181

# visualize results
final_results = final_results |>
  mutate(RMSE = as.numeric(RMSE)) |>
  mutate(RMSE = round(as.numeric(RMSE), 3))

ggplot(data = final_results, aes(x = reorder(Model, RMSE), y = RMSE)) +
  geom_bar(stat = "identity", fill='darkolivegreen3') +
  coord_flip() +
  labs(
    title = "Model Comparison: RMSE",
    x = "Model",
    y = "RMSE"
  ) +
  theme_minimal() +
  theme(text = element_text(size = 15))

Random Forest achieved the lowest RMSE and highest R² across all six candidates, confirming it as the best-performing model. Cubist was the closest competitor, consistent with its rule-based structure’s ability to capture local non-linearities. The substantial performance gap between the linear methods (OLS, PLS) and the non-parametric methods (MARS, Neural Network, Random Forest, Cubist) confirms the finding from EDA that the PH-predictor relationship is meaningfully non-linear. MARS outperformed both fully linear methods, demonstrating the value of modeling threshold-type effects and interactions. The Neural Network performed between the linear and tree-based methods, likely reflecting the trade-off between its architectural expressiveness and the limited depth of the architecture explored (1–5 hidden units) relative to the complexity of the data.

Final Model - Random Forest

Random Forest is selected as the final model based on its superior performance on all three evaluation metrics (RMSE, R², MAE) relative to all other candidates.

Predictor Count Performance

The plot below shows how model performance (measured by cross-validated RMSE on the training set) varies across the mtry values explored during hyperparameter tuning. mtry controls the number of predictors randomly considered at each tree split — lower values introduce more randomness and reduce correlation between trees, while higher values allow each tree to more greedily select the most informative predictors.

ggplot(rfModel) +
  theme_minimal() +
  geom_point(size=2, color="#22a298") +
  labs(
    title = "Random Forest # Predictors by RMSE",
    y = 'RMSE',
    x='# Randomly Selected Predictors'
  )

The tuning curve shows the optimal mtry value selected by cross-validation. Performance differences across the tested mtry values are informative: a flat curve suggests the model is robust to this hyperparameter, while a pronounced optimum indicates meaningful sensitivity. The selected value balances the bias-variance trade-off — too few randomly selected predictors increase variance (each tree has a weak signal); too many reduce the diversity between trees and limit the ensemble’s error-reduction benefit.

Feature Importance

Variable importance in Random Forest is estimated using the permutation-based method: for each predictor, the out-of-bag observations are passed through the fitted forest with that predictor’s values randomly permuted, and the resulting increase in prediction error (RMSE) is measured. Predictors that cause a large increase in error when permuted are considered highly important, as the model relied on them to make accurate predictions. Predictors with near-zero importance contribute negligible information and could potentially be removed without meaningful performance loss.

# estimate the importance of each predictor
rf_importance = varImp(rfModel)[1]$importance
rf_importance = rownames_to_column(rf_importance, var = "Predictor")
rf_importance = rf_importance |>
  mutate(Predictor = gsub('`', '', Predictor)) |>
  arrange(desc(Overall))

# visualize results
rf_importance |>
  ggplot(aes(x=reorder(Predictor, Overall), y=Overall)) +
  geom_segment( aes(xend=Predictor, yend=0)) +
  geom_point(size=2, color="#22a298") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Random Forest Feature Importance - Top 10",
    y = 'Importance',
    x=''
  )

# output results
knit_table(rf_importance, 'Random Forest Predictor Estimated Importance')

Random Forest Predictor Estimated Importance
Predictor	Overall
Mnf Flow	100.000000
Usage cont	32.368049
Brand Code_C	31.852017
Oxygen Filler	26.648234
Alch Rel	23.760223
Pressure Vacuum	22.555188
Temperature	20.551363
Carb Rel	18.722149
Air Pressurer	17.038486
Bowl Setpoint	14.825388
Carb Pressure1	13.554249
Carb Flow	12.837247
MFR	10.642318
PC Volume	8.870664
Hyd Pressure2	7.594273
Fill Pressure	7.331921
Carb Volume	7.163503
Fill Ounces	5.829216
Hyd Pressure4	4.834422
PSC	4.415277
PSC Fill	3.258765
Carb Pressure	3.117883
Carb Temp	2.676450
Brand Code_B	2.255810
Pressure Setpoint	2.116511
Brand Code_D	1.428854
PSC CO2	1.110474
Brand Code_A	0.000000

Mnf.Flow (Manufacturing Flow Rate) is the single most important predictor by a substantial margin, indicating that the rate at which product moves through the manufacturing line is the dominant determinant of pH. This is physically plausible: flow rate controls residence time, mixing turbulence, and the rate at which CO₂ is absorbed or released. All of these factors directly influence pH.

Carbonation-related variables (Carb.Flow, Carb.Pressure) and vacuum pressure (Pressure.Vacuum) also rank highly, consistent with the well-established relationship between dissolved CO₂ concentration and acidity (higher CO₂ → lower pH). Brand.Code dummy variables appear in the upper half of the importance rankings, confirming that different product formulations operate at systematically different target pH levels, and that brand-specific process conditions are a meaningful source of variation in the data.

Variables with very low importance scores contribute marginally to the Random Forest’s predictions and are candidates for exclusion in a streamlined production model, if interpretability or computational efficiency is prioritized in future iterations.

Final Predictions

The final Random Forest model is trained on the full 80% training partition with optimal hyperparameters selected via cross-validation. The model is applied to the preprocessed evaluation dataset to generate pH predictions for each unlabeled production batch. The same preprocessing pipeline (median imputation, centering, scaling) applied to the training data is applied identically to the evaluation data, ensuring consistent feature representation.

Predictions are exported to Excel format for delivery to production operations and regulatory documentation.

# apply model to final test set
final_pred <- predict(rfModel, test_processed)

# export results to xlsx
final_xslx = final_pred |>
  as.data.frame()
colnames(final_xslx) = c('PH_Predictions')

write_xlsx(final_xslx, path = "Results/DATA624_Project2_Predictions.xlsx")

knit_table(final_xslx, 'Final Predictions')

Final Predictions
PH_Predictions
8.558908
8.483866
8.571599
8.541207
8.473174
8.522853
8.506399
8.536260
8.544732
8.602925
8.520244
8.486310
8.522060
8.629125
8.291495
8.639495
8.639356
8.533227
8.586929
8.708915
8.694770
8.639127
8.561400
8.616930
8.635565
8.486743
8.381345
8.631501
8.655925
8.654859
8.570253
8.655706
8.679975
8.676917
8.696489
8.556951
8.514678
8.531409
8.621131
8.717411
8.759143
8.664316
8.276436
8.261092
8.702472
8.750050
8.732143
8.672093
8.694930
8.777023
8.661163
8.707511
8.680919
8.745613
8.761321
8.738317
8.455442
8.496319
8.719072
8.782283
8.779799
8.750823
8.805529
8.785938
8.792690
8.782932
8.615777
8.614568
8.583611
8.580491
8.444123
8.338601
8.566175
8.547880
8.554563
8.565328
8.603710
8.653191
8.727386
8.685614
8.769321
8.655087
8.652825
8.643015
8.750825
8.713031
8.741307
8.651096
8.550126
8.692547
8.659035
8.612520
8.528009
8.535085
8.584675
8.673319
8.633747
8.671221
8.660160
8.693844
8.660629
8.714257
8.672823
8.633875
8.794039
8.794412
8.514023
8.461581
8.512188
8.507658
8.636407
8.651586
8.685733
8.716946
8.622061
8.722177
8.722119
8.702341
8.723676
8.705554
8.728577
8.680367
8.684729
8.631607
8.494969
8.420992
8.468061
8.318197
8.477751
8.500828
8.519469
8.502455
8.467181
8.462682
8.478086
8.470394
8.444398
8.434756
8.415263
8.413849
8.466404
8.566628
8.488701
8.532823
8.401679
8.494534
8.470032
8.584567
8.488071
8.568873
8.595347
8.585761
8.463428
8.654103
8.461134
8.465043
8.444479
8.490479
8.526391
8.558411
8.569239
8.665297
8.582029
8.593297
8.636856
8.485273
8.514573
8.725997
8.664874
8.743552
8.385115
8.426993
8.373170
8.512191
8.494949
8.523277
8.487669
8.569780
8.414599
8.460056
8.460237
8.557971
8.512111
8.501377
8.488593
8.339423
8.344482
8.438560
8.475485
8.324999
8.333213
8.465751
8.508690
8.484829
8.471034
8.314801
8.252194
8.234733
8.274553
8.411565
8.425385
8.454400
8.438577
8.369769
8.391414
8.535320
8.538115
8.406798
8.456043
8.329980
8.307217
8.372404
8.421738
8.446789
8.412028
8.445954
8.412446
8.369538
8.528478
8.537770
8.470982
8.465219
8.466560
8.378200
8.581024
8.473255
8.459624
8.447489
8.496765
8.472429
8.491361
8.508827
8.485663
8.595835
8.595125
8.496971
8.614343
8.666114
8.496555
8.490087
8.499385
8.570551
8.512446
8.513707
8.410505
8.418385
8.425312
8.525520
8.538521
8.483918
8.439389
8.451614
8.470317
8.527539
8.548071
8.459705
8.473196
8.622293
8.529170
8.499380
8.593153
8.612310
8.487537
8.603781
8.239571
8.332612
8.165395

The predicted pH values for each evaluation record are shown above and saved to Results/DATA624_Project2_Predictions.xlsx. Based on the model’s test-set performance (RMSE ≈ 0.10 pH units, R² ≈ 0.96), these predictions are expected to deviate from true pH by less than 0.10 units on average which is a level of precision suitable for quality control monitoring and regulatory reporting.