ABC Beverage — Manufacturing Process Predictive Modeling for pH Optimization

DATA 624: Predictive Analytics — Technical Report

Author

Arutam Antunish, Domllermut C. Alamo, Tanzil Ehsan and Mohammad Zahid

Published

May 24, 2026

1 Executive Introduction & Project Context

In response to new regulatory directives, this technical investigation establishes a robust predictive modeling framework to understand, quantify, and forecast the factors influencing pH levels within the ABC Beverage manufacturing process. Maintaining tight control over pH is a vital operational benchmark for product quality, chemical stability, and regulatory compliance.

This report documents a comprehensive data science workflow, transforming noisy, raw production telemetry into a high-performance predictive system. By leveraging advanced preprocessing methods and diverse machine learning algorithms via the caret ecosystem, we build an optimized framework tailored to the specific dynamics of our manufacturing line.

library(tidyverse)
library(caret)
library(corrplot)
library(RColorBrewer)
library(doParallel)
library(knitr)
library(kableExtra)
library(bonsai)
library(xgboost)
library(tidymodels)

2 Parallel Computation Setup

To reduce wall-clock training time across the candidate models, we initialize a PSOCK cluster using all available cores minus one, keeping one core free for OS processes. The cluster is registered with caret’s backend and will be explicitly released after model training completes.

n_cores <- max(1L, parallel::detectCores() - 1L)
cl      <- makePSOCKcluster(n_cores)
registerDoParallel(cl)

cat(sprintf("Parallel backend active: %d worker cores\n", n_cores))
Parallel backend active: 11 worker cores

3 Comprehensive Exploratory Data Analysis (EDA)

3.1 Data Loading via GitHub Remote Repository

To facilitate seamless group collaboration and ensure reproducibility across all team members’ environments, the historical training telemetry and the final evaluation sets are loaded directly via GitHub raw URLs.

train_url <- paste0(
  "https://raw.githubusercontent.com/arutam-antunish/",
  "Data_624_Final-Project/refs/heads/main/StudentData.csv"
)
evaluation_url <- paste0(
  "https://raw.githubusercontent.com/arutam-antunish/",
  "Data_624_Final-Project/refs/heads/main/StudentEvaluation.csv"
)

train_data      <- read_csv(train_url)
evaluation_data <- read_csv(evaluation_url)

# Sanitize column names for R syntax compatibility
colnames(train_data)      <- make.names(colnames(train_data))
colnames(evaluation_data) <- make.names(colnames(evaluation_data))

cat(sprintf("Training Set   : %d rows x %d columns\n",
            nrow(train_data), ncol(train_data)))
Training Set   : 2571 rows x 33 columns
cat(sprintf("Evaluation Set : %d rows x %d columns\n",
            nrow(evaluation_data), ncol(evaluation_data)))
Evaluation Set : 267 rows x 33 columns

3.2 Target Variable Diagnostic: pH Distribution

Our target parameter is pH. Evaluating its structural distribution, baseline variance, and central tendencies is crucial before implementing feature transformations.

summary_ph <- train_data %>%
  summarise(
    Mean          = mean(PH,    na.rm = TRUE),
    Median        = median(PH,  na.rm = TRUE),
    SD            = sd(PH,      na.rm = TRUE),
    Min           = min(PH,     na.rm = TRUE),
    Max           = max(PH,     na.rm = TRUE),
    Missing_Count = sum(is.na(PH))
  )

kable(summary_ph, caption = "Descriptive Statistics - Manufacturing pH",
      digits = 4)
Descriptive Statistics - Manufacturing pH
Mean Median SD Min Max Missing_Count
8.5456 8.54 0.1725 7.88 9.36 4
ggplot(train_data, aes(x = PH)) +
  geom_histogram(
    aes(y = after_stat(density)),
    bins  = 40,
    fill  = "#2c3e50",
    color = "white",
    alpha = 0.8
  ) +
  geom_density(color = "#e74c3c", linewidth = 1.2) +
  labs(
    title = "Distribution Profiling of Manufacturing pH",
    x     = "pH Level",
    y     = "Density"
  )

Distribution of the target variable pH across the training set.

3.3 Feature Space Deficiency: Missing Value Analysis

We calculate the exact missing percentage across all telemetry features to properly inform our multivariate imputation layer.

missing_pct <- colMeans(is.na(train_data)) * 100
missing_df  <- data.frame(
  Variable           = names(missing_pct),
  Missing_Percentage = missing_pct
) %>%
  filter(Missing_Percentage > 0) %>%
  arrange(desc(Missing_Percentage))

ggplot(
  missing_df,
  aes(x = reorder(Variable, Missing_Percentage), y = Missing_Percentage)
) +
  geom_bar(stat = "identity", fill = "#34495e", width = 0.7) +
  coord_flip() +
  labs(
    title = "Predictor Space - Missing Data Matrix",
    x     = "Process Features",
    y     = "Percentage Missing (%)"
  )

Missing data percentage by predictor. MFR and Brand.Code show the highest deficiency rates.

3.4 Predictor Correlation Structure

With 31 manufacturing features, multicollinearity is expected. The correlation heatmap below motivates both the PLS model choice (which handles collinearity by design) and the imputation order in our preprocessing pipeline.

numeric_only <- train_data %>%
  select(where(is.numeric)) %>%
  select(-PH)

cor_matrix <- cor(numeric_only, use = "pairwise.complete.obs")

corrplot(
  cor_matrix,
  method  = "color",
  type    = "upper",
  order   = "hclust",
  col     = colorRampPalette(c("#e74c3c", "white", "#2c3e50"))(200),
  tl.cex  = 0.65,
  tl.col  = "black",
  addCoef.col = NULL,
  diag    = FALSE,
  title   = "Predictor Correlation Matrix (Training Set)",
  mar     = c(0, 0, 1.5, 0)
)

Pearson correlation matrix across all numeric predictors. Darker blue/red cells indicate strong positive/negative collinearity.

4 Advanced Preprocessing Pipeline

4.1 Target Isolation, Categorical Cleaning & Empirical Outlier Capping

We remove records where the target variable pH is missing or recorded as an impossible zero. Because Brand.Code is a categorical factor that cannot be processed by knnImpute, rows with missing Brand.Code are excluded from training. Before imputing, we verify the actual majority class so any later evaluation-set substitution is grounded in data.

# Verify majority class BEFORE deciding imputation defaults
brand_table <- sort(table(train_data$Brand.Code), decreasing = TRUE)
cat("Brand.Code distribution in training set:\n")
Brand.Code distribution in training set:
print(brand_table)

   B    D    C    A 
1239  615  304  293 
majority_class <- names(brand_table)[1]
cat(sprintf("\nMajority class: '%s' - used as fallback for evaluation NAs\n",
            majority_class))

Majority class: 'B' - used as fallback for evaluation NAs
train_clean <- train_data %>%
  filter(!is.na(PH) & PH > 0) %>%
  filter(!is.na(Brand.Code))

target   <- train_clean$PH
features <- train_clean %>% select(-PH)

features$Brand.Code        <- as.factor(features$Brand.Code)
evaluation_data$Brand.Code <- as.factor(evaluation_data$Brand.Code)

numeric_features     <- features %>% select(where(is.numeric))
categorical_features <- features %>% select(where(is.factor))

cap_outliers <- function(x) {
  q <- quantile(x, probs = c(0.01, 0.99), na.rm = TRUE)
  x[x < q[1]] <- q[1]
  x[x > q[2]] <- q[2]
  return(x)
}

numeric_capped <- as.data.frame(lapply(numeric_features, cap_outliers))

cat(sprintf("Clean training rows after filtering: %d\n", nrow(train_clean)))
Clean training rows after filtering: 2447

4.2 Near-Zero Variance Screening, Structural Imputation, and Box-Cox Transform

Adhering strictly to the textbook workflow (Kuhn & Johnson, Ch. 3), we employ a unified caret::preProcess module. This pipeline scans and drops near-zero variance features, uses K-Nearest Neighbors imputation to replace missing numeric values, applies Box-Cox transformations to correct skewness, then centers and scales.

prep_parameters <- preProcess(
  numeric_capped,
  method = c("nzv", "knnImpute", "BoxCox", "center", "scale")
)

numeric_transformed <- predict(prep_parameters, numeric_capped)

final_features  <- cbind(categorical_features, numeric_transformed)
final_train_set <- cbind(final_features, PH = target)

print(prep_parameters)
Created from 2038 samples and 31 variables

Pre-processing:
  - Box-Cox transformation (25)
  - centered (30)
  - ignored (0)
  - 5 nearest neighbor imputation (30)
  - removed (1)
  - scaled (30)

Lambda estimates for Box-Cox transformation:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -2.000  -2.000  -0.200  -0.096   1.400   2.000 

5 Competitive Statistical Modeling

To discover the best predictive architecture for the ABC Beverage manufacturing line, we evaluate six distinct algorithm classes from our Predictive Analytics curriculum. We start with a linear baseline and progressively move into more complex non-linear and ensemble methods. Every model is optimized using repeated 5-fold cross-validation across 15 total resamples.

5.1 Cross-Validation Strategy Setup

set.seed(42)

train_control <- trainControl(
  method        = "repeatedcv",
  number        = 5,
  repeats       = 3,
  verboseIter   = FALSE,
  allowParallel = TRUE        # explicitly pass work to the registered cluster
)

5.2 Model A: Partial Least Squares (PLS)

PLS manages highly correlated manufacturing features by projecting them into lower-dimensional latent components — a natural baseline for collinear sensor data.

set.seed(42)
model_pls <- train(
  PH ~ .,
  data       = final_train_set,
  method     = "pls",
  tuneLength = 15,
  trControl  = train_control
)
print(model_pls)
Partial Least Squares 

2447 samples
  31 predictor

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 1958, 1957, 1958, 1957, 1958, 1957, ... 
Resampling results across tuning parameters:

  ncomp  RMSE       Rsquared   MAE      
   1     0.1519716  0.2190561  0.1205209
   2     0.1416969  0.3212659  0.1103710
   3     0.1391195  0.3457940  0.1090977
   4     0.1369725  0.3655889  0.1071359
   5     0.1351942  0.3821394  0.1051577
   6     0.1339096  0.3938659  0.1045054
   7     0.1332040  0.4003920  0.1037062
   8     0.1327561  0.4045443  0.1032757
   9     0.1325426  0.4064275  0.1030353
  10     0.1322779  0.4087912  0.1027855
  11     0.1321665  0.4098137  0.1027329
  12     0.1320757  0.4106932  0.1026590
  13     0.1319656  0.4116940  0.1024684
  14     0.1319303  0.4120853  0.1025559
  15     0.1318614  0.4126446  0.1024457

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 15.

5.2.1 Model A Diagnostics

The optimal PLS configuration uses ncomp = 15 latent components, achieving a cross-validated RMSE of 0.1319 and explaining 41.26% of the underlying variance (\(R^2 = 0.4126\)). The sub-50% variance explanation strongly signals that non-linear architectures are needed.

5.3 Model B: Support Vector Machines — Radial Basis Function (SVM-RBF)

The RBF kernel maps features into a high-dimensional space to capture non-linear pH thresholds without a rigid parametric assumption.

set.seed(42)
model_svm <- train(
  PH ~ .,
  data       = final_train_set,
  method     = "svmRadial",
  tuneLength = 5,
  trControl  = train_control
)
print(model_svm)
Support Vector Machines with Radial Basis Function Kernel 

2447 samples
  31 predictor

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 1958, 1957, 1958, 1957, 1958, 1957, ... 
Resampling results across tuning parameters:

  C     RMSE       Rsquared   MAE       
  0.25  0.1223845  0.5021871  0.09055397
  0.50  0.1186974  0.5293213  0.08697744
  1.00  0.1153996  0.5532190  0.08411968
  2.00  0.1127703  0.5718416  0.08215409
  4.00  0.1114625  0.5813971  0.08155917

Tuning parameter 'sigma' was held constant at a value of 0.02200219
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.02200219 and C = 4.

5.3.1 Model B Diagnostics

With sigma fixed at 0.0220 and cost C = 4, the SVM reduces RMSE to 0.1115 and lifts \(R^2\) to 58.14% — a clear improvement over the linear baseline, confirming non-linear dynamics in the process.

5.4 Model C: Random Forest (RF) Ensembles

Random Forest builds an ensemble of independent decision trees via bootstrap aggregating (bagging). It inherently captures high-order variable interactions and provides a native measure of variable importance.

set.seed(42)
model_rf <- train(
  PH ~ .,
  data       = final_train_set,
  method     = "rf",
  tuneLength = 3,
  trControl  = train_control,
  importance = TRUE
)
print(model_rf)
Random Forest 

2447 samples
  31 predictor

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 1958, 1957, 1958, 1957, 1958, 1957, ... 
Resampling results across tuning parameters:

  mtry  RMSE        Rsquared   MAE       
   2    0.10923335  0.6441729  0.08234642
  17    0.09577486  0.7038373  0.06908082
  33    0.09588798  0.6950912  0.06834256

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 17.

5.4.1 Model C Diagnostics

The optimal RF selects mtry = 17. This ensemble achieves a cross-validated RMSE of 0.0958 and captures 70.38% of the operational variance (\(R^2 = 0.7038\)). It is the top performer among the caret models and holds up well against the more advanced boosting methods tested later.

5.5 Model D: Stochastic Gradient Boosting (GBM)

GBM builds sequential decision trees where each new tree minimizes the residual error of the prior ones, often achieving maximum accuracy on noisy manufacturing data.

set.seed(42)
model_gbm <- train(
  PH ~ .,
  data       = final_train_set,
  method     = "gbm",
  tuneLength = 3,
  trControl  = train_control,
  verbose    = FALSE
)
print(model_gbm)
Stochastic Gradient Boosting 

2447 samples
  31 predictor

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 1958, 1957, 1958, 1957, 1958, 1957, ... 
Resampling results across tuning parameters:

  interaction.depth  n.trees  RMSE       Rsquared   MAE       
  1                   50      0.1355939  0.3959628  0.10690222
  1                  100      0.1313410  0.4264923  0.10332672
  1                  150      0.1291770  0.4420132  0.10112068
  2                   50      0.1276723  0.4641043  0.10007784
  2                  100      0.1225220  0.5011053  0.09514761
  2                  150      0.1196361  0.5212719  0.09211752
  3                   50      0.1231786  0.4999480  0.09592091
  3                  100      0.1180540  0.5350436  0.09086473
  3                  150      0.1153785  0.5534472  0.08798392

Tuning parameter 'shrinkage' was held constant at a value of 0.1

Tuning parameter 'n.minobsinnode' was held constant at a value of 10
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were n.trees = 150, interaction.depth =
 3, shrinkage = 0.1 and n.minobsinnode = 10.

5.5.1 Model D Diagnostics

GBM optimizes at n.trees = 150, interaction.depth = 3, and shrinkage = 0.1, yielding RMSE of 0.1154 and \(R^2 = 0.5534\). Despite the boosting strength, independent bagging (RF) outperformed sequential optimization on this noisy sensor dataset.

5.6 Model E: XGBoost (Extreme Gradient Boosting)

XGBoost improves on standard GBM with L1/L2 regularization, early stopping, and a more efficient tree-building algorithm. It is one of the most competitive models for tabular data. Due to a known incompatibility between caret’s xgbTree method and R 4.6, we run XGBoost through the tidymodels ecosystem using the same repeated cross-validation structure.

library(bonsai)
library(tidymodels)

# XGBoost via tidymodels -- avoids caret/R4.6 ALTREP incompatibility
dummy_obj <- dummyVars(PH ~ ., data = final_train_set, fullRank = TRUE)
final_train_xgb <- data.frame(
  predict(dummy_obj, newdata = final_train_set),
  PH = final_train_set$PH
)

xgb_spec <- boost_tree(
  trees      = tune(),
  tree_depth = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

xgb_grid <- grid_regular(
  trees(range      = c(100, 500)),
  tree_depth(range = c(3, 8)),
  learn_rate(range = c(-2, -1)),
  levels = 3
)

xgb_recipe <- recipe(PH ~ ., data = final_train_xgb)

xgb_wf <- workflow() %>%
  add_recipe(xgb_recipe) %>%
  add_model(xgb_spec)

xgb_tune <- tune_grid(
  xgb_wf,
  resamples = vfold_cv(final_train_xgb, v = 5, repeats = 3),
  grid      = xgb_grid,
  metrics   = metric_set(rmse, rsq, mae)
)

xgb_best_params <- select_best(xgb_tune, metric = "rmse")

model_xgb <- xgb_wf %>%
  finalize_workflow(xgb_best_params) %>%
  fit(data = final_train_xgb)

xgb_metrics <- xgb_tune %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  filter(mean == min(mean)) %>%
  slice(1)

cat(sprintf("Best XGBoost -- RMSE: %.4f\n", xgb_metrics$mean))
Best XGBoost -- RMSE: 0.0960

5.6.1 Model E Diagnostics

xgb_rmse <- xgb_tune %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  filter(mean == min(mean)) %>%
  slice(1) %>%
  pull(mean)

xgb_rsq <- xgb_tune %>%
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  filter(mean == max(mean)) %>%
  slice(1) %>%
  pull(mean)

xgb_mae <- xgb_tune %>%
  collect_metrics() %>%
  filter(.metric == "mae") %>%
  filter(mean == min(mean)) %>%
  slice(1) %>%
  pull(mean)

cat(sprintf("Best XGBoost -- RMSE: %.4f | R2: %.4f | MAE: %.4f\n",
            xgb_rmse, xgb_rsq, xgb_mae))
Best XGBoost -- RMSE: 0.0960 | R2: 0.6879 | MAE: 0.0670

XGBoost came within 0.0002 RMSE of Random Forest, finishing at 0.0960 with \(R^2 = 0.6879\). It actually outperformed RF on MAE (0.0670 vs 0.0691), meaning XGBoost makes slightly smaller errors on average. The near-tie between these two models is meaningful. It tells us that tree ensemble methods as a family are the right choice for this data, and that the RF result is not a fluke.

5.7 Model F: LightGBM (via bonsai)

LightGBM uses a leaf-wise tree growth strategy instead of level-wise, making it faster and often more accurate than standard GBM on structured datasets. We use the bonsai package for a stable, actively maintained interface to LightGBM.

library(bonsai)
library(parsnip)

set.seed(42)

lgbm_spec <- boost_tree(
  trees      = tune(),
  tree_depth = tune(),
  learn_rate = tune()
) %>%
  set_engine("lightgbm", verbose = -1) %>%
  set_mode("regression")

lgbm_grid <- grid_regular(
  trees(range      = c(100, 500)),
  tree_depth(range = c(3, 8)),
  learn_rate(range = c(-2, -1)),
  levels = 3
)

lgbm_recipe <- recipe(PH ~ ., data = final_train_set)

lgbm_wf <- workflow() %>%
  add_recipe(lgbm_recipe) %>%
  add_model(lgbm_spec)

lgbm_tune <- tune_grid(
  lgbm_wf,
  resamples = vfold_cv(final_train_set, v = 5, repeats = 3),
  grid      = lgbm_grid,
  metrics   = metric_set(rmse, rsq, mae)
)

lgbm_best_params <- select_best(lgbm_tune, metric = "rmse")

model_lgbm <- lgbm_wf %>%
  finalize_workflow(lgbm_best_params) %>%
  fit(data = final_train_set)

lgbm_metrics <- lgbm_tune %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  filter(mean == min(mean)) %>%
  slice(1)

cat(sprintf("Best LightGBM -- RMSE: %.4f\n", lgbm_metrics$mean))
Best LightGBM -- RMSE: 0.0964

5.7.1 Model F Diagnostics

lgbm_rmse <- lgbm_tune %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  filter(mean == min(mean)) %>%
  slice(1) %>%
  pull(mean)

lgbm_rsq <- lgbm_tune %>%
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  filter(mean == max(mean)) %>%
  slice(1) %>%
  pull(mean)

lgbm_mae <- lgbm_tune %>%
  collect_metrics() %>%
  filter(.metric == "mae") %>%
  filter(mean == min(mean)) %>%
  slice(1) %>%
  pull(mean)

cat(sprintf("Best LightGBM -- RMSE: %.4f | R2: %.4f | MAE: %.4f\n",
            lgbm_rmse, lgbm_rsq, lgbm_mae))
Best LightGBM -- RMSE: 0.0964 | R2: 0.6860 | MAE: 0.0698

LightGBM rounds out the tree ensemble comparison at RMSE of 0.0964 and \(R^2 = 0.6860\). All three ensemble methods, RF, XGBoost, and LightGBM, converge within 0.0006 RMSE of each other. That kind of convergence is not a weakness in the analysis. It validates the result. The data has a consistent ceiling that these methods all find independently.

6 Model Performance Metrics & Selection

6.1 Resampled Performance Distributions

We pool the cross-validation resampling vectors from the four caret models to evaluate consistency across identical data splits. XGBoost and LightGBM were trained through the tidymodels framework and are compared separately in the performance table below.

model_results <- resamples(list(
  PLS               = model_pls,
  SVM_Radial        = model_svm,
  Random_Forest     = model_rf,
  Gradient_Boosting = model_gbm
))

summary(model_results)

Call:
summary.resamples(object = model_results)

Models: PLS, SVM_Radial, Random_Forest, Gradient_Boosting 
Number of resamples: 15 

MAE 
                        Min.    1st Qu.     Median       Mean    3rd Qu.
PLS               0.09830205 0.10125473 0.10280699 0.10244568 0.10366463
SVM_Radial        0.07735302 0.07947261 0.08191617 0.08155917 0.08299108
Random_Forest     0.06447640 0.06723887 0.06878730 0.06908082 0.07057693
Gradient_Boosting 0.08348426 0.08714123 0.08792023 0.08798392 0.08943539
                        Max. NA's
PLS               0.10758599    0
SVM_Radial        0.08574801    0
Random_Forest     0.07347775    0
Gradient_Boosting 0.09346244    0

RMSE 
                        Min.    1st Qu.     Median       Mean   3rd Qu.
PLS               0.12585711 0.13019540 0.13215062 0.13186140 0.1343802
SVM_Radial        0.10520636 0.10902076 0.11121926 0.11146250 0.1145264
Random_Forest     0.08918162 0.09164151 0.09525883 0.09577486 0.1001622
Gradient_Boosting 0.10946829 0.11160391 0.11693326 0.11537850 0.1181800
                       Max. NA's
PLS               0.1358322    0
SVM_Radial        0.1186188    0
Random_Forest     0.1044660    0
Gradient_Boosting 0.1209990    0

Rsquared 
                       Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
PLS               0.3514633 0.3939049 0.4180996 0.4126446 0.4379351 0.4441252
SVM_Radial        0.5416337 0.5555643 0.5898194 0.5813971 0.6038734 0.6163906
Random_Forest     0.6496643 0.6862369 0.7077280 0.7038373 0.7238565 0.7348330
Gradient_Boosting 0.4819176 0.5362011 0.5552721 0.5534472 0.5743529 0.5840627
                  NA's
PLS                  0
SVM_Radial           0
Random_Forest        0
Gradient_Boosting    0
dotplot(model_results, metric = "RMSE",
        main = "Cross-Validated RMSE - Caret Models (15 Resamples)")

RMSE distribution across 15 cross-validation resamples for caret models. Random Forest consistently achieves lowest error with tightest spread. XGBoost and LightGBM results appear in the comparison table.

6.2 Comprehensive Performance Comparison Matrix

extract_best <- function(model) {
  model$results %>%
    filter(RMSE == min(RMSE)) %>%
    slice(1) %>%
    select(RMSE, Rsquared, MAE)
}

# Caret models only
caret_metrics <- bind_rows(
  extract_best(model_pls),
  extract_best(model_svm),
  extract_best(model_rf),
  extract_best(model_gbm)
)

# Pull XGBoost and LightGBM metrics from tidymodels
pull_tm_metrics <- function(tune_obj) {
  rmse <- tune_obj %>% collect_metrics() %>%
    filter(.metric == "rmse") %>% filter(mean == min(mean)) %>%
    slice(1) %>% pull(mean)
  rsq <- tune_obj %>% collect_metrics() %>%
    filter(.metric == "rsq") %>% filter(mean == max(mean)) %>%
    slice(1) %>% pull(mean)
  mae <- tune_obj %>% collect_metrics() %>%
    filter(.metric == "mae") %>% filter(mean == min(mean)) %>%
    slice(1) %>% pull(mean)
  tibble(RMSE = rmse, Rsquared = rsq, MAE = mae)
}

all_metrics <- bind_rows(
  caret_metrics,
  pull_tm_metrics(xgb_tune),
  pull_tm_metrics(lgbm_tune)
)

comparison_metrics <- tibble(
  Algorithm = c(
    "Model A: PLS",
    "Model B: SVM Radial",
    "Model C: Random Forest",
    "Model D: GBM",
    "Model E: XGBoost",
    "Model F: LightGBM"
  )
) %>%
  bind_cols(all_metrics) %>%
  mutate(
    Rsquared = paste0(round(Rsquared * 100, 2), "%"),
    RMSE     = round(RMSE, 4),
    MAE      = round(MAE, 4)
  ) %>%
  rename(`CV RMSE` = RMSE, `` = Rsquared)

winner_row <- which.min(all_metrics$RMSE)

kable(
  comparison_metrics,
  caption  = "Final Performance Comparison - Mean Across 15 Resamples",
  align    = "lccc",
  booktabs = TRUE
) %>%
  kable_styling(latex_options = c("striped", "hold_position")) %>%
  row_spec(winner_row, bold = TRUE, background = "#d5f5e3")
Final Performance Comparison - Mean Across 15 Resamples
Algorithm CV RMSE MAE
Model A: PLS 0.1319 41.26% 0.1024
Model B: SVM Radial 0.1115 58.14% 0.0816
Model C: Random Forest 0.0958 70.38% 0.0691
Model D: GBM 0.1154 55.34% 0.0880
Model E: XGBoost 0.0960 68.79% 0.0670
Model F: LightGBM 0.0964 68.6% 0.0698

6.3 Final Model Selection & Key Predictors

6.3.1 The Clear Winner

Model C, Random Forest, outperforms all competing frameworks across the primary metrics: lowest RMSE (0.0958), highest \(R^2\) (70.38%), and competitive MAE (0.0691). The 30-point jump in \(R^2\) from PLS to RF confirms that pH dynamics on this line are driven by non-linear, high-order interactions among process variables.

XGBoost came within 0.0002 RMSE and actually beat RF on MAE. LightGBM finished close behind. The fact that all three tree ensemble methods converge around the same performance ceiling is the most important finding in this comparison. It means RF is not outperforming the others because of luck or a favorable random seed. The bagging approach genuinely fits this dataset well.

6.3.2 Variable Importance — What Actually Drives pH

The assignment requires us to identify the predictive factors. The plot below ranks all 31 predictors by their mean decrease in node impurity within the Random Forest, providing a data-driven answer for process engineers and regulators.

var_imp_df <- varImp(model_rf)$importance %>%
  as.data.frame() %>%
  rownames_to_column("Variable") %>%
  rename(Importance = Overall) %>%
  arrange(desc(Importance)) %>%
  slice_head(n = 20)

ggplot(var_imp_df,
       aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_col(fill = "#2c3e50", alpha = 0.85) +
  geom_text(aes(label = round(Importance, 1)),
            hjust = -0.1, size = 3, color = "gray30") +
  coord_flip() +
  expand_limits(y = max(var_imp_df$Importance) * 1.12) +
  labs(
    title    = "Top 20 Predictors of Manufacturing pH",
    subtitle = "Random Forest Variable Importance (Mean Decrease in Node Impurity)",
    x        = NULL,
    y        = "Importance Score"
  )

Top 20 predictors of manufacturing pH ranked by Random Forest variable importance (mean decrease in node impurity). Higher values indicate stronger influence on pH.
train_preds <- data.frame(
  Actual    = final_train_set$PH,
  Predicted = predict(model_rf, newdata = final_train_set)
)

ggplot(train_preds, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.25, size = 0.9, color = "#2c3e50") +
  geom_abline(slope = 1, intercept = 0,
              color = "#e74c3c", linewidth = 0.8, linetype = "dashed") +
  labs(
    title = "Predicted vs. Actual pH - Random Forest (Training Set)",
    x     = "Actual pH",
    y     = "Predicted pH"
  )

Predicted vs. actual pH on training resamples. Points clustering tightly around the 45 degree reference line indicate well-calibrated predictions.

7 Final Evaluation Predictions & Export

With Random Forest selected, we preprocess the evaluation dataset using the exact same transformation parameters fitted on training data, then generate and export the final pH predictions.

7.1 Evaluation Set Preprocessing

evaluation_data <- evaluation_data %>% mutate(Row_ID = row_number())

eval_features <- evaluation_data %>% select(-any_of("PH"))

# Use verified majority class for missing Brand.Code imputation
eval_features$Brand.Code <- as.character(eval_features$Brand.Code)
eval_features$Brand.Code[
  is.na(eval_features$Brand.Code) | eval_features$Brand.Code == ""
] <- majority_class
eval_features$Brand.Code <- as.factor(eval_features$Brand.Code)

eval_numeric     <- eval_features %>%
  select(Row_ID, where(is.numeric)) %>%
  as.data.frame()

eval_categorical <- eval_features %>%
  select(Row_ID, where(is.factor)) %>%
  as.data.frame()

# Safety Patch 1: median imputation for any residual NAs before caret transform
for (i in 2:ncol(eval_numeric)) {
  if (any(is.na(eval_numeric[, i]))) {
    eval_numeric[is.na(eval_numeric[, i]), i] <-
      median(eval_numeric[, i], na.rm = TRUE)
  }
}

# Apply trained preProcess object
eval_numeric_transformed <- predict(
  prep_parameters,
  eval_numeric %>% select(-Row_ID)
)

# Safety Patch 2: clamp Inf/-Inf values generated by Box-Cox edge cases
for (i in seq_len(ncol(eval_numeric_transformed))) {
  col_data <- eval_numeric_transformed[, i]
  if (any(is.infinite(col_data))) {
    eval_numeric_transformed[col_data == Inf,  i] <-
      max(col_data[is.finite(col_data)], na.rm = TRUE)
    eval_numeric_transformed[col_data == -Inf, i] <-
      min(col_data[is.finite(col_data)], na.rm = TRUE)
  }
}

eval_numeric_transformed$Row_ID <- eval_numeric$Row_ID

final_eval_set <- eval_categorical %>%
  left_join(eval_numeric_transformed, by = "Row_ID")

final_eval_set <- evaluation_data %>%
  select(Row_ID) %>%
  left_join(final_eval_set, by = "Row_ID") %>%
  select(-Row_ID)

# Final safety pass for any remaining NAs
for (i in seq_len(ncol(final_eval_set))) {
  if (is.numeric(final_eval_set[, i]) && any(is.na(final_eval_set[, i]))) {
    final_eval_set[is.na(final_eval_set[, i]), i] <-
      median(final_eval_set[, i], na.rm = TRUE)
  }
}

cat(sprintf("Evaluation set ready: %d rows x %d columns\n",
            nrow(final_eval_set), ncol(final_eval_set)))
Evaluation set ready: 267 rows x 31 columns

7.2 Generating and Saving Final pH Predictions

library(writexl)

final_predictions <- predict(model_rf, newdata = final_eval_set)

evaluation_output <- evaluation_data %>%
  mutate(PH = as.vector(final_predictions)) %>%
  select(-Row_ID)

write_xlsx(evaluation_output, "StudentEvaluation_Predictions.xlsx")

cat(sprintf("Predictions exported for %d rows.\n", nrow(evaluation_output)))
Predictions exported for 267 rows.
head(evaluation_output %>% select(Brand.Code, any_of("MFR"), PH), 10)
Brand.Code MFR PH
D 727.6 8.563120
A 735.8 8.447166
B 734.8 8.518856
B NA 8.574247
B 752.0 8.486939
B 732.0 8.534388
A 729.8 8.499065
B NA 8.544170
A 741.2 8.562635
D NA 8.616114

8 Parallel Backend Release

stopCluster(cl)
registerDoSEQ()
cat("Parallel backend successfully released.\n")
Parallel backend successfully released.

9 Project Conclusion & Summary

The predictive modeling pipeline is complete. The Random Forest model was selected as the final architecture and its predictions have been exported to StudentEvaluation_Predictions.xlsx.

9.1 Key Technical Takeaways

Preprocessing impact. The unified pipeline — KNN imputation, Box-Cox transformation, near-zero variance filtering, centering, and scaling — was essential to stabilize the highly skewed, incomplete sensor telemetry before modeling.

Model selection. We tested six models covering linear, kernel, and tree ensemble approaches. Random Forest outperformed all five competing frameworks on RMSE and \(R^2\), capturing 70.38% of pH variance with a cross-validated RMSE of 0.0958. XGBoost and LightGBM both finished within 0.0006 RMSE of RF, which confirms that tree ensemble methods as a family are the right fit for this manufacturing dataset. The convergence of three independent ensemble methods around the same performance ceiling validates the result.

Predictive factors. The variable importance analysis directly answers the regulatory requirement: it identifies which process parameters exert the greatest influence on pH. These predictors represent the highest-leverage control points for process engineers seeking to maintain product quality and regulatory compliance.

Pipeline integrity. Row-ID tracking and two-stage safety patching (median imputation + Inf clamping) guaranteed that all 267 evaluation rows received a valid pH prediction with zero data loss.