1. Introduction

This document presents a technical report on the development of predictive models for manufacturing process pH. The pH of the final product is a critical quality attribute in many manufacturing settings, as it can influence product stability, performance, and regulatory compliance. Our goal is to use historical process data to build a robust, data-driven model that accurately predicts pH and helps identify the key process factors that drive variation in this outcome.

In this project, the response variable pH represents the acidity or alkalinity of the manufacturing process output. It is measured on the standard pH scale, where lower values indicate more acidic conditions and higher values indicate more basic (alkaline) conditions. Because pH affects product quality and stability, it is treated as a critical quality attribute that the process must keep within an acceptable specification range.

The analysis follows a complete end-to-end workflow. We begin by ingesting and cleaning the raw student datasets, standardizing variable names, handling missing values, and preparing the data for modeling. We then conduct exploratory data analysis (EDA) to understand the distribution of pH, examine its relationships with key predictors, and assess the consistency between the training and evaluation datasets. Next, we construct several competing models—ranging from a baseline linear regression to more flexible approaches such as elastic net, MARS, random forest, and Cubist—using a consistent resampling framework for hyperparameter tuning and performance estimation.

Model performance is assessed using cross-validated RMSE and R-squared on the training data, as well as RMSE and R-squared on a held-out test set. Based on these metrics, we select a final model and examine its variable importance to interpret which process features are most influential for pH. The report concludes with final predictions for the evaluation dataset and a brief discussion of reproducibility, including package requirements and session information, so that the entire workflow can be replicated from this R Markdown file.

2. Data Preparation

2.1 Load Required Libraries

To ensure a reproducible and stable analytical workflow, we began by configuring the R environment with all required packages for data handling, visualization, and preprocessing. This setup step guarantees that all dependencies are consistently available throughout the analysis, reduces the risk of version related issues, and establishes a standardized foundation for the subsequent data processing stages.

tidyverse : data manipulatio
skimr : quick data summaries
corrplot : correlation heatmaps
caret : pre-processing, modeling framework
readxl : read excel files**
janitor : data cleaning
forcats : categorical variable handling

*Not used in Mac

# Load required packages for the analysis
library(tidyverse)
library(skimr)
library(corrplot)
library(caret)
library(janitor)
library(forcats)
library(recipes)
library(earth)
library(ggplot2)
library(elasticnet)
library(Cubist)
library(randomForest)

# Note (macOS):
# Input data are read from CSV files using readr::read_csv instead of readxl.
# This avoids platform-specific issues with readxl on macOS.

2.2 Data Ingestion

In this phase, we imported the two primary datasets required for the analysis. StudentData was designated as the training dataset and serves as the foundation for building and calibrating the predictive models. StudentEvaluation, on the other hand, functions as the test dataset and is reserved for out of sample evaluation to assess model generalization.

train_raw <- read_excel("data/StudentData.xlsx")
eval_raw  <- read_excel("data/StudentEvaluation.xlsx")

cat("Train rows x cols: ", paste(dim(train_raw), collapse = " x "), "\n")
cat("Eval  rows x cols: ", paste(dim(eval_raw),  collapse = " x "), "\n")


glimpse(train_raw)

train_raw <- readr::read_csv("data/StudentData.csv", show_col_types = FALSE)
eval_raw  <- readr::read_csv("data/StudentEvaluation.csv", show_col_types = FALSE)

cat("Train rows x cols: ", paste(dim(train_raw), collapse = " x "), "\n")

## Train rows x cols:  2571 x 33

cat("Eval  rows x cols: ", paste(dim(eval_raw),  collapse = " x "), "\n")

## Eval  rows x cols:  267 x 33

glimpse(train_raw)

## Rows: 2,571
## Columns: 33
## $ `Brand Code`        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", …
## $ `Carb Volume`       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, …
## $ `Fill Ounces`       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, …
## $ `PC Volume`         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1113…
## $ `Carb Pressure`     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64…
## $ `Carb Temp`         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 1…
## $ PSC                 <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.15…
## $ `PSC Fill`          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.…
## $ `PSC CO2`           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.…
## $ `Mnf Flow`          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1`    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 1…
## $ `Fill Pressure`     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46…
## $ `Hyd Pressure1`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4`     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, …
## $ `Filler Level`      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 1…
## $ `Filler Speed`      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014…
## $ Temperature         <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65…
## $ `Usage cont`        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 1…
## $ `Carb Flow`         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902,…
## $ Density             <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.…
## $ MFR                 <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, …
## $ Balling             <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1…
## $ `Pressure Vacuum`   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4…
## $ PH                  <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.…
## $ `Oxygen Filler`     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0…
## $ `Bowl Setpoint`     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46…
## $ `Air Pressurer`     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 1…
## $ `Alch Rel`          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.…
## $ `Carb Rel`          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.…
## $ `Balling Lvl`       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.…

We will now take a look at the structure of both datasets to understand the variables and their types.

skim(train_raw)

Data summary
Name	train_raw
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand Code	120	0.95	1	1	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Fill Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
Carb Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Fill Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Hyd Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
Hyd Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
Hyd Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
Filler Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
Filler Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
Carb Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Pressure Vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Bowl Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Pressure Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air Pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Alch Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
Carb Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
Balling Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

The training dataset consists of 2,571 observations and 33 variables in total. The response variable is pH, which measures the acidity level of the manufacturing process output. The remaining 32 variables are used as predictors and include process measurements such as temperatures, concentrations, and timing variables, as well as product characteristics (e.g., brand or batch indicators). These predictors capture different aspects of the manufacturing conditions that are expected to influence the final pH.

skim(eval_raw)

Data summary
Name	eval_raw
Number of rows	267
Number of columns	33
_______________________
Column type frequency:
character	1
logical	1
numeric	31
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand Code	8	0.97	1	1	0	4	0

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
PH	267	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb Volume	1	1.00	5.37	0.11	5.15	5.29	5.34	5.47	5.67	▂▇▃▅▁
Fill Ounces	6	0.98	23.97	0.08	23.75	23.92	23.97	24.01	24.20	▁▅▇▃▁
PC Volume	4	0.99	0.28	0.06	0.10	0.23	0.28	0.32	0.46	▁▆▇▅▁
Carb Pressure	0	1.00	68.25	3.86	60.20	65.30	68.00	70.60	77.60	▃▆▇▃▂
Carb Temp	1	1.00	141.23	4.30	130.00	138.40	140.80	143.80	154.00	▁▆▇▃▁
PSC	5	0.98	0.09	0.05	0.00	0.04	0.08	0.11	0.25	▆▇▃▂▁
PSC Fill	3	0.99	0.19	0.11	0.02	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC CO2	5	0.98	0.05	0.04	0.00	0.02	0.04	0.06	0.24	▇▃▂▁▁
Mnf Flow	0	1.00	21.03	117.76	-100.20	-100.00	0.20	141.30	220.40	▇▁▁▆▂
Carb Pressure1	4	0.99	123.04	4.42	113.00	120.20	123.40	125.50	136.00	▃▃▇▂▁
Fill Pressure	2	0.99	48.14	3.44	37.80	46.00	47.80	50.20	60.20	▁▇▇▂▁
Hyd Pressure1	0	1.00	12.01	13.53	-50.00	0.00	10.40	20.40	50.00	▁▁▇▆▂
Hyd Pressure2	1	1.00	20.11	17.21	-50.00	0.00	26.80	34.80	61.40	▁▁▆▇▁
Hyd Pressure3	1	1.00	19.61	16.56	-50.00	0.00	27.70	33.00	49.20	▁▁▆▃▇
Hyd Pressure4	4	0.99	97.84	13.92	68.00	90.00	98.00	104.00	140.00	▅▆▇▂▁
Filler Level	2	0.99	110.29	15.50	69.20	100.60	118.60	120.20	153.20	▂▃▇▇▁
Filler Speed	10	0.96	3581.39	911.19	1006.00	3812.00	3978.00	3996.00	4020.00	▁▁▁▁▇
Temperature	2	0.99	66.23	1.69	63.80	65.40	65.80	66.60	75.40	▇▅▁▁▁
Usage cont	2	0.99	20.90	3.00	12.90	18.12	21.44	23.74	24.60	▁▃▃▃▇
Carb Flow	0	1.00	2408.64	1161.36	0.00	1083.00	3038.00	3215.00	3858.00	▂▃▁▆▇
Density	1	1.00	1.18	0.38	0.06	0.92	0.98	1.60	1.84	▁▁▇▁▅
MFR	31	0.88	697.80	96.40	15.60	707.00	724.60	731.45	784.80	▁▁▁▁▇
Balling	1	1.00	2.20	0.92	0.90	1.50	1.65	3.24	3.79	▅▇▁▂▅
Pressure Vacuum	1	1.00	-5.17	0.58	-6.40	-5.60	-5.20	-4.80	-3.60	▁▇▆▃▁
Oxygen Filler	3	0.99	0.05	0.05	0.00	0.02	0.03	0.05	0.40	▇▁▁▁▁
Bowl Setpoint	1	1.00	109.62	15.02	70.00	100.00	120.00	120.00	130.00	▁▂▁▃▇
Pressure Setpoint	2	0.99	47.73	2.06	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air Pressurer	1	1.00	142.83	1.23	141.20	142.20	142.60	142.80	147.20	▅▇▁▁▁
Alch Rel	3	0.99	6.91	0.50	6.40	6.54	6.58	7.18	7.82	▇▁▂▁▃
Carb Rel	2	0.99	5.44	0.13	5.18	5.34	5.40	5.56	5.74	▂▇▂▃▂
Balling Lvl	0	1.00	2.05	0.88	0.00	1.38	1.48	3.08	3.42	▁▃▇▁▇

To further understand the datasets, we will list the variable names present in both the training and evaluation datasets.

names(train_raw)

##  [1] "Brand Code"        "Carb Volume"       "Fill Ounces"      
##  [4] "PC Volume"         "Carb Pressure"     "Carb Temp"        
##  [7] "PSC"               "PSC Fill"          "PSC CO2"          
## [10] "Mnf Flow"          "Carb Pressure1"    "Fill Pressure"    
## [13] "Hyd Pressure1"     "Hyd Pressure2"     "Hyd Pressure3"    
## [16] "Hyd Pressure4"     "Filler Level"      "Filler Speed"     
## [19] "Temperature"       "Usage cont"        "Carb Flow"        
## [22] "Density"           "MFR"               "Balling"          
## [25] "Pressure Vacuum"   "PH"                "Oxygen Filler"    
## [28] "Bowl Setpoint"     "Pressure Setpoint" "Air Pressurer"    
## [31] "Alch Rel"          "Carb Rel"          "Balling Lvl"

Compare that column sets are identical.

identical(names(train_raw), names(eval_raw))

## [1] TRUE

2.2 Cleaning Column Names

To ensure consistency and ease of reference throughout the analysis, we standardized the column names in both datasets. This step involved converting all column names to lowercase and replacing spaces with underscores. Such standardization helps prevent potential issues related to case sensitivity and special characters during data manipulation and modeling.

train <- train_raw |>
  clean_names()  

score <- eval_raw |>
  clean_names()

identical(names(train), names(score))

## [1] TRUE

The result is a consistent naming convention across both datasets, facilitating smoother data handling in subsequent steps.

2.3 Data Cleaning and Imputation

In this step, we addressed missing values and potential anomalies in the datasets to prepare them for analysis. The following actions were taken:

Removed rows with missing target variable (ph) from the training dataset, as these cannot contribute to model training.
Treated missing values in the categorical variable brand_code as an explicit category labeled “Unknown”. This approach allows the model to recognize missingness as a distinct factor rather than ignoring those instances.
Re-coded sentinel values in the mnf_flow variable. Values less than or equal to -99 were identified as indicators of missing or invalid data. We created a new binary indicator variable mnf_flow_off to flag these instances and set the original mnf_flow values to NA for proper handling during imputation.
Defined a preprocessing recipe using the recipes package to systematically handle missing data and scale numeric predictors. The recipe includes median imputation for numeric variables, mode imputation for categorical variables, removal of zero-variance predictors, and centering and scaling of numeric predictors.

train <- train |>
  filter(!is.na(ph))

train <- train |>
  mutate(
    brand_code = fct_explicit_na(brand_code, na_level = "Unknown")
  )

score <- score |>
  mutate(
    brand_code = fct_explicit_na(brand_code, na_level = "Unknown")
  )

summary(train$mnf_flow)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -100.20 -100.00   70.20   24.63  140.80  229.40

table(train$mnf_flow)[1:10]  # inspect -100, -100.2

## 
## -100.2   -100    0.2      2    4.8    6.8    7.6    8.2    8.6    9.6 
##    578    605     78      1      1      1      1      1      1      1

train <- train |>
  mutate(
    mnf_flow_off = ifelse(mnf_flow <= -99, 1, 0),
    mnf_flow     = ifelse(mnf_flow <= -99, NA, mnf_flow)
  )

score <- score |>
  mutate(
    mnf_flow_off = ifelse(mnf_flow <= -99, 1, 0),
    mnf_flow     = ifelse(mnf_flow <= -99, NA, mnf_flow)
  )

train_clean_raw  <- train
score_clean_raw  <- score

ph_recipe <- recipe(ph ~ ., data = train) |>
  update_role(ph,         new_role = "outcome")   |>
  update_role(brand_code, new_role = "predictor") |>
  step_impute_median(all_numeric_predictors())    |>
  step_impute_mode(all_nominal_predictors())      |>
  step_zv(all_predictors())                       |>
  step_center(all_numeric_predictors())           |>
  step_scale(all_numeric_predictors())

ph_prep <- prep(ph_recipe, training = train)

train <- juice(ph_prep)                     # fully processed training data
score <- bake(ph_prep, new_data = score)    # fully processed scoring data

New csv files with cleaned data can be saved for future use.

write.csv(train, "data/Train_Clean_PH.csv", row.names = FALSE)
write.csv(score, "data/Score_Clean_PH.csv", row.names = FALSE)

3 Exploratory Data Analysis (EDA)

The goal of this section is to explore the training dataset to understand the distribution of the target variable (pH) and identify potential predictors through correlation analysis. This will help inform feature selection and engineering for the modeling phase.

3.1 Overview of Cleaned Training Data

The basic structure and summary statistics of the cleaned training and scoring datasets are presented below. This overview provides insights into the number of observations, variables, and key characteristics of the data.

dim(train)

## [1] 2567   34

dim(score)

## [1] 267  34

glimpse(train)

## Rows: 2,567
## Columns: 34
## $ brand_code        <fct> B, A, B, A, A, A, A, B, B, B, B, B, B, B, B, B, C, B…
## $ carb_volume       <dbl> -0.28475798, 0.53130613, -0.78695127, 0.65685445, 1.…
## $ fill_ounces       <dbl> -0.09476889, 0.36597325, 0.98029608, 0.36597325, 3.8…
## $ pc_volume         <dbl> -0.22972785, -0.63987315, -0.22972785, 0.26909754, -…
## $ carb_pressure     <dbl> 0.002735817, 0.059554926, 0.741384239, -1.474561027,…
## $ carb_temp         <dbl> 0.02758583, -0.37105082, 0.92451827, -2.11508613, -1…
## $ psc               <dbl> 0.3977850, 0.8064432, 0.1117243, -0.1743364, -1.1959…
## $ psc_fill          <dbl> 0.55290392, 0.21180698, 1.23509780, 1.91729168, -0.2…
## $ psc_co2           <dbl> -0.37842496, -0.37842496, 2.42643607, -0.37842496, 1…
## $ mnf_flow          <dbl> 0.1327827, 0.1327827, 0.1327827, 0.1327827, 0.132782…
## $ carb_pressure1    <dbl> -0.80419751, -0.20822172, -0.50620961, -1.57045209, …
## $ fill_pressure     <dbl> -0.6031982, -0.6031982, -0.6031982, -0.4769726, -0.6…
## $ hyd_pressure1     <dbl> -1.003715, -1.003715, -1.003715, -1.003715, -1.00371…
## $ hyd_pressure2     <dbl> 0.4627655, 0.4627655, 0.4627655, -1.2874501, -1.2874…
## $ hyd_pressure3     <dbl> 0.4443658, 0.4443658, 0.4443658, -1.2877894, -1.2877…
## $ hyd_pressure4     <dbl> 1.6654791, 0.7442469, -1.0982173, -0.3305239, -0.330…
## $ filler_level      <dbl> 0.7590093, 0.5930444, 0.6824101, 0.5419783, 0.593044…
## $ filler_speed      <dbl> 0.4034736, 0.3824942, 0.4270754, 0.4165857, 0.413963…
## $ temperature       <dbl> 0.02678116, 1.18952507, 0.75349611, -0.26390482, -0.…
## $ usage_cont        <dbl> -1.61956027, -0.36851010, -1.08820025, -1.20254355, …
## $ carb_flow         <dbl> 0.4294009, 0.6275087, 0.4125804, 0.5508821, 0.543406…
## $ density           <dbl> -0.7811071, -0.6749974, 1.0758126, 0.9697029, 0.9697…
## $ mfr               <dbl> 0.2721248, 0.2974595, 0.4128729, 0.3509438, 0.241160…
## $ balling           <dbl> -0.8626157, -0.7550364, 1.0135670, 0.9059877, 0.9059…
## $ pressure_vacuum   <dbl> 2.132323, 2.132323, 2.482975, 1.431020, 1.431020, 1.…
## $ oxygen_filler     <dbl> -0.541796037, -0.452875935, -0.497335986, -0.3639558…
## $ bowl_setpoint     <dbl> 0.6964944, 0.6964944, 0.6964944, 0.6964944, 0.696494…
## $ pressure_setpoint <dbl> -0.5919061, -0.3955348, -0.4937204, -0.7882774, -0.7…
## $ air_pressurer     <dbl> -0.1929304, 0.1369080, -0.6876881, 2.7756158, 2.7756…
## $ alch_rel          <dbl> -0.6276642, -0.6672780, 1.5114835, 0.4815235, 0.4815…
## $ carb_rel          <dbl> -0.90746961, -1.06301679, 3.13675695, -0.12973374, 0…
## $ balling_lvl       <dbl> -0.6576926, -0.5656108, 1.4141480, 1.1379026, 1.1379…
## $ mnf_flow_off      <dbl> 1.081412, 1.081412, 1.081412, 1.081412, 1.081412, 1.…
## $ ph                <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…

glimpse(score)

## Rows: 267
## Columns: 34
## $ brand_code        <fct> D, A, B, B, B, B, A, B, A, D, B, B, B, B, C, Unknown…
## $ carb_volume       <dbl> 1.0334994, 0.2174353, -0.7241771, -0.9752738, 0.3429…
## $ fill_ounces       <dbl> 0.67313461, -0.24834969, -0.63230144, -0.40193036, 2…
## $ pc_volume         <dbl> -0.118877757, -0.839403309, 0.435372667, -1.51558883…
## $ carb_pressure     <dbl> -0.79273171, -1.41774192, -0.50863617, -0.96318904, …
## $ carb_temp         <dbl> -1.61679032, -1.51713116, -0.17173249, -0.52053956, …
## $ psc               <dbl> 3.09492882, -0.86905526, -0.33779966, -1.64550575, -…
## $ psc_fill          <dbl> 1.74674321, 0.21180698, -0.81148384, 0.04125851, 0.8…
## $ psc_co2           <dbl> -0.37842496, 0.55652872, -0.84590180, -0.84590180, 0…
## $ mnf_flow          <dbl> 0.1327827, 0.1327827, 0.1327827, 0.1327827, 0.132782…
## $ carb_pressure1    <dbl> -1.2724642, -0.8041975, -0.5062096, 0.4728935, -1.61…
## $ fill_pressure     <dbl> -0.60319819, -0.54008538, -0.66631100, -2.49658257, …
## $ hyd_pressure1     <dbl> -1.003715, -1.003715, -1.003715, -1.003715, -1.00371…
## $ hyd_pressure2     <dbl> 0.4627655, -1.2874501, -1.2874501, -1.2874501, -1.28…
## $ hyd_pressure3     <dbl> 0.4443658, -1.2877894, -1.2877894, -1.2877894, -1.28…
## $ hyd_pressure4     <dbl> -0.0234465, 1.2048630, 0.1300922, 2.7402499, -0.1769…
## $ filler_level      <dbl> 1.2824371, 0.6824101, 0.6441105, 0.6951767, 0.427079…
## $ filler_speed      <dbl> 0.3824942, 0.4165857, 0.4139633, 0.3772494, 0.424453…
## $ temperature       <dbl> 0.02678116, -0.26390482, -0.26390482, 6.13118672, 0.…
## $ usage_cont        <dbl> 0.22338460, -1.14200886, 1.07087019, -0.96713088, 0.…
## $ carb_flow         <dbl> 0.44622135, 0.41444934, 0.54527528, -2.28430282, 0.6…
## $ density           <dbl> -0.7811071, 0.8635932, -0.7280523, -1.1524911, -0.78…
## $ mfr               <dbl> 0.3087193, 0.4241328, 0.4100580, 0.2580500, 0.652144…
## $ balling           <dbl> -0.8626157, 0.7984084, -0.8088261, -1.2305369, -0.86…
## $ pressure_vacuum   <dbl> 2.4829749, 1.4310197, 1.7816714, 2.1323231, 2.132323…
## $ oxygen_filler     <dbl> -0.541796037, -0.363955833, -0.008275424, -0.2883737…
## $ bowl_setpoint     <dbl> 1.3506843, 0.6964944, 0.6964944, 0.6964944, 0.696494…
## $ pressure_setpoint <dbl> -1.1810200, -0.7882774, -0.7882774, -0.7882774, 1.17…
## $ air_pressurer     <dbl> -0.19293045, 3.60021192, 3.10545422, 2.94053498, 2.4…
## $ alch_rel          <dbl> -0.6672780, 0.4815235, -0.7465057, -0.8257334, -0.78…
## $ carb_rel          <dbl> -0.75192244, 1.11464367, -0.75192244, 0.49245496, -0…
## $ balling_lvl       <dbl> -0.6576926, 1.1379026, -0.6807130, -0.6576926, -0.68…
## $ mnf_flow_off      <dbl> 1.081412, 1.081412, 1.081412, 1.081412, 1.081412, 1.…
## $ ph                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

skim(train)

Data summary
Name	train
Number of rows	2567
Number of columns	34
_______________________
Column type frequency:
factor	1
numeric	33
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
brand_code	0	1	FALSE	5	B: 1235, D: 615, C: 304, A: 293

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
carb_volume	1	0.00	1.00	-3.11	-0.72	-0.22	0.78	3.11	▁▆▇▅▁
fill_ounces	1	0.00	1.00	-3.93	-0.63	-0.02	0.60	3.98	▁▂▇▂▁
pc_volume	1	0.00	1.00	-3.29	-0.63	-0.10	0.56	3.34	▁▃▇▂▁
carb_pressure	1	0.00	1.00	-3.18	-0.74	0.00	0.68	3.18	▁▅▇▃▁
carb_temp	1	0.00	1.00	-3.11	-0.67	-0.07	0.68	3.22	▁▃▇▃▁
psc	1	0.00	1.00	-1.69	-0.71	-0.17	0.56	3.79	▆▇▃▁▁
psc_fill	1	0.00	1.00	-1.66	-0.81	-0.13	0.55	3.62	▆▇▃▁▁
psc_co2	1	0.00	1.00	-1.31	-0.85	-0.38	0.56	4.30	▇▅▂▁▁
mnf_flow	1	0.00	1.00	-4.27	0.07	0.13	0.19	3.00	▁▁▂▇▁
carb_pressure1	1	0.00	1.00	-3.61	-0.76	0.13	0.60	3.75	▁▃▇▂▁
fill_pressure	1	0.00	1.00	-4.20	-0.60	-0.48	0.66	3.94	▁▁▇▂▁
hyd_pressure1	1	0.00	1.00	-1.07	-1.00	-0.08	0.62	3.67	▇▅▂▁▁
hyd_pressure2	1	0.00	1.00	-1.29	-1.29	0.46	0.83	2.35	▇▂▇▃▁
hyd_pressure3	1	0.00	1.00	-1.36	-1.29	0.44	0.80	1.85	▇▁▃▇▁
hyd_pressure4	1	0.00	1.00	-2.63	-0.79	-0.02	0.44	3.51	▂▆▇▂▁
filler_level	1	0.00	1.00	-3.42	-0.68	0.58	0.68	3.31	▁▃▅▇▁
filler_speed	1	0.00	1.00	-3.54	0.26	0.38	0.40	0.44	▁▁▁▁▇
temperature	1	0.00	1.00	-1.72	-0.55	-0.26	0.32	7.44	▇▃▁▁▁
usage_cont	1	0.00	1.00	-3.00	-0.88	0.27	0.92	1.65	▁▃▅▃▇
carb_flow	1	0.00	1.00	-2.29	-1.22	0.52	0.67	2.46	▂▅▆▇▁
density	1	0.00	1.00	-2.48	-0.73	-0.52	1.18	1.98	▁▅▇▂▆
mfr	1	0.00	1.00	-9.49	0.04	0.26	0.35	2.29	▁▁▁▂▇
balling	1	0.00	1.00	-2.19	-0.76	-0.59	1.17	1.95	▁▇▁▁▅
pressure_vacuum	1	0.00	1.00	-2.43	-0.67	-0.32	0.38	2.83	▂▇▅▃▁
oxygen_filler	1	0.00	1.00	-0.98	-0.54	-0.29	0.28	7.86	▇▁▁▁▁
bowl_setpoint	1	0.00	1.00	-2.57	-0.61	0.70	0.70	2.00	▁▂▃▇▁
pressure_setpoint	1	0.00	1.00	-1.77	-0.79	-0.79	1.18	2.16	▁▇▁▆▁
air_pressurer	1	0.00	1.00	-1.68	-0.52	-0.19	0.14	4.42	▅▇▁▁▁
alch_rel	1	0.00	1.00	-3.20	-0.71	-0.67	0.66	3.41	▁▇▂▃▁
carb_rel	1	0.00	1.00	-3.71	-0.75	-0.29	0.80	4.85	▁▇▇▂▁
balling_lvl	1	0.00	1.00	-2.36	-0.77	-0.66	1.25	1.85	▁▇▂▁▆
mnf_flow_off	1	0.00	1.00	-0.92	-0.92	-0.92	1.08	1.08	▇▁▁▁▇
ph	1	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁

3.2 Distribution of Target Variable (pH)

Understanding the distribution of the response is a key first step in APM. We examine central tendency, spread, skewness, and potential outliers in pH.

# Summary statistics for pH
summary(train$ph)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.880   8.440   8.540   8.546   8.680   9.360

# Skewness
library(e1071)
skewness(train$ph, na.rm = TRUE)

## [1] -0.2906437

We also visualize the distribution:

ggplot(train, aes(x = ph)) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "white", alpha = 0.7) +
  labs(title = "Distribution of pH in Training Data - Frequency",
       x = "pH",
       y = "Frequency") +
  theme_minimal()

The histogram above illustrates the distribution of the target variable pH in the training dataset. We can observe that the pH values are approximately normally distributed, with a slight skew towards higher values. The majority of pH values fall within the range of 6.5 to 8.5, indicating that most manufacturing processes maintain a neutral to slightly alkaline pH level. This insight is crucial for understanding the typical operating conditions and will inform our modeling approach.

ggplot(train, aes(x = ph)) +
geom_histogram(bins = 30, aes(y = ..density..), alpha = 0.6) +
geom_density(color = "blue") +
labs(
title = "Distribution of pH (Training Data) - Density",
x = "pH",
y = "Density"
)+  theme_minimal()

ggplot(train, aes(y = ph)) +
geom_boxplot() +
labs(
title = "Boxplot of pH (Training Data)",
y = "pH"
)+theme_minimal()

3.3 pH by Brand

APM emphasizes understanding how the outcome behaves across important categorical factors. Here we examine pH across brand_code.

brand_ph_summary <- train |>
group_by(brand_code) |>
summarise(
n       = n(),
mean_ph = mean(ph, na.rm = TRUE),
sd_ph   = sd(ph, na.rm = TRUE),
min_ph  = min(ph, na.rm = TRUE),
max_ph  = max(ph, na.rm = TRUE),
.groups = "drop"
)

brand_ph_summary

## # A tibble: 5 × 6
##   brand_code     n mean_ph sd_ph min_ph max_ph
##   <fct>      <int>   <dbl> <dbl>  <dbl>  <dbl>
## 1 A            293    8.50 0.163   8.06   8.86
## 2 B           1235    8.57 0.169   8      8.94
## 3 C            304    8.41 0.177   7.88   9.36
## 4 D            615    8.60 0.135   8.16   8.92
## 5 Unknown      120    8.49 0.177   8.12   8.88

ggplot(train, aes(x = brand_code, y = ph)) +
geom_boxplot() +
labs(
title = "pH by Brand Code",
x = "Brand Code",
y = "pH"
)+theme_minimal()

3.4 Relationships Between Predictors and pH

We explore the relationship between each predictor and the response. For numeric predictors, we calculate Pearson correlations with pH and visualize the most important ones.

# Select numeric predictors (including engineered ones) but keep ph for correlation

numeric_vars <- train |>
select(where(is.numeric))

# Correlation matrix for numeric variables

cor_mat <- cor(numeric_vars, use = "pairwise.complete.obs")

# Correlations of predictors with pH (exclude pH itself)

corr_with_ph <- cor_mat[, "ph"]
corr_with_ph <- corr_with_ph[names(corr_with_ph) != "ph"]

# Create a tidy data frame of correlations

corr_ph_df <- tibble(
variable    = names(corr_with_ph),
correlation = as.numeric(corr_with_ph)
) |>
mutate(abs_corr = abs(correlation)) |>
arrange(desc(abs_corr))

# Display top 15 predictors by absolute correlation with pH

head(corr_ph_df, 15)

## # A tibble: 15 × 3
##    variable          correlation abs_corr
##    <chr>                   <dbl>    <dbl>
##  1 mnf_flow_off            0.468    0.468
##  2 bowl_setpoint           0.349    0.349
##  3 filler_level            0.322    0.322
##  4 usage_cont             -0.318    0.318
##  5 pressure_setpoint      -0.306    0.306
##  6 hyd_pressure3          -0.240    0.240
##  7 pressure_vacuum         0.221    0.221
##  8 fill_pressure          -0.212    0.212
##  9 hyd_pressure2          -0.201    0.201
## 10 oxygen_filler           0.166    0.166
## 11 carb_rel                0.162    0.162
## 12 temperature            -0.158    0.158
## 13 carb_flow               0.156    0.156
## 14 alch_rel                0.147    0.147
## 15 hyd_pressure4          -0.140    0.140

We now visualize the top correlated predictors:

top_corr <- corr_ph_df |>
slice(1:15)

ggplot(top_corr, aes(x = reorder(variable, correlation), y = correlation)) +
geom_col() +
coord_flip() +
labs(
title = "Top Correlations with pH",
x = "Predictor",
y = "Correlation with pH"
)

For a more detailed view, we create scatterplots (for numeric variables) and boxplots (for factors) using caret::featurePlot, as recommended in APM:

# Identify top numeric predictors by absolute correlation

top_numeric_vars <- corr_ph_df |>
arrange(desc(abs_corr)) |>
slice(1:6) |>
pull(variable)

top_numeric_vars

## [1] "mnf_flow_off"      "bowl_setpoint"     "filler_level"     
## [4] "usage_cont"        "pressure_setpoint" "hyd_pressure3"

# Scatterplots of top predictors vs pH

featurePlot(
x = train[, top_numeric_vars],
y = train$ph,
plot = "scatter",
layout = c(3, 2),
type = c("p", "smooth")
)

3.5 Correlation Structure Among Predictors

The importance of identifying highly correlated predictors, as redundancy can affect some models and variable importance measures.

# Correlation matrix among numeric predictors (excluding pH)

pred_numeric <- train |>
select(where(is.numeric), -ph)

pred_cor <- cor(pred_numeric, use = "pairwise.complete.obs")

# Visual correlation heatmap

corrplot(pred_cor,
order = "hclust",
type = "lower",
tl.cex = 0.6,
main = "Correlation Heatmap of Numeric Predictors")

Identify highly correlated variables:

# Find predictors with correlation > 0.9

high_corr_idx <- findCorrelation(pred_cor, cutoff = 0.9, names = TRUE)
high_corr_idx

## [1] "balling"       "hyd_pressure3" "balling_lvl"   "density"      
## [5] "bowl_setpoint"

# Extract highly correlated pairs for reporting

pred_cor_abs <- abs(pred_cor)
diag(pred_cor_abs) <- 0

high_pairs_idx <- which(pred_cor_abs > 0.9, arr.ind = TRUE)

high_pairs <- tibble(
var1 = rownames(pred_cor_abs)[high_pairs_idx[, "row"]],
var2 = colnames(pred_cor_abs)[high_pairs_idx[, "col"]],
r    = pred_cor[high_pairs_idx]
) |>
filter(var1 < var2) |>
arrange(desc(abs(r)))

high_pairs

## # A tibble: 8 × 3
##   var1          var2              r
##   <chr>         <chr>         <dbl>
## 1 balling       balling_lvl   0.979
## 2 balling       density       0.955
## 3 balling_lvl   density       0.949
## 4 bowl_setpoint filler_level  0.938
## 5 hyd_pressure2 hyd_pressure3 0.926
## 6 alch_rel      balling       0.924
## 7 alch_rel      balling_lvl   0.923
## 8 alch_rel      density       0.901

3.6 Train vs Score: Consistency Check

Before modeling, confirming that the scoring set is compatible with the training set. We compare distributions of key predictors and brand composition across train and score.

3.6.1 Brand Composition

# Brand distribution in train

brand_dist_train <- train |>
count(brand_code) |>
mutate(pct_train = 100 * n / sum(n))

# Brand distribution in score

brand_dist_score <- score |>
count(brand_code) |>
mutate(pct_score = 100 * n / sum(n))

brand_compare <- brand_dist_train |>
full_join(brand_dist_score, by = "brand_code",
suffix = c("_train", "_score"))

brand_compare

## # A tibble: 5 × 5
##   brand_code n_train pct_train n_score pct_score
##   <fct>        <int>     <dbl>   <int>     <dbl>
## 1 A              293     11.4       35     13.1 
## 2 B             1235     48.1      129     48.3 
## 3 C              304     11.8       31     11.6 
## 4 D              615     24.0       64     24.0 
## 5 Unknown        120      4.67       8      3.00

Visual comparison:

brand_compare_long <- brand_compare |>
select(brand_code, pct_train, pct_score) |>
pivot_longer(cols = starts_with("pct"),
names_to = "dataset",
values_to = "pct")

ggplot(brand_compare_long,
aes(x = brand_code, y = pct, fill = dataset)) +
geom_col(position = "dodge") +
labs(
title = "Brand Distribution: Train vs Score",
x = "Brand Code",
y = "Percent of Observations"
)

3.6.2 Key Predictor Distributions

We compare the distributions of selected key predictors by overlaying the training and scoring datasets.

plot_train_score <- function(var_name) {
df <- bind_rows(
train |> mutate(dataset = "train"),
score |> mutate(dataset = "score")
) |>
select(dataset, all_of(var_name)) |>
filter(!is.na(.data[[var_name]]))

ggplot(df, aes_string(x = var_name, fill = "dataset")) +
geom_density(alpha = 0.4) +
labs(
title = paste("Distribution of", var_name, "- Train vs Score"),
x = var_name,
y = "Density"
)
}

# Example predictors to compare

plot_train_score("bowl_setpoint")

plot_train_score("filler_level")

plot_train_score("pressure_setpoint")

plot_train_score("mnf_flow")

plot_train_score("temperature")

3.7 EDA Summary

This exploratory analysis provided key insights into the manufacturing process pH prediction task:

The response ph shows moderate variation across the training set, with no extreme outliers after preprocessing. This suggests that standard regression models are appropriate for modeling the outcome.
Several process variables (e.g., temperature, flow, and pressure-related measures) display clear nonlinear relationships with ph, indicating that more flexible models (such as MARS, random forest, or Cubist) may capture additional signal beyond a simple linear model.
The brand composition and key predictor distributions are broadly consistent between the training and scoring datasets, which supports the assumption that the trained models can be reasonably applied to the evaluation data.

4. Predictive Modeling

4.1 Train / Test Split

To evaluate out-of-sample performance, we split the processed training data into a modeling set and a held-out test set. Specifically, 80% of the observations are used for model fitting and hyperparameter tuning, and the remaining 20% are reserved as an independent test set that is only used for final performance evaluation.

idx <- createDataPartition(train$ph, p = 0.80, list = FALSE)

train_set <- train[idx, ]
test_set  <- train[-idx, ]

nrow(train_set); nrow(test_set)

## [1] 2055

## [1] 512

4.2 Model Building

We compare several modeling approaches to predict ph, ranging from a simple linear regression baseline to more flexible, nonlinear methods. All models are fit using the caret framework with repeated 10-fold cross-validation (10 folds, 3 repeats) to obtain stable estimates of predictive performance. The candidate models include: - Linear regression (baseline) - Elastic net (regularized linear model) - Multivariate Adaptive Regression Splines (MARS) - Random forest - Cubist

Each model is trained on the same training set and tuned over a grid of hyperparameters, using RMSE as the primary performance metric.

# Cross-validation setup
ctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 3,
  summaryFunction = defaultSummary,
  savePredictions = "final"
)

The below five models were chosen because they represent a balanced comparison of linear, regularized, and tree-based learning approaches.

Model 1: Linear Regression

Linear regression serves as a baseline model. It assumes a linear relationship between PH and process variables, making it useful for comparison.

## -- Model 1: Linear Regression --

mod_lm <- train(
  ph ~ ., 
  data = train_set,
  method = "lm",
  trControl = ctrl
)

Model 2: Elastic Net

Elastic net was chosen because it is a penalized regression method capable of handling collinearity and feature redundancy. Elastic net combines L1 and L2 regularization, making it more robust than standard linear regression.

## -- Model 2: Elastic Net --

enet_model <- train(
  ph ~ ., 
  data = train_set, 
  method = "enet", 
  trControl = ctrl, 
  preProc = c("center", "scale"),
  tuneLength = 10 
)

Model 3: Cubist

Cubist was included as the most advanced model in this analysis. It fits linear models within partitioned regions, allowing local linear behavior rather than forcing a single global relationship.

## -- Model 3: Cubist --
mod_cubist <- train(
  ph ~ .,
  data = train_set,
  method = "cubist",
  trControl = ctrl,
  tuneLength = 10 
)

Model 4: Multivariate Adaptive Regression Splines (MARS)

MARS was selected because it can model non-linear relationships through hinge functions without fully committing to deep tree structure. It is more flexible than regression but lighter weight than Random Forest or Cubist.

## -- Model 4: MARS --
grid_mars <- expand.grid(
  degree = c(1, 2),
  nprune = seq(5, 45, by = 5)
)

mod_mars <- train(
  ph ~ .,
  data = train_set,
  method = "earth",
  tuneGrid = grid_mars,
  trControl = ctrl
)

Model 5: Random Forest

Random Forest was chosen because it handles complex variable interactions, non-linear patterns, and correlated predictors through ensemble tree learning.

## -- Model 5: Random Forest --
grid_rf <- expand.grid(mtry = c(4, 8, 12))

mod_rf <- train(
  ph ~., 
  data = train_set,
  method = "rf",
  tuneGrid = grid_rf,
  trControl = ctrl,
  importance = TRUE
)

4.3 Model Evaluation

On the held-out test set, the more flexible models (MARS, random forest, and Cubist) generally outperform the linear regression and elastic net baselines. Among all candidates, Cubist achieves the lowest RMSE and highest R-squared, indicating that it provides the best overall predictive accuracy for ph. Random forest performs similarly well, but Cubist offers a slightly lower test error and is therefore selected as the preferred model.

Resampling Performance

## Compare Resampling Performance
resamps <- resamples(
  list(
    LR = mod_lm,
    ENET = enet_model,
    Cubist = mod_cubist,
    MARS = mod_mars,
    RF = mod_rf
  )
)

summary(resamps)

## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: LR, ENET, Cubist, MARS, RF 
## Number of resamples: 30 
## 
## MAE 
##              Min.    1st Qu.     Median       Mean    3rd Qu.       Max. NA's
## LR     0.09502638 0.10073187 0.10473973 0.10401172 0.10758205 0.11257155    0
## ENET   0.09566784 0.09961397 0.10152405 0.10395614 0.10754995 0.11744316    0
## Cubist 0.06030879 0.06690243 0.06940446 0.06894201 0.07129079 0.07660211    0
## MARS   0.08681832 0.09275998 0.09651002 0.09648758 0.10045255 0.10700731    0
## RF     0.06154997 0.06915886 0.07115037 0.07066135 0.07371902 0.07717479    0
## 
## RMSE 
##              Min.    1st Qu.     Median       Mean    3rd Qu.      Max. NA's
## LR     0.12466236 0.12906901 0.13532887 0.13458006 0.13780751 0.1468140    0
## ENET   0.12317109 0.12891313 0.13151916 0.13422831 0.13723333 0.1545302    0
## Cubist 0.08333979 0.09057539 0.09472819 0.09560511 0.09888564 0.1118435    0
## MARS   0.11211048 0.11998577 0.12495943 0.12512709 0.12919504 0.1403545    0
## RF     0.08323433 0.09348964 0.09655022 0.09668645 0.10299622 0.1094346    0
## 
## Rsquared 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LR     0.3232769 0.3673798 0.4013862 0.4002920 0.4232234 0.4945814    0
## ENET   0.2885415 0.3809034 0.4109300 0.4028178 0.4298928 0.4885498    0
## Cubist 0.5844056 0.6673708 0.6996149 0.6962508 0.7234493 0.7790068    0
## MARS   0.3474497 0.4367629 0.4920285 0.4804481 0.5210819 0.5985185    0
## RF     0.6133050 0.6761162 0.7017627 0.7082329 0.7573772 0.7977818    0

Based on the cross-validated resampling, the two tree-based models (Random Forest and Cubist) performed substantially better than linear models. Random Forest showed the highest \(R^2\), indicating the strongest ability to explain variance in PH, while Cubist produced the lowest MAE and RMSE, meaning it generated more accurate predictions on average. Although each model led in a different metric, the performance gap between them was small, and both emerged as the top-performing predictive methods.

resamp_summary <- summary(resamps)$statistics

model_compare <- data.frame(
  "MAE.Mean" = round(resamp_summary$MAE[,"Mean"], 5),
  "RMSE.Mean" = round(resamp_summary$RMSE[,"Mean"], 5),
  "R-Squared.Mean" = round(resamp_summary$Rsquared[,"Mean"], 5)
) %>%
  arrange(desc("R-Squared (Mean)"))

model_compare

##        MAE.Mean RMSE.Mean R.Squared.Mean
## LR      0.10401   0.13458        0.40029
## ENET    0.10396   0.13423        0.40282
## Cubist  0.06894   0.09561        0.69625
## MARS    0.09649   0.12513        0.48045
## RF      0.07066   0.09669        0.70823

Test Set Performance

## Extract actual PH values 
test_y <- test_set$ph

## Predict using each trained model
pred_lm     <- predict(mod_lm, newdata = test_set)
pred_enet   <- predict(enet_model, newdata = test_set)
pred_mars   <- predict(mod_mars, newdata = test_set)
pred_rf     <- predict(mod_rf, newdata = test_set)
pred_cubist <- predict(mod_cubist, newdata = test_set)

## Test performance summary table
test_metrics <- rbind(
  postResample(pred_lm,     test_y),
  postResample(pred_enet,   test_y),
  postResample(pred_mars,   test_y),
  postResample(pred_rf,     test_y),
  postResample(pred_cubist, test_y)
) %>% as.data.frame()

test_results <- tibble::tibble(
  Model = c("Linear Regression","Elastic Net","MARS","Random Forest","Cubist"),
  RMSE = test_metrics$RMSE,
  Rsquared = test_metrics$Rsquared,
  MAE = test_metrics$MAE
) %>%
  arrange(desc(Rsquared))

test_results

## # A tibble: 5 × 4
##   Model               RMSE Rsquared    MAE
##   <chr>              <dbl>    <dbl>  <dbl>
## 1 Random Forest     0.0998    0.656 0.0700
## 2 Cubist            0.0998    0.652 0.0676
## 3 MARS              0.126     0.446 0.0966
## 4 Linear Regression 0.130     0.407 0.103 
## 5 Elastic Net       0.131     0.403 0.103

The test set results show that Cubist and Random Forest remain the strongest predictive models, reinforcing the resampling evaluation findings. Random Forest achieves the highest \(R^2\), while Cubist produced the lowest RMSE and MAE.

Visuals

##  Resampling Performance
resamp_stats <- summary(resamps)$statistics

resamp_results <- tibble::tibble(
  Model    = rownames(resamp_stats$RMSE),
  RMSE     = resamp_stats$RMSE[,"Mean"],
  MAE      = resamp_stats$MAE[,"Mean"],
  Rsquared = resamp_stats$Rsquared[,"Mean"]
)

## Order models by decreasing r-squared
resamp_results <- resamp_results %>%
  arrange(desc(Rsquared)) %>%
  mutate(Model = factor(Model, levels = Model))

## Long format for ggplot
resamp_plot_data <- resamp_results %>%
  tidyr::pivot_longer(
    cols = c(RMSE, MAE, Rsquared),
    names_to  = "Metric",
    values_to = "Value"
  )

## Resampling Performance Bar Chart
ggplot(resamp_plot_data, aes(x = Model, y = Value, fill = Model)) +
  geom_col(width = 0.75, alpha = 0.85) +
  facet_wrap(~ Metric, scales = "free_y", ncol = 3) +
  labs(
    title = "Model Performance - Cross-Validated Resampling",
    x = "Model",
    y = "Metric Value"
  ) +
  theme_minimal(base_size = 14) +
  scale_fill_brewer(palette = "Set2") +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold", size = 18, hjust = 0.5),
    strip.text  = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

## ---- ##

## Test Set Performance
model_plot_data <- test_results %>%
  mutate(Model = factor(Model, levels = Model[order(-Rsquared)])) %>%  
  pivot_longer(cols = c("RMSE","MAE","Rsquared"),
               names_to = "Metric", values_to = "Value")

ggplot(model_plot_data, aes(x = Model, y = Value, fill = Model)) +
  geom_col(width = 0.75, alpha = 0.85) +
  facet_wrap(~ Metric, scales = "free_y", ncol = 3) +
  labs(title = "Model Performance Comparison - Test Set",
       y = "Metric Value", x = "Model") +
  theme_minimal(base_size = 14) +
  scale_fill_brewer(palette = "Set2") +
  theme(legend.position = "none",
        plot.title = element_text(face="bold", size=18, hjust=0.5),
        strip.text = element_text(size=14, face="bold"),
        axis.text.x = element_text(angle =90, hjust=1, size = 8))

4.4 Final Model

The difference between the top two models (Random Forest and Cubist) is minimal, but Cubist demonstrates the lowest overall test error, making it the final model selected for PH prediction.

Based on both cross-validated resampling and held-out test-set performance, the Cubist model is selected as the final model for ph prediction. While random forest achieves comparable accuracy, Cubist attains the lowest RMSE and highest R-squared, and its rule-based structure provides a compact representation of predictor effects. We therefore use the Cubist model for variable importance analysis and for generating final predictions on the evaluation dataset.

Variable Importance

cubist_imp <- varImp(mod_cubist, scale = TRUE)
cubist_imp

## cubist variable importance
## 
##   only 20 most important variables shown (out of 36)
## 
##                 Overall
## pressure_vacuum  100.00
## mnf_flow_off      99.02
## balling_lvl       97.06
## balling           95.10
## alch_rel          90.20
## density           85.29
## air_pressurer     75.49
## temperature       73.53
## oxygen_filler     69.61
## hyd_pressure3     58.82
## brand_codeC       55.88
## bowl_setpoint     54.90
## carb_pressure1    53.92
## carb_rel          53.92
## hyd_pressure2     51.96
## carb_flow         51.96
## usage_cont        45.10
## hyd_pressure1     43.14
## mnf_flow          42.16
## filler_level      35.29

plot(cubist_imp, top = 10, main = "Cubist - Top 10 Variable Importance")

The Cubist model identifies several key drivers of PH variation. The strongest predictor is pressure_vacuum.

4.5 Final Predictions

In the final step, we apply the selected Cubist model to the evaluation (scoring) dataset to generate predicted pH values. These predictions are based on the same set of processed predictors used during model training, ensuring that the scoring data are handled consistently with the training data. The resulting predicted pH values are exported for downstream use (e.g., submission, reporting, or further quality monitoring) and represent the model’s best estimate of process pH given the observed manufacturing conditions.

final_predictions <- predict(mod_cubist, newdata = score)
eval_output <- score %>% mutate(predicted_ph = final_predictions)

write.csv(eval_output, "PH_predictions_final.csv", row.names = FALSE)

ggplot(eval_output, aes(x=predicted_ph)) +
  geom_histogram(bins=30, fill="steelblue", alpha=.7) +
  labs(
    title="Distribution of Predicted PH – Final Cubist Model",
    x="Predicted PH", y="Frequency"
  )+
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = 'bold')
  )

ph_compare <- bind_rows(
  train %>%
    transmute(dataset = "Training (Observed)", value = ph),
  eval_output %>%
    transmute(dataset = "Scoring (Predicted)", value = predicted_ph)
)

ggplot(ph_compare, aes(x = value, fill = dataset)) +
  geom_density(alpha = 0.45) +
  labs(
    title = "Distribution of pH:\nTraining vs Predicted (Cubist)",
    x     = "pH",
    y     = "Density",
    fill  = "Source"
  ) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 12),
    legend.title = element_text(size = 12, face='bold')
  )

5. Discussion and Conclusion

In this project, we developed and compared several models for predicting manufacturing process pH using historical process data. After cleaning and exploring the data, we evaluated a baseline linear regression model alongside more flexible methods, including elastic net, MARS, random forest, and Cubist, within a consistent cross-validation framework. The results showed that the nonlinear models substantially improved predictive performance relative to the linear baseline.

Both random forest and Cubist emerged as strong candidates, with Cubist achieving the lowest RMSE and MAE on the held-out test set while maintaining competitive \(R^2\). Based on this balance of accuracy and stability, we selected the Cubist model as the final model and used it to generate pH predictions for the evaluation dataset. The variable importance results highlight a small set of key process variables as the primary drivers of pH, providing insight into which aspects of the manufacturing conditions may be most critical for quality control.

Overall, the workflow demonstrates an end-to-end, data-driven approach to modeling a critical quality attribute. The same framework could be extended to incorporate additional process variables, alternative model types, or more detailed diagnostics if richer data become available.

6. Reproducibility

This analysis is fully reproducible from this R Markdown file:

All data ingestion, cleaning, modeling, and evaluation steps are implemented in code chunks within the document.
Data are loaded using relative paths (for example, from a data/ folder), so the analysis can be re-run by placing the files in the same project structure.
Random seeds (for example, set.seed(2025)) are set before model training to make results replicable across runs.
We assume that the following R packages are installed: tidyverse, skimr, corrplot, caret, janitor, forcats, recipes, earth, elasticnet, Cubist, randomForest, and ggplot2.

Manufacturing Process pH Prediction

Technical Report – Group Project

Alinzon Simon - Data Lead

Kristin Lussi - Modeling Lead

Jerald Melukkaran - Business Report Lead

Joe Garcia - Technical Report Lead

2025-12-07

1. Introduction

2. Data Preparation

2.1 Load Required Libraries

2.2 Data Ingestion

2.2 Cleaning Column Names

2.3 Data Cleaning and Imputation

3 Exploratory Data Analysis (EDA)

3.1 Overview of Cleaned Training Data

3.2 Distribution of Target Variable (pH)

3.3 pH by Brand

3.4 Relationships Between Predictors and pH

3.5 Correlation Structure Among Predictors

3.6 Train vs Score: Consistency Check

3.6.1 Brand Composition

3.6.2 Key Predictor Distributions

3.7 EDA Summary

4. Predictive Modeling

4.1 Train / Test Split

4.2 Model Building

Model 1: Linear Regression

Model 2: Elastic Net

Model 3: Cubist

Model 4: Multivariate Adaptive Regression Splines (MARS)

Model 5: Random Forest

4.3 Model Evaluation

Resampling Performance

Test Set Performance

Visuals

4.4 Final Model

Variable Importance

4.5 Final Predictions

5. Discussion and Conclusion

6. Reproducibility