This document presents a technical report on the development of predictive models for manufacturing process pH. The pH of the final product is a critical quality attribute in many manufacturing settings, as it can influence product stability, performance, and regulatory compliance. Our goal is to use historical process data to build a robust, data-driven model that accurately predicts pH and helps identify the key process factors that drive variation in this outcome.
In this project, the response variable pH represents the acidity or alkalinity of the manufacturing process output. It is measured on the standard pH scale, where lower values indicate more acidic conditions and higher values indicate more basic (alkaline) conditions. Because pH affects product quality and stability, it is treated as a critical quality attribute that the process must keep within an acceptable specification range.
The analysis follows a complete end-to-end workflow. We begin by ingesting and cleaning the raw student datasets, standardizing variable names, handling missing values, and preparing the data for modeling. We then conduct exploratory data analysis (EDA) to understand the distribution of pH, examine its relationships with key predictors, and assess the consistency between the training and evaluation datasets. Next, we construct several competing models—ranging from a baseline linear regression to more flexible approaches such as elastic net, MARS, random forest, and Cubist—using a consistent resampling framework for hyperparameter tuning and performance estimation.
Model performance is assessed using cross-validated RMSE and R-squared on the training data, as well as RMSE and R-squared on a held-out test set. Based on these metrics, we select a final model and examine its variable importance to interpret which process features are most influential for pH. The report concludes with final predictions for the evaluation dataset and a brief discussion of reproducibility, including package requirements and session information, so that the entire workflow can be replicated from this R Markdown file.
To ensure a reproducible and stable analytical workflow, we began by configuring the R environment with all required packages for data handling, visualization, and preprocessing. This setup step guarantees that all dependencies are consistently available throughout the analysis, reduces the risk of version related issues, and establishes a standardized foundation for the subsequent data processing stages.
tidyverse : data manipulatio
skimr : quick data summaries
corrplot : correlation heatmaps
caret : pre-processing, modeling framework
readxl : read excel files**
janitor : data cleaning
forcats : categorical variable handling
*Not used in Mac
# Load required packages for the analysis
library(tidyverse)
library(skimr)
library(corrplot)
library(caret)
library(janitor)
library(forcats)
library(recipes)
library(earth)
library(ggplot2)
library(elasticnet)
library(Cubist)
library(randomForest)
# Note (macOS):
# Input data are read from CSV files using readr::read_csv instead of readxl.
# This avoids platform-specific issues with readxl on macOS.
In this phase, we imported the two primary datasets required for the analysis. StudentData was designated as the training dataset and serves as the foundation for building and calibrating the predictive models. StudentEvaluation, on the other hand, functions as the test dataset and is reserved for out of sample evaluation to assess model generalization.
train_raw <- read_excel("data/StudentData.xlsx")
eval_raw <- read_excel("data/StudentEvaluation.xlsx")
cat("Train rows x cols: ", paste(dim(train_raw), collapse = " x "), "\n")
cat("Eval rows x cols: ", paste(dim(eval_raw), collapse = " x "), "\n")
glimpse(train_raw)
train_raw <- readr::read_csv("data/StudentData.csv", show_col_types = FALSE)
eval_raw <- readr::read_csv("data/StudentEvaluation.csv", show_col_types = FALSE)
cat("Train rows x cols: ", paste(dim(train_raw), collapse = " x "), "\n")
## Train rows x cols: 2571 x 33
cat("Eval rows x cols: ", paste(dim(eval_raw), collapse = " x "), "\n")
## Eval rows x cols: 267 x 33
glimpse(train_raw)
## Rows: 2,571
## Columns: 33
## $ `Brand Code` <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", …
## $ `Carb Volume` <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, …
## $ `Fill Ounces` <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, …
## $ `PC Volume` <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1113…
## $ `Carb Pressure` <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64…
## $ `Carb Temp` <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 1…
## $ PSC <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.15…
## $ `PSC Fill` <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.…
## $ `PSC CO2` <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.…
## $ `Mnf Flow` <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1` <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 1…
## $ `Fill Pressure` <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46…
## $ `Hyd Pressure1` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2` <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3` <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4` <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, …
## $ `Filler Level` <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 1…
## $ `Filler Speed` <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014…
## $ Temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65…
## $ `Usage cont` <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 1…
## $ `Carb Flow` <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902,…
## $ Density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.…
## $ MFR <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, …
## $ Balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1…
## $ `Pressure Vacuum` <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4…
## $ PH <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.…
## $ `Oxygen Filler` <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0…
## $ `Bowl Setpoint` <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46…
## $ `Air Pressurer` <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 1…
## $ `Alch Rel` <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.…
## $ `Carb Rel` <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.…
## $ `Balling Lvl` <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.…
We will now take a look at the structure of both datasets to understand the variables and their types.
skim(train_raw)
| Name | train_raw |
| Number of rows | 2571 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand Code | 120 | 0.95 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb Volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
| Fill Ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
| PC Volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
| Carb Pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | ▁▅▇▃▁ |
| Carb Temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
| PSC | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
| PSC Fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| PSC CO2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
| Mnf Flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | ▇▁▁▇▂ |
| Carb Pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | ▁▃▇▂▁ |
| Fill Pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
| Hyd Pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
| Hyd Pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | ▇▂▇▅▁ |
| Hyd Pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | ▇▁▃▇▁ |
| Hyd Pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | ▁▃▇▂▁ |
| Filler Level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | ▁▃▅▇▁ |
| Filler Speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
| Temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
| Usage cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | ▁▃▅▃▇ |
| Carb Flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | ▂▅▆▇▁ |
| Density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
| MFR | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | ▁▁▁▂▇ |
| Balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
| Pressure Vacuum | 0 | 1.00 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
| PH | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
| Oxygen Filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
| Bowl Setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
| Pressure Setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Air Pressurer | 0 | 1.00 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | ▅▇▁▁▁ |
| Alch Rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | ▁▇▂▃▁ |
| Carb Rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | ▁▇▇▂▁ |
| Balling Lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
The training dataset consists of 2,571 observations and 33 variables
in total. The response variable is pH, which measures the
acidity level of the manufacturing process output. The remaining 32
variables are used as predictors and include process measurements such
as temperatures, concentrations, and timing variables, as well as
product characteristics (e.g., brand or batch indicators). These
predictors capture different aspects of the manufacturing conditions
that are expected to influence the final pH.
skim(eval_raw)
| Name | eval_raw |
| Number of rows | 267 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| logical | 1 |
| numeric | 31 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand Code | 8 | 0.97 | 1 | 1 | 0 | 4 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| PH | 267 | 0 | NaN | : |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb Volume | 1 | 1.00 | 5.37 | 0.11 | 5.15 | 5.29 | 5.34 | 5.47 | 5.67 | ▂▇▃▅▁ |
| Fill Ounces | 6 | 0.98 | 23.97 | 0.08 | 23.75 | 23.92 | 23.97 | 24.01 | 24.20 | ▁▅▇▃▁ |
| PC Volume | 4 | 0.99 | 0.28 | 0.06 | 0.10 | 0.23 | 0.28 | 0.32 | 0.46 | ▁▆▇▅▁ |
| Carb Pressure | 0 | 1.00 | 68.25 | 3.86 | 60.20 | 65.30 | 68.00 | 70.60 | 77.60 | ▃▆▇▃▂ |
| Carb Temp | 1 | 1.00 | 141.23 | 4.30 | 130.00 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▆▇▃▁ |
| PSC | 5 | 0.98 | 0.09 | 0.05 | 0.00 | 0.04 | 0.08 | 0.11 | 0.25 | ▆▇▃▂▁ |
| PSC Fill | 3 | 0.99 | 0.19 | 0.11 | 0.02 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| PSC CO2 | 5 | 0.98 | 0.05 | 0.04 | 0.00 | 0.02 | 0.04 | 0.06 | 0.24 | ▇▃▂▁▁ |
| Mnf Flow | 0 | 1.00 | 21.03 | 117.76 | -100.20 | -100.00 | 0.20 | 141.30 | 220.40 | ▇▁▁▆▂ |
| Carb Pressure1 | 4 | 0.99 | 123.04 | 4.42 | 113.00 | 120.20 | 123.40 | 125.50 | 136.00 | ▃▃▇▂▁ |
| Fill Pressure | 2 | 0.99 | 48.14 | 3.44 | 37.80 | 46.00 | 47.80 | 50.20 | 60.20 | ▁▇▇▂▁ |
| Hyd Pressure1 | 0 | 1.00 | 12.01 | 13.53 | -50.00 | 0.00 | 10.40 | 20.40 | 50.00 | ▁▁▇▆▂ |
| Hyd Pressure2 | 1 | 1.00 | 20.11 | 17.21 | -50.00 | 0.00 | 26.80 | 34.80 | 61.40 | ▁▁▆▇▁ |
| Hyd Pressure3 | 1 | 1.00 | 19.61 | 16.56 | -50.00 | 0.00 | 27.70 | 33.00 | 49.20 | ▁▁▆▃▇ |
| Hyd Pressure4 | 4 | 0.99 | 97.84 | 13.92 | 68.00 | 90.00 | 98.00 | 104.00 | 140.00 | ▅▆▇▂▁ |
| Filler Level | 2 | 0.99 | 110.29 | 15.50 | 69.20 | 100.60 | 118.60 | 120.20 | 153.20 | ▂▃▇▇▁ |
| Filler Speed | 10 | 0.96 | 3581.39 | 911.19 | 1006.00 | 3812.00 | 3978.00 | 3996.00 | 4020.00 | ▁▁▁▁▇ |
| Temperature | 2 | 0.99 | 66.23 | 1.69 | 63.80 | 65.40 | 65.80 | 66.60 | 75.40 | ▇▅▁▁▁ |
| Usage cont | 2 | 0.99 | 20.90 | 3.00 | 12.90 | 18.12 | 21.44 | 23.74 | 24.60 | ▁▃▃▃▇ |
| Carb Flow | 0 | 1.00 | 2408.64 | 1161.36 | 0.00 | 1083.00 | 3038.00 | 3215.00 | 3858.00 | ▂▃▁▆▇ |
| Density | 1 | 1.00 | 1.18 | 0.38 | 0.06 | 0.92 | 0.98 | 1.60 | 1.84 | ▁▁▇▁▅ |
| MFR | 31 | 0.88 | 697.80 | 96.40 | 15.60 | 707.00 | 724.60 | 731.45 | 784.80 | ▁▁▁▁▇ |
| Balling | 1 | 1.00 | 2.20 | 0.92 | 0.90 | 1.50 | 1.65 | 3.24 | 3.79 | ▅▇▁▂▅ |
| Pressure Vacuum | 1 | 1.00 | -5.17 | 0.58 | -6.40 | -5.60 | -5.20 | -4.80 | -3.60 | ▁▇▆▃▁ |
| Oxygen Filler | 3 | 0.99 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.05 | 0.40 | ▇▁▁▁▁ |
| Bowl Setpoint | 1 | 1.00 | 109.62 | 15.02 | 70.00 | 100.00 | 120.00 | 120.00 | 130.00 | ▁▂▁▃▇ |
| Pressure Setpoint | 2 | 0.99 | 47.73 | 2.06 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Air Pressurer | 1 | 1.00 | 142.83 | 1.23 | 141.20 | 142.20 | 142.60 | 142.80 | 147.20 | ▅▇▁▁▁ |
| Alch Rel | 3 | 0.99 | 6.91 | 0.50 | 6.40 | 6.54 | 6.58 | 7.18 | 7.82 | ▇▁▂▁▃ |
| Carb Rel | 2 | 0.99 | 5.44 | 0.13 | 5.18 | 5.34 | 5.40 | 5.56 | 5.74 | ▂▇▂▃▂ |
| Balling Lvl | 0 | 1.00 | 2.05 | 0.88 | 0.00 | 1.38 | 1.48 | 3.08 | 3.42 | ▁▃▇▁▇ |
To further understand the datasets, we will list the variable names present in both the training and evaluation datasets.
names(train_raw)
## [1] "Brand Code" "Carb Volume" "Fill Ounces"
## [4] "PC Volume" "Carb Pressure" "Carb Temp"
## [7] "PSC" "PSC Fill" "PSC CO2"
## [10] "Mnf Flow" "Carb Pressure1" "Fill Pressure"
## [13] "Hyd Pressure1" "Hyd Pressure2" "Hyd Pressure3"
## [16] "Hyd Pressure4" "Filler Level" "Filler Speed"
## [19] "Temperature" "Usage cont" "Carb Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure Vacuum" "PH" "Oxygen Filler"
## [28] "Bowl Setpoint" "Pressure Setpoint" "Air Pressurer"
## [31] "Alch Rel" "Carb Rel" "Balling Lvl"
Compare that column sets are identical.
identical(names(train_raw), names(eval_raw))
## [1] TRUE
To ensure consistency and ease of reference throughout the analysis, we standardized the column names in both datasets. This step involved converting all column names to lowercase and replacing spaces with underscores. Such standardization helps prevent potential issues related to case sensitivity and special characters during data manipulation and modeling.
train <- train_raw |>
clean_names()
score <- eval_raw |>
clean_names()
identical(names(train), names(score))
## [1] TRUE
The result is a consistent naming convention across both datasets, facilitating smoother data handling in subsequent steps.
In this step, we addressed missing values and potential anomalies in the datasets to prepare them for analysis. The following actions were taken:
Removed rows with missing target variable (ph) from the training dataset, as these cannot contribute to model training.
Treated missing values in the categorical variable brand_code as an explicit category labeled “Unknown”. This approach allows the model to recognize missingness as a distinct factor rather than ignoring those instances.
Re-coded sentinel values in the mnf_flow variable. Values less than or equal to -99 were identified as indicators of missing or invalid data. We created a new binary indicator variable mnf_flow_off to flag these instances and set the original mnf_flow values to NA for proper handling during imputation.
Defined a preprocessing recipe using the recipes package to systematically handle missing data and scale numeric predictors. The recipe includes median imputation for numeric variables, mode imputation for categorical variables, removal of zero-variance predictors, and centering and scaling of numeric predictors.
train <- train |>
filter(!is.na(ph))
train <- train |>
mutate(
brand_code = fct_explicit_na(brand_code, na_level = "Unknown")
)
score <- score |>
mutate(
brand_code = fct_explicit_na(brand_code, na_level = "Unknown")
)
summary(train$mnf_flow)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -100.20 -100.00 70.20 24.63 140.80 229.40
table(train$mnf_flow)[1:10] # inspect -100, -100.2
##
## -100.2 -100 0.2 2 4.8 6.8 7.6 8.2 8.6 9.6
## 578 605 78 1 1 1 1 1 1 1
train <- train |>
mutate(
mnf_flow_off = ifelse(mnf_flow <= -99, 1, 0),
mnf_flow = ifelse(mnf_flow <= -99, NA, mnf_flow)
)
score <- score |>
mutate(
mnf_flow_off = ifelse(mnf_flow <= -99, 1, 0),
mnf_flow = ifelse(mnf_flow <= -99, NA, mnf_flow)
)
train_clean_raw <- train
score_clean_raw <- score
ph_recipe <- recipe(ph ~ ., data = train) |>
update_role(ph, new_role = "outcome") |>
update_role(brand_code, new_role = "predictor") |>
step_impute_median(all_numeric_predictors()) |>
step_impute_mode(all_nominal_predictors()) |>
step_zv(all_predictors()) |>
step_center(all_numeric_predictors()) |>
step_scale(all_numeric_predictors())
ph_prep <- prep(ph_recipe, training = train)
train <- juice(ph_prep) # fully processed training data
score <- bake(ph_prep, new_data = score) # fully processed scoring data
New csv files with cleaned data can be saved for future use.
write.csv(train, "data/Train_Clean_PH.csv", row.names = FALSE)
write.csv(score, "data/Score_Clean_PH.csv", row.names = FALSE)
The goal of this section is to explore the training dataset to understand the distribution of the target variable (pH) and identify potential predictors through correlation analysis. This will help inform feature selection and engineering for the modeling phase.
The basic structure and summary statistics of the cleaned training and scoring datasets are presented below. This overview provides insights into the number of observations, variables, and key characteristics of the data.
dim(train)
## [1] 2567 34
dim(score)
## [1] 267 34
glimpse(train)
## Rows: 2,567
## Columns: 34
## $ brand_code <fct> B, A, B, A, A, A, A, B, B, B, B, B, B, B, B, B, C, B…
## $ carb_volume <dbl> -0.28475798, 0.53130613, -0.78695127, 0.65685445, 1.…
## $ fill_ounces <dbl> -0.09476889, 0.36597325, 0.98029608, 0.36597325, 3.8…
## $ pc_volume <dbl> -0.22972785, -0.63987315, -0.22972785, 0.26909754, -…
## $ carb_pressure <dbl> 0.002735817, 0.059554926, 0.741384239, -1.474561027,…
## $ carb_temp <dbl> 0.02758583, -0.37105082, 0.92451827, -2.11508613, -1…
## $ psc <dbl> 0.3977850, 0.8064432, 0.1117243, -0.1743364, -1.1959…
## $ psc_fill <dbl> 0.55290392, 0.21180698, 1.23509780, 1.91729168, -0.2…
## $ psc_co2 <dbl> -0.37842496, -0.37842496, 2.42643607, -0.37842496, 1…
## $ mnf_flow <dbl> 0.1327827, 0.1327827, 0.1327827, 0.1327827, 0.132782…
## $ carb_pressure1 <dbl> -0.80419751, -0.20822172, -0.50620961, -1.57045209, …
## $ fill_pressure <dbl> -0.6031982, -0.6031982, -0.6031982, -0.4769726, -0.6…
## $ hyd_pressure1 <dbl> -1.003715, -1.003715, -1.003715, -1.003715, -1.00371…
## $ hyd_pressure2 <dbl> 0.4627655, 0.4627655, 0.4627655, -1.2874501, -1.2874…
## $ hyd_pressure3 <dbl> 0.4443658, 0.4443658, 0.4443658, -1.2877894, -1.2877…
## $ hyd_pressure4 <dbl> 1.6654791, 0.7442469, -1.0982173, -0.3305239, -0.330…
## $ filler_level <dbl> 0.7590093, 0.5930444, 0.6824101, 0.5419783, 0.593044…
## $ filler_speed <dbl> 0.4034736, 0.3824942, 0.4270754, 0.4165857, 0.413963…
## $ temperature <dbl> 0.02678116, 1.18952507, 0.75349611, -0.26390482, -0.…
## $ usage_cont <dbl> -1.61956027, -0.36851010, -1.08820025, -1.20254355, …
## $ carb_flow <dbl> 0.4294009, 0.6275087, 0.4125804, 0.5508821, 0.543406…
## $ density <dbl> -0.7811071, -0.6749974, 1.0758126, 0.9697029, 0.9697…
## $ mfr <dbl> 0.2721248, 0.2974595, 0.4128729, 0.3509438, 0.241160…
## $ balling <dbl> -0.8626157, -0.7550364, 1.0135670, 0.9059877, 0.9059…
## $ pressure_vacuum <dbl> 2.132323, 2.132323, 2.482975, 1.431020, 1.431020, 1.…
## $ oxygen_filler <dbl> -0.541796037, -0.452875935, -0.497335986, -0.3639558…
## $ bowl_setpoint <dbl> 0.6964944, 0.6964944, 0.6964944, 0.6964944, 0.696494…
## $ pressure_setpoint <dbl> -0.5919061, -0.3955348, -0.4937204, -0.7882774, -0.7…
## $ air_pressurer <dbl> -0.1929304, 0.1369080, -0.6876881, 2.7756158, 2.7756…
## $ alch_rel <dbl> -0.6276642, -0.6672780, 1.5114835, 0.4815235, 0.4815…
## $ carb_rel <dbl> -0.90746961, -1.06301679, 3.13675695, -0.12973374, 0…
## $ balling_lvl <dbl> -0.6576926, -0.5656108, 1.4141480, 1.1379026, 1.1379…
## $ mnf_flow_off <dbl> 1.081412, 1.081412, 1.081412, 1.081412, 1.081412, 1.…
## $ ph <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
glimpse(score)
## Rows: 267
## Columns: 34
## $ brand_code <fct> D, A, B, B, B, B, A, B, A, D, B, B, B, B, C, Unknown…
## $ carb_volume <dbl> 1.0334994, 0.2174353, -0.7241771, -0.9752738, 0.3429…
## $ fill_ounces <dbl> 0.67313461, -0.24834969, -0.63230144, -0.40193036, 2…
## $ pc_volume <dbl> -0.118877757, -0.839403309, 0.435372667, -1.51558883…
## $ carb_pressure <dbl> -0.79273171, -1.41774192, -0.50863617, -0.96318904, …
## $ carb_temp <dbl> -1.61679032, -1.51713116, -0.17173249, -0.52053956, …
## $ psc <dbl> 3.09492882, -0.86905526, -0.33779966, -1.64550575, -…
## $ psc_fill <dbl> 1.74674321, 0.21180698, -0.81148384, 0.04125851, 0.8…
## $ psc_co2 <dbl> -0.37842496, 0.55652872, -0.84590180, -0.84590180, 0…
## $ mnf_flow <dbl> 0.1327827, 0.1327827, 0.1327827, 0.1327827, 0.132782…
## $ carb_pressure1 <dbl> -1.2724642, -0.8041975, -0.5062096, 0.4728935, -1.61…
## $ fill_pressure <dbl> -0.60319819, -0.54008538, -0.66631100, -2.49658257, …
## $ hyd_pressure1 <dbl> -1.003715, -1.003715, -1.003715, -1.003715, -1.00371…
## $ hyd_pressure2 <dbl> 0.4627655, -1.2874501, -1.2874501, -1.2874501, -1.28…
## $ hyd_pressure3 <dbl> 0.4443658, -1.2877894, -1.2877894, -1.2877894, -1.28…
## $ hyd_pressure4 <dbl> -0.0234465, 1.2048630, 0.1300922, 2.7402499, -0.1769…
## $ filler_level <dbl> 1.2824371, 0.6824101, 0.6441105, 0.6951767, 0.427079…
## $ filler_speed <dbl> 0.3824942, 0.4165857, 0.4139633, 0.3772494, 0.424453…
## $ temperature <dbl> 0.02678116, -0.26390482, -0.26390482, 6.13118672, 0.…
## $ usage_cont <dbl> 0.22338460, -1.14200886, 1.07087019, -0.96713088, 0.…
## $ carb_flow <dbl> 0.44622135, 0.41444934, 0.54527528, -2.28430282, 0.6…
## $ density <dbl> -0.7811071, 0.8635932, -0.7280523, -1.1524911, -0.78…
## $ mfr <dbl> 0.3087193, 0.4241328, 0.4100580, 0.2580500, 0.652144…
## $ balling <dbl> -0.8626157, 0.7984084, -0.8088261, -1.2305369, -0.86…
## $ pressure_vacuum <dbl> 2.4829749, 1.4310197, 1.7816714, 2.1323231, 2.132323…
## $ oxygen_filler <dbl> -0.541796037, -0.363955833, -0.008275424, -0.2883737…
## $ bowl_setpoint <dbl> 1.3506843, 0.6964944, 0.6964944, 0.6964944, 0.696494…
## $ pressure_setpoint <dbl> -1.1810200, -0.7882774, -0.7882774, -0.7882774, 1.17…
## $ air_pressurer <dbl> -0.19293045, 3.60021192, 3.10545422, 2.94053498, 2.4…
## $ alch_rel <dbl> -0.6672780, 0.4815235, -0.7465057, -0.8257334, -0.78…
## $ carb_rel <dbl> -0.75192244, 1.11464367, -0.75192244, 0.49245496, -0…
## $ balling_lvl <dbl> -0.6576926, 1.1379026, -0.6807130, -0.6576926, -0.68…
## $ mnf_flow_off <dbl> 1.081412, 1.081412, 1.081412, 1.081412, 1.081412, 1.…
## $ ph <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
skim(train)
| Name | train |
| Number of rows | 2567 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 33 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| brand_code | 0 | 1 | FALSE | 5 | B: 1235, D: 615, C: 304, A: 293 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| carb_volume | 0 | 1 | 0.00 | 1.00 | -3.11 | -0.72 | -0.22 | 0.78 | 3.11 | ▁▆▇▅▁ |
| fill_ounces | 0 | 1 | 0.00 | 1.00 | -3.93 | -0.63 | -0.02 | 0.60 | 3.98 | ▁▂▇▂▁ |
| pc_volume | 0 | 1 | 0.00 | 1.00 | -3.29 | -0.63 | -0.10 | 0.56 | 3.34 | ▁▃▇▂▁ |
| carb_pressure | 0 | 1 | 0.00 | 1.00 | -3.18 | -0.74 | 0.00 | 0.68 | 3.18 | ▁▅▇▃▁ |
| carb_temp | 0 | 1 | 0.00 | 1.00 | -3.11 | -0.67 | -0.07 | 0.68 | 3.22 | ▁▃▇▃▁ |
| psc | 0 | 1 | 0.00 | 1.00 | -1.69 | -0.71 | -0.17 | 0.56 | 3.79 | ▆▇▃▁▁ |
| psc_fill | 0 | 1 | 0.00 | 1.00 | -1.66 | -0.81 | -0.13 | 0.55 | 3.62 | ▆▇▃▁▁ |
| psc_co2 | 0 | 1 | 0.00 | 1.00 | -1.31 | -0.85 | -0.38 | 0.56 | 4.30 | ▇▅▂▁▁ |
| mnf_flow | 0 | 1 | 0.00 | 1.00 | -4.27 | 0.07 | 0.13 | 0.19 | 3.00 | ▁▁▂▇▁ |
| carb_pressure1 | 0 | 1 | 0.00 | 1.00 | -3.61 | -0.76 | 0.13 | 0.60 | 3.75 | ▁▃▇▂▁ |
| fill_pressure | 0 | 1 | 0.00 | 1.00 | -4.20 | -0.60 | -0.48 | 0.66 | 3.94 | ▁▁▇▂▁ |
| hyd_pressure1 | 0 | 1 | 0.00 | 1.00 | -1.07 | -1.00 | -0.08 | 0.62 | 3.67 | ▇▅▂▁▁ |
| hyd_pressure2 | 0 | 1 | 0.00 | 1.00 | -1.29 | -1.29 | 0.46 | 0.83 | 2.35 | ▇▂▇▃▁ |
| hyd_pressure3 | 0 | 1 | 0.00 | 1.00 | -1.36 | -1.29 | 0.44 | 0.80 | 1.85 | ▇▁▃▇▁ |
| hyd_pressure4 | 0 | 1 | 0.00 | 1.00 | -2.63 | -0.79 | -0.02 | 0.44 | 3.51 | ▂▆▇▂▁ |
| filler_level | 0 | 1 | 0.00 | 1.00 | -3.42 | -0.68 | 0.58 | 0.68 | 3.31 | ▁▃▅▇▁ |
| filler_speed | 0 | 1 | 0.00 | 1.00 | -3.54 | 0.26 | 0.38 | 0.40 | 0.44 | ▁▁▁▁▇ |
| temperature | 0 | 1 | 0.00 | 1.00 | -1.72 | -0.55 | -0.26 | 0.32 | 7.44 | ▇▃▁▁▁ |
| usage_cont | 0 | 1 | 0.00 | 1.00 | -3.00 | -0.88 | 0.27 | 0.92 | 1.65 | ▁▃▅▃▇ |
| carb_flow | 0 | 1 | 0.00 | 1.00 | -2.29 | -1.22 | 0.52 | 0.67 | 2.46 | ▂▅▆▇▁ |
| density | 0 | 1 | 0.00 | 1.00 | -2.48 | -0.73 | -0.52 | 1.18 | 1.98 | ▁▅▇▂▆ |
| mfr | 0 | 1 | 0.00 | 1.00 | -9.49 | 0.04 | 0.26 | 0.35 | 2.29 | ▁▁▁▂▇ |
| balling | 0 | 1 | 0.00 | 1.00 | -2.19 | -0.76 | -0.59 | 1.17 | 1.95 | ▁▇▁▁▅ |
| pressure_vacuum | 0 | 1 | 0.00 | 1.00 | -2.43 | -0.67 | -0.32 | 0.38 | 2.83 | ▂▇▅▃▁ |
| oxygen_filler | 0 | 1 | 0.00 | 1.00 | -0.98 | -0.54 | -0.29 | 0.28 | 7.86 | ▇▁▁▁▁ |
| bowl_setpoint | 0 | 1 | 0.00 | 1.00 | -2.57 | -0.61 | 0.70 | 0.70 | 2.00 | ▁▂▃▇▁ |
| pressure_setpoint | 0 | 1 | 0.00 | 1.00 | -1.77 | -0.79 | -0.79 | 1.18 | 2.16 | ▁▇▁▆▁ |
| air_pressurer | 0 | 1 | 0.00 | 1.00 | -1.68 | -0.52 | -0.19 | 0.14 | 4.42 | ▅▇▁▁▁ |
| alch_rel | 0 | 1 | 0.00 | 1.00 | -3.20 | -0.71 | -0.67 | 0.66 | 3.41 | ▁▇▂▃▁ |
| carb_rel | 0 | 1 | 0.00 | 1.00 | -3.71 | -0.75 | -0.29 | 0.80 | 4.85 | ▁▇▇▂▁ |
| balling_lvl | 0 | 1 | 0.00 | 1.00 | -2.36 | -0.77 | -0.66 | 1.25 | 1.85 | ▁▇▂▁▆ |
| mnf_flow_off | 0 | 1 | 0.00 | 1.00 | -0.92 | -0.92 | -0.92 | 1.08 | 1.08 | ▇▁▁▁▇ |
| ph | 0 | 1 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
Understanding the distribution of the response is a key first step in APM. We examine central tendency, spread, skewness, and potential outliers in pH.
# Summary statistics for pH
summary(train$ph)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.880 8.440 8.540 8.546 8.680 9.360
# Skewness
library(e1071)
skewness(train$ph, na.rm = TRUE)
## [1] -0.2906437
We also visualize the distribution:
ggplot(train, aes(x = ph)) +
geom_histogram(binwidth = 0.1, fill = "blue", color = "white", alpha = 0.7) +
labs(title = "Distribution of pH in Training Data - Frequency",
x = "pH",
y = "Frequency") +
theme_minimal()
The histogram above illustrates the distribution of the target variable pH in the training dataset. We can observe that the pH values are approximately normally distributed, with a slight skew towards higher values. The majority of pH values fall within the range of 6.5 to 8.5, indicating that most manufacturing processes maintain a neutral to slightly alkaline pH level. This insight is crucial for understanding the typical operating conditions and will inform our modeling approach.
ggplot(train, aes(x = ph)) +
geom_histogram(bins = 30, aes(y = ..density..), alpha = 0.6) +
geom_density(color = "blue") +
labs(
title = "Distribution of pH (Training Data) - Density",
x = "pH",
y = "Density"
)+ theme_minimal()
ggplot(train, aes(y = ph)) +
geom_boxplot() +
labs(
title = "Boxplot of pH (Training Data)",
y = "pH"
)+theme_minimal()
APM emphasizes understanding how the outcome behaves across important categorical factors. Here we examine pH across brand_code.
brand_ph_summary <- train |>
group_by(brand_code) |>
summarise(
n = n(),
mean_ph = mean(ph, na.rm = TRUE),
sd_ph = sd(ph, na.rm = TRUE),
min_ph = min(ph, na.rm = TRUE),
max_ph = max(ph, na.rm = TRUE),
.groups = "drop"
)
brand_ph_summary
## # A tibble: 5 × 6
## brand_code n mean_ph sd_ph min_ph max_ph
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 A 293 8.50 0.163 8.06 8.86
## 2 B 1235 8.57 0.169 8 8.94
## 3 C 304 8.41 0.177 7.88 9.36
## 4 D 615 8.60 0.135 8.16 8.92
## 5 Unknown 120 8.49 0.177 8.12 8.88
ggplot(train, aes(x = brand_code, y = ph)) +
geom_boxplot() +
labs(
title = "pH by Brand Code",
x = "Brand Code",
y = "pH"
)+theme_minimal()
We explore the relationship between each predictor and the response. For numeric predictors, we calculate Pearson correlations with pH and visualize the most important ones.
# Select numeric predictors (including engineered ones) but keep ph for correlation
numeric_vars <- train |>
select(where(is.numeric))
# Correlation matrix for numeric variables
cor_mat <- cor(numeric_vars, use = "pairwise.complete.obs")
# Correlations of predictors with pH (exclude pH itself)
corr_with_ph <- cor_mat[, "ph"]
corr_with_ph <- corr_with_ph[names(corr_with_ph) != "ph"]
# Create a tidy data frame of correlations
corr_ph_df <- tibble(
variable = names(corr_with_ph),
correlation = as.numeric(corr_with_ph)
) |>
mutate(abs_corr = abs(correlation)) |>
arrange(desc(abs_corr))
# Display top 15 predictors by absolute correlation with pH
head(corr_ph_df, 15)
## # A tibble: 15 × 3
## variable correlation abs_corr
## <chr> <dbl> <dbl>
## 1 mnf_flow_off 0.468 0.468
## 2 bowl_setpoint 0.349 0.349
## 3 filler_level 0.322 0.322
## 4 usage_cont -0.318 0.318
## 5 pressure_setpoint -0.306 0.306
## 6 hyd_pressure3 -0.240 0.240
## 7 pressure_vacuum 0.221 0.221
## 8 fill_pressure -0.212 0.212
## 9 hyd_pressure2 -0.201 0.201
## 10 oxygen_filler 0.166 0.166
## 11 carb_rel 0.162 0.162
## 12 temperature -0.158 0.158
## 13 carb_flow 0.156 0.156
## 14 alch_rel 0.147 0.147
## 15 hyd_pressure4 -0.140 0.140
We now visualize the top correlated predictors:
top_corr <- corr_ph_df |>
slice(1:15)
ggplot(top_corr, aes(x = reorder(variable, correlation), y = correlation)) +
geom_col() +
coord_flip() +
labs(
title = "Top Correlations with pH",
x = "Predictor",
y = "Correlation with pH"
)
For a more detailed view, we create scatterplots (for numeric variables) and boxplots (for factors) using caret::featurePlot, as recommended in APM:
# Identify top numeric predictors by absolute correlation
top_numeric_vars <- corr_ph_df |>
arrange(desc(abs_corr)) |>
slice(1:6) |>
pull(variable)
top_numeric_vars
## [1] "mnf_flow_off" "bowl_setpoint" "filler_level"
## [4] "usage_cont" "pressure_setpoint" "hyd_pressure3"
# Scatterplots of top predictors vs pH
featurePlot(
x = train[, top_numeric_vars],
y = train$ph,
plot = "scatter",
layout = c(3, 2),
type = c("p", "smooth")
)
The importance of identifying highly correlated predictors, as redundancy can affect some models and variable importance measures.
# Correlation matrix among numeric predictors (excluding pH)
pred_numeric <- train |>
select(where(is.numeric), -ph)
pred_cor <- cor(pred_numeric, use = "pairwise.complete.obs")
# Visual correlation heatmap
corrplot(pred_cor,
order = "hclust",
type = "lower",
tl.cex = 0.6,
main = "Correlation Heatmap of Numeric Predictors")
Identify highly correlated variables:
# Find predictors with correlation > 0.9
high_corr_idx <- findCorrelation(pred_cor, cutoff = 0.9, names = TRUE)
high_corr_idx
## [1] "balling" "hyd_pressure3" "balling_lvl" "density"
## [5] "bowl_setpoint"
# Extract highly correlated pairs for reporting
pred_cor_abs <- abs(pred_cor)
diag(pred_cor_abs) <- 0
high_pairs_idx <- which(pred_cor_abs > 0.9, arr.ind = TRUE)
high_pairs <- tibble(
var1 = rownames(pred_cor_abs)[high_pairs_idx[, "row"]],
var2 = colnames(pred_cor_abs)[high_pairs_idx[, "col"]],
r = pred_cor[high_pairs_idx]
) |>
filter(var1 < var2) |>
arrange(desc(abs(r)))
high_pairs
## # A tibble: 8 × 3
## var1 var2 r
## <chr> <chr> <dbl>
## 1 balling balling_lvl 0.979
## 2 balling density 0.955
## 3 balling_lvl density 0.949
## 4 bowl_setpoint filler_level 0.938
## 5 hyd_pressure2 hyd_pressure3 0.926
## 6 alch_rel balling 0.924
## 7 alch_rel balling_lvl 0.923
## 8 alch_rel density 0.901
Before modeling, confirming that the scoring set is compatible with the training set. We compare distributions of key predictors and brand composition across train and score.
# Brand distribution in train
brand_dist_train <- train |>
count(brand_code) |>
mutate(pct_train = 100 * n / sum(n))
# Brand distribution in score
brand_dist_score <- score |>
count(brand_code) |>
mutate(pct_score = 100 * n / sum(n))
brand_compare <- brand_dist_train |>
full_join(brand_dist_score, by = "brand_code",
suffix = c("_train", "_score"))
brand_compare
## # A tibble: 5 × 5
## brand_code n_train pct_train n_score pct_score
## <fct> <int> <dbl> <int> <dbl>
## 1 A 293 11.4 35 13.1
## 2 B 1235 48.1 129 48.3
## 3 C 304 11.8 31 11.6
## 4 D 615 24.0 64 24.0
## 5 Unknown 120 4.67 8 3.00
Visual comparison:
brand_compare_long <- brand_compare |>
select(brand_code, pct_train, pct_score) |>
pivot_longer(cols = starts_with("pct"),
names_to = "dataset",
values_to = "pct")
ggplot(brand_compare_long,
aes(x = brand_code, y = pct, fill = dataset)) +
geom_col(position = "dodge") +
labs(
title = "Brand Distribution: Train vs Score",
x = "Brand Code",
y = "Percent of Observations"
)
We compare the distributions of selected key predictors by overlaying the training and scoring datasets.
plot_train_score <- function(var_name) {
df <- bind_rows(
train |> mutate(dataset = "train"),
score |> mutate(dataset = "score")
) |>
select(dataset, all_of(var_name)) |>
filter(!is.na(.data[[var_name]]))
ggplot(df, aes_string(x = var_name, fill = "dataset")) +
geom_density(alpha = 0.4) +
labs(
title = paste("Distribution of", var_name, "- Train vs Score"),
x = var_name,
y = "Density"
)
}
# Example predictors to compare
plot_train_score("bowl_setpoint")
plot_train_score("filler_level")
plot_train_score("pressure_setpoint")
plot_train_score("mnf_flow")
plot_train_score("temperature")
This exploratory analysis provided key insights into the manufacturing process pH prediction task:
ph shows moderate variation across the
training set, with no extreme outliers after preprocessing. This
suggests that standard regression models are appropriate for modeling
the outcome.ph, indicating that more flexible models (such as MARS,
random forest, or Cubist) may capture additional signal beyond a simple
linear model.To evaluate out-of-sample performance, we split the processed training data into a modeling set and a held-out test set. Specifically, 80% of the observations are used for model fitting and hyperparameter tuning, and the remaining 20% are reserved as an independent test set that is only used for final performance evaluation.
idx <- createDataPartition(train$ph, p = 0.80, list = FALSE)
train_set <- train[idx, ]
test_set <- train[-idx, ]
nrow(train_set); nrow(test_set)
## [1] 2055
## [1] 512
We compare several modeling approaches to predict ph,
ranging from a simple linear regression baseline to more flexible,
nonlinear methods. All models are fit using the caret
framework with repeated 10-fold cross-validation (10 folds, 3 repeats)
to obtain stable estimates of predictive performance. The candidate
models include: - Linear regression (baseline) - Elastic net
(regularized linear model) - Multivariate Adaptive Regression Splines
(MARS) - Random forest - Cubist
Each model is trained on the same training set and tuned over a grid of hyperparameters, using RMSE as the primary performance metric.
# Cross-validation setup
ctrl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 3,
summaryFunction = defaultSummary,
savePredictions = "final"
)
The below five models were chosen because they represent a balanced comparison of linear, regularized, and tree-based learning approaches.
Linear regression serves as a baseline model. It assumes a linear relationship between PH and process variables, making it useful for comparison.
## -- Model 1: Linear Regression --
mod_lm <- train(
ph ~ .,
data = train_set,
method = "lm",
trControl = ctrl
)
Elastic net was chosen because it is a penalized regression method capable of handling collinearity and feature redundancy. Elastic net combines L1 and L2 regularization, making it more robust than standard linear regression.
## -- Model 2: Elastic Net --
enet_model <- train(
ph ~ .,
data = train_set,
method = "enet",
trControl = ctrl,
preProc = c("center", "scale"),
tuneLength = 10
)
Cubist was included as the most advanced model in this analysis. It fits linear models within partitioned regions, allowing local linear behavior rather than forcing a single global relationship.
## -- Model 3: Cubist --
mod_cubist <- train(
ph ~ .,
data = train_set,
method = "cubist",
trControl = ctrl,
tuneLength = 10
)
MARS was selected because it can model non-linear relationships through hinge functions without fully committing to deep tree structure. It is more flexible than regression but lighter weight than Random Forest or Cubist.
## -- Model 4: MARS --
grid_mars <- expand.grid(
degree = c(1, 2),
nprune = seq(5, 45, by = 5)
)
mod_mars <- train(
ph ~ .,
data = train_set,
method = "earth",
tuneGrid = grid_mars,
trControl = ctrl
)
Random Forest was chosen because it handles complex variable interactions, non-linear patterns, and correlated predictors through ensemble tree learning.
## -- Model 5: Random Forest --
grid_rf <- expand.grid(mtry = c(4, 8, 12))
mod_rf <- train(
ph ~.,
data = train_set,
method = "rf",
tuneGrid = grid_rf,
trControl = ctrl,
importance = TRUE
)
On the held-out test set, the more flexible models (MARS, random
forest, and Cubist) generally outperform the linear regression and
elastic net baselines. Among all candidates, Cubist achieves the lowest
RMSE and highest R-squared, indicating that it provides the best overall
predictive accuracy for ph. Random forest performs
similarly well, but Cubist offers a slightly lower test error and is
therefore selected as the preferred model.
## Compare Resampling Performance
resamps <- resamples(
list(
LR = mod_lm,
ENET = enet_model,
Cubist = mod_cubist,
MARS = mod_mars,
RF = mod_rf
)
)
summary(resamps)
##
## Call:
## summary.resamples(object = resamps)
##
## Models: LR, ENET, Cubist, MARS, RF
## Number of resamples: 30
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LR 0.09502638 0.10073187 0.10473973 0.10401172 0.10758205 0.11257155 0
## ENET 0.09566784 0.09961397 0.10152405 0.10395614 0.10754995 0.11744316 0
## Cubist 0.06030879 0.06690243 0.06940446 0.06894201 0.07129079 0.07660211 0
## MARS 0.08681832 0.09275998 0.09651002 0.09648758 0.10045255 0.10700731 0
## RF 0.06154997 0.06915886 0.07115037 0.07066135 0.07371902 0.07717479 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LR 0.12466236 0.12906901 0.13532887 0.13458006 0.13780751 0.1468140 0
## ENET 0.12317109 0.12891313 0.13151916 0.13422831 0.13723333 0.1545302 0
## Cubist 0.08333979 0.09057539 0.09472819 0.09560511 0.09888564 0.1118435 0
## MARS 0.11211048 0.11998577 0.12495943 0.12512709 0.12919504 0.1403545 0
## RF 0.08323433 0.09348964 0.09655022 0.09668645 0.10299622 0.1094346 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LR 0.3232769 0.3673798 0.4013862 0.4002920 0.4232234 0.4945814 0
## ENET 0.2885415 0.3809034 0.4109300 0.4028178 0.4298928 0.4885498 0
## Cubist 0.5844056 0.6673708 0.6996149 0.6962508 0.7234493 0.7790068 0
## MARS 0.3474497 0.4367629 0.4920285 0.4804481 0.5210819 0.5985185 0
## RF 0.6133050 0.6761162 0.7017627 0.7082329 0.7573772 0.7977818 0
Based on the cross-validated resampling, the two tree-based models (Random Forest and Cubist) performed substantially better than linear models. Random Forest showed the highest \(R^2\), indicating the strongest ability to explain variance in PH, while Cubist produced the lowest MAE and RMSE, meaning it generated more accurate predictions on average. Although each model led in a different metric, the performance gap between them was small, and both emerged as the top-performing predictive methods.
resamp_summary <- summary(resamps)$statistics
model_compare <- data.frame(
"MAE.Mean" = round(resamp_summary$MAE[,"Mean"], 5),
"RMSE.Mean" = round(resamp_summary$RMSE[,"Mean"], 5),
"R-Squared.Mean" = round(resamp_summary$Rsquared[,"Mean"], 5)
) %>%
arrange(desc("R-Squared (Mean)"))
model_compare
## MAE.Mean RMSE.Mean R.Squared.Mean
## LR 0.10401 0.13458 0.40029
## ENET 0.10396 0.13423 0.40282
## Cubist 0.06894 0.09561 0.69625
## MARS 0.09649 0.12513 0.48045
## RF 0.07066 0.09669 0.70823
## Extract actual PH values
test_y <- test_set$ph
## Predict using each trained model
pred_lm <- predict(mod_lm, newdata = test_set)
pred_enet <- predict(enet_model, newdata = test_set)
pred_mars <- predict(mod_mars, newdata = test_set)
pred_rf <- predict(mod_rf, newdata = test_set)
pred_cubist <- predict(mod_cubist, newdata = test_set)
## Test performance summary table
test_metrics <- rbind(
postResample(pred_lm, test_y),
postResample(pred_enet, test_y),
postResample(pred_mars, test_y),
postResample(pred_rf, test_y),
postResample(pred_cubist, test_y)
) %>% as.data.frame()
test_results <- tibble::tibble(
Model = c("Linear Regression","Elastic Net","MARS","Random Forest","Cubist"),
RMSE = test_metrics$RMSE,
Rsquared = test_metrics$Rsquared,
MAE = test_metrics$MAE
) %>%
arrange(desc(Rsquared))
test_results
## # A tibble: 5 × 4
## Model RMSE Rsquared MAE
## <chr> <dbl> <dbl> <dbl>
## 1 Random Forest 0.0998 0.656 0.0700
## 2 Cubist 0.0998 0.652 0.0676
## 3 MARS 0.126 0.446 0.0966
## 4 Linear Regression 0.130 0.407 0.103
## 5 Elastic Net 0.131 0.403 0.103
The test set results show that Cubist and Random Forest remain the strongest predictive models, reinforcing the resampling evaluation findings. Random Forest achieves the highest \(R^2\), while Cubist produced the lowest RMSE and MAE.
## Resampling Performance
resamp_stats <- summary(resamps)$statistics
resamp_results <- tibble::tibble(
Model = rownames(resamp_stats$RMSE),
RMSE = resamp_stats$RMSE[,"Mean"],
MAE = resamp_stats$MAE[,"Mean"],
Rsquared = resamp_stats$Rsquared[,"Mean"]
)
## Order models by decreasing r-squared
resamp_results <- resamp_results %>%
arrange(desc(Rsquared)) %>%
mutate(Model = factor(Model, levels = Model))
## Long format for ggplot
resamp_plot_data <- resamp_results %>%
tidyr::pivot_longer(
cols = c(RMSE, MAE, Rsquared),
names_to = "Metric",
values_to = "Value"
)
## Resampling Performance Bar Chart
ggplot(resamp_plot_data, aes(x = Model, y = Value, fill = Model)) +
geom_col(width = 0.75, alpha = 0.85) +
facet_wrap(~ Metric, scales = "free_y", ncol = 3) +
labs(
title = "Model Performance - Cross-Validated Resampling",
x = "Model",
y = "Metric Value"
) +
theme_minimal(base_size = 14) +
scale_fill_brewer(palette = "Set2") +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 18, hjust = 0.5),
strip.text = element_text(size = 14, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)
)
## ---- ##
## Test Set Performance
model_plot_data <- test_results %>%
mutate(Model = factor(Model, levels = Model[order(-Rsquared)])) %>%
pivot_longer(cols = c("RMSE","MAE","Rsquared"),
names_to = "Metric", values_to = "Value")
ggplot(model_plot_data, aes(x = Model, y = Value, fill = Model)) +
geom_col(width = 0.75, alpha = 0.85) +
facet_wrap(~ Metric, scales = "free_y", ncol = 3) +
labs(title = "Model Performance Comparison - Test Set",
y = "Metric Value", x = "Model") +
theme_minimal(base_size = 14) +
scale_fill_brewer(palette = "Set2") +
theme(legend.position = "none",
plot.title = element_text(face="bold", size=18, hjust=0.5),
strip.text = element_text(size=14, face="bold"),
axis.text.x = element_text(angle =90, hjust=1, size = 8))
The difference between the top two models (Random Forest and Cubist) is minimal, but Cubist demonstrates the lowest overall test error, making it the final model selected for PH prediction.
Based on both cross-validated resampling and held-out test-set
performance, the Cubist model is selected as the final model for
ph prediction. While random forest achieves comparable
accuracy, Cubist attains the lowest RMSE and highest R-squared, and its
rule-based structure provides a compact representation of predictor
effects. We therefore use the Cubist model for variable importance
analysis and for generating final predictions on the evaluation
dataset.
cubist_imp <- varImp(mod_cubist, scale = TRUE)
cubist_imp
## cubist variable importance
##
## only 20 most important variables shown (out of 36)
##
## Overall
## pressure_vacuum 100.00
## mnf_flow_off 99.02
## balling_lvl 97.06
## balling 95.10
## alch_rel 90.20
## density 85.29
## air_pressurer 75.49
## temperature 73.53
## oxygen_filler 69.61
## hyd_pressure3 58.82
## brand_codeC 55.88
## bowl_setpoint 54.90
## carb_pressure1 53.92
## carb_rel 53.92
## hyd_pressure2 51.96
## carb_flow 51.96
## usage_cont 45.10
## hyd_pressure1 43.14
## mnf_flow 42.16
## filler_level 35.29
plot(cubist_imp, top = 10, main = "Cubist - Top 10 Variable Importance")
The Cubist model identifies several key drivers of PH variation. The strongest predictor is pressure_vacuum.
In the final step, we apply the selected Cubist model to the evaluation (scoring) dataset to generate predicted pH values. These predictions are based on the same set of processed predictors used during model training, ensuring that the scoring data are handled consistently with the training data. The resulting predicted pH values are exported for downstream use (e.g., submission, reporting, or further quality monitoring) and represent the model’s best estimate of process pH given the observed manufacturing conditions.
final_predictions <- predict(mod_cubist, newdata = score)
eval_output <- score %>% mutate(predicted_ph = final_predictions)
write.csv(eval_output, "PH_predictions_final.csv", row.names = FALSE)
ggplot(eval_output, aes(x=predicted_ph)) +
geom_histogram(bins=30, fill="steelblue", alpha=.7) +
labs(
title="Distribution of Predicted PH – Final Cubist Model",
x="Predicted PH", y="Frequency"
)+
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = 'bold')
)
ph_compare <- bind_rows(
train %>%
transmute(dataset = "Training (Observed)", value = ph),
eval_output %>%
transmute(dataset = "Scoring (Predicted)", value = predicted_ph)
)
ggplot(ph_compare, aes(x = value, fill = dataset)) +
geom_density(alpha = 0.45) +
labs(
title = "Distribution of pH:\nTraining vs Predicted (Cubist)",
x = "pH",
y = "Density",
fill = "Source"
) +
scale_fill_brewer(palette = "Set2") +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 12),
legend.title = element_text(size = 12, face='bold')
)
In this project, we developed and compared several models for predicting manufacturing process pH using historical process data. After cleaning and exploring the data, we evaluated a baseline linear regression model alongside more flexible methods, including elastic net, MARS, random forest, and Cubist, within a consistent cross-validation framework. The results showed that the nonlinear models substantially improved predictive performance relative to the linear baseline.
Both random forest and Cubist emerged as strong candidates, with Cubist achieving the lowest RMSE and MAE on the held-out test set while maintaining competitive \(R^2\). Based on this balance of accuracy and stability, we selected the Cubist model as the final model and used it to generate pH predictions for the evaluation dataset. The variable importance results highlight a small set of key process variables as the primary drivers of pH, providing insight into which aspects of the manufacturing conditions may be most critical for quality control.
Overall, the workflow demonstrates an end-to-end, data-driven approach to modeling a critical quality attribute. The same framework could be extended to incorporate additional process variables, alternative model types, or more detailed diagnostics if richer data become available.
This analysis is fully reproducible from this R Markdown file:
data/ folder), so the analysis can be re-run by placing the
files in the same project structure.set.seed(2025)) are set
before model training to make results replicable across runs.tidyverse, skimr, corrplot,
caret, janitor, forcats,
recipes, earth, elasticnet,
Cubist, randomForest, and
ggplot2.