library(tidyverse)
library(tidymodels)
library(caret)
library(recipes)
library(kableExtra)
library(patchwork)

library(readxl)
library(janitor)

library(glmnet)
library(earth)
library(kernlab)
library(nnet)
library(ranger)
library(xgboost)
library(e1071)
library(naniar)
library(class)
library(rpart)
library(missForest)

library(vip)
library(writexl)
library(doParallel)

set.seed(123)
project_seed = 123

# use_parallel <- TRUE
# 
# if (use_parallel) {
#   #num_cores_to_use <- max(1, parallel::detectCores(logical = TRUE) - 1)
#   num_cores_to_use <- 2
#   cl <- makePSOCKcluster(num_cores_to_use)
#   registerDoParallel(cl)
# 
#   message("Parallel backend started with ", num_cores_to_use, " cores.")
# }

Project Objective

The project is to better understand the manufacturing process and build a model that can predict PH. New regulations require the company to explain what production factors may affect PH and also report a predictive model based on historical production data.

The main goal of this project is to create a reliable PH prediction workflow. I will review the data, check missing values, test different imputation methods, compare multiple models, and then choose the final model based on validation performance.

This project has both technical and business purpose. The technical part explains data preprocessing, imputation, model tuning, and validation metrics. The business part explains the final model and important production factors in more readable way.

Modeling Plan

The workflow for this project is:

Load and prepare the historical ABC Beverage data.
Split the data into training, validation, and final evaluation sets.
Review preprocessing issues like missing values, categorical variables, skewness, near-zero variance, and highly correlated predictors.
Create several imputed feature sets.
Test the imputed feature sets with different model families.
Tune models using cross-validation when it is needed.
Compare model performance using RMSE, MAE, and R-squared.
Select the final model mostly based on validation RMSE.
Use the final model to predict PH values for the final evaluation data.
Export final predictions in Excel readable format.

PH will not be used during imputation because PH is the target variable. If PH is used to fill missing feature values, then the model can indirectly learn from the answer it is supposed to predict. This would ppotentially create target leakage.

Load Data

set.seed(123)
train_url <- "https://raw.githubusercontent.com/farhodibr/CUNY-SPS-MSDS/main/DATA624_Project2/DATA/StudentData%20%282%29.xlsx"

test_url <- "https://raw.githubusercontent.com/farhodibr/CUNY-SPS-MSDS/main/DATA624_Project2/DATA/StudentEvaluation.xlsx"

train_file <- tempfile(fileext = ".xlsx")
test_file  <- tempfile(fileext = ".xlsx")

download.file(train_url, train_file, mode = "wb")
download.file(test_url, test_file, mode = "wb")

train_data_raw <- read_excel(train_file) |>
  clean_names()

test_data_raw <- read_excel(test_file) |>
  clean_names()

Categorical Variable Review

categorical_vars_raw <- names(train_data_raw)[
  sapply(train_data_raw, function(x) is.character(x) || is.factor(x))
]

categorical_vars_raw

[1] "brand_code"

categorical_distribution <- tibble()

for (var in categorical_vars_raw) {
  
  temp_distribution <- train_data_raw |>
    count(.data[[var]], name = "count") |>
    mutate(
      variable = var,
      percent = round(count / sum(count) * 100, 2)
    )
  
  names(temp_distribution)[1] <- "category"
  
  categorical_distribution <- bind_rows(
    categorical_distribution,
    temp_distribution
  )
}

categorical_distribution |>
  arrange(variable, desc(percent)) |>
  kable(
    caption = "Categorical Variable Distributions",
    digits = 2
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Categorical Variable Distributions
category	count	variable	percent
B	1239	brand_code	48.19
D	615	brand_code	23.92
C	304	brand_code	11.82
A	293	brand_code	11.40
NA	120	brand_code	4.67

Only brand_code is a categorical variable in this data. Brand B is the most common category, making up about 48% of the rows. Brand D is the second largest group, while Brands A and C are smaller groups but still have enough observations. Around 4.7% of the brand_code values are missing.

Later I will check if the training and validation split keeps brand_code's distributions similar.

Train and Validation Split

After reviewing the categorical variable distribution, I will split train_data_raw data into 80% training and 20% validation sets. The split will be stratified by PH because PH is the target variable.

After the split, I will also check if brand_code’s values has same distributions between training and validation.

set.seed(123)

production_data <- train_data_raw |>
  mutate(
    brand_code = as.factor(brand_code)
  )

eval_data <- test_data_raw |>
  mutate(
    brand_code = as.factor(brand_code)
  )

ph_split <- initial_split(
  production_data,
  prop = 0.80,
  strata = ph
)

train_data <- training(ph_split)
valid_data <- testing(ph_split)

train_features_raw <- train_data |>
  select(-ph)

valid_features_raw <- valid_data |>
  select(-ph)

eval_features_raw <- eval_data |>
  select(-any_of("ph"))

train_ph <- train_data$ph
valid_ph <- valid_data$ph

valid_features_raw$brand_code <- factor(
  valid_features_raw$brand_code,
  levels = levels(train_features_raw$brand_code)
)

eval_features_raw$brand_code <- factor(
  eval_features_raw$brand_code,
  levels = levels(train_features_raw$brand_code)
)

split_overview <- tibble(
  dataset = c("Training", "Validation", "Final Evaluation"),
  rows = c(
    nrow(train_features_raw),
    nrow(valid_features_raw),
    nrow(eval_features_raw)
  ),
  feature_columns = c(
    ncol(train_features_raw),
    ncol(valid_features_raw),
    ncol(eval_features_raw)
  ),
  ph_available = c("Yes", "Yes", "No")
)

split_overview |>
  kable(
    caption = "Modeling Data Split"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Modeling Data Split
dataset	rows	feature_columns	ph_available
Training	2055	32	Yes
Validation	516	32	Yes
Final Evaluation	267	32	No

brand_balance_check <- bind_rows(
  train_features_raw |>
    mutate(split = "Training"),
  valid_features_raw |>
    mutate(split = "Validation")
) |>
  count(split, brand_code, name = "count") |>
  group_by(split) |>
  mutate(
    percent = round(count / sum(count) * 100, 2),
    count_percent = paste0(count, " (", percent, "%)")
  ) |>
  ungroup()

brand_balance_wide <- brand_balance_check |>
  select(split, brand_code, count_percent) |>
  mutate(
    brand_code = ifelse(
      is.na(brand_code),
      "Missing",
      as.character(brand_code)
    ),
    brand_code = factor(
      brand_code,
      levels = c("A", "B", "C", "D", "Missing")
    )
  ) |>
  pivot_wider(
    names_from = brand_code,
    values_from = count_percent
  )

brand_balance_wide |>
  kable(
    caption = "Brand Code Distribution After Train/Validation Split",
    col.names = c(
      "Dataset",
      "Brand A",
      "Brand B",
      "Brand C",
      "Brand D",
      "Missing Brand Code"
    )
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Brand Code Distribution After Train/Validation Split
Dataset	Brand A	Brand B	Brand C	Brand D	Missing Brand Code
Training	225 (10.95%)	986 (47.98%)	255 (12.41%)	495 (24.09%)	94 (4.57%)
Validation	68 (13.18%)	253 (49.03%)	49 (9.5%)	120 (23.26%)	26 (5.04%)

After the split brand_code distribution is pretty much similar between the training and validation sets. Brand B remains the largest group in both sets, and Brand D is also close between the two sets. Brand A and Brand C have some difference, but the split still looks acceptable. Missing brand_code values are also close, around 4.6% in training and 5.0% in validation.

Because the PH distribution was stratified during the split and brand_code distribution also looks reasonable, I will keep this train/validation split.

Since PH is the target variable, I am going to check if PH values has similar distributions between the training and validation sets.

ph_split_summary <- tibble(
  dataset = c("Training", "Validation"),
  count = c(length(train_ph), length(valid_ph)),
  mean_ph = c(mean(train_ph, na.rm = TRUE), mean(valid_ph, na.rm = TRUE)),
  median_ph = c(median(train_ph, na.rm = TRUE), median(valid_ph, na.rm = TRUE)),
  sd_ph = c(sd(train_ph, na.rm = TRUE), sd(valid_ph, na.rm = TRUE)),
  min_ph = c(min(train_ph, na.rm = TRUE), min(valid_ph, na.rm = TRUE)),
  max_ph = c(max(train_ph, na.rm = TRUE), max(valid_ph, na.rm = TRUE))
)

ph_split_summary |>
  kable(
    caption = "PH Distribution After Train/Validation Split",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

PH Distribution After Train/Validation Split
dataset	count	mean_ph	median_ph	sd_ph	min_ph	max_ph
Training	2055	8.5455	8.54	0.1728	7.88	9.36
Validation	516	8.5462	8.54	0.1717	8.00	8.88

ph_split_plot_data <- tibble(
  ph = c(train_ph, valid_ph),
  dataset = c(
    rep("Training", length(train_ph)),
    rep("Validation", length(valid_ph))
  )
)

ggplot(
  ph_split_plot_data,
  aes(x = ph, fill = dataset)
) +
  geom_density(alpha = 0.4) +
  labs(
    title = "PH Distribution in Training and Validation Sets",
    x = "PH",
    y = "Density",
    fill = "Dataset"
  ) +
  theme_minimal()

The PH distribution looks similar enough between the training and validation sets. As we see from the above table mean, median, and standard deviation are almost the same.
The training set has a slightly higher maximum PH value, but the overall distribution still looks balanced. Because of this, I keep this split for the rest of the project.

Data Preprocessing Review

After checking PH balance and brand_code balance, I am going to review the remaining preprocessing issues. In this section I will explore missing values, numeric predictor summaries, skewness, near-zero variance.

This section has only diagnostic purpose. The actual preprocessing steps like dummy variables, transformations, scaling, and feature removal will be handled later inside modeling workflows when needed.

Missing Value Review

Before checking missing predictor values, I will first check if PH itself has missing values in the training/validation data and final evaluation data.

ph_missing_check <- tibble(
  dataset = c("Training/VAlidation Data", "Final Evaluation Data"),
  rows = c(nrow(train_data_raw), nrow(test_data_raw)),
  ph_column_exists = c(
    "ph" %in% names(train_data_raw),
    "ph" %in% names(test_data_raw)
  ),
  missing_ph_count = c(
    if ("ph" %in% names(train_data_raw)) sum(is.na(train_data_raw$ph)) else NA,
    if ("ph" %in% names(test_data_raw)) sum(is.na(test_data_raw$ph)) else NA
  ),
  missing_ph_percent = c(
    if ("ph" %in% names(train_data_raw)) round(mean(is.na(train_data_raw$ph)) * 100, 2) else NA,
    if ("ph" %in% names(test_data_raw)) round(mean(is.na(test_data_raw$ph)) * 100, 2) else NA
  )
)

ph_missing_check |>
  kable(
    caption = "PH Missing Value Check",
    digits = 2
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

PH Missing Value Check
dataset	rows	ph_column_exists	missing_ph_count	missing_ph_percent
Training/VAlidation Data	2571	TRUE	4	0.16
Final Evaluation Data	267	TRUE	267	100.00

The final evaluation data has PH missing for all rows, which is expected because those are the values the model needs to predict.

The training/validation data has 4 rows with missing PH values. Because PH is the target variable, those rows cannot be used for model training or validation. I will remove any rows with missing PH from the training and validation sets before separating features and target values.

train_data <- train_data |>
  filter(!is.na(ph))

valid_data <- valid_data |>
  filter(!is.na(ph))

train_features_raw <- train_data |>
  select(-ph)

valid_features_raw <- valid_data |>
  select(-ph)

eval_features_raw <- eval_data |>
  select(-any_of("ph"))

train_ph <- train_data$ph
valid_ph <- valid_data$ph

sum(is.na(train_ph))

[1] 0

sum(is.na(valid_ph))

[1] 0

nrow(train_features_raw)

[1] 2052

nrow(valid_features_raw)

[1] 515

length(train_ph)

[1] 2052

length(valid_ph)

[1] 515

missing_summary <- bind_rows(
  train_features_raw |>
    summarise(across(everything(), ~ sum(is.na(.)))) |>
    pivot_longer(
      cols = everything(),
      names_to = "variable",
      values_to = "missing_count"
    ) |>
    mutate(
      dataset = "Training",
      total_rows = nrow(train_features_raw)
    ),
  
  valid_features_raw |>
    summarise(across(everything(), ~ sum(is.na(.)))) |>
    pivot_longer(
      cols = everything(),
      names_to = "variable",
      values_to = "missing_count"
    ) |>
    mutate(
      dataset = "Validation",
      total_rows = nrow(valid_features_raw)
    ),
  
  eval_features_raw |>
    summarise(across(everything(), ~ sum(is.na(.)))) |>
    pivot_longer(
      cols = everything(),
      names_to = "variable",
      values_to = "missing_count"
    ) |>
    mutate(
      dataset = "Final Evaluation",
      total_rows = nrow(eval_features_raw)
    )
) |>
  mutate(
    missing_percent = round(missing_count / total_rows * 100, 2)
    ) |>
    filter(missing_count > 0) |>
    select(
      dataset,
      variable,
      missing_count,
      missing_percent
    ) |>
    mutate(
      dataset = factor(
        dataset,
        levels = c("Training", "Validation", "Final Evaluation")
      )
    ) |>
    arrange(dataset, desc(missing_count))

missing_summary |>
  kable(
    caption = "Missing Predictor Values by Dataset",
    digits = 2
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Missing Predictor Values by Dataset
dataset	variable	missing_count	missing_percent
Training	mfr	173	8.43
Training	brand_code	94	4.58
Training	filler_speed	43	2.10
Training	psc_co2	36	1.75
Training	pc_volume	32	1.56
Training	fill_ounces	30	1.46
Training	psc	27	1.32
Training	carb_pressure1	25	1.22
Training	hyd_pressure4	24	1.17
Training	carb_pressure	21	1.02
Training	psc_fill	21	1.02
Training	carb_temp	17	0.83
Training	filler_level	14	0.68
Training	fill_pressure	13	0.63
Training	hyd_pressure2	11	0.54
Training	hyd_pressure3	11	0.54
Training	temperature	10	0.49
Training	oxygen_filler	9	0.44
Training	pressure_setpoint	9	0.44
Training	carb_volume	8	0.39
Training	hyd_pressure1	8	0.39
Training	carb_rel	8	0.39
Training	alch_rel	6	0.29
Training	usage_cont	4	0.19
Training	carb_flow	2	0.10
Training	balling_lvl	1	0.05
Validation	mfr	35	6.80
Validation	brand_code	26	5.05
Validation	filler_speed	11	2.14
Validation	carb_temp	9	1.75
Validation	fill_ounces	8	1.55
Validation	pc_volume	7	1.36
Validation	carb_pressure1	7	1.36
Validation	carb_pressure	6	1.17
Validation	psc	6	1.17
Validation	fill_pressure	5	0.97
Validation	hyd_pressure2	4	0.78
Validation	hyd_pressure3	4	0.78
Validation	hyd_pressure4	4	0.78
Validation	psc_co2	3	0.58
Validation	hyd_pressure1	3	0.58
Validation	pressure_setpoint	3	0.58
Validation	carb_volume	2	0.39
Validation	psc_fill	2	0.39
Validation	filler_level	2	0.39
Validation	temperature	2	0.39
Validation	oxygen_filler	2	0.39
Validation	bowl_setpoint	2	0.39
Validation	usage_cont	1	0.19
Validation	alch_rel	1	0.19
Final Evaluation	mfr	31	11.61
Final Evaluation	filler_speed	10	3.75
Final Evaluation	brand_code	8	3.00
Final Evaluation	fill_ounces	6	2.25
Final Evaluation	psc	5	1.87
Final Evaluation	psc_co2	5	1.87
Final Evaluation	pc_volume	4	1.50
Final Evaluation	carb_pressure1	4	1.50
Final Evaluation	hyd_pressure4	4	1.50
Final Evaluation	psc_fill	3	1.12
Final Evaluation	oxygen_filler	3	1.12
Final Evaluation	alch_rel	3	1.12
Final Evaluation	fill_pressure	2	0.75
Final Evaluation	filler_level	2	0.75
Final Evaluation	temperature	2	0.75
Final Evaluation	usage_cont	2	0.75
Final Evaluation	pressure_setpoint	2	0.75
Final Evaluation	carb_rel	2	0.75
Final Evaluation	carb_volume	1	0.37
Final Evaluation	carb_temp	1	0.37
Final Evaluation	hyd_pressure2	1	0.37
Final Evaluation	hyd_pressure3	1	0.37
Final Evaluation	density	1	0.37
Final Evaluation	balling	1	0.37
Final Evaluation	pressure_vacuum	1	0.37
Final Evaluation	bowl_setpoint	1	0.37
Final Evaluation	air_pressurer	1	0.37

The missing value review shows that missing predictor values are present in training, validation, and evaluation sets. The variable mfr has the highest missingness in all three datasets, with about 8.6% missing in training, 7.0% in validation, and 11.6% in final evaluation. brand_code also has missing values in all datasets.

Most other predictors have smaller missing percentages, usually below 2%.

Since evaluation set also has missing predictor values, the imputation process has to be applied to training, validation, and evaluation missing values. PH will not be used during imputation because it is the target variable.

Raw Numeric Predictor Review

These checks are done before missing values imputation to understand the original data. They are not final transformation decisions. After imputation, I will review the completed feature sets again before modeling.

numeric_vars <- names(train_features_raw)[
  sapply(train_features_raw, is.numeric)
]

#numeric_vars

numeric_summary <- train_features_raw |>
  select(all_of(numeric_vars)) |>
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "value"
  ) |>
  group_by(variable) |>
  summarise(
    count = sum(!is.na(value)),
    missing_count = sum(is.na(value)),
    mean = mean(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    min = min(value, na.rm = TRUE),
    q1 = quantile(value, 0.25, na.rm = TRUE),
    q3 = quantile(value, 0.75, na.rm = TRUE),
    max = max(value, na.rm = TRUE),
    .groups = "drop"
  )

numeric_summary |>
  kable(
    caption = "Numeric Predictor Summary in Training Data",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center",
    bootstrap_options = c("striped", "hover", "condensed")
  )

Numeric Predictor Summary in Training Data
variable	count	missing_count	mean	median	sd	min	q1	q3	max
air_pressurer	2052	0	142.835	142.600	1.211	140.800	142.200	143.000	148.200
alch_rel	2046	6	6.898	6.560	0.508	5.280	6.540	7.240	8.620
balling	2052	0	2.196	1.648	0.931	0.160	1.496	3.292	3.984
balling_lvl	2051	1	2.050	1.460	0.869	0.000	1.380	3.140	3.660
bowl_setpoint	2052	0	109.197	120.000	15.268	70.000	100.000	120.000	140.000
carb_flow	2050	2	2462.747	3029.000	1075.415	26.000	1160.000	3180.000	3834.000
carb_pressure	2031	21	68.162	68.000	3.511	57.000	65.600	70.500	79.400
carb_pressure1	2027	25	122.537	123.200	4.779	105.600	118.800	125.400	140.200
carb_rel	2044	8	5.436	5.400	0.129	4.960	5.340	5.540	6.060
carb_temp	2035	17	141.026	140.800	4.034	128.600	138.400	143.600	154.000
carb_volume	2044	8	5.370	5.347	0.107	5.040	5.293	5.460	5.700
density	2052	0	1.172	0.980	0.377	0.240	0.900	1.620	1.920
fill_ounces	2022	30	23.976	23.973	0.088	23.633	23.920	24.033	24.320
fill_pressure	2039	13	47.923	46.400	3.179	36.000	46.000	50.000	60.400
filler_level	2038	14	109.145	118.200	15.609	55.800	98.450	120.000	153.200
filler_speed	2009	43	3677.520	3982.000	785.011	998.000	3888.000	3998.000	4030.000
hyd_pressure1	2044	8	12.475	11.600	12.366	-0.800	0.000	20.250	58.000
hyd_pressure2	2041	11	20.934	28.600	16.391	0.000	0.000	34.600	59.400
hyd_pressure3	2041	11	20.496	27.800	15.971	-1.200	0.000	33.400	50.000
hyd_pressure4	2028	24	96.417	96.000	13.255	62.000	86.000	102.000	142.000
mfr	1879	173	703.896	724.000	74.710	31.400	706.700	731.000	868.600
mnf_flow	2052	0	25.160	76.200	119.586	-100.200	-100.000	140.800	229.400
oxygen_filler	2043	9	0.047	0.033	0.045	0.002	0.022	0.060	0.400
pc_volume	2020	32	0.277	0.272	0.061	0.079	0.239	0.313	0.471
pressure_setpoint	2043	9	47.618	46.000	2.030	44.000	46.000	50.000	52.000
pressure_vacuum	2052	0	-5.217	-5.400	0.568	-6.600	-5.600	-5.000	-3.600
psc	2025	27	0.085	0.076	0.050	0.002	0.048	0.112	0.268
psc_co2	2016	36	0.056	0.040	0.043	0.000	0.020	0.080	0.240
psc_fill	2031	21	0.194	0.180	0.116	0.000	0.100	0.260	0.620
temperature	2042	10	65.976	65.600	1.405	63.600	65.200	66.400	76.200
usage_cont	2048	4	21.028	21.920	2.969	12.080	18.440	23.740	25.900

As we see from this table, numeric predictors are measured on different scales. Because of this, models like linear, SVM, neural networks, and regularized regression will need centering and scaling later.

Some variables also have zero or negative values, such as mnf_flow, pressure_vacuum, and some hydraulic pressure variables. Because of this, Box-Cox transformation is not safe for all predictors. Yeo-Johnson transformation is a better option because it can handle zero and negative values.

Some variables are stored as numeric, but they may actually represent process settings or categories. To check this, I will use the counts of unique values each numeric predictor has in the training data.
If a numeric variable has only a small number of unique values, I may apply different methods during preprocessing and modeling.

numeric_unique_summary <- train_features_raw |>
  select(all_of(numeric_vars)) |>
  summarise(
    across(
      everything(),
      ~ n_distinct(., na.rm = TRUE)
    )
  ) |>
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "unique_values"
  ) |>
  left_join(
    numeric_summary |>
      select(variable, count, missing_count, min, max),
    by = "variable"
  ) |>
  mutate(
    possible_discrete_or_setting = ifelse(
      unique_values <= 10,
      "Yes",
      "No"
    )
  ) |>
  arrange(unique_values)

numeric_discrete_review <- numeric_unique_summary |>
  filter(unique_values <= 50)

numeric_discrete_review |>
  kable(
    caption = "Numeric Predictors With Limited Unique Values",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Numeric Predictors With Limited Unique Values
variable	unique_values	count	missing_count	min	max	possible_discrete_or_setting
pressure_setpoint	7	2043	9	44.00	52.00	Yes
bowl_setpoint	10	2052	0	70.00	140.00	Yes
pressure_vacuum	16	2052	0	-6.60	-3.60	No
psc_co2	23	2016	36	0.00	0.24	No
air_pressurer	32	2052	0	140.80	148.20	No
hyd_pressure4	39	2028	24	62.00	142.00	No
carb_rel	40	2044	8	4.96	6.06	No
alch_rel	49	2046	6	5.28	8.62	No

Based on the unique value check, pressure_setpoint and bowl_setpoint look like process setting variables. However, I preffer do not treat these variables exactly like fully continuous numeric measurements. Later, when Yeo-Johnson transformations are used inside modeling recipes, I might exclude pressure_setpoint and bowl_setpoint numeric variables from that transformation.

exclude_from_yj_vars <- c(
  "pressure_setpoint",
  "bowl_setpoint"
)

# recipe(ph ~ ., data = train_data_model) |>
#   step_YeoJohnson(
#     all_numeric_predictors(),
#     -all_of(setting_like_numeric_vars)
#   ) |>
#   step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
#   step_zv(all_predictors()) |>
#   step_nzv(all_predictors()) |>
#   step_normalize(all_numeric_predictors())

Skewness Check

Since some values are missing, this skewness check is based on available raw training values. Imputation may slightly change the distribution, so the final transformation choices will be confirmed after imputed feature sets are created.

Skewness analysis is useful here because some models work better when numeric predictors are more normally distributed. This is especially important for models like linear regression, elastic net, SVM, and neural networks.

However, I do not want to blindly transform every skewed variable. Some numeric variables are process settings or rounded measurements as we saw earlier, so skewness may not mean the same thing for them. Because of this, transformation decisions will be handled inside the modelling section later.

numeric_skewness <- train_features_raw |>
  select(all_of(numeric_vars)) |>
  summarise(
    across(
      everything(),
      ~ skewness(., na.rm = TRUE)
    )
  ) |>
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "skewness"
  ) |>
  mutate(
    abs_skewness = abs(skewness),
    skewness_level = case_when(
      abs_skewness >= 2 ~ "High skewness",
      abs_skewness >= 1 ~ "Moderate skewness",
      TRUE ~ "Low skewness"
    ),
    # setting_like = ifelse(
    #   variable %in% exclude_from_yj_vars,
    #   "Yes",
    #   "No"
    #)
  ) |>
  arrange(desc(abs_skewness))

skewed_numeric_vars <- numeric_skewness |>
  filter(abs_skewness >= 1)

skewed_numeric_vars |>
  kable(
    caption = "Numeric Predictors With Moderate or High Skewness",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Numeric Predictors With Moderate or High Skewness
variable	skewness	abs_skewness	skewness_level
mfr	-5.096	5.096	High skewness
filler_speed	-2.790	2.790	High skewness
temperature	2.483	2.483	High skewness
oxygen_filler	2.366	2.366	High skewness
air_pressurer	2.259	2.259	High skewness
psc_co2	1.755	1.755	Moderate skewness

ggplot(
  numeric_skewness,
  aes(
    x = reorder(variable, abs_skewness),
    y = abs_skewness,
    fill = skewness_level
  )
) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Absolute Skewness of Numeric Predictors",
    subtitle = "Variables are colored by skewness level",
    x = "Variable",
    y = "Absolute Skewness",
    fill = "Skewness Level"
  ) +
  theme_minimal()

skewed_vars_to_plot <- skewed_numeric_vars |>
  pull(variable)

skewed_distribution_data <- train_features_raw |>
  select(all_of(skewed_vars_to_plot)) |>
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "value"
  )

ggplot(
  skewed_distribution_data,
  aes(x = value)
) +
  geom_histogram(
    bins = 30,
    color = "white"
  ) +
  facet_wrap(
    ~ variable,
    scales = "free",
    ncol = 3
  ) +
  labs(
    title = "Distributions of Skewed Numeric Predictors",
    subtitle = "Only variables with absolute skewness greater than or equal to 1 are shown",
    #x = "Value",
    y = "Count"
  ) +
  theme_minimal()

The skewness table and plot shows that only a small group of numeric predictors have moderate or high skewness. The variables with absolute skewness greater than or equal to 1 are mfr, filler_speed, oxygen_filler, temperature, air_pressurer, and psc_co2.

At first, these variables may look like candidates for transformation. However, the histograms show that not all skewed variables behave the same way. Some variables look like regular continuous variables with long tails, while others look more like grouped or rounded process measurements.

oxygen_filler and temperature look more like continuous variables with right skewness. These variables are better candidates for Yeo-Johnson transformation in models that are sensitive to skewed predictors.
mfr and filler_speed are left-skewed. Most observations are concentrated around normal operating values, but there are some much lower values. After outliers analysis section, transformation may still be useful for some model types.
Even psc_co2 have skewness, its plot show grouped values and repeated bars respectfully. This suggests psc_co2 may be categorical or rounded measurements. Because of this, I preffer not to apply Yeo-Johnson transformation to psc_co2.

Based on this review, I will keep all skewed variables in the dataset, but I separate the transformation decision by variable type. In the modeling workflow, Yeo-Johnson transformation will be applied before brand_codedummy encoding, and variables that look like process settings or grouped measurements will be excluded from Yeo-Johnson transformation.

exclude_from_yj_vars <- c(
  "pressure_setpoint",
  "bowl_setpoint",
  "air_pressurer",
  "psc_co2"
)

Near-Zero Variance Check

Next I will check for near-zero variance predictors.

This matters because predictors with very little variation usually do not have much useful information for the models.

nzv_raw_check <- nearZeroVar(
  train_features_raw,
  saveMetrics = FALSE)

nzv_raw_check

integer(0)

No raw predictors were flagged as near-zero variance. This means I do not remove any predictors at this stage based on near-zero variance.

Outlier and Unusual Value Review

Before imputation, I would check for unusual numeric values. If a value is clearly impossible or likely caused by a recording issue, it should be reviewed before missing values are filled.

I do not automatically remove outliers only because they are extreme. In production data, unusual values may represent real process conditions. My goal here is to identify suspicious values and decide carefully before changing the data.

numeric_vars_raw <- train_features_raw |>
  select(where(is.numeric)) |>
  names()

outlier_summary <- map_dfr(
  numeric_vars_raw,
  function(var) {
    
    values <- train_features_raw[[var]]
    
    q1 <- quantile(values, 0.25, na.rm = TRUE)
    q3 <- quantile(values, 0.75, na.rm = TRUE)
    iqr_value <- q3 - q1
    
    lower_bound <- q1 - 1.5 * iqr_value
    upper_bound <- q3 + 1.5 * iqr_value
    
    tibble(
      variable = var,
      count = sum(!is.na(values)),
      missing_count = sum(is.na(values)),
      min_value = min(values, na.rm = TRUE),
      q1 = q1,
      median = median(values, na.rm = TRUE),
      q3 = q3,
      max_value = max(values, na.rm = TRUE),
      lower_bound = lower_bound,
      upper_bound = upper_bound,
      outlier_count = sum(values < lower_bound | values > upper_bound, na.rm = TRUE),
      outlier_percent = round(outlier_count / count * 100, 2)
    )
  }
) |>
  arrange(desc(outlier_percent))

outlier_summary |>
  kable(
    caption = "Outlier Review Using IQR Rule",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center",
    bootstrap_options = c("striped", "hover", "condensed")
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )

Outlier Review Using IQR Rule
variable	count	missing_count	min_value	q1	median	q3	max_value	lower_bound	upper_bound	outlier_count	outlier_percent
filler_speed	2009	43	998.000	3888.000	3982.000	3998.000	4030.000	3723.000	4163.000	356	17.72
mfr	1879	173	31.400	706.700	724.000	731.000	868.600	670.250	767.450	190	10.11
air_pressurer	2052	0	140.800	142.200	142.600	143.000	148.200	141.000	144.200	177	8.63
oxygen_filler	2043	9	0.002	0.022	0.033	0.060	0.400	-0.035	0.117	147	7.20
temperature	2042	10	63.600	65.200	65.600	66.400	76.200	63.400	68.200	103	5.04
pressure_vacuum	2052	0	-6.600	-5.600	-5.400	-5.000	-3.600	-6.500	-4.100	95	4.63
hyd_pressure4	2028	24	62.000	86.000	96.000	102.000	142.000	62.000	126.000	70	3.45
pc_volume	2020	32	0.079	0.239	0.272	0.313	0.471	0.128	0.424	64	3.17
fill_pressure	2039	13	36.000	46.000	46.400	50.000	60.400	40.000	56.000	64	3.14
psc_co2	2016	36	0.000	0.020	0.040	0.080	0.240	-0.070	0.170	63	3.12
psc_fill	2031	21	0.000	0.100	0.180	0.260	0.620	-0.140	0.500	52	2.56
psc	2025	27	0.002	0.048	0.076	0.112	0.268	-0.048	0.208	48	2.37
fill_ounces	2022	30	23.633	23.920	23.973	24.033	24.320	23.750	24.203	38	1.88
carb_temp	2035	17	128.600	138.400	140.800	143.600	154.000	130.600	151.400	35	1.72
carb_pressure1	2027	25	105.600	118.800	123.200	125.400	140.200	108.900	135.300	17	0.84
carb_pressure	2031	21	57.000	65.600	68.000	70.500	79.400	58.250	77.850	14	0.69
filler_level	2038	14	55.800	98.450	118.200	120.000	153.200	66.125	152.325	6	0.29
alch_rel	2046	6	5.280	6.540	6.560	7.240	8.620	5.490	8.290	6	0.29
carb_rel	2044	8	4.960	5.340	5.400	5.540	6.060	5.040	5.840	6	0.29
hyd_pressure1	2044	8	-0.800	0.000	11.600	20.250	58.000	-30.375	50.625	4	0.20
carb_volume	2044	8	5.040	5.293	5.347	5.460	5.700	5.043	5.710	2	0.10
mnf_flow	2052	0	-100.200	-100.000	76.200	140.800	229.400	-461.200	502.000	0	0.00
hyd_pressure2	2041	11	0.000	0.000	28.600	34.600	59.400	-51.900	86.500	0	0.00
hyd_pressure3	2041	11	-1.200	0.000	27.800	33.400	50.000	-50.100	83.500	0	0.00
usage_cont	2048	4	12.080	18.440	21.920	23.740	25.900	10.490	31.690	0	0.00
carb_flow	2050	2	26.000	1160.000	3029.000	3180.000	3834.000	-1870.000	6210.000	0	0.00
density	2052	0	0.240	0.900	0.980	1.620	1.920	-0.180	2.700	0	0.00
balling	2052	0	0.160	1.496	1.648	3.292	3.984	-1.198	5.986	0	0.00
bowl_setpoint	2052	0	70.000	100.000	120.000	120.000	140.000	70.000	150.000	0	0.00
pressure_setpoint	2043	9	44.000	46.000	46.000	50.000	52.000	40.000	56.000	0	0.00
balling_lvl	2051	1	0.000	1.380	1.460	3.140	3.660	-1.260	5.780	0	0.00

negative_value_check <- train_features_raw |>
  select(where(is.numeric)) |>
  summarise(
    across(
      everything(),
      ~ sum(. < 0, na.rm = TRUE)
    )
  ) |>
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "negative_count"
  ) |>
  filter(negative_count > 0) |>
  mutate(
    total_non_missing = map_int(
      variable,
      ~ sum(!is.na(train_features_raw[[.x]]))
    ),
    negative_percent = round(negative_count / total_non_missing * 100, 2),
    min_value = map_dbl(
      variable,
      ~ min(train_features_raw[[.x]], na.rm = TRUE)
    ),
    max_value = map_dbl(
      variable,
      ~ max(train_features_raw[[.x]], na.rm = TRUE)
    )
  ) |>
  arrange(desc(negative_count))

negative_value_check |>
  kable(
    caption = "Numeric Predictors With Negative Values",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Numeric Predictors With Negative Values
variable	negative_count	total_non_missing	negative_percent	min_value	max_value
pressure_vacuum	2052	2052	100.00	-6.6	-3.6
mnf_flow	941	2052	45.86	-100.2	229.4
hyd_pressure3	60	2041	2.94	-1.2	50.0
hyd_pressure1	33	2044	1.61	-0.8	58.0

Review of Negative `mnfflow` Values

The negative-value summary shows that almost 46% of mnf_flow has many negative values . Since almost half of the observed mnf_flow values are negative, I want to review this variable more carefully before deciding whether these values should be treated as errors.

Negative mnf_flow values look suspicious because many of them appear to be negative. However, before deciding on what to do with negative values, I would like to check whether these values are connected to PH and other variables.

First of all, I will check the what actually these negative values look like

mnf_flow_negative_values <- train_features_raw |>
  filter(
    !is.na(mnf_flow),
    mnf_flow < 0
  ) |>
  count(
    mnf_flow,
    sort = TRUE
  )

mnf_flow_negative_values |>
  kable(
    caption = "Negative mnfflow Values in Training Data",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Negative mnfflow Values in Training Data
mnf_flow	n
-100.0	487
-100.2	454

The table shows that negative mnf_flow values are almost entirely concentrated at -100 and -100.2. This suggests that these values may represent an error readings or actual production settings.

Below I will compare PH values for rows with negative and non-negative mnf_flow.

mnf_flow_ph_check <- train_features_raw |>
  mutate(
    mnf_flow_status = case_when(
      is.na(mnf_flow) ~ "Missing",
      mnf_flow < 0 ~ "Negative",
      mnf_flow >= 0 ~ "Non-negative"
    )
  ) |>
  bind_cols(
    tibble(ph = train_ph)
  ) |>
  filter(
    !is.na(ph)
  ) |>
  group_by(
    brand_code,
    mnf_flow_status
  ) |>
  summarise(
    n = n(),
    mean_ph = mean(ph, na.rm = TRUE),
    median_ph = median(ph, na.rm = TRUE),
    sd_ph = sd(ph, na.rm = TRUE),
    min_ph = min(ph, na.rm = TRUE),
    max_ph = max(ph, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(
    brand_code,
    mnf_flow_status
  )

mnf_flow_ph_check |>
  kable(
    caption = "PH Comparison for Negative vs Non-Negative mnfflow",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

PH Comparison for Negative vs Non-Negative mnfflow
brand_code	mnf_flow_status	n	mean_ph	median_ph	sd_ph	min_ph	max_ph
A	Negative	85	8.5779	8.60	0.1388	8.20	8.86
A	Non-negative	140	8.4381	8.46	0.1519	8.06	8.68
B	Negative	476	8.6700	8.70	0.1385	8.18	8.94
B	Non-negative	507	8.4716	8.48	0.1329	8.00	8.86
C	Negative	118	8.4675	8.50	0.1995	8.08	9.36
C	Non-negative	137	8.3737	8.38	0.1495	7.88	8.80
D	Negative	222	8.6627	8.68	0.1316	8.32	8.92
D	Non-negative	273	8.5527	8.56	0.1226	8.16	8.82
NA	Negative	40	8.6330	8.61	0.1261	8.28	8.88
NA	Non-negative	54	8.3930	8.45	0.1289	8.12	8.60

As we see from the table, negative mnf_flow rows have higher mean PH in every Brand Code group. This strongly suggests that the negative values are carrying useful information about the PH outcome.

I also compare a few process variables to see whether negative mnf_flow rows look different in the production process.

mnf_flow_process_check <- train_features_raw |>
  mutate(
    mnf_flow_negative = case_when(
      is.na(mnf_flow) ~ "Missing",
      mnf_flow < 0 ~ "Negative",
      mnf_flow >= 0 ~ "Non-negative"
    )
  ) |>
  group_by(
    brand_code,
    mnf_flow_negative
  ) |>
  summarise(
    n = n(),
    mean_temperature = mean(temperature, na.rm = TRUE),
    mean_density = mean(density, na.rm = TRUE),
    mean_carb_rel = mean(carb_rel, na.rm = TRUE),
    mean_oxygen_filler = mean(oxygen_filler, na.rm = TRUE),
    mean_filler_speed = mean(filler_speed, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(
    brand_code,
    mnf_flow_negative
  )

mnf_flow_process_check |>
  kable(
    caption = "Process Variable Comparison for Negative vs Non-Negative mnfflow",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Process Variable Comparison for Negative vs Non-Negative mnfflow
brand_code	mnf_flow_negative	n	mean_temperature	mean_density	mean_carb_rel	mean_oxygen_filler	mean_filler_speed
A	Negative	85	66.1153	1.5522	5.5026	0.0701	3381.349
A	Non-negative	140	66.1696	1.5759	5.5247	0.0265	3684.567
B	Negative	476	65.8474	0.9103	5.3818	0.0713	3735.662
B	Non-negative	507	65.9199	0.9084	5.3424	0.0255	3705.375
C	Negative	118	66.9879	0.8961	5.3483	0.0848	3637.983
C	Non-negative	137	66.4482	0.9330	5.3543	0.0229	3723.293
D	Negative	222	65.5674	1.6842	5.6032	0.0709	3622.532
D	Non-negative	273	65.3735	1.6787	5.6078	0.0232	3706.127
NA	Negative	40	67.3500	0.9605	5.3349	0.0792	3610.350
NA	Non-negative	54	67.2491	0.9919	5.3506	0.0269	3449.077

The process-variable comparison also shows that negative mnf_flow rows are different from non-negative rows.
The clearest example is oxygen_filler. For Brand B:

rows with negative mnf_flow have mean of oxygen_filler of 0.0713, compared with only 0.0255 for non-negative mnf_flow rows.

The same pattern appears for other brands too. For Brand C:

the mean of oxygen_filler is 0.0848 for negative mnf_flow rows and 0.0229 for non-negative rows.

This makes me more careful about treating negative mnf_flow as an error. If these values were only random bad measurements, I would not expect them to clearly separate PH and oxygen_filler.

Based on this review, I will not convert negative mnf_flow values as errors in the main preprocessing workflow at this point. Even though -100 and -100.2 look like suspicious, they appear to contain useful information. For example, Brand B has mean PH of 8.6700 when mnf_flow is negative, compared with 8.4716 when mnf_flow is non-negative. I will decide how to treat them after I impute missing values.

Outliers Review by Brand Code

I think it is more logical to also anlyse outliers by brand_code group. Since different brands may have different production settings, one global outlier cutoff can be misleading.

For example, a value may look unusual when all brands are combined together, but it may be normal for one specific brand. Because of this, I do not want to decide about outliers only from the full dataset. I want to see if values are unusual inside their own brand group.

This brand-level outlier check gives a better view of the production process. If a value is only extreme globally but not extreme within its own brand, I would probably keep it. But if a value is extreme even inside its own brand group and also does not make sense from production logic, then it may need more review before imputation.

At this stage, I use this as a diagnostic check. I will not remove or change outliers automatically, because some unusual values may still represent real production conditions.

brand_outlier_summary <- train_features_raw |>
  filter(!is.na(brand_code)) |>
  select(brand_code, all_of(numeric_vars_raw)) |>
  pivot_longer(
    cols = all_of(numeric_vars_raw),
    names_to = "variable",
    values_to = "value"
  ) |>
  group_by(brand_code, variable) |>
  summarise(
    count = sum(!is.na(value)),
    missing_count = sum(is.na(value)),
    q1 = quantile(value, 0.25, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    q3 = quantile(value, 0.75, na.rm = TRUE),
    iqr_value = q3 - q1,
    min_value = min(value, na.rm = TRUE),
    max_value = max(value, na.rm = TRUE),
    lower_bound = q1 - 1.5 * iqr_value,
    upper_bound = q3 + 1.5 * iqr_value,
    outlier_count = sum(
      value < lower_bound | value > upper_bound,
      na.rm = TRUE
    ),
    outlier_percent = round(outlier_count / count * 100, 2),
    .groups = "drop"
  ) |>
  arrange(desc(outlier_percent))

brand_outlier_summary |>
  filter(outlier_count > 0) |>
  kable(
    caption = "Outlier Review Within Each Brand Code Group",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center",
    bootstrap_options = c("striped", "hover", "condensed")
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )

Outlier Review Within Each Brand Code Group
brand_code	variable	count	missing_count	q1	median	q3	iqr_value	min_value	max_value	lower_bound	upper_bound	outlier_count	outlier_percent
D	pressure_setpoint	493	2	46.000	46.000	46.000	0.000	44.000	52.000	46.000	46.000	159	32.25
D	bowl_setpoint	495	0	110.000	120.000	120.000	10.000	70.000	140.000	95.000	135.000	110	22.22
C	filler_speed	249	6	3892.000	3980.000	3996.000	104.000	1004.000	4030.000	3736.000	4152.000	48	19.28
D	filler_speed	485	10	3892.000	3982.000	3996.000	104.000	998.000	4026.000	3736.000	4152.000	91	18.76
A	filler_speed	217	8	3806.000	3984.000	3998.000	192.000	1004.000	4024.000	3518.000	4286.000	34	15.67
B	filler_speed	966	17	3890.000	3982.000	3998.000	108.000	1002.000	4020.000	3728.000	4160.000	147	15.22
C	mfr	233	22	709.800	721.600	729.800	20.000	76.800	835.400	679.800	759.800	31	13.30
C	filler_level	254	1	108.250	118.100	120.000	11.750	69.400	150.800	90.625	137.625	32	12.60
C	bowl_setpoint	255	0	110.000	120.000	120.000	10.000	70.000	130.000	95.000	135.000	32	12.55
D	hyd_pressure4	494	1	78.000	80.000	84.000	6.000	62.000	140.000	69.000	93.000	61	12.35
B	air_pressurer	983	0	142.200	142.600	143.000	0.800	140.800	148.200	141.000	144.200	120	12.21
A	mfr	194	31	709.200	723.600	731.000	21.800	31.400	868.600	676.500	763.700	23	11.86
D	mfr	453	42	707.400	724.800	732.200	24.800	112.600	792.400	670.200	769.400	51	11.26
C	air_pressurer	255	0	142.200	142.400	142.800	0.600	141.200	147.200	141.300	143.700	28	10.98
D	density	495	0	1.660	1.720	1.780	0.120	0.720	1.920	1.480	1.960	49	9.90
C	density	255	0	0.860	0.940	1.020	0.160	0.240	1.760	0.620	1.260	25	9.80
D	fill_pressure	491	4	45.800	46.200	48.000	2.200	37.800	60.000	42.500	51.300	48	9.78
C	carb_rel	252	3	5.320	5.360	5.380	0.060	5.020	5.580	5.230	5.470	24	9.52
A	oxygen_filler	223	2	0.022	0.032	0.048	0.026	0.002	0.400	-0.017	0.088	21	9.42
B	hyd_pressure4	975	8	96.000	98.000	102.000	6.000	78.000	138.000	87.000	111.000	90	9.23
D	oxygen_filler	494	1	0.019	0.033	0.052	0.033	0.002	0.318	-0.030	0.101	45	9.11
A	density	225	0	1.540	1.600	1.660	0.120	0.840	1.760	1.360	1.840	20	8.89
A	air_pressurer	225	0	142.400	142.600	143.000	0.600	141.000	147.000	141.500	143.900	18	8.00
C	psc_co2	250	5	0.040	0.040	0.080	0.040	0.000	0.240	-0.020	0.140	20	8.00
B	mfr	914	69	706.200	725.000	730.800	24.600	95.400	840.000	669.300	767.700	70	7.66
A	alch_rel	224	1	7.100	7.120	7.160	0.060	6.520	7.760	7.010	7.250	16	7.14
B	oxygen_filler	981	2	0.022	0.034	0.064	0.042	0.002	0.400	-0.041	0.127	60	6.12
B	density	983	0	0.880	0.920	0.960	0.080	0.340	1.840	0.760	1.080	58	5.90
D	pc_volume	489	6	0.221	0.255	0.292	0.071	0.087	0.462	0.115	0.398	27	5.52
B	temperature	980	3	65.200	65.600	66.200	1.000	64.000	75.800	63.700	67.700	51	5.20
C	oxygen_filler	252	3	0.022	0.034	0.072	0.050	0.003	0.238	-0.054	0.148	13	5.16
D	pressure_vacuum	495	0	-5.600	-5.400	-5.000	0.600	-6.600	-3.600	-6.500	-4.100	25	5.05
B	psc_co2	963	20	0.020	0.040	0.070	0.050	0.000	0.240	-0.055	0.145	48	4.98
A	temperature	223	2	65.200	65.800	66.600	1.400	64.200	76.200	63.100	68.700	11	4.93
D	carb_rel	495	0	5.580	5.600	5.640	0.060	4.960	5.780	5.490	5.730	24	4.85
A	pressure_vacuum	225	0	-5.600	-5.200	-5.000	0.600	-6.200	-3.800	-6.500	-4.100	10	4.44
D	air_pressurer	495	0	142.200	142.600	142.800	0.600	141.200	147.400	141.300	143.700	22	4.44
C	fill_ounces	251	4	23.940	23.980	24.040	0.100	23.693	24.313	23.790	24.190	11	4.38
D	temperature	493	2	64.600	65.200	66.000	1.400	63.600	73.600	62.500	68.100	20	4.06
A	balling_lvl	225	0	3.040	3.100	3.200	0.160	1.440	3.400	2.800	3.440	9	4.00
A	fill_pressure	225	0	46.000	47.400	50.200	4.200	36.200	60.400	39.700	56.500	9	4.00
C	pc_volume	251	4	0.258	0.290	0.322	0.064	0.103	0.463	0.163	0.417	10	3.98
C	fill_pressure	252	3	46.000	48.000	50.200	4.200	39.200	58.800	39.700	56.500	10	3.97
A	psc	220	5	0.046	0.072	0.102	0.056	0.006	0.260	-0.038	0.186	8	3.64
D	psc_co2	489	6	0.020	0.040	0.060	0.040	0.000	0.240	-0.040	0.120	17	3.48
C	temperature	253	2	65.800	66.600	67.200	1.400	64.000	76.000	63.700	69.300	8	3.16
B	balling_lvl	983	0	1.360	1.400	1.440	0.080	0.000	3.440	1.240	1.560	31	3.15
A	carb_rel	224	1	5.460	5.520	5.560	0.100	5.120	6.060	5.310	5.710	7	3.12
B	balling	983	0	1.398	1.498	1.596	0.198	0.346	3.864	1.101	1.893	30	3.05
D	carb_temp	492	3	138.800	140.700	143.250	4.450	129.200	153.800	132.125	149.925	15	3.05
B	pc_volume	969	14	0.247	0.277	0.315	0.068	0.109	0.471	0.145	0.417	28	2.89
B	psc_fill	974	9	0.120	0.180	0.260	0.140	0.000	0.620	-0.090	0.470	28	2.87
D	carb_volume	492	3	5.473	5.513	5.540	0.067	5.260	5.700	5.373	5.640	14	2.85
D	filler_level	491	4	100.800	117.400	120.200	19.400	55.800	151.800	71.700	149.300	14	2.85
A	fill_ounces	219	6	23.933	23.980	24.033	0.100	23.753	24.313	23.783	24.183	6	2.74
D	alch_rel	495	0	7.680	7.720	7.740	0.060	6.500	8.200	7.590	7.830	13	2.63
B	carb_pressure	976	7	65.000	66.800	69.000	4.000	57.000	79.400	59.000	75.000	25	2.56
D	psc	490	5	0.042	0.074	0.106	0.064	0.002	0.264	-0.053	0.201	12	2.45
C	psc_fill	253	2	0.100	0.180	0.280	0.180	0.020	0.600	-0.170	0.550	6	2.37
B	psc	971	12	0.052	0.080	0.114	0.062	0.004	0.268	-0.041	0.207	22	2.27
A	psc_co2	222	3	0.020	0.040	0.080	0.060	0.000	0.240	-0.070	0.170	5	2.25
A	psc_fill	223	2	0.100	0.180	0.260	0.160	0.000	0.580	-0.140	0.500	5	2.24
A	balling	225	0	3.092	3.194	3.390	0.298	1.298	3.764	2.645	3.837	5	2.22
B	fill_pressure	980	3	46.000	47.000	50.000	4.000	36.000	59.200	40.000	56.000	21	2.14
D	carb_pressure	484	11	68.750	70.600	72.400	3.650	62.600	78.800	63.275	77.875	10	2.07
D	fill_ounces	490	5	23.907	23.960	24.007	0.100	23.693	24.320	23.757	24.157	10	2.04
D	psc_fill	491	4	0.100	0.180	0.260	0.160	0.000	0.600	-0.140	0.500	10	2.04
D	balling	495	0	3.390	3.490	3.688	0.298	0.998	3.984	2.943	4.135	10	2.02
D	balling_lvl	495	0	3.200	3.280	3.360	0.160	0.900	3.660	2.960	3.600	10	2.02
C	carb_volume	254	1	5.260	5.300	5.340	0.080	5.047	5.473	5.140	5.460	5	1.97
C	balling	255	0	1.472	1.646	1.796	0.324	0.160	3.588	0.986	2.282	5	1.96
B	alch_rel	980	3	6.520	6.540	6.560	0.040	6.400	8.560	6.460	6.620	19	1.94
A	hyd_pressure4	216	9	94.000	102.000	112.000	18.000	62.000	142.000	67.000	139.000	4	1.85
A	carb_volume	225	0	5.373	5.433	5.467	0.093	5.040	5.627	5.233	5.607	4	1.78
B	carb_volume	979	4	5.280	5.313	5.347	0.067	5.080	5.560	5.180	5.447	16	1.63
C	hyd_pressure4	253	2	94.000	102.000	110.000	16.000	74.000	140.000	70.000	134.000	4	1.58
C	psc	253	2	0.052	0.082	0.122	0.070	0.002	0.256	-0.053	0.227	4	1.58
B	carb_rel	981	2	5.320	5.360	5.400	0.080	5.100	5.840	5.200	5.520	15	1.53
A	pc_volume	220	5	0.223	0.255	0.308	0.085	0.111	0.456	0.095	0.436	3	1.36
A	carb_pressure	224	1	67.200	69.200	71.200	4.000	60.600	78.200	61.200	77.200	3	1.34
B	carb_temp	973	10	138.200	140.600	143.800	5.600	128.600	154.000	129.800	152.200	13	1.34
B	fill_ounces	972	11	23.920	23.980	24.033	0.113	23.633	24.280	23.750	24.203	13	1.34
C	pressure_vacuum	255	0	-5.700	-5.400	-5.000	0.700	-6.200	-3.800	-6.750	-3.950	3	1.18
D	carb_pressure1	488	7	119.000	123.400	125.800	6.800	105.600	140.200	108.800	136.000	5	1.02
A	carb_pressure1	222	3	119.800	123.400	125.600	5.800	111.000	134.600	111.100	134.300	2	0.90
C	carb_pressure	254	1	64.800	67.200	69.200	4.400	59.600	78.200	58.200	75.800	2	0.79
C	balling_lvl	255	0	1.400	1.520	1.620	0.220	0.000	3.380	1.070	1.950	2	0.78
B	carb_pressure1	971	12	118.600	123.000	125.400	6.800	107.000	139.400	108.400	135.600	7	0.72
A	hyd_pressure1	225	0	0.000	10.400	19.400	19.400	-0.800	53.800	-29.100	48.500	1	0.44
C	carb_pressure1	253	2	118.600	123.200	125.200	6.600	109.600	138.400	108.700	135.100	1	0.40
C	carb_temp	253	2	138.600	141.600	144.000	5.400	130.600	153.600	130.500	152.100	1	0.40
C	alch_rel	254	1	6.520	6.560	6.600	0.080	6.400	7.660	6.400	6.720	1	0.39
B	hyd_pressure1	981	2	0.000	12.000	20.200	20.200	-0.800	58.000	-30.300	50.500	2	0.20

brand_outlier_summary |>
  filter(outlier_percent >= 5) |>
  arrange(desc(outlier_percent)) |>
  kable(
    caption = "Brand-Level Outlier Checks With At Least 5 Percent Outliers",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )

Brand-Level Outlier Checks With At Least 5 Percent Outliers
brand_code	variable	count	missing_count	q1	median	q3	iqr_value	min_value	max_value	lower_bound	upper_bound	outlier_count	outlier_percent
D	pressure_setpoint	493	2	46.000	46.000	46.000	0.000	44.000	52.000	46.000	46.000	159	32.25
D	bowl_setpoint	495	0	110.000	120.000	120.000	10.000	70.000	140.000	95.000	135.000	110	22.22
C	filler_speed	249	6	3892.000	3980.000	3996.000	104.000	1004.000	4030.000	3736.000	4152.000	48	19.28
D	filler_speed	485	10	3892.000	3982.000	3996.000	104.000	998.000	4026.000	3736.000	4152.000	91	18.76
A	filler_speed	217	8	3806.000	3984.000	3998.000	192.000	1004.000	4024.000	3518.000	4286.000	34	15.67
B	filler_speed	966	17	3890.000	3982.000	3998.000	108.000	1002.000	4020.000	3728.000	4160.000	147	15.22
C	mfr	233	22	709.800	721.600	729.800	20.000	76.800	835.400	679.800	759.800	31	13.30
C	filler_level	254	1	108.250	118.100	120.000	11.750	69.400	150.800	90.625	137.625	32	12.60
C	bowl_setpoint	255	0	110.000	120.000	120.000	10.000	70.000	130.000	95.000	135.000	32	12.55
D	hyd_pressure4	494	1	78.000	80.000	84.000	6.000	62.000	140.000	69.000	93.000	61	12.35
B	air_pressurer	983	0	142.200	142.600	143.000	0.800	140.800	148.200	141.000	144.200	120	12.21
A	mfr	194	31	709.200	723.600	731.000	21.800	31.400	868.600	676.500	763.700	23	11.86
D	mfr	453	42	707.400	724.800	732.200	24.800	112.600	792.400	670.200	769.400	51	11.26
C	air_pressurer	255	0	142.200	142.400	142.800	0.600	141.200	147.200	141.300	143.700	28	10.98
D	density	495	0	1.660	1.720	1.780	0.120	0.720	1.920	1.480	1.960	49	9.90
C	density	255	0	0.860	0.940	1.020	0.160	0.240	1.760	0.620	1.260	25	9.80
D	fill_pressure	491	4	45.800	46.200	48.000	2.200	37.800	60.000	42.500	51.300	48	9.78
C	carb_rel	252	3	5.320	5.360	5.380	0.060	5.020	5.580	5.230	5.470	24	9.52
A	oxygen_filler	223	2	0.022	0.032	0.048	0.026	0.002	0.400	-0.017	0.088	21	9.42
B	hyd_pressure4	975	8	96.000	98.000	102.000	6.000	78.000	138.000	87.000	111.000	90	9.23
D	oxygen_filler	494	1	0.019	0.033	0.052	0.033	0.002	0.318	-0.030	0.101	45	9.11
A	density	225	0	1.540	1.600	1.660	0.120	0.840	1.760	1.360	1.840	20	8.89
A	air_pressurer	225	0	142.400	142.600	143.000	0.600	141.000	147.000	141.500	143.900	18	8.00
C	psc_co2	250	5	0.040	0.040	0.080	0.040	0.000	0.240	-0.020	0.140	20	8.00
B	mfr	914	69	706.200	725.000	730.800	24.600	95.400	840.000	669.300	767.700	70	7.66
A	alch_rel	224	1	7.100	7.120	7.160	0.060	6.520	7.760	7.010	7.250	16	7.14
B	oxygen_filler	981	2	0.022	0.034	0.064	0.042	0.002	0.400	-0.041	0.127	60	6.12
B	density	983	0	0.880	0.920	0.960	0.080	0.340	1.840	0.760	1.080	58	5.90
D	pc_volume	489	6	0.221	0.255	0.292	0.071	0.087	0.462	0.115	0.398	27	5.52
B	temperature	980	3	65.200	65.600	66.200	1.000	64.000	75.800	63.700	67.700	51	5.20
C	oxygen_filler	252	3	0.022	0.034	0.072	0.050	0.003	0.238	-0.054	0.148	13	5.16
D	pressure_vacuum	495	0	-5.600	-5.400	-5.000	0.600	-6.600	-3.600	-6.500	-4.100	25	5.05

negative_values_by_brand <- train_features_raw |>
  filter(!is.na(brand_code)) |>
  select(brand_code, all_of(numeric_vars_raw)) |>
  pivot_longer(
    cols = all_of(numeric_vars_raw),
    names_to = "variable",
    values_to = "value"
  ) |>
  group_by(brand_code, variable) |>
  summarise(
    count = sum(!is.na(value)),
    negative_count = sum(value < 0, na.rm = TRUE),
    negative_percent = round(negative_count / count * 100, 2),
    min_value = min(value, na.rm = TRUE),
    max_value = max(value, na.rm = TRUE),
    .groups = "drop"
  ) |>
  filter(negative_count > 0) |>
  arrange(desc(negative_percent))

negative_values_by_brand |>
  kable(
    caption = "Negative Values by Brand Code",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Negative Values by Brand Code
brand_code	variable	count	negative_count	negative_percent	min_value	max_value
A	pressure_vacuum	225	225	100.00	-6.2	-3.8
B	pressure_vacuum	983	983	100.00	-6.4	-3.6
C	pressure_vacuum	255	255	100.00	-6.2	-3.8
D	pressure_vacuum	495	495	100.00	-6.6	-3.6
B	mnf_flow	983	476	48.42	-100.2	213.4
C	mnf_flow	255	118	46.27	-100.2	182.0
D	mnf_flow	495	222	44.85	-100.2	204.6
A	mnf_flow	225	85	37.78	-100.2	216.2
C	hyd_pressure3	255	8	3.14	-1.2	43.8
A	hyd_pressure3	224	7	3.12	-1.2	43.8
B	hyd_pressure3	980	27	2.76	-1.2	45.6
A	hyd_pressure1	225	6	2.67	-0.8	53.8
D	hyd_pressure3	488	13	2.66	-1.2	50.0
C	hyd_pressure1	255	5	1.96	-0.8	43.2
D	hyd_pressure1	489	8	1.64	-0.8	47.0
B	hyd_pressure1	981	12	1.22	-0.8	58.0

Outliers Compared With PH

After finding outliers, I compare them with PH. This helps answer whether unusual predictor values are connected to unusual PH values.

If outlier rows have different PH from non-outlier rows, then removing them may remove important production information. Because of this, I do not want to delete outliers only because they are statistically unusual.

train_features_with_ph <- train_features_raw |>
  mutate(ph = train_ph)

Now I will create outlier flags within each brand group:

brand_outlier_ph_data <- train_features_with_ph |>
  mutate(row_id = row_number()) |>
  filter(!is.na(brand_code)) |>
  select(row_id, brand_code, ph, all_of(numeric_vars_raw)) |>
  pivot_longer(
    cols = all_of(numeric_vars_raw),
    names_to = "variable",
    values_to = "value"
  ) |>
  group_by(brand_code, variable) |>
  mutate(
    q1 = quantile(value, 0.25, na.rm = TRUE),
    q3 = quantile(value, 0.75, na.rm = TRUE),
    iqr_value = q3 - q1,
    lower_bound = q1 - 1.5 * iqr_value,
    upper_bound = q3 + 1.5 * iqr_value,
    is_outlier = value < lower_bound | value > upper_bound
  ) |>
  ungroup()

This will summarize PH for outlier vs non-outlier rows:

outlier_ph_summary <- brand_outlier_ph_data |>
  filter(!is.na(value)) |>
  group_by(variable, is_outlier) |>
  summarise(
    count = n(),
    mean_ph = mean(ph, na.rm = TRUE),
    median_ph = median(ph, na.rm = TRUE),
    sd_ph = sd(ph, na.rm = TRUE),
    min_ph = min(ph, na.rm = TRUE),
    max_ph = max(ph, na.rm = TRUE),
    .groups = "drop"
  ) |>
  pivot_wider(
    names_from = is_outlier,
    values_from = c(count, mean_ph, median_ph, sd_ph, min_ph, max_ph),
    names_prefix = "outlier_"
  ) |>
  mutate(
    mean_ph_difference = mean_ph_outlier_TRUE - mean_ph_outlier_FALSE,
    abs_mean_ph_difference = abs(mean_ph_difference)
  ) |>
  arrange(desc(abs_mean_ph_difference))

outlier_ph_summary |>
  filter(count_outlier_TRUE >= 10) |>
  slice_max(abs_mean_ph_difference, n = 15) |>
  kable(
    caption = "Top PH Differences for Outlier Rows",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center",
    bootstrap_options = c("striped", "hover", "condensed")
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )

Top PH Differences for Outlier Rows
variable	count_outlier_FALSE	count_outlier_TRUE	mean_ph_outlier_FALSE	mean_ph_outlier_TRUE	median_ph_outlier_FALSE	median_ph_outlier_TRUE	sd_ph_outlier_FALSE	sd_ph_outlier_TRUE	min_ph_outlier_FALSE	min_ph_outlier_TRUE	max_ph_outlier_FALSE	max_ph_outlier_TRUE	mean_ph_difference	abs_mean_ph_difference
carb_pressure1	1919	15	8.5497	8.6680	8.56	8.62	0.1701	0.2733	7.88	8.38	8.94	9.36	0.1183	0.1183
filler_level	1900	46	8.5503	8.4365	8.56	8.45	0.1720	0.1069	7.88	8.22	8.94	8.76	-0.1138	0.1138
carb_rel	1882	70	8.5513	8.4446	8.56	8.46	0.1690	0.2020	7.88	7.90	8.94	8.94	-0.1068	0.1068
temperature	1859	90	8.5495	8.4969	8.56	8.50	0.1703	0.1853	7.88	8.08	8.94	8.92	-0.0527	0.0527
carb_volume	1911	39	8.5488	8.4964	8.54	8.52	0.1721	0.1887	7.88	8.06	9.36	8.86	-0.0524	0.0524
oxygen_filler	1811	139	8.5437	8.5957	8.54	8.60	0.1724	0.1642	7.88	8.20	8.94	9.36	0.0520	0.0520
pressure_setpoint	1790	159	8.5452	8.5877	8.54	8.60	0.1760	0.1156	7.88	8.24	9.36	8.84	0.0425	0.0425
pressure_vacuum	1920	38	8.5487	8.5068	8.56	8.46	0.1727	0.1522	7.88	8.26	9.36	8.78	-0.0419	0.0419
mfr	1619	175	8.5509	8.5123	8.56	8.50	0.1702	0.1675	7.88	8.08	8.94	8.92	-0.0386	0.0386
fill_ounces	1892	40	8.5484	8.5145	8.54	8.55	0.1722	0.2026	7.88	8.02	9.36	8.86	-0.0339	0.0339
hyd_pressure4	1779	159	8.5448	8.5780	8.54	8.58	0.1700	0.1873	7.88	8.02	8.94	8.94	0.0332	0.0332
psc_co2	1834	90	8.5492	8.5169	8.56	8.51	0.1713	0.1875	7.88	8.08	9.36	8.94	-0.0324	0.0324
psc_fill	1892	49	8.5470	8.5780	8.54	8.60	0.1715	0.2126	7.88	8.12	9.36	8.94	0.0310	0.0310
balling_lvl	1906	52	8.5471	8.5777	8.54	8.58	0.1708	0.2228	7.88	8.10	8.94	9.36	0.0306	0.0306
fill_pressure	1860	88	8.5488	8.5189	8.56	8.50	0.1724	0.1533	7.88	8.22	8.94	8.88	-0.0299	0.0299

top_outlier_ph_vars <- outlier_ph_summary |>
  filter(count_outlier_TRUE >= 10) |>
  slice_max(abs_mean_ph_difference, n = 6) |>
  pull(variable)

brand_outlier_ph_data |>
  filter(variable %in% top_outlier_ph_vars) |>
  mutate(
    outlier_status = ifelse(is_outlier, "Outlier", "Not Outlier")
  ) |>
  ggplot(
    aes(x = outlier_status, y = ph)
  ) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free_x") +
  labs(
    title = "PH Distribution for Outlier vs Non-Outlier Rows",
    x = "Outlier Status",
    y = "PH"
  ) +
  theme_minimal()

Brand-Level Outliers Compared With PH

Since outliers were identified within each brand_code group, I also compare them with PH within each brand_code group. This is more logical than combining all brands together, because each brand may have different normal production behavior.

This check helps me see whether outlier values for a specific brand are connected with different PH values.

train_features_with_ph <- train_features_raw |>
  mutate(ph = train_ph)

brand_outlier_ph_data <- train_features_with_ph |>
  mutate(row_id = row_number()) |>
  filter(!is.na(brand_code)) |>
  select(row_id, brand_code, ph, all_of(numeric_vars_raw)) |>
  pivot_longer(
    cols = all_of(numeric_vars_raw),
    names_to = "variable",
    values_to = "value"
  ) |>
  group_by(brand_code, variable) |>
  mutate(
    q1 = quantile(value, 0.25, na.rm = TRUE),
    q3 = quantile(value, 0.75, na.rm = TRUE),
    iqr_value = q3 - q1,
    lower_bound = q1 - 1.5 * iqr_value,
    upper_bound = q3 + 1.5 * iqr_value,
    is_outlier = value < lower_bound | value > upper_bound
  ) |>
  ungroup()

brand_outlier_ph_summary <- brand_outlier_ph_data |>
  filter(!is.na(value)) |>
  group_by(brand_code, variable, is_outlier) |>
  summarise(
    count = n(),
    mean_ph = mean(ph, na.rm = TRUE),
    median_ph = median(ph, na.rm = TRUE),
    sd_ph = sd(ph, na.rm = TRUE),
    min_ph = min(ph, na.rm = TRUE),
    max_ph = max(ph, na.rm = TRUE),
    .groups = "drop"
  ) |>
  pivot_wider(
    names_from = is_outlier,
    values_from = c(count, mean_ph, median_ph, sd_ph, min_ph, max_ph),
    names_prefix = "outlier_"
  ) |>
  mutate(
    mean_ph_difference = mean_ph_outlier_TRUE - mean_ph_outlier_FALSE,
    abs_mean_ph_difference = abs(mean_ph_difference)
  ) |>
  arrange(desc(abs_mean_ph_difference))

top_brand_outlier_ph_differences <- brand_outlier_ph_summary |>
  filter(count_outlier_TRUE >= 10) |>
  slice_max(abs_mean_ph_difference, n = 20)

top_brand_outlier_ph_differences |>
  kable(
    caption = "Top PH Differences for Brand-Level Outlier Rows",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center",
    bootstrap_options = c("striped", "hover", "condensed")
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )

Top PH Differences for Brand-Level Outlier Rows
brand_code	variable	count_outlier_FALSE	count_outlier_TRUE	mean_ph_outlier_FALSE	mean_ph_outlier_TRUE	median_ph_outlier_FALSE	median_ph_outlier_TRUE	sd_ph_outlier_FALSE	sd_ph_outlier_TRUE	min_ph_outlier_FALSE	min_ph_outlier_TRUE	max_ph_outlier_FALSE	max_ph_outlier_TRUE	mean_ph_difference	abs_mean_ph_difference
C	oxygen_filler	239	13	8.4049	8.5892	8.42	8.58	0.1675	0.2872	7.88	8.26	8.80	9.36	0.1844	0.1844
C	carb_rel	228	24	8.4211	8.3117	8.43	8.33	0.1630	0.1910	7.88	7.90	8.80	8.62	-0.1095	0.1095
B	temperature	929	51	8.5727	8.4639	8.56	8.44	0.1659	0.1698	8.00	8.18	8.94	8.86	-0.1088	0.1088
C	density	230	25	8.4277	8.3200	8.44	8.26	0.1558	0.3185	7.90	7.88	8.80	9.36	-0.1077	0.1077
C	filler_speed	201	48	8.3945	8.4975	8.42	8.51	0.1612	0.1888	7.88	8.02	8.80	8.80	0.1030	0.1030
B	carb_temp	960	13	8.5688	8.4723	8.56	8.46	0.1680	0.1372	8.00	8.24	8.94	8.80	-0.0965	0.0965
D	carb_rel	471	24	8.6061	8.5217	8.62	8.55	0.1363	0.1487	8.24	8.16	8.92	8.78	-0.0844	0.0844
B	carb_volume	963	16	8.5689	8.4875	8.56	8.48	0.1677	0.1805	8.00	8.08	8.94	8.80	-0.0814	0.0814
A	mfr	171	23	8.4916	8.4130	8.52	8.44	0.1563	0.1784	8.06	8.12	8.86	8.74	-0.0785	0.0785
A	filler_speed	183	34	8.4833	8.5524	8.52	8.59	0.1574	0.1773	8.06	8.20	8.86	8.86	0.0691	0.0691
A	oxygen_filler	202	21	8.4830	8.5514	8.50	8.58	0.1607	0.1612	8.06	8.20	8.86	8.86	0.0685	0.0685
B	psc_fill	946	28	8.5661	8.6329	8.56	8.71	0.1674	0.1813	8.00	8.26	8.94	8.94	0.0668	0.0668
B	balling_lvl	952	31	8.5656	8.6316	8.56	8.68	0.1675	0.1716	8.00	8.22	8.94	8.94	0.0660	0.0660
D	psc_fill	481	10	8.6001	8.6660	8.60	8.65	0.1381	0.1310	8.16	8.50	8.92	8.92	0.0659	0.0659
D	fill_pressure	443	48	8.6086	8.5446	8.62	8.52	0.1377	0.1286	8.16	8.24	8.92	8.82	-0.0640	0.0640
D	filler_level	477	14	8.6044	8.5429	8.62	8.54	0.1386	0.0938	8.16	8.40	8.92	8.76	-0.0615	0.0615
C	fill_pressure	242	10	8.4097	8.4700	8.42	8.49	0.1714	0.1367	7.88	8.26	8.80	8.62	0.0603	0.0603
B	fill_ounces	959	13	8.5677	8.5077	8.56	8.52	0.1681	0.1916	8.00	8.18	8.94	8.80	-0.0600	0.0600
D	balling_lvl	485	10	8.6032	8.5440	8.62	8.56	0.1368	0.1849	8.24	8.16	8.92	8.78	-0.0592	0.0592
D	psc	478	12	8.6030	8.5450	8.62	8.55	0.1388	0.1138	8.16	8.34	8.92	8.70	-0.0580	0.0580

The brand-level PH comparison shows that outliers do not behave the same way for every brand.

For Brand C, oxygen_filler outlier rows had average PH of 8.5892, while non-outlier rows had average PH of 8.4049.
This is a difference of about +0.18 PH. Since PH values in this dataset do not vary very widely, this difference is meaningful. It suggests that high or unusual oxygen_filler values may be connected with higher PH for Brand C.
For Brand C, carb_rel outlier rows had average PH of 8.3117, while non-outlier rows had average PH of 8.4211.
This is a difference of about -0.11 PH. This means unusual carb_rel values for Brand C may be connected with lower PH.
For Brand B, temperature outlier rows had average PH of 8.4639, while non-outlier rows had average PH of 8.5727.
This is a difference of about -0.11 PH. This suggests that unusual temperature values may matter for Brand B and should not be removed without thinking.
For Brand A, mfr outlier rows had average PH of 8.4130, while non-outlier rows had average PH of 8.4916.
This is a smaller difference, about -0.08 PH, but it still shows that unusual mfr values may have some relationship with PH.

These examples show that some outliers may contain useful production information. If I remove all outliers without proper analysis, I might remove cases where the process behaves differently and PH changes.

For this project, I will use brand-level capping for selected continuous variables. This means the outlier limits are calculated separately inside each Brand Code group, not from the full dataset. This is more logical because each brand may have its own normal operating range.

I will not cap setting-like or grouped variables such as pressure_setpoint, bowl_setpoint, air_pressurer, psc_co2, and pressure_vacuum. These variables look like they have fixed or grouped values by design, so regular IQR outlier rules may not make sense for them.

Overall, I do not treat outliers as simple errors. I treat them as unusual production values that need careful handling. Capping will be documented as a preprocessing step, and later model performance can show whether this treatment helps or hurts prediction.

vars_excluded_from_capping <- c(
  "pressure_setpoint",
  "bowl_setpoint",
  "air_pressurer",
  "psc_co2",
  "pressure_vacuum"
)

top_brand_outlier_vars <- top_brand_outlier_ph_differences |>
  distinct(brand_code, variable) |>
  slice_head(n = 8)

# brand_outlier_ph_data |>
#   semi_join(
#     top_brand_outlier_vars,
#     by = c("brand_code", "variable")
#   ) |>
#   mutate(
#     outlier_status = ifelse(is_outlier, "Outlier", "Not Outlier"),
#     brand_variable = paste("Brand", brand_code, "-", variable)
#   ) |>
#   ggplot(
#     aes(x = outlier_status, y = ph)
#   ) +
#   geom_boxplot() +
#   facet_wrap(~ brand_variable, scales = "free_x") +
#   labs(
#     title = "PH Distribution for Brand-Level Outlier vs Non-Outlier Rows",
#     x = "Outlier Status",
#     y = "PH"
#   ) +
#   theme_minimal()

Brand-Level Outlier Capping

Based on the outlier review, I apply capping to selected continuous numeric predictors. The caps are calculated from the training data only and separately within each Brand Code group.

This avoids using validation or final evaluation data to make preprocessing decisions. It also keeps the capping process more fair because each brand can have different normal operating ranges.

I do not cap setting-like or grouped variables because IQR limits may not make sense for them. In the first version, I exclude pressure_setpoint, bowl_setpoint, air_pressurer, psc_co2, and pressure_vacuum from capping.

pressure_setpoint and bowl_setpoint both look like process settings with limited possible values. air_pressurer and psc_co2 have grouped or repeated values, so regular IQR capping may over-treat them. pressure_vacuum is excluded because negative values appear to be part of how this variable is measured.

vars_excluded_from_capping <- c(
  "pressure_setpoint",
  "bowl_setpoint",
  "air_pressurer",
  "psc_co2",
  "pressure_vacuum"
)

get_numeric_vars_to_cap <- function(data_x, excluded_vars) {
  
  numeric_vars <- data_x |>
    select(where(is.numeric)) |>
    names()
  
  numeric_vars[
    !numeric_vars %in% excluded_vars
  ]
}

create_capping_limits <- function(train_x, numeric_vars_to_cap, brand_col = "brand_code") {
  
  brand_cap_limits <- train_x |>
    filter(!is.na(.data[[brand_col]])) |>
    select(all_of(brand_col), all_of(numeric_vars_to_cap)) |>
    pivot_longer(
      cols = all_of(numeric_vars_to_cap),
      names_to = "variable",
      values_to = "value"
    ) |>
    group_by(.data[[brand_col]], variable) |>
    summarise(
      q1 = quantile(value, 0.25, na.rm = TRUE),
      q3 = quantile(value, 0.75, na.rm = TRUE),
      iqr_value = q3 - q1,
      lower_cap = q1 - 1.5 * iqr_value,
      upper_cap = q3 + 1.5 * iqr_value,
      .groups = "drop"
    ) |>
    rename(brand_code = all_of(brand_col))
  
  global_cap_limits <- train_x |>
    select(all_of(numeric_vars_to_cap)) |>
    pivot_longer(
      cols = all_of(numeric_vars_to_cap),
      names_to = "variable",
      values_to = "value"
    ) |>
    group_by(variable) |>
    summarise(
      q1 = quantile(value, 0.25, na.rm = TRUE),
      q3 = quantile(value, 0.75, na.rm = TRUE),
      iqr_value = q3 - q1,
      lower_cap = q1 - 1.5 * iqr_value,
      upper_cap = q3 + 1.5 * iqr_value,
      .groups = "drop"
    )
  
  list(
    brand_cap_limits = brand_cap_limits,
    global_cap_limits = global_cap_limits
  )
}

apply_brand_caps <- function(data_x, numeric_vars_to_cap, cap_limits, brand_col = "brand_code") {
  
  capped_data <- data_x
  
  brand_cap_limits <- cap_limits$brand_cap_limits
  global_cap_limits <- cap_limits$global_cap_limits
  
  for (var in numeric_vars_to_cap) {
    
    for (brand in unique(brand_cap_limits$brand_code)) {
      
      caps <- brand_cap_limits |>
        filter(
          brand_code == brand,
          variable == var
        )
      
      rows_to_cap <- which(capped_data[[brand_col]] == brand)
      
      if (nrow(caps) == 1 && length(rows_to_cap) > 0) {
        
        capped_data[[var]][rows_to_cap] <- pmin(
          pmax(capped_data[[var]][rows_to_cap], caps$lower_cap),
          caps$upper_cap
        )
      }
    }
    
    global_caps <- global_cap_limits |>
      filter(variable == var)
    
    missing_brand_rows <- which(is.na(capped_data[[brand_col]]))
    
    if (nrow(global_caps) == 1 && length(missing_brand_rows) > 0) {
      
      capped_data[[var]][missing_brand_rows] <- pmin(
        pmax(capped_data[[var]][missing_brand_rows], global_caps$lower_cap),
        global_caps$upper_cap
      )
    }
  }
  
  capped_data
}

apply_brand_caps <- function(data_x, numeric_vars_to_cap, cap_limits, brand_col = "brand_code") {
  
  capped_data <- data_x
  
  brand_cap_limits <- cap_limits$brand_cap_limits
  global_cap_limits <- cap_limits$global_cap_limits
  
  for (var in numeric_vars_to_cap) {
    
    for (brand in unique(brand_cap_limits$brand_code)) {
      
      caps <- brand_cap_limits |>
        filter(
          brand_code == brand,
          variable == var
        )
      
      rows_to_cap <- which(capped_data[[brand_col]] == brand)
      
      if (nrow(caps) == 1 && length(rows_to_cap) > 0) {
        
        capped_data[[var]][rows_to_cap] <- pmin(
          pmax(capped_data[[var]][rows_to_cap], caps$lower_cap),
          caps$upper_cap
        )
      }
    }
    
    global_caps <- global_cap_limits |>
      filter(variable == var)
    
    missing_brand_rows <- which(is.na(capped_data[[brand_col]]))
    
    if (nrow(global_caps) == 1 && length(missing_brand_rows) > 0) {
      
      capped_data[[var]][missing_brand_rows] <- pmin(
        pmax(capped_data[[var]][missing_brand_rows], global_caps$lower_cap),
        global_caps$upper_cap
      )
    }
  }
  
  capped_data
}

summarize_capping_by_brand <- function(original_x, capped_x, numeric_vars_to_cap, dataset_name) {
  
  map_dfr(
    numeric_vars_to_cap,
    function(var) {
      
      original_values <- original_x[[var]]
      capped_values <- capped_x[[var]]
      
      changed_values <- !is.na(original_values) &
        !is.na(capped_values) &
        original_values != capped_values
      
      tibble(
        dataset = dataset_name,
        variable = var,
        brand_code = original_x$brand_code,
        was_capped = changed_values
      ) |>
        mutate(
          brand_code = ifelse(
            is.na(brand_code),
            "Missing Brand",
            as.character(brand_code)
          )
        ) |>
        group_by(dataset, variable, brand_code) |>
        summarise(
          values_capped = sum(was_capped),
          .groups = "drop"
        )
    }
  ) |>
    pivot_wider(
      names_from = brand_code,
      values_from = values_capped,
      values_fill = 0
    ) |>
    mutate(
      Total = rowSums(
        across(
          where(is.numeric)
        )
      )
    )
}

First Capping Attempt

First, I run the capping process using the original excluded variable list. This helps me see how much the capping rule changes the data before deciding whether the excluded list needs to be updated.

vars_excluded_from_capping_v1 <- c(
  "pressure_setpoint",
  "bowl_setpoint",
  "air_pressurer",
  "psc_co2",
  "pressure_vacuum"
)

numeric_vars_to_cap_v1 <- get_numeric_vars_to_cap(
  data_x = train_features_raw,
  excluded_vars = vars_excluded_from_capping_v1
)

numeric_vars_to_cap_v1

 [1] "carb_volume"    "fill_ounces"    "pc_volume"      "carb_pressure" 
 [5] "carb_temp"      "psc"            "psc_fill"       "mnf_flow"      
 [9] "carb_pressure1" "fill_pressure"  "hyd_pressure1"  "hyd_pressure2" 
[13] "hyd_pressure3"  "hyd_pressure4"  "filler_level"   "filler_speed"  
[17] "temperature"    "usage_cont"     "carb_flow"      "density"       
[21] "mfr"            "balling"        "oxygen_filler"  "alch_rel"      
[25] "carb_rel"       "balling_lvl"

cap_limits_v1 <- create_capping_limits(
  train_x = train_features_raw,
  numeric_vars_to_cap = numeric_vars_to_cap_v1,
  brand_col = "brand_code"
)

train_features_capped_v1 <- apply_brand_caps(
  data_x = train_features_raw,
  numeric_vars_to_cap = numeric_vars_to_cap_v1,
  cap_limits = cap_limits_v1,
  brand_col = "brand_code"
)

valid_features_capped_v1 <- apply_brand_caps(
  data_x = valid_features_raw,
  numeric_vars_to_cap = numeric_vars_to_cap_v1,
  cap_limits = cap_limits_v1,
  brand_col = "brand_code"
)

eval_features_capped_v1 <- apply_brand_caps(
  data_x = eval_features_raw,
  numeric_vars_to_cap = numeric_vars_to_cap_v1,
  cap_limits = cap_limits_v1,
  brand_col = "brand_code"
)

get_numeric_vars_to_cap <- function(data_x, excluded_vars) {
  
  numeric_vars <- data_x |>
    select(where(is.numeric)) |>
    names()
  
  numeric_vars[
    !numeric_vars %in% excluded_vars
  ]
}

capping_summary_v1 <- bind_rows(
  summarize_capping_by_brand(
    original_x = train_features_raw,
    capped_x = train_features_capped_v1,
    numeric_vars_to_cap = numeric_vars_to_cap_v1,
    dataset_name = "Training"
  ),
  
  summarize_capping_by_brand(
    original_x = valid_features_raw,
    capped_x = valid_features_capped_v1,
    numeric_vars_to_cap = numeric_vars_to_cap_v1,
    dataset_name = "Validation"
  ),
  
  summarize_capping_by_brand(
    original_x = eval_features_raw,
    capped_x = eval_features_capped_v1,
    numeric_vars_to_cap = numeric_vars_to_cap_v1,
    dataset_name = "Final Evaluation"
  )
) |>
  mutate(
    dataset = factor(
      dataset,
      levels = c("Training", "Validation", "Final Evaluation")
    )
  ) |>
  arrange(dataset, desc(Total))

capping_summary_v1 |>
  kable(
    caption = "First Capping Attempt: Values Capped by Brand Code",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )

First Capping Attempt: Values Capped by Brand Code
dataset	variable	A	B	C	D	Missing Brand	Total
Training	filler_speed	34	147	48	91	28	348
Training	mfr	23	70	31	51	19	194
Training	hyd_pressure4	4	90	4	61	5	164
Training	density	20	58	25	49	0	152
Training	oxygen_filler	21	60	13	45	8	147
Training	temperature	11	51	8	20	16	106
Training	fill_pressure	9	21	10	48	7	95
Training	pc_volume	3	28	10	27	13	81
Training	carb_rel	7	15	24	24	0	70
Training	alch_rel	16	19	1	13	5	54
Training	balling_lvl	9	31	2	10	0	52
Training	psc_fill	5	28	6	10	1	50
Training	balling	5	30	5	10	0	50
Training	psc	8	22	4	12	3	49
Training	fill_ounces	6	13	11	10	7	47
Training	filler_level	0	0	32	14	0	46
Training	carb_pressure	3	25	2	10	2	42
Training	carb_volume	4	16	5	14	1	40
Training	carb_temp	0	13	1	15	2	31
Training	carb_pressure1	2	7	1	5	2	17
Training	hyd_pressure1	1	2	0	0	1	4
Training	mnf_flow	0	0	0	0	0	0
Training	hyd_pressure2	0	0	0	0	0	0
Training	hyd_pressure3	0	0	0	0	0	0
Training	usage_cont	0	0	0	0	0	0
Training	carb_flow	0	0	0	0	0	0
Validation	filler_speed	8	33	9	20	9	79
Validation	mfr	7	24	6	14	7	58
Validation	oxygen_filler	3	15	3	12	1	34
Validation	hyd_pressure4	1	18	1	12	0	32
Validation	temperature	2	19	2	3	2	28
Validation	density	4	13	2	9	0	28
Validation	carb_rel	1	3	10	5	1	20
Validation	pc_volume	0	4	3	8	2	17
Validation	psc_fill	1	9	1	5	1	17
Validation	fill_pressure	2	8	3	4	0	17
Validation	balling_lvl	3	9	1	3	0	16
Validation	alch_rel	5	5	1	3	0	14
Validation	carb_pressure	2	6	0	2	1	11
Validation	filler_level	0	0	8	2	0	10
Validation	balling	1	5	1	3	0	10
Validation	carb_volume	1	7	0	1	0	9
Validation	fill_ounces	0	5	1	1	1	8
Validation	carb_temp	3	3	0	1	1	8
Validation	psc	1	1	3	2	0	7
Validation	carb_pressure1	0	1	0	0	2	3
Validation	hyd_pressure1	0	1	0	0	0	1
Validation	mnf_flow	0	0	0	0	0	0
Validation	hyd_pressure2	0	0	0	0	0	0
Validation	hyd_pressure3	0	0	0	0	0	0
Validation	usage_cont	0	0	0	0	0	0
Validation	carb_flow	0	0	0	0	0	0
Final Evaluation	filler_speed	5	20	6	14	2	47
Final Evaluation	hyd_pressure4	1	16	0	14	2	33
Final Evaluation	mfr	6	10	6	8	2	32
Final Evaluation	density	2	11	3	6	0	22
Final Evaluation	temperature	2	8	0	6	4	20
Final Evaluation	oxygen_filler	3	5	4	5	1	18
Final Evaluation	fill_pressure	1	2	0	10	1	14
Final Evaluation	balling	1	9	0	4	0	14
Final Evaluation	psc	3	5	1	2	0	11
Final Evaluation	carb_volume	2	5	1	1	0	9
Final Evaluation	carb_rel	0	2	2	5	0	9
Final Evaluation	pc_volume	2	2	0	4	0	8
Final Evaluation	balling_lvl	2	5	0	1	0	8
Final Evaluation	carb_pressure	2	3	0	1	0	6
Final Evaluation	alch_rel	1	4	0	1	0	6
Final Evaluation	carb_temp	1	2	0	1	0	4
Final Evaluation	psc_fill	1	1	0	2	0	4
Final Evaluation	fill_ounces	1	0	1	1	0	3
Final Evaluation	filler_level	0	0	1	1	0	2
Final Evaluation	carb_pressure1	1	0	0	0	0	1
Final Evaluation	hyd_pressure1	1	0	0	0	0	1
Final Evaluation	mnf_flow	0	0	0	0	0	0
Final Evaluation	hyd_pressure2	0	0	0	0	0	0
Final Evaluation	hyd_pressure3	0	0	0	0	0	0
Final Evaluation	usage_cont	0	0	0	0	0	0
Final Evaluation	carb_flow	0	0	0	0	0	0

The first capping attempt showed that some variables were changed a lot. The largest issue was filler_speed. In the training data, filler_speed had 348 capped values, which is a large number.

This made me review the earlier brand-level outlier table more carefully. filler_speed had high outlier percentages across all brands:

Brand C had about 19.28% outliers for filler_speed.
Brand D had about 18.76% outliers for filler_speed.
Brand A had about 15.67% outliers for filler_speed.
Brand B had about 15.22% outliers for filler_speed.

The low filler_speed values are around 998 to 1004, while most normal values are closer to 3900 to 4000. This may look like an error at first, but because the same pattern appears across all brands, it may represent a real slow-speed production state, startup/shutdown condition, or machine slowdown.

Because of this, I do not want to cap filler_speed. Capping these low values may hide a real condition that could be important for predicting PH.

The first capping attempt was still useful because it showed that the original capping rule was too aggressive for filler_speed. Based on this review, I update the excluded variable list and rerun the capping process.

Outlier Treatment Decision

vars_excluded_from_capping_v2 <- c(
  "pressure_setpoint",
  "bowl_setpoint",
  "air_pressurer",
  "psc_co2",
  "pressure_vacuum"
)

numeric_vars_to_cap_v2 <- get_numeric_vars_to_cap(
  data_x = train_features_raw,
  excluded_vars = vars_excluded_from_capping_v2
)

numeric_vars_to_cap_v2

 [1] "carb_volume"    "fill_ounces"    "pc_volume"      "carb_pressure" 
 [5] "carb_temp"      "psc"            "psc_fill"       "mnf_flow"      
 [9] "carb_pressure1" "fill_pressure"  "hyd_pressure1"  "hyd_pressure2" 
[13] "hyd_pressure3"  "hyd_pressure4"  "filler_level"   "filler_speed"  
[17] "temperature"    "usage_cont"     "carb_flow"      "density"       
[21] "mfr"            "balling"        "oxygen_filler"  "alch_rel"      
[25] "carb_rel"       "balling_lvl"

ph_difference_cutoff <- 0.05
min_outlier_count_for_rule <- 10

cap_allowed_combos <- brand_outlier_ph_summary |>
  filter(
    count_outlier_TRUE >= min_outlier_count_for_rule,
    abs_mean_ph_difference < ph_difference_cutoff
  ) |>
  select(
    brand_code,
    variable,
    count_outlier_TRUE,
    mean_ph_difference,
    abs_mean_ph_difference
  ) |>
  arrange(abs_mean_ph_difference)

cap_allowed_combos |>
  kable(
    caption = "Brand-Variable Outlier Groups Allowed for Capping",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )

Brand-Variable Outlier Groups Allowed for Capping
brand_code	variable	count_outlier_TRUE	mean_ph_difference	abs_mean_ph_difference
D	filler_speed	91	0.0015	0.0015
B	filler_speed	147	0.0022	0.0022
C	air_pressurer	28	0.0033	0.0033
B	hyd_pressure4	90	0.0039	0.0039
B	air_pressurer	120	-0.0059	0.0059
D	psc_co2	17	0.0064	0.0064
D	air_pressurer	22	-0.0097	0.0097
D	hyd_pressure4	61	0.0140	0.0140
A	density	20	0.0143	0.0143
C	psc_co2	20	-0.0160	0.0160
B	oxygen_filler	60	0.0166	0.0166
D	carb_pressure	10	-0.0174	0.0174
D	carb_volume	14	-0.0181	0.0181
B	psc	22	-0.0202	0.0202
D	balling	10	-0.0204	0.0204
D	pressure_setpoint	159	-0.0215	0.0215
B	density	58	-0.0254	0.0254
D	density	49	0.0263	0.0263
C	filler_level	32	-0.0268	0.0268
B	psc_co2	48	-0.0277	0.0277
C	bowl_setpoint	32	-0.0281	0.0281
B	pc_volume	28	-0.0288	0.0288
B	balling	30	0.0299	0.0299
A	air_pressurer	18	-0.0300	0.0300
A	pressure_vacuum	10	-0.0303	0.0303
B	carb_rel	15	-0.0313	0.0313
B	carb_pressure	25	-0.0333	0.0333
D	temperature	20	-0.0333	0.0333
D	mfr	51	-0.0334	0.0334
D	bowl_setpoint	110	-0.0353	0.0353
D	oxygen_filler	45	0.0369	0.0369
D	fill_ounces	10	0.0386	0.0386
B	mfr	70	-0.0437	0.0437
C	pc_volume	10	0.0458	0.0458
C	mfr	31	0.0462	0.0462
B	alch_rel	19	0.0467	0.0467
D	carb_temp	15	-0.0478	0.0478
C	fill_ounces	11	0.0488	0.0488

brand_cap_limits_v2 <- train_features_raw |>
  filter(!is.na(brand_code)) |>
  select(brand_code, all_of(numeric_vars_to_cap_v2)) |>
  pivot_longer(
    cols = all_of(numeric_vars_to_cap_v2),
    names_to = "variable",
    values_to = "value"
  ) |>
  group_by(brand_code, variable) |>
  summarise(
    q1 = quantile(value, 0.25, na.rm = TRUE),
    q3 = quantile(value, 0.75, na.rm = TRUE),
    iqr_value = q3 - q1,
    lower_cap = q1 - 1.5 * iqr_value,
    upper_cap = q3 + 1.5 * iqr_value,
    .groups = "drop"
  )

apply_brand_caps_ph_allowed <- function(
  data_x,
  numeric_vars_to_cap,
  brand_cap_limits,
  cap_allowed_combos,
  brand_col = "brand_code"
) {
  
  capped_data <- data_x
  
  for (i in seq_len(nrow(cap_allowed_combos))) {
    
    current_brand <- cap_allowed_combos$brand_code[i]
    current_var <- cap_allowed_combos$variable[i]
    
    if (current_var %in% numeric_vars_to_cap) {
      
      caps <- brand_cap_limits |>
        filter(
          brand_code == current_brand,
          variable == current_var
        )
      
      rows_to_cap <- which(capped_data[[brand_col]] == current_brand)
      
      if (nrow(caps) == 1 && length(rows_to_cap) > 0) {
        
        capped_data[[current_var]][rows_to_cap] <- pmin(
          pmax(capped_data[[current_var]][rows_to_cap], caps$lower_cap),
          caps$upper_cap
        )
      }
    }
  }
  
  capped_data
}

train_features_capped <- apply_brand_caps_ph_allowed(
  data_x = train_features_raw,
  numeric_vars_to_cap = numeric_vars_to_cap_v2,
  brand_cap_limits = brand_cap_limits_v2,
  cap_allowed_combos = cap_allowed_combos,
  brand_col = "brand_code"
)

valid_features_capped <- apply_brand_caps_ph_allowed(
  data_x = valid_features_raw,
  numeric_vars_to_cap = numeric_vars_to_cap_v2,
  brand_cap_limits = brand_cap_limits_v2,
  cap_allowed_combos = cap_allowed_combos,
  brand_col = "brand_code"
)

eval_features_capped <- apply_brand_caps_ph_allowed(
  data_x = eval_features_raw,
  numeric_vars_to_cap = numeric_vars_to_cap_v2,
  brand_cap_limits = brand_cap_limits_v2,
  cap_allowed_combos = cap_allowed_combos,
  brand_col = "brand_code"
)

capping_summary_v2 <- bind_rows(
  summarize_capping_by_brand(
    original_x = train_features_raw,
    capped_x = train_features_capped,
    numeric_vars_to_cap = numeric_vars_to_cap_v2,
    dataset_name = "Training"
  ),
  
  summarize_capping_by_brand(
    original_x = valid_features_raw,
    capped_x = valid_features_capped,
    numeric_vars_to_cap = numeric_vars_to_cap_v2,
    dataset_name = "Validation"
  ),
  
  summarize_capping_by_brand(
    original_x = eval_features_raw,
    capped_x = eval_features_capped,
    numeric_vars_to_cap = numeric_vars_to_cap_v2,
    dataset_name = "Final Evaluation"
  )
) |>
  mutate(
    dataset = factor(
      dataset,
      levels = c("Training", "Validation", "Final Evaluation")
    )
  ) |>
  filter(Total > 0) |>
  arrange(dataset, desc(Total))

capping_summary_v2 |>
  slice_max(
    order_by = Total,
    n = 15,
    with_ties = FALSE
  ) |>
  kable(
    caption = "Top 15 Variables with Values Capped by Brand-Level PH-Difference Rule. 
    
                Counts of variables' values capped are in each Brand Code group",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Top 15 Variables with Values Capped by Brand-Level PH-Difference Rule.

`Counts of variables' values capped are in each Brand Code group</caption>`
dataset	variable	A	B	C	D	Total
Training	filler_speed	0	147	0	91	238
Training	mfr	0	70	31	51	152
Training	hyd_pressure4	0	90	0	61	151
Training	density	20	58	0	49	127
Training	oxygen_filler	0	60	0	45	105
Validation	filler_speed	0	33	0	20	53
Validation	mfr	0	24	6	14	44
Training	balling	0	30	0	10	40
Training	pc_volume	0	28	10	0	38
Training	carb_pressure	0	25	0	10	35
Final Evaluation	filler_speed	0	20	0	14	34
Training	filler_level	0	0	32	0	32
Validation	hyd_pressure4	0	18	0	12	30
Final Evaluation	hyd_pressure4	0	16	0	14	30
Validation	oxygen_filler	0	15	0	12	27

After reviewing the outlier-PH comparison, I updated the capping strategy to be selective by both Brand Code and variable.

The PH-difference value compares the average PH of outlier rows with the average PH of non-outlier rows within the same Brand Code and variable group. For example, if Brand B has outliers in filler_speed, I compare the average PH for Brand B rows where filler_speed is an outlier against the average PH for Brand B rows where filler_speed is not an outlier.

This helps separate outliers that may contain useful production signal from outliers that are more likely just extreme numeric values. If the PH difference is large, the outlier group may be related to changes in PH, so capping those values could remove useful information. If the PH difference is small, the outlier group does not appear to behave very differently in terms of PH, so capping is safer.

For this project, I will use averages difference of 0.05 PH as the cutoff. If a specific Brand Code and variable combination has an absolute PH difference below 0.05, capping is allowed for that group. If the absolute PH difference is 0.05 or larger, that group is not capped because those outliers may contain useful signal for predicting PH.

Capping will not be applied globally to an entire variable. It is applied only to the specific Brand Code value’s group and variable combination that meets the rule. For example, if Brand B filler_speed has a very small PH difference, then Brand B filler_speedoutliers can be capped. This does not mean filler_speed is automatically capped for Brands A, C, or D.

For example, filler_speed shows how the brand-level rule works. In the training data, filler_speed had 147 capped values for Brand B and 91 capped values for Brand D, for a total of 238 capped values.

dataset	variable	A	B	C	D	Missing Brand	Total
Training	filler_speed	0	147	0	91	0	238

This does not mean filler_speed was capped globally for every brand. It means the PH-difference rule allowed capping for specific brand groups where the outlier values did not show meaningful PH difference. In this case, Brand B and Brand D filler_speed values were capped, while Brand A and Brand C filler_speed values were not capped.

The final capped feature sets are then passed into the imputation workflow.

Missing Values Imputation Strategy

The preprocessing review showed that predictors’ missing values exist in the training, validation, and final evaluation data. Since many models (for example linear models) cannot use missing predictor values directly, I want to create completed feature sets before modeling. Also, i don’t want to remove rows with missing values.

The important point is that imputation must be learned from the training data only. Validation and final evaluation data should receive the same imputation process, but they should not be used to learn imputation values. This will help to avoid data leakage.

PH is not used during imputation because PH is the target variable. Also, PH is not available in the final evaluation data, so using PH during imputation would not match the real prediction situation.

I will compare multiple imputation approaches and later test them with different model types. The final imputation choice will be based on validation performance, not just assumptions.

Imputation Setup

Before creating imputed feature sets, I make sure categorical variables have consistent factor levels across training, validation, and final evaluation data.

brand_levels <- train_features_raw |>
  pull(brand_code) |>
  as.factor() |>
  levels()

train_features_raw <- train_features_raw |>
  mutate(
    brand_code = factor(brand_code, levels = brand_levels)
  )

valid_features_raw <- valid_features_raw |>
  mutate(
    brand_code = factor(brand_code, levels = brand_levels)
  )

eval_features_raw <- eval_features_raw |>
  mutate(
    brand_code = factor(brand_code, levels = brand_levels)
  )

pre_imputation_missing_check <- tibble(
  dataset = c("Training", "Validation", "Final Evaluation"),
  total_missing_values = c(
    sum(is.na(train_features_raw)),
    sum(is.na(valid_features_raw)),
    sum(is.na(eval_features_raw))
  )
)

pre_imputation_missing_check |>
  kable(
    caption = "Total Missing Predictor Values Before Imputation"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Total Missing Predictor Values Before Imputation
dataset	total_missing_values
Training	657
Validation	155
Final Evaluation	107

This check confirms that missing predictor values are present before imputation. The next step is to create completed versions of the training, validation, and final evaluation feature sets.

get_numeric_predictors <- function(data_x) {
  
  data_x |>
    select(where(is.numeric)) |>
    names()
}

impute_numeric_by_brand_tree <- function(
  train_x,
  valid_x,
  eval_x,
  brand_col = "brand_code",
  trees = 25,
  min_brand_rows = 20
) {
  
  numeric_vars <- get_numeric_predictors(train_x)
  
  train_out <- train_x
  valid_out <- valid_x
  eval_out <- eval_x
  
  fallback_recipe <- recipe(~ ., data = train_x) |>
    step_unknown(
      all_of(brand_col),
      new_level = "Unknown"
    ) |>
    step_impute_bag(
      all_of(numeric_vars),
      impute_with = imp_vars(all_predictors()),
      trees = trees
    )
  
  fallback_fit <- prep(
    fallback_recipe,
    training = train_x
  )
  
  train_fallback <- bake(
    fallback_fit,
    new_data = train_x
  )
  
  valid_fallback <- bake(
    fallback_fit,
    new_data = valid_x
  )
  
  eval_fallback <- bake(
    fallback_fit,
    new_data = eval_x
  )
  
  train_out[numeric_vars] <- train_fallback[numeric_vars]
  valid_out[numeric_vars] <- valid_fallback[numeric_vars]
  eval_out[numeric_vars] <- eval_fallback[numeric_vars]
  
  known_brands <- train_x |>
    filter(!is.na(.data[[brand_col]])) |>
    distinct(.data[[brand_col]]) |>
    pull(.data[[brand_col]])
  
  for (brand in known_brands) {
    
    train_rows <- which(train_x[[brand_col]] == brand)
    valid_rows <- which(valid_x[[brand_col]] == brand)
    eval_rows <- which(eval_x[[brand_col]] == brand)
    
    brand_train_numeric <- train_x[train_rows, numeric_vars]
    
    if (nrow(brand_train_numeric) >= min_brand_rows) {
      
      brand_recipe <- recipe(~ ., data = brand_train_numeric) |>
        step_impute_bag(
          all_numeric_predictors(),
          impute_with = imp_vars(all_numeric_predictors()),
          trees = trees
        )
      
      brand_fit <- prep(
        brand_recipe,
        training = brand_train_numeric
      )
      
      train_out[train_rows, numeric_vars] <- bake(
        brand_fit,
        new_data = train_x[train_rows, numeric_vars]
      )
      
      if (length(valid_rows) > 0) {
        valid_out[valid_rows, numeric_vars] <- bake(
          brand_fit,
          new_data = valid_x[valid_rows, numeric_vars]
        )
      }
      
      if (length(eval_rows) > 0) {
        eval_out[eval_rows, numeric_vars] <- bake(
          brand_fit,
          new_data = eval_x[eval_rows, numeric_vars]
        )
      }
    }
  }
  
  list(
    train = train_out,
    valid = valid_out,
    eval = eval_out,
    numeric_predictors = numeric_vars
  )
}

train_features_for_imputation <- train_features_capped
valid_features_for_imputation <- valid_features_capped
eval_features_for_imputation <- eval_features_capped

check_numeric_missing <- function(train_x, valid_x, eval_x, numeric_vars) {
  
  tibble(
    dataset = c("Training", "Validation", "Final Evaluation"),
    numeric_missing_count = c(
      sum(is.na(train_x |> select(all_of(numeric_vars)))),
      sum(is.na(valid_x |> select(all_of(numeric_vars)))),
      sum(is.na(eval_x |> select(all_of(numeric_vars))))
    )
  )
}

collect_imputed_values <- function(before_x, after_x, numeric_vars, brand_col = "brand_code") {
  
  imputed_values <- tibble()
  
  for (var in numeric_vars) {
    
    missing_rows <- which(is.na(before_x[[var]]))
    
    if (length(missing_rows) > 0) {
      
      temp_values <- tibble(
        variable = var,
        row_id = missing_rows,
        brand_code = before_x[[brand_col]][missing_rows],
        original_value = before_x[[var]][missing_rows],
        imputed_value = after_x[[var]][missing_rows]
      )
      
      imputed_values <- bind_rows(
        imputed_values,
        temp_values
      )
    }
  }
  
  imputed_values
}

check_imputed_ranges <- function(before_x, after_x, numeric_vars) {
  
  map_dfr(
    numeric_vars,
    function(var) {
      
      missing_rows <- which(is.na(before_x[[var]]))
      
      if (length(missing_rows) == 0) {
        return(NULL)
      }
      
      imputed_values <- after_x[[var]][missing_rows]
      original_values <- before_x[[var]]
      
      tibble(
        variable = var,
        original_min = min(original_values, na.rm = TRUE),
        original_max = max(original_values, na.rm = TRUE),
        min_imputed_value = min(imputed_values, na.rm = TRUE),
        max_imputed_value = max(imputed_values, na.rm = TRUE),
        imputed_below_original_min = any(imputed_values < original_min),
        imputed_above_original_max = any(imputed_values > original_max)
      )
      
    }
  )
}

predict_missing_brand_code <- function(data_x, brand_model, brand_col = "brand_code") {
  
  completed_x <- data_x
  
  missing_brand_rows <- which(is.na(completed_x[[brand_col]]))
  
  if (length(missing_brand_rows) > 0) {
    
    prediction_data <- completed_x[missing_brand_rows, ]
    
    predicted_brand <- predict(
      brand_model,
      data = prediction_data
    )$predictions
    
    completed_x[[brand_col]] <- as.character(completed_x[[brand_col]])
    completed_x[[brand_col]][missing_brand_rows] <- as.character(predicted_brand)
    completed_x[[brand_col]] <- as.factor(completed_x[[brand_col]])
  }
  
  completed_x
}

Hybrid Imputation Approach

After PH-guided brand-level capping, I wil test several imputation workflows. My goal here is not only to fill in missing values, but also to check which imputation strategy creates the best completed dataset for predictions.

The logic of this section is:

Numeric predictors are imputed first.
PCA is used to check whether brand_code has structure in the completed numeric space.
If missing brand_code rows are located near known brand groups, KNN is a reasonable brand_codeimputation method.
Other numeric imputation methods will be applied with the same KNN Brand Code imputation step.
missForest will be apllied separately because it imputes numeric predictors and brand_code together.
All created feature sets are tested using the same Random Forest validation workflow.

This makes the imputation methods comparison practical because the final goal is not just to predict and fill missing values, but to improve PH prediction performance.

get_numeric_predictors <- function(data_x) {
  data_x |>
    select(where(is.numeric)) |>
    names()
}

calculate_regression_metrics <- function(actual, predicted) {
  
  tibble(
    rmse = sqrt(mean((actual - predicted)^2, na.rm = TRUE)),
    mae = mean(abs(actual - predicted), na.rm = TRUE),
    rsq = cor(actual, predicted, use = "complete.obs")^2
  )
}

rf_tune_grid_temp <- expand.grid(
  mtry = c(5, 10, 15, 20)
)

rf_control_temp <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 3,
  verboseIter = FALSE
)

run_rf_validation_test <- function(
  train_x,
  valid_x,
  train_y,
  valid_y,
  workflow_name,
  project_seed = 123,
  rf_tune_grid = rf_tune_grid_temp,
  rf_control = rf_control_temp,
  ntree = 500
) {
  
  set.seed(project_seed)
  
  rf_time <- system.time({
    
    rf_model <- train(
      x = train_x,
      y = train_y,
      method = "rf",
      trControl = rf_control,
      tuneGrid = rf_tune_grid,
      metric = "RMSE",
      ntree = ntree,
      importance = TRUE
    )
    
  })
  
  rf_pred <- predict(
    rf_model,
    newdata = valid_x
  )
  
  calculate_regression_metrics(
    actual = valid_y,
    predicted = rf_pred
  ) |>
    mutate(
      workflow = workflow_name,
      training_rows = nrow(train_x),
      validation_rows = nrow(valid_x),
      best_mtry = rf_model$bestTune$mtry,
      training_time_seconds = as.numeric(rf_time["elapsed"])
    ) |>
    select(
      workflow,
      training_rows,
      validation_rows,
      best_mtry,
      rmse,
      mae,
      rsq,
      training_time_seconds
    )
}

Tree-Based Numeric Imputation

In this first workflow I will use is tree-based numeric imputation for the numeric predictors. At this step, brand_code is still allowed to be missing, because brand_code will be handled separately later with KNN.

The reason I impute numeric variables first is because I will need full numeric data in order to get right with both PCA and KNN . PCA usually struggle with missing numeric values and might deliver not correct results, and KNN is at best with complete numeric predictors to calculate distance between observations in a vector space.

So the order is:

bagging-tree numeric imputation first
PCA components explorations for all known and missing brands
KNN brand_code imputation

This keeps the workflow more logical and reasonable. The following process starts with imputation of missing numeric predictors values, then I check associations of missing brand_code with known brand_code groups on PCA plot, and then I use KNN to impute brand_code.

impute_numeric_by_brand_tree <- function(
  train_x,
  valid_x,
  eval_x,
  brand_col = "brand_code",
  trees = 25,
  min_brand_rows = 20
) {
  
  numeric_vars <- get_numeric_predictors(train_x)
  
  train_out <- train_x
  valid_out <- valid_x
  eval_out <- eval_x
  
  fallback_recipe <- recipe(~ ., data = train_x) |>
    step_unknown(
      all_of(brand_col),
      new_level = "Unknown"
    ) |>
    step_impute_bag(
      all_of(numeric_vars),
      impute_with = imp_vars(all_predictors()),
      trees = trees
    )
  
  fallback_fit <- prep(
    fallback_recipe,
    training = train_x
  )
  
  train_fallback <- bake(
    fallback_fit,
    new_data = train_x
  )
  
  valid_fallback <- bake(
    fallback_fit,
    new_data = valid_x
  )
  
  eval_fallback <- bake(
    fallback_fit,
    new_data = eval_x
  )
  
  train_out[numeric_vars] <- train_fallback[numeric_vars]
  valid_out[numeric_vars] <- valid_fallback[numeric_vars]
  eval_out[numeric_vars] <- eval_fallback[numeric_vars]
  
  known_brands <- train_x |>
    filter(!is.na(.data[[brand_col]])) |>
    distinct(.data[[brand_col]]) |>
    pull(.data[[brand_col]])
  
  for (brand in known_brands) {
    
    train_rows <- which(train_x[[brand_col]] == brand)
    valid_rows <- which(valid_x[[brand_col]] == brand)
    eval_rows <- which(eval_x[[brand_col]] == brand)
    
    brand_train_numeric <- train_x[train_rows, numeric_vars]
    
    if (nrow(brand_train_numeric) >= min_brand_rows) {
      
      brand_recipe <- recipe(~ ., data = brand_train_numeric) |>
        step_impute_bag(
          all_numeric_predictors(),
          impute_with = imp_vars(all_numeric_predictors()),
          trees = trees
        )
      
      brand_fit <- prep(
        brand_recipe,
        training = brand_train_numeric
      )
      
      train_out[train_rows, numeric_vars] <- bake(
        brand_fit,
        new_data = train_x[train_rows, numeric_vars]
      )
      
      if (length(valid_rows) > 0) {
        valid_out[valid_rows, numeric_vars] <- bake(
          brand_fit,
          new_data = valid_x[valid_rows, numeric_vars]
        )
      }
      
      if (length(eval_rows) > 0) {
        eval_out[eval_rows, numeric_vars] <- bake(
          brand_fit,
          new_data = eval_x[eval_rows, numeric_vars]
        )
      }
    }
  }
  
  list(
    train = train_out,
    valid = valid_out,
    eval = eval_out,
    numeric_predictors = numeric_vars
  )
}

brand_tree_numeric_imputed <- impute_numeric_by_brand_tree(
  train_x = train_features_for_imputation,
  valid_x = valid_features_for_imputation,
  eval_x = eval_features_for_imputation,
  brand_col = "brand_code",
  trees = 25,
  min_brand_rows = 20
)

train_features_brand_tree_numeric <- brand_tree_numeric_imputed$train
valid_features_brand_tree_numeric <- brand_tree_numeric_imputed$valid
eval_features_brand_tree_numeric <- brand_tree_numeric_imputed$eval

tree_numeric_missing_check <- tibble(
  dataset = c("Training", "Validation", "Final Evaluation"),
  numeric_missing_values = c(
    train_features_brand_tree_numeric |>
      select(where(is.numeric)) |>
      is.na() |>
      sum(),
    valid_features_brand_tree_numeric |>
      select(where(is.numeric)) |>
      is.na() |>
      sum(),
    eval_features_brand_tree_numeric |>
      select(where(is.numeric)) |>
      is.na() |>
      sum()
  )
)

tree_numeric_missing_check |>
  kable(
    caption = "Missing Numeric Values After Tree-Based Numeric Imputation",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Missing Numeric Values After Tree-Based Numeric Imputation
dataset	numeric_missing_values
Training	0
Validation	0
Final Evaluation	0

As we see, we do not have missing numeric variables after bagging tree imputation.

Now we can take a look at clucters of all brand_code, and see how rows with missing brand_code are spreaded between those clusters.

PCA Review for Missing Brand Code

Here I will use PCA to visually check if brand_code groups have clusters structure in the completed numeric process-variable space.

Here PCA is not used as the final model. It is just a diagnostic visualization. My goal is to see whether rows with missing brand_code are located near known brand_code groups. If missing brand_code rows are near known brand clusters, then KNN will be a reasonable imputation method since it assigns values for missing brand_code using nearby neighbor observations.

The PCA is created using previous bagging tree imputations of missing numeric predictors, because PCA cannot handle missing numeric values directly.

pca_numeric_vars <- get_numeric_predictors(train_features_brand_tree_numeric)

pca_source_data <- bind_rows(
  train_features_brand_tree_numeric |>
    mutate(dataset = "Training"),
  
  valid_features_brand_tree_numeric |>
    mutate(dataset = "Validation"),
  
  eval_features_brand_tree_numeric |>
    mutate(dataset = "Final Evaluation")
) |>
  mutate(
    brand_code_plot = ifelse(
      is.na(brand_code),
      "Missing Brand",
      as.character(brand_code)
    ),
    brand_code_plot = factor(
      brand_code_plot,
      levels = c("A", "B", "C", "D", "Missing Brand")
    )
  )

pca_input <- pca_source_data |>
  select(all_of(pca_numeric_vars))

pca_fit <- prcomp(
  pca_input,
  center = TRUE,
  scale. = TRUE
)

pca_scores <- pca_source_data |>
  select(dataset, brand_code, brand_code_plot) |>
  bind_cols(
    as_tibble(pca_fit$x[, 1:2])
  )

pca_variance <- summary(pca_fit)$importance[2, 1:2] * 100

brand_colors <- c(
  "A" = "#1f77b4",
  "B" = "#ff7f0e",
  "C" = "#2ca02c",
  "D" = "#9467bd",
  "Missing Brand" = "#d62728"
)

brand_shapes <- c(
  "A" = 16,
  "B" = 17,
  "C" = 15,
  "D" = 18,
  "Missing Brand" = 8
)

ggplot(
  pca_scores,
  aes(
    x = PC1,
    y = PC2,
    color = brand_code_plot,
    shape = brand_code_plot
  )
) +
  geom_point(
    alpha = 0.35,
    size = 2
  ) +
  geom_point(
    data = pca_scores |>
      filter(brand_code_plot == "Missing Brand"),
    aes(
      x = PC1,
      y = PC2,
      color = brand_code_plot,
      shape = brand_code_plot
    ),
    size = 4,
    alpha = 0.95
  ) +
  scale_color_manual(
    values = brand_colors
  ) +
  scale_shape_manual(
    values = brand_shapes
  ) +
  guides(
    color = guide_legend(
      override.aes = list(
        size = 5,
        alpha = 1
      )
    ),
    shape = guide_legend(
      override.aes = list(
        size = 5,
        alpha = 1
      )
    )
  ) +
  labs(
    title = "PCA Plot of Process Variables by Brand Code",
    subtitle = "Rows with missing Brand Code are highlighted with red star markers",
    x = paste0("Principal Component 1 (", round(pca_variance[1], 1), "% variance)"),
    y = paste0("Principal Component 2 (", round(pca_variance[2], 1), "% variance)"),
    color = "Brand Code",
    shape = "Brand Code"
  ) +
  coord_equal() +
  theme_classic(base_size = 14) +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 15, face = "bold"),
    legend.text = element_text(size = 14),
    legend.key.size = unit(1.1, "cm"),
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 13),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

The PCA plot shows that the known brand_code groups are not randomly mixed. Brand A and Brand D form clearer separated regions, while Brand B and Brand C overlap more on the left side of the plot.

At the same time, most of the observations with missing brand_code are located inside or very close to the Brand B and Brand C regions. This is strong suggestion that rows with missing brand_code are not just isolated unusual observations. At PCA plot, great majority of them look very similar to existing Brands B and C.

Since the missing rows are clustered near known brand_code groups, KNN distance-based imputationis is my preffered approach in this case.

KNN Brand Code Imputation

In this step, I will apply KNN only to impute missing brand_code values. The numeric predictors are already imputed before this step.

Before applying KNN approach, I center and scale the numeric predictors from the training data. Otherwise without scaling, variables with larger values can dominate the distance calculations and make the KNN result less reliable.

After scaling, KNN will look for the closest rows with known brand_code and use those nearby rows to predict the missing brand_code.

In the following code validate_knn_brand() function tests several K values for KNN using rows where brand_code is already known. It splits the known-brand rows into training and validation parts, centers and scales the numeric predictors, runs KNN, and returns the accuracy table, best K value, and confusion table.

The impute_brand_knn() function then uses the best K value to impute missing Brand Code in the training, validation, and final evaluation sets.

validate_knn_brand <- function(
  train_x,
  brand_col = "brand_code",
  k_values = c(3, 5, 7, 9, 11),
  project_seed = 123
  ) {
    
    numeric_vars <- get_numeric_predictors(train_x)
    
    known_data <- train_x |>
      filter(!is.na(.data[[brand_col]]))
    
    known_data[[brand_col]] <- as.factor(known_data[[brand_col]])
    
    set.seed(project_seed)
    
    validation_index <- createDataPartition(
      known_data[[brand_col]],
      p = 0.80,
      list = FALSE
    )
    
    brand_train <- known_data[validation_index, ]
    brand_valid <- known_data[-validation_index, ]
    
    train_numeric <- brand_train |>
      select(all_of(numeric_vars))
    
    valid_numeric <- brand_valid |>
      select(all_of(numeric_vars))
    
    train_center <- map_dbl(train_numeric, mean, na.rm = TRUE)
    train_scale <- map_dbl(train_numeric, sd, na.rm = TRUE)
    
    train_scale[train_scale == 0 | is.na(train_scale)] <- 1
    
    scale_with_train <- function(data_numeric) {
      sweep(
        sweep(
          as.matrix(data_numeric),
          2,
          train_center,
          "-"
        ),
        2,
        train_scale,
        "/"
      )
    }
    
    train_scaled <- scale_with_train(train_numeric)
    valid_scaled <- scale_with_train(valid_numeric)
    
    accuracy_results <- map_dfr(
      k_values,
      function(k_value) {
        
        pred_brand <- class::knn(
          train = train_scaled,
          test = valid_scaled,
          cl = brand_train[[brand_col]],
          k = k_value
        )
        
        tibble(
          k = k_value,
          accuracy = mean(pred_brand == brand_valid[[brand_col]]),
          test_rows = nrow(brand_valid)
        )
      }
    )
    
    best_k <- accuracy_results |>
      arrange(desc(accuracy), k) |>
      slice(1) |>
      pull(k)
    
    best_pred <- class::knn(
      train = train_scaled,
      test = valid_scaled,
      cl = brand_train[[brand_col]],
      k = best_k
    )
    
    confusion_table <- tibble(
      actual_brand = brand_valid[[brand_col]],
      predicted_brand = best_pred
    ) |>
      count(actual_brand, predicted_brand, name = "n")
    
    list(
      accuracy_results = accuracy_results,
      best_k = best_k,
      confusion_table = confusion_table
    )
}

impute_brand_knn <- function(
  train_x,
  valid_x,
  eval_x,
  brand_col = "brand_code",
  k = 5
) {
  
  numeric_vars <- get_numeric_predictors(train_x)
  
  train_out <- train_x
  valid_out <- valid_x
  eval_out <- eval_x
  
  known_train_rows <- which(!is.na(train_x[[brand_col]]))
  
  known_train_x <- train_x[known_train_rows, ]
  known_train_y <- as.factor(known_train_x[[brand_col]])
  
  brand_levels <- levels(known_train_y)
  
  train_numeric_known <- known_train_x |>
    select(all_of(numeric_vars))
  
  train_center <- map_dbl(train_numeric_known, mean, na.rm = TRUE)
  train_scale <- map_dbl(train_numeric_known, sd, na.rm = TRUE)
  
  train_scale[train_scale == 0 | is.na(train_scale)] <- 1
  
  scale_with_train <- function(data_x) {
    
    data_numeric <- data_x |>
      select(all_of(numeric_vars))
    
    sweep(
      sweep(
        as.matrix(data_numeric),
        2,
        train_center,
        "-"
      ),
      2,
      train_scale,
      "/"
    )
  }
  
  train_scaled_all <- scale_with_train(train_x)
  valid_scaled_all <- scale_with_train(valid_x)
  eval_scaled_all <- scale_with_train(eval_x)
  
  train_scaled_known <- train_scaled_all[known_train_rows, , drop = FALSE]
  
  impute_one_dataset <- function(data_out, scaled_all) {
    
    missing_rows <- which(is.na(data_out[[brand_col]]))
    
    if (length(missing_rows) > 0) {
      
      predicted_brand <- class::knn(
        train = train_scaled_known,
        test = scaled_all[missing_rows, , drop = FALSE],
        cl = known_train_y,
        k = k
      )
      
      data_out[[brand_col]][missing_rows] <- as.character(predicted_brand)
    }
    
    data_out[[brand_col]] <- factor(
      data_out[[brand_col]],
      levels = brand_levels
    )
    
    data_out
  }
  
  train_out <- impute_one_dataset(train_out, train_scaled_all)
  valid_out <- impute_one_dataset(valid_out, valid_scaled_all)
  eval_out <- impute_one_dataset(eval_out, eval_scaled_all)
  
  list(
    train = train_out,
    valid = valid_out,
    eval = eval_out
  )
}

knn_brand_validation <- validate_knn_brand(
  train_x = train_features_brand_tree_numeric,
  brand_col = "brand_code",
  k_values = c(3, 5, 7, 9, 11),
  project_seed = project_seed
)

knn_brand_validation$accuracy_results |>
  kable(
    caption = "KNN Brand Code Validation Accuracy by K",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

KNN Brand Code Validation Accuracy by K
k	accuracy	test_rows
3	0.8645	391
5	0.8491	391
7	0.8593	391
9	0.8542	391
11	0.8389	391

best_knn_k <- knn_brand_validation$best_k

The KNN Brand Code validation results show that k = 3 gave the best accuracy.

The accuracy values were:

k = 3: 0.8645
k = 5: 0.8491
k = 7: 0.8593
k = 9: 0.8542
k = 11: 0.8389

This means the model predicted the correct brand_code about 86.45% of the time when k = 3, and I will use it for imputing missing brand_code values. A smaller k lets the imputation depend more on the closest nearby observations instead of averaging across too many neighbors. In my opinion this makes sense because the PCA plot showed that missing brand_code rows were located near known brand groups

knn_confusion_wide <- knn_brand_validation$confusion_table |>
  pivot_wider(
    names_from = predicted_brand,
    values_from = n,
    values_fill = 0
  ) |>
  arrange(actual_brand)

knn_confusion_wide |>
  kable(
    caption = "KNN Brand Code Validation Confusion Matrix",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

KNN Brand Code Validation Confusion Matrix
actual_brand	A	B	C	D
A	36	2	1	6
B	1	185	10	0
C	0	31	20	0
D	3	0	0	96

The confusion matrix makes the KNN brand_code predictions validation easier to understand.

The diagonal values are the correct predictions:

Brand A: 36 correctly predicted as A
Brand B: 185 correctly predicted as B
Brand C: 20 correctly predicted as C
Brand D: 96 correctly predicted as D

KNN predicted Brand B and Brand D especially well. Brand B had 185 correct predictions, with only 10 rows predicted as C and 1 row predicted as A. Brand D also performed well, with 96 correct predictions and only 3 rows predicted as A.

Brand C was the most difficult group. Out of actual Brand C rows, 20 were correctly predicted as C, but 31 were predicted as Brand B. This errors can be explained by the earlier PCA plot, where Brand B and Brand C looked more overlapped in the process-variable space, therefore KNN model struggled to correctly make predictions for Brand C.

Brand A was mostly predicted correctly, but there was some confusion with Brand D.

Overall, the confusion matrix supports using KNN to impute missing brand_code values. It performs well overall, but it also shows that Brand C is harder to separate from Brand B.

And now I will create first imputed dataset using bagged-tree model to impute missing numeric values, and KNN model to impute missing brand_code values

tree_numeric_knn_brand <- impute_brand_knn(
  train_x = train_features_brand_tree_numeric,
  valid_x = valid_features_brand_tree_numeric,
  eval_x = eval_features_brand_tree_numeric,
  brand_col = "brand_code",
  k = best_knn_k
)

train_features_tree_numeric_knn_brand <- tree_numeric_knn_brand$train
valid_features_tree_numeric_knn_brand <- tree_numeric_knn_brand$valid
eval_features_tree_numeric_knn_brand <- tree_numeric_knn_brand$eval

# Keep old object names too, so later code that already uses them does not break.
train_features_capped_knn_brand <- train_features_tree_numeric_knn_brand
valid_features_capped_knn_brand <- valid_features_tree_numeric_knn_brand
eval_features_capped_knn_brand <- eval_features_tree_numeric_knn_brand

tree_numeric_knn_missing_check <- tibble(
  dataset = c("Training", "Validation", "Final Evaluation"),
  total_missing_values = c(
    sum(is.na(train_features_tree_numeric_knn_brand)),
    sum(is.na(valid_features_tree_numeric_knn_brand)),
    sum(is.na(eval_features_tree_numeric_knn_brand))
  )
)

tree_numeric_knn_missing_check |>
  kable(
    caption = "Missing Values After Tree Numeric + KNN Brand Imputation",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Missing Values After Tree Numeric + KNN Brand Imputation
dataset	total_missing_values
Training	0
Validation	0
Final Evaluation	0

rf_capped_knn_results <- run_rf_validation_test(
  train_x = train_features_tree_numeric_knn_brand,
  valid_x = valid_features_tree_numeric_knn_brand,
  train_y = train_ph,
  valid_y = valid_ph,
  workflow_name = "Tree numeric + KNN brand",
  project_seed = project_seed
)

rf_capped_knn_results |>
  kable(
    caption = "Tree Numeric + KNN Brand Random Forest Validation Performance",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Tree Numeric + KNN Brand Random Forest Validation Performance
workflow	training_rows	validation_rows	best_mtry	rmse	mae	rsq	training_time_seconds
Tree numeric + KNN brand	2052	515	20	0.0962	0.0687	0.698	205.3

summarize_imputed_brand_counts <- function(
  original_train_x,
  original_valid_x,
  original_eval_x,
  imputed_train_x,
  imputed_valid_x,
  imputed_eval_x,
  brand_col = "brand_code"
) {
  
  bind_rows(
    imputed_train_x |>
      filter(is.na(original_train_x[[brand_col]])) |>
      count(.data[[brand_col]], name = "n") |>
      mutate(dataset = "Training"),
    
    imputed_valid_x |>
      filter(is.na(original_valid_x[[brand_col]])) |>
      count(.data[[brand_col]], name = "n") |>
      mutate(dataset = "Validation"),
    
    imputed_eval_x |>
      filter(is.na(original_eval_x[[brand_col]])) |>
      count(.data[[brand_col]], name = "n") |>
      mutate(dataset = "Final Evaluation")
  ) |>
    rename(imputed_brand_code = .data[[brand_col]]) |>
    select(dataset, imputed_brand_code, n) |>
    pivot_wider(
      names_from = imputed_brand_code,
      values_from = n,
      values_fill = 0
    ) |>
    mutate(
      Total = rowSums(across(where(is.numeric)))
    )
}

tree_numeric_knn_brand_counts <- summarize_imputed_brand_counts(
  original_train_x = train_features_brand_tree_numeric,
  original_valid_x = valid_features_brand_tree_numeric,
  original_eval_x = eval_features_brand_tree_numeric,
  imputed_train_x = train_features_tree_numeric_knn_brand,
  imputed_valid_x = valid_features_tree_numeric_knn_brand,
  imputed_eval_x = eval_features_tree_numeric_knn_brand,
  brand_col = "brand_code"
)

tree_numeric_knn_brand_counts |>
  kable(
    caption = "Imputed Brand Code Counts: Tree Numeric + KNN Brand",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Imputed Brand Code Counts: Tree Numeric + KNN Brand
dataset	A	B	C	Total
Training	13	65	16	94
Validation	3	21	2	26
Final Evaluation	0	5	3	8

This table shows only the rows where brand_code was originally missing. It does not show the full brand_code distribution.

For the tree-based numeric imputation workflow, KNN assigned most missing Brand Code values to Brand B and Brand C.
In the training data, 65 missing rows were imputed as Brand B and 16 were imputed as Brand C. Some rows were also assigned to Brand A (13), but no missing rows were assigned to Brand D.

This matches the PCA plot reasonably well because most missing brand_code rows were located inside or close to the Brand B and Brand C clusters. The KNN result is therefore not just assigning everything to the largest brand randomly. It is using nearby rows in the completed numeric process-variable space.

Overall, this matches the PCA plot because most missing Brand Code rows were located near the Brand B and Brand C clusters.

Random Forest Numeric Imputation + KNN Brand Code Imputation

I mentioned earlier that I will test few imputations techniques. So here, I will test Random Forest model as another method for imputing missing numeric predictors.

The idea is that each numeric variable with missing values can be predicted from the other available process variables.
Random Forest can be useful here because it can capture nonlinear relationships and interactions between variables, which may exist in the production process. And it is possible it gives better results compared to bagged trees imputation method completed previously.

This process follows these steps:

start with the PH-guided outliers capped feature sets.
use Random Forest models to impute missing numeric predictors.
keep brand_code missing during the numeric imputation step.
after numeric values are completed, use KNN to impute missing brand_code.
test the final completed dataset using Random Forest validation model.

Missing values of brand_code will be handled separately with KNN, since PCA plot showed earlier that missing brand_code rows are mostly located near Brand B and Brand C clusters. So the same KNN brand_code imputation step is applied after the numeric predictors are completed.

impute_numeric_by_random_forest <- function(
  train_x,
  valid_x,
  eval_x,
  brand_col = "brand_code",
  ntree = 250,
  min_obs = 30,
  project_seed = 123
) {
  
  numeric_vars <- get_numeric_predictors(train_x)
  
  train_out <- train_x
  valid_out <- valid_x
  eval_out <- eval_x
  
  train_medians <- map_dbl(
    numeric_vars,
    function(var) {
      med_value <- median(train_x[[var]], na.rm = TRUE)
      
      if (is.nan(med_value)) {
        med_value <- 0
      }
      
      med_value
    }
  )
  
  names(train_medians) <- numeric_vars
  
  fill_numeric_with_train_medians <- function(data_x) {
    
    data_filled <- data_x
    
    for (var in numeric_vars) {
      data_filled[[var]][is.na(data_filled[[var]])] <- train_medians[[var]]
    }
    
    data_filled
  }
  
  train_base <- fill_numeric_with_train_medians(train_x)
  valid_base <- fill_numeric_with_train_medians(valid_x)
  eval_base <- fill_numeric_with_train_medians(eval_x)
  
  train_out[numeric_vars] <- train_base[numeric_vars]
  valid_out[numeric_vars] <- valid_base[numeric_vars]
  eval_out[numeric_vars] <- eval_base[numeric_vars]
  
  brand_levels <- train_x[[brand_col]] |>
    as.character() |>
    na.omit() |>
    unique()
  
  brand_levels <- c(sort(brand_levels), "Unknown")
  
  prepare_brand_for_dummy <- function(data_x) {
    
    data_x |>
      mutate(
        "{brand_col}" := ifelse(
          is.na(.data[[brand_col]]),
          "Unknown",
          as.character(.data[[brand_col]])
        ),
        "{brand_col}" := factor(
          .data[[brand_col]],
          levels = brand_levels
        )
      )
  }
  
  train_brand_dummy_data <- prepare_brand_for_dummy(train_x)
  
  brand_dummy_fit <- caret::dummyVars(
    formula = as.formula(paste("~", brand_col)),
    data = train_brand_dummy_data,
    fullRank = FALSE
  )
  
  make_rf_predictor_matrix <- function(data_filled, original_data, target_var) {
    
    numeric_predictors <- setdiff(numeric_vars, target_var)
    
    brand_dummy_data <- prepare_brand_for_dummy(original_data)
    
    brand_dummy_matrix <- predict(
      brand_dummy_fit,
      newdata = brand_dummy_data
    ) |>
      as.data.frame()
    
    bind_cols(
      data_filled[numeric_predictors],
      brand_dummy_matrix
    )
  }
  
  vars_to_impute <- numeric_vars[
    map_lgl(
      numeric_vars,
      function(var) {
        any(is.na(train_x[[var]])) ||
          any(is.na(valid_x[[var]])) ||
          any(is.na(eval_x[[var]]))
      }
    )
  ]
  
  imputation_log <- list()
  
  for (var in vars_to_impute) {
    
    observed_train_rows <- which(!is.na(train_x[[var]]))
    
    if (length(observed_train_rows) < min_obs) {
      
      imputation_log[[var]] <- tibble(
        variable = var,
        method = "median fallback",
        observed_training_rows = length(observed_train_rows),
        train_missing_imputed = sum(is.na(train_x[[var]])),
        valid_missing_imputed = sum(is.na(valid_x[[var]])),
        eval_missing_imputed = sum(is.na(eval_x[[var]]))
      )
      
      next
    }
    
    x_train_matrix <- make_rf_predictor_matrix(
      data_filled = train_base,
      original_data = train_x,
      target_var = var
    )
    
    x_valid_matrix <- make_rf_predictor_matrix(
      data_filled = valid_base,
      original_data = valid_x,
      target_var = var
    )
    
    x_eval_matrix <- make_rf_predictor_matrix(
      data_filled = eval_base,
      original_data = eval_x,
      target_var = var
    )
    
    set.seed(project_seed)
    
    rf_imp_model <- randomForest::randomForest(
      x = x_train_matrix[observed_train_rows, ],
      y = train_x[[var]][observed_train_rows],
      ntree = ntree
    )
    
    train_missing_rows <- which(is.na(train_x[[var]]))
    valid_missing_rows <- which(is.na(valid_x[[var]]))
    eval_missing_rows <- which(is.na(eval_x[[var]]))
    
    if (length(train_missing_rows) > 0) {
      train_out[[var]][train_missing_rows] <- predict(
        rf_imp_model,
        newdata = x_train_matrix[train_missing_rows, ]
      )
    }
    
    if (length(valid_missing_rows) > 0) {
      valid_out[[var]][valid_missing_rows] <- predict(
        rf_imp_model,
        newdata = x_valid_matrix[valid_missing_rows, ]
      )
    }
    
    if (length(eval_missing_rows) > 0) {
      eval_out[[var]][eval_missing_rows] <- predict(
        rf_imp_model,
        newdata = x_eval_matrix[eval_missing_rows, ]
      )
    }
    
    imputation_log[[var]] <- tibble(
      variable = var,
      method = "random forest",
      observed_training_rows = length(observed_train_rows),
      train_missing_imputed = length(train_missing_rows),
      valid_missing_imputed = length(valid_missing_rows),
      eval_missing_imputed = length(eval_missing_rows)
    )
  }
  
  list(
    train = train_out,
    valid = valid_out,
    eval = eval_out,
    numeric_predictors = numeric_vars,
    imputation_log = bind_rows(imputation_log)
  )
}

rf_numeric_imputed_temp <- impute_numeric_by_random_forest(
  train_x = train_features_for_imputation,
  valid_x = valid_features_for_imputation,
  eval_x = eval_features_for_imputation,
  brand_col = "brand_code",
  ntree = 250,
  min_obs = 30,
  project_seed = project_seed
)

train_features_rf_numeric_temp <- rf_numeric_imputed_temp$train
valid_features_rf_numeric_temp <- rf_numeric_imputed_temp$valid
eval_features_rf_numeric_temp <- rf_numeric_imputed_temp$eval

rf_numeric_imputed_temp$imputation_log |>
  arrange(desc(train_missing_imputed + valid_missing_imputed + eval_missing_imputed)) |>
  kable(
    caption = "Random Forest Numeric Imputation Log",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )

Random Forest Numeric Imputation Log
variable	method	observed_training_rows	train_missing_imputed	valid_missing_imputed	eval_missing_imputed
mfr	random forest	1879	173	35	31
filler_speed	random forest	2009	43	11	10
fill_ounces	random forest	2022	30	8	6
psc_co2	random forest	2016	36	3	5
pc_volume	random forest	2020	32	7	4
psc	random forest	2025	27	6	5
carb_pressure1	random forest	2027	25	7	4
hyd_pressure4	random forest	2028	24	4	4
carb_pressure	random forest	2031	21	6	0
carb_temp	random forest	2035	17	9	1
psc_fill	random forest	2031	21	2	3
fill_pressure	random forest	2039	13	5	2
filler_level	random forest	2038	14	2	2
hyd_pressure2	random forest	2041	11	4	1
hyd_pressure3	random forest	2041	11	4	1
temperature	random forest	2042	10	2	2
oxygen_filler	random forest	2043	9	2	3
pressure_setpoint	random forest	2043	9	3	2
carb_volume	random forest	2044	8	2	1
hyd_pressure1	random forest	2044	8	3	0
alch_rel	random forest	2046	6	1	3
carb_rel	random forest	2044	8	0	2
usage_cont	random forest	2048	4	1	2
bowl_setpoint	random forest	2052	0	2	1
carb_flow	random forest	2050	2	0	0
density	random forest	2052	0	0	1
balling	random forest	2052	0	0	1
pressure_vacuum	random forest	2052	0	0	1
air_pressurer	random forest	2052	0	0	1
balling_lvl	random forest	2051	1	0	0

rf_numeric_knn_brand_temp <- impute_brand_knn(
  train_x = train_features_rf_numeric_temp,
  valid_x = valid_features_rf_numeric_temp,
  eval_x = eval_features_rf_numeric_temp,
  brand_col = "brand_code",
  k = best_knn_k
)

train_features_rf_numeric_knn_brand_temp <- rf_numeric_knn_brand_temp$train
valid_features_rf_numeric_knn_brand_temp <- rf_numeric_knn_brand_temp$valid
eval_features_rf_numeric_knn_brand_temp <- rf_numeric_knn_brand_temp$eval

rf_numeric_knn_results_temp <- run_rf_validation_test(
  train_x = train_features_rf_numeric_knn_brand_temp,
  valid_x = valid_features_rf_numeric_knn_brand_temp,
  train_y = train_ph,
  valid_y = valid_ph,
  workflow_name = "RF numeric + KNN brand",
  project_seed = project_seed
)

# rf_numeric_knn_results_temp |>
#   kable(
#     caption = "RF Numeric + KNN Brand Random Forest Validation Performance",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )

Now we have all missing values imputed, and let’s try another imputation technique.

XGBoost Numeric Imputation + KNN Brand Code Imputation

Here I will use XGBoost model as another method for predicting and imputing missing numeric predictors.

The idea is pretty much same as previous Bagging Trees and Random Forest numeric imputation workflows. Each numeric variable with missing values is treated as a temporary prediction problem, where the other process variables are used as predictors. XGBoost can be useful here because it can model nonlinear patterns and interactions, and it often performs well on structured tabular data.

The steps are:

start with the PH-guided capped feature sets.
use XGBoost model to impute missing numeric predictors.
keep brand_code missing during the numeric imputation step.
after numeric values are completed, will use KNN to impute missing brand_code.
test the final completed dataset using the same Random Forest validation model.

Missing values of brand_code are still handled separately with KNN model. So after XGBoost completes the numeric predictors, KNN is used to assign brand_code based on nearby observations just like in previous methods - centering and scaling applied in KNN training process.

I’m using this imputations technique to see if XGBoost numeric imputation creates a better completed dataset than the Bagging Tree based and Random Forest numeric imputation methods.

impute_numeric_by_xgboost <- function(
  train_x,
  valid_x,
  eval_x,
  brand_col = "brand_code",
  nrounds = 100,
  max_depth = 3,
  eta = 0.05,
  subsample = 0.8,
  colsample_bytree = 0.8,
  min_obs = 30,
  project_seed = 123
) {
  
  numeric_vars <- get_numeric_predictors(train_x)
  
  train_out <- train_x
  valid_out <- valid_x
  eval_out <- eval_x
  
  train_medians <- map_dbl(
    numeric_vars,
    function(var) {
      med_value <- median(train_x[[var]], na.rm = TRUE)
      
      if (is.nan(med_value)) {
        med_value <- 0
      }
      
      med_value
    }
  )
  
  names(train_medians) <- numeric_vars
  
  for (var in numeric_vars) {
    train_out[[var]][is.na(train_out[[var]])] <- train_medians[[var]]
    valid_out[[var]][is.na(valid_out[[var]])] <- train_medians[[var]]
    eval_out[[var]][is.na(eval_out[[var]])] <- train_medians[[var]]
  }
  
  brand_levels <- train_x[[brand_col]] |>
    as.character() |>
    na.omit() |>
    unique()
  
  brand_levels <- c(sort(brand_levels), "Unknown")
  
  prepare_brand_for_dummy <- function(data_x) {
    
    data_x |>
      mutate(
        "{brand_col}" := ifelse(
          is.na(.data[[brand_col]]),
          "Unknown",
          as.character(.data[[brand_col]])
        ),
        "{brand_col}" := factor(
          .data[[brand_col]],
          levels = brand_levels
        )
      )
  }
  
  brand_dummy_fit <- caret::dummyVars(
    formula = as.formula(paste("~", brand_col)),
    data = prepare_brand_for_dummy(train_x),
    fullRank = FALSE
  )
  
  make_xgb_matrix <- function(data_x, target_var) {
    
    numeric_predictors <- setdiff(numeric_vars, target_var)
    
    brand_dummy_matrix <- predict(
      brand_dummy_fit,
      newdata = prepare_brand_for_dummy(data_x)
    ) |>
      as.data.frame()
    
    predictor_data <- bind_cols(
      data_x[numeric_predictors],
      brand_dummy_matrix
    )
    
    as.matrix(predictor_data)
  }
  
  vars_to_impute <- numeric_vars[
    map_lgl(
      numeric_vars,
      function(var) {
        any(is.na(train_x[[var]])) ||
          any(is.na(valid_x[[var]])) ||
          any(is.na(eval_x[[var]]))
      }
    )
  ]
  
  imputation_log <- list()
  
  for (var in vars_to_impute) {
    
    observed_train_rows <- which(!is.na(train_x[[var]]))
    
    train_missing_rows <- which(is.na(train_x[[var]]))
    valid_missing_rows <- which(is.na(valid_x[[var]]))
    eval_missing_rows <- which(is.na(eval_x[[var]]))
    
    if (length(observed_train_rows) < min_obs) {
      
      imputation_log[[var]] <- tibble(
        variable = var,
        method = "median fallback",
        observed_training_rows = length(observed_train_rows),
        train_missing_imputed = length(train_missing_rows),
        valid_missing_imputed = length(valid_missing_rows),
        eval_missing_imputed = length(eval_missing_rows)
      )
      
      next
    }
    
    x_train_matrix <- make_xgb_matrix(train_x, target_var = var)
    x_valid_matrix <- make_xgb_matrix(valid_x, target_var = var)
    x_eval_matrix <- make_xgb_matrix(eval_x, target_var = var)
    
    dtrain <- xgboost::xgb.DMatrix(
      data = x_train_matrix[observed_train_rows, , drop = FALSE],
      label = train_x[[var]][observed_train_rows],
      missing = NA
    )
    
    params <- list(
      objective = "reg:squarederror",
      eval_metric = "rmse",
      max_depth = max_depth,
      eta = eta,
      subsample = subsample,
      colsample_bytree = colsample_bytree
    )
    
    set.seed(project_seed)
    
    xgb_imp_model <- xgboost::xgb.train(
      params = params,
      data = dtrain,
      nrounds = nrounds,
      verbose = 0
    )
    
    if (length(train_missing_rows) > 0) {
      train_out[[var]][train_missing_rows] <- predict(
        xgb_imp_model,
        newdata = xgboost::xgb.DMatrix(
          x_train_matrix[train_missing_rows, , drop = FALSE],
          missing = NA
        )
      )
    }
    
    if (length(valid_missing_rows) > 0) {
      valid_out[[var]][valid_missing_rows] <- predict(
        xgb_imp_model,
        newdata = xgboost::xgb.DMatrix(
          x_valid_matrix[valid_missing_rows, , drop = FALSE],
          missing = NA
        )
      )
    }
    
    if (length(eval_missing_rows) > 0) {
      eval_out[[var]][eval_missing_rows] <- predict(
        xgb_imp_model,
        newdata = xgboost::xgb.DMatrix(
          x_eval_matrix[eval_missing_rows, , drop = FALSE],
          missing = NA
        )
      )
    }
    
    imputation_log[[var]] <- tibble(
      variable = var,
      method = "xgboost",
      observed_training_rows = length(observed_train_rows),
      train_missing_imputed = length(train_missing_rows),
      valid_missing_imputed = length(valid_missing_rows),
      eval_missing_imputed = length(eval_missing_rows)
    )
  }
  
  list(
    train = train_out,
    valid = valid_out,
    eval = eval_out,
    numeric_predictors = numeric_vars,
    imputation_log = bind_rows(imputation_log)
  )
}

xgb_numeric_imputed_temp <- impute_numeric_by_xgboost(
  train_x = train_features_for_imputation,
  valid_x = valid_features_for_imputation,
  eval_x = eval_features_for_imputation,
  brand_col = "brand_code",
  nrounds = 100,
  max_depth = 3,
  eta = 0.05,
  subsample = 0.8,
  colsample_bytree = 0.8,
  min_obs = 30,
  project_seed = project_seed
)

train_features_xgb_numeric_temp <- xgb_numeric_imputed_temp$train
valid_features_xgb_numeric_temp <- xgb_numeric_imputed_temp$valid
eval_features_xgb_numeric_temp <- xgb_numeric_imputed_temp$eval

xgb_numeric_imputed_temp$imputation_log |>
  arrange(desc(train_missing_imputed + valid_missing_imputed + eval_missing_imputed)) |>
  kable(
    caption = "XGBoost Numeric Imputation Log",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )

XGBoost Numeric Imputation Log
variable	method	observed_training_rows	train_missing_imputed	valid_missing_imputed	eval_missing_imputed
mfr	xgboost	1879	173	35	31
filler_speed	xgboost	2009	43	11	10
fill_ounces	xgboost	2022	30	8	6
psc_co2	xgboost	2016	36	3	5
pc_volume	xgboost	2020	32	7	4
psc	xgboost	2025	27	6	5
carb_pressure1	xgboost	2027	25	7	4
hyd_pressure4	xgboost	2028	24	4	4
carb_pressure	xgboost	2031	21	6	0
carb_temp	xgboost	2035	17	9	1
psc_fill	xgboost	2031	21	2	3
fill_pressure	xgboost	2039	13	5	2
filler_level	xgboost	2038	14	2	2
hyd_pressure2	xgboost	2041	11	4	1
hyd_pressure3	xgboost	2041	11	4	1
temperature	xgboost	2042	10	2	2
oxygen_filler	xgboost	2043	9	2	3
pressure_setpoint	xgboost	2043	9	3	2
carb_volume	xgboost	2044	8	2	1
hyd_pressure1	xgboost	2044	8	3	0
alch_rel	xgboost	2046	6	1	3
carb_rel	xgboost	2044	8	0	2
usage_cont	xgboost	2048	4	1	2
bowl_setpoint	xgboost	2052	0	2	1
carb_flow	xgboost	2050	2	0	0
density	xgboost	2052	0	0	1
balling	xgboost	2052	0	0	1
pressure_vacuum	xgboost	2052	0	0	1
air_pressurer	xgboost	2052	0	0	1
balling_lvl	xgboost	2051	1	0	0

The XGBoost numeric imputation log shows which variables had missing values filled in the training, validation, and final evaluation datasets.

Each variable with missing numeric values was imputed using an XGBoost model trained from the available training data. PH was not used during this process, so the imputation step does not create target leakage.

The largest number of missing values was for mfr, followed by variables such as filler_speed, fill_ounces, psc_co2, and pc_volume.

I included this table as a check that the imputation process was applied consistently across all three datasets before modeling.

xgb_numeric_knn_brand_temp <- impute_brand_knn(
  train_x = train_features_xgb_numeric_temp,
  valid_x = valid_features_xgb_numeric_temp,
  eval_x = eval_features_xgb_numeric_temp,
  brand_col = "brand_code",
  k = best_knn_k
)

train_features_xgb_numeric_knn_brand_temp <- xgb_numeric_knn_brand_temp$train
valid_features_xgb_numeric_knn_brand_temp <- xgb_numeric_knn_brand_temp$valid
eval_features_xgb_numeric_knn_brand_temp <- xgb_numeric_knn_brand_temp$eval

xgb_numeric_knn_rf_results_temp <- run_rf_validation_test(
  train_x = train_features_xgb_numeric_knn_brand_temp,
  valid_x = valid_features_xgb_numeric_knn_brand_temp,
  train_y = train_ph,
  valid_y = valid_ph,
  workflow_name = "XGBoost numeric + KNN brand",
  project_seed = project_seed
)

# xgb_numeric_knn_rf_results_temp |>
#   kable(
#     caption = "XGBoost Numeric + KNN Brand Random Forest Validation Performance",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )

xgb_knn_imputed_brand_counts <- bind_rows(
  train_features_xgb_numeric_knn_brand_temp |>
    filter(is.na(train_features_xgb_numeric_temp$brand_code)) |>
    count(brand_code, name = "n") |>
    mutate(dataset = "Training"),
  
  valid_features_xgb_numeric_knn_brand_temp |>
    filter(is.na(valid_features_xgb_numeric_temp$brand_code)) |>
    count(brand_code, name = "n") |>
    mutate(dataset = "Validation"),
  
  eval_features_xgb_numeric_knn_brand_temp |>
    filter(is.na(eval_features_xgb_numeric_temp$brand_code)) |>
    count(brand_code, name = "n") |>
    mutate(dataset = "Final Evaluation")
) |>
  rename(imputed_brand_code = brand_code) |>
  pivot_wider(
    names_from = imputed_brand_code,
    values_from = n,
    values_fill = 0
  ) |>
  mutate(
    Total = rowSums(across(where(is.numeric)))
  ) |>
  select(dataset, everything())

xgb_knn_imputed_brand_counts |>
  kable(
    caption = "Imputed Brand Code Counts After XGBoost Numeric Imputation + KNN",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Imputed Brand Code Counts After XGBoost Numeric Imputation + KNN
dataset	A	B	C	Total
Training	7	64	23	94
Validation	1	20	5	26
Final Evaluation	0	6	2	8

xgb_knn_brand_validation <- validate_knn_brand(
  train_x = train_features_xgb_numeric_temp,
  brand_col = "brand_code",
  k_values = c(3, 5, 7, 9, 11),
  project_seed = project_seed
)

xgb_knn_brand_validation$confusion_table |>
  pivot_wider(
    names_from = predicted_brand,
    values_from = n,
    values_fill = 0
  ) |>
  arrange(actual_brand) |>
  kable(
    caption = "KNN Brand Code Validation Confusion Matrix Using XGBoost-Imputed Numeric Predictors",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

KNN Brand Code Validation Confusion Matrix Using XGBoost-Imputed Numeric Predictors
actual_brand	A	B	C	D
A	36	2	1	6
B	1	185	10	0
C	1	31	19	0
D	2	0	1	96

Compared with the earlier KNN brand_code confusion matrix, the results using XGBoost-imputed numeric predictors are almost the same.

The previous confusion matrix had:

Brand A: 36 correctly predicted as A
Brand B: 185 correctly predicted as B
Brand C: 20 correctly predicted as C
Brand D: 96 correctly predicted as D

The XGBoost-imputed version had:

Brand A: 36 correctly predicted as A
Brand B: 185 correctly predicted as B
Brand C: 19 correctly predicted as C
Brand D: 96 correctly predicted as D

So the main difference is very small:

Brand A stayed the same.
Brand B stayed the same.
Brand D stayed the same.
Brand C decreased slightly from 20 correct predictions to 19 correct predictions.

As we see, Brand C rows are still the hardest to classify. In the XGBoost-imputed version, 31 actual Brand C rows were predicted as Brand B, which is the same pattern as before. This supports the earlier PCA result, where Brand B and Brand C were more overlapped in the numeric process-variable space.

Overall, the XGBoost numeric imputation did not meaningfully change the KNN brand_codevalidation pattern. The KNN model still performs well for Brands A, B, and D, but Brand C remains harder to separate from Brand B.

Imputation of all NA’s with `missForest` package

In this part, I test missForest as another imputation method.

This method is different from other workflows because it does not separate numeric imputation and brand_code imputation into two steps as I used earlier. Instead, missForest imputes numeric predictors and brand_code together in one process.

The steps are:

start with the PH-guided capped feature sets.
combine the training, validation, and final evaluation feature sets for the imputation step.
use missForest to impute both numeric predictors and missing brand_code.
split the imputed data back into training, validation, and final evaluation sets.
test the completed dataset using the same Random Forest validation model as before.

I included this imputation process as a comparison method because missForest can handle mixed numeric and categorical predictors. However, it is less connected to my PCA and KNN brand_code imputations logic, because it does not use the separate process of completing numeric values first and then assigning brand_code based on nearby observations.

prepare_missforest_data <- function(train_x, valid_x, eval_x, brand_col = "brand_code") {
  
  train_temp <- train_x |>
    mutate(.dataset = "Training", .row_id = row_number())
  
  valid_temp <- valid_x |>
    mutate(.dataset = "Validation", .row_id = row_number())
  
  eval_temp <- eval_x |>
    mutate(.dataset = "Final Evaluation", .row_id = row_number())
  
  combined_data <- bind_rows(
    train_temp,
    valid_temp,
    eval_temp
  )
  
  combined_data <- combined_data |>
    mutate(
      "{brand_col}" := as.factor(.data[[brand_col]])
    )
  
  list(
    combined_data = combined_data,
    id_cols = c(".dataset", ".row_id")
  )
}

missforest_input <- prepare_missforest_data(
  train_x = train_features_for_imputation,
  valid_x = valid_features_for_imputation,
  eval_x = eval_features_for_imputation,
  brand_col = "brand_code"
)

combined_for_missforest <- missforest_input$combined_data
id_cols <- missforest_input$id_cols

combined_ids <- combined_for_missforest |>
  select(all_of(id_cols))

combined_features_for_missforest <- combined_for_missforest |>
  select(-all_of(id_cols)) |>
  as.data.frame()

combined_features_for_missforest <- combined_features_for_missforest |>
  mutate(
    across(where(is.character), as.factor),
    across(where(is.logical), as.factor)
  )

combined_features_for_missforest$brand_code <- as.factor(
  combined_features_for_missforest$brand_code
)

combined_features_for_missforest <- combined_features_for_missforest |>
  mutate(
    across(
      where(~ inherits(.x, "Date") || inherits(.x, "POSIXct") || inherits(.x, "POSIXlt")),
      as.numeric
    )
  )

all_missing_cols <- names(combined_features_for_missforest)[
  map_lgl(combined_features_for_missforest, ~ all(is.na(.x)))
]

combined_features_for_missforest <- combined_features_for_missforest |>
  select(-any_of(all_missing_cols))

constant_cols <- names(combined_features_for_missforest)[
  map_lgl(
    combined_features_for_missforest,
    ~ length(unique(na.omit(.x))) <= 1
  )
]

combined_features_for_missforest <- combined_features_for_missforest |>
  select(-any_of(constant_cols))

set.seed(project_seed)

missforest_time_temp <- system.time({
  
  missforest_result_temp <- missForest(
    xmis = combined_features_for_missforest,
    maxiter = 3,
    ntree = 50,
    variablewise = TRUE,
    verbose = TRUE
  )
  
})

  missForest iteration 1 in progress...done!
    estimated error(s): 0.02549889 0.002423574 0.006228133 0.002000095 2.476716 3.270997 0.002225144 0.01265173 0.001775185 0 11.25729 3.633197 21.6205 11.23396 7.67219 52.86191 26.78502 66429.97 0.9822 4.856411 174607.9 0.002789376 645.1303 0.00792792 0.07094996 0.0007109196 18.60701 0.7337317 0.6809167 0.01107182 0.00279986 0.01465174 
    difference(s): 0.001248508 0.03246295 
    time: 1.44 seconds

  missForest iteration 2 in progress...done!
    estimated error(s): 0.0225425 0.002431696 0.006274587 0.002011106 2.344886 3.298449 0.002247812 0.01253804 0.001766746 0 11.14045 3.696281 22.65158 10.87017 7.500838 51.82271 27.53997 30159.45 0.9322445 4.845559 191381.8 0.002579745 718.2835 0.007228232 0.0708929 0.0006863428 16.51772 0.7274178 0.6579717 0.01043707 0.002802485 0.01902887 
    difference(s): 0.0001388316 0.00952717 
    time: 1.39 seconds

  missForest iteration 3 in progress...done!
    estimated error(s): 0.02143385 0.002448337 0.006218199 0.002009058 2.13478 3.319124 0.002221153 0.01261356 0.001786276 0 11.25951 3.73122 21.58347 11.16358 7.748629 51.69491 28.4234 33827.62 0.9165848 4.869446 184279.5 0.002834121 655.3057 0.007171282 0.06827684 0.0007088335 15.52839 0.7407598 0.6634867 0.01034152 0.002816703 0.0200716 
    difference(s): 0.00002911975 0.007410021 
    time: 1.41 seconds

missforest_time_temp

   user  system elapsed 
   4.89    2.50    4.24

missforest_result_temp$OOBerror

              PFC               MSE               MSE               MSE 
     0.0214338507      0.0024483373      0.0062181991      0.0020090582 
              MSE               MSE               MSE               MSE 
     2.1347803332      3.3191239656      0.0022211529      0.0126135631 
              MSE               MSE               MSE               MSE 
     0.0017862764      0.0000000000     11.2595069395      3.7312199962 
              MSE               MSE               MSE               MSE 
    21.5834697187     11.1635822029      7.7486286628     51.6949086916 
              MSE               MSE               MSE               MSE 
    28.4234033967  33827.6197125108      0.9165848314      4.8694455645 
              MSE               MSE               MSE               MSE 
184279.4672627342      0.0028341205    655.3057387789      0.0071712824 
              MSE               MSE               MSE               MSE 
     0.0682768382      0.0007088335     15.5283864592      0.7407597868 
              MSE               MSE               MSE               MSE 
     0.6634867181      0.0103415237      0.0028167026      0.0200716008

combined_missforest_imputed <- missforest_result_temp$ximp |>
  bind_cols(combined_ids)

train_features_missforest_temp <- combined_missforest_imputed |>
  filter(.dataset == "Training") |>
  arrange(.row_id) |>
  select(-all_of(id_cols))

valid_features_missforest_temp <- combined_missforest_imputed |>
  filter(.dataset == "Validation") |>
  arrange(.row_id) |>
  select(-all_of(id_cols))

eval_features_missforest_temp <- combined_missforest_imputed |>
  filter(.dataset == "Final Evaluation") |>
  arrange(.row_id) |>
  select(-all_of(id_cols))

missforest_missing_check_temp <- tibble(
  dataset = c("Training", "Validation", "Final Evaluation"),
  rows = c(
    nrow(train_features_missforest_temp),
    nrow(valid_features_missforest_temp),
    nrow(eval_features_missforest_temp)
  ),
  columns = c(
    ncol(train_features_missforest_temp),
    ncol(valid_features_missforest_temp),
    ncol(eval_features_missforest_temp)
  ),
  total_missing_values = c(
    sum(is.na(train_features_missforest_temp)),
    sum(is.na(valid_features_missforest_temp)),
    sum(is.na(eval_features_missforest_temp))
  )
)

# missforest_missing_check_temp |>
#   kable(
#     caption = "Missing Values After missForest Imputation",
#     digits = 0
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )

missforest_rf_results_temp <- run_rf_validation_test(
  train_x = train_features_missforest_temp,
  valid_x = valid_features_missforest_temp,
  train_y = train_ph,
  valid_y = valid_ph,
  workflow_name = "missForest imputation",
  project_seed = project_seed
)

# missforest_rf_results_temp |>
#   kable(
#     caption = "missForest Random Forest Validation Performance",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )

Complete-Case Benchmark Dataset

Before fully relying on imputations, I will also test a complete-case dataset with Random Forest model. In this part, I do not impute missing predictor values. Instead, I simply remove rows that have any missing predictor values to create complete-case daaset.

This gives a useful benchmark because the model is trained only on fully observed records. However, it also removes many observations with missing values, so it may not be fully practical for the final evaluation data where missing predictors also exist.

I will use this as a comparison point, not as the final preprocessing workflow.

# Complete-case feature sets
complete_train_rows <- complete.cases(train_features_for_imputation)
complete_valid_rows <- complete.cases(valid_features_for_imputation)

train_features_complete_case <- train_features_for_imputation[complete_train_rows, ]
valid_features_complete_case <- valid_features_for_imputation[complete_valid_rows, ]

train_ph_complete_case <- train_ph[complete_train_rows]
valid_ph_complete_case <- valid_ph[complete_valid_rows]

complete_case_row_check <- tibble(
  dataset = c("Training", "Validation"),
  original_rows = c(
    nrow(train_features_for_imputation),
    nrow(valid_features_for_imputation)
  ),
  complete_case_rows = c(
    nrow(train_features_complete_case),
    nrow(valid_features_complete_case)
  ),
  removed_rows = original_rows - complete_case_rows
)

complete_case_row_check |>
  kable(
    caption = "Rows Kept and Removed for Complete-Case Random Forest",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Rows Kept and Removed for Complete-Case Random Forest
dataset	original_rows	complete_case_rows	removed_rows
Training	2052	1625	427
Validation	515	413	102

complete_case_rf_results <- run_rf_validation_test(
  train_x = train_features_complete_case,
  valid_x = valid_features_complete_case,
  train_y = train_ph_complete_case,
  valid_y = valid_ph_complete_case,
  workflow_name = "Complete-case random forest",
  project_seed = project_seed
)

complete_case_rf_results |>
  kable(
    caption = "Complete-Case Random Forest Validation Performance",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Complete-Case Random Forest Validation Performance
workflow	training_rows	validation_rows	best_mtry	rmse	mae	rsq	training_time_seconds
Complete-case random forest	1625	413	15	0.092	0.0659	0.7309	136.01

Imputation Workflow Comparison Using Random Forest

After creating the different datasets with imputed NA values, I compare them using the same Random Forest validation test.

My main goal is not just to see which method fills missing values. I want to see which imputation workflow creates the most useful dataset for predicting PH.

Each workflow has been tested with the same general Random Forest setup, same training PH target, and same validation PH target. This makes the comparison more practical because the final decision is based on downstream model performance, not only on the imputation method itself.

I also included complete-case dataset. This model does not use imputation and only kept rows with no missing predictor values. It can be useful as a baseline, but it uses less rows, so it is not exactly the same situation as the imputed workflows.

imputation_workflow_comparison <- list(
  if (exists("complete_case_rf_results")) complete_case_rf_results else NULL,
  if (exists("xgb_numeric_knn_rf_results_temp")) xgb_numeric_knn_rf_results_temp else NULL,
  if (exists("rf_numeric_knn_results_temp")) rf_numeric_knn_results_temp else NULL,
  if (exists("rf_capped_knn_results")) rf_capped_knn_results else NULL,
  if (exists("missforest_rf_results_temp")) missforest_rf_results_temp else NULL
) |>
  purrr::compact() |>
  bind_rows()

imputation_workflow_comparison |>
  select(
    workflow,
    training_rows,
    validation_rows,
    best_mtry,
    rmse,
    mae,
    rsq,
    training_time_seconds
  ) |>
  arrange(rmse) |>
  kable(
    caption = "Random Forest Comparison: Complete-Case and Imputed Workflows",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Random Forest Comparison: Complete-Case and Imputed Workflows
workflow	training_rows	validation_rows	best_mtry	rmse	mae	rsq	training_time_seconds
Complete-case random forest	1625	413	15	0.0920	0.0659	0.7309	136.01
XGBoost numeric + KNN brand	2052	515	20	0.0958	0.0682	0.7013	192.01
Tree numeric + KNN brand	2052	515	20	0.0962	0.0687	0.6980	205.30
missForest imputation	2052	515	20	0.0966	0.0694	0.6947	189.60
RF numeric + KNN brand	2052	515	20	0.0967	0.0692	0.6944	189.14

The complete-case dataset has the best validation performance, but it uses fewer rows because records with missing predictors are removed.
This makes it useful as a benchmark, but I wouldn’t use it for following prediction models, because the final evaluation data also contains missing predictor values. And, also I would like to use all given observations.

Among the imputed workflows, Random Forest test shows us that “XGBoost numeric imputation + KNN brand_code imputation” technique performed best.
It has:

the lowest RMSE = 0.0958,
and highest \(R^2\) = 0.7013 among imputed datasets.

The other three workflows results are very close to each other, but none of themis better than XGBoost numeric + KNN Brand Code imputation method.

Based on these results, I select datasets of XGBoost numeric + KNN Brand Code imputation method as the main training/validation and evaluation sets to use for modeling and final evaluation predictions.

Final Completed Feature Sets for Modeling

train_features_final_imputed <- train_features_xgb_numeric_knn_brand_temp
valid_features_final_imputed <- valid_features_xgb_numeric_knn_brand_temp
eval_features_final_imputed <- eval_features_xgb_numeric_knn_brand_temp

final_imputed_set_check <- tibble(
  dataset = c("Training", "Validation", "Final Evaluation"),
  rows = c(
    nrow(train_features_final_imputed),
    nrow(valid_features_final_imputed),
    nrow(eval_features_final_imputed)
  ),
  columns = c(
    ncol(train_features_final_imputed),
    ncol(valid_features_final_imputed),
    ncol(eval_features_final_imputed)
  ),
  total_missing_values = c(
    sum(is.na(train_features_final_imputed)),
    sum(is.na(valid_features_final_imputed)),
    sum(is.na(eval_features_final_imputed))
  )
)

final_imputed_set_check |>
  kable(
    caption = "Final Completed Feature Sets Used for Modeling",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Final Completed Feature Sets Used for Modeling
dataset	rows	columns
Training	2052	32
Validation	515	32
Final Evaluation	267	32

Final Preprocessing Decisions by Model Family

The one single preprocessing approach may not be best for every model family. Some models are sensitive to scale, skewness, and correlated predictors. Other models, especially tree-based models, can handle nonlinear splits and interactions more naturally.

Because of this, I would not force the exact same transformed features set into every model. Instead, I will make preprocessing decisions based on how each model family works.

`mnfflow` Variable Review

The mnf_flow variable earlier needed extra review because almost 50% of varible values are negative.

mnf_flow_negative_values |>
  kable(
    caption = "Negative mnfflow Values in Training Data",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Negative mnfflow Values in Training Data
mnf_flow	n
-100.0	487
-100.2	454

All negative values are exactly -100 or -100.2, so at first they looked like possible coded process states rather than regular continuous flow measurements.

However, the earlier PH comparison showed that these negative mnf_flow rows behave differently from non-negative rows. For example, Brand B had mean PH of 8.6700 when mnf_flow was negative, compared with 8.4716 when mnf_flow was non-negative.

The process-variable comparison also showed differences. For Brand B, mean oxygen_filler was 0.0713 for negative mnf_flow rows, compared with 0.0255 for non-negative rows.

Because of this, I do not treat negative mnf_flow values as simple errors. Even though -100 and -100.2 look like special coded values, they appear to contain useful process information.

Testing `mnfflow` Representation for Linear Models

The mnf_flow review showed that this variable may not behave like one regular continuous predictor. This matters especially for linear models because linear models use coefficients.

If mnf_flow is kept as one numeric variable, the model has to use one linear relationship for both:

negative coded values such as -100 and -100.2
positive flow values that behave like regular continuous measurements

Because of this, I test two simple linear regression versions before making the final linear-model preprocessing decision:

Original mnf_flow: keep mnf_flow as one numeric predictor.
State-based mnf_flow: treat negative values as categorical process states and keep positive flow as a separate continuous variable. Basically I split mnf_flow into two separate variables: one catecorical for negative values (-100and-100.2), and another numerical variable for the rest of values.

This is only a quick preprocessing diagnostic. I am not applying the full final linear-model preprocessing steps yet. The purpose is only to decide whether mnf_flow should be represented differently for linear and regularized models.

add_mnf_flow_state_features <- function(data_x) {
  
  data_x |>
    mutate(
      mnf_flow_state = case_when(
        is.na(mnf_flow) ~ "Missing",
        mnf_flow == -100 ~ "Negative_100",
        mnf_flow == -100.2 ~ "Negative_100_2",
        mnf_flow < 0 ~ "Other_negative",
        mnf_flow >= 0 ~ "Non_negative"
      ),
      
      mnf_flow_state = factor(
        mnf_flow_state,
        levels = c(
          "Non_negative",
          "Negative_100",
          "Negative_100_2",
          "Other_negative",
          "Missing"
        )
      ),
      
      mnf_flow_positive_value = ifelse(
        !is.na(mnf_flow) & mnf_flow >= 0,
        mnf_flow,
        0
      )
    )
}

train_features_lm_original <- train_features_final_imputed
valid_features_lm_original <- valid_features_final_imputed

train_features_lm_state <- train_features_final_imputed |>
  add_mnf_flow_state_features() |>
  select(-mnf_flow)

valid_features_lm_state <- valid_features_final_imputed |>
  add_mnf_flow_state_features() |>
  select(-mnf_flow)

train_lm_original_data <- train_features_lm_original |>
  mutate(ph = train_ph) |>
  filter(!is.na(ph))

valid_lm_original_data <- valid_features_lm_original |>
  mutate(ph = valid_ph) |>
  filter(!is.na(ph))

train_lm_state_data <- train_features_lm_state |>
  mutate(ph = train_ph) |>
  filter(!is.na(ph))

valid_lm_state_data <- valid_features_lm_state |>
  mutate(ph = valid_ph) |>
  filter(!is.na(ph))

mnf_flow_transformation_example <- bind_rows(
  train_features_lm_original |>
    filter(mnf_flow < 0) |>
    distinct(mnf_flow) |>
    slice_head(n = 2),
  
  train_features_lm_original |>
    filter(mnf_flow >= 0) |>
    distinct(mnf_flow) |>
    slice_head(n = 5)
) |>
  mutate(
    mnf_flow_state = case_when(
      is.na(mnf_flow) ~ "Missing",
      mnf_flow == -100 ~ "Negative_100",
      mnf_flow == -100.2 ~ "Negative_100_2",
      mnf_flow < 0 ~ "Other_negative",
      mnf_flow >= 0 ~ "Non_negative"
    ),
    mnf_flow_positive_value = ifelse(
      !is.na(mnf_flow) & mnf_flow >= 0,
      mnf_flow,
      0
    )
  ) |>
  arrange(mnf_flow)

mnf_flow_transformation_example |>
  kable(
    caption = "Example of mnfflow State-Based Transformation",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Example of mnfflow State-Based Transformation
mnf_flow	mnf_flow_state	mnf_flow_positive_value
-100.2	Negative_100_2	0.0
-100.0	Negative_100	0.0
0.2	Non_negative	0.2
161.4	Non_negative	161.4
162.6	Non_negative	162.6
164.4	Non_negative	164.4
167.4	Non_negative	167.4

This table shows how the original mnf_flow variable is changed for the linear-model test. Negative values such as -100 and -100.2 are treated as categorical states, while positive values are kept as a separate numeric variable.

For example, when mnf_flow = -100.2, the new mnf_flow_state is Negative_100_2, and mnf_flow_positive_value is set to 0. When mnf_flow is positive, the state is Non_negative, and the positive value is kept as the continuous flow measurement.

Now let’s see before and after split results with simple linear regression

lm_original_model <- lm(
  ph ~ .,
  data = train_lm_original_data
)

lm_state_model <- lm(
  ph ~ .,
  data = train_lm_state_data
)

lm_original_pred <- predict(
  lm_original_model,
  newdata = valid_lm_original_data
)

lm_state_pred <- predict(
  lm_state_model,
  newdata = valid_lm_state_data
)

lm_mnf_flow_comparison <- bind_rows(
  calculate_regression_metrics(
    actual = valid_lm_original_data$ph,
    predicted = lm_original_pred
  ) |>
    mutate(
      workflow = "Linear model with original mnfflow",
      training_rows = nrow(train_lm_original_data),
      validation_rows = nrow(valid_lm_original_data)
    ),
  
  calculate_regression_metrics(
    actual = valid_lm_state_data$ph,
    predicted = lm_state_pred
  ) |>
    mutate(
      workflow = "Linear model with mnfflow state + positive value",
      training_rows = nrow(train_lm_state_data),
      validation_rows = nrow(valid_lm_state_data)
    )
) |>
  select(
    workflow,
    training_rows,
    validation_rows,
    rmse,
    mae,
    rsq
  ) |>
  arrange(rmse)

lm_mnf_flow_comparison |>
  kable(
    caption = "Simple Linear Model Comparison of mnfflow Representations",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Simple Linear Model Comparison of mnfflow Representations
workflow	training_rows	validation_rows	rmse	mae	rsq
Linear model with mnfflow state + positive value	2052	515	0.1280	0.0990	0.4448
Linear model with original mnfflow	2052	515	0.1311	0.1011	0.4173

The simple linear model comparison supports using the state-based mnf_flow split for linear models.

The model with mnf_flow_state + mnf_flow_positive_value performed better than the model with the original mnf_flow variable:

RMSE improved from 0.1311 to 0.1280,
MAE improved from 0.1011 to 0.0990,
and \(R^2\) improved from 0.4173 to 0.4448.

This supports my idea that mnf_flow should not always be treated as one regular continuous predictor. For linear models, one coefficient may be too restrictive because negative mnf_flow values appear to represent a coded process state, while positive values represent regular flow magnitude.

Because of this, I will use this mnf_flow_state + mnf_flow_positive_value representation dataset for linear and regularized model families.

Final Model-Family Preprocessing Plan

Based on the negative-value review, the mnf_flow behavior check, and the simple linear regression comparison, I use different preprocessing strategies for different model families.

Linear and Regularized Models

For linear and regularized models, I will use the mnf_flow_state + mnf_flow_positive_value representation. These models use coefficients, so separating the negative coded states from positive flow magnitude gives the model more flexibility.

The linear and regularized model preprocessing will include:

near-zero variance predictor removal
dummy encoding for categorical variables
selective Yeo-Johnson transformation for highly skewed continuous numeric variables
centering and scaling of numeric predictors
high-correlation review or filtering if needed

I wouldn’t apply Yeo-Johnson transformation to all numeric variables. It will be only considered for continuous numeric predictors with high skewness, and will not be applied to categorical variables, dummy variables, or variables that behave more like process settings or coded states.

This will keep the transformation more careful because not every numeric-looking variable should be treated as a regular continuous predictor.

Nonlinear Non-Tree Models

For nonlinear non-tree models, such as SVM with radial basis kernel or other distance/kernel-based methods, I will treat the mnf_flow representation as a comparison question.

These models may benefit from the state-based mnf_flow representation because they are sensitive to feature geometry, scale, and distance. However, I will not assume this automatically.

For nonlinear non-tree models, I may compare:

original final imputed feature set
final imputed feature set with mnf_flow_state + mnf_flow_positive_value

These models usually require centering and scaling. Selective Yeo-Johnson transformation may also be useful for highly skewed continuous predictors.

Tree-Based Models

For tree-based models, I will use the best final imputed feature set with the original mnf_flow variable.

This is because tree-based models can naturally split on rules like mnf_flow < 0. Because of that, they do not need the same mnf_flow state representation that linear models need.

Tree-based models also do not require centering and scaling, and I will not apply Yeo-Johnson transformation by default. These models can handle nonlinear splits, interactions, and skewed predictors.

I did side Random Forest test (not included here), and mnf_flow_state + mnf_flow_positive_valuerepresentation had lower accuracy performance compared with the original mnf_flow version. So, for Random Forest and other tree-based models, I will keep using the original final imputed feature set.

Data Leakage Control

All preprocessing steps that learn values from data must be learned from the training set only. This includes Yeo-Johnson transformation, centering, and scaling.

After these preprocessing steps are fitted on the training data during models training, the same fitted steps are applied to the validation and final evaluation sets.

The validation set is used only for model comparison. The final evaluation set will not be used for training, tuning, or model selection. I will use evaluation set at the end after the final model is selected.

#these will be tree-based model feature sets
train_features_tree_family <- train_features_final_imputed
valid_features_tree_family <- valid_features_final_imputed
eval_features_tree_family <- eval_features_final_imputed

#here linear / regularized model feature sets
train_features_linear_family <- train_features_final_imputed |>
  add_mnf_flow_state_features() |>
  select(-mnf_flow)

valid_features_linear_family <- valid_features_final_imputed |>
  add_mnf_flow_state_features() |>
  select(-mnf_flow)

eval_features_linear_family <- eval_features_final_imputed |>
  add_mnf_flow_state_features() |>
  select(-mnf_flow)

model_family_feature_set_check <- tibble(
  feature_set = c(
    "Tree family training",
    "Tree family validation",
    "Tree family final evaluation",
    "Linear family training",
    "Linear family validation",
    "Linear family final evaluation"
  ),
  rows = c(
    nrow(train_features_tree_family),
    nrow(valid_features_tree_family),
    nrow(eval_features_tree_family),
    nrow(train_features_linear_family),
    nrow(valid_features_linear_family),
    nrow(eval_features_linear_family)
  ),
  columns = c(
    ncol(train_features_tree_family),
    ncol(valid_features_tree_family),
    ncol(eval_features_tree_family),
    ncol(train_features_linear_family),
    ncol(valid_features_linear_family),
    ncol(eval_features_linear_family)
  ),
  missing_values = c(
    sum(is.na(train_features_tree_family)),
    sum(is.na(valid_features_tree_family)),
    sum(is.na(eval_features_tree_family)),
    sum(is.na(train_features_linear_family)),
    sum(is.na(valid_features_linear_family)),
    sum(is.na(eval_features_linear_family))
  )
)

model_family_feature_set_check |>
  kable(
    caption = "Model Family Feature Set Check",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Model Family Feature Set Check
feature_set	rows	columns
Tree family training	2052	32
Tree family validation	515	32
Tree family final evaluation	267	32
Linear family training	2052	33
Linear family validation	515	33
Linear family final evaluation	267	33

processed_data_folder <- "processed_modeling_data"

if (!dir.exists(processed_data_folder)) {
  dir.create(processed_data_folder)
}

write_csv(
  train_features_tree_family,
  file.path(processed_data_folder, "train_features_tree_family.csv")
)

write_csv(
  valid_features_tree_family,
  file.path(processed_data_folder, "valid_features_tree_family.csv")
)

write_csv(
  eval_features_tree_family,
  file.path(processed_data_folder, "eval_features_tree_family.csv")
)

write_csv(
  train_features_linear_family,
  file.path(processed_data_folder, "train_features_linear_family.csv")
)

write_csv(
  valid_features_linear_family,
  file.path(processed_data_folder, "valid_features_linear_family.csv")
)

write_csv(
  eval_features_linear_family,
  file.path(processed_data_folder, "eval_features_linear_family.csv")
)

write_csv(
  tibble(ph = train_ph),
  file.path(processed_data_folder, "train_ph.csv")
)

write_csv(
  tibble(ph = valid_ph),
  file.path(processed_data_folder, "valid_ph.csv")
)

processed_file_summary <- tibble(
  file_name = c(
    "train_features_tree_family.csv",
    "valid_features_tree_family.csv",
    "eval_features_tree_family.csv",
    "train_features_linear_family.csv",
    "valid_features_linear_family.csv",
    "eval_features_linear_family.csv",
    "train_ph.csv",
    "valid_ph.csv"
  ),
  description = c(
    "Training predictors for tree-based models using original mnf_flow",
    "Validation predictors for tree-based models using original mnf_flow",
    "Final evaluation predictors for tree-based models using original mnf_flow",
    "Training predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value",
    "Validation predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value",
    "Final evaluation predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value",
    "Training PH target values",
    "Validation PH target values"
  ),
  rows = c(
    nrow(train_features_tree_family),
    nrow(valid_features_tree_family),
    nrow(eval_features_tree_family),
    nrow(train_features_linear_family),
    nrow(valid_features_linear_family),
    nrow(eval_features_linear_family),
    length(train_ph),
    length(valid_ph)
  ),
  columns = c(
    ncol(train_features_tree_family),
    ncol(valid_features_tree_family),
    ncol(eval_features_tree_family),
    ncol(train_features_linear_family),
    ncol(valid_features_linear_family),
    ncol(eval_features_linear_family),
    1,
    1
  )
)

write_csv(
  processed_file_summary,
  file.path(processed_data_folder, "processed_file_summary.csv")
)

processed_file_summary |>
  kable(
    caption = "Saved Processed Modeling Files",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Saved Processed Modeling Files
file_name	description	rows	columns
train_features_tree_family.csv	Training predictors for tree-based models using original mnf_flow	2052	32
valid_features_tree_family.csv	Validation predictors for tree-based models using original mnf_flow	515	32
eval_features_tree_family.csv	Final evaluation predictors for tree-based models using original mnf_flow	267	32
train_features_linear_family.csv	Training predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value	2052	33
valid_features_linear_family.csv	Validation predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value	515	33
eval_features_linear_family.csv	Final evaluation predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value	267	33
train_ph.csv	Training PH target values	2052	1
valid_ph.csv	Validation PH target values	515	1

set.seed(123)
project_seed = 123


# num_cores <- 10
# 
# 
# cl <- makePSOCKcluster(num_cores)
# registerDoParallel(cl)

Load Processed Modeling Data

processed_data_folder <- "processed_modeling_data"

train_features_tree_family <- read_csv(
  file.path(processed_data_folder, "train_features_tree_family.csv"),
  show_col_types = FALSE
)

valid_features_tree_family <- read_csv(
  file.path(processed_data_folder, "valid_features_tree_family.csv"),
  show_col_types = FALSE
)

eval_features_tree_family <- read_csv(
  file.path(processed_data_folder, "eval_features_tree_family.csv"),
  show_col_types = FALSE
)

train_features_linear_family <- read_csv(
  file.path(processed_data_folder, "train_features_linear_family.csv"),
  show_col_types = FALSE
)

valid_features_linear_family <- read_csv(
  file.path(processed_data_folder, "valid_features_linear_family.csv"),
  show_col_types = FALSE
)

eval_features_linear_family <- read_csv(
  file.path(processed_data_folder, "eval_features_linear_family.csv"),
  show_col_types = FALSE
)

train_ph <- read_csv(
  file.path(processed_data_folder, "train_ph.csv"),
  show_col_types = FALSE
) |>
  pull(ph)

valid_ph <- read_csv(
  file.path(processed_data_folder, "valid_ph.csv"),
  show_col_types = FALSE
) |>
  pull(ph)

loaded_data_check <- tibble(
  object_name = c(
    "train_features_tree_family",
    "valid_features_tree_family",
    "eval_features_tree_family",
    "train_features_linear_family",
    "valid_features_linear_family",
    "eval_features_linear_family",
    "train_ph",
    "valid_ph"
  ),
  rows = c(
    nrow(train_features_tree_family),
    nrow(valid_features_tree_family),
    nrow(eval_features_tree_family),
    nrow(train_features_linear_family),
    nrow(valid_features_linear_family),
    nrow(eval_features_linear_family),
    length(train_ph),
    length(valid_ph)
  ),
  columns = c(
    ncol(train_features_tree_family),
    ncol(valid_features_tree_family),
    ncol(eval_features_tree_family),
    ncol(train_features_linear_family),
    ncol(valid_features_linear_family),
    ncol(eval_features_linear_family),
    1,
    1
  )
)

loaded_data_check |>
  kable(
    caption = "Loaded Modeling Data Check",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Loaded Modeling Data Check
object_name	rows	columns
train_features_tree_family	2052	32
valid_features_tree_family	515	32
eval_features_tree_family	267	32
train_features_linear_family	2052	33
valid_features_linear_family	515	33
eval_features_linear_family	267	33
train_ph	2052	1
valid_ph	515	1

stopifnot(nrow(train_features_tree_family) == length(train_ph))
stopifnot(nrow(valid_features_tree_family) == length(valid_ph))

stopifnot(nrow(train_features_linear_family) == length(train_ph))
stopifnot(nrow(valid_features_linear_family) == length(valid_ph))

stopifnot(nrow(eval_features_tree_family) == nrow(eval_features_linear_family))

Helper Functions

Before fitting models, I define a few helper functions that will be reused throughout the modeling file. This keeps the model sections shorter and makes the results tables more consistent.

Function 1: Regression metrics

calculate_regression_metrics <- function(actual, predicted) {
  
  tibble(
    rmse = sqrt(mean((actual - predicted)^2, na.rm = TRUE)),
    mae = mean(abs(actual - predicted), na.rm = TRUE),
    rsq = cor(actual, predicted, use = "complete.obs")^2
  )
}

This function calculates the main validation metrics used in this project: RMSE, MAE, and R-squared.

Function 2: Make a validation result row

make_model_result_row <- function(
    workflow,
    model_family,
    model_name,
    training_rows,
    validation_rows,
    tuning_parameters,
    actual,
    predicted,
    training_time_seconds
) {
  
  metrics <- calculate_regression_metrics(
    actual = actual,
    predicted = predicted
  )
  
  tibble(
    workflow = workflow,
    model_family = model_family,
    model_name = model_name,
    training_rows = training_rows,
    validation_rows = validation_rows,
    tuning_parameters = tuning_parameters,
    rmse = metrics$rmse,
    mae = metrics$mae,
    rsq = metrics$rsq,
    training_time_seconds = training_time_seconds
  )
}

This function creates one clean result row for each model. Extra tuning information, such as best alpha, lambda, mtry, or cp, can be added when needed.

Function 3: Prepare linear model matrix

prepare_linear_model_matrix <- function(train_x, valid_x, eval_x = NULL) {
  
  dummy_fit <- caret::dummyVars(
    ~ .,
    data = train_x,
    fullRank = TRUE
  )
  
  train_dummy <- predict(dummy_fit, newdata = train_x) |>
    as.data.frame()
  
  valid_dummy <- predict(dummy_fit, newdata = valid_x) |>
    as.data.frame()
  
  if (!is.null(eval_x)) {
    eval_dummy <- predict(dummy_fit, newdata = eval_x) |>
      as.data.frame()
  } else {
    eval_dummy <- NULL
  }
  
  list(
    train = train_dummy,
    valid = valid_dummy,
    eval = eval_dummy,
    dummy_fit = dummy_fit
  )
}

This function creates dummy variables using the training data and then applies the same dummy encoding to validation and final evaluation data. This keeps categorical predictors consistent across all datasets.

Function 4: detect skewed continuous variables

find_skewed_numeric_vars <- function(
  train_x,
  skew_cutoff = 1,
  min_unique_values = 10,
  exclude_vars = c(
    "pressure_setpoint",
    "bowl_setpoint"
  )
) {
  
  numeric_vars <- train_x |>
    select(where(is.numeric)) |>
    names()
  
  skew_summary <- train_x |>
    select(all_of(numeric_vars)) |>
    summarise(
      across(
        everything(),
        list(
          unique_values = ~ n_distinct(.x, na.rm = TRUE),
          skewness = ~ e1071::skewness(.x, na.rm = TRUE)
        )
      )
    ) |>
    pivot_longer(
      cols = everything(),
      names_to = "name",
      values_to = "value"
    ) |>
    separate(
      name,
      into = c("variable", "metric"),
      sep = "_(?=[^_]+$)"
    ) |>
    pivot_wider(
      names_from = metric,
      values_from = value
    ) |>
    mutate(
      abs_skewness = abs(skewness),
      use_yeo_johnson = abs_skewness >= skew_cutoff &
        unique_values >= min_unique_values &
        !(variable %in% exclude_vars)
    ) |>
    arrange(desc(abs_skewness))
  
  list(
    skew_summary = skew_summary,
    yeo_johnson_vars = skew_summary |>
      filter(use_yeo_johnson) |>
      pull(variable)
  )
}

This function identifies numeric variables that are strongly skewed and may need a Yeo-Johnson transformation.

Function 5: Apply high-correlation filtering

apply_correlation_filter <- function(
  train_x,
  valid_x,
  eval_x = NULL,
  cutoff = 0.90
) {
  
  numeric_train <- train_x |>
    select(where(is.numeric))
  
  corr_matrix <- cor(
    numeric_train,
    use = "pairwise.complete.obs"
  )
  
  high_corr_cols <- caret::findCorrelation(
    corr_matrix,
    cutoff = cutoff,
    names = TRUE
  )
  
  if (length(high_corr_cols) > 0) {
    
    train_x <- train_x |>
      select(-all_of(high_corr_cols))
    
    valid_x <- valid_x |>
      select(-all_of(high_corr_cols))
    
    if (!is.null(eval_x)) {
      eval_x <- eval_x |>
        select(-all_of(high_corr_cols))
    }
  }
  
  list(
    train = train_x,
    valid = valid_x,
    eval = eval_x,
    removed_columns = high_corr_cols
  )
}

This function removes highly correlated numeric predictors using the training data only. This is mainly useful for linear and regularized models where strong correlation can make coefficients unstable.

Function 6: NZV filtering

apply_nzv_filter <- function(train_x, valid_x, eval_x = NULL) {
  
  nzv_cols_index <- caret::nearZeroVar(train_x)
  
  removed_columns <- names(train_x)[nzv_cols_index]
  
  if (length(removed_columns) > 0) {
    
    train_x <- train_x |>
      select(-all_of(removed_columns))
    
    valid_x <- valid_x |>
      select(-all_of(removed_columns))
    
    if (!is.null(eval_x)) {
      eval_x <- eval_x |>
        select(-all_of(removed_columns))
    }
  }
  
  list(
    train = train_x,
    valid = valid_x,
    eval = eval_x,
    removed_columns = removed_columns
  )
}

Function 7: Plots of predicted vs observed values

plot_model_diagnostics <- function(obs, pred, model_name = "Model") {
  
  diag_df <- data.frame(
    observed = as.numeric(obs),
    predicted = as.numeric(pred),
    residuals = as.numeric(obs) - as.numeric(pred)
  )
  
  p1 <- ggplot(data = diag_df, aes(x = predicted, y = observed)) +
    geom_point(alpha = 0.6, color = "blue") +
    geom_abline(slope = 1, intercept = 0, linetype = 2) +
    labs(
      title = paste("Predicted vs Observed: \n", model_name),
      x = "Predicted PH",
      y = "Observed PH"
    )
  
  p2 <- ggplot(data = diag_df, aes(x = predicted, y = residuals)) +
    geom_point(alpha = 0.6, color = "blue") +
    geom_hline(yintercept = 0, linetype = 2) +
    labs(
      title = paste("Residuals vs Predicted: \n", model_name),
      x = "Predicted PH",
      y = "Residuals"
    )
  
  p1 + p2
}

The first plot compares predicted values with observed values. If the model is doing well, the points should be close to the diagonal line.

The second plot shows residuals. Ideally, the residuals should be scattered around zero without a strong pattern.

Function 8: Model validation results table

make_and_show_model_result <- function(
    workflow,
    model_family,
    model_name,
    training_rows,
    validation_rows,
    tuning_parameters,
    actual,
    predicted,
    training_time_seconds,
    table_caption = NULL
) {
  
  result <- make_model_result_row(
    workflow = workflow,
    model_family = model_family,
    model_name = model_name,
    training_rows = training_rows,
    validation_rows = validation_rows,
    tuning_parameters = tuning_parameters,
    actual = actual,
    predicted = predicted,
    training_time_seconds = training_time_seconds
  )
  
  if (is.null(table_caption)) {
    table_caption <- paste(model_name, "Validation Performance")
  }
  
  print(
    result |>
      kable(
        caption = table_caption,
        digits = 4
      ) |>
      kable_styling(
        full_width = FALSE,
        position = "center"
      )
  )
  
  return(result)
}

Function 9: Comparison table

add_model_result <- function(results_table, new_result) {
  
  bind_rows(
    results_table,
    new_result
  ) |>
    arrange(rmse)
}

linear_model_results <- tibble()
tree_model_results <- tibble()
all_model_results <- tibble()

Function 10: doParallel settings

start_parallel <- function(cores_to_leave_free = 1) {
  
  library(doParallel)
  
  total_cores <- parallel::detectCores(logical = TRUE)
  
  cores_to_use <- max(1, total_cores - cores_to_leave_free)
  
  cl <- makePSOCKcluster(cores_to_use)
  registerDoParallel(cl)
  
  message("Parallel backend started with ", cores_to_use, " cores.")
  
  return(cl)
}

stop_parallel <- function(cl) {
  
  stopCluster(cl)
  registerDoSEQ()
  
  message("Parallel backend stopped.")
}

Modeling Approach

In this part I will compare several model families using the validation set. The training set will be used for model fitting and cross-validation tuning. The validation set is used only after tuning to compare model performance.

I will use RMSE as the main metric because PH prediction is a regression problem and RMSE penalizes larger errors more strongly. I also report MAE because it is easier to interpret as the average absolute error. \(R^2\) will be included to show how much variation in PH is explained by the model.

Different model families will use different feature sets:

Linear and regularized models use the linear-family feature set with mnf_flow_state + mnf_flow_positive_value.
Tree-based models use the tree-family feature set with the original mnf_flow variable.
Nonlinear models may be tested with one or both feature sets depending on the model.

set.seed(123)

cv_control <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 3,
  savePredictions = "final"
)

I’m going to use repeated 5-fold cross-validation with 3 repeats.

The validation set is still kept separate. I will use it to compare the models after they are trained.

Linear and Regularized Models

Since I already created the linear-family feature set, I continue with that version here.

The linear-family feature set uses mnf_flow_state and mnf_flow_positive_value instead of the original mnf_flow, because that made more sense for linear models earlier.

train_linear_model_data <- train_features_linear_family |>
  mutate(ph = train_ph)

valid_linear_model_data <- valid_features_linear_family |>
  mutate(ph = valid_ph)

Near-Zero Variance and Correlation Filtering for Linear Models

Before fitting the linear models, I remove predictors with almost no variation. I did NZV test on raw data in preprocessing secton, but I’m doing it again as precation step since datasets now have imputed values

linear_nzv <- apply_nzv_filter(
  train_x = train_features_linear_family,
  valid_x = valid_features_linear_family
)

train_linear_nzv <- linear_nzv$train
valid_linear_nzv <- linear_nzv$valid

linear_nzv_removed <- tibble(
  removed_predictor = linear_nzv$removed_columns
)

linear_nzv_removed

# A tibble: 0 × 1
# ℹ 1 variable: removed_predictor <chr>

No near-zero variance predictors were removed.

linear_corr <- apply_correlation_filter(
  train_x = train_linear_nzv,
  valid_x = valid_linear_nzv,
  cutoff = 0.90
)

train_linear_filtered <- linear_corr$train
valid_linear_filtered <- linear_corr$valid

stopifnot(exists("train_linear_filtered"))
stopifnot(exists("valid_linear_filtered"))

linear_corr_removed <- tibble(
  removed_predictor = linear_corr$removed_columns
)

linear_corr_removed |>
  kable(
    caption = "Highly Correlated Predictors Removed for Linear Models"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Highly Correlated Predictors Removed for Linear Models
removed_predictor
balling
hyd_pressure3
density
balling_lvl
filler_level

The correlation filter removed five predictors: balling, hyd_pressure3, density, balling_lvl, and filler_level.

train_linear_filtered_data <- train_linear_filtered |>
  mutate(ph = train_ph)

valid_linear_filtered_data <- valid_linear_filtered |>
  mutate(ph = valid_ph)

Yeo-Johnson Variable List for Linear Models

For the linear and regularized models, I first create the list of variables that should receive the Yeo-Johnson transformation.

I do not apply the transformation directly here. The actual Yeo-Johnson transformation will be handled inside the modeling workflow together with centering and scaling.

exclude_from_yj <- c(
  "pressure_setpoint",
  "bowl_setpoint"
)

yeo_johnson_vars <- train_linear_filtered_data |>
  select(where(is.numeric)) |>
  select(-any_of(c(exclude_from_yj, "ph"))) |>
  names()

tibble(
  yeo_johnson_variable = yeo_johnson_vars
) |>
  kable(
    caption = "Variables Selected for Yeo-Johnson Transformation"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Variables Selected for Yeo-Johnson Transformation
yeo_johnson_variable
carb_volume
fill_ounces
pc_volume
carb_pressure
carb_temp
psc
psc_fill
psc_co2
carb_pressure1
fill_pressure
hyd_pressure1
hyd_pressure2
hyd_pressure4
filler_speed
temperature
usage_cont
carb_flow
mfr
pressure_vacuum
oxygen_filler
air_pressurer
alch_rel
carb_rel
mnf_flow_positive_value

Create Linear Modeling Data

After the NZV and correlation filtering steps, I add ph back to the filtered linear-family training and validation datasets.

The actual Yeo-Johnson, centering, and scaling steps will be applied inside the model training workflow.

train_linear_model_data <- train_linear_filtered |>
  mutate(ph = train_ph)

valid_linear_model_data <- valid_linear_filtered |>
  mutate(ph = valid_ph)

linear_recipe <- recipe(
  ph ~ .,
  data = train_linear_model_data
) |>
  step_YeoJohnson(all_of(yeo_johnson_vars)) |>
  step_dummy(all_nominal_predictors()) |>
  step_center(all_numeric_predictors()) |>
  step_scale(all_numeric_predictors())

Here I created the preprocessing workflow for the linear and regularized models.

The Yeo-Johnson step uses only the variables selected earlier in yeo_johnson_vars. Also, since some variables have negative values it is better to use Yeo-Johnson method instad of BoxCox. This keeps the transformation limited to the numeric variables we decided are appropriate.

After that, categorical predictors are converted into dummy variables. Then numeric predictors are centered and scaled, which is important for regularized models like ridge, lasso, and elastic net.

I keep this inside the modeling workflow so the preprocessing is connected to cross-validation instead of being applied separately by hand.

Linear Regression Baseline

I start with a regular linear regression model as the first baseline.

I originally tried to use caret::train() with a recipe, but it caused an error during repeated cross-validation.

Because this modeling section uses a recipe for Yeo-Johnson, centering, scaling, and dummy encoding, I switch to the tidymodels workflow. This fits better here because recipes, workflows, parsnip, and rsample are designed to work together.

The modeling logic stays the same. I am not changing the preprocessing plan. I am only changing the model training framework so the recipe can be handled more reliably.

set.seed(123)

linear_folds <- vfold_cv(
  train_linear_model_data,
  v = 5,
  repeats = 3
)

I use repeated 5-fold cross-validation again, but now through tidymodels.

set.seed(123)

linear_model_spec <- linear_reg() |>
  set_engine("lm")

linear_workflow <- workflow() |>
  add_recipe(linear_recipe) |>
  add_model(linear_model_spec)

set.seed(123)

linear_cv_results <- fit_resamples(
  linear_workflow,
  resamples = linear_folds,
  metrics = metric_set(rmse, mae, rsq),
  control = control_resamples(save_pred = TRUE)
)

collect_metrics(linear_cv_results) |>
  kable(
    caption = "Linear Regression Cross-Validation Performance",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Linear Regression Cross-Validation Performance
.metric	.estimator	mean	n	std_err	.config
mae	standard	0.1066	15	0.0008	pre0_mod0_post0
rmse	standard	0.1363	15	0.0010	pre0_mod0_post0
rsq	standard	0.3786	15	0.0093	pre0_mod0_post0

linear_start_time <- Sys.time()

linear_fit <- fit(
  linear_workflow,
  data = train_linear_model_data
)

linear_end_time <- Sys.time()

linear_training_time <- as.numeric(
  difftime(linear_end_time, linear_start_time, units = "secs")
)

linear_validation_predictions <- predict(
  linear_fit,
  new_data = valid_linear_model_data
) |>
  pull(.pred)

linear_result <- make_model_result_row(
  workflow = "Linear-family ",
  model_family = "Linear / Regularized",
  model_name = "Linear Regression",
  training_rows = nrow(train_linear_model_data),
  validation_rows = nrow(valid_linear_model_data),
  tuning_parameters = "None",
  actual = valid_ph,
  predicted = linear_validation_predictions,
  training_time_seconds = linear_training_time
)

linear_result |>
  kable(
    caption = "Linear Regression Validation Performance",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Linear Regression Validation Performance
workflow	model_family	model_name	training_rows	validation_rows	tuning_parameters	rmse	mae	rsq	training_time_seconds
Linear-family	Linear / Regularized	Linear Regression	2052	515	None	0.1305	0.1007	0.4244	0.1201

The first model I tested was regular linear regression.

This model is used as a baseline model because it is simple and easy to understand.

The validation performance was:

RMSE: 0.1305
MAE: 0.1007
\(R^2\): 0.4244

This result gives me a starting point for comparing all other models. The model was very fast to train, but the prediction performance is not very strong as expected.

The \(R^2\) value of 0.4244 means the model explained some of the PH pattern, but a large part of the variation was still not captured. This makes sense because the data we have probably has nonlinear relationships and interactions between variables.

Overall, linear regression is useful as a baseline, but it was not flexible enough.

plot_model_diagnostics(
  obs = valid_ph,
  pred = linear_validation_predictions,
  model_name = "Linear Regression"
)

The diagnostic plots show why linear regression is useful as a baseline, but not strong enough.

In Predicted vs Observed plot, the points follow the general upward direction, so the model captured some relationship between predictors and PH. However, the points spread around the diagonal line is wide. This means the linear regression produced many over and under predicted values.

The Residuals vs Predicted plot also shows a fairly wide spread of residuals. Some residuals are larger than +0.25 or below -0.25, which means the model had some noticeable overprediction and underprediction cases.

It also appear to be some funnel-shape pattern in the residuals instead of a fully random cloud around zero.

Overall, linear regression gave a useful starting point, but the diagnostic plots show that PH prediction likely needs a more flexible model that can capture nonlinear patterns and interactions between predictors.

Elastic Net Regression

After the regular linear regression model, I continue with elastic net regression.

Elastic net is still a linear model, but it adds regularization. This means it can shrink some coefficients and reduce overfitting compared with a regular linear regression model.

Elastic net combines both ridge and lasso regression.

Elastic Net Tuning Parameters

Elastic net has two main tuning parameters: penalty and mixture.

The penalty controls how aggressivly the model shrinks the coefficients. A small penalty means the model is closer to regular linear regression. On the other hand, larger penalty means stronger shrinkage.

The mixture controls the type of regularization. When mixture = 0, the model behaves pretty much like ridge regression. When mixture = 1, it behaves like lasso regression. Values between 0 and 1 combine both ridge and lasso methods.

In this part, I tune both parameters because I do not want to assume ahead of time whether ridge-like, lasso-like, or mixed regularization will work best.

make_linear_recipe <- function(training_data) {
  
  recipe(
    ph ~ .,
    data = training_data
  ) |>
    step_YeoJohnson(all_of(yeo_johnson_vars)) |>
    step_center(all_numeric_predictors()) |>
    step_scale(all_numeric_predictors()) |>
    step_dummy(all_nominal_predictors())
}

elastic_net_spec <- linear_reg(
  penalty = tune(),
  mixture = tune()
) |>
  set_engine("glmnet")

elastic_net_grid <- grid_regular(
  penalty(range = c(-6, -1)),
  mixture(range = c(0, 1)),
  levels = c(
    penalty = 15,
    mixture = 6
  )
)

elastic_net_workflow <- workflow() |>
  add_recipe(linear_recipe) |>
  add_model(elastic_net_spec)

elastic_net_grid_2 <- grid_regular(
  penalty(range = c(-6, -1)),
  mixture(range = c(0, 1)),
  levels = c(
    penalty = 15,
    mixture = 6
  )
)

set.seed(123)

elastic_net_start_time <- Sys.time()

elastic_net_tune <- tune_grid(
  elastic_net_workflow,
  resamples = linear_folds,
  grid = elastic_net_grid,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

elastic_net_end_time <- Sys.time()

elastic_net_training_time <- as.numeric(
  difftime(elastic_net_end_time, elastic_net_start_time, units = "secs")
)
#elastic_net_tune

# elastic_net_tune |>
#   collect_metrics() |>
#   arrange(.metric, mean) |>
#   kable(
#     caption = "Elastic Net Cross-Validation Tuning Results",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )

best_elastic_net <- select_best(
  elastic_net_tune,
  metric = "rmse"
)

best_elastic_net

# A tibble: 1 × 3
  penalty mixture .config         
    <dbl>   <dbl> <chr>           
1 0.00848       0 pre0_mod67_post0

Now let’s fit chosen model on training and validation sets

final_elastic_net_workflow <- finalize_workflow(
  elastic_net_workflow,
  best_elastic_net
)

elastic_net_start_time <- Sys.time()

final_elastic_net_fit <- fit(
  final_elastic_net_workflow,
  data = train_linear_model_data
)

elastic_net_end_time <- Sys.time()

elastic_net_training_time <- as.numeric(
  difftime(elastic_net_end_time, elastic_net_start_time, units = "secs")
)

elastic_net_validation_predictions <- predict(
  final_elastic_net_fit,
  new_data = valid_linear_model_data
) |>
  pull(.pred)

elastic_net_result <- make_and_show_model_result(
  workflow = "Linear-family ",
  model_family = "Linear / Regularized",
  model_name = "Elastic Net",
  training_rows = nrow(train_linear_model_data),
  validation_rows = nrow(valid_linear_model_data),
  tuning_parameters = paste(
    "penalty =",
    best_elastic_net$penalty,
    ", mixture =",
    best_elastic_net$mixture
  ),
  actual = valid_ph,
  predicted = elastic_net_validation_predictions,
  training_time_seconds = elastic_net_training_time
)

Elastic Net Validation Performance
workflow	model_family	model_name	training_rows	validation_rows	tuning_parameters	rmse	mae	rsq	training_time_seconds
Linear-family	Linear / Regularized	Elastic Net	2052	515	penalty = 0.00848342898244073 , mixture = 0	0.1305	0.1007	0.425	0.1375

The best elastic net model selected penalty = 0.0085 and mixture = 0.

The elastic net model produced almost the same validation performance as the regular linear regression baseline.This makes sense because the best tuning result selected penalty = 0.0085 and mixture = 0. A mixture value of 0 means the model is ridge-like, and the very small penalty means only a tiny amount of shrinkage was applied.
Because of this, the elastic net model stayed very close to the regular linear regression model. In this case, regularization did not meaningfully improve validation performance.

plot_model_diagnostics(
  obs = valid_ph,
  pred = elastic_net_validation_predictions,
  model_name = "Elastic Net"
)

The elastic net diagnostic plots show that the model captures the general direction of PH values, and almost identical with diagnosis plots of previous linear regression.

In the predicted vs observed plot, the points follow an upward trend, but they are not tightly clustered around the diagonal line. The predicted PH values are also compressed into a smaller range than the observed PH values, which suggests that the model is pulling predictions toward the average.

The residual plot shows that the residuals are mostly centered around zero, but there is still noticeable spread. The spread has a mild funnel shape, with larger residual variation for smaller predicted PH values. This suggests that the elastic net model struggles more with explaining variation in the lower predicted PH range.

For larger predicted PH values, there also appears to be a slight downward pattern in the residuals.

Overall, the residual plot suggests that the elastic net model captures some of the general PH pattern, but it does not fully explain the variance across the whole PH range.

Partial Least Squares Regression

I also test Partial Least Squares regression because it can be useful when predictors are correlated.

Instead of using the original predictors directly, PLS creates a smaller number of components (supervised version of PCA) from the predictors. These components are created in a way that tries to explain the relationship with ph.

The main tuning parameter is num_comp, which controls how many PLS components are used. Too few components may underfit, while too many components may start acting more like a regular linear model. For this model I will use tuning with up to 15 components.

train_pls_model_data <- train_linear_nzv |>
  mutate(ph = train_ph)

valid_pls_model_data <- valid_linear_nzv |>
  mutate(ph = valid_ph)

exclude_from_pls_yj <- c(
  "pressure_setpoint",
  "bowl_setpoint"
)

pls_yeo_johnson_vars <- train_linear_nzv |>
  select(where(is.numeric)) |>
  select(-any_of(exclude_from_pls_yj)) |>
  names()

make_pls_recipe <- function(training_data) {
  
  recipe(
    ph ~ .,
    data = training_data
  ) |>
    step_YeoJohnson(all_of(pls_yeo_johnson_vars)) |>
    step_center(all_numeric_predictors()) |>
    step_scale(all_numeric_predictors()) |>
    step_dummy(all_nominal_predictors())
}

set.seed(123)

pls_folds <- vfold_cv(
  train_pls_model_data,
  v = 5,
  repeats = 3
)

fit_pls_cv_split <- function(split, num_comp) {
  
  train_split_data <- analysis(split)
  valid_split_data <- assessment(split)
  
  split_recipe <- make_pls_recipe(train_split_data)
  
  split_recipe_prep <- prep(
    split_recipe,
    training = train_split_data,
    retain = TRUE
  )
  
  train_split_baked <- bake(
    split_recipe_prep,
    new_data = train_split_data
  )
  
  valid_split_baked <- bake(
    split_recipe_prep,
    new_data = valid_split_data
  )
  
  pls_fit <- plsr(
    ph ~ .,
    data = train_split_baked,
    ncomp = num_comp,
    method = "kernelpls"
  )
  
  pls_predictions <- predict(
    pls_fit,
    newdata = valid_split_baked,
    ncomp = num_comp
  ) |>
    as.numeric()
  
  tibble(
    num_comp = num_comp,
    rmse = rmse_vec(
      truth = valid_split_baked$ph,
      estimate = pls_predictions
    ),
    mae = mae_vec(
      truth = valid_split_baked$ph,
      estimate = pls_predictions
    ),
    rsq = rsq_vec(
      truth = valid_split_baked$ph,
      estimate = pls_predictions
    )
  )
}

pls_grid <- tibble(
  num_comp = 1:20
)

set.seed(123)

pls_uncorr_start_time <- Sys.time()

pls_uncorr_cv_results <- crossing(
  split_id = seq_along(pls_folds$splits),
  num_comp = pls_grid$num_comp
) |>
  mutate(
    metrics = map2(
      split_id,
      num_comp,
      ~ fit_pls_cv_split(
        split = pls_folds$splits[[.x]],
        num_comp = .y
      )
    )
  ) |>
  select(-split_id, -num_comp) |>
  unnest(metrics)

pls_uncorr_end_time <- Sys.time()

pls_uncorr_training_time <- as.numeric(
  difftime(
    pls_uncorr_end_time,
    pls_uncorr_start_time,
    units = "secs"
  )
)

pls_uncorr_cv_summary <- pls_uncorr_cv_results |>
  group_by(num_comp) |>
  summarise(
    rmse = mean(rmse, na.rm = TRUE),
    mae = mean(mae, na.rm = TRUE),
    rsq = mean(rsq, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(rmse)

pls_uncorr_cv_summary |>
  kable(
    caption = "PLS Cross-Validation Results with Correlated Predictors Kept",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

PLS Cross-Validation Results with Correlated Predictors Kept
num_comp	rmse	mae	rsq
14	0.1350	0.1051	0.3906
15	0.1350	0.1051	0.3904
17	0.1350	0.1052	0.3903
16	0.1350	0.1052	0.3903
18	0.1350	0.1052	0.3901
13	0.1350	0.1053	0.3900
19	0.1351	0.1052	0.3899
12	0.1351	0.1052	0.3897
20	0.1351	0.1051	0.3896
10	0.1352	0.1053	0.3889
11	0.1352	0.1053	0.3886
8	0.1354	0.1057	0.3863
9	0.1354	0.1055	0.3863
7	0.1359	0.1058	0.3823
6	0.1364	0.1064	0.3778
5	0.1372	0.1064	0.3696
4	0.1388	0.1085	0.3557
3	0.1398	0.1091	0.3462
2	0.1443	0.1132	0.3042
1	0.1527	0.1210	0.2212

best_pls_uncorr <- pls_uncorr_cv_summary |>
  slice(1)

best_pls_uncorr |>
  kable(
    caption = "Best PLS Result with Correlated Predictors Kept",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Best PLS Result with Correlated Predictors Kept
num_comp	rmse	mae	rsq
14	0.135	0.1051	0.3906

best_pls_uncorr_num_comp <- best_pls_uncorr$num_comp

The best PLS model selected num_comp = 14 and it is a lot of componets are being used.

pls_uncorr_final_recipe <- make_pls_recipe(
  train_pls_model_data
)

pls_uncorr_final_recipe_prep <- prep(
  pls_uncorr_final_recipe,
  training = train_pls_model_data,
  retain = TRUE
)

train_pls_uncorr_baked <- bake(
  pls_uncorr_final_recipe_prep,
  new_data = train_pls_model_data
)

valid_pls_uncorr_baked <- bake(
  pls_uncorr_final_recipe_prep,
  new_data = valid_pls_model_data
)

pls_uncorr_final_start_time <- Sys.time()

pls_uncorr_final_model <- plsr(
  ph ~ .,
  data = train_pls_uncorr_baked,
  ncomp = best_pls_uncorr_num_comp,
  method = "kernelpls"
)

pls_uncorr_final_end_time <- Sys.time()

pls_uncorr_final_training_time <- as.numeric(
  difftime(
    pls_uncorr_final_end_time,
    pls_uncorr_final_start_time,
    units = "secs"
  )
)

pls_uncorr_validation_predictions <- predict(
  pls_uncorr_final_model,
  newdata = valid_pls_uncorr_baked,
  ncomp = best_pls_uncorr_num_comp
) |>
  as.numeric()

pls_uncorr_result <- make_and_show_model_result(
  workflow = "Linear-family / PLS keeps correlated predictors",
  model_family = "Linear / Regularized",
  model_name = "PLS Regression",
  training_rows = nrow(train_pls_model_data),
  validation_rows = nrow(valid_pls_model_data),
  tuning_parameters = paste(
    "num_comp =",
    best_pls_uncorr_num_comp
  ),
  actual = valid_ph,
  predicted = pls_uncorr_validation_predictions,
  training_time_seconds = pls_uncorr_final_training_time
)

PLS Regression Validation Performance
workflow	model_family	model_name	training_rows	validation_rows	tuning_parameters	rmse	mae	rsq	training_time_seconds
Linear-family / PLS keeps correlated predictors	Linear / Regularized	PLS Regression	2052	515	num_comp = 14	0.1286	0.0987	0.4403	0.0095

The best PLS model selected num_comp = 14 and it is a lot of componets.

This means the model needed many components to get its best cross-validation result. In this case, PLS did not reduce the predictor space into a small number of very strong components.

The best PLS cross-validation performance was:

RMSE: 0.1362
MAE: 0.1066
\(R^2\): 0.3792

Compared with the regular linear regression and elastic net models, PLS performed slightly worse. This suggests that reducing the predictors into PLS components did not improve PH prediction for this dataset.

Because of this, PLS is useful as another linear-family comparison model, but it does not look like the strongest model in this family.

The plots are almost identical to previous plots, and my previous explanations of ENET diagnostics apply to these plots as well.

plot_model_diagnostics(
  obs = valid_ph,
  pred = pls_uncorr_validation_predictions,
  model_name = "PLS Regression"
)

The PLS diagnostic plots look very similar to the regular linear regression plots.

In the Predicted vs Observed plot, the points follow the general upward direction, so PLS captured some relationship between the predictors and PH. However, there is still a wide spread around the diagonal line. This means the predictions are not very close to the observed PH values for many validation rows.

The Residuals vs Predicted plot also shows that residuals are spread fairly wide around zero. There are still some larger overpredictions and underpredictions.

PLS was useful because it can handle correlated predictors better than regular linear regression. This is why I tested it after seeing correlation between some process variables.

However, the plots show that PLS still behaves mostly like a linear model. It improved slightly over regular linear regression, but it still did not capture the nonlinear patterns.

Overall, PLS was a reasonable linear-family model to test, but it was not flexible enough.

linear_model_results <- tibble()

linear_model_results <- add_model_result(
  results_table = linear_model_results,
  new_result = linear_result
)

linear_model_results <- add_model_result(
  results_table = linear_model_results,
  new_result = elastic_net_result
)

linear_model_results <- add_model_result(
  results_table = linear_model_results,
  new_result = pls_uncorr_result
)

# linear_model_results <- linear_model_results |>
#   filter(model_name != "PLS Regression")
# 
# all_model_results <- all_model_results |>
#   filter(model_name != "PLS Regression")

linear_model_results |>
  arrange(rmse) |>
  kable(
    caption = "Linear and Regularized Model Comparison",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Linear and Regularized Model Comparison
workflow	model_family	model_name	training_rows	validation_rows	tuning_parameters	rmse	mae	rsq	training_time_seconds
Linear-family / PLS keeps correlated predictors	Linear / Regularized	PLS Regression	2052	515	num_comp = 14	0.1286	0.0987	0.4403	0.0095
Linear-family	Linear / Regularized	Linear Regression	2052	515	None	0.1305	0.1007	0.4244	0.1201
Linear-family	Linear / Regularized	Elastic Net	2052	515	penalty = 0.00848342898244073 , mixture = 0	0.1305	0.1007	0.4250	0.1375

The linear and regularized model comparison shows that PLS regression performed slightly better than regular linear regression and elastic net.

PLS regression:

RMSE: 0.1286
MAE: 0.0987
\(R^2\): 0.4403

Linear regression:

RMSE: 0.1305
MAE: 0.1007
\(R^2\): 0.4244

Elastic net:

RMSE: 0.1305
MAE: 0.1007
\(R^2\): 0.4250

For PLS, I kept the correlated predictors instead of applying correlation filtering. This is because PLS is designed to work with correlated predictors by creating components from them. Removing correlated predictors before PLS could remove useful information.

The best PLS model selected num_comp = 14. This means the model needed many components to reach its best performance, so it did not reduce the data into only a few strong components. Still, PLS gave a small improvement over the regular linear regression and elastic net models.

Elastic net selected mixture = 0, which means it behaved like ridge regression. Its performance was almost the same as regular linear regression, so regularization did not add much improvement in this section.

Overall, PLS was the best model inside the linear and regularized model family, but the improvement was small. This tells us that linear models are able to capture some PH patterns, but linear models still leave a good amount of data structure unexplained.

linear_model_results <- tibble()
all_model_results <- tibble()

linear_model_results <- add_model_result(
  results_table = linear_model_results,
  new_result = linear_result
)

linear_model_results <- add_model_result(
  results_table = linear_model_results,
  new_result = elastic_net_result
)

linear_model_results <- add_model_result(
  results_table = linear_model_results,
  new_result = pls_uncorr_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = linear_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = elastic_net_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = pls_uncorr_result
)

#all_model_results

Nonlinear Models

After the linear-family models, I will test nonlinear models.

The linear models gave very similar prediction patterns, so the next question is if flexible models can capture nonlinear relationships or interactions in the production data.

For these models, I continue using the linear-family feature set. This is because mnf_flow was already separated into mnf_flow_state and mnf_flow_positive_value, which is safer for models that depend on distances, scaling, or smooth nonlinear relationships.

I also keep the same preprocessing recipe with Yeo-Johnson, centering, scaling, and dummy encoding. This is important for SVM and neural networks because they are sensitive to variable scale.

nonlinear_model_results <- tibble()

SVM with Radial Basis Kernel

I start the nonlinear model section with SVM using a radial basis kernel.
This model can capture nonlinear patterns by comparing observations based on distance. Because of that, here scaling is important. I use the same preprocessing recipe from the previous section, where numeric variables are transformed, centered, and scaled before modeling.

The main tuning parameters are cost and rbf_sigma.

cost controls how much the model tries to avoid errors. A larger cost allows less flexibility for mistakes, while a smaller cost allows more tolerance.
rbf_sigma controls how local or smooth the radial kernel is. Smaller values usually create smoother patterns, while larger values can make the model more sensitive to nearby points.

svm_rbf_spec <- svm_rbf(
  cost = tune(),
  rbf_sigma = tune()
) |>
  set_mode("regression") |>
  set_engine("kernlab")

svm_rbf_spec <- svm_rbf(
  cost = tune(),
  rbf_sigma = tune()) |>
  set_mode("regression") |>
  set_engine("kernlab")

svm_rbf_workflow <- workflow() |>
  add_recipe(linear_recipe) |>
  add_model(svm_rbf_spec)

svm_rbf_grid <- grid_regular(
  cost(range = c(-3, 5)),
  rbf_sigma(range = c(-6, 1)),
  levels = c(
    cost = 9,
    rbf_sigma = 8
  )
)

# svm_rbf_grid |>
#   kable(
#     caption = "SVM Radial Tuning Grid",
#     digits = 6
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )

set.seed(123)

svm_rbf_start_time <- Sys.time()

svm_rbf_tune <- tune_grid(
  svm_rbf_workflow,
  resamples = linear_folds,
  grid = svm_rbf_grid,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

svm_rbf_end_time <- Sys.time()

svm_rbf_training_time <- as.numeric(
  difftime(svm_rbf_end_time, svm_rbf_start_time, units = "secs")
)

svm_rbf_tune |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top SVM Radial Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Top SVM Radial Tuning Results by RMSE
cost	rbf_sigma	.metric	.estimator	mean	n	std_err	.config
2	0.10	rmse	standard	0.1201	15	0.0009	pre0_mod38_post0
4	0.10	rmse	standard	0.1210	15	0.0008	pre0_mod46_post0
8	0.01	rmse	standard	0.1216	15	0.0010	pre0_mod53_post0
1	0.10	rmse	standard	0.1218	15	0.0010	pre0_mod30_post0
8	0.10	rmse	standard	0.1220	15	0.0008	pre0_mod54_post0
4	0.01	rmse	standard	0.1221	15	0.0011	pre0_mod45_post0
16	0.10	rmse	standard	0.1225	15	0.0008	pre0_mod62_post0
32	0.10	rmse	standard	0.1225	15	0.0008	pre0_mod70_post0
16	0.01	rmse	standard	0.1226	15	0.0011	pre0_mod61_post0
2	0.01	rmse	standard	0.1234	15	0.0011	pre0_mod37_post0

best_svm_rbf <- select_best(
  svm_rbf_tune,
  metric = "rmse"
)

best_svm_rbf |>
  kable(
    caption = "Best SVM Radial Tuning Parameters",
    digits = 6
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Best SVM Radial Tuning Parameters
cost	rbf_sigma	.config
2	0.1	pre0_mod38_post0

In this SVM radial tuning grid found a good region around cost = 2 and rbf_sigma = 0.10.

As we see in tuning results table above, several of the best tuning combinations are close to each other, which suggests that the model was not depending on just one lucky parameter setting. Instead, there seemed to be a good tuning region.

For example, this tuning grid showed strong results for:

- cost = 2, rbf_sigma = 0.10 , mean RMSE = 0.1201
- cost = 4, rbf_sigma = 0.10``, mean RMSE = 0.1210
- cost = 1, rbf_sigma = 0.10``, mean RMSE = 0.1218
- cost = 8, rbf_sigma = 0.01``, mean RMSE = 0.1216

Because nearby values had similar RMSE scores, below I will run a refined tuning grid around that area. This will allowe me to test smaller changes in cost and rbf_sigma instead of only using the wider first grid.

svm_rbf_grid |>
  summarise(
    min_cost = min(cost),
    max_cost = max(cost),
    min_rbf_sigma = min(rbf_sigma),
    max_rbf_sigma = max(rbf_sigma)
  )

# A tibble: 1 × 4
  min_cost max_cost min_rbf_sigma max_rbf_sigma
     <dbl>    <dbl>         <dbl>         <dbl>
1    0.125       32      0.000001            10

svm_rbf_grid_refined <- expand_grid(
  cost = c(1, 1.5, 2, 3, 4, 6, 8),
  rbf_sigma = c(0.05, 0.075, 0.10, 0.125, 0.15, 0.20)
)

# svm_rbf_grid_refined |>
#   kable(
#     caption = "Refined SVM Radial Tuning Grid",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )

set.seed(123)

svm_rbf_refined_start_time <- Sys.time()

svm_rbf_tune_refined <- tune_grid(
  svm_rbf_workflow,
  resamples = linear_folds,
  grid = svm_rbf_grid_refined,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

svm_rbf_refined_end_time <- Sys.time()

svm_rbf_refined_training_time <- as.numeric(
  difftime(
    svm_rbf_refined_end_time,
    svm_rbf_refined_start_time,
    units = "secs"
  )
)

best_svm_rbf_refined <- select_best(
  svm_rbf_tune_refined,
  metric = "rmse"
)

best_svm_rbf_refined |>
  kable(
    caption = "Best Refined SVM Radial Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Best Refined SVM Radial Tuning Parameters
cost	rbf_sigma	.config
3	0.05	pre0_mod19_post0

The refined tuning selected:

cost: 3
rbf_sigma: 0.05
RMSE: 0.1183

This improved the cross-validation RMSE from 0.1201 in the first grid to 0.1183 in the refined grid.

This result can indicate that the SVM radial model performs better with a moderate cost value and a smoother radial kernel. In practical terms, the model is flexible enough to capture nonlinear patterns, but it is not using an overly aggressive kernel setting.

final_svm_rbf_workflow <- finalize_workflow(
  svm_rbf_workflow,
  best_svm_rbf_refined
)

svm_rbf_final_start_time <- Sys.time()

final_svm_rbf_fit <- fit(
  final_svm_rbf_workflow,
  data = train_linear_model_data
)

svm_rbf_final_end_time <- Sys.time()

svm_rbf_final_training_time <- as.numeric(
  difftime(
    svm_rbf_final_end_time,
    svm_rbf_final_start_time,
    units = "secs"
  )
)

svm_rbf_validation_predictions <- predict(
  final_svm_rbf_fit,
  new_data = valid_linear_model_data
) |>
  pull(.pred)

svm_rbf_result <- make_and_show_model_result(
  workflow = "Nonlinear-family",
  model_family = "Nonlinear",
  model_name = "SVM Radial",
  training_rows = nrow(train_linear_model_data),
  validation_rows = nrow(valid_linear_model_data),
  tuning_parameters = paste(
    "cost =",
    best_svm_rbf_refined$cost,
    ", / rbf_sigma =",
    best_svm_rbf_refined$rbf_sigma
  ),
  actual = valid_ph,
  predicted = svm_rbf_validation_predictions,
  training_time_seconds = svm_rbf_final_training_time
)

SVM Radial Validation Performance
workflow	model_family	model_name	training_rows	validation_rows	tuning_parameters	rmse	mae	rsq	training_time_seconds
Nonlinear-family	Nonlinear	SVM Radial	2052	515	cost = 3 , / rbf_sigma = 0.05	0.1112	0.0812	0.5806	0.8783

The SVM radial model performed better than the linear-family models on the validation data.

RMSE: 0.1112
MAE: 0.0812
\(R^2\): 0.5806

This is an improvement over the linear regression, elastic net, and PLS models, which all had validation RMSE values around 0.1305 and \(R^2\) values around 0.425.

The best SVM radial model used cost = 3 and rbf_sigma = 0.05. This means the model used a moderate cost value and a smoother radial kernel.

This result suggests that the relationships between predictors and PH are not only linear. The radial kernel performed better to capture nonlinear patterns that the linear-family models could not capture.

plot_model_diagnostics(
  obs = valid_ph,
  pred = svm_rbf_validation_predictions,
  model_name = "SVM Radial"
)

The SVM radial diagnostic plots show better behavior than the linear-family models.

In the predicted vs observed plot, the points follow a clearer upward pattern and are tighter to the diagonal line compared to linear models. There are few far points from the line for predictions less than 8.72, and the cluster of predictions above 8.72 PH is much tighter to diagonal line. This suggests that SVM radial basis model has a tendency to over or under predict PH values less than 8.72.

The residual plot shows that most residuals are centered and tighter around zero. There is still noticeable spread, especially around the middle predicted PH range with some of the predictions with larger negative residuals (overpredicted), but the residual pattern looks more balanced than the linear-family models.

Overall, there are still some larger errors, which means this model is not perfect. However, compared with the earlier linear models, the SVM radial model captures more of the PH variation and handles nonlinear structure better.

nonlinear_model_results <- add_model_result(
  results_table = nonlinear_model_results,
  new_result = svm_rbf_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = svm_rbf_result
)

nonlinear_model_results |>
  arrange(rmse) |>
  kable(
    caption = "Current Nonlinear Model Comparison",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
    
  )

Current Nonlinear Model Comparison
workflow	model_family	model_name	training_rows	validation_rows	tuning_parameters	rmse	mae	rsq	training_time_seconds
Nonlinear-family	Nonlinear	SVM Radial	2052	515	cost = 3 , / rbf_sigma = 0.05	0.1112	0.0812	0.5806	0.8783

Multivariate Adaptive Regression Splines (MARS) Model

Next, I test a MARS model.

MARS can capture nonlinear relationships by creating “hinged” linear terms. This makes it more flexible than regular linear regression, but usually easier to interpret than models like SVM or neural n

The main tuning parameters are num_terms and prod_degree.

num_terms controls how many model terms are allowed. A larger value gives the model more flexibility.
prod_degree controls whether the model can include interaction terms. A value of 1 allows only main effects, while a value of 2 allows two-way interactions, like multiplicative interactions.

mars_spec <- mars(
  num_terms = tune(),
  prod_degree = tune(),
  prune_method = "none") |>
  set_mode("regression") |>
  set_engine("earth")

mars_workflow <- workflow() |>
  add_recipe(linear_recipe) |>
  add_model(mars_spec)

mars_grid <- expand_grid(
  num_terms = seq(5, 50, by = 5),
  prod_degree = c(1, 2)
)

# mars_grid |>
#   kable(
#     caption = "MARS Tuning Grid"
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )

set.seed(123)

mars_start_time <- Sys.time()

mars_tune <- tune_grid(
  mars_workflow,
  resamples = linear_folds,
  grid = mars_grid,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

mars_end_time <- Sys.time()

mars_training_time <- as.numeric(
  difftime(mars_end_time, mars_start_time, units = "secs")
)

mars_tune |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top MARS Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Top MARS Tuning Results by RMSE
num_terms	prod_degree	.metric	.estimator	mean	n	std_err	.config
35	2	rmse	standard	0.1293	15	0.0015	pre0_mod14_post0
40	2	rmse	standard	0.1294	15	0.0014	pre0_mod16_post0
50	2	rmse	standard	0.1295	15	0.0016	pre0_mod20_post0
45	2	rmse	standard	0.1295	15	0.0016	pre0_mod18_post0
30	2	rmse	standard	0.1297	15	0.0015	pre0_mod12_post0
25	2	rmse	standard	0.1315	15	0.0014	pre0_mod10_post0
20	2	rmse	standard	0.1341	15	0.0013	pre0_mod08_post0
15	1	rmse	standard	0.1353	15	0.0010	pre0_mod05_post0
15	2	rmse	standard	0.1366	15	0.0010	pre0_mod06_post0
10	2	rmse	standard	0.1380	15	0.0011	pre0_mod04_post0

best_mars <- select_best(
  mars_tune,
  metric = "rmse"
)

best_mars |>
  kable(
    caption = "Best MARS Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Best MARS Tuning Parameters
num_terms	prod_degree	.config
35	2	pre0_mod14_post0

The best MARS model selected num_terms = 35 and prod_degree = 2.

This means the best version allowed up to 35 model terms and included two-way interaction terms. The top MARS results all used prod_degree = 2, which suggests that interactions between predictors helped the model more than main effects only.

The best cross-validation RMSE was 0.1296. This is close to the linear-family models, but it is not as strong as the SVM radial model.

This result suggests that MARS added some flexibility through piecewise linear terms and interactions, but it still did not capture the PH patterns as well as the radial SVM model.

final_mars_workflow <- finalize_workflow(
  mars_workflow,
  best_mars
)

mars_final_start_time <- Sys.time()

final_mars_fit <- fit(
  final_mars_workflow,
  data = train_linear_model_data
)

mars_final_end_time <- Sys.time()

mars_final_training_time <- as.numeric(
  difftime(
    mars_final_end_time,
    mars_final_start_time,
    units = "secs"
  )
)

mars_validation_predictions <- predict(
  final_mars_fit,
  new_data = valid_linear_model_data) |>
  pull(.pred)

mars_result <- make_and_show_model_result(
  workflow = "Nonlinear-family",
  model_family = "Nonlinear",
  model_name = "MARS",
  training_rows = nrow(train_linear_model_data),
  validation_rows = nrow(valid_linear_model_data),
  tuning_parameters = paste(
    "num_terms =",
    best_mars$num_terms,
    ", prod_degree =",
    best_mars$prod_degree
  ),
  actual = valid_ph,
  predicted = mars_validation_predictions,
  training_time_seconds = mars_final_training_time
)

MARS Validation Performance
workflow	model_family	model_name	training_rows	validation_rows	tuning_parameters	rmse	mae	rsq	training_time_seconds
Nonlinear-family	Nonlinear	MARS	2052	515	num_terms = 35 , prod_degree = 2	0.1212	0.0907	0.5013	0.3193

The final MARS model used:

num_terms: 35
prod_degree: 2

Its validation performance was:

RMSE: 0.1212
MAE: 0.0907
\(R^2\): 0.5013

This result is better than overall linear-family models, but weaker than SVM radial.

The selected prod_degree = 2 means the model used two-way interaction terms. This indicate that some relationships between predictors and PH are not just simple one-variable effects. The model improved by allowing predictors to work together.

The selected num_terms = 35 means MARS needed a flexible model with many piecewise terms.

It looks like MARS captured more nonlinear structure than linear regression, elastic net, and PLS. This suggests that piecewise linear interactions helped, but they were still not flexible enough to capture the full data pattern.

plot_model_diagnostics(
  obs = valid_ph,
  pred = mars_validation_predictions,
  model_name = "MARS"
)

MARS diagnostic plots also show better behavior than the linear-family models.

In the predicted vs observed plot, the points follow a clearer upward pattern and are tighter to the diagonal line compared to linear models. But the model still has some over and under predictions

The residual plot shows that most residuals are centered and tighter around zero. There is still noticeable spread. As we see around the middle predicted PH range some of the predictions with larger residuals (overpredicted), and it the same tendency as we saw in SVM radial diagnostics plots.

nonlinear_model_results <- add_model_result(
  results_table = nonlinear_model_results,
  new_result = mars_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = mars_result
)

Neural Network Model

Next, I test a neural network model.

Neural networks can capture nonlinear relationships, but they are also more sensitive to preprocessing and tuning. Because of this, I use the same linear-family preprocessing recipe with Yeo-Johnson, centering, scaling, and dummy encoding.

The main tuning parameters are:

hidden_units: number of neurons in the hidden layer
penalty: regularization strength to reduce overfitting
epochs: number of training iterations

nn_spec <- mlp(
  hidden_units = tune(),
  penalty = tune(),
  epochs = tune()
) |>
  set_mode("regression") |>
  set_engine(
    "nnet",
    trace = FALSE,
    MaxNWts = 5000
  )

nn_workflow <- workflow() |>
  add_recipe(linear_recipe) |>
  add_model(nn_spec)

nn_grid <- expand_grid(
  hidden_units = c(3, 5, 7, 10),
  penalty = c(0.0001, 0.001, 0.01, 0.1),
  epochs = c(100, 200, 500)
)

# nn_grid |>
#   kable(
#     caption = "Neural Network Tuning Grid",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )

set.seed(123)

nn_start_time <- Sys.time()

nn_tune <- tune_grid(
  nn_workflow,
  resamples = linear_folds,
  grid = nn_grid,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

nn_end_time <- Sys.time()

nn_training_time <- as.numeric(
  difftime(nn_end_time, nn_start_time, units = "secs")
)

nn_tune |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top Neural Network Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Top Neural Network Tuning Results by RMSE
hidden_units	penalty	epochs	.metric	.estimator	mean	n	std_err	.config
10	0.10	100	rmse	standard	0.1264	15	0.0017	pre0_mod46_post0
5	0.10	500	rmse	standard	0.1283	15	0.0010	pre0_mod24_post0
3	0.01	500	rmse	standard	0.1292	15	0.0013	pre0_mod09_post0
7	0.10	200	rmse	standard	0.1292	15	0.0010	pre0_mod35_post0
5	0.10	200	rmse	standard	0.1295	15	0.0015	pre0_mod23_post0
3	0.10	500	rmse	standard	0.1295	15	0.0013	pre0_mod12_post0
7	0.10	500	rmse	standard	0.1295	15	0.0019	pre0_mod36_post0
5	0.01	100	rmse	standard	0.1306	15	0.0014	pre0_mod19_post0
7	0.01	100	rmse	standard	0.1307	15	0.0017	pre0_mod31_post0
5	0.01	200	rmse	standard	0.1309	15	0.0011	pre0_mod20_post0

best_nn <- select_best(
  nn_tune,
  metric = "rmse"
)

best_nn |>
  kable(
    caption = "Best Neural Network Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Best Neural Network Tuning Parameters
hidden_units	penalty	epochs	.config
10	0.1	100	pre0_mod46_post0

The first neural network tuning grid selected hidden_units = 10, penalty = 0.10, and epochs = 100.

Since hidden_units = 10 and penalty = 0.10 were both at the upper end of the first grid, I will run a new refined grid around that area in following code. This will check if slightly larger network or stronger regularization can improve accuracy.

I also will change the epoch values to the lower range (50, 100, 150) because the first grid did not show improvement from using 200 or 500 epochs.

nn_grid_refined <- expand_grid(
  hidden_units = c(8, 10, 12, 15, 20),
  penalty = c(0.05, 0.10, 0.20, 0.30, 0.50),
  epochs = c(50, 100, 150)
)

# nn_grid_refined |>
#   kable(
#     caption = "Refined Neural Network Tuning Grid",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )

set.seed(123)

nn_refined_start_time <- Sys.time()

nn_tune_refined <- tune_grid(
  nn_workflow,
  resamples = linear_folds,
  grid = nn_grid_refined,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

nn_refined_end_time <- Sys.time()

nn_refined_training_time <- as.numeric(
  difftime(
    nn_refined_end_time,
    nn_refined_start_time,
    units = "secs"
  )
)

nn_tune_refined |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top Refined Neural Network Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Top Refined Neural Network Tuning Results by RMSE
hidden_units	penalty	epochs	.metric	.estimator	mean	n	std_err	.config
12	0.5	150	rmse	standard	0.1232	15	0.0013	pre0_mod45_post0
15	0.5	150	rmse	standard	0.1234	15	0.0014	pre0_mod60_post0
12	0.3	150	rmse	standard	0.1247	15	0.0014	pre0_mod42_post0
10	0.5	150	rmse	standard	0.1251	15	0.0011	pre0_mod30_post0
12	0.5	100	rmse	standard	0.1253	15	0.0017	pre0_mod44_post0
15	0.3	150	rmse	standard	0.1259	15	0.0014	pre0_mod57_post0
15	0.5	100	rmse	standard	0.1259	15	0.0012	pre0_mod59_post0
12	0.2	100	rmse	standard	0.1261	15	0.0012	pre0_mod38_post0
15	0.3	100	rmse	standard	0.1268	15	0.0013	pre0_mod56_post0
10	0.3	100	rmse	standard	0.1273	15	0.0014	pre0_mod26_post0

best_nn_refined <- select_best(
  nn_tune_refined,
  metric = "rmse"
)

best_nn_refined |>
  kable(
    caption = "Best Refined Neural Network Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Best Refined Neural Network Tuning Parameters
hidden_units	penalty	epochs	.config
12	0.5	150	pre0_mod45_post0

The refined neural network tuning improved the model compared with the first tuning grid.

The first grid selected:

hidden_units: 10
penalty: 0.10
epochs: 100
Cross-validation RMSE: 0.1264

The refined grid selected:

hidden_units: 12
penalty: 0.50
epochs: 150
Cross-validation RMSE: 0.1232

This shows that the neural network benefited from a slightly larger hidden layer and stronger regularization. The stronger penalty = 0.50 suggests that the model needed shrinkage to avoid overfitting.

final_nn_workflow <- finalize_workflow(
  nn_workflow,
  best_nn_refined
)

nn_final_start_time <- Sys.time()

final_nn_fit <- fit(
  final_nn_workflow,
  data = train_linear_model_data
)

nn_final_end_time <- Sys.time()

nn_final_training_time <- as.numeric(
  difftime(
    nn_final_end_time,
    nn_final_start_time,
    units = "secs"
  )
)

nn_validation_predictions <- predict(
  final_nn_fit,
  new_data = valid_linear_model_data
) |>
  pull(.pred)

nn_result <- make_and_show_model_result(
  workflow = "Nonlinear-family",
  model_family = "Nonlinear",
  model_name = "Neural Network",
  training_rows = nrow(train_linear_model_data),
  validation_rows = nrow(valid_linear_model_data),
  tuning_parameters = paste(
    "hidden_units =",
    best_nn_refined$hidden_units,
    ", penalty =",
    best_nn_refined$penalty,
    ", epochs =",
    best_nn_refined$epochs
  ),
  actual = valid_ph,
  predicted = nn_validation_predictions,
  training_time_seconds = nn_final_training_time
)

Neural Network Validation Performance
workflow	model_family	model_name	training_rows	validation_rows	tuning_parameters	rmse	mae	rsq	training_time_seconds
Nonlinear-family	Nonlinear	Neural Network	2052	515	hidden_units = 12 , penalty = 0.5 , epochs = 150	0.1226	0.0926	0.4984	1.0357

The refined neural network model improved over the linear-family models, but it did not outperform the SVM radial model.

The final neural network used:

hidden_units: 12
penalty: 0.5
epochs: 150

Its validation performance was:

RMSE: 0.1226
MAE: 0.0926
\(R^2\): 0.4984

This is better than the linear regression, elastic net, and PLS models, which had RMSE values around 0.1305 and \(R^2\) values around 0.425. This suggests that the neural network captured some nonlinear structure that the linear-family models missed.

However, the neural network did not perform as well as the SVM radial model. SVM radial still had lower RMSE and MAE and a higher \(R^2\) on the validation data.

The stronger penalty value, penalty = 0.5, also tells us that regularization was important for the neural network. Without enough regularization, the neural network possibly could overfit training data.

p1 <- plot_model_diagnostics(
  obs = valid_ph,
  pred = nn_validation_predictions,
  model_name = "Neural Network"
)
p1

nonlinear_model_results <- add_model_result(
  results_table = nonlinear_model_results,
  new_result = nn_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = nn_result
)

Neural Network with Tree-Family Features Dataset

I also test the neural network using the tree-family feature set.

This is an experimental comparison because the tree-family data keeps the original mnf_flow variable. Neural networks are sensitive to scale and numeric distances, so this version may or may not work better than the linear-family version.

To make this safer, I still use Yeo-Johnson, centering, scaling, and dummy encoding. However, I exclude mnf_flow from Yeo-Johnson because it may represent special process states, not just a regular continuous measurement.

train_tree_nn_model_data <- train_features_tree_family |>
  mutate(ph = train_ph)

valid_tree_nn_model_data <- valid_features_tree_family |>
  mutate(ph = valid_ph)

exclude_from_tree_nn_yj <- c(
  "pressure_setpoint",
  "bowl_setpoint",
  "mnf_flow"
)

tree_nn_yj_vars <- train_features_tree_family |>
  select(where(is.numeric)) |>
  select(-any_of(exclude_from_tree_nn_yj)) |>
  names()

tibble(
  yeo_johnson_variable = tree_nn_yj_vars
) |>
  kable(
    caption = "Tree-Family Variables Selected for Yeo-Johnson in Neural Network"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Tree-Family Variables Selected for Yeo-Johnson in Neural Network
yeo_johnson_variable
carb_volume
fill_ounces
pc_volume
carb_pressure
carb_temp
psc
psc_fill
psc_co2
carb_pressure1
fill_pressure
hyd_pressure1
hyd_pressure2
hyd_pressure3
hyd_pressure4
filler_level
filler_speed
temperature
usage_cont
carb_flow
density
mfr
balling
pressure_vacuum
oxygen_filler
air_pressurer
alch_rel
carb_rel
balling_lvl

tree_nn_recipe <- recipe(
  ph ~ .,
  data = train_tree_nn_model_data
) |>
  step_YeoJohnson(all_of(tree_nn_yj_vars)) |>
  step_center(all_numeric_predictors()) |>
  step_scale(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

tree_nn_recipe <- recipe(
  ph ~ .,
  data = train_tree_nn_model_data
) |>
  step_YeoJohnson(all_of(tree_nn_yj_vars)) |>
  step_center(all_numeric_predictors()) |>
  step_scale(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

tree_nn_grid <- expand_grid(
  hidden_units = c(8, 10, 12, 15, 20),
  penalty = c(0.05, 0.10, 0.20, 0.30, 0.50),
  epochs = c(50, 100, 150)
)

# tree_nn_grid |>
#   kable(
#     caption = "Tree-Family Neural Network Tuning Grid",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )

tree_nn_folds <- vfold_cv(
  train_tree_nn_model_data,
  v = 5,
  repeats = 3
)

tree_nn_spec <- mlp(
  hidden_units = tune(),
  penalty = tune(),
  epochs = tune()
) |>
  set_mode("regression") |>
  set_engine(
    "nnet",
    trace = FALSE,
    MaxNWts = 5000
  )

tree_nn_workflow <- workflow() |>
  add_recipe(tree_nn_recipe) |>
  add_model(tree_nn_spec)

set.seed(123)

tree_nn_start_time <- Sys.time()

tree_nn_tune <- tune_grid(
  tree_nn_workflow,
  resamples = tree_nn_folds,
  grid = tree_nn_grid,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

tree_nn_end_time <- Sys.time()

tree_nn_training_time <- as.numeric(
  difftime(
    tree_nn_end_time,
    tree_nn_start_time,
    units = "secs"
  )
)

tree_nn_tune |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top Tree-Family Neural Network Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Top Tree-Family Neural Network Tuning Results by RMSE
hidden_units	penalty	epochs	.metric	.estimator	mean	n	std_err	.config
15	0.5	150	rmse	standard	0.1220	15	0.0017	pre0_mod60_post0
15	0.3	150	rmse	standard	0.1234	15	0.0020	pre0_mod57_post0
12	0.5	150	rmse	standard	0.1247	15	0.0021	pre0_mod45_post0
12	0.2	150	rmse	standard	0.1251	15	0.0019	pre0_mod39_post0
10	0.5	150	rmse	standard	0.1253	15	0.0019	pre0_mod30_post0
12	0.3	150	rmse	standard	0.1258	15	0.0020	pre0_mod42_post0
15	0.2	100	rmse	standard	0.1260	15	0.0018	pre0_mod53_post0
15	0.5	100	rmse	standard	0.1261	15	0.0026	pre0_mod59_post0
12	0.5	100	rmse	standard	0.1263	15	0.0025	pre0_mod44_post0
15	0.3	100	rmse	standard	0.1267	15	0.0019	pre0_mod56_post0

best_tree_nn <- select_best(
  tree_nn_tune,
  metric = "rmse"
)

best_tree_nn |>
  kable(
    caption = "Best Tree-Family Neural Network Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Best Tree-Family Neural Network Tuning Parameters
hidden_units	penalty	epochs	.config
15	0.5	150	pre0_mod60_post0

final_tree_nn_workflow <- finalize_workflow(
  tree_nn_workflow,
  best_tree_nn
)

tree_nn_final_start_time <- Sys.time()

final_tree_nn_fit <- fit(
  final_tree_nn_workflow,
  data = train_tree_nn_model_data
)

tree_nn_final_end_time <- Sys.time()

tree_nn_final_training_time <- as.numeric(
  difftime(
    tree_nn_final_end_time,
    tree_nn_final_start_time,
    units = "secs"
  )
)

tree_nn_validation_predictions <- predict(
  final_tree_nn_fit,
  new_data = valid_tree_nn_model_data
) |>
  pull(.pred)

tree_nn_result <- make_and_show_model_result(
  workflow = "Tree-family",
  model_family = "Nonlinear",
  model_name = "Neural Network with Tree-Family Features",
  training_rows = nrow(train_tree_nn_model_data),
  validation_rows = nrow(valid_tree_nn_model_data),
  tuning_parameters = paste(
    "hidden_units =",
    best_tree_nn$hidden_units,
    ", penalty =",
    best_tree_nn$penalty,
    ", epochs =",
    best_tree_nn$epochs
  ),
  actual = valid_ph,
  predicted = tree_nn_validation_predictions,
  training_time_seconds = tree_nn_final_training_time
)

Neural Network with Tree-Family Features Validation Performance
workflow	model_family	model_name	training_rows	validation_rows	tuning_parameters	rmse	mae	rsq	training_time_seconds
Tree-family	Nonlinear	Neural Network with Tree-Family Features	2052	515	hidden_units = 15 , penalty = 0.5 , epochs = 150	0.1432	0.0989	0.366	1.209

I tested two neural network versions.

The first neural network used the nonlinear-family data set. It selected:

hidden_units = 12
penalty = 0.5
epochs = 150

Its validation performance was:

RMSE: 0.1226
MAE: 0.0926
\(R^2\): 0.4984

The second neural network used the tree-family data set. It selected:

hidden_units = 15
penalty = 0.5
epochs = 150

Its validation performance was:

RMSE: 0.1432
MAE: 0.0989
\(R^2\): 0.3660

The regular neural network performed better than the neural network with tree-family dataset. Its RMSE and MAE were lower, and its \(R^2\) was higher.

This means that the tree-family feature version did not improve neural network performance. In fact, it made the neural network worse. This may be because the neural network was more sensitive to how the predictors were represented and scaled.

Overall, the neural network trained/tested with linear family dataset was the better neural network version.

p2 <- plot_model_diagnostics(
  obs = valid_ph,
  pred = tree_nn_validation_predictions,
  model_name = "Neural Network with Tree-Family Features"
)

p1/p2

I also compared the diagnostic plots for the two neural network versions.

The first neural network plot shows that the model captured the general PH trend better than the linear models. Most points follow the diagonal direction, but there is still a noticeable spread around the line. This means the model improved over simple linear models, but still had some larger prediction errors.

The residual plot for the first neural network is mostly centered around zero, but there are still some clear overprediction and underprediction cases. A few residuals are close to -0.50 and above +0.50, which means the model struggled with some validation observations.

The second neural network was trained and tested using the tree-family dataset. Its Predicted vs Observed plot shows a slightly different pattern. The points look more concentrated in the middle and upper PH range, but the overall prediction pattern is weaker.

The residual plot for the second neural network shows that most errors are still close to zero, but there are also a few larger outliers. There are a couple of observations that were severely underpredicted, which can be seen on the left side of the diagnostic plots.

The regular neural network performed better than the neural network with tree-family features. The tree-family feature version did not improve neural network performance and actually produced weaker test results.

This comparison also supports my decision to create separate feature sets for different model families. The same feature engineering approach does not always help every model. As we see, some models benefit from one predictor structure, while other models may perform worse with it.

nonlinear_model_results <- add_model_result(
  results_table = nonlinear_model_results,
  new_result = tree_nn_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = tree_nn_result
)

nonlinear_model_results |>
  arrange(rmse) |>
  kable(
    caption = "Updated Nonlinear Model Comparison",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Updated Nonlinear Model Comparison
workflow	model_family	model_name	training_rows	validation_rows	tuning_parameters	rmse	mae	rsq	training_time_seconds
Nonlinear-family	Nonlinear	SVM Radial	2052	515	cost = 3 , / rbf_sigma = 0.05	0.1112	0.0812	0.5806	0.8783
Nonlinear-family	Nonlinear	MARS	2052	515	num_terms = 35 , prod_degree = 2	0.1212	0.0907	0.5013	0.3193
Nonlinear-family	Nonlinear	Neural Network	2052	515	hidden_units = 12 , penalty = 0.5 , epochs = 150	0.1226	0.0926	0.4984	1.0357
Tree-family	Nonlinear	Neural Network with Tree-Family Features	2052	515	hidden_units = 15 , penalty = 0.5 , epochs = 150	0.1432	0.0989	0.3660	1.2090

The nonlinear model comparison shows that SVM radial performed best in this model family.

SVM radial:

RMSE: 0.1112
MAE: 0.0812
\(R^2\): 0.5806

MARS:

RMSE: 0.1212
MAE: 0.0907
\(R^2\): 0.5013

Neural network with nonlinear-family dataset:

RMSE: 0.1226
MAE: 0.0926
\(R^2\): 0.4984

Neural network with tree-family dataset:

RMSE: 0.1432
MAE: 0.0989
\(R^2\): 0.3660

The SVM with radial basis kernel model had the lowest RMSE and MAE and the highest \(R^2\) among the nonlinear models. This suggests that the radial kernel was better able to capture nonlinear relationships in the beverage production data.

MARS was the second-best nonlinear model by RMSE. It performed slightly better than both neural network versions.

The regular neural network performed better than the neural network trained with tree-family dataset. The regular neural network had lower RMSE, lower MAE, and higher \(R^2\).

The neural network with tree-family features did not improve model performance. Its RMSE increased to 0.1432, and its \(R^2\) dropped to 0.3660. This means that using the tree-family dataset made the neural network weaker in this case.

Overall, SVM radial was the strongest nonlinear model. MARS and the regular neural network also performed better than the linear-family models, but they were not as strong as SVM radial. The nonlinear model results still support the idea that PH prediction depends on nonlinear patterns or interactions that regular linear models do not capture well.

Tree-Based Models

After testing linear, regularized, and nonlinear models, I now test tree-based models.

For these models, I use the tree-family feature set. This version keeps the original mnf_flow variable, because tree-based models can naturally split on different values and ranges.

I will not use math transformations, centering, scaling, or correlation filtering for tree-based models. Tree-based models do not depend on variables being on the same scale, and they can handle nonlinear relationships and interactions directly.

train_tree_model_data <- train_features_tree_family |>
  mutate(ph = train_ph)

valid_tree_model_data <- valid_features_tree_family |>
  mutate(ph = valid_ph)

set.seed(123)

tree_folds <- vfold_cv(
  train_tree_model_data,
  v = 5,
  repeats = 3
)

I will use 5-fold cross-validation repeated 3 times for each models.

tree_model_results <- tibble()

Random Forest

I start the tree-based section with Random Forest.

Random Forest builds many decision trees and averages their predictions. Each tree sees a slightly different version of the data, and each split only considers a random subset of predictors.

The main tuning parameters are:

mtry: how many predictors are randomly spread at each split
min_n: the minimum number of observations allowed in a terminal node
trees: the number of trees in the forest

A smaller min_n allows more detailed trees, while a larger min_n creates smoother predictions. A larger mtry lets each split consider more predictors, but it is possible that it creates trees more similar to each other.

rf_spec <- rand_forest(
  mtry = tune(),
  min_n = tune(),
  trees = 750
) |>
  set_mode("regression") |>
  set_engine(
    "ranger",
    importance = "impurity",
    num.threads = 1
  )

rf_workflow <- workflow() |>
  add_formula(ph ~ .) |>
  add_model(rf_spec)

set.seed(123)

rf_grid_refined <- expand_grid(
  mtry = seq(8, 24, by = 2),
  min_n = seq(1, 6, by = 1)
)

# rf_grid |>
#   kable(
#     caption = "Random Forest Tuning Grid"
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )

#cl <- start_parallel(cores_to_leave_free = 1)

set.seed(123)

rf_refined_start_time <- Sys.time()

rf_tune_refined <- tune_grid(
  rf_workflow,
  resamples = tree_folds,
  grid = rf_grid_refined,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

rf_refined_end_time <- Sys.time()

rf_refined_training_time <- as.numeric(
  difftime(
    rf_refined_end_time,
    rf_refined_start_time,
    units = "secs"
  )
)

#stop_parallel(cl)

rf_tune_refined |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top Refined Random Forest Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Top Refined Random Forest Tuning Results by RMSE
mtry	min_n	.metric	.estimator	mean	n	std_err	.config
24	1	rmse	standard	0.1052	15	0.0011	pre0_mod49_post0
24	3	rmse	standard	0.1053	15	0.0012	pre0_mod51_post0
22	1	rmse	standard	0.1054	15	0.0012	pre0_mod43_post0
24	2	rmse	standard	0.1054	15	0.0011	pre0_mod50_post0
22	2	rmse	standard	0.1054	15	0.0011	pre0_mod44_post0
20	3	rmse	standard	0.1054	15	0.0011	pre0_mod39_post0
22	3	rmse	standard	0.1055	15	0.0011	pre0_mod45_post0
24	4	rmse	standard	0.1055	15	0.0011	pre0_mod52_post0
20	1	rmse	standard	0.1055	15	0.0011	pre0_mod37_post0
22	4	rmse	standard	0.1055	15	0.0011	pre0_mod46_post0

best_rf_refined <- select_best(
  rf_tune_refined,
  metric = "rmse"
)

best_rf_refined |>
  kable(
    caption = "Best Refined Random Forest Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Best Refined Random Forest Tuning Parameters
mtry	min_n	.config
24	1	pre0_mod49_post0

The Random Forest tuning selected:

mtry: 24 , meaning that at each split, the model randomly considers 24 predictors.
min_n: 1 meaning the model preferred detailed trees.
Cross-validation RMSE: 0.1052

final_rf_workflow <- finalize_workflow(
  rf_workflow,
  best_rf_refined
)

rf_final_start_time <- Sys.time()

final_rf_fit <- fit(
  final_rf_workflow,
  data = train_tree_model_data
)

rf_final_end_time <- Sys.time()

rf_final_training_time <- as.numeric(
  difftime(
    rf_final_end_time,
    rf_final_start_time,
    units = "secs"
  )
)

rf_validation_predictions <- predict(
  final_rf_fit,
  new_data = valid_tree_model_data
) |>
  pull(.pred)

rf_result <- make_and_show_model_result(
  workflow = "Tree-family",
  model_family = "Tree-Based",
  model_name = "Random Forest",
  training_rows = nrow(train_tree_model_data),
  validation_rows = nrow(valid_tree_model_data),
  tuning_parameters = paste(
    "mtry =",
    best_rf_refined$mtry,
    ", min_n =",
    best_rf_refined$min_n,
    ", trees = 1000"
  ),
  actual = valid_ph,
  predicted = rf_validation_predictions,
  training_time_seconds = rf_refined_training_time
)

Random Forest Validation Performance
workflow	model_family	model_name	training_rows	validation_rows	tuning_parameters	rmse	mae	rsq	training_time_seconds
Tree-family	Tree-Based	Random Forest	2052	515	mtry = 24 , min_n = 1 , trees = 1000	0.0984	0.0706	0.6835	2200.34

On the validation data Random Forest model performed much better compared to previous models.

The final Random Forest model used:

mtry: 24
min_n: 1
trees: 1000

Its validation performance was:

RMSE: 0.0984
MAE: 0.0706
\(R^2\): 0.6835

This is the strongest test performance so far. Compared with the earlier nonlinear models, RF model has lower RMSE and a higher \(R^2\).

Overall, this result suggests that PH prediction depends on nonlinear relationships and interactions that RF model can capture data structure better than the linear-family and earlier nonlinear models.

rf_diagnostiv_plots <- plot_model_diagnostics(
  obs = valid_ph,
  pred = rf_validation_predictions,
  model_name = "Random Forest"
)
rf_diagnostiv_plots

In the predicted vs observed plot, the points are tighter to the diagonal reference line compared to earlier models. This means the Random Forest predictions are better aligned with the actual PH values.

The residual plot shows that most residuals are centered around zero. There are still few larger residuals, especially in the middle predicted PH range, and most of residual are in the range of -0.25 to 0.25.

Overall, the diagnostic plots support the validation metrics.

tree_model_results <- add_model_result(
  results_table = tree_model_results,
  new_result = rf_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = rf_result
)

Cubist Model

Next, I test a Cubist model.

Cubist is different from Random Forest because it does not only average many trees. Instead, it creates rule-based groups and fits linear models inside those rules.

This can be useful because PH may behave differently under different production conditions.

The main tuning parameters are:

committees: how many Cubist model committees are used
neighbors: how many nearby observations are used to adjust predictions

Larger committees can make the model stronger, but at the same time model becomes more complex. The neighbors parameter can help adjust predictions based on nearby observations, which may improve accuracy.

cubist_spec <- cubist_rules(
  committees = tune(),
  neighbors = tune()
) |>
  set_mode("regression") |>
  set_engine("Cubist")

cubist_workflow <- workflow() |>
  add_formula(ph ~ .) |>
  add_model(cubist_spec)

cubist_grid <- expand_grid(
  committees = c(15, 20, 25, 35, 40, 45, 50),
  neighbors = c(0, 3, 5, 7, 9, 11)
)

#num_cores_to_use <- max(1, parallel::detectCores(logical = TRUE) - 1)

#cl <- makePSOCKcluster(num_cores_to_use)
#registerDoParallel(cl)

set.seed(123)

cubist_start_time <- Sys.time()

cubist_tune <- tune_grid(
  cubist_workflow,
  resamples = tree_folds,
  grid = cubist_grid,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(
    save_pred = TRUE,
    verbose = TRUE,
    allow_par = TRUE
  )
)

cubist_end_time <- Sys.time()

cubist_training_time <- as.numeric(
  difftime(
    cubist_end_time,
    cubist_start_time,
    units = "secs"
  )
)

#stopCluster(cl)
#registerDoSEQ()

cubist_tune |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top Cubist Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Top Cubist Tuning Results by RMSE
committees	neighbors	.metric	.estimator	mean	n	std_err	.config
50	5	rmse	standard	0.1003	15	0.0009	pre0_mod39_post0
35	5	rmse	standard	0.1004	15	0.0010	pre0_mod21_post0
45	5	rmse	standard	0.1005	15	0.0009	pre0_mod33_post0
40	5	rmse	standard	0.1006	15	0.0009	pre0_mod27_post0
50	7	rmse	standard	0.1008	15	0.0011	pre0_mod40_post0
25	5	rmse	standard	0.1008	15	0.0010	pre0_mod15_post0
35	7	rmse	standard	0.1009	15	0.0011	pre0_mod22_post0
45	7	rmse	standard	0.1009	15	0.0011	pre0_mod34_post0
40	7	rmse	standard	0.1011	15	0.0011	pre0_mod28_post0
50	3	rmse	standard	0.1012	15	0.0008	pre0_mod38_post0

best_cubist <- select_best(
  cubist_tune,
  metric = "rmse"
)

best_cubist |>
  kable(
    caption = "Best Cubist Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Best Cubist Tuning Parameters
committees	neighbors	.config
50	5	pre0_mod39_post0

final_cubist_workflow <- finalize_workflow(
  cubist_workflow,
  best_cubist
)

cubist_final_start_time <- Sys.time()

final_cubist_fit <- fit(
  final_cubist_workflow,
  data = train_tree_model_data
)

cubist_final_end_time <- Sys.time()

cubist_final_training_time <- as.numeric(
  difftime(
    cubist_end_time,
    cubist_start_time,
    units = "secs"
  )
)

cubist_validation_predictions <- predict(
  final_cubist_fit,
  new_data = valid_tree_model_data
) |>
  pull(.pred)

cubist_result <- make_and_show_model_result(
  workflow = "Tree-family",
  model_family = "Tree-Based",
  model_name = "Cubist",
  training_rows = nrow(train_tree_model_data),
  validation_rows = nrow(valid_tree_model_data),
  tuning_parameters = paste(
    "committees =",
    best_cubist$committees,
    ", neighbors =",
    best_cubist$neighbors
  ),
  actual = valid_ph,
  predicted = cubist_validation_predictions,
  training_time_seconds = cubist_final_training_time
)

Cubist Validation Performance
workflow	model_family	model_name	training_rows	validation_rows	tuning_parameters	rmse	mae	rsq	training_time_seconds
Tree-family	Tree-Based	Cubist	2052	515	committees = 50 , neighbors = 5	0.0996	0.0682	0.665	393.0181

The Cubist model also performed very well on the validation data.

The final Cubist model used:

committees: 50
neighbors: 5

Its validation performance was:

RMSE: 0.0996
MAE: 0.0682
\(R^2\): 0.6650

The selected committees = 50 means Cubist performed better with combining many rule-based models. The selected neighbors = 5 means it also used nearby observations to adjust predictions, which probably helped improve local accuracy. Cubist is a strong tree/rule-based model for this dataset and is competitive with RF.

cubist_diagnostic_plots <- plot_model_diagnostics(
  obs = valid_ph,
  pred = cubist_validation_predictions,
  model_name = "Cubist"
)
cubist_diagnostic_plots

In the predicted vs observed plot, the points follow the diagonal reference line closely. The spread around the line is present with several over and under predicted values, but it is tighter than most earlier models.
The residual plot also looks good overall. Majority of residuals are centered around zero, but there are several values in the midrange being over and under predicted.

Compared with Random Forest, Cubist has slightly higher RMSE and lower \(R^2\), but slightly better MAE. The diagnostic plot supports that idea because many predictions are close to the actual values, even though a few larger errors remain.

tree_model_results <- add_model_result(
  results_table = tree_model_results,
  new_result = cubist_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = cubist_result
)

XGBoost model

For the XGBoost tuning grid, here I will keep the grid smaller than the full possible tuning space.

I will use tree_depth = 2 and 3 because the dataset is not very large, and very deep trees might overfit. I will use learning rates of 0.03 and 0.05 because they allow gradual boosting without making the model too slow. I test trees = 500 and 1000 to compare a medium and larger number of boosting rounds.

I will keep sample_size = 0.80 fixed for the first pass to reduce the grid size. Also I will tune mtry, min_n, and loss_reduction because they affect how complex or conservative the boosted trees become.

xgb_spec <- boost_tree(
  trees = tune(),
  tree_depth = tune(),
  learn_rate = tune(),
  loss_reduction = tune(),
  min_n = tune(),
  sample_size = tune(),
  mtry = tune()
) |>
  set_mode("regression") |>
  set_engine(
    "xgboost",
    nthread = 1
  )

xgb_workflow <- workflow() |>
  add_formula(ph ~ .) |>
  add_model(xgb_spec)

xgb_grid <- expand_grid(
  trees = c(1000),
  tree_depth = c(2, 3, 4),
  learn_rate = c(0.03, 0.05, 0.07),
  loss_reduction = c(0, 0.01),
  min_n = c(2, 5),
  sample_size = c(0.80),
  mtry = c(16, 24, 28)
)

# xgb_grid |>
#   kable(
#     caption = "XGBoost Tuning Grid"
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )

#num_cores_to_use <- max(1, parallel::detectCores(logical = TRUE) - 1)

xgb_start_time <- Sys.time()
##cl <- makePSOCKcluster(num_cores_to_use)
#registerDoParallel(cl)

set.seed(123)



xgb_tune <- tune_grid(
  xgb_workflow,
  resamples = tree_folds,
  grid = xgb_grid,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(
    save_pred = TRUE,
    verbose = TRUE,
    allow_par = TRUE
  )
)


xgb_end_time <- Sys.time()

xgb_training_time <- as.numeric(
  difftime(
    xgb_end_time,
    xgb_start_time,
    units = "secs"))
  
#stopCluster(cl)
#registerDoSEQ()

xgb_tune |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top XGBoost Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Top XGBoost Tuning Results by RMSE
mtry	trees	min_n	tree_depth	learn_rate	sample_size	.metric	.estimator	mean	n	std_err	.config
24	1000	2	4	0.05	0.8	rmse	standard	0.1038	15	0.0012	pre0_mod051_post0
28	1000	2	4	0.07	0.8	rmse	standard	0.1040	15	0.0011	pre0_mod089_post0
24	1000	5	4	0.05	0.8	rmse	standard	0.1041	15	0.0011	pre0_mod069_post0
24	1000	2	4	0.07	0.8	rmse	standard	0.1042	15	0.0012	pre0_mod053_post0
28	1000	2	4	0.03	0.8	rmse	standard	0.1043	15	0.0011	pre0_mod085_post0
24	1000	5	4	0.03	0.8	rmse	standard	0.1043	15	0.0010	pre0_mod067_post0
28	1000	5	4	0.05	0.8	rmse	standard	0.1044	15	0.0011	pre0_mod105_post0
28	1000	2	4	0.05	0.8	rmse	standard	0.1044	15	0.0010	pre0_mod087_post0
24	1000	2	4	0.03	0.8	rmse	standard	0.1045	15	0.0011	pre0_mod049_post0
24	1000	5	4	0.07	0.8	rmse	standard	0.1045	15	0.0011	pre0_mod071_post0

best_xgb <- select_best(
  xgb_tune,
  metric = "rmse"
)

best_xgb |>
  kable(
    caption = "Best XGBoost Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Best XGBoost Tuning Parameters
mtry	trees	min_n	tree_depth	learn_rate	loss_reduction	sample_size	.config
24	1000	2	4	0.05	0	0.8	pre0_mod051_post0

The grid tested:

trees: 1000
tree_depth: 2, 3, 4
learn_rate: 0.03, 0.05, 0.07
loss_reduction: 0, 0.01
min_n: 2, 5
sample_size: 0.80
mtry: 16, 24, 28

The best XGBoost tuning result selected:

mtry: 24
trees: 1000
min_n: 2
tree_depth: 4
learn_rate: 0.05
loss_reduction: 0
sample_size: 0.80

This means the best XGBoost model used 1000 boosted trees with moderate tree depth. A tree_depth of 4 allows the model to capture interactions between predictors, but it is not extremely deep.

The selected learn_rate = 0.05 means the model learned gradually instead of making large updates at each boosting step. This is usually safer because each tree only makes a small contribution to the final model.

The selected min_n = 2 means the model allowed fairly detailed splits. The loss_reduction = 0 value means the model did not require extra improvement before making a split, so it was allowed to split more freely when useful.

The selected sample_size = 0.80 means each boosting iteration used a random 80% of the training rows.
I used row sampling because the dataset has some imbalance in brand_code, especially Brand B, which makes up almost half of the observations. The row sampling does not fully fix brand imbalance, but it adds randomness during training and can reduce the chance that the model depends too heavily on one exact training sample.
This randomness can help reduce overfitting and make the boosted trees more stable.

Overall, the selected tuning parameters show that XGBoost benefited from a moderately flexible model: enough trees, moderate depth, gradual learning, small terminal nodes, and row sampling.

# xgb_tune |>
#   collect_metrics() |>
#   filter(.metric == "rmse") |>
#   group_by(tree_depth, learn_rate) |>
#   summarise(
#     best_rmse = min(mean),
#     .groups = "drop"
#   ) |>
#   ggplot(aes(
#     x = factor(tree_depth),
#     y = factor(learn_rate),
#     fill = best_rmse
#   )) +
#   geom_tile() +
#   geom_text(aes(label = round(best_rmse, 4))) +
#   labs(
#     title = "XGBoost Best RMSE by Tree Depth and Learning Rate",
#     x = "Tree Depth",
#     y = "Learning Rate",
#     fill = "RMSE"
#   )

final_xgb_workflow <- finalize_workflow(
  xgb_workflow,
  best_xgb
)

xgb_final_start_time <- Sys.time()

final_xgb_fit <- fit(
  final_xgb_workflow,
  data = train_tree_model_data
)

xgb_final_end_time <- Sys.time()

xgb_final_training_time <- as.numeric(
  difftime(
    xgb_final_end_time,
    xgb_final_start_time,
    units = "secs"
  )
)

xgb_validation_predictions <- predict(
  final_xgb_fit,
  new_data = valid_tree_model_data
) |>
  pull(.pred)

xgb_result_short <- xgb_result |>
  mutate(
    training_time_minutes = training_time_seconds / 60
  ) |>
  select(
    workflow,
    model_family,
    model_name,
    training_rows,
    validation_rows,
    rmse,
    mae,
    rsq,
    training_time_minutes
  )

xgb_result_short |>
  kable(
    caption = "XGBoost Validation Performance",
    digits = 4,
    format = "html"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center",
    bootstrap_options = c("striped", "hover", "condensed", "responsive")
  )

XGBoost Validation Performance
workflow	model_family	model_name	training_rows	validation_rows	rmse	mae	rsq	training_time_minutes
Tree-family	Tree-Based	XGBoost	2052	515	0.0964	0.0705	0.6852	10.8502

xgb_selected_parameters <- xgb_result |>
  select(
    model_name,
    tuning_parameters
  )

xgb_selected_parameters |>
  kable(
    caption = "Selected XGBoost Tuning Parameters",
    format = "html"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center",
    bootstrap_options = c("striped", "hover", "condensed", "responsive")
  )

Selected XGBoost Tuning Parameters
model_name	tuning_parameters
XGBoost	trees = 1000 , tree_depth = 4 , learn_rate = 0.05 , loss_reduction = 0 , min_n = 2 , sample_size = 0.8 , mtry = 24

The final XGBoost model used:

trees = 1000
tree_depth = 4
learn_rate = 0.05
loss_reduction = 0
min_n = 2
sample_size = 0.80
mtry = 24

Its validation performance is:

RMSE: 0.0964
MAE: 0.0705
\(R^2\): 0.6852

This is the best individual model result so far. XGBoost has the lowest validation RMSE and \(R^2\) among the individual models

The model used enough trees to learn the pattern, a moderate tree depth to capture nonlinear relationships, and a learning rate of 0.05 so the model could improve gradually. The sample_size = 0.80 added some randomness during training, which can help reduce overfitting.

xgb_diagn_plots <- plot_model_diagnostics(
  obs = valid_ph,
  pred = xgb_validation_predictions,
  model_name = "XGBoost"
)
xgb_diagn_plots

The XGBoost diagnostic plots show that the model performed well overall on the validation data.

In the Predicted vs Observed plot, most points follow the diagonal line, which means the predicted PH values were generally close to the observed PH values. There is still some spread, but the overall pattern is strong and shows that the model captured the main PH trend.

The Residuals vs Predicted plot also looks reasonably balanced around zero. This means the model did not strongly overpredict or underpredict across the whole predicted PH range.

There are still a few larger residuals, especially some points below -0.25 and above 0.25.

Overall, these plots support using XGBoost as the final individual model. The model is not perfect, but it shows a good balance between prediction accuracy and residual behavior.

tree_model_results <- add_model_result(
  results_table = tree_model_results,
  new_result = xgb_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = xgb_result
)

Final Model Comparison Summary

final_model_comparison <- all_model_results |>
  mutate(
    training_time_minutes = training_time_seconds / 60
  ) |>
  select(
    `Dataset type` = workflow,
    `Model Family` = model_family,
    `Model Name` = model_name,
    `Tuning Parameters` = tuning_parameters,
    RMSE = rmse,
    MAE = mae,
    `R-squared` = rsq
  ) |>
  arrange(RMSE)

final_model_comparison |>
  kable(
    caption = "Final Model Comparison on Validation Data",
    digits = 4,
    format = "html"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  ) |>
  column_spec(
    column = 4,
    width = "16em",
    extra_css = "white-space: normal; word-break: break-word;"
  )

Final Model Comparison on Validation Data
Dataset type	Model Family	Model Name	Tuning Parameters	RMSE	MAE	R-squared
Tree-family	Tree-Based	XGBoost	trees = 1000 , tree_depth = 4 , learn_rate = 0.05 , loss_reduction = 0 , min_n = 2 , sample_size = 0.8 , mtry = 24	0.0964	0.0705	0.6852
Tree-family	Tree-Based	Random Forest	mtry = 24 , min_n = 1 , trees = 1000	0.0984	0.0706	0.6835
Tree-family	Tree-Based	Cubist	committees = 50 , neighbors = 5	0.0996	0.0682	0.6650
Nonlinear-family	Nonlinear	SVM Radial	cost = 3 , / rbf_sigma = 0.05	0.1112	0.0812	0.5806
Nonlinear-family	Nonlinear	MARS	num_terms = 35 , prod_degree = 2	0.1212	0.0907	0.5013
Nonlinear-family	Nonlinear	Neural Network	hidden_units = 12 , penalty = 0.5 , epochs = 150	0.1226	0.0926	0.4984
Linear-family / PLS keeps correlated predictors	Linear / Regularized	PLS Regression	num_comp = 14	0.1286	0.0987	0.4403
Linear-family	Linear / Regularized	Linear Regression	None	0.1305	0.1007	0.4244
Linear-family	Linear / Regularized	Elastic Net	penalty = 0.00848342898244073 , mixture = 0	0.1305	0.1007	0.4250
Tree-family	Nonlinear	Neural Network with Tree-Family Features	hidden_units = 15 , penalty = 0.5 , epochs = 150	0.1432	0.0989	0.3660

The final validation comparison shows a clear difference between the model families.

Linear and Regularized Models

The linear and regularized models were the weakest group overall.

Linear regression and elastic net produced almost identical results:
- RMSE: about 0.1305
- MAE: about 0.1007
- \(R^2\): about 0.425
PLS regression performed slightly better:
- RMSE: 0.1286
- MAE: 0.0987
- \(R^2\): 0.4403

For PLS, I kept the correlated predictors because PLS is designed to work with correlated variables by creating components from them. This helped PLS perform slightly better than regular linear regression and elastic net.

However, even with this adjustment, the linear-family models still left a lot of PH variation unexplained.

Nonlinear Models

The nonlinear models improved over the linear-family models.

SVM with radial basis kernel was the best model in this group:
- RMSE: 0.1112
- MAE: 0.0812
- \(R^2\): 0.5806
MARS and neural networks also improved over the linear models, but they were weaker than SVM radial.

This tells us that nonlinear relationships are playing larger role in the PH explanation. However, not all nonlinear models were able to capture those relationships equally well.

Tree-Based Models

The tree-based models performed best overall.

XGBoost had the lowest RMSE:
- RMSE: 0.0964
- MAE: 0.0705
- \(R^2\): 0.6852
Random Forest was very close:
- RMSE: 0.0984
- MAE: 0.0706
- \(R^2\): 0.6835
Cubist also performed strongly:
- RMSE: 0.0996
- MAE: 0.0682
- \(R^2\): 0.6650

Cubist had the lowest MAE among the individual tree-based models, which means it had slightly smaller average absolute errors. XGBoost and Random Forest performed better by RMSE and \(R^2\).

cubist_diagnostic_plots/rf_diagnostiv_plots/xgb_diagn_plots

Since tree-based models produced fairly close results, it is worth to notice that these diagnostic plots represent a better understanding of which model is preformed better overall.

Cubist plots:
Predicted vs Observed plot show that Cubist model had slightly uneven balance in a spread. There are few highly over and underpredicted values.
From Residuals vs Predicted plot it is noticeable that that Cubist model had some larger over and under estimated predictions
Cubist had the best MAE among the tree-based models, which means it made smaller average absolute errors. However, the diagnostic plots show that it still had a few larger misses, there are few residuals in 0.50 and -0.50 ranges.
Random Forest:
There is a visible pattern in the residuals. For predicted PH values below about 8.65, the model appears to have more negative residuals, meaning it tends to overpredict in that range. For predicted PH values above about 8.65, there is more tendency toward positive residuals, meaning it tends to underpredict some higher PH values.
The residuals vs predicted plot shows this pattern more clearly. This suggests that Random Forest performs well overall, but it may still pull some predictions toward the middle range
XGBoost:
The XGBoost diagnostic plots look the most balanced overall.
In the Predicted vs Observed plot, the points are tightly grouped around the diagonal line, and the spread looks more even compared with Cubist and Random Forest.
The residuals vs predicted plot also looks more balanced. The residuals are mostly centered around zero with kinda oval shape, and the spread appears more even across the predicted PH range. Compared with Cubist and Random Forest, XGBoost shows less spread of the residuals.

Because of this, XGBoost looks like the strongest individual tree-based model from the diagnostic plots. It also had the best RMSE in the final comparison table, which supports this choice.

Overall, based on both the validation metrics and the diagnostic plots, I would select XGBoost as the best individual model.

Tuning Process

I did not rely on only one tuning attempt for the stronger models. For several models, I used a staged tuning process:

First, I used a wider tuning grid to find the general region where the model performed best (for some models it took a longer computing time).
Then, I refined the grid around the best-performing parameters area.
This was done for models such as SVM radial, Random Forest, neural networks, and XGBoost.

For example:

Random Forest was first tuned across a wider range of mtry and min_n values.
After the first tuning results showed me that stronger models used larger mtry values and smaller min_n values, so after that I refined the search around that region.
XGBoost was also tuned using same logic because it has many parameters and a full tuning grid can become very expensive to keep and run here every time.

Some intermediate tuning attempts are not included in the final RMD file and knitted reports.

For the final report, I kept only the most important tuning results and the final validation comparison. Otherwise including every tuning attempt would make the report too long and would also increase computing and knitting time.

I also considered saving model metrics after each tuning attempt, but I decided not to do that in the final workflow. Since the modeling process already included many models and tuning steps, saving too many intermediate results could make the project harder to follow and easier to mix up. Instead, I focused on keeping the final model comparison clean and reproducible.

Final Takeaway

Based on the validation results:

XGBoost is the best individual model by RMSE.
Random Forest is very close and is a strong stable benchmark.
Cubist is also competitive and has the best MAE among the individual tree-based models.

So, the tree-based models are the strongest individual model family for this PH prediction task.

Important Predictors from XGBoost and Random Forest

After selecting the best individual models, I also review which predictors were most important.

This part is important because the goal is not only to predict PH, but also to understand which manufacturing and process variables helped the models make predictions.

I use XGBoost variable importance because XGBoost was the strongest individual model by RMSE. I also check Random Forest variable importance because Random Forest had very close validation performance and gives another tree-based view of important predictors.

These importance values should be read as predictive importance, not direct causation. In other words, a variable can help predict PH, but that does not automatically mean it directly causes PH to change.

XGBoost important predictors

final_xgb_engine <- final_xgb_fit |>
  extract_fit_parsnip() |>
  pluck("fit")

xgb_importance <- xgb.importance(
  model = final_xgb_engine
)

xgb_importance_table <- xgb_importance |>
  as_tibble() |>
  slice(1:20)

xgb_importance_table |>
  kable(
    caption = "Top 20 XGBoost Important Predictors",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Top 20 XGBoost Important Predictors
Feature	Gain	Cover	Frequency
mnf_flow	0.1613	0.0345	0.0280
usage_cont	0.0680	0.0446	0.0451
oxygen_filler	0.0510	0.0469	0.0489
brand_codeC	0.0508	0.0093	0.0069
temperature	0.0453	0.0281	0.0323
bowl_setpoint	0.0436	0.0074	0.0075
balling_lvl	0.0430	0.0323	0.0344
pressure_vacuum	0.0418	0.0425	0.0374
alch_rel	0.0387	0.0357	0.0292
density	0.0358	0.0324	0.0265
air_pressurer	0.0350	0.0257	0.0280
carb_pressure1	0.0348	0.0476	0.0424
filler_speed	0.0343	0.0510	0.0432
carb_rel	0.0264	0.0216	0.0264
carb_flow	0.0236	0.0500	0.0431
carb_volume	0.0204	0.0344	0.0384
hyd_pressure3	0.0202	0.0207	0.0220
balling	0.0190	0.0232	0.0231
pc_volume	0.0180	0.0360	0.0431
filler_level	0.0172	0.0307	0.0323

xgb_importance_table |>
  ggplot(aes(x = reorder(Feature, Gain), y = Gain)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 20 XGBoost Important Predictors",
    x = "Predictor",
    y = "Gain"
  )

Random Forest important predictors

final_rf_engine <- final_rf_fit |>
  extract_fit_parsnip() |>
  pluck("fit")

rf_importance_table <- tibble(
  variable = names(final_rf_engine$variable.importance),
  importance = as.numeric(final_rf_engine$variable.importance)
) |>
  arrange(desc(importance)) |>
  slice(1:20)

rf_importance_table |>
  kable(
    caption = "Top 20 Random Forest Important Predictors",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Top 20 Random Forest Important Predictors
variable	importance
mnf_flow	11.9829
usage_cont	3.4860
temperature	3.0087
oxygen_filler	2.9751
carb_rel	2.9246
alch_rel	2.8036
balling_lvl	2.6853
pressure_vacuum	2.3379
air_pressurer	2.0996
carb_pressure1	1.9842
density	1.9279
brand_code	1.8040
bowl_setpoint	1.6859
carb_flow	1.6133
filler_speed	1.5604
filler_level	1.4296
balling	1.3673
hyd_pressure3	1.3143
pc_volume	1.1827
carb_volume	1.1820

rf_importance_table |>
  ggplot(aes(x = reorder(variable, importance), y = importance)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 20 Random Forest Important Predictors",
    x = "Predictor",
    y = "Importance"
  )

xgb_importance_ranked <- xgb_importance |>
  as_tibble() |>
  transmute(
    variable = Feature,
    xgb_gain = Gain
  ) |>
  arrange(desc(xgb_gain)) |>
  mutate(
    xgb_rank = row_number()
  )

rf_importance_ranked <- tibble(
  variable = names(final_rf_engine$variable.importance),
  rf_importance = as.numeric(final_rf_engine$variable.importance)
) |>
  arrange(desc(rf_importance)) |>
  mutate(
    rf_rank = row_number()
  )

importance_rank_comparison <- full_join(
  xgb_importance_ranked,
  rf_importance_ranked,
  by = "variable"
) |>
  mutate(
    appears_in_both = ifelse(
      !is.na(xgb_rank) & !is.na(rf_rank),
      "Yes",
      "No"
    )
  ) |>
  rowwise() |>
  mutate(
    average_rank = mean(
      c(xgb_rank, rf_rank),
      na.rm = TRUE
    )
  ) |>
  ungroup() |>
  arrange(average_rank)

XGBoost and RF models comparison

importance_rank_display <- importance_rank_comparison |>
  select(
    variable,
    xgb_rank,
    rf_rank,
    average_rank,
    appears_in_both
  ) |>
  slice(1:10) |>
  mutate(
    xgb_rank = ifelse(is.na(xgb_rank), "Not ranked", as.character(xgb_rank)),
    rf_rank = ifelse(is.na(rf_rank), "Not ranked", as.character(rf_rank))
  )

importance_rank_display |>
  kable(
    caption = "Rank-Based Comparison of Top-10 Important Predictors",
    digits = 2
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Rank-Based Comparison of Top-10 Important Predictors
variable	xgb_rank	rf_rank	average_rank	appears_in_both
mnf_flow	1	1	1.0	Yes
usage_cont	2	2	2.0	Yes
oxygen_filler	3	4	3.5	Yes
brand_codeC	4	Not ranked	4.0	No
temperature	5	3	4.0	Yes
balling_lvl	7	7	7.0	Yes
alch_rel	9	6	7.5	Yes
pressure_vacuum	8	8	8.0	Yes
bowl_setpoint	6	13	9.5	Yes
carb_rel	14	5	9.5	Yes

XGBoost and Random Forest both showed that some predictors were much more important than others, but the importance values are not on the same scale.

For XGBoost, mnf_flow had the highest Gain value at 0.1613. This was much larger than next predictor, usage_cont, which had Gain = 0.0680. This tells us that mnf_flow was the strongest predictor in the XGBoost model.

Random Forest showed the same pattern. mnf_flow had the highest importance value at 11.9829, while the next predictor, usage_cont, had importance of 3.4860. So in both models, mnf_flow was clearly the most important predictor for PH.

Both models also found similar process-related variables important, including:

usage_cont
oxygen_filler
temperature
balling_lvl
pressure_vacuum
alch_rel
carb_rel
carb_pressure1
filler_speed
carb_flow

This gives more confidence that these variables are useful predictive factors because they appear in both strong models.

There is one difference is how brand information appears. XGBoost shows brand_codeC as an important predictor with Gain = 0.0508, while Random Forest shows the full brand_code variable with importance = 1.8040. This difference is probably because XGBoost used dummy variables for brand_code and Random Forest handled the brand variable as is and didn’t split it into dummy variables.

Overall, the importance results suggest that PH prediction is mostly related to mnf_flow (I aasume it is manufacturing flow), usage/process conditions, oxygen and filling behavior, temperature, pressure-related variables, carbonation/alcohol-related variables, and brand differences.

I think it is important to note that these are predictive factors, not direct causal conclusions. They tell us which variables helped the models predict PH, but not necessarily which variables directly cause PH to change.

What happens if we change variable’s values?

After selecting XGBoost as the final individual model, I wanted to see how predicted PH changes when models important predictors are changed.

I wanted to see more precise changes for each brand’s average predicted PH, so I calculated the average prediction changes separately by brand_code.

I think this is more useful than using only one global average changes, because I think each brand may have different process behavior. In other words, the same variable change may affect the model predictions differently for each brand type.

For each brand, I changed one XGBoost’s important predictor by 10% while keeping the other predictors the same. Then I compared the new predicted PH with the original predicted PH.

This does not prove that changing one variable directly causes PH to change. It only shows how the trained XGBoost model responds to these changes.

# Baseline row for scenario testing
# I use the validation row closest to the median predicted PH

valid_tree_predictions_for_scenario <- predict(
  final_xgb_fit,
  new_data = valid_tree_model_data
) |>
  pull(.pred)

median_predicted_ph <- median(valid_tree_predictions_for_scenario)

baseline_row_index <- which.min(
  abs(valid_tree_predictions_for_scenario - median_predicted_ph)
)

baseline_row <- valid_tree_model_data |>
  slice(baseline_row_index) |>
  select(-ph)

baseline_prediction <- predict(
  final_xgb_fit,
  new_data = baseline_row
) |>
  pull(.pred)

baseline_prediction

[1] 8.533664

make_brand_numeric_scenario <- function(variable_name, change_percent = 0.10) {
  
  # Validation predictors only
  valid_x <- valid_tree_model_data |>
    select(-ph)
  
  # Original predictions
  original_predictions <- predict(
    final_xgb_fit,
    new_data = valid_x
  ) |>
    pull(.pred)
  
  # Decreased scenario
  decreased_x <- valid_x |>
    mutate(
      !!variable_name := .data[[variable_name]] * (1 - change_percent)
    )
  
  decreased_predictions <- predict(
    final_xgb_fit,
    new_data = decreased_x
  ) |>
    pull(.pred)
  
  # Increased scenario
  increased_x <- valid_x |>
    mutate(
      !!variable_name := .data[[variable_name]] * (1 + change_percent)
    )
  
  increased_predictions <- predict(
    final_xgb_fit,
    new_data = increased_x
  ) |>
    pull(.pred)
  
  # Combine and summarize by brand
  bind_rows(
    tibble(
      brand_code = valid_x$brand_code,
      variable_changed = variable_name,
      scenario = paste0(variable_name, " decreased by 10%"),
      original_predicted_ph = original_predictions,
      scenario_predicted_ph = decreased_predictions,
      change_from_original = scenario_predicted_ph - original_predicted_ph
    ),
    tibble(
      brand_code = valid_x$brand_code,
      variable_changed = variable_name,
      scenario = paste0(variable_name, " increased by 10%"),
      original_predicted_ph = original_predictions,
      scenario_predicted_ph = increased_predictions,
      change_from_original = scenario_predicted_ph - original_predicted_ph
    )
  ) |>
    group_by(
      brand_code,
      variable_changed,
      scenario
    ) |>
    summarise(
      validation_rows = n(),
      average_original_ph = mean(original_predicted_ph, na.rm = TRUE),
      average_scenario_ph = mean(scenario_predicted_ph, na.rm = TRUE),
      average_change_from_original = mean(change_from_original, na.rm = TRUE),
      min_change = min(change_from_original, na.rm = TRUE),
      max_change = max(change_from_original, na.rm = TRUE),
      .groups = "drop"
    ) |>
    arrange(
      variable_changed,
      brand_code,
      scenario
    )
}

scenario_variables <- c(
  "mnf_flow",
  "usage_cont",
  "oxygen_filler",
  "temperature",
  "pressure_vacuum",
  "carb_rel"
)

brand_scenario_results <- map_dfr(
  scenario_variables,
  make_brand_numeric_scenario
)

format_change <- function(x) {
  ifelse(
    x > 0,
    paste0("+", sprintf("%.4f", x)),
    sprintf("%.4f", x)
  )
}

scenario_decrease_table <- brand_scenario_results |>
  filter(str_detect(scenario, "decreased")) |>
  select(
    brand_code,
    variable_changed,
    average_change_from_original
  ) |>
  mutate(
    average_change_from_original = round(average_change_from_original, 4)
  ) |>
  pivot_wider(
    names_from = brand_code,
    values_from = average_change_from_original
  ) |>
  arrange(variable_changed)

scenario_increase_table <- brand_scenario_results |>
  filter(str_detect(scenario, "increased")) |>
  select(
    brand_code,
    variable_changed,
    average_change_from_original
  ) |>
  mutate(
    average_change_from_original = round(average_change_from_original, 4)
  ) |>
  pivot_wider(
    names_from = brand_code,
    values_from = average_change_from_original
  ) |>
  arrange(variable_changed)

scenario_decrease_display <- scenario_decrease_table |>
  mutate(
    across(
      c(A, B, C, D),
      format_change
    )
  )

scenario_decrease_display |>
  kable(
    caption = "Average Predicted PH Change When Variable Decreased by 10%",
    format = "html"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
   # bootstrap_options = c("striped", "hover", "condensed")
  )

Average Predicted PH Change When Variable Decreased by 10%
variable_changed	A	B	C	D
carb_rel	-0.0475	-0.0218	+0.0038	-0.0585
mnf_flow	-0.0047	-0.0047	+0.0107	-0.0038
oxygen_filler	+0.0021	+0.0020	+0.0002	+0.0013
pressure_vacuum	+0.0014	-0.0019	+0.0032	0.0000
temperature	+0.0053	+0.0139	+0.0104	+0.0124
usage_cont	+0.0248	+0.0254	+0.0175	+0.0240

The table above shows the average change in predicted PH when each selected predictor was decreased by 10%, separated by brand code.

Positive values mean the XGBoost model predicted higher averafe PH after the variable was decreased by 10%. Negative values mean the model predicted lower average PH after the variable was decreased.

There are few patterns are noticeable:

carb_rel had one of the strongest effects. When carb_rel was decreased by 10%, predicted PH decreased for brands A, B, and D. The strongest average decrease was for brand D (-0.0585), followed by brand A (-0.0475). Brand C behaved differently and had a very small increase (+0.0038).
usage_cont showed a consistent pattern across all brands. When usage_cont was decreased by 10%, predicted PH increased for all brands. The average increase was between +0.0175 and +0.0254.
temperature also showed a consistent pattern. When temperature was decreased by 10%, predicted PH increased for all brands. The increase is not very large, but it is consistent.
oxygen_filler and pressure_vacuum had very small mostly positive average effects in this case.
mnf_flow had a small average effect, but it did not behave exactly the same for all brands. For brands A, B, and D, decreasing mnf_flow slightly lowered predicted PH. For Brand C, it slightly increased predicted PH.

Overall as I expected, this table shows that XGBoost is not treating every brand the same way . The same 10% change can have different effects depending on the brand. This supports the idea that PH prediction depends on the combination of process variables and brand-level behavior, not only one variable by itself.

And again, these results should still be interpreted as model-based prediction scenarios, not direct cause-and-effect results.

scenario_increase_display <- scenario_increase_table |>
  mutate(
    across(
      c(A, B, C, D),
      format_change
    )
  )

scenario_increase_display |>
  kable(
    caption = "Average Predicted PH Change When Variable Increased by 10%",
    format = "html"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
   # bootstrap_options = c("striped", "hover")
  )

Average Predicted PH Change When Variable Increased by 10%
variable_changed	A	B	C	D
carb_rel	-0.0062	-0.0274	-0.0251	-0.0097
mnf_flow	+0.0106	+0.0094	+0.0042	+0.0140
oxygen_filler	-0.0027	-0.0017	-0.0002	-0.0033
pressure_vacuum	+0.0017	+0.0030	+0.0111	+0.0021
temperature	-0.0349	-0.0566	-0.0201	-0.0451
usage_cont	-0.0010	-0.0061	-0.0039	+0.0022

And this table above shows the average change in predicted PH when each selected predictor was increased by 10%, separated by brand codes.

There are also patterns are noticeable:

temperature had one of the strongest effects. When temperature was increased by 10%, as we see predicted PH decreased for all brands. The largest average decrease was for brand B (-0.0566), followed by D (-0.0451), brand A (-0.0349), and brand C (-0.0201). This tells us that the model associates higher temperature values with lower predicted PH.
carb_rel also showed mostly negative changes. When carb_rel was increased by 10%, predicted PH decreased for all brands. The decrease was strongest for brand B (-0.0274) and brand C (-0.0251).
mnf_flow showed a consistent positive pattern. When mnf_flow was increased by 10%, predicted PH increased for all brands. The effect was not very large, but it was consistent across brands. Brand D had the largest average increase (+0.0140).
pressure_vacuum had a small positive effect for all brands. The largest average change was for Brand C (+0.0111), while the other brands had smaller changes.
oxygen_filler had a very small negative effect across all brands. The changes were close to zero, so this scenario did not strongly move predicted PH on average.
usage_cont had small average effects. For Brands A, B, and C, increasing usage_cont slightly decreased predicted PH. For Brand D, it slightly increased predicted PH, but the change was very small (+0.0022).

As we see from both tables, temperature, carb_rel, usage_cont, and mnf_flow showed the clearest average changes in predicted PH.

The temperature pattern was especially clear. Decreasing temperature increased predicted PH for all brands, while increasing temperature decreased predicted PH for all brands.

The mnf_flow pattern was also mostly consistent. Increasing mnf_flow increased predicted PH for all brands, although the average changes were not very large.

The carb_rel variable had stronger negative effects for some brands, especially Brand A and Brand D when decreased, and Brand B and Brand C when increased.

I think that the important point here is that these are not simple one-variable rules. The model response can differ by brand, and real process changes may also affect other variables at the same time. Because of that, these scenario results should be used as a guide for decision-making discussion with process and manufacturing involved experts, not as direct production instructions.

This interaction review changes two important predictors at the same time: mnf_flow and oxygen_filler.

If the predicted PH changes in a simple smooth pattern, then the model response is easier to explain. If the prediction changes differently across combinations of the two variables, that suggests the XGBoost model may be using an interaction between those predictors.

This does not prove a physical interaction in the manufacturing process. It only shows that the model prediction depends on the combination of these input values.

Final Evaluation PH Predictions

The final evaluation predictor set was already prepared in the preprocessing part.

Since I already cleaned, capped, and imputed predictors for the final evaluation dataset eval_features_tree_family, I will not repeat the full preprocessing workflow here.

The only remaining step is model-specific prediction. Since the final XGBoost model was trained as a workflow, the model object handles the same model preparation that was used during training. I only need to confirm that the evaluation predictors have the same columns as the tree-family training predictors, then generate PH predictions.

missing_from_eval <- setdiff(
  names(train_tree_model_data |> select(-ph)),
  names(eval_features_tree_family)
)

extra_in_eval <- setdiff(
  names(eval_features_tree_family),
  names(train_tree_model_data |> select(-ph))
)

missing_from_eval

character(0)

extra_in_eval

character(0)

stopifnot(length(missing_from_eval) == 0)
stopifnot(length(extra_in_eval) == 0)

It confirmed that there are same columns and no missing values.

Now we can do predictions

eval_xgb_predictions <- predict(
  final_xgb_fit,
  new_data = eval_features_tree_family
) |>
  rename(predicted_ph = .pred)

eval_predictions_xgboost <- eval_features_tree_family |>
  mutate(
    row_id = row_number(),
    predicted_ph = eval_xgb_predictions$predicted_ph
  ) |>
  select(
    row_id,
    predicted_ph
  )

eval_predictions_xgboost |>
  head(10) |>
  kable(
    caption = "First 10 Final Evaluation PH Predictions",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

First 10 Final Evaluation PH Predictions
row_id	predicted_ph
1	8.5002
2	8.4530
3	8.5082
4	8.6799
5	8.3887
6	8.4879
7	8.5332
8	8.5005
9	8.5930
10	8.7368

write_csv(
  eval_predictions_xgboost,
  "final_evaluation_ph_predictions_xgboost.csv"
)

write_xlsx(
  eval_predictions_xgboost,
  "final_evaluation_ph_predictions_xgboost.xlsx"
)

Final Conclusion

I selected XGBoost as the final individual model for PH prediction.

It had the best validation RMSE among the individual models and showed strong diagnostic performance compared with the other model families. Random Forest and Cubist models had close accuracy metrics, but XGBoost had the best overall validation result.

The most important predictors suggest that PH is connected to manufacturing flow, oxygen/filling behavior, temperature, pressure, carbonation/alcohol-related measurements, and brand-level differences.

The final evaluation data do not contain observed PH values, so I couldn’t calculate model accuracy could on that dataset. I used trained XGBoost model to generate final PH predictions.

The variable importance results should be interpreted as predictive importance, not direct causation.

Optional Bonus Experiment: Ensemble Model Predictions

Note: This section is my additional experiment. It doesn’t have a purpose for selecting the final submitted model.

After comparing the individual models, I also wanted to experiment to see if combining model predictions could improve validation performance.

The idea behind an ensemble is that different models may capture different parts of the PH prediction problem. For example, Random Forest, Cubist, and XGBoost are all tree-based models, but they learn patterns in different ways:

Random Forest averages many independent trees.
Cubist creates rule-based groups and local linear models.
XGBoost builds trees sequentially, where each tree tries to correct previous errors.

Because these models are different, they may not make the same mistakes on the same observations. If one model overpredicts PH for some rows and another model underpredicts those same rows, averaging their predictions might reduce overall error.

For this experiment, I first tested simple average ensembles. Then I tested a weighted ensemble, where stronger models were allowed to receive more weight.

This ensemble section is treated as an experiment because the weights were selected using validation performance. Because of that, I avoid repeatedly changing the weights too many times, so the validation set does not become overused.

For the final modeling approach, I focus on individual models only. This keeps the final model easier to explain, reproduce, and report. Based on the validation results, XGBoost is selected as the best individual model because it had the lowest RMSE and highest \(R^2\) among the individual models tested.

Simple Average Ensemble: Random Forest and Cubist

ensemble_predictions_rf_cubist <- (
  rf_validation_predictions +
  cubist_validation_predictions
) / 2

ensemble_rf_cubist_result <- make_and_show_model_result(
  workflow = "Ensemble",
  model_family = "Model Combination",
  model_name = "Average Ensemble: RF + Cubist",
  training_rows = nrow(train_tree_model_data),
  validation_rows = length(valid_ph),
  tuning_parameters = "Average of Random Forest and Cubist",
  actual = valid_ph,
  predicted = ensemble_predictions_rf_cubist,
  training_time_seconds = 0
)

Average Ensemble: RF + Cubist Validation Performance
workflow	model_family	model_name	training_rows	validation_rows	tuning_parameters	rmse	mae	rsq	training_time_seconds
Ensemble	Model Combination	Average Ensemble: RF + Cubist	2052	515	Average of Random Forest and Cubist	0.0954	0.0666	0.6937	0

The first ensemble experiment used a simple average of the Random Forest and Cubist predictions.

This ensemble had:

RMSE: 0.0954
MAE: 0.0666
\(R^2\): 0.6937

This result is better than both individual models by RMSE and \(R^2\).

Random Forest had:

RMSE: 0.0984
MAE: 0.0706
\(R^2\): 0.6835

Cubist had:

RMSE: 0.0996
MAE: 0.0682
\(R^2\): 0.6650

The ensemble improved because Random Forest and Cubist do not make exactly the same errors. Random Forest had stronger RMSE and \(R^2\), while Cubist had better MAE. By averaging their predictions, the ensemble combined useful parts of both models.

This is also a good ensemble to report because it is simple and easy to explain. It does not require training another model. It only averages the predictions from two already strong models.

ensemble_model_results <- tibble()

ensemble_model_results <- add_model_result(
  results_table = ensemble_model_results,
  new_result = ensemble_rf_cubist_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = ensemble_rf_cubist_result
)

Simple Average Ensemble: Random Forest, Cubist, and XGBoost

ensemble_predictions_rf_cubist_xgb <- (
  rf_validation_predictions +
  cubist_validation_predictions +
  xgb_validation_predictions
) / 3

ensemble_rf_cubist_xgb_result <- make_and_show_model_result(
  workflow = "Ensemble",
  model_family = "Model Combination",
  model_name = "Average Ensemble: RF + Cubist + XGBoost",
  training_rows = nrow(train_tree_model_data),
  validation_rows = length(valid_ph),
  tuning_parameters = "Average of Random Forest, Cubist, and XGBoost",
  actual = valid_ph,
  predicted = ensemble_predictions_rf_cubist_xgb,
  training_time_seconds = 0
)

Average Ensemble: RF + Cubist + XGBoost Validation Performance
workflow	model_family	model_name	training_rows	validation_rows	tuning_parameters	rmse	mae	rsq	training_time_seconds
Ensemble	Model Combination	Average Ensemble: RF + Cubist + XGBoost	2052	515	Average of Random Forest, Cubist, and XGBoost	0.0941	0.0667	0.7027	0

The second ensemble experiment averaged the predictions from Random Forest, Cubist, and XGBoost.

This ensemble had:

RMSE: 0.0941
MAE: 0.0667
\(R^2\): 0.7027

This improved over the RF + Cubist average ensemble, which had:

RMSE: 0.0954
MAE: 0.0666
\(R^2\): 0.6937

The improvement in RMSE and \(R^2\) suggests that XGBoost added useful information to the ensemble

Validation-Weighted Ensemble: Random Forest, Cubist, and XGBoost

ensemble_weight_grid <- expand_grid(
  rf_weight = seq(0, 1, by = 0.05),
  cubist_weight = seq(0, 1, by = 0.05),
  xgb_weight = seq(0, 1, by = 0.05)
) |>
  filter(abs(rf_weight + cubist_weight + xgb_weight - 1) < 0.000001)

ensemble_weight_results <- ensemble_weight_grid |>
  mutate(
    predictions = pmap(
      list(rf_weight, cubist_weight, xgb_weight),
      function(rf_weight, cubist_weight, xgb_weight) {
        rf_weight * rf_validation_predictions +
          cubist_weight * cubist_validation_predictions +
          xgb_weight * xgb_validation_predictions
      }
    ),
    rmse = map_dbl(
      predictions,
      ~ yardstick::rmse_vec(
        truth = valid_ph,
        estimate = .x
      )
    ),
    mae = map_dbl(
      predictions,
      ~ yardstick::mae_vec(
        truth = valid_ph,
        estimate = .x
      )
    ),
    rsq = map_dbl(
      predictions,
      ~ yardstick::rsq_vec(
        truth = valid_ph,
        estimate = .x
      )
    )
  ) |>
  select(-predictions) |>
  arrange(rmse)

best_weighted_ensemble <- ensemble_weight_results |>
  slice(1)

best_weighted_ensemble |>
  kable(
    caption = "Best Weighted Ensemble Weights",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Best Weighted Ensemble Weights
rf_weight	cubist_weight	xgb_weight	rmse	mae	rsq
0.25	0.3	0.45	0.094	0.067	0.7026

weighted_ensemble_predictions <- 
  best_weighted_ensemble$rf_weight * rf_validation_predictions +
  best_weighted_ensemble$cubist_weight * cubist_validation_predictions +
  best_weighted_ensemble$xgb_weight * xgb_validation_predictions

weighted_ensemble_result <- make_and_show_model_result(
  workflow = "Ensemble",
  model_family = "Model Combination",
  model_name = "Weighted Ensemble: RF + Cubist + XGBoost",
  training_rows = nrow(train_tree_model_data),
  validation_rows = length(valid_ph),
  tuning_parameters = paste(
    "RF =",
    best_weighted_ensemble$rf_weight,
    ", Cubist =",
    best_weighted_ensemble$cubist_weight,
    ", XGBoost =",
    best_weighted_ensemble$xgb_weight
  ),
  actual = valid_ph,
  predicted = weighted_ensemble_predictions,
  training_time_seconds = 0
)

Weighted Ensemble: RF + Cubist + XGBoost Validation Performance
workflow	model_family	model_name	training_rows	validation_rows	tuning_parameters	rmse	mae	rsq	training_time_seconds
Ensemble	Model Combination	Weighted Ensemble: RF + Cubist + XGBoost	2052	515	RF = 0.25 , Cubist = 0.3 , XGBoost = 0.45	0.094	0.067	0.7026	0

The weighted ensemble also performed very well.

The selected weights were:

Random Forest: 0.25
Cubist: 0.30
XGBoost: 0.45

The validation performance was:

RMSE: 0.0940
MAE: 0.0670
\(R^2\): 0.7026

This is very close to the simple average ensemble of Random Forest, Cubist, and XGBoost. The weighted ensemble had slightly lower RMSE (by 0.001), while the simple average ensemble had slightly better MAE and almost the same \(R^2\).

Note: The ensemble models were tested only as an experiment to see if combining predictions would improve accuracy. Even ensembling results are better then individual XGBoost’s result, for this project I decided do not use them as official final model to make predictions.

ensemble_model_results <- tibble()

ensemble_model_results <- add_model_result(
  results_table = ensemble_model_results,
  new_result = ensemble_rf_cubist_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = ensemble_rf_cubist_result
)

ensemble_model_results <- add_model_result(
  results_table = ensemble_model_results,
  new_result = ensemble_rf_cubist_xgb_result
)
all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = ensemble_rf_cubist_xgb_result
)

ensemble_model_results <- add_model_result(
  results_table = ensemble_model_results,
  new_result = weighted_ensemble_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = weighted_ensemble_result
)

#ensemble_model_results

# if (exists("cl")) {
#   stopCluster(cl)
#   registerDoSEQ()
#   message("Parallel backend stopped.")
# }

DATA624 Project 2

Farhod Ibragimov

2026-04-17

Project Objective

Modeling Plan

Load Data

Categorical Variable Review

Train and Validation Split

Data Preprocessing Review

Missing Value Review

Raw Numeric Predictor Review

Skewness Check

Near-Zero Variance Check

Outlier and Unusual Value Review

Review of Negative mnfflow Values

Outliers Review by Brand Code

Outliers Compared With PH

Brand-Level Outliers Compared With PH

Brand-Level Outlier Capping

First Capping Attempt

Outlier Treatment Decision

Missing Values Imputation Strategy

Imputation Setup

Hybrid Imputation Approach

Tree-Based Numeric Imputation

PCA Review for Missing Brand Code

KNN Brand Code Imputation

Random Forest Numeric Imputation + KNN Brand Code Imputation

XGBoost Numeric Imputation + KNN Brand Code Imputation

Imputation of all NA’s with missForest package

Complete-Case Benchmark Dataset

Imputation Workflow Comparison Using Random Forest

Final Completed Feature Sets for Modeling

Final Preprocessing Decisions by Model Family

mnfflow Variable Review

Testing mnfflow Representation for Linear Models

Final Model-Family Preprocessing Plan

Linear and Regularized Models

Nonlinear Non-Tree Models

Tree-Based Models

Data Leakage Control

Load Processed Modeling Data

Helper Functions

Function 1: Regression metrics

Function 2: Make a validation result row

Function 3: Prepare linear model matrix

Function 4: detect skewed continuous variables

Function 5: Apply high-correlation filtering

Function 6: NZV filtering

Function 7: Plots of predicted vs observed values

Function 8: Model validation results table

Function 9: Comparison table

Function 10: doParallel settings

Modeling Approach

Linear and Regularized Models

Near-Zero Variance and Correlation Filtering for Linear Models

Yeo-Johnson Variable List for Linear Models

Create Linear Modeling Data

Linear Regression Baseline

Elastic Net Regression

Elastic Net Tuning Parameters

Partial Least Squares Regression

Nonlinear Models

SVM with Radial Basis Kernel

Multivariate Adaptive Regression Splines (MARS) Model

Neural Network Model

Neural Network with Tree-Family Features Dataset

Tree-Based Models

Random Forest

Cubist Model

XGBoost model

Final Model Comparison Summary

Linear and Regularized Models

Nonlinear Models

Tree-Based Models

Tuning Process

Final Takeaway

Important Predictors from XGBoost and Random Forest

XGBoost important predictors

Random Forest important predictors

Review of Negative `mnfflow` Values

Imputation of all NA’s with `missForest` package

`mnfflow` Variable Review

Testing `mnfflow` Representation for Linear Models