library(tidyverse)
library(tidymodels)
library(caret)
library(recipes)
library(kableExtra)
library(patchwork)

library(readxl)
library(janitor)

library(glmnet)
library(earth)
library(kernlab)
library(nnet)
library(ranger)
library(xgboost)
library(e1071)
library(naniar)
library(class)
library(rpart)
library(missForest)

library(vip)
library(writexl)
library(doParallel)
set.seed(123)
project_seed = 123

# use_parallel <- TRUE
# 
# if (use_parallel) {
#   #num_cores_to_use <- max(1, parallel::detectCores(logical = TRUE) - 1)
#   num_cores_to_use <- 2
#   cl <- makePSOCKcluster(num_cores_to_use)
#   registerDoParallel(cl)
# 
#   message("Parallel backend started with ", num_cores_to_use, " cores.")
# }

Project Objective

The project is to better understand the manufacturing process and build a model that can predict PH. New regulations require the company to explain what production factors may affect PH and also report a predictive model based on historical production data.

The main goal of this project is to create a reliable PH prediction workflow. I will review the data, check missing values, test different imputation methods, compare multiple models, and then choose the final model based on validation performance.

This project has both technical and business purpose. The technical part explains data preprocessing, imputation, model tuning, and validation metrics. The business part explains the final model and important production factors in more readable way.

Modeling Plan

The workflow for this project is:

  1. Load and prepare the historical ABC Beverage data.
  2. Split the data into training, validation, and final evaluation sets.
  3. Review preprocessing issues like missing values, categorical variables, skewness, near-zero variance, and highly correlated predictors.
  4. Create several imputed feature sets.
  5. Test the imputed feature sets with different model families.
  6. Tune models using cross-validation when it is needed.
  7. Compare model performance using RMSE, MAE, and R-squared.
  8. Select the final model mostly based on validation RMSE.
  9. Use the final model to predict PH values for the final evaluation data.
  10. Export final predictions in Excel readable format.

PH will not be used during imputation because PH is the target variable. If PH is used to fill missing feature values, then the model can indirectly learn from the answer it is supposed to predict. This would ppotentially create target leakage.

Load Data

set.seed(123)
train_url <- "https://raw.githubusercontent.com/farhodibr/CUNY-SPS-MSDS/main/DATA624_Project2/DATA/StudentData%20%282%29.xlsx"

test_url <- "https://raw.githubusercontent.com/farhodibr/CUNY-SPS-MSDS/main/DATA624_Project2/DATA/StudentEvaluation.xlsx"

train_file <- tempfile(fileext = ".xlsx")
test_file  <- tempfile(fileext = ".xlsx")

download.file(train_url, train_file, mode = "wb")
download.file(test_url, test_file, mode = "wb")

train_data_raw <- read_excel(train_file) |>
  clean_names()

test_data_raw <- read_excel(test_file) |>
  clean_names()

Categorical Variable Review

categorical_vars_raw <- names(train_data_raw)[
  sapply(train_data_raw, function(x) is.character(x) || is.factor(x))
]

categorical_vars_raw
[1] "brand_code"
categorical_distribution <- tibble()

for (var in categorical_vars_raw) {
  
  temp_distribution <- train_data_raw |>
    count(.data[[var]], name = "count") |>
    mutate(
      variable = var,
      percent = round(count / sum(count) * 100, 2)
    )
  
  names(temp_distribution)[1] <- "category"
  
  categorical_distribution <- bind_rows(
    categorical_distribution,
    temp_distribution
  )
}

categorical_distribution |>
  arrange(variable, desc(percent)) |>
  kable(
    caption = "Categorical Variable Distributions",
    digits = 2
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Categorical Variable Distributions
category count variable percent
B 1239 brand_code 48.19
D 615 brand_code 23.92
C 304 brand_code 11.82
A 293 brand_code 11.40
NA 120 brand_code 4.67

Only brand_code is a categorical variable in this data. Brand B is the most common category, making up about 48% of the rows. Brand D is the second largest group, while Brands A and C are smaller groups but still have enough observations. Around 4.7% of the brand_code values are missing.

Later I will check if the training and validation split keeps brand_code's distributions similar.

Train and Validation Split

After reviewing the categorical variable distribution, I will split train_data_raw data into 80% training and 20% validation sets. The split will be stratified by PH because PH is the target variable.

After the split, I will also check if brand_code’s values has same distributions between training and validation.

set.seed(123)

production_data <- train_data_raw |>
  mutate(
    brand_code = as.factor(brand_code)
  )

eval_data <- test_data_raw |>
  mutate(
    brand_code = as.factor(brand_code)
  )

ph_split <- initial_split(
  production_data,
  prop = 0.80,
  strata = ph
)

train_data <- training(ph_split)
valid_data <- testing(ph_split)

train_features_raw <- train_data |>
  select(-ph)

valid_features_raw <- valid_data |>
  select(-ph)

eval_features_raw <- eval_data |>
  select(-any_of("ph"))

train_ph <- train_data$ph
valid_ph <- valid_data$ph
valid_features_raw$brand_code <- factor(
  valid_features_raw$brand_code,
  levels = levels(train_features_raw$brand_code)
)

eval_features_raw$brand_code <- factor(
  eval_features_raw$brand_code,
  levels = levels(train_features_raw$brand_code)
)
split_overview <- tibble(
  dataset = c("Training", "Validation", "Final Evaluation"),
  rows = c(
    nrow(train_features_raw),
    nrow(valid_features_raw),
    nrow(eval_features_raw)
  ),
  feature_columns = c(
    ncol(train_features_raw),
    ncol(valid_features_raw),
    ncol(eval_features_raw)
  ),
  ph_available = c("Yes", "Yes", "No")
)

split_overview |>
  kable(
    caption = "Modeling Data Split"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Modeling Data Split
dataset rows feature_columns ph_available
Training 2055 32 Yes
Validation 516 32 Yes
Final Evaluation 267 32 No
brand_balance_check <- bind_rows(
  train_features_raw |>
    mutate(split = "Training"),
  valid_features_raw |>
    mutate(split = "Validation")
) |>
  count(split, brand_code, name = "count") |>
  group_by(split) |>
  mutate(
    percent = round(count / sum(count) * 100, 2),
    count_percent = paste0(count, " (", percent, "%)")
  ) |>
  ungroup()

brand_balance_wide <- brand_balance_check |>
  select(split, brand_code, count_percent) |>
  mutate(
    brand_code = ifelse(
      is.na(brand_code),
      "Missing",
      as.character(brand_code)
    ),
    brand_code = factor(
      brand_code,
      levels = c("A", "B", "C", "D", "Missing")
    )
  ) |>
  pivot_wider(
    names_from = brand_code,
    values_from = count_percent
  )

brand_balance_wide |>
  kable(
    caption = "Brand Code Distribution After Train/Validation Split",
    col.names = c(
      "Dataset",
      "Brand A",
      "Brand B",
      "Brand C",
      "Brand D",
      "Missing Brand Code"
    )
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Brand Code Distribution After Train/Validation Split
Dataset Brand A Brand B Brand C Brand D Missing Brand Code
Training 225 (10.95%) 986 (47.98%) 255 (12.41%) 495 (24.09%) 94 (4.57%)
Validation 68 (13.18%) 253 (49.03%) 49 (9.5%) 120 (23.26%) 26 (5.04%)

After the split brand_code distribution is pretty much similar between the training and validation sets. Brand B remains the largest group in both sets, and Brand D is also close between the two sets. Brand A and Brand C have some difference, but the split still looks acceptable. Missing brand_code values are also close, around 4.6% in training and 5.0% in validation.

Because the PH distribution was stratified during the split and brand_code distribution also looks reasonable, I will keep this train/validation split.

Since PH is the target variable, I am going to check if PH values has similar distributions between the training and validation sets.

ph_split_summary <- tibble(
  dataset = c("Training", "Validation"),
  count = c(length(train_ph), length(valid_ph)),
  mean_ph = c(mean(train_ph, na.rm = TRUE), mean(valid_ph, na.rm = TRUE)),
  median_ph = c(median(train_ph, na.rm = TRUE), median(valid_ph, na.rm = TRUE)),
  sd_ph = c(sd(train_ph, na.rm = TRUE), sd(valid_ph, na.rm = TRUE)),
  min_ph = c(min(train_ph, na.rm = TRUE), min(valid_ph, na.rm = TRUE)),
  max_ph = c(max(train_ph, na.rm = TRUE), max(valid_ph, na.rm = TRUE))
)

ph_split_summary |>
  kable(
    caption = "PH Distribution After Train/Validation Split",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
PH Distribution After Train/Validation Split
dataset count mean_ph median_ph sd_ph min_ph max_ph
Training 2055 8.5455 8.54 0.1728 7.88 9.36
Validation 516 8.5462 8.54 0.1717 8.00 8.88
ph_split_plot_data <- tibble(
  ph = c(train_ph, valid_ph),
  dataset = c(
    rep("Training", length(train_ph)),
    rep("Validation", length(valid_ph))
  )
)

ggplot(
  ph_split_plot_data,
  aes(x = ph, fill = dataset)
) +
  geom_density(alpha = 0.4) +
  labs(
    title = "PH Distribution in Training and Validation Sets",
    x = "PH",
    y = "Density",
    fill = "Dataset"
  ) +
  theme_minimal()

The PH distribution looks similar enough between the training and validation sets. As we see from the above table mean, median, and standard deviation are almost the same.
The training set has a slightly higher maximum PH value, but the overall distribution still looks balanced. Because of this, I keep this split for the rest of the project.

Data Preprocessing Review

After checking PH balance and brand_code balance, I am going to review the remaining preprocessing issues. In this section I will explore missing values, numeric predictor summaries, skewness, near-zero variance.

This section has only diagnostic purpose. The actual preprocessing steps like dummy variables, transformations, scaling, and feature removal will be handled later inside modeling workflows when needed.

Missing Value Review

Before checking missing predictor values, I will first check if PH itself has missing values in the training/validation data and final evaluation data.

ph_missing_check <- tibble(
  dataset = c("Training/VAlidation Data", "Final Evaluation Data"),
  rows = c(nrow(train_data_raw), nrow(test_data_raw)),
  ph_column_exists = c(
    "ph" %in% names(train_data_raw),
    "ph" %in% names(test_data_raw)
  ),
  missing_ph_count = c(
    if ("ph" %in% names(train_data_raw)) sum(is.na(train_data_raw$ph)) else NA,
    if ("ph" %in% names(test_data_raw)) sum(is.na(test_data_raw$ph)) else NA
  ),
  missing_ph_percent = c(
    if ("ph" %in% names(train_data_raw)) round(mean(is.na(train_data_raw$ph)) * 100, 2) else NA,
    if ("ph" %in% names(test_data_raw)) round(mean(is.na(test_data_raw$ph)) * 100, 2) else NA
  )
)

ph_missing_check |>
  kable(
    caption = "PH Missing Value Check",
    digits = 2
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
PH Missing Value Check
dataset rows ph_column_exists missing_ph_count missing_ph_percent
Training/VAlidation Data 2571 TRUE 4 0.16
Final Evaluation Data 267 TRUE 267 100.00

The final evaluation data has PH missing for all rows, which is expected because those are the values the model needs to predict.

The training/validation data has 4 rows with missing PH values. Because PH is the target variable, those rows cannot be used for model training or validation. I will remove any rows with missing PH from the training and validation sets before separating features and target values.

train_data <- train_data |>
  filter(!is.na(ph))

valid_data <- valid_data |>
  filter(!is.na(ph))
train_features_raw <- train_data |>
  select(-ph)

valid_features_raw <- valid_data |>
  select(-ph)

eval_features_raw <- eval_data |>
  select(-any_of("ph"))

train_ph <- train_data$ph
valid_ph <- valid_data$ph
sum(is.na(train_ph))
[1] 0
sum(is.na(valid_ph))
[1] 0
nrow(train_features_raw)
[1] 2052
nrow(valid_features_raw)
[1] 515
length(train_ph)
[1] 2052
length(valid_ph)
[1] 515
missing_summary <- bind_rows(
  train_features_raw |>
    summarise(across(everything(), ~ sum(is.na(.)))) |>
    pivot_longer(
      cols = everything(),
      names_to = "variable",
      values_to = "missing_count"
    ) |>
    mutate(
      dataset = "Training",
      total_rows = nrow(train_features_raw)
    ),
  
  valid_features_raw |>
    summarise(across(everything(), ~ sum(is.na(.)))) |>
    pivot_longer(
      cols = everything(),
      names_to = "variable",
      values_to = "missing_count"
    ) |>
    mutate(
      dataset = "Validation",
      total_rows = nrow(valid_features_raw)
    ),
  
  eval_features_raw |>
    summarise(across(everything(), ~ sum(is.na(.)))) |>
    pivot_longer(
      cols = everything(),
      names_to = "variable",
      values_to = "missing_count"
    ) |>
    mutate(
      dataset = "Final Evaluation",
      total_rows = nrow(eval_features_raw)
    )
) |>
  mutate(
    missing_percent = round(missing_count / total_rows * 100, 2)
    ) |>
    filter(missing_count > 0) |>
    select(
      dataset,
      variable,
      missing_count,
      missing_percent
    ) |>
    mutate(
      dataset = factor(
        dataset,
        levels = c("Training", "Validation", "Final Evaluation")
      )
    ) |>
    arrange(dataset, desc(missing_count))

missing_summary |>
  kable(
    caption = "Missing Predictor Values by Dataset",
    digits = 2
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Missing Predictor Values by Dataset
dataset variable missing_count missing_percent
Training mfr 173 8.43
Training brand_code 94 4.58
Training filler_speed 43 2.10
Training psc_co2 36 1.75
Training pc_volume 32 1.56
Training fill_ounces 30 1.46
Training psc 27 1.32
Training carb_pressure1 25 1.22
Training hyd_pressure4 24 1.17
Training carb_pressure 21 1.02
Training psc_fill 21 1.02
Training carb_temp 17 0.83
Training filler_level 14 0.68
Training fill_pressure 13 0.63
Training hyd_pressure2 11 0.54
Training hyd_pressure3 11 0.54
Training temperature 10 0.49
Training oxygen_filler 9 0.44
Training pressure_setpoint 9 0.44
Training carb_volume 8 0.39
Training hyd_pressure1 8 0.39
Training carb_rel 8 0.39
Training alch_rel 6 0.29
Training usage_cont 4 0.19
Training carb_flow 2 0.10
Training balling_lvl 1 0.05
Validation mfr 35 6.80
Validation brand_code 26 5.05
Validation filler_speed 11 2.14
Validation carb_temp 9 1.75
Validation fill_ounces 8 1.55
Validation pc_volume 7 1.36
Validation carb_pressure1 7 1.36
Validation carb_pressure 6 1.17
Validation psc 6 1.17
Validation fill_pressure 5 0.97
Validation hyd_pressure2 4 0.78
Validation hyd_pressure3 4 0.78
Validation hyd_pressure4 4 0.78
Validation psc_co2 3 0.58
Validation hyd_pressure1 3 0.58
Validation pressure_setpoint 3 0.58
Validation carb_volume 2 0.39
Validation psc_fill 2 0.39
Validation filler_level 2 0.39
Validation temperature 2 0.39
Validation oxygen_filler 2 0.39
Validation bowl_setpoint 2 0.39
Validation usage_cont 1 0.19
Validation alch_rel 1 0.19
Final Evaluation mfr 31 11.61
Final Evaluation filler_speed 10 3.75
Final Evaluation brand_code 8 3.00
Final Evaluation fill_ounces 6 2.25
Final Evaluation psc 5 1.87
Final Evaluation psc_co2 5 1.87
Final Evaluation pc_volume 4 1.50
Final Evaluation carb_pressure1 4 1.50
Final Evaluation hyd_pressure4 4 1.50
Final Evaluation psc_fill 3 1.12
Final Evaluation oxygen_filler 3 1.12
Final Evaluation alch_rel 3 1.12
Final Evaluation fill_pressure 2 0.75
Final Evaluation filler_level 2 0.75
Final Evaluation temperature 2 0.75
Final Evaluation usage_cont 2 0.75
Final Evaluation pressure_setpoint 2 0.75
Final Evaluation carb_rel 2 0.75
Final Evaluation carb_volume 1 0.37
Final Evaluation carb_temp 1 0.37
Final Evaluation hyd_pressure2 1 0.37
Final Evaluation hyd_pressure3 1 0.37
Final Evaluation density 1 0.37
Final Evaluation balling 1 0.37
Final Evaluation pressure_vacuum 1 0.37
Final Evaluation bowl_setpoint 1 0.37
Final Evaluation air_pressurer 1 0.37

The missing value review shows that missing predictor values are present in training, validation, and evaluation sets. The variable mfr has the highest missingness in all three datasets, with about 8.6% missing in training, 7.0% in validation, and 11.6% in final evaluation. brand_code also has missing values in all datasets.

Most other predictors have smaller missing percentages, usually below 2%.

Since evaluation set also has missing predictor values, the imputation process has to be applied to training, validation, and evaluation missing values. PH will not be used during imputation because it is the target variable.

Raw Numeric Predictor Review

These checks are done before missing values imputation to understand the original data. They are not final transformation decisions. After imputation, I will review the completed feature sets again before modeling.

numeric_vars <- names(train_features_raw)[
  sapply(train_features_raw, is.numeric)
]

#numeric_vars
numeric_summary <- train_features_raw |>
  select(all_of(numeric_vars)) |>
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "value"
  ) |>
  group_by(variable) |>
  summarise(
    count = sum(!is.na(value)),
    missing_count = sum(is.na(value)),
    mean = mean(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    min = min(value, na.rm = TRUE),
    q1 = quantile(value, 0.25, na.rm = TRUE),
    q3 = quantile(value, 0.75, na.rm = TRUE),
    max = max(value, na.rm = TRUE),
    .groups = "drop"
  )

numeric_summary |>
  kable(
    caption = "Numeric Predictor Summary in Training Data",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center",
    bootstrap_options = c("striped", "hover", "condensed")
  )
Numeric Predictor Summary in Training Data
variable count missing_count mean median sd min q1 q3 max
air_pressurer 2052 0 142.835 142.600 1.211 140.800 142.200 143.000 148.200
alch_rel 2046 6 6.898 6.560 0.508 5.280 6.540 7.240 8.620
balling 2052 0 2.196 1.648 0.931 0.160 1.496 3.292 3.984
balling_lvl 2051 1 2.050 1.460 0.869 0.000 1.380 3.140 3.660
bowl_setpoint 2052 0 109.197 120.000 15.268 70.000 100.000 120.000 140.000
carb_flow 2050 2 2462.747 3029.000 1075.415 26.000 1160.000 3180.000 3834.000
carb_pressure 2031 21 68.162 68.000 3.511 57.000 65.600 70.500 79.400
carb_pressure1 2027 25 122.537 123.200 4.779 105.600 118.800 125.400 140.200
carb_rel 2044 8 5.436 5.400 0.129 4.960 5.340 5.540 6.060
carb_temp 2035 17 141.026 140.800 4.034 128.600 138.400 143.600 154.000
carb_volume 2044 8 5.370 5.347 0.107 5.040 5.293 5.460 5.700
density 2052 0 1.172 0.980 0.377 0.240 0.900 1.620 1.920
fill_ounces 2022 30 23.976 23.973 0.088 23.633 23.920 24.033 24.320
fill_pressure 2039 13 47.923 46.400 3.179 36.000 46.000 50.000 60.400
filler_level 2038 14 109.145 118.200 15.609 55.800 98.450 120.000 153.200
filler_speed 2009 43 3677.520 3982.000 785.011 998.000 3888.000 3998.000 4030.000
hyd_pressure1 2044 8 12.475 11.600 12.366 -0.800 0.000 20.250 58.000
hyd_pressure2 2041 11 20.934 28.600 16.391 0.000 0.000 34.600 59.400
hyd_pressure3 2041 11 20.496 27.800 15.971 -1.200 0.000 33.400 50.000
hyd_pressure4 2028 24 96.417 96.000 13.255 62.000 86.000 102.000 142.000
mfr 1879 173 703.896 724.000 74.710 31.400 706.700 731.000 868.600
mnf_flow 2052 0 25.160 76.200 119.586 -100.200 -100.000 140.800 229.400
oxygen_filler 2043 9 0.047 0.033 0.045 0.002 0.022 0.060 0.400
pc_volume 2020 32 0.277 0.272 0.061 0.079 0.239 0.313 0.471
pressure_setpoint 2043 9 47.618 46.000 2.030 44.000 46.000 50.000 52.000
pressure_vacuum 2052 0 -5.217 -5.400 0.568 -6.600 -5.600 -5.000 -3.600
psc 2025 27 0.085 0.076 0.050 0.002 0.048 0.112 0.268
psc_co2 2016 36 0.056 0.040 0.043 0.000 0.020 0.080 0.240
psc_fill 2031 21 0.194 0.180 0.116 0.000 0.100 0.260 0.620
temperature 2042 10 65.976 65.600 1.405 63.600 65.200 66.400 76.200
usage_cont 2048 4 21.028 21.920 2.969 12.080 18.440 23.740 25.900

As we see from this table, numeric predictors are measured on different scales. Because of this, models like linear, SVM, neural networks, and regularized regression will need centering and scaling later.

Some variables also have zero or negative values, such as mnf_flow, pressure_vacuum, and some hydraulic pressure variables. Because of this, Box-Cox transformation is not safe for all predictors. Yeo-Johnson transformation is a better option because it can handle zero and negative values.

Some variables are stored as numeric, but they may actually represent process settings or categories. To check this, I will use the counts of unique values each numeric predictor has in the training data.
If a numeric variable has only a small number of unique values, I may apply different methods during preprocessing and modeling.

numeric_unique_summary <- train_features_raw |>
  select(all_of(numeric_vars)) |>
  summarise(
    across(
      everything(),
      ~ n_distinct(., na.rm = TRUE)
    )
  ) |>
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "unique_values"
  ) |>
  left_join(
    numeric_summary |>
      select(variable, count, missing_count, min, max),
    by = "variable"
  ) |>
  mutate(
    possible_discrete_or_setting = ifelse(
      unique_values <= 10,
      "Yes",
      "No"
    )
  ) |>
  arrange(unique_values)

numeric_discrete_review <- numeric_unique_summary |>
  filter(unique_values <= 50)

numeric_discrete_review |>
  kable(
    caption = "Numeric Predictors With Limited Unique Values",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Numeric Predictors With Limited Unique Values
variable unique_values count missing_count min max possible_discrete_or_setting
pressure_setpoint 7 2043 9 44.00 52.00 Yes
bowl_setpoint 10 2052 0 70.00 140.00 Yes
pressure_vacuum 16 2052 0 -6.60 -3.60 No
psc_co2 23 2016 36 0.00 0.24 No
air_pressurer 32 2052 0 140.80 148.20 No
hyd_pressure4 39 2028 24 62.00 142.00 No
carb_rel 40 2044 8 4.96 6.06 No
alch_rel 49 2046 6 5.28 8.62 No

Based on the unique value check, pressure_setpoint and bowl_setpoint look like process setting variables. However, I preffer do not treat these variables exactly like fully continuous numeric measurements. Later, when Yeo-Johnson transformations are used inside modeling recipes, I might exclude pressure_setpoint and bowl_setpoint numeric variables from that transformation.

exclude_from_yj_vars <- c(
  "pressure_setpoint",
  "bowl_setpoint"
)

# recipe(ph ~ ., data = train_data_model) |>
#   step_YeoJohnson(
#     all_numeric_predictors(),
#     -all_of(setting_like_numeric_vars)
#   ) |>
#   step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
#   step_zv(all_predictors()) |>
#   step_nzv(all_predictors()) |>
#   step_normalize(all_numeric_predictors())

Skewness Check

Since some values are missing, this skewness check is based on available raw training values. Imputation may slightly change the distribution, so the final transformation choices will be confirmed after imputed feature sets are created.

Skewness analysis is useful here because some models work better when numeric predictors are more normally distributed. This is especially important for models like linear regression, elastic net, SVM, and neural networks.

However, I do not want to blindly transform every skewed variable. Some numeric variables are process settings or rounded measurements as we saw earlier, so skewness may not mean the same thing for them. Because of this, transformation decisions will be handled inside the modelling section later.

numeric_skewness <- train_features_raw |>
  select(all_of(numeric_vars)) |>
  summarise(
    across(
      everything(),
      ~ skewness(., na.rm = TRUE)
    )
  ) |>
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "skewness"
  ) |>
  mutate(
    abs_skewness = abs(skewness),
    skewness_level = case_when(
      abs_skewness >= 2 ~ "High skewness",
      abs_skewness >= 1 ~ "Moderate skewness",
      TRUE ~ "Low skewness"
    ),
    # setting_like = ifelse(
    #   variable %in% exclude_from_yj_vars,
    #   "Yes",
    #   "No"
    #)
  ) |>
  arrange(desc(abs_skewness))

skewed_numeric_vars <- numeric_skewness |>
  filter(abs_skewness >= 1)

skewed_numeric_vars |>
  kable(
    caption = "Numeric Predictors With Moderate or High Skewness",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Numeric Predictors With Moderate or High Skewness
variable skewness abs_skewness skewness_level
mfr -5.096 5.096 High skewness
filler_speed -2.790 2.790 High skewness
temperature 2.483 2.483 High skewness
oxygen_filler 2.366 2.366 High skewness
air_pressurer 2.259 2.259 High skewness
psc_co2 1.755 1.755 Moderate skewness
ggplot(
  numeric_skewness,
  aes(
    x = reorder(variable, abs_skewness),
    y = abs_skewness,
    fill = skewness_level
  )
) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Absolute Skewness of Numeric Predictors",
    subtitle = "Variables are colored by skewness level",
    x = "Variable",
    y = "Absolute Skewness",
    fill = "Skewness Level"
  ) +
  theme_minimal()

skewed_vars_to_plot <- skewed_numeric_vars |>
  pull(variable)

skewed_distribution_data <- train_features_raw |>
  select(all_of(skewed_vars_to_plot)) |>
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "value"
  )

ggplot(
  skewed_distribution_data,
  aes(x = value)
) +
  geom_histogram(
    bins = 30,
    color = "white"
  ) +
  facet_wrap(
    ~ variable,
    scales = "free",
    ncol = 3
  ) +
  labs(
    title = "Distributions of Skewed Numeric Predictors",
    subtitle = "Only variables with absolute skewness greater than or equal to 1 are shown",
    #x = "Value",
    y = "Count"
  ) +
  theme_minimal()

The skewness table and plot shows that only a small group of numeric predictors have moderate or high skewness. The variables with absolute skewness greater than or equal to 1 are mfr, filler_speed, oxygen_filler, temperature, air_pressurer, and psc_co2.

At first, these variables may look like candidates for transformation. However, the histograms show that not all skewed variables behave the same way. Some variables look like regular continuous variables with long tails, while others look more like grouped or rounded process measurements.

  • oxygen_filler and temperature look more like continuous variables with right skewness. These variables are better candidates for Yeo-Johnson transformation in models that are sensitive to skewed predictors.

  • mfr and filler_speed are left-skewed. Most observations are concentrated around normal operating values, but there are some much lower values. After outliers analysis section, transformation may still be useful for some model types.

  • Even psc_co2 have skewness, its plot show grouped values and repeated bars respectfully. This suggests psc_co2 may be categorical or rounded measurements. Because of this, I preffer not to apply Yeo-Johnson transformation to psc_co2.

Based on this review, I will keep all skewed variables in the dataset, but I separate the transformation decision by variable type. In the modeling workflow, Yeo-Johnson transformation will be applied before brand_codedummy encoding, and variables that look like process settings or grouped measurements will be excluded from Yeo-Johnson transformation.

exclude_from_yj_vars <- c(
  "pressure_setpoint",
  "bowl_setpoint",
  "air_pressurer",
  "psc_co2"
)

Near-Zero Variance Check

Next I will check for near-zero variance predictors.

This matters because predictors with very little variation usually do not have much useful information for the models.

nzv_raw_check <- nearZeroVar(
  train_features_raw,
  saveMetrics = FALSE)

nzv_raw_check
integer(0)

No raw predictors were flagged as near-zero variance. This means I do not remove any predictors at this stage based on near-zero variance.

Outlier and Unusual Value Review

Before imputation, I would check for unusual numeric values. If a value is clearly impossible or likely caused by a recording issue, it should be reviewed before missing values are filled.

I do not automatically remove outliers only because they are extreme. In production data, unusual values may represent real process conditions. My goal here is to identify suspicious values and decide carefully before changing the data.

numeric_vars_raw <- train_features_raw |>
  select(where(is.numeric)) |>
  names()

outlier_summary <- map_dfr(
  numeric_vars_raw,
  function(var) {
    
    values <- train_features_raw[[var]]
    
    q1 <- quantile(values, 0.25, na.rm = TRUE)
    q3 <- quantile(values, 0.75, na.rm = TRUE)
    iqr_value <- q3 - q1
    
    lower_bound <- q1 - 1.5 * iqr_value
    upper_bound <- q3 + 1.5 * iqr_value
    
    tibble(
      variable = var,
      count = sum(!is.na(values)),
      missing_count = sum(is.na(values)),
      min_value = min(values, na.rm = TRUE),
      q1 = q1,
      median = median(values, na.rm = TRUE),
      q3 = q3,
      max_value = max(values, na.rm = TRUE),
      lower_bound = lower_bound,
      upper_bound = upper_bound,
      outlier_count = sum(values < lower_bound | values > upper_bound, na.rm = TRUE),
      outlier_percent = round(outlier_count / count * 100, 2)
    )
  }
) |>
  arrange(desc(outlier_percent))

outlier_summary |>
  kable(
    caption = "Outlier Review Using IQR Rule",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center",
    bootstrap_options = c("striped", "hover", "condensed")
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )
Outlier Review Using IQR Rule
variable count missing_count min_value q1 median q3 max_value lower_bound upper_bound outlier_count outlier_percent
filler_speed 2009 43 998.000 3888.000 3982.000 3998.000 4030.000 3723.000 4163.000 356 17.72
mfr 1879 173 31.400 706.700 724.000 731.000 868.600 670.250 767.450 190 10.11
air_pressurer 2052 0 140.800 142.200 142.600 143.000 148.200 141.000 144.200 177 8.63
oxygen_filler 2043 9 0.002 0.022 0.033 0.060 0.400 -0.035 0.117 147 7.20
temperature 2042 10 63.600 65.200 65.600 66.400 76.200 63.400 68.200 103 5.04
pressure_vacuum 2052 0 -6.600 -5.600 -5.400 -5.000 -3.600 -6.500 -4.100 95 4.63
hyd_pressure4 2028 24 62.000 86.000 96.000 102.000 142.000 62.000 126.000 70 3.45
pc_volume 2020 32 0.079 0.239 0.272 0.313 0.471 0.128 0.424 64 3.17
fill_pressure 2039 13 36.000 46.000 46.400 50.000 60.400 40.000 56.000 64 3.14
psc_co2 2016 36 0.000 0.020 0.040 0.080 0.240 -0.070 0.170 63 3.12
psc_fill 2031 21 0.000 0.100 0.180 0.260 0.620 -0.140 0.500 52 2.56
psc 2025 27 0.002 0.048 0.076 0.112 0.268 -0.048 0.208 48 2.37
fill_ounces 2022 30 23.633 23.920 23.973 24.033 24.320 23.750 24.203 38 1.88
carb_temp 2035 17 128.600 138.400 140.800 143.600 154.000 130.600 151.400 35 1.72
carb_pressure1 2027 25 105.600 118.800 123.200 125.400 140.200 108.900 135.300 17 0.84
carb_pressure 2031 21 57.000 65.600 68.000 70.500 79.400 58.250 77.850 14 0.69
filler_level 2038 14 55.800 98.450 118.200 120.000 153.200 66.125 152.325 6 0.29
alch_rel 2046 6 5.280 6.540 6.560 7.240 8.620 5.490 8.290 6 0.29
carb_rel 2044 8 4.960 5.340 5.400 5.540 6.060 5.040 5.840 6 0.29
hyd_pressure1 2044 8 -0.800 0.000 11.600 20.250 58.000 -30.375 50.625 4 0.20
carb_volume 2044 8 5.040 5.293 5.347 5.460 5.700 5.043 5.710 2 0.10
mnf_flow 2052 0 -100.200 -100.000 76.200 140.800 229.400 -461.200 502.000 0 0.00
hyd_pressure2 2041 11 0.000 0.000 28.600 34.600 59.400 -51.900 86.500 0 0.00
hyd_pressure3 2041 11 -1.200 0.000 27.800 33.400 50.000 -50.100 83.500 0 0.00
usage_cont 2048 4 12.080 18.440 21.920 23.740 25.900 10.490 31.690 0 0.00
carb_flow 2050 2 26.000 1160.000 3029.000 3180.000 3834.000 -1870.000 6210.000 0 0.00
density 2052 0 0.240 0.900 0.980 1.620 1.920 -0.180 2.700 0 0.00
balling 2052 0 0.160 1.496 1.648 3.292 3.984 -1.198 5.986 0 0.00
bowl_setpoint 2052 0 70.000 100.000 120.000 120.000 140.000 70.000 150.000 0 0.00
pressure_setpoint 2043 9 44.000 46.000 46.000 50.000 52.000 40.000 56.000 0 0.00
balling_lvl 2051 1 0.000 1.380 1.460 3.140 3.660 -1.260 5.780 0 0.00
negative_value_check <- train_features_raw |>
  select(where(is.numeric)) |>
  summarise(
    across(
      everything(),
      ~ sum(. < 0, na.rm = TRUE)
    )
  ) |>
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "negative_count"
  ) |>
  filter(negative_count > 0) |>
  mutate(
    total_non_missing = map_int(
      variable,
      ~ sum(!is.na(train_features_raw[[.x]]))
    ),
    negative_percent = round(negative_count / total_non_missing * 100, 2),
    min_value = map_dbl(
      variable,
      ~ min(train_features_raw[[.x]], na.rm = TRUE)
    ),
    max_value = map_dbl(
      variable,
      ~ max(train_features_raw[[.x]], na.rm = TRUE)
    )
  ) |>
  arrange(desc(negative_count))

negative_value_check |>
  kable(
    caption = "Numeric Predictors With Negative Values",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Numeric Predictors With Negative Values
variable negative_count total_non_missing negative_percent min_value max_value
pressure_vacuum 2052 2052 100.00 -6.6 -3.6
mnf_flow 941 2052 45.86 -100.2 229.4
hyd_pressure3 60 2041 2.94 -1.2 50.0
hyd_pressure1 33 2044 1.61 -0.8 58.0

Review of Negative mnfflow Values

The negative-value summary shows that almost 46% of mnf_flow has many negative values . Since almost half of the observed mnf_flow values are negative, I want to review this variable more carefully before deciding whether these values should be treated as errors.

Negative mnf_flow values look suspicious because many of them appear to be negative. However, before deciding on what to do with negative values, I would like to check whether these values are connected to PH and other variables.

First of all, I will check the what actually these negative values look like

mnf_flow_negative_values <- train_features_raw |>
  filter(
    !is.na(mnf_flow),
    mnf_flow < 0
  ) |>
  count(
    mnf_flow,
    sort = TRUE
  )

mnf_flow_negative_values |>
  kable(
    caption = "Negative mnfflow Values in Training Data",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Negative mnfflow Values in Training Data
mnf_flow n
-100.0 487
-100.2 454

The table shows that negative mnf_flow values are almost entirely concentrated at -100 and -100.2. This suggests that these values may represent an error readings or actual production settings.

Below I will compare PH values for rows with negative and non-negative mnf_flow.

mnf_flow_ph_check <- train_features_raw |>
  mutate(
    mnf_flow_status = case_when(
      is.na(mnf_flow) ~ "Missing",
      mnf_flow < 0 ~ "Negative",
      mnf_flow >= 0 ~ "Non-negative"
    )
  ) |>
  bind_cols(
    tibble(ph = train_ph)
  ) |>
  filter(
    !is.na(ph)
  ) |>
  group_by(
    brand_code,
    mnf_flow_status
  ) |>
  summarise(
    n = n(),
    mean_ph = mean(ph, na.rm = TRUE),
    median_ph = median(ph, na.rm = TRUE),
    sd_ph = sd(ph, na.rm = TRUE),
    min_ph = min(ph, na.rm = TRUE),
    max_ph = max(ph, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(
    brand_code,
    mnf_flow_status
  )

mnf_flow_ph_check |>
  kable(
    caption = "PH Comparison for Negative vs Non-Negative mnfflow",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
PH Comparison for Negative vs Non-Negative mnfflow
brand_code mnf_flow_status n mean_ph median_ph sd_ph min_ph max_ph
A Negative 85 8.5779 8.60 0.1388 8.20 8.86
A Non-negative 140 8.4381 8.46 0.1519 8.06 8.68
B Negative 476 8.6700 8.70 0.1385 8.18 8.94
B Non-negative 507 8.4716 8.48 0.1329 8.00 8.86
C Negative 118 8.4675 8.50 0.1995 8.08 9.36
C Non-negative 137 8.3737 8.38 0.1495 7.88 8.80
D Negative 222 8.6627 8.68 0.1316 8.32 8.92
D Non-negative 273 8.5527 8.56 0.1226 8.16 8.82
NA Negative 40 8.6330 8.61 0.1261 8.28 8.88
NA Non-negative 54 8.3930 8.45 0.1289 8.12 8.60

As we see from the table, negative mnf_flow rows have higher mean PH in every Brand Code group. This strongly suggests that the negative values are carrying useful information about the PH outcome.

I also compare a few process variables to see whether negative mnf_flow rows look different in the production process.

mnf_flow_process_check <- train_features_raw |>
  mutate(
    mnf_flow_negative = case_when(
      is.na(mnf_flow) ~ "Missing",
      mnf_flow < 0 ~ "Negative",
      mnf_flow >= 0 ~ "Non-negative"
    )
  ) |>
  group_by(
    brand_code,
    mnf_flow_negative
  ) |>
  summarise(
    n = n(),
    mean_temperature = mean(temperature, na.rm = TRUE),
    mean_density = mean(density, na.rm = TRUE),
    mean_carb_rel = mean(carb_rel, na.rm = TRUE),
    mean_oxygen_filler = mean(oxygen_filler, na.rm = TRUE),
    mean_filler_speed = mean(filler_speed, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(
    brand_code,
    mnf_flow_negative
  )

mnf_flow_process_check |>
  kable(
    caption = "Process Variable Comparison for Negative vs Non-Negative mnfflow",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Process Variable Comparison for Negative vs Non-Negative mnfflow
brand_code mnf_flow_negative n mean_temperature mean_density mean_carb_rel mean_oxygen_filler mean_filler_speed
A Negative 85 66.1153 1.5522 5.5026 0.0701 3381.349
A Non-negative 140 66.1696 1.5759 5.5247 0.0265 3684.567
B Negative 476 65.8474 0.9103 5.3818 0.0713 3735.662
B Non-negative 507 65.9199 0.9084 5.3424 0.0255 3705.375
C Negative 118 66.9879 0.8961 5.3483 0.0848 3637.983
C Non-negative 137 66.4482 0.9330 5.3543 0.0229 3723.293
D Negative 222 65.5674 1.6842 5.6032 0.0709 3622.532
D Non-negative 273 65.3735 1.6787 5.6078 0.0232 3706.127
NA Negative 40 67.3500 0.9605 5.3349 0.0792 3610.350
NA Non-negative 54 67.2491 0.9919 5.3506 0.0269 3449.077

The process-variable comparison also shows that negative mnf_flow rows are different from non-negative rows.
The clearest example is oxygen_filler. For Brand B:

  • rows with negative mnf_flow have mean of oxygen_filler of 0.0713, compared with only 0.0255 for non-negative mnf_flow rows.

The same pattern appears for other brands too. For Brand C:

  • the mean of oxygen_filler is 0.0848 for negative mnf_flow rows and 0.0229 for non-negative rows.

This makes me more careful about treating negative mnf_flow as an error. If these values were only random bad measurements, I would not expect them to clearly separate PH and oxygen_filler.

Based on this review, I will not convert negative mnf_flow values as errors in the main preprocessing workflow at this point. Even though -100 and -100.2 look like suspicious, they appear to contain useful information. For example, Brand B has mean PH of 8.6700 when mnf_flow is negative, compared with 8.4716 when mnf_flow is non-negative. I will decide how to treat them after I impute missing values.

Outliers Review by Brand Code

I think it is more logical to also anlyse outliers by brand_code group. Since different brands may have different production settings, one global outlier cutoff can be misleading.

For example, a value may look unusual when all brands are combined together, but it may be normal for one specific brand. Because of this, I do not want to decide about outliers only from the full dataset. I want to see if values are unusual inside their own brand group.

This brand-level outlier check gives a better view of the production process. If a value is only extreme globally but not extreme within its own brand, I would probably keep it. But if a value is extreme even inside its own brand group and also does not make sense from production logic, then it may need more review before imputation.

At this stage, I use this as a diagnostic check. I will not remove or change outliers automatically, because some unusual values may still represent real production conditions.

brand_outlier_summary <- train_features_raw |>
  filter(!is.na(brand_code)) |>
  select(brand_code, all_of(numeric_vars_raw)) |>
  pivot_longer(
    cols = all_of(numeric_vars_raw),
    names_to = "variable",
    values_to = "value"
  ) |>
  group_by(brand_code, variable) |>
  summarise(
    count = sum(!is.na(value)),
    missing_count = sum(is.na(value)),
    q1 = quantile(value, 0.25, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    q3 = quantile(value, 0.75, na.rm = TRUE),
    iqr_value = q3 - q1,
    min_value = min(value, na.rm = TRUE),
    max_value = max(value, na.rm = TRUE),
    lower_bound = q1 - 1.5 * iqr_value,
    upper_bound = q3 + 1.5 * iqr_value,
    outlier_count = sum(
      value < lower_bound | value > upper_bound,
      na.rm = TRUE
    ),
    outlier_percent = round(outlier_count / count * 100, 2),
    .groups = "drop"
  ) |>
  arrange(desc(outlier_percent))

brand_outlier_summary |>
  filter(outlier_count > 0) |>
  kable(
    caption = "Outlier Review Within Each Brand Code Group",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center",
    bootstrap_options = c("striped", "hover", "condensed")
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )
Outlier Review Within Each Brand Code Group
brand_code variable count missing_count q1 median q3 iqr_value min_value max_value lower_bound upper_bound outlier_count outlier_percent
D pressure_setpoint 493 2 46.000 46.000 46.000 0.000 44.000 52.000 46.000 46.000 159 32.25
D bowl_setpoint 495 0 110.000 120.000 120.000 10.000 70.000 140.000 95.000 135.000 110 22.22
C filler_speed 249 6 3892.000 3980.000 3996.000 104.000 1004.000 4030.000 3736.000 4152.000 48 19.28
D filler_speed 485 10 3892.000 3982.000 3996.000 104.000 998.000 4026.000 3736.000 4152.000 91 18.76
A filler_speed 217 8 3806.000 3984.000 3998.000 192.000 1004.000 4024.000 3518.000 4286.000 34 15.67
B filler_speed 966 17 3890.000 3982.000 3998.000 108.000 1002.000 4020.000 3728.000 4160.000 147 15.22
C mfr 233 22 709.800 721.600 729.800 20.000 76.800 835.400 679.800 759.800 31 13.30
C filler_level 254 1 108.250 118.100 120.000 11.750 69.400 150.800 90.625 137.625 32 12.60
C bowl_setpoint 255 0 110.000 120.000 120.000 10.000 70.000 130.000 95.000 135.000 32 12.55
D hyd_pressure4 494 1 78.000 80.000 84.000 6.000 62.000 140.000 69.000 93.000 61 12.35
B air_pressurer 983 0 142.200 142.600 143.000 0.800 140.800 148.200 141.000 144.200 120 12.21
A mfr 194 31 709.200 723.600 731.000 21.800 31.400 868.600 676.500 763.700 23 11.86
D mfr 453 42 707.400 724.800 732.200 24.800 112.600 792.400 670.200 769.400 51 11.26
C air_pressurer 255 0 142.200 142.400 142.800 0.600 141.200 147.200 141.300 143.700 28 10.98
D density 495 0 1.660 1.720 1.780 0.120 0.720 1.920 1.480 1.960 49 9.90
C density 255 0 0.860 0.940 1.020 0.160 0.240 1.760 0.620 1.260 25 9.80
D fill_pressure 491 4 45.800 46.200 48.000 2.200 37.800 60.000 42.500 51.300 48 9.78
C carb_rel 252 3 5.320 5.360 5.380 0.060 5.020 5.580 5.230 5.470 24 9.52
A oxygen_filler 223 2 0.022 0.032 0.048 0.026 0.002 0.400 -0.017 0.088 21 9.42
B hyd_pressure4 975 8 96.000 98.000 102.000 6.000 78.000 138.000 87.000 111.000 90 9.23
D oxygen_filler 494 1 0.019 0.033 0.052 0.033 0.002 0.318 -0.030 0.101 45 9.11
A density 225 0 1.540 1.600 1.660 0.120 0.840 1.760 1.360 1.840 20 8.89
A air_pressurer 225 0 142.400 142.600 143.000 0.600 141.000 147.000 141.500 143.900 18 8.00
C psc_co2 250 5 0.040 0.040 0.080 0.040 0.000 0.240 -0.020 0.140 20 8.00
B mfr 914 69 706.200 725.000 730.800 24.600 95.400 840.000 669.300 767.700 70 7.66
A alch_rel 224 1 7.100 7.120 7.160 0.060 6.520 7.760 7.010 7.250 16 7.14
B oxygen_filler 981 2 0.022 0.034 0.064 0.042 0.002 0.400 -0.041 0.127 60 6.12
B density 983 0 0.880 0.920 0.960 0.080 0.340 1.840 0.760 1.080 58 5.90
D pc_volume 489 6 0.221 0.255 0.292 0.071 0.087 0.462 0.115 0.398 27 5.52
B temperature 980 3 65.200 65.600 66.200 1.000 64.000 75.800 63.700 67.700 51 5.20
C oxygen_filler 252 3 0.022 0.034 0.072 0.050 0.003 0.238 -0.054 0.148 13 5.16
D pressure_vacuum 495 0 -5.600 -5.400 -5.000 0.600 -6.600 -3.600 -6.500 -4.100 25 5.05
B psc_co2 963 20 0.020 0.040 0.070 0.050 0.000 0.240 -0.055 0.145 48 4.98
A temperature 223 2 65.200 65.800 66.600 1.400 64.200 76.200 63.100 68.700 11 4.93
D carb_rel 495 0 5.580 5.600 5.640 0.060 4.960 5.780 5.490 5.730 24 4.85
A pressure_vacuum 225 0 -5.600 -5.200 -5.000 0.600 -6.200 -3.800 -6.500 -4.100 10 4.44
D air_pressurer 495 0 142.200 142.600 142.800 0.600 141.200 147.400 141.300 143.700 22 4.44
C fill_ounces 251 4 23.940 23.980 24.040 0.100 23.693 24.313 23.790 24.190 11 4.38
D temperature 493 2 64.600 65.200 66.000 1.400 63.600 73.600 62.500 68.100 20 4.06
A balling_lvl 225 0 3.040 3.100 3.200 0.160 1.440 3.400 2.800 3.440 9 4.00
A fill_pressure 225 0 46.000 47.400 50.200 4.200 36.200 60.400 39.700 56.500 9 4.00
C pc_volume 251 4 0.258 0.290 0.322 0.064 0.103 0.463 0.163 0.417 10 3.98
C fill_pressure 252 3 46.000 48.000 50.200 4.200 39.200 58.800 39.700 56.500 10 3.97
A psc 220 5 0.046 0.072 0.102 0.056 0.006 0.260 -0.038 0.186 8 3.64
D psc_co2 489 6 0.020 0.040 0.060 0.040 0.000 0.240 -0.040 0.120 17 3.48
C temperature 253 2 65.800 66.600 67.200 1.400 64.000 76.000 63.700 69.300 8 3.16
B balling_lvl 983 0 1.360 1.400 1.440 0.080 0.000 3.440 1.240 1.560 31 3.15
A carb_rel 224 1 5.460 5.520 5.560 0.100 5.120 6.060 5.310 5.710 7 3.12
B balling 983 0 1.398 1.498 1.596 0.198 0.346 3.864 1.101 1.893 30 3.05
D carb_temp 492 3 138.800 140.700 143.250 4.450 129.200 153.800 132.125 149.925 15 3.05
B pc_volume 969 14 0.247 0.277 0.315 0.068 0.109 0.471 0.145 0.417 28 2.89
B psc_fill 974 9 0.120 0.180 0.260 0.140 0.000 0.620 -0.090 0.470 28 2.87
D carb_volume 492 3 5.473 5.513 5.540 0.067 5.260 5.700 5.373 5.640 14 2.85
D filler_level 491 4 100.800 117.400 120.200 19.400 55.800 151.800 71.700 149.300 14 2.85
A fill_ounces 219 6 23.933 23.980 24.033 0.100 23.753 24.313 23.783 24.183 6 2.74
D alch_rel 495 0 7.680 7.720 7.740 0.060 6.500 8.200 7.590 7.830 13 2.63
B carb_pressure 976 7 65.000 66.800 69.000 4.000 57.000 79.400 59.000 75.000 25 2.56
D psc 490 5 0.042 0.074 0.106 0.064 0.002 0.264 -0.053 0.201 12 2.45
C psc_fill 253 2 0.100 0.180 0.280 0.180 0.020 0.600 -0.170 0.550 6 2.37
B psc 971 12 0.052 0.080 0.114 0.062 0.004 0.268 -0.041 0.207 22 2.27
A psc_co2 222 3 0.020 0.040 0.080 0.060 0.000 0.240 -0.070 0.170 5 2.25
A psc_fill 223 2 0.100 0.180 0.260 0.160 0.000 0.580 -0.140 0.500 5 2.24
A balling 225 0 3.092 3.194 3.390 0.298 1.298 3.764 2.645 3.837 5 2.22
B fill_pressure 980 3 46.000 47.000 50.000 4.000 36.000 59.200 40.000 56.000 21 2.14
D carb_pressure 484 11 68.750 70.600 72.400 3.650 62.600 78.800 63.275 77.875 10 2.07
D fill_ounces 490 5 23.907 23.960 24.007 0.100 23.693 24.320 23.757 24.157 10 2.04
D psc_fill 491 4 0.100 0.180 0.260 0.160 0.000 0.600 -0.140 0.500 10 2.04
D balling 495 0 3.390 3.490 3.688 0.298 0.998 3.984 2.943 4.135 10 2.02
D balling_lvl 495 0 3.200 3.280 3.360 0.160 0.900 3.660 2.960 3.600 10 2.02
C carb_volume 254 1 5.260 5.300 5.340 0.080 5.047 5.473 5.140 5.460 5 1.97
C balling 255 0 1.472 1.646 1.796 0.324 0.160 3.588 0.986 2.282 5 1.96
B alch_rel 980 3 6.520 6.540 6.560 0.040 6.400 8.560 6.460 6.620 19 1.94
A hyd_pressure4 216 9 94.000 102.000 112.000 18.000 62.000 142.000 67.000 139.000 4 1.85
A carb_volume 225 0 5.373 5.433 5.467 0.093 5.040 5.627 5.233 5.607 4 1.78
B carb_volume 979 4 5.280 5.313 5.347 0.067 5.080 5.560 5.180 5.447 16 1.63
C hyd_pressure4 253 2 94.000 102.000 110.000 16.000 74.000 140.000 70.000 134.000 4 1.58
C psc 253 2 0.052 0.082 0.122 0.070 0.002 0.256 -0.053 0.227 4 1.58
B carb_rel 981 2 5.320 5.360 5.400 0.080 5.100 5.840 5.200 5.520 15 1.53
A pc_volume 220 5 0.223 0.255 0.308 0.085 0.111 0.456 0.095 0.436 3 1.36
A carb_pressure 224 1 67.200 69.200 71.200 4.000 60.600 78.200 61.200 77.200 3 1.34
B carb_temp 973 10 138.200 140.600 143.800 5.600 128.600 154.000 129.800 152.200 13 1.34
B fill_ounces 972 11 23.920 23.980 24.033 0.113 23.633 24.280 23.750 24.203 13 1.34
C pressure_vacuum 255 0 -5.700 -5.400 -5.000 0.700 -6.200 -3.800 -6.750 -3.950 3 1.18
D carb_pressure1 488 7 119.000 123.400 125.800 6.800 105.600 140.200 108.800 136.000 5 1.02
A carb_pressure1 222 3 119.800 123.400 125.600 5.800 111.000 134.600 111.100 134.300 2 0.90
C carb_pressure 254 1 64.800 67.200 69.200 4.400 59.600 78.200 58.200 75.800 2 0.79
C balling_lvl 255 0 1.400 1.520 1.620 0.220 0.000 3.380 1.070 1.950 2 0.78
B carb_pressure1 971 12 118.600 123.000 125.400 6.800 107.000 139.400 108.400 135.600 7 0.72
A hyd_pressure1 225 0 0.000 10.400 19.400 19.400 -0.800 53.800 -29.100 48.500 1 0.44
C carb_pressure1 253 2 118.600 123.200 125.200 6.600 109.600 138.400 108.700 135.100 1 0.40
C carb_temp 253 2 138.600 141.600 144.000 5.400 130.600 153.600 130.500 152.100 1 0.40
C alch_rel 254 1 6.520 6.560 6.600 0.080 6.400 7.660 6.400 6.720 1 0.39
B hyd_pressure1 981 2 0.000 12.000 20.200 20.200 -0.800 58.000 -30.300 50.500 2 0.20
brand_outlier_summary |>
  filter(outlier_percent >= 5) |>
  arrange(desc(outlier_percent)) |>
  kable(
    caption = "Brand-Level Outlier Checks With At Least 5 Percent Outliers",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )
Brand-Level Outlier Checks With At Least 5 Percent Outliers
brand_code variable count missing_count q1 median q3 iqr_value min_value max_value lower_bound upper_bound outlier_count outlier_percent
D pressure_setpoint 493 2 46.000 46.000 46.000 0.000 44.000 52.000 46.000 46.000 159 32.25
D bowl_setpoint 495 0 110.000 120.000 120.000 10.000 70.000 140.000 95.000 135.000 110 22.22
C filler_speed 249 6 3892.000 3980.000 3996.000 104.000 1004.000 4030.000 3736.000 4152.000 48 19.28
D filler_speed 485 10 3892.000 3982.000 3996.000 104.000 998.000 4026.000 3736.000 4152.000 91 18.76
A filler_speed 217 8 3806.000 3984.000 3998.000 192.000 1004.000 4024.000 3518.000 4286.000 34 15.67
B filler_speed 966 17 3890.000 3982.000 3998.000 108.000 1002.000 4020.000 3728.000 4160.000 147 15.22
C mfr 233 22 709.800 721.600 729.800 20.000 76.800 835.400 679.800 759.800 31 13.30
C filler_level 254 1 108.250 118.100 120.000 11.750 69.400 150.800 90.625 137.625 32 12.60
C bowl_setpoint 255 0 110.000 120.000 120.000 10.000 70.000 130.000 95.000 135.000 32 12.55
D hyd_pressure4 494 1 78.000 80.000 84.000 6.000 62.000 140.000 69.000 93.000 61 12.35
B air_pressurer 983 0 142.200 142.600 143.000 0.800 140.800 148.200 141.000 144.200 120 12.21
A mfr 194 31 709.200 723.600 731.000 21.800 31.400 868.600 676.500 763.700 23 11.86
D mfr 453 42 707.400 724.800 732.200 24.800 112.600 792.400 670.200 769.400 51 11.26
C air_pressurer 255 0 142.200 142.400 142.800 0.600 141.200 147.200 141.300 143.700 28 10.98
D density 495 0 1.660 1.720 1.780 0.120 0.720 1.920 1.480 1.960 49 9.90
C density 255 0 0.860 0.940 1.020 0.160 0.240 1.760 0.620 1.260 25 9.80
D fill_pressure 491 4 45.800 46.200 48.000 2.200 37.800 60.000 42.500 51.300 48 9.78
C carb_rel 252 3 5.320 5.360 5.380 0.060 5.020 5.580 5.230 5.470 24 9.52
A oxygen_filler 223 2 0.022 0.032 0.048 0.026 0.002 0.400 -0.017 0.088 21 9.42
B hyd_pressure4 975 8 96.000 98.000 102.000 6.000 78.000 138.000 87.000 111.000 90 9.23
D oxygen_filler 494 1 0.019 0.033 0.052 0.033 0.002 0.318 -0.030 0.101 45 9.11
A density 225 0 1.540 1.600 1.660 0.120 0.840 1.760 1.360 1.840 20 8.89
A air_pressurer 225 0 142.400 142.600 143.000 0.600 141.000 147.000 141.500 143.900 18 8.00
C psc_co2 250 5 0.040 0.040 0.080 0.040 0.000 0.240 -0.020 0.140 20 8.00
B mfr 914 69 706.200 725.000 730.800 24.600 95.400 840.000 669.300 767.700 70 7.66
A alch_rel 224 1 7.100 7.120 7.160 0.060 6.520 7.760 7.010 7.250 16 7.14
B oxygen_filler 981 2 0.022 0.034 0.064 0.042 0.002 0.400 -0.041 0.127 60 6.12
B density 983 0 0.880 0.920 0.960 0.080 0.340 1.840 0.760 1.080 58 5.90
D pc_volume 489 6 0.221 0.255 0.292 0.071 0.087 0.462 0.115 0.398 27 5.52
B temperature 980 3 65.200 65.600 66.200 1.000 64.000 75.800 63.700 67.700 51 5.20
C oxygen_filler 252 3 0.022 0.034 0.072 0.050 0.003 0.238 -0.054 0.148 13 5.16
D pressure_vacuum 495 0 -5.600 -5.400 -5.000 0.600 -6.600 -3.600 -6.500 -4.100 25 5.05
negative_values_by_brand <- train_features_raw |>
  filter(!is.na(brand_code)) |>
  select(brand_code, all_of(numeric_vars_raw)) |>
  pivot_longer(
    cols = all_of(numeric_vars_raw),
    names_to = "variable",
    values_to = "value"
  ) |>
  group_by(brand_code, variable) |>
  summarise(
    count = sum(!is.na(value)),
    negative_count = sum(value < 0, na.rm = TRUE),
    negative_percent = round(negative_count / count * 100, 2),
    min_value = min(value, na.rm = TRUE),
    max_value = max(value, na.rm = TRUE),
    .groups = "drop"
  ) |>
  filter(negative_count > 0) |>
  arrange(desc(negative_percent))

negative_values_by_brand |>
  kable(
    caption = "Negative Values by Brand Code",
    digits = 3
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Negative Values by Brand Code
brand_code variable count negative_count negative_percent min_value max_value
A pressure_vacuum 225 225 100.00 -6.2 -3.8
B pressure_vacuum 983 983 100.00 -6.4 -3.6
C pressure_vacuum 255 255 100.00 -6.2 -3.8
D pressure_vacuum 495 495 100.00 -6.6 -3.6
B mnf_flow 983 476 48.42 -100.2 213.4
C mnf_flow 255 118 46.27 -100.2 182.0
D mnf_flow 495 222 44.85 -100.2 204.6
A mnf_flow 225 85 37.78 -100.2 216.2
C hyd_pressure3 255 8 3.14 -1.2 43.8
A hyd_pressure3 224 7 3.12 -1.2 43.8
B hyd_pressure3 980 27 2.76 -1.2 45.6
A hyd_pressure1 225 6 2.67 -0.8 53.8
D hyd_pressure3 488 13 2.66 -1.2 50.0
C hyd_pressure1 255 5 1.96 -0.8 43.2
D hyd_pressure1 489 8 1.64 -0.8 47.0
B hyd_pressure1 981 12 1.22 -0.8 58.0

Outliers Compared With PH

After finding outliers, I compare them with PH. This helps answer whether unusual predictor values are connected to unusual PH values.

If outlier rows have different PH from non-outlier rows, then removing them may remove important production information. Because of this, I do not want to delete outliers only because they are statistically unusual.

train_features_with_ph <- train_features_raw |>
  mutate(ph = train_ph)

Now I will create outlier flags within each brand group:

brand_outlier_ph_data <- train_features_with_ph |>
  mutate(row_id = row_number()) |>
  filter(!is.na(brand_code)) |>
  select(row_id, brand_code, ph, all_of(numeric_vars_raw)) |>
  pivot_longer(
    cols = all_of(numeric_vars_raw),
    names_to = "variable",
    values_to = "value"
  ) |>
  group_by(brand_code, variable) |>
  mutate(
    q1 = quantile(value, 0.25, na.rm = TRUE),
    q3 = quantile(value, 0.75, na.rm = TRUE),
    iqr_value = q3 - q1,
    lower_bound = q1 - 1.5 * iqr_value,
    upper_bound = q3 + 1.5 * iqr_value,
    is_outlier = value < lower_bound | value > upper_bound
  ) |>
  ungroup()

This will summarize PH for outlier vs non-outlier rows:

outlier_ph_summary <- brand_outlier_ph_data |>
  filter(!is.na(value)) |>
  group_by(variable, is_outlier) |>
  summarise(
    count = n(),
    mean_ph = mean(ph, na.rm = TRUE),
    median_ph = median(ph, na.rm = TRUE),
    sd_ph = sd(ph, na.rm = TRUE),
    min_ph = min(ph, na.rm = TRUE),
    max_ph = max(ph, na.rm = TRUE),
    .groups = "drop"
  ) |>
  pivot_wider(
    names_from = is_outlier,
    values_from = c(count, mean_ph, median_ph, sd_ph, min_ph, max_ph),
    names_prefix = "outlier_"
  ) |>
  mutate(
    mean_ph_difference = mean_ph_outlier_TRUE - mean_ph_outlier_FALSE,
    abs_mean_ph_difference = abs(mean_ph_difference)
  ) |>
  arrange(desc(abs_mean_ph_difference))

outlier_ph_summary |>
  filter(count_outlier_TRUE >= 10) |>
  slice_max(abs_mean_ph_difference, n = 15) |>
  kable(
    caption = "Top PH Differences for Outlier Rows",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center",
    bootstrap_options = c("striped", "hover", "condensed")
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )
Top PH Differences for Outlier Rows
variable count_outlier_FALSE count_outlier_TRUE mean_ph_outlier_FALSE mean_ph_outlier_TRUE median_ph_outlier_FALSE median_ph_outlier_TRUE sd_ph_outlier_FALSE sd_ph_outlier_TRUE min_ph_outlier_FALSE min_ph_outlier_TRUE max_ph_outlier_FALSE max_ph_outlier_TRUE mean_ph_difference abs_mean_ph_difference
carb_pressure1 1919 15 8.5497 8.6680 8.56 8.62 0.1701 0.2733 7.88 8.38 8.94 9.36 0.1183 0.1183
filler_level 1900 46 8.5503 8.4365 8.56 8.45 0.1720 0.1069 7.88 8.22 8.94 8.76 -0.1138 0.1138
carb_rel 1882 70 8.5513 8.4446 8.56 8.46 0.1690 0.2020 7.88 7.90 8.94 8.94 -0.1068 0.1068
temperature 1859 90 8.5495 8.4969 8.56 8.50 0.1703 0.1853 7.88 8.08 8.94 8.92 -0.0527 0.0527
carb_volume 1911 39 8.5488 8.4964 8.54 8.52 0.1721 0.1887 7.88 8.06 9.36 8.86 -0.0524 0.0524
oxygen_filler 1811 139 8.5437 8.5957 8.54 8.60 0.1724 0.1642 7.88 8.20 8.94 9.36 0.0520 0.0520
pressure_setpoint 1790 159 8.5452 8.5877 8.54 8.60 0.1760 0.1156 7.88 8.24 9.36 8.84 0.0425 0.0425
pressure_vacuum 1920 38 8.5487 8.5068 8.56 8.46 0.1727 0.1522 7.88 8.26 9.36 8.78 -0.0419 0.0419
mfr 1619 175 8.5509 8.5123 8.56 8.50 0.1702 0.1675 7.88 8.08 8.94 8.92 -0.0386 0.0386
fill_ounces 1892 40 8.5484 8.5145 8.54 8.55 0.1722 0.2026 7.88 8.02 9.36 8.86 -0.0339 0.0339
hyd_pressure4 1779 159 8.5448 8.5780 8.54 8.58 0.1700 0.1873 7.88 8.02 8.94 8.94 0.0332 0.0332
psc_co2 1834 90 8.5492 8.5169 8.56 8.51 0.1713 0.1875 7.88 8.08 9.36 8.94 -0.0324 0.0324
psc_fill 1892 49 8.5470 8.5780 8.54 8.60 0.1715 0.2126 7.88 8.12 9.36 8.94 0.0310 0.0310
balling_lvl 1906 52 8.5471 8.5777 8.54 8.58 0.1708 0.2228 7.88 8.10 8.94 9.36 0.0306 0.0306
fill_pressure 1860 88 8.5488 8.5189 8.56 8.50 0.1724 0.1533 7.88 8.22 8.94 8.88 -0.0299 0.0299
top_outlier_ph_vars <- outlier_ph_summary |>
  filter(count_outlier_TRUE >= 10) |>
  slice_max(abs_mean_ph_difference, n = 6) |>
  pull(variable)

brand_outlier_ph_data |>
  filter(variable %in% top_outlier_ph_vars) |>
  mutate(
    outlier_status = ifelse(is_outlier, "Outlier", "Not Outlier")
  ) |>
  ggplot(
    aes(x = outlier_status, y = ph)
  ) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free_x") +
  labs(
    title = "PH Distribution for Outlier vs Non-Outlier Rows",
    x = "Outlier Status",
    y = "PH"
  ) +
  theme_minimal()

Brand-Level Outliers Compared With PH

Since outliers were identified within each brand_code group, I also compare them with PH within each brand_code group. This is more logical than combining all brands together, because each brand may have different normal production behavior.

This check helps me see whether outlier values for a specific brand are connected with different PH values.

train_features_with_ph <- train_features_raw |>
  mutate(ph = train_ph)
brand_outlier_ph_data <- train_features_with_ph |>
  mutate(row_id = row_number()) |>
  filter(!is.na(brand_code)) |>
  select(row_id, brand_code, ph, all_of(numeric_vars_raw)) |>
  pivot_longer(
    cols = all_of(numeric_vars_raw),
    names_to = "variable",
    values_to = "value"
  ) |>
  group_by(brand_code, variable) |>
  mutate(
    q1 = quantile(value, 0.25, na.rm = TRUE),
    q3 = quantile(value, 0.75, na.rm = TRUE),
    iqr_value = q3 - q1,
    lower_bound = q1 - 1.5 * iqr_value,
    upper_bound = q3 + 1.5 * iqr_value,
    is_outlier = value < lower_bound | value > upper_bound
  ) |>
  ungroup()
brand_outlier_ph_summary <- brand_outlier_ph_data |>
  filter(!is.na(value)) |>
  group_by(brand_code, variable, is_outlier) |>
  summarise(
    count = n(),
    mean_ph = mean(ph, na.rm = TRUE),
    median_ph = median(ph, na.rm = TRUE),
    sd_ph = sd(ph, na.rm = TRUE),
    min_ph = min(ph, na.rm = TRUE),
    max_ph = max(ph, na.rm = TRUE),
    .groups = "drop"
  ) |>
  pivot_wider(
    names_from = is_outlier,
    values_from = c(count, mean_ph, median_ph, sd_ph, min_ph, max_ph),
    names_prefix = "outlier_"
  ) |>
  mutate(
    mean_ph_difference = mean_ph_outlier_TRUE - mean_ph_outlier_FALSE,
    abs_mean_ph_difference = abs(mean_ph_difference)
  ) |>
  arrange(desc(abs_mean_ph_difference))

top_brand_outlier_ph_differences <- brand_outlier_ph_summary |>
  filter(count_outlier_TRUE >= 10) |>
  slice_max(abs_mean_ph_difference, n = 20)

top_brand_outlier_ph_differences |>
  kable(
    caption = "Top PH Differences for Brand-Level Outlier Rows",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center",
    bootstrap_options = c("striped", "hover", "condensed")
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )
Top PH Differences for Brand-Level Outlier Rows
brand_code variable count_outlier_FALSE count_outlier_TRUE mean_ph_outlier_FALSE mean_ph_outlier_TRUE median_ph_outlier_FALSE median_ph_outlier_TRUE sd_ph_outlier_FALSE sd_ph_outlier_TRUE min_ph_outlier_FALSE min_ph_outlier_TRUE max_ph_outlier_FALSE max_ph_outlier_TRUE mean_ph_difference abs_mean_ph_difference
C oxygen_filler 239 13 8.4049 8.5892 8.42 8.58 0.1675 0.2872 7.88 8.26 8.80 9.36 0.1844 0.1844
C carb_rel 228 24 8.4211 8.3117 8.43 8.33 0.1630 0.1910 7.88 7.90 8.80 8.62 -0.1095 0.1095
B temperature 929 51 8.5727 8.4639 8.56 8.44 0.1659 0.1698 8.00 8.18 8.94 8.86 -0.1088 0.1088
C density 230 25 8.4277 8.3200 8.44 8.26 0.1558 0.3185 7.90 7.88 8.80 9.36 -0.1077 0.1077
C filler_speed 201 48 8.3945 8.4975 8.42 8.51 0.1612 0.1888 7.88 8.02 8.80 8.80 0.1030 0.1030
B carb_temp 960 13 8.5688 8.4723 8.56 8.46 0.1680 0.1372 8.00 8.24 8.94 8.80 -0.0965 0.0965
D carb_rel 471 24 8.6061 8.5217 8.62 8.55 0.1363 0.1487 8.24 8.16 8.92 8.78 -0.0844 0.0844
B carb_volume 963 16 8.5689 8.4875 8.56 8.48 0.1677 0.1805 8.00 8.08 8.94 8.80 -0.0814 0.0814
A mfr 171 23 8.4916 8.4130 8.52 8.44 0.1563 0.1784 8.06 8.12 8.86 8.74 -0.0785 0.0785
A filler_speed 183 34 8.4833 8.5524 8.52 8.59 0.1574 0.1773 8.06 8.20 8.86 8.86 0.0691 0.0691
A oxygen_filler 202 21 8.4830 8.5514 8.50 8.58 0.1607 0.1612 8.06 8.20 8.86 8.86 0.0685 0.0685
B psc_fill 946 28 8.5661 8.6329 8.56 8.71 0.1674 0.1813 8.00 8.26 8.94 8.94 0.0668 0.0668
B balling_lvl 952 31 8.5656 8.6316 8.56 8.68 0.1675 0.1716 8.00 8.22 8.94 8.94 0.0660 0.0660
D psc_fill 481 10 8.6001 8.6660 8.60 8.65 0.1381 0.1310 8.16 8.50 8.92 8.92 0.0659 0.0659
D fill_pressure 443 48 8.6086 8.5446 8.62 8.52 0.1377 0.1286 8.16 8.24 8.92 8.82 -0.0640 0.0640
D filler_level 477 14 8.6044 8.5429 8.62 8.54 0.1386 0.0938 8.16 8.40 8.92 8.76 -0.0615 0.0615
C fill_pressure 242 10 8.4097 8.4700 8.42 8.49 0.1714 0.1367 7.88 8.26 8.80 8.62 0.0603 0.0603
B fill_ounces 959 13 8.5677 8.5077 8.56 8.52 0.1681 0.1916 8.00 8.18 8.94 8.80 -0.0600 0.0600
D balling_lvl 485 10 8.6032 8.5440 8.62 8.56 0.1368 0.1849 8.24 8.16 8.92 8.78 -0.0592 0.0592
D psc 478 12 8.6030 8.5450 8.62 8.55 0.1388 0.1138 8.16 8.34 8.92 8.70 -0.0580 0.0580

The brand-level PH comparison shows that outliers do not behave the same way for every brand.

  • For Brand C, oxygen_filler outlier rows had average PH of 8.5892, while non-outlier rows had average PH of 8.4049.
    This is a difference of about +0.18 PH. Since PH values in this dataset do not vary very widely, this difference is meaningful. It suggests that high or unusual oxygen_filler values may be connected with higher PH for Brand C.

  • For Brand C, carb_rel outlier rows had average PH of 8.3117, while non-outlier rows had average PH of 8.4211.
    This is a difference of about -0.11 PH. This means unusual carb_rel values for Brand C may be connected with lower PH.

  • For Brand B, temperature outlier rows had average PH of 8.4639, while non-outlier rows had average PH of 8.5727.
    This is a difference of about -0.11 PH. This suggests that unusual temperature values may matter for Brand B and should not be removed without thinking.

  • For Brand A, mfr outlier rows had average PH of 8.4130, while non-outlier rows had average PH of 8.4916.
    This is a smaller difference, about -0.08 PH, but it still shows that unusual mfr values may have some relationship with PH.

These examples show that some outliers may contain useful production information. If I remove all outliers without proper analysis, I might remove cases where the process behaves differently and PH changes.

For this project, I will use brand-level capping for selected continuous variables. This means the outlier limits are calculated separately inside each Brand Code group, not from the full dataset. This is more logical because each brand may have its own normal operating range.

I will not cap setting-like or grouped variables such as pressure_setpoint, bowl_setpoint, air_pressurer, psc_co2, and pressure_vacuum. These variables look like they have fixed or grouped values by design, so regular IQR outlier rules may not make sense for them.

Overall, I do not treat outliers as simple errors. I treat them as unusual production values that need careful handling. Capping will be documented as a preprocessing step, and later model performance can show whether this treatment helps or hurts prediction.

vars_excluded_from_capping <- c(
  "pressure_setpoint",
  "bowl_setpoint",
  "air_pressurer",
  "psc_co2",
  "pressure_vacuum"
)
top_brand_outlier_vars <- top_brand_outlier_ph_differences |>
  distinct(brand_code, variable) |>
  slice_head(n = 8)

# brand_outlier_ph_data |>
#   semi_join(
#     top_brand_outlier_vars,
#     by = c("brand_code", "variable")
#   ) |>
#   mutate(
#     outlier_status = ifelse(is_outlier, "Outlier", "Not Outlier"),
#     brand_variable = paste("Brand", brand_code, "-", variable)
#   ) |>
#   ggplot(
#     aes(x = outlier_status, y = ph)
#   ) +
#   geom_boxplot() +
#   facet_wrap(~ brand_variable, scales = "free_x") +
#   labs(
#     title = "PH Distribution for Brand-Level Outlier vs Non-Outlier Rows",
#     x = "Outlier Status",
#     y = "PH"
#   ) +
#   theme_minimal()

Brand-Level Outlier Capping

Based on the outlier review, I apply capping to selected continuous numeric predictors. The caps are calculated from the training data only and separately within each Brand Code group.

This avoids using validation or final evaluation data to make preprocessing decisions. It also keeps the capping process more fair because each brand can have different normal operating ranges.

I do not cap setting-like or grouped variables because IQR limits may not make sense for them. In the first version, I exclude pressure_setpoint, bowl_setpoint, air_pressurer, psc_co2, and pressure_vacuum from capping.

pressure_setpoint and bowl_setpoint both look like process settings with limited possible values. air_pressurer and psc_co2 have grouped or repeated values, so regular IQR capping may over-treat them. pressure_vacuum is excluded because negative values appear to be part of how this variable is measured.

vars_excluded_from_capping <- c(
  "pressure_setpoint",
  "bowl_setpoint",
  "air_pressurer",
  "psc_co2",
  "pressure_vacuum"
)

get_numeric_vars_to_cap <- function(data_x, excluded_vars) {
  
  numeric_vars <- data_x |>
    select(where(is.numeric)) |>
    names()
  
  numeric_vars[
    !numeric_vars %in% excluded_vars
  ]
}
create_capping_limits <- function(train_x, numeric_vars_to_cap, brand_col = "brand_code") {
  
  brand_cap_limits <- train_x |>
    filter(!is.na(.data[[brand_col]])) |>
    select(all_of(brand_col), all_of(numeric_vars_to_cap)) |>
    pivot_longer(
      cols = all_of(numeric_vars_to_cap),
      names_to = "variable",
      values_to = "value"
    ) |>
    group_by(.data[[brand_col]], variable) |>
    summarise(
      q1 = quantile(value, 0.25, na.rm = TRUE),
      q3 = quantile(value, 0.75, na.rm = TRUE),
      iqr_value = q3 - q1,
      lower_cap = q1 - 1.5 * iqr_value,
      upper_cap = q3 + 1.5 * iqr_value,
      .groups = "drop"
    ) |>
    rename(brand_code = all_of(brand_col))
  
  global_cap_limits <- train_x |>
    select(all_of(numeric_vars_to_cap)) |>
    pivot_longer(
      cols = all_of(numeric_vars_to_cap),
      names_to = "variable",
      values_to = "value"
    ) |>
    group_by(variable) |>
    summarise(
      q1 = quantile(value, 0.25, na.rm = TRUE),
      q3 = quantile(value, 0.75, na.rm = TRUE),
      iqr_value = q3 - q1,
      lower_cap = q1 - 1.5 * iqr_value,
      upper_cap = q3 + 1.5 * iqr_value,
      .groups = "drop"
    )
  
  list(
    brand_cap_limits = brand_cap_limits,
    global_cap_limits = global_cap_limits
  )
}
apply_brand_caps <- function(data_x, numeric_vars_to_cap, cap_limits, brand_col = "brand_code") {
  
  capped_data <- data_x
  
  brand_cap_limits <- cap_limits$brand_cap_limits
  global_cap_limits <- cap_limits$global_cap_limits
  
  for (var in numeric_vars_to_cap) {
    
    for (brand in unique(brand_cap_limits$brand_code)) {
      
      caps <- brand_cap_limits |>
        filter(
          brand_code == brand,
          variable == var
        )
      
      rows_to_cap <- which(capped_data[[brand_col]] == brand)
      
      if (nrow(caps) == 1 && length(rows_to_cap) > 0) {
        
        capped_data[[var]][rows_to_cap] <- pmin(
          pmax(capped_data[[var]][rows_to_cap], caps$lower_cap),
          caps$upper_cap
        )
      }
    }
    
    global_caps <- global_cap_limits |>
      filter(variable == var)
    
    missing_brand_rows <- which(is.na(capped_data[[brand_col]]))
    
    if (nrow(global_caps) == 1 && length(missing_brand_rows) > 0) {
      
      capped_data[[var]][missing_brand_rows] <- pmin(
        pmax(capped_data[[var]][missing_brand_rows], global_caps$lower_cap),
        global_caps$upper_cap
      )
    }
  }
  
  capped_data
}
apply_brand_caps <- function(data_x, numeric_vars_to_cap, cap_limits, brand_col = "brand_code") {
  
  capped_data <- data_x
  
  brand_cap_limits <- cap_limits$brand_cap_limits
  global_cap_limits <- cap_limits$global_cap_limits
  
  for (var in numeric_vars_to_cap) {
    
    for (brand in unique(brand_cap_limits$brand_code)) {
      
      caps <- brand_cap_limits |>
        filter(
          brand_code == brand,
          variable == var
        )
      
      rows_to_cap <- which(capped_data[[brand_col]] == brand)
      
      if (nrow(caps) == 1 && length(rows_to_cap) > 0) {
        
        capped_data[[var]][rows_to_cap] <- pmin(
          pmax(capped_data[[var]][rows_to_cap], caps$lower_cap),
          caps$upper_cap
        )
      }
    }
    
    global_caps <- global_cap_limits |>
      filter(variable == var)
    
    missing_brand_rows <- which(is.na(capped_data[[brand_col]]))
    
    if (nrow(global_caps) == 1 && length(missing_brand_rows) > 0) {
      
      capped_data[[var]][missing_brand_rows] <- pmin(
        pmax(capped_data[[var]][missing_brand_rows], global_caps$lower_cap),
        global_caps$upper_cap
      )
    }
  }
  
  capped_data
}
summarize_capping_by_brand <- function(original_x, capped_x, numeric_vars_to_cap, dataset_name) {
  
  map_dfr(
    numeric_vars_to_cap,
    function(var) {
      
      original_values <- original_x[[var]]
      capped_values <- capped_x[[var]]
      
      changed_values <- !is.na(original_values) &
        !is.na(capped_values) &
        original_values != capped_values
      
      tibble(
        dataset = dataset_name,
        variable = var,
        brand_code = original_x$brand_code,
        was_capped = changed_values
      ) |>
        mutate(
          brand_code = ifelse(
            is.na(brand_code),
            "Missing Brand",
            as.character(brand_code)
          )
        ) |>
        group_by(dataset, variable, brand_code) |>
        summarise(
          values_capped = sum(was_capped),
          .groups = "drop"
        )
    }
  ) |>
    pivot_wider(
      names_from = brand_code,
      values_from = values_capped,
      values_fill = 0
    ) |>
    mutate(
      Total = rowSums(
        across(
          where(is.numeric)
        )
      )
    )
}

First Capping Attempt

First, I run the capping process using the original excluded variable list. This helps me see how much the capping rule changes the data before deciding whether the excluded list needs to be updated.

vars_excluded_from_capping_v1 <- c(
  "pressure_setpoint",
  "bowl_setpoint",
  "air_pressurer",
  "psc_co2",
  "pressure_vacuum"
)

numeric_vars_to_cap_v1 <- get_numeric_vars_to_cap(
  data_x = train_features_raw,
  excluded_vars = vars_excluded_from_capping_v1
)

numeric_vars_to_cap_v1
 [1] "carb_volume"    "fill_ounces"    "pc_volume"      "carb_pressure" 
 [5] "carb_temp"      "psc"            "psc_fill"       "mnf_flow"      
 [9] "carb_pressure1" "fill_pressure"  "hyd_pressure1"  "hyd_pressure2" 
[13] "hyd_pressure3"  "hyd_pressure4"  "filler_level"   "filler_speed"  
[17] "temperature"    "usage_cont"     "carb_flow"      "density"       
[21] "mfr"            "balling"        "oxygen_filler"  "alch_rel"      
[25] "carb_rel"       "balling_lvl"   
cap_limits_v1 <- create_capping_limits(
  train_x = train_features_raw,
  numeric_vars_to_cap = numeric_vars_to_cap_v1,
  brand_col = "brand_code"
)
train_features_capped_v1 <- apply_brand_caps(
  data_x = train_features_raw,
  numeric_vars_to_cap = numeric_vars_to_cap_v1,
  cap_limits = cap_limits_v1,
  brand_col = "brand_code"
)

valid_features_capped_v1 <- apply_brand_caps(
  data_x = valid_features_raw,
  numeric_vars_to_cap = numeric_vars_to_cap_v1,
  cap_limits = cap_limits_v1,
  brand_col = "brand_code"
)

eval_features_capped_v1 <- apply_brand_caps(
  data_x = eval_features_raw,
  numeric_vars_to_cap = numeric_vars_to_cap_v1,
  cap_limits = cap_limits_v1,
  brand_col = "brand_code"
)
get_numeric_vars_to_cap <- function(data_x, excluded_vars) {
  
  numeric_vars <- data_x |>
    select(where(is.numeric)) |>
    names()
  
  numeric_vars[
    !numeric_vars %in% excluded_vars
  ]
}
capping_summary_v1 <- bind_rows(
  summarize_capping_by_brand(
    original_x = train_features_raw,
    capped_x = train_features_capped_v1,
    numeric_vars_to_cap = numeric_vars_to_cap_v1,
    dataset_name = "Training"
  ),
  
  summarize_capping_by_brand(
    original_x = valid_features_raw,
    capped_x = valid_features_capped_v1,
    numeric_vars_to_cap = numeric_vars_to_cap_v1,
    dataset_name = "Validation"
  ),
  
  summarize_capping_by_brand(
    original_x = eval_features_raw,
    capped_x = eval_features_capped_v1,
    numeric_vars_to_cap = numeric_vars_to_cap_v1,
    dataset_name = "Final Evaluation"
  )
) |>
  mutate(
    dataset = factor(
      dataset,
      levels = c("Training", "Validation", "Final Evaluation")
    )
  ) |>
  arrange(dataset, desc(Total))

capping_summary_v1 |>
  kable(
    caption = "First Capping Attempt: Values Capped by Brand Code",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )
First Capping Attempt: Values Capped by Brand Code
dataset variable A B C D Missing Brand Total
Training filler_speed 34 147 48 91 28 348
Training mfr 23 70 31 51 19 194
Training hyd_pressure4 4 90 4 61 5 164
Training density 20 58 25 49 0 152
Training oxygen_filler 21 60 13 45 8 147
Training temperature 11 51 8 20 16 106
Training fill_pressure 9 21 10 48 7 95
Training pc_volume 3 28 10 27 13 81
Training carb_rel 7 15 24 24 0 70
Training alch_rel 16 19 1 13 5 54
Training balling_lvl 9 31 2 10 0 52
Training psc_fill 5 28 6 10 1 50
Training balling 5 30 5 10 0 50
Training psc 8 22 4 12 3 49
Training fill_ounces 6 13 11 10 7 47
Training filler_level 0 0 32 14 0 46
Training carb_pressure 3 25 2 10 2 42
Training carb_volume 4 16 5 14 1 40
Training carb_temp 0 13 1 15 2 31
Training carb_pressure1 2 7 1 5 2 17
Training hyd_pressure1 1 2 0 0 1 4
Training mnf_flow 0 0 0 0 0 0
Training hyd_pressure2 0 0 0 0 0 0
Training hyd_pressure3 0 0 0 0 0 0
Training usage_cont 0 0 0 0 0 0
Training carb_flow 0 0 0 0 0 0
Validation filler_speed 8 33 9 20 9 79
Validation mfr 7 24 6 14 7 58
Validation oxygen_filler 3 15 3 12 1 34
Validation hyd_pressure4 1 18 1 12 0 32
Validation temperature 2 19 2 3 2 28
Validation density 4 13 2 9 0 28
Validation carb_rel 1 3 10 5 1 20
Validation pc_volume 0 4 3 8 2 17
Validation psc_fill 1 9 1 5 1 17
Validation fill_pressure 2 8 3 4 0 17
Validation balling_lvl 3 9 1 3 0 16
Validation alch_rel 5 5 1 3 0 14
Validation carb_pressure 2 6 0 2 1 11
Validation filler_level 0 0 8 2 0 10
Validation balling 1 5 1 3 0 10
Validation carb_volume 1 7 0 1 0 9
Validation fill_ounces 0 5 1 1 1 8
Validation carb_temp 3 3 0 1 1 8
Validation psc 1 1 3 2 0 7
Validation carb_pressure1 0 1 0 0 2 3
Validation hyd_pressure1 0 1 0 0 0 1
Validation mnf_flow 0 0 0 0 0 0
Validation hyd_pressure2 0 0 0 0 0 0
Validation hyd_pressure3 0 0 0 0 0 0
Validation usage_cont 0 0 0 0 0 0
Validation carb_flow 0 0 0 0 0 0
Final Evaluation filler_speed 5 20 6 14 2 47
Final Evaluation hyd_pressure4 1 16 0 14 2 33
Final Evaluation mfr 6 10 6 8 2 32
Final Evaluation density 2 11 3 6 0 22
Final Evaluation temperature 2 8 0 6 4 20
Final Evaluation oxygen_filler 3 5 4 5 1 18
Final Evaluation fill_pressure 1 2 0 10 1 14
Final Evaluation balling 1 9 0 4 0 14
Final Evaluation psc 3 5 1 2 0 11
Final Evaluation carb_volume 2 5 1 1 0 9
Final Evaluation carb_rel 0 2 2 5 0 9
Final Evaluation pc_volume 2 2 0 4 0 8
Final Evaluation balling_lvl 2 5 0 1 0 8
Final Evaluation carb_pressure 2 3 0 1 0 6
Final Evaluation alch_rel 1 4 0 1 0 6
Final Evaluation carb_temp 1 2 0 1 0 4
Final Evaluation psc_fill 1 1 0 2 0 4
Final Evaluation fill_ounces 1 0 1 1 0 3
Final Evaluation filler_level 0 0 1 1 0 2
Final Evaluation carb_pressure1 1 0 0 0 0 1
Final Evaluation hyd_pressure1 1 0 0 0 0 1
Final Evaluation mnf_flow 0 0 0 0 0 0
Final Evaluation hyd_pressure2 0 0 0 0 0 0
Final Evaluation hyd_pressure3 0 0 0 0 0 0
Final Evaluation usage_cont 0 0 0 0 0 0
Final Evaluation carb_flow 0 0 0 0 0 0

The first capping attempt showed that some variables were changed a lot. The largest issue was filler_speed. In the training data, filler_speed had 348 capped values, which is a large number.

This made me review the earlier brand-level outlier table more carefully. filler_speed had high outlier percentages across all brands:

  • Brand C had about 19.28% outliers for filler_speed.
  • Brand D had about 18.76% outliers for filler_speed.
  • Brand A had about 15.67% outliers for filler_speed.
  • Brand B had about 15.22% outliers for filler_speed.

The low filler_speed values are around 998 to 1004, while most normal values are closer to 3900 to 4000. This may look like an error at first, but because the same pattern appears across all brands, it may represent a real slow-speed production state, startup/shutdown condition, or machine slowdown.

Because of this, I do not want to cap filler_speed. Capping these low values may hide a real condition that could be important for predicting PH.

The first capping attempt was still useful because it showed that the original capping rule was too aggressive for filler_speed. Based on this review, I update the excluded variable list and rerun the capping process.

Outlier Treatment Decision

vars_excluded_from_capping_v2 <- c(
  "pressure_setpoint",
  "bowl_setpoint",
  "air_pressurer",
  "psc_co2",
  "pressure_vacuum"
)

numeric_vars_to_cap_v2 <- get_numeric_vars_to_cap(
  data_x = train_features_raw,
  excluded_vars = vars_excluded_from_capping_v2
)

numeric_vars_to_cap_v2
 [1] "carb_volume"    "fill_ounces"    "pc_volume"      "carb_pressure" 
 [5] "carb_temp"      "psc"            "psc_fill"       "mnf_flow"      
 [9] "carb_pressure1" "fill_pressure"  "hyd_pressure1"  "hyd_pressure2" 
[13] "hyd_pressure3"  "hyd_pressure4"  "filler_level"   "filler_speed"  
[17] "temperature"    "usage_cont"     "carb_flow"      "density"       
[21] "mfr"            "balling"        "oxygen_filler"  "alch_rel"      
[25] "carb_rel"       "balling_lvl"   
ph_difference_cutoff <- 0.05
min_outlier_count_for_rule <- 10

cap_allowed_combos <- brand_outlier_ph_summary |>
  filter(
    count_outlier_TRUE >= min_outlier_count_for_rule,
    abs_mean_ph_difference < ph_difference_cutoff
  ) |>
  select(
    brand_code,
    variable,
    count_outlier_TRUE,
    mean_ph_difference,
    abs_mean_ph_difference
  ) |>
  arrange(abs_mean_ph_difference)

cap_allowed_combos |>
  kable(
    caption = "Brand-Variable Outlier Groups Allowed for Capping",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )
Brand-Variable Outlier Groups Allowed for Capping
brand_code variable count_outlier_TRUE mean_ph_difference abs_mean_ph_difference
D filler_speed 91 0.0015 0.0015
B filler_speed 147 0.0022 0.0022
C air_pressurer 28 0.0033 0.0033
B hyd_pressure4 90 0.0039 0.0039
B air_pressurer 120 -0.0059 0.0059
D psc_co2 17 0.0064 0.0064
D air_pressurer 22 -0.0097 0.0097
D hyd_pressure4 61 0.0140 0.0140
A density 20 0.0143 0.0143
C psc_co2 20 -0.0160 0.0160
B oxygen_filler 60 0.0166 0.0166
D carb_pressure 10 -0.0174 0.0174
D carb_volume 14 -0.0181 0.0181
B psc 22 -0.0202 0.0202
D balling 10 -0.0204 0.0204
D pressure_setpoint 159 -0.0215 0.0215
B density 58 -0.0254 0.0254
D density 49 0.0263 0.0263
C filler_level 32 -0.0268 0.0268
B psc_co2 48 -0.0277 0.0277
C bowl_setpoint 32 -0.0281 0.0281
B pc_volume 28 -0.0288 0.0288
B balling 30 0.0299 0.0299
A air_pressurer 18 -0.0300 0.0300
A pressure_vacuum 10 -0.0303 0.0303
B carb_rel 15 -0.0313 0.0313
B carb_pressure 25 -0.0333 0.0333
D temperature 20 -0.0333 0.0333
D mfr 51 -0.0334 0.0334
D bowl_setpoint 110 -0.0353 0.0353
D oxygen_filler 45 0.0369 0.0369
D fill_ounces 10 0.0386 0.0386
B mfr 70 -0.0437 0.0437
C pc_volume 10 0.0458 0.0458
C mfr 31 0.0462 0.0462
B alch_rel 19 0.0467 0.0467
D carb_temp 15 -0.0478 0.0478
C fill_ounces 11 0.0488 0.0488
brand_cap_limits_v2 <- train_features_raw |>
  filter(!is.na(brand_code)) |>
  select(brand_code, all_of(numeric_vars_to_cap_v2)) |>
  pivot_longer(
    cols = all_of(numeric_vars_to_cap_v2),
    names_to = "variable",
    values_to = "value"
  ) |>
  group_by(brand_code, variable) |>
  summarise(
    q1 = quantile(value, 0.25, na.rm = TRUE),
    q3 = quantile(value, 0.75, na.rm = TRUE),
    iqr_value = q3 - q1,
    lower_cap = q1 - 1.5 * iqr_value,
    upper_cap = q3 + 1.5 * iqr_value,
    .groups = "drop"
  )
apply_brand_caps_ph_allowed <- function(
  data_x,
  numeric_vars_to_cap,
  brand_cap_limits,
  cap_allowed_combos,
  brand_col = "brand_code"
) {
  
  capped_data <- data_x
  
  for (i in seq_len(nrow(cap_allowed_combos))) {
    
    current_brand <- cap_allowed_combos$brand_code[i]
    current_var <- cap_allowed_combos$variable[i]
    
    if (current_var %in% numeric_vars_to_cap) {
      
      caps <- brand_cap_limits |>
        filter(
          brand_code == current_brand,
          variable == current_var
        )
      
      rows_to_cap <- which(capped_data[[brand_col]] == current_brand)
      
      if (nrow(caps) == 1 && length(rows_to_cap) > 0) {
        
        capped_data[[current_var]][rows_to_cap] <- pmin(
          pmax(capped_data[[current_var]][rows_to_cap], caps$lower_cap),
          caps$upper_cap
        )
      }
    }
  }
  
  capped_data
}
train_features_capped <- apply_brand_caps_ph_allowed(
  data_x = train_features_raw,
  numeric_vars_to_cap = numeric_vars_to_cap_v2,
  brand_cap_limits = brand_cap_limits_v2,
  cap_allowed_combos = cap_allowed_combos,
  brand_col = "brand_code"
)

valid_features_capped <- apply_brand_caps_ph_allowed(
  data_x = valid_features_raw,
  numeric_vars_to_cap = numeric_vars_to_cap_v2,
  brand_cap_limits = brand_cap_limits_v2,
  cap_allowed_combos = cap_allowed_combos,
  brand_col = "brand_code"
)

eval_features_capped <- apply_brand_caps_ph_allowed(
  data_x = eval_features_raw,
  numeric_vars_to_cap = numeric_vars_to_cap_v2,
  brand_cap_limits = brand_cap_limits_v2,
  cap_allowed_combos = cap_allowed_combos,
  brand_col = "brand_code"
)
capping_summary_v2 <- bind_rows(
  summarize_capping_by_brand(
    original_x = train_features_raw,
    capped_x = train_features_capped,
    numeric_vars_to_cap = numeric_vars_to_cap_v2,
    dataset_name = "Training"
  ),
  
  summarize_capping_by_brand(
    original_x = valid_features_raw,
    capped_x = valid_features_capped,
    numeric_vars_to_cap = numeric_vars_to_cap_v2,
    dataset_name = "Validation"
  ),
  
  summarize_capping_by_brand(
    original_x = eval_features_raw,
    capped_x = eval_features_capped,
    numeric_vars_to_cap = numeric_vars_to_cap_v2,
    dataset_name = "Final Evaluation"
  )
) |>
  mutate(
    dataset = factor(
      dataset,
      levels = c("Training", "Validation", "Final Evaluation")
    )
  ) |>
  filter(Total > 0) |>
  arrange(dataset, desc(Total))

capping_summary_v2 |>
  slice_max(
    order_by = Total,
    n = 15,
    with_ties = FALSE
  ) |>
  kable(
    caption = "Top 15 Variables with Values Capped by Brand-Level PH-Difference Rule. 
    
                Counts of variables' values capped are in each Brand Code group",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )

Top 15 Variables with Values Capped by Brand-Level PH-Difference Rule.

            Counts of variables' values capped are in each Brand Code group</caption>
dataset variable A B C D Missing Brand Total
Training filler_speed 0 147 0 91 0 238
Training mfr 0 70 31 51 0 152
Training hyd_pressure4 0 90 0 61 0 151
Training density 20 58 0 49 0 127
Training oxygen_filler 0 60 0 45 0 105
Validation filler_speed 0 33 0 20 0 53
Validation mfr 0 24 6 14 0 44
Training balling 0 30 0 10 0 40
Training pc_volume 0 28 10 0 0 38
Training carb_pressure 0 25 0 10 0 35
Final Evaluation filler_speed 0 20 0 14 0 34
Training filler_level 0 0 32 0 0 32
Validation hyd_pressure4 0 18 0 12 0 30
Final Evaluation hyd_pressure4 0 16 0 14 0 30
Validation oxygen_filler 0 15 0 12 0 27

After reviewing the outlier-PH comparison, I updated the capping strategy to be selective by both Brand Code and variable.

The PH-difference value compares the average PH of outlier rows with the average PH of non-outlier rows within the same Brand Code and variable group. For example, if Brand B has outliers in filler_speed, I compare the average PH for Brand B rows where filler_speed is an outlier against the average PH for Brand B rows where filler_speed is not an outlier.

This helps separate outliers that may contain useful production signal from outliers that are more likely just extreme numeric values. If the PH difference is large, the outlier group may be related to changes in PH, so capping those values could remove useful information. If the PH difference is small, the outlier group does not appear to behave very differently in terms of PH, so capping is safer.

For this project, I will use averages difference of 0.05 PH as the cutoff. If a specific Brand Code and variable combination has an absolute PH difference below 0.05, capping is allowed for that group. If the absolute PH difference is 0.05 or larger, that group is not capped because those outliers may contain useful signal for predicting PH.

Capping will not be applied globally to an entire variable. It is applied only to the specific Brand Code value’s group and variable combination that meets the rule. For example, if Brand B filler_speed has a very small PH difference, then Brand B filler_speedoutliers can be capped. This does not mean filler_speed is automatically capped for Brands A, C, or D.

For example, filler_speed shows how the brand-level rule works. In the training data, filler_speed had 147 capped values for Brand B and 91 capped values for Brand D, for a total of 238 capped values.

dataset variable A B C D Missing Brand Total
Training filler_speed 0 147 0 91 0 238

This does not mean filler_speed was capped globally for every brand. It means the PH-difference rule allowed capping for specific brand groups where the outlier values did not show meaningful PH difference. In this case, Brand B and Brand D filler_speed values were capped, while Brand A and Brand C filler_speed values were not capped.

The final capped feature sets are then passed into the imputation workflow.

Missing Values Imputation Strategy

The preprocessing review showed that predictors’ missing values exist in the training, validation, and final evaluation data. Since many models (for example linear models) cannot use missing predictor values directly, I want to create completed feature sets before modeling. Also, i don’t want to remove rows with missing values.

The important point is that imputation must be learned from the training data only. Validation and final evaluation data should receive the same imputation process, but they should not be used to learn imputation values. This will help to avoid data leakage.

PH is not used during imputation because PH is the target variable. Also, PH is not available in the final evaluation data, so using PH during imputation would not match the real prediction situation.

I will compare multiple imputation approaches and later test them with different model types. The final imputation choice will be based on validation performance, not just assumptions.

Imputation Setup

Before creating imputed feature sets, I make sure categorical variables have consistent factor levels across training, validation, and final evaluation data.

brand_levels <- train_features_raw |>
  pull(brand_code) |>
  as.factor() |>
  levels()

train_features_raw <- train_features_raw |>
  mutate(
    brand_code = factor(brand_code, levels = brand_levels)
  )

valid_features_raw <- valid_features_raw |>
  mutate(
    brand_code = factor(brand_code, levels = brand_levels)
  )

eval_features_raw <- eval_features_raw |>
  mutate(
    brand_code = factor(brand_code, levels = brand_levels)
  )
pre_imputation_missing_check <- tibble(
  dataset = c("Training", "Validation", "Final Evaluation"),
  total_missing_values = c(
    sum(is.na(train_features_raw)),
    sum(is.na(valid_features_raw)),
    sum(is.na(eval_features_raw))
  )
)

pre_imputation_missing_check |>
  kable(
    caption = "Total Missing Predictor Values Before Imputation"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Total Missing Predictor Values Before Imputation
dataset total_missing_values
Training 657
Validation 155
Final Evaluation 107

This check confirms that missing predictor values are present before imputation. The next step is to create completed versions of the training, validation, and final evaluation feature sets.

get_numeric_predictors <- function(data_x) {
  
  data_x |>
    select(where(is.numeric)) |>
    names()
}
impute_numeric_by_brand_tree <- function(
  train_x,
  valid_x,
  eval_x,
  brand_col = "brand_code",
  trees = 25,
  min_brand_rows = 20
) {
  
  numeric_vars <- get_numeric_predictors(train_x)
  
  train_out <- train_x
  valid_out <- valid_x
  eval_out <- eval_x
  
  fallback_recipe <- recipe(~ ., data = train_x) |>
    step_unknown(
      all_of(brand_col),
      new_level = "Unknown"
    ) |>
    step_impute_bag(
      all_of(numeric_vars),
      impute_with = imp_vars(all_predictors()),
      trees = trees
    )
  
  fallback_fit <- prep(
    fallback_recipe,
    training = train_x
  )
  
  train_fallback <- bake(
    fallback_fit,
    new_data = train_x
  )
  
  valid_fallback <- bake(
    fallback_fit,
    new_data = valid_x
  )
  
  eval_fallback <- bake(
    fallback_fit,
    new_data = eval_x
  )
  
  train_out[numeric_vars] <- train_fallback[numeric_vars]
  valid_out[numeric_vars] <- valid_fallback[numeric_vars]
  eval_out[numeric_vars] <- eval_fallback[numeric_vars]
  
  known_brands <- train_x |>
    filter(!is.na(.data[[brand_col]])) |>
    distinct(.data[[brand_col]]) |>
    pull(.data[[brand_col]])
  
  for (brand in known_brands) {
    
    train_rows <- which(train_x[[brand_col]] == brand)
    valid_rows <- which(valid_x[[brand_col]] == brand)
    eval_rows <- which(eval_x[[brand_col]] == brand)
    
    brand_train_numeric <- train_x[train_rows, numeric_vars]
    
    if (nrow(brand_train_numeric) >= min_brand_rows) {
      
      brand_recipe <- recipe(~ ., data = brand_train_numeric) |>
        step_impute_bag(
          all_numeric_predictors(),
          impute_with = imp_vars(all_numeric_predictors()),
          trees = trees
        )
      
      brand_fit <- prep(
        brand_recipe,
        training = brand_train_numeric
      )
      
      train_out[train_rows, numeric_vars] <- bake(
        brand_fit,
        new_data = train_x[train_rows, numeric_vars]
      )
      
      if (length(valid_rows) > 0) {
        valid_out[valid_rows, numeric_vars] <- bake(
          brand_fit,
          new_data = valid_x[valid_rows, numeric_vars]
        )
      }
      
      if (length(eval_rows) > 0) {
        eval_out[eval_rows, numeric_vars] <- bake(
          brand_fit,
          new_data = eval_x[eval_rows, numeric_vars]
        )
      }
    }
  }
  
  list(
    train = train_out,
    valid = valid_out,
    eval = eval_out,
    numeric_predictors = numeric_vars
  )
}
train_features_for_imputation <- train_features_capped
valid_features_for_imputation <- valid_features_capped
eval_features_for_imputation <- eval_features_capped
check_numeric_missing <- function(train_x, valid_x, eval_x, numeric_vars) {
  
  tibble(
    dataset = c("Training", "Validation", "Final Evaluation"),
    numeric_missing_count = c(
      sum(is.na(train_x |> select(all_of(numeric_vars)))),
      sum(is.na(valid_x |> select(all_of(numeric_vars)))),
      sum(is.na(eval_x |> select(all_of(numeric_vars))))
    )
  )
}
collect_imputed_values <- function(before_x, after_x, numeric_vars, brand_col = "brand_code") {
  
  imputed_values <- tibble()
  
  for (var in numeric_vars) {
    
    missing_rows <- which(is.na(before_x[[var]]))
    
    if (length(missing_rows) > 0) {
      
      temp_values <- tibble(
        variable = var,
        row_id = missing_rows,
        brand_code = before_x[[brand_col]][missing_rows],
        original_value = before_x[[var]][missing_rows],
        imputed_value = after_x[[var]][missing_rows]
      )
      
      imputed_values <- bind_rows(
        imputed_values,
        temp_values
      )
    }
  }
  
  imputed_values
}
check_imputed_ranges <- function(before_x, after_x, numeric_vars) {
  
  map_dfr(
    numeric_vars,
    function(var) {
      
      missing_rows <- which(is.na(before_x[[var]]))
      
      if (length(missing_rows) == 0) {
        return(NULL)
      }
      
      imputed_values <- after_x[[var]][missing_rows]
      original_values <- before_x[[var]]
      
      tibble(
        variable = var,
        original_min = min(original_values, na.rm = TRUE),
        original_max = max(original_values, na.rm = TRUE),
        min_imputed_value = min(imputed_values, na.rm = TRUE),
        max_imputed_value = max(imputed_values, na.rm = TRUE),
        imputed_below_original_min = any(imputed_values < original_min),
        imputed_above_original_max = any(imputed_values > original_max)
      )
      
    }
  )
}
predict_missing_brand_code <- function(data_x, brand_model, brand_col = "brand_code") {
  
  completed_x <- data_x
  
  missing_brand_rows <- which(is.na(completed_x[[brand_col]]))
  
  if (length(missing_brand_rows) > 0) {
    
    prediction_data <- completed_x[missing_brand_rows, ]
    
    predicted_brand <- predict(
      brand_model,
      data = prediction_data
    )$predictions
    
    completed_x[[brand_col]] <- as.character(completed_x[[brand_col]])
    completed_x[[brand_col]][missing_brand_rows] <- as.character(predicted_brand)
    completed_x[[brand_col]] <- as.factor(completed_x[[brand_col]])
  }
  
  completed_x
}

Hybrid Imputation Approach

After PH-guided brand-level capping, I wil test several imputation workflows. My goal here is not only to fill in missing values, but also to check which imputation strategy creates the best completed dataset for predictions.

The logic of this section is:

  1. Numeric predictors are imputed first.
  2. PCA is used to check whether brand_code has structure in the completed numeric space.
  3. If missing brand_code rows are located near known brand groups, KNN is a reasonable brand_codeimputation method.
  4. Other numeric imputation methods will be applied with the same KNN Brand Code imputation step.
  5. missForest will be apllied separately because it imputes numeric predictors and brand_code together.
  6. All created feature sets are tested using the same Random Forest validation workflow.

This makes the imputation methods comparison practical because the final goal is not just to predict and fill missing values, but to improve PH prediction performance.

get_numeric_predictors <- function(data_x) {
  data_x |>
    select(where(is.numeric)) |>
    names()
}

calculate_regression_metrics <- function(actual, predicted) {
  
  tibble(
    rmse = sqrt(mean((actual - predicted)^2, na.rm = TRUE)),
    mae = mean(abs(actual - predicted), na.rm = TRUE),
    rsq = cor(actual, predicted, use = "complete.obs")^2
  )
}

rf_tune_grid_temp <- expand.grid(
  mtry = c(5, 10, 15, 20)
)

rf_control_temp <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 3,
  verboseIter = FALSE
)

run_rf_validation_test <- function(
  train_x,
  valid_x,
  train_y,
  valid_y,
  workflow_name,
  project_seed = 123,
  rf_tune_grid = rf_tune_grid_temp,
  rf_control = rf_control_temp,
  ntree = 500
) {
  
  set.seed(project_seed)
  
  rf_time <- system.time({
    
    rf_model <- train(
      x = train_x,
      y = train_y,
      method = "rf",
      trControl = rf_control,
      tuneGrid = rf_tune_grid,
      metric = "RMSE",
      ntree = ntree,
      importance = TRUE
    )
    
  })
  
  rf_pred <- predict(
    rf_model,
    newdata = valid_x
  )
  
  calculate_regression_metrics(
    actual = valid_y,
    predicted = rf_pred
  ) |>
    mutate(
      workflow = workflow_name,
      training_rows = nrow(train_x),
      validation_rows = nrow(valid_x),
      best_mtry = rf_model$bestTune$mtry,
      training_time_seconds = as.numeric(rf_time["elapsed"])
    ) |>
    select(
      workflow,
      training_rows,
      validation_rows,
      best_mtry,
      rmse,
      mae,
      rsq,
      training_time_seconds
    )
}

Tree-Based Numeric Imputation

In this first workflow I will use is tree-based numeric imputation for the numeric predictors. At this step, brand_code is still allowed to be missing, because brand_code will be handled separately later with KNN.

The reason I impute numeric variables first is because I will need full numeric data in order to get right with both PCA and KNN . PCA usually struggle with missing numeric values and might deliver not correct results, and KNN is at best with complete numeric predictors to calculate distance between observations in a vector space.

So the order is:

  1. bagging-tree numeric imputation first

  2. PCA components explorations for all known and missing brands

  3. KNN brand_code imputation

This keeps the workflow more logical and reasonable. The following process starts with imputation of missing numeric predictors values, then I check associations of missing brand_code with known brand_code groups on PCA plot, and then I use KNN to impute brand_code.

impute_numeric_by_brand_tree <- function(
  train_x,
  valid_x,
  eval_x,
  brand_col = "brand_code",
  trees = 25,
  min_brand_rows = 20
) {
  
  numeric_vars <- get_numeric_predictors(train_x)
  
  train_out <- train_x
  valid_out <- valid_x
  eval_out <- eval_x
  
  fallback_recipe <- recipe(~ ., data = train_x) |>
    step_unknown(
      all_of(brand_col),
      new_level = "Unknown"
    ) |>
    step_impute_bag(
      all_of(numeric_vars),
      impute_with = imp_vars(all_predictors()),
      trees = trees
    )
  
  fallback_fit <- prep(
    fallback_recipe,
    training = train_x
  )
  
  train_fallback <- bake(
    fallback_fit,
    new_data = train_x
  )
  
  valid_fallback <- bake(
    fallback_fit,
    new_data = valid_x
  )
  
  eval_fallback <- bake(
    fallback_fit,
    new_data = eval_x
  )
  
  train_out[numeric_vars] <- train_fallback[numeric_vars]
  valid_out[numeric_vars] <- valid_fallback[numeric_vars]
  eval_out[numeric_vars] <- eval_fallback[numeric_vars]
  
  known_brands <- train_x |>
    filter(!is.na(.data[[brand_col]])) |>
    distinct(.data[[brand_col]]) |>
    pull(.data[[brand_col]])
  
  for (brand in known_brands) {
    
    train_rows <- which(train_x[[brand_col]] == brand)
    valid_rows <- which(valid_x[[brand_col]] == brand)
    eval_rows <- which(eval_x[[brand_col]] == brand)
    
    brand_train_numeric <- train_x[train_rows, numeric_vars]
    
    if (nrow(brand_train_numeric) >= min_brand_rows) {
      
      brand_recipe <- recipe(~ ., data = brand_train_numeric) |>
        step_impute_bag(
          all_numeric_predictors(),
          impute_with = imp_vars(all_numeric_predictors()),
          trees = trees
        )
      
      brand_fit <- prep(
        brand_recipe,
        training = brand_train_numeric
      )
      
      train_out[train_rows, numeric_vars] <- bake(
        brand_fit,
        new_data = train_x[train_rows, numeric_vars]
      )
      
      if (length(valid_rows) > 0) {
        valid_out[valid_rows, numeric_vars] <- bake(
          brand_fit,
          new_data = valid_x[valid_rows, numeric_vars]
        )
      }
      
      if (length(eval_rows) > 0) {
        eval_out[eval_rows, numeric_vars] <- bake(
          brand_fit,
          new_data = eval_x[eval_rows, numeric_vars]
        )
      }
    }
  }
  
  list(
    train = train_out,
    valid = valid_out,
    eval = eval_out,
    numeric_predictors = numeric_vars
  )
}
brand_tree_numeric_imputed <- impute_numeric_by_brand_tree(
  train_x = train_features_for_imputation,
  valid_x = valid_features_for_imputation,
  eval_x = eval_features_for_imputation,
  brand_col = "brand_code",
  trees = 25,
  min_brand_rows = 20
)

train_features_brand_tree_numeric <- brand_tree_numeric_imputed$train
valid_features_brand_tree_numeric <- brand_tree_numeric_imputed$valid
eval_features_brand_tree_numeric <- brand_tree_numeric_imputed$eval
tree_numeric_missing_check <- tibble(
  dataset = c("Training", "Validation", "Final Evaluation"),
  numeric_missing_values = c(
    train_features_brand_tree_numeric |>
      select(where(is.numeric)) |>
      is.na() |>
      sum(),
    valid_features_brand_tree_numeric |>
      select(where(is.numeric)) |>
      is.na() |>
      sum(),
    eval_features_brand_tree_numeric |>
      select(where(is.numeric)) |>
      is.na() |>
      sum()
  )
)

tree_numeric_missing_check |>
  kable(
    caption = "Missing Numeric Values After Tree-Based Numeric Imputation",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Missing Numeric Values After Tree-Based Numeric Imputation
dataset numeric_missing_values
Training 0
Validation 0
Final Evaluation 0

As we see, we do not have missing numeric variables after bagging tree imputation.

Now we can take a look at clucters of all brand_code, and see how rows with missing brand_code are spreaded between those clusters.

PCA Review for Missing Brand Code

Here I will use PCA to visually check if brand_code groups have clusters structure in the completed numeric process-variable space.

Here PCA is not used as the final model. It is just a diagnostic visualization. My goal is to see whether rows with missing brand_code are located near known brand_code groups. If missing brand_code rows are near known brand clusters, then KNN will be a reasonable imputation method since it assigns values for missing brand_code using nearby neighbor observations.

The PCA is created using previous bagging tree imputations of missing numeric predictors, because PCA cannot handle missing numeric values directly.

pca_numeric_vars <- get_numeric_predictors(train_features_brand_tree_numeric)

pca_source_data <- bind_rows(
  train_features_brand_tree_numeric |>
    mutate(dataset = "Training"),
  
  valid_features_brand_tree_numeric |>
    mutate(dataset = "Validation"),
  
  eval_features_brand_tree_numeric |>
    mutate(dataset = "Final Evaluation")
) |>
  mutate(
    brand_code_plot = ifelse(
      is.na(brand_code),
      "Missing Brand",
      as.character(brand_code)
    ),
    brand_code_plot = factor(
      brand_code_plot,
      levels = c("A", "B", "C", "D", "Missing Brand")
    )
  )

pca_input <- pca_source_data |>
  select(all_of(pca_numeric_vars))

pca_fit <- prcomp(
  pca_input,
  center = TRUE,
  scale. = TRUE
)

pca_scores <- pca_source_data |>
  select(dataset, brand_code, brand_code_plot) |>
  bind_cols(
    as_tibble(pca_fit$x[, 1:2])
  )

pca_variance <- summary(pca_fit)$importance[2, 1:2] * 100
brand_colors <- c(
  "A" = "#1f77b4",
  "B" = "#ff7f0e",
  "C" = "#2ca02c",
  "D" = "#9467bd",
  "Missing Brand" = "#d62728"
)

brand_shapes <- c(
  "A" = 16,
  "B" = 17,
  "C" = 15,
  "D" = 18,
  "Missing Brand" = 8
)

ggplot(
  pca_scores,
  aes(
    x = PC1,
    y = PC2,
    color = brand_code_plot,
    shape = brand_code_plot
  )
) +
  geom_point(
    alpha = 0.35,
    size = 2
  ) +
  geom_point(
    data = pca_scores |>
      filter(brand_code_plot == "Missing Brand"),
    aes(
      x = PC1,
      y = PC2,
      color = brand_code_plot,
      shape = brand_code_plot
    ),
    size = 4,
    alpha = 0.95
  ) +
  scale_color_manual(
    values = brand_colors
  ) +
  scale_shape_manual(
    values = brand_shapes
  ) +
  guides(
    color = guide_legend(
      override.aes = list(
        size = 5,
        alpha = 1
      )
    ),
    shape = guide_legend(
      override.aes = list(
        size = 5,
        alpha = 1
      )
    )
  ) +
  labs(
    title = "PCA Plot of Process Variables by Brand Code",
    subtitle = "Rows with missing Brand Code are highlighted with red star markers",
    x = paste0("Principal Component 1 (", round(pca_variance[1], 1), "% variance)"),
    y = paste0("Principal Component 2 (", round(pca_variance[2], 1), "% variance)"),
    color = "Brand Code",
    shape = "Brand Code"
  ) +
  coord_equal() +
  theme_classic(base_size = 14) +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 15, face = "bold"),
    legend.text = element_text(size = 14),
    legend.key.size = unit(1.1, "cm"),
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 13),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

The PCA plot shows that the known brand_code groups are not randomly mixed. Brand A and Brand D form clearer separated regions, while Brand B and Brand C overlap more on the left side of the plot.

At the same time, most of the observations with missing brand_code are located inside or very close to the Brand B and Brand C regions. This is strong suggestion that rows with missing brand_code are not just isolated unusual observations. At PCA plot, great majority of them look very similar to existing Brands B and C.

Since the missing rows are clustered near known brand_code groups, KNN distance-based imputationis is my preffered approach in this case.

KNN Brand Code Imputation

In this step, I will apply KNN only to impute missing brand_code values. The numeric predictors are already imputed before this step.

Before applying KNN approach, I center and scale the numeric predictors from the training data. Otherwise without scaling, variables with larger values can dominate the distance calculations and make the KNN result less reliable.

After scaling, KNN will look for the closest rows with known brand_code and use those nearby rows to predict the missing brand_code.

In the following code validate_knn_brand() function tests several K values for KNN using rows where brand_code is already known. It splits the known-brand rows into training and validation parts, centers and scales the numeric predictors, runs KNN, and returns the accuracy table, best K value, and confusion table.

The impute_brand_knn() function then uses the best K value to impute missing Brand Code in the training, validation, and final evaluation sets.

validate_knn_brand <- function(
  train_x,
  brand_col = "brand_code",
  k_values = c(3, 5, 7, 9, 11),
  project_seed = 123
  ) {
    
    numeric_vars <- get_numeric_predictors(train_x)
    
    known_data <- train_x |>
      filter(!is.na(.data[[brand_col]]))
    
    known_data[[brand_col]] <- as.factor(known_data[[brand_col]])
    
    set.seed(project_seed)
    
    validation_index <- createDataPartition(
      known_data[[brand_col]],
      p = 0.80,
      list = FALSE
    )
    
    brand_train <- known_data[validation_index, ]
    brand_valid <- known_data[-validation_index, ]
    
    train_numeric <- brand_train |>
      select(all_of(numeric_vars))
    
    valid_numeric <- brand_valid |>
      select(all_of(numeric_vars))
    
    train_center <- map_dbl(train_numeric, mean, na.rm = TRUE)
    train_scale <- map_dbl(train_numeric, sd, na.rm = TRUE)
    
    train_scale[train_scale == 0 | is.na(train_scale)] <- 1
    
    scale_with_train <- function(data_numeric) {
      sweep(
        sweep(
          as.matrix(data_numeric),
          2,
          train_center,
          "-"
        ),
        2,
        train_scale,
        "/"
      )
    }
    
    train_scaled <- scale_with_train(train_numeric)
    valid_scaled <- scale_with_train(valid_numeric)
    
    accuracy_results <- map_dfr(
      k_values,
      function(k_value) {
        
        pred_brand <- class::knn(
          train = train_scaled,
          test = valid_scaled,
          cl = brand_train[[brand_col]],
          k = k_value
        )
        
        tibble(
          k = k_value,
          accuracy = mean(pred_brand == brand_valid[[brand_col]]),
          test_rows = nrow(brand_valid)
        )
      }
    )
    
    best_k <- accuracy_results |>
      arrange(desc(accuracy), k) |>
      slice(1) |>
      pull(k)
    
    best_pred <- class::knn(
      train = train_scaled,
      test = valid_scaled,
      cl = brand_train[[brand_col]],
      k = best_k
    )
    
    confusion_table <- tibble(
      actual_brand = brand_valid[[brand_col]],
      predicted_brand = best_pred
    ) |>
      count(actual_brand, predicted_brand, name = "n")
    
    list(
      accuracy_results = accuracy_results,
      best_k = best_k,
      confusion_table = confusion_table
    )
}

impute_brand_knn <- function(
  train_x,
  valid_x,
  eval_x,
  brand_col = "brand_code",
  k = 5
) {
  
  numeric_vars <- get_numeric_predictors(train_x)
  
  train_out <- train_x
  valid_out <- valid_x
  eval_out <- eval_x
  
  known_train_rows <- which(!is.na(train_x[[brand_col]]))
  
  known_train_x <- train_x[known_train_rows, ]
  known_train_y <- as.factor(known_train_x[[brand_col]])
  
  brand_levels <- levels(known_train_y)
  
  train_numeric_known <- known_train_x |>
    select(all_of(numeric_vars))
  
  train_center <- map_dbl(train_numeric_known, mean, na.rm = TRUE)
  train_scale <- map_dbl(train_numeric_known, sd, na.rm = TRUE)
  
  train_scale[train_scale == 0 | is.na(train_scale)] <- 1
  
  scale_with_train <- function(data_x) {
    
    data_numeric <- data_x |>
      select(all_of(numeric_vars))
    
    sweep(
      sweep(
        as.matrix(data_numeric),
        2,
        train_center,
        "-"
      ),
      2,
      train_scale,
      "/"
    )
  }
  
  train_scaled_all <- scale_with_train(train_x)
  valid_scaled_all <- scale_with_train(valid_x)
  eval_scaled_all <- scale_with_train(eval_x)
  
  train_scaled_known <- train_scaled_all[known_train_rows, , drop = FALSE]
  
  impute_one_dataset <- function(data_out, scaled_all) {
    
    missing_rows <- which(is.na(data_out[[brand_col]]))
    
    if (length(missing_rows) > 0) {
      
      predicted_brand <- class::knn(
        train = train_scaled_known,
        test = scaled_all[missing_rows, , drop = FALSE],
        cl = known_train_y,
        k = k
      )
      
      data_out[[brand_col]][missing_rows] <- as.character(predicted_brand)
    }
    
    data_out[[brand_col]] <- factor(
      data_out[[brand_col]],
      levels = brand_levels
    )
    
    data_out
  }
  
  train_out <- impute_one_dataset(train_out, train_scaled_all)
  valid_out <- impute_one_dataset(valid_out, valid_scaled_all)
  eval_out <- impute_one_dataset(eval_out, eval_scaled_all)
  
  list(
    train = train_out,
    valid = valid_out,
    eval = eval_out
  )
}
knn_brand_validation <- validate_knn_brand(
  train_x = train_features_brand_tree_numeric,
  brand_col = "brand_code",
  k_values = c(3, 5, 7, 9, 11),
  project_seed = project_seed
)

knn_brand_validation$accuracy_results |>
  kable(
    caption = "KNN Brand Code Validation Accuracy by K",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
KNN Brand Code Validation Accuracy by K
k accuracy test_rows
3 0.8645 391
5 0.8491 391
7 0.8593 391
9 0.8542 391
11 0.8389 391
best_knn_k <- knn_brand_validation$best_k

The KNN Brand Code validation results show that k = 3 gave the best accuracy.

The accuracy values were:

  • k = 3: 0.8645
  • k = 5: 0.8491
  • k = 7: 0.8593
  • k = 9: 0.8542
  • k = 11: 0.8389

This means the model predicted the correct brand_code about 86.45% of the time when k = 3, and I will use it for imputing missing brand_code values. A smaller k lets the imputation depend more on the closest nearby observations instead of averaging across too many neighbors. In my opinion this makes sense because the PCA plot showed that missing brand_code rows were located near known brand groups

knn_confusion_wide <- knn_brand_validation$confusion_table |>
  pivot_wider(
    names_from = predicted_brand,
    values_from = n,
    values_fill = 0
  ) |>
  arrange(actual_brand)

knn_confusion_wide |>
  kable(
    caption = "KNN Brand Code Validation Confusion Matrix",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
KNN Brand Code Validation Confusion Matrix
actual_brand A B C D
A 36 2 1 6
B 1 185 10 0
C 0 31 20 0
D 3 0 0 96

The confusion matrix makes the KNN brand_code predictions validation easier to understand.

The diagonal values are the correct predictions:

  • Brand A: 36 correctly predicted as A
  • Brand B: 185 correctly predicted as B
  • Brand C: 20 correctly predicted as C
  • Brand D: 96 correctly predicted as D

KNN predicted Brand B and Brand D especially well. Brand B had 185 correct predictions, with only 10 rows predicted as C and 1 row predicted as A. Brand D also performed well, with 96 correct predictions and only 3 rows predicted as A.

Brand C was the most difficult group. Out of actual Brand C rows, 20 were correctly predicted as C, but 31 were predicted as Brand B. This errors can be explained by the earlier PCA plot, where Brand B and Brand C looked more overlapped in the process-variable space, therefore KNN model struggled to correctly make predictions for Brand C.

Brand A was mostly predicted correctly, but there was some confusion with Brand D.

Overall, the confusion matrix supports using KNN to impute missing brand_code values. It performs well overall, but it also shows that Brand C is harder to separate from Brand B.

And now I will create first imputed dataset using bagged-tree model to impute missing numeric values, and KNN model to impute missing brand_code values

tree_numeric_knn_brand <- impute_brand_knn(
  train_x = train_features_brand_tree_numeric,
  valid_x = valid_features_brand_tree_numeric,
  eval_x = eval_features_brand_tree_numeric,
  brand_col = "brand_code",
  k = best_knn_k
)

train_features_tree_numeric_knn_brand <- tree_numeric_knn_brand$train
valid_features_tree_numeric_knn_brand <- tree_numeric_knn_brand$valid
eval_features_tree_numeric_knn_brand <- tree_numeric_knn_brand$eval

# Keep old object names too, so later code that already uses them does not break.
train_features_capped_knn_brand <- train_features_tree_numeric_knn_brand
valid_features_capped_knn_brand <- valid_features_tree_numeric_knn_brand
eval_features_capped_knn_brand <- eval_features_tree_numeric_knn_brand
tree_numeric_knn_missing_check <- tibble(
  dataset = c("Training", "Validation", "Final Evaluation"),
  total_missing_values = c(
    sum(is.na(train_features_tree_numeric_knn_brand)),
    sum(is.na(valid_features_tree_numeric_knn_brand)),
    sum(is.na(eval_features_tree_numeric_knn_brand))
  )
)

tree_numeric_knn_missing_check |>
  kable(
    caption = "Missing Values After Tree Numeric + KNN Brand Imputation",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Missing Values After Tree Numeric + KNN Brand Imputation
dataset total_missing_values
Training 0
Validation 0
Final Evaluation 0
rf_capped_knn_results <- run_rf_validation_test(
  train_x = train_features_tree_numeric_knn_brand,
  valid_x = valid_features_tree_numeric_knn_brand,
  train_y = train_ph,
  valid_y = valid_ph,
  workflow_name = "Tree numeric + KNN brand",
  project_seed = project_seed
)

rf_capped_knn_results |>
  kable(
    caption = "Tree Numeric + KNN Brand Random Forest Validation Performance",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Tree Numeric + KNN Brand Random Forest Validation Performance
workflow training_rows validation_rows best_mtry rmse mae rsq training_time_seconds
Tree numeric + KNN brand 2052 515 20 0.0962 0.0687 0.698 205.3
summarize_imputed_brand_counts <- function(
  original_train_x,
  original_valid_x,
  original_eval_x,
  imputed_train_x,
  imputed_valid_x,
  imputed_eval_x,
  brand_col = "brand_code"
) {
  
  bind_rows(
    imputed_train_x |>
      filter(is.na(original_train_x[[brand_col]])) |>
      count(.data[[brand_col]], name = "n") |>
      mutate(dataset = "Training"),
    
    imputed_valid_x |>
      filter(is.na(original_valid_x[[brand_col]])) |>
      count(.data[[brand_col]], name = "n") |>
      mutate(dataset = "Validation"),
    
    imputed_eval_x |>
      filter(is.na(original_eval_x[[brand_col]])) |>
      count(.data[[brand_col]], name = "n") |>
      mutate(dataset = "Final Evaluation")
  ) |>
    rename(imputed_brand_code = .data[[brand_col]]) |>
    select(dataset, imputed_brand_code, n) |>
    pivot_wider(
      names_from = imputed_brand_code,
      values_from = n,
      values_fill = 0
    ) |>
    mutate(
      Total = rowSums(across(where(is.numeric)))
    )
}
tree_numeric_knn_brand_counts <- summarize_imputed_brand_counts(
  original_train_x = train_features_brand_tree_numeric,
  original_valid_x = valid_features_brand_tree_numeric,
  original_eval_x = eval_features_brand_tree_numeric,
  imputed_train_x = train_features_tree_numeric_knn_brand,
  imputed_valid_x = valid_features_tree_numeric_knn_brand,
  imputed_eval_x = eval_features_tree_numeric_knn_brand,
  brand_col = "brand_code"
)

tree_numeric_knn_brand_counts |>
  kable(
    caption = "Imputed Brand Code Counts: Tree Numeric + KNN Brand",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Imputed Brand Code Counts: Tree Numeric + KNN Brand
dataset A B C Total
Training 13 65 16 94
Validation 3 21 2 26
Final Evaluation 0 5 3 8

This table shows only the rows where brand_code was originally missing. It does not show the full brand_code distribution.

For the tree-based numeric imputation workflow, KNN assigned most missing Brand Code values to Brand B and Brand C.
In the training data, 65 missing rows were imputed as Brand B and 16 were imputed as Brand C. Some rows were also assigned to Brand A (13), but no missing rows were assigned to Brand D.

This matches the PCA plot reasonably well because most missing brand_code rows were located inside or close to the Brand B and Brand C clusters. The KNN result is therefore not just assigning everything to the largest brand randomly. It is using nearby rows in the completed numeric process-variable space.

Overall, this matches the PCA plot because most missing Brand Code rows were located near the Brand B and Brand C clusters.

Random Forest Numeric Imputation + KNN Brand Code Imputation

I mentioned earlier that I will test few imputations techniques. So here, I will test Random Forest model as another method for imputing missing numeric predictors.

The idea is that each numeric variable with missing values can be predicted from the other available process variables.
Random Forest can be useful here because it can capture nonlinear relationships and interactions between variables, which may exist in the production process. And it is possible it gives better results compared to bagged trees imputation method completed previously.

This process follows these steps:

  1. start with the PH-guided outliers capped feature sets.
  2. use Random Forest models to impute missing numeric predictors.
  3. keep brand_code missing during the numeric imputation step.
  4. after numeric values are completed, use KNN to impute missing brand_code.
  5. test the final completed dataset using Random Forest validation model.

Missing values of brand_code will be handled separately with KNN, since PCA plot showed earlier that missing brand_code rows are mostly located near Brand B and Brand C clusters. So the same KNN brand_code imputation step is applied after the numeric predictors are completed.

impute_numeric_by_random_forest <- function(
  train_x,
  valid_x,
  eval_x,
  brand_col = "brand_code",
  ntree = 250,
  min_obs = 30,
  project_seed = 123
) {
  
  numeric_vars <- get_numeric_predictors(train_x)
  
  train_out <- train_x
  valid_out <- valid_x
  eval_out <- eval_x
  
  train_medians <- map_dbl(
    numeric_vars,
    function(var) {
      med_value <- median(train_x[[var]], na.rm = TRUE)
      
      if (is.nan(med_value)) {
        med_value <- 0
      }
      
      med_value
    }
  )
  
  names(train_medians) <- numeric_vars
  
  fill_numeric_with_train_medians <- function(data_x) {
    
    data_filled <- data_x
    
    for (var in numeric_vars) {
      data_filled[[var]][is.na(data_filled[[var]])] <- train_medians[[var]]
    }
    
    data_filled
  }
  
  train_base <- fill_numeric_with_train_medians(train_x)
  valid_base <- fill_numeric_with_train_medians(valid_x)
  eval_base <- fill_numeric_with_train_medians(eval_x)
  
  train_out[numeric_vars] <- train_base[numeric_vars]
  valid_out[numeric_vars] <- valid_base[numeric_vars]
  eval_out[numeric_vars] <- eval_base[numeric_vars]
  
  brand_levels <- train_x[[brand_col]] |>
    as.character() |>
    na.omit() |>
    unique()
  
  brand_levels <- c(sort(brand_levels), "Unknown")
  
  prepare_brand_for_dummy <- function(data_x) {
    
    data_x |>
      mutate(
        "{brand_col}" := ifelse(
          is.na(.data[[brand_col]]),
          "Unknown",
          as.character(.data[[brand_col]])
        ),
        "{brand_col}" := factor(
          .data[[brand_col]],
          levels = brand_levels
        )
      )
  }
  
  train_brand_dummy_data <- prepare_brand_for_dummy(train_x)
  
  brand_dummy_fit <- caret::dummyVars(
    formula = as.formula(paste("~", brand_col)),
    data = train_brand_dummy_data,
    fullRank = FALSE
  )
  
  make_rf_predictor_matrix <- function(data_filled, original_data, target_var) {
    
    numeric_predictors <- setdiff(numeric_vars, target_var)
    
    brand_dummy_data <- prepare_brand_for_dummy(original_data)
    
    brand_dummy_matrix <- predict(
      brand_dummy_fit,
      newdata = brand_dummy_data
    ) |>
      as.data.frame()
    
    bind_cols(
      data_filled[numeric_predictors],
      brand_dummy_matrix
    )
  }
  
  vars_to_impute <- numeric_vars[
    map_lgl(
      numeric_vars,
      function(var) {
        any(is.na(train_x[[var]])) ||
          any(is.na(valid_x[[var]])) ||
          any(is.na(eval_x[[var]]))
      }
    )
  ]
  
  imputation_log <- list()
  
  for (var in vars_to_impute) {
    
    observed_train_rows <- which(!is.na(train_x[[var]]))
    
    if (length(observed_train_rows) < min_obs) {
      
      imputation_log[[var]] <- tibble(
        variable = var,
        method = "median fallback",
        observed_training_rows = length(observed_train_rows),
        train_missing_imputed = sum(is.na(train_x[[var]])),
        valid_missing_imputed = sum(is.na(valid_x[[var]])),
        eval_missing_imputed = sum(is.na(eval_x[[var]]))
      )
      
      next
    }
    
    x_train_matrix <- make_rf_predictor_matrix(
      data_filled = train_base,
      original_data = train_x,
      target_var = var
    )
    
    x_valid_matrix <- make_rf_predictor_matrix(
      data_filled = valid_base,
      original_data = valid_x,
      target_var = var
    )
    
    x_eval_matrix <- make_rf_predictor_matrix(
      data_filled = eval_base,
      original_data = eval_x,
      target_var = var
    )
    
    set.seed(project_seed)
    
    rf_imp_model <- randomForest::randomForest(
      x = x_train_matrix[observed_train_rows, ],
      y = train_x[[var]][observed_train_rows],
      ntree = ntree
    )
    
    train_missing_rows <- which(is.na(train_x[[var]]))
    valid_missing_rows <- which(is.na(valid_x[[var]]))
    eval_missing_rows <- which(is.na(eval_x[[var]]))
    
    if (length(train_missing_rows) > 0) {
      train_out[[var]][train_missing_rows] <- predict(
        rf_imp_model,
        newdata = x_train_matrix[train_missing_rows, ]
      )
    }
    
    if (length(valid_missing_rows) > 0) {
      valid_out[[var]][valid_missing_rows] <- predict(
        rf_imp_model,
        newdata = x_valid_matrix[valid_missing_rows, ]
      )
    }
    
    if (length(eval_missing_rows) > 0) {
      eval_out[[var]][eval_missing_rows] <- predict(
        rf_imp_model,
        newdata = x_eval_matrix[eval_missing_rows, ]
      )
    }
    
    imputation_log[[var]] <- tibble(
      variable = var,
      method = "random forest",
      observed_training_rows = length(observed_train_rows),
      train_missing_imputed = length(train_missing_rows),
      valid_missing_imputed = length(valid_missing_rows),
      eval_missing_imputed = length(eval_missing_rows)
    )
  }
  
  list(
    train = train_out,
    valid = valid_out,
    eval = eval_out,
    numeric_predictors = numeric_vars,
    imputation_log = bind_rows(imputation_log)
  )
}
rf_numeric_imputed_temp <- impute_numeric_by_random_forest(
  train_x = train_features_for_imputation,
  valid_x = valid_features_for_imputation,
  eval_x = eval_features_for_imputation,
  brand_col = "brand_code",
  ntree = 250,
  min_obs = 30,
  project_seed = project_seed
)

train_features_rf_numeric_temp <- rf_numeric_imputed_temp$train
valid_features_rf_numeric_temp <- rf_numeric_imputed_temp$valid
eval_features_rf_numeric_temp <- rf_numeric_imputed_temp$eval
rf_numeric_imputed_temp$imputation_log |>
  arrange(desc(train_missing_imputed + valid_missing_imputed + eval_missing_imputed)) |>
  kable(
    caption = "Random Forest Numeric Imputation Log",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )
Random Forest Numeric Imputation Log
variable method observed_training_rows train_missing_imputed valid_missing_imputed eval_missing_imputed
mfr random forest 1879 173 35 31
filler_speed random forest 2009 43 11 10
fill_ounces random forest 2022 30 8 6
psc_co2 random forest 2016 36 3 5
pc_volume random forest 2020 32 7 4
psc random forest 2025 27 6 5
carb_pressure1 random forest 2027 25 7 4
hyd_pressure4 random forest 2028 24 4 4
carb_pressure random forest 2031 21 6 0
carb_temp random forest 2035 17 9 1
psc_fill random forest 2031 21 2 3
fill_pressure random forest 2039 13 5 2
filler_level random forest 2038 14 2 2
hyd_pressure2 random forest 2041 11 4 1
hyd_pressure3 random forest 2041 11 4 1
temperature random forest 2042 10 2 2
oxygen_filler random forest 2043 9 2 3
pressure_setpoint random forest 2043 9 3 2
carb_volume random forest 2044 8 2 1
hyd_pressure1 random forest 2044 8 3 0
alch_rel random forest 2046 6 1 3
carb_rel random forest 2044 8 0 2
usage_cont random forest 2048 4 1 2
bowl_setpoint random forest 2052 0 2 1
carb_flow random forest 2050 2 0 0
density random forest 2052 0 0 1
balling random forest 2052 0 0 1
pressure_vacuum random forest 2052 0 0 1
air_pressurer random forest 2052 0 0 1
balling_lvl random forest 2051 1 0 0
rf_numeric_knn_brand_temp <- impute_brand_knn(
  train_x = train_features_rf_numeric_temp,
  valid_x = valid_features_rf_numeric_temp,
  eval_x = eval_features_rf_numeric_temp,
  brand_col = "brand_code",
  k = best_knn_k
)

train_features_rf_numeric_knn_brand_temp <- rf_numeric_knn_brand_temp$train
valid_features_rf_numeric_knn_brand_temp <- rf_numeric_knn_brand_temp$valid
eval_features_rf_numeric_knn_brand_temp <- rf_numeric_knn_brand_temp$eval
rf_numeric_knn_results_temp <- run_rf_validation_test(
  train_x = train_features_rf_numeric_knn_brand_temp,
  valid_x = valid_features_rf_numeric_knn_brand_temp,
  train_y = train_ph,
  valid_y = valid_ph,
  workflow_name = "RF numeric + KNN brand",
  project_seed = project_seed
)

# rf_numeric_knn_results_temp |>
#   kable(
#     caption = "RF Numeric + KNN Brand Random Forest Validation Performance",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )

Now we have all missing values imputed, and let’s try another imputation technique.

XGBoost Numeric Imputation + KNN Brand Code Imputation

Here I will use XGBoost model as another method for predicting and imputing missing numeric predictors.

The idea is pretty much same as previous Bagging Trees and Random Forest numeric imputation workflows. Each numeric variable with missing values is treated as a temporary prediction problem, where the other process variables are used as predictors. XGBoost can be useful here because it can model nonlinear patterns and interactions, and it often performs well on structured tabular data.

The steps are:

  1. start with the PH-guided capped feature sets.
  2. use XGBoost model to impute missing numeric predictors.
  3. keep brand_code missing during the numeric imputation step.
  4. after numeric values are completed, will use KNN to impute missing brand_code.
  5. test the final completed dataset using the same Random Forest validation model.

Missing values of brand_code are still handled separately with KNN model. So after XGBoost completes the numeric predictors, KNN is used to assign brand_code based on nearby observations just like in previous methods - centering and scaling applied in KNN training process.

I’m using this imputations technique to see if XGBoost numeric imputation creates a better completed dataset than the Bagging Tree based and Random Forest numeric imputation methods.

impute_numeric_by_xgboost <- function(
  train_x,
  valid_x,
  eval_x,
  brand_col = "brand_code",
  nrounds = 100,
  max_depth = 3,
  eta = 0.05,
  subsample = 0.8,
  colsample_bytree = 0.8,
  min_obs = 30,
  project_seed = 123
) {
  
  numeric_vars <- get_numeric_predictors(train_x)
  
  train_out <- train_x
  valid_out <- valid_x
  eval_out <- eval_x
  
  train_medians <- map_dbl(
    numeric_vars,
    function(var) {
      med_value <- median(train_x[[var]], na.rm = TRUE)
      
      if (is.nan(med_value)) {
        med_value <- 0
      }
      
      med_value
    }
  )
  
  names(train_medians) <- numeric_vars
  
  for (var in numeric_vars) {
    train_out[[var]][is.na(train_out[[var]])] <- train_medians[[var]]
    valid_out[[var]][is.na(valid_out[[var]])] <- train_medians[[var]]
    eval_out[[var]][is.na(eval_out[[var]])] <- train_medians[[var]]
  }
  
  brand_levels <- train_x[[brand_col]] |>
    as.character() |>
    na.omit() |>
    unique()
  
  brand_levels <- c(sort(brand_levels), "Unknown")
  
  prepare_brand_for_dummy <- function(data_x) {
    
    data_x |>
      mutate(
        "{brand_col}" := ifelse(
          is.na(.data[[brand_col]]),
          "Unknown",
          as.character(.data[[brand_col]])
        ),
        "{brand_col}" := factor(
          .data[[brand_col]],
          levels = brand_levels
        )
      )
  }
  
  brand_dummy_fit <- caret::dummyVars(
    formula = as.formula(paste("~", brand_col)),
    data = prepare_brand_for_dummy(train_x),
    fullRank = FALSE
  )
  
  make_xgb_matrix <- function(data_x, target_var) {
    
    numeric_predictors <- setdiff(numeric_vars, target_var)
    
    brand_dummy_matrix <- predict(
      brand_dummy_fit,
      newdata = prepare_brand_for_dummy(data_x)
    ) |>
      as.data.frame()
    
    predictor_data <- bind_cols(
      data_x[numeric_predictors],
      brand_dummy_matrix
    )
    
    as.matrix(predictor_data)
  }
  
  vars_to_impute <- numeric_vars[
    map_lgl(
      numeric_vars,
      function(var) {
        any(is.na(train_x[[var]])) ||
          any(is.na(valid_x[[var]])) ||
          any(is.na(eval_x[[var]]))
      }
    )
  ]
  
  imputation_log <- list()
  
  for (var in vars_to_impute) {
    
    observed_train_rows <- which(!is.na(train_x[[var]]))
    
    train_missing_rows <- which(is.na(train_x[[var]]))
    valid_missing_rows <- which(is.na(valid_x[[var]]))
    eval_missing_rows <- which(is.na(eval_x[[var]]))
    
    if (length(observed_train_rows) < min_obs) {
      
      imputation_log[[var]] <- tibble(
        variable = var,
        method = "median fallback",
        observed_training_rows = length(observed_train_rows),
        train_missing_imputed = length(train_missing_rows),
        valid_missing_imputed = length(valid_missing_rows),
        eval_missing_imputed = length(eval_missing_rows)
      )
      
      next
    }
    
    x_train_matrix <- make_xgb_matrix(train_x, target_var = var)
    x_valid_matrix <- make_xgb_matrix(valid_x, target_var = var)
    x_eval_matrix <- make_xgb_matrix(eval_x, target_var = var)
    
    dtrain <- xgboost::xgb.DMatrix(
      data = x_train_matrix[observed_train_rows, , drop = FALSE],
      label = train_x[[var]][observed_train_rows],
      missing = NA
    )
    
    params <- list(
      objective = "reg:squarederror",
      eval_metric = "rmse",
      max_depth = max_depth,
      eta = eta,
      subsample = subsample,
      colsample_bytree = colsample_bytree
    )
    
    set.seed(project_seed)
    
    xgb_imp_model <- xgboost::xgb.train(
      params = params,
      data = dtrain,
      nrounds = nrounds,
      verbose = 0
    )
    
    if (length(train_missing_rows) > 0) {
      train_out[[var]][train_missing_rows] <- predict(
        xgb_imp_model,
        newdata = xgboost::xgb.DMatrix(
          x_train_matrix[train_missing_rows, , drop = FALSE],
          missing = NA
        )
      )
    }
    
    if (length(valid_missing_rows) > 0) {
      valid_out[[var]][valid_missing_rows] <- predict(
        xgb_imp_model,
        newdata = xgboost::xgb.DMatrix(
          x_valid_matrix[valid_missing_rows, , drop = FALSE],
          missing = NA
        )
      )
    }
    
    if (length(eval_missing_rows) > 0) {
      eval_out[[var]][eval_missing_rows] <- predict(
        xgb_imp_model,
        newdata = xgboost::xgb.DMatrix(
          x_eval_matrix[eval_missing_rows, , drop = FALSE],
          missing = NA
        )
      )
    }
    
    imputation_log[[var]] <- tibble(
      variable = var,
      method = "xgboost",
      observed_training_rows = length(observed_train_rows),
      train_missing_imputed = length(train_missing_rows),
      valid_missing_imputed = length(valid_missing_rows),
      eval_missing_imputed = length(eval_missing_rows)
    )
  }
  
  list(
    train = train_out,
    valid = valid_out,
    eval = eval_out,
    numeric_predictors = numeric_vars,
    imputation_log = bind_rows(imputation_log)
  )
}
xgb_numeric_imputed_temp <- impute_numeric_by_xgboost(
  train_x = train_features_for_imputation,
  valid_x = valid_features_for_imputation,
  eval_x = eval_features_for_imputation,
  brand_col = "brand_code",
  nrounds = 100,
  max_depth = 3,
  eta = 0.05,
  subsample = 0.8,
  colsample_bytree = 0.8,
  min_obs = 30,
  project_seed = project_seed
)

train_features_xgb_numeric_temp <- xgb_numeric_imputed_temp$train
valid_features_xgb_numeric_temp <- xgb_numeric_imputed_temp$valid
eval_features_xgb_numeric_temp <- xgb_numeric_imputed_temp$eval
xgb_numeric_imputed_temp$imputation_log |>
  arrange(desc(train_missing_imputed + valid_missing_imputed + eval_missing_imputed)) |>
  kable(
    caption = "XGBoost Numeric Imputation Log",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )|>
  scroll_box(
    width = "100%",
    height = "500px"
  )
XGBoost Numeric Imputation Log
variable method observed_training_rows train_missing_imputed valid_missing_imputed eval_missing_imputed
mfr xgboost 1879 173 35 31
filler_speed xgboost 2009 43 11 10
fill_ounces xgboost 2022 30 8 6
psc_co2 xgboost 2016 36 3 5
pc_volume xgboost 2020 32 7 4
psc xgboost 2025 27 6 5
carb_pressure1 xgboost 2027 25 7 4
hyd_pressure4 xgboost 2028 24 4 4
carb_pressure xgboost 2031 21 6 0
carb_temp xgboost 2035 17 9 1
psc_fill xgboost 2031 21 2 3
fill_pressure xgboost 2039 13 5 2
filler_level xgboost 2038 14 2 2
hyd_pressure2 xgboost 2041 11 4 1
hyd_pressure3 xgboost 2041 11 4 1
temperature xgboost 2042 10 2 2
oxygen_filler xgboost 2043 9 2 3
pressure_setpoint xgboost 2043 9 3 2
carb_volume xgboost 2044 8 2 1
hyd_pressure1 xgboost 2044 8 3 0
alch_rel xgboost 2046 6 1 3
carb_rel xgboost 2044 8 0 2
usage_cont xgboost 2048 4 1 2
bowl_setpoint xgboost 2052 0 2 1
carb_flow xgboost 2050 2 0 0
density xgboost 2052 0 0 1
balling xgboost 2052 0 0 1
pressure_vacuum xgboost 2052 0 0 1
air_pressurer xgboost 2052 0 0 1
balling_lvl xgboost 2051 1 0 0

The XGBoost numeric imputation log shows which variables had missing values filled in the training, validation, and final evaluation datasets.

Each variable with missing numeric values was imputed using an XGBoost model trained from the available training data. PH was not used during this process, so the imputation step does not create target leakage.

The largest number of missing values was for mfr, followed by variables such as filler_speed, fill_ounces, psc_co2, and pc_volume.

I included this table as a check that the imputation process was applied consistently across all three datasets before modeling.

xgb_numeric_knn_brand_temp <- impute_brand_knn(
  train_x = train_features_xgb_numeric_temp,
  valid_x = valid_features_xgb_numeric_temp,
  eval_x = eval_features_xgb_numeric_temp,
  brand_col = "brand_code",
  k = best_knn_k
)

train_features_xgb_numeric_knn_brand_temp <- xgb_numeric_knn_brand_temp$train
valid_features_xgb_numeric_knn_brand_temp <- xgb_numeric_knn_brand_temp$valid
eval_features_xgb_numeric_knn_brand_temp <- xgb_numeric_knn_brand_temp$eval
xgb_numeric_knn_rf_results_temp <- run_rf_validation_test(
  train_x = train_features_xgb_numeric_knn_brand_temp,
  valid_x = valid_features_xgb_numeric_knn_brand_temp,
  train_y = train_ph,
  valid_y = valid_ph,
  workflow_name = "XGBoost numeric + KNN brand",
  project_seed = project_seed
)

# xgb_numeric_knn_rf_results_temp |>
#   kable(
#     caption = "XGBoost Numeric + KNN Brand Random Forest Validation Performance",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )
xgb_knn_imputed_brand_counts <- bind_rows(
  train_features_xgb_numeric_knn_brand_temp |>
    filter(is.na(train_features_xgb_numeric_temp$brand_code)) |>
    count(brand_code, name = "n") |>
    mutate(dataset = "Training"),
  
  valid_features_xgb_numeric_knn_brand_temp |>
    filter(is.na(valid_features_xgb_numeric_temp$brand_code)) |>
    count(brand_code, name = "n") |>
    mutate(dataset = "Validation"),
  
  eval_features_xgb_numeric_knn_brand_temp |>
    filter(is.na(eval_features_xgb_numeric_temp$brand_code)) |>
    count(brand_code, name = "n") |>
    mutate(dataset = "Final Evaluation")
) |>
  rename(imputed_brand_code = brand_code) |>
  pivot_wider(
    names_from = imputed_brand_code,
    values_from = n,
    values_fill = 0
  ) |>
  mutate(
    Total = rowSums(across(where(is.numeric)))
  ) |>
  select(dataset, everything())

xgb_knn_imputed_brand_counts |>
  kable(
    caption = "Imputed Brand Code Counts After XGBoost Numeric Imputation + KNN",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Imputed Brand Code Counts After XGBoost Numeric Imputation + KNN
dataset A B C Total
Training 7 64 23 94
Validation 1 20 5 26
Final Evaluation 0 6 2 8
xgb_knn_brand_validation <- validate_knn_brand(
  train_x = train_features_xgb_numeric_temp,
  brand_col = "brand_code",
  k_values = c(3, 5, 7, 9, 11),
  project_seed = project_seed
)

xgb_knn_brand_validation$confusion_table |>
  pivot_wider(
    names_from = predicted_brand,
    values_from = n,
    values_fill = 0
  ) |>
  arrange(actual_brand) |>
  kable(
    caption = "KNN Brand Code Validation Confusion Matrix Using XGBoost-Imputed Numeric Predictors",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
KNN Brand Code Validation Confusion Matrix Using XGBoost-Imputed Numeric Predictors
actual_brand A B C D
A 36 2 1 6
B 1 185 10 0
C 1 31 19 0
D 2 0 1 96

Compared with the earlier KNN brand_code confusion matrix, the results using XGBoost-imputed numeric predictors are almost the same.

The previous confusion matrix had:

  • Brand A: 36 correctly predicted as A
  • Brand B: 185 correctly predicted as B
  • Brand C: 20 correctly predicted as C
  • Brand D: 96 correctly predicted as D

The XGBoost-imputed version had:

  • Brand A: 36 correctly predicted as A
  • Brand B: 185 correctly predicted as B
  • Brand C: 19 correctly predicted as C
  • Brand D: 96 correctly predicted as D

So the main difference is very small:

  • Brand A stayed the same.
  • Brand B stayed the same.
  • Brand D stayed the same.
  • Brand C decreased slightly from 20 correct predictions to 19 correct predictions.

As we see, Brand C rows are still the hardest to classify. In the XGBoost-imputed version, 31 actual Brand C rows were predicted as Brand B, which is the same pattern as before. This supports the earlier PCA result, where Brand B and Brand C were more overlapped in the numeric process-variable space.

Overall, the XGBoost numeric imputation did not meaningfully change the KNN brand_codevalidation pattern. The KNN model still performs well for Brands A, B, and D, but Brand C remains harder to separate from Brand B.

Imputation of all NA’s with missForest package

In this part, I test missForest as another imputation method.

This method is different from other workflows because it does not separate numeric imputation and brand_code imputation into two steps as I used earlier. Instead, missForest imputes numeric predictors and brand_code together in one process.

The steps are:

  1. start with the PH-guided capped feature sets.
  2. combine the training, validation, and final evaluation feature sets for the imputation step.
  3. use missForest to impute both numeric predictors and missing brand_code.
  4. split the imputed data back into training, validation, and final evaluation sets.
  5. test the completed dataset using the same Random Forest validation model as before.

I included this imputation process as a comparison method because missForest can handle mixed numeric and categorical predictors. However, it is less connected to my PCA and KNN brand_code imputations logic, because it does not use the separate process of completing numeric values first and then assigning brand_code based on nearby observations.

prepare_missforest_data <- function(train_x, valid_x, eval_x, brand_col = "brand_code") {
  
  train_temp <- train_x |>
    mutate(.dataset = "Training", .row_id = row_number())
  
  valid_temp <- valid_x |>
    mutate(.dataset = "Validation", .row_id = row_number())
  
  eval_temp <- eval_x |>
    mutate(.dataset = "Final Evaluation", .row_id = row_number())
  
  combined_data <- bind_rows(
    train_temp,
    valid_temp,
    eval_temp
  )
  
  combined_data <- combined_data |>
    mutate(
      "{brand_col}" := as.factor(.data[[brand_col]])
    )
  
  list(
    combined_data = combined_data,
    id_cols = c(".dataset", ".row_id")
  )
}

missforest_input <- prepare_missforest_data(
  train_x = train_features_for_imputation,
  valid_x = valid_features_for_imputation,
  eval_x = eval_features_for_imputation,
  brand_col = "brand_code"
)

combined_for_missforest <- missforest_input$combined_data
id_cols <- missforest_input$id_cols

combined_ids <- combined_for_missforest |>
  select(all_of(id_cols))

combined_features_for_missforest <- combined_for_missforest |>
  select(-all_of(id_cols)) |>
  as.data.frame()

combined_features_for_missforest <- combined_features_for_missforest |>
  mutate(
    across(where(is.character), as.factor),
    across(where(is.logical), as.factor)
  )

combined_features_for_missforest$brand_code <- as.factor(
  combined_features_for_missforest$brand_code
)

combined_features_for_missforest <- combined_features_for_missforest |>
  mutate(
    across(
      where(~ inherits(.x, "Date") || inherits(.x, "POSIXct") || inherits(.x, "POSIXlt")),
      as.numeric
    )
  )

all_missing_cols <- names(combined_features_for_missforest)[
  map_lgl(combined_features_for_missforest, ~ all(is.na(.x)))
]

combined_features_for_missforest <- combined_features_for_missforest |>
  select(-any_of(all_missing_cols))

constant_cols <- names(combined_features_for_missforest)[
  map_lgl(
    combined_features_for_missforest,
    ~ length(unique(na.omit(.x))) <= 1
  )
]

combined_features_for_missforest <- combined_features_for_missforest |>
  select(-any_of(constant_cols))
set.seed(project_seed)

missforest_time_temp <- system.time({
  
  missforest_result_temp <- missForest(
    xmis = combined_features_for_missforest,
    maxiter = 3,
    ntree = 50,
    variablewise = TRUE,
    verbose = TRUE
  )
  
})
  missForest iteration 1 in progress...done!
    estimated error(s): 0.02549889 0.002423574 0.006228133 0.002000095 2.476716 3.270997 0.002225144 0.01265173 0.001775185 0 11.25729 3.633197 21.6205 11.23396 7.67219 52.86191 26.78502 66429.97 0.9822 4.856411 174607.9 0.002789376 645.1303 0.00792792 0.07094996 0.0007109196 18.60701 0.7337317 0.6809167 0.01107182 0.00279986 0.01465174 
    difference(s): 0.001248508 0.03246295 
    time: 1.44 seconds

  missForest iteration 2 in progress...done!
    estimated error(s): 0.0225425 0.002431696 0.006274587 0.002011106 2.344886 3.298449 0.002247812 0.01253804 0.001766746 0 11.14045 3.696281 22.65158 10.87017 7.500838 51.82271 27.53997 30159.45 0.9322445 4.845559 191381.8 0.002579745 718.2835 0.007228232 0.0708929 0.0006863428 16.51772 0.7274178 0.6579717 0.01043707 0.002802485 0.01902887 
    difference(s): 0.0001388316 0.00952717 
    time: 1.39 seconds

  missForest iteration 3 in progress...done!
    estimated error(s): 0.02143385 0.002448337 0.006218199 0.002009058 2.13478 3.319124 0.002221153 0.01261356 0.001786276 0 11.25951 3.73122 21.58347 11.16358 7.748629 51.69491 28.4234 33827.62 0.9165848 4.869446 184279.5 0.002834121 655.3057 0.007171282 0.06827684 0.0007088335 15.52839 0.7407598 0.6634867 0.01034152 0.002816703 0.0200716 
    difference(s): 0.00002911975 0.007410021 
    time: 1.41 seconds
missforest_time_temp
   user  system elapsed 
   4.89    2.50    4.24 
missforest_result_temp$OOBerror
              PFC               MSE               MSE               MSE 
     0.0214338507      0.0024483373      0.0062181991      0.0020090582 
              MSE               MSE               MSE               MSE 
     2.1347803332      3.3191239656      0.0022211529      0.0126135631 
              MSE               MSE               MSE               MSE 
     0.0017862764      0.0000000000     11.2595069395      3.7312199962 
              MSE               MSE               MSE               MSE 
    21.5834697187     11.1635822029      7.7486286628     51.6949086916 
              MSE               MSE               MSE               MSE 
    28.4234033967  33827.6197125108      0.9165848314      4.8694455645 
              MSE               MSE               MSE               MSE 
184279.4672627342      0.0028341205    655.3057387789      0.0071712824 
              MSE               MSE               MSE               MSE 
     0.0682768382      0.0007088335     15.5283864592      0.7407597868 
              MSE               MSE               MSE               MSE 
     0.6634867181      0.0103415237      0.0028167026      0.0200716008 
combined_missforest_imputed <- missforest_result_temp$ximp |>
  bind_cols(combined_ids)

train_features_missforest_temp <- combined_missforest_imputed |>
  filter(.dataset == "Training") |>
  arrange(.row_id) |>
  select(-all_of(id_cols))

valid_features_missforest_temp <- combined_missforest_imputed |>
  filter(.dataset == "Validation") |>
  arrange(.row_id) |>
  select(-all_of(id_cols))

eval_features_missforest_temp <- combined_missforest_imputed |>
  filter(.dataset == "Final Evaluation") |>
  arrange(.row_id) |>
  select(-all_of(id_cols))
missforest_missing_check_temp <- tibble(
  dataset = c("Training", "Validation", "Final Evaluation"),
  rows = c(
    nrow(train_features_missforest_temp),
    nrow(valid_features_missforest_temp),
    nrow(eval_features_missforest_temp)
  ),
  columns = c(
    ncol(train_features_missforest_temp),
    ncol(valid_features_missforest_temp),
    ncol(eval_features_missforest_temp)
  ),
  total_missing_values = c(
    sum(is.na(train_features_missforest_temp)),
    sum(is.na(valid_features_missforest_temp)),
    sum(is.na(eval_features_missforest_temp))
  )
)

# missforest_missing_check_temp |>
#   kable(
#     caption = "Missing Values After missForest Imputation",
#     digits = 0
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )
missforest_rf_results_temp <- run_rf_validation_test(
  train_x = train_features_missforest_temp,
  valid_x = valid_features_missforest_temp,
  train_y = train_ph,
  valid_y = valid_ph,
  workflow_name = "missForest imputation",
  project_seed = project_seed
)

# missforest_rf_results_temp |>
#   kable(
#     caption = "missForest Random Forest Validation Performance",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )

Complete-Case Benchmark Dataset

Before fully relying on imputations, I will also test a complete-case dataset with Random Forest model. In this part, I do not impute missing predictor values. Instead, I simply remove rows that have any missing predictor values to create complete-case daaset.

This gives a useful benchmark because the model is trained only on fully observed records. However, it also removes many observations with missing values, so it may not be fully practical for the final evaluation data where missing predictors also exist.

I will use this as a comparison point, not as the final preprocessing workflow.

# Complete-case feature sets
complete_train_rows <- complete.cases(train_features_for_imputation)
complete_valid_rows <- complete.cases(valid_features_for_imputation)

train_features_complete_case <- train_features_for_imputation[complete_train_rows, ]
valid_features_complete_case <- valid_features_for_imputation[complete_valid_rows, ]

train_ph_complete_case <- train_ph[complete_train_rows]
valid_ph_complete_case <- valid_ph[complete_valid_rows]

complete_case_row_check <- tibble(
  dataset = c("Training", "Validation"),
  original_rows = c(
    nrow(train_features_for_imputation),
    nrow(valid_features_for_imputation)
  ),
  complete_case_rows = c(
    nrow(train_features_complete_case),
    nrow(valid_features_complete_case)
  ),
  removed_rows = original_rows - complete_case_rows
)

complete_case_row_check |>
  kable(
    caption = "Rows Kept and Removed for Complete-Case Random Forest",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Rows Kept and Removed for Complete-Case Random Forest
dataset original_rows complete_case_rows removed_rows
Training 2052 1625 427
Validation 515 413 102
complete_case_rf_results <- run_rf_validation_test(
  train_x = train_features_complete_case,
  valid_x = valid_features_complete_case,
  train_y = train_ph_complete_case,
  valid_y = valid_ph_complete_case,
  workflow_name = "Complete-case random forest",
  project_seed = project_seed
)

complete_case_rf_results |>
  kable(
    caption = "Complete-Case Random Forest Validation Performance",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Complete-Case Random Forest Validation Performance
workflow training_rows validation_rows best_mtry rmse mae rsq training_time_seconds
Complete-case random forest 1625 413 15 0.092 0.0659 0.7309 136.01

Imputation Workflow Comparison Using Random Forest

After creating the different datasets with imputed NA values, I compare them using the same Random Forest validation test.

My main goal is not just to see which method fills missing values. I want to see which imputation workflow creates the most useful dataset for predicting PH.

Each workflow has been tested with the same general Random Forest setup, same training PH target, and same validation PH target. This makes the comparison more practical because the final decision is based on downstream model performance, not only on the imputation method itself.

I also included complete-case dataset. This model does not use imputation and only kept rows with no missing predictor values. It can be useful as a baseline, but it uses less rows, so it is not exactly the same situation as the imputed workflows.

imputation_workflow_comparison <- list(
  if (exists("complete_case_rf_results")) complete_case_rf_results else NULL,
  if (exists("xgb_numeric_knn_rf_results_temp")) xgb_numeric_knn_rf_results_temp else NULL,
  if (exists("rf_numeric_knn_results_temp")) rf_numeric_knn_results_temp else NULL,
  if (exists("rf_capped_knn_results")) rf_capped_knn_results else NULL,
  if (exists("missforest_rf_results_temp")) missforest_rf_results_temp else NULL
) |>
  purrr::compact() |>
  bind_rows()

imputation_workflow_comparison |>
  select(
    workflow,
    training_rows,
    validation_rows,
    best_mtry,
    rmse,
    mae,
    rsq,
    training_time_seconds
  ) |>
  arrange(rmse) |>
  kable(
    caption = "Random Forest Comparison: Complete-Case and Imputed Workflows",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Random Forest Comparison: Complete-Case and Imputed Workflows
workflow training_rows validation_rows best_mtry rmse mae rsq training_time_seconds
Complete-case random forest 1625 413 15 0.0920 0.0659 0.7309 136.01
XGBoost numeric + KNN brand 2052 515 20 0.0958 0.0682 0.7013 192.01
Tree numeric + KNN brand 2052 515 20 0.0962 0.0687 0.6980 205.30
missForest imputation 2052 515 20 0.0966 0.0694 0.6947 189.60
RF numeric + KNN brand 2052 515 20 0.0967 0.0692 0.6944 189.14

The complete-case dataset has the best validation performance, but it uses fewer rows because records with missing predictors are removed.
This makes it useful as a benchmark, but I wouldn’t use it for following prediction models, because the final evaluation data also contains missing predictor values. And, also I would like to use all given observations.

Among the imputed workflows, Random Forest test shows us that “XGBoost numeric imputation + KNN brand_code imputation” technique performed best.
It has:

  • the lowest RMSE = 0.0958,

  • and highest \(R^2\) = 0.7013 among imputed datasets.

The other three workflows results are very close to each other, but none of themis better than XGBoost numeric + KNN Brand Code imputation method.

Based on these results, I select datasets of XGBoost numeric + KNN Brand Code imputation method as the main training/validation and evaluation sets to use for modeling and final evaluation predictions.

Final Completed Feature Sets for Modeling

train_features_final_imputed <- train_features_xgb_numeric_knn_brand_temp
valid_features_final_imputed <- valid_features_xgb_numeric_knn_brand_temp
eval_features_final_imputed <- eval_features_xgb_numeric_knn_brand_temp

final_imputed_set_check <- tibble(
  dataset = c("Training", "Validation", "Final Evaluation"),
  rows = c(
    nrow(train_features_final_imputed),
    nrow(valid_features_final_imputed),
    nrow(eval_features_final_imputed)
  ),
  columns = c(
    ncol(train_features_final_imputed),
    ncol(valid_features_final_imputed),
    ncol(eval_features_final_imputed)
  ),
  total_missing_values = c(
    sum(is.na(train_features_final_imputed)),
    sum(is.na(valid_features_final_imputed)),
    sum(is.na(eval_features_final_imputed))
  )
)

final_imputed_set_check |>
  kable(
    caption = "Final Completed Feature Sets Used for Modeling",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Final Completed Feature Sets Used for Modeling
dataset rows columns total_missing_values
Training 2052 32 0
Validation 515 32 0
Final Evaluation 267 32 0

Final Preprocessing Decisions by Model Family

The one single preprocessing approach may not be best for every model family. Some models are sensitive to scale, skewness, and correlated predictors. Other models, especially tree-based models, can handle nonlinear splits and interactions more naturally.

Because of this, I would not force the exact same transformed features set into every model. Instead, I will make preprocessing decisions based on how each model family works.

mnfflow Variable Review

The mnf_flow variable earlier needed extra review because almost 50% of varible values are negative.

mnf_flow_negative_values |>
  kable(
    caption = "Negative mnfflow Values in Training Data",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Negative mnfflow Values in Training Data
mnf_flow n
-100.0 487
-100.2 454

All negative values are exactly -100 or -100.2, so at first they looked like possible coded process states rather than regular continuous flow measurements.

However, the earlier PH comparison showed that these negative mnf_flow rows behave differently from non-negative rows. For example, Brand B had mean PH of 8.6700 when mnf_flow was negative, compared with 8.4716 when mnf_flow was non-negative.

The process-variable comparison also showed differences. For Brand B, mean oxygen_filler was 0.0713 for negative mnf_flow rows, compared with 0.0255 for non-negative rows.

Because of this, I do not treat negative mnf_flow values as simple errors. Even though -100 and -100.2 look like special coded values, they appear to contain useful process information.

Testing mnfflow Representation for Linear Models

The mnf_flow review showed that this variable may not behave like one regular continuous predictor. This matters especially for linear models because linear models use coefficients.

If mnf_flow is kept as one numeric variable, the model has to use one linear relationship for both:

  • negative coded values such as -100 and -100.2
  • positive flow values that behave like regular continuous measurements

Because of this, I test two simple linear regression versions before making the final linear-model preprocessing decision:

  • Original mnf_flow: keep mnf_flow as one numeric predictor.
  • State-based mnf_flow: treat negative values as categorical process states and keep positive flow as a separate continuous variable. Basically I split mnf_flow into two separate variables: one catecorical for negative values (-100and-100.2), and another numerical variable for the rest of values.

This is only a quick preprocessing diagnostic. I am not applying the full final linear-model preprocessing steps yet. The purpose is only to decide whether mnf_flow should be represented differently for linear and regularized models.

add_mnf_flow_state_features <- function(data_x) {
  
  data_x |>
    mutate(
      mnf_flow_state = case_when(
        is.na(mnf_flow) ~ "Missing",
        mnf_flow == -100 ~ "Negative_100",
        mnf_flow == -100.2 ~ "Negative_100_2",
        mnf_flow < 0 ~ "Other_negative",
        mnf_flow >= 0 ~ "Non_negative"
      ),
      
      mnf_flow_state = factor(
        mnf_flow_state,
        levels = c(
          "Non_negative",
          "Negative_100",
          "Negative_100_2",
          "Other_negative",
          "Missing"
        )
      ),
      
      mnf_flow_positive_value = ifelse(
        !is.na(mnf_flow) & mnf_flow >= 0,
        mnf_flow,
        0
      )
    )
}
train_features_lm_original <- train_features_final_imputed
valid_features_lm_original <- valid_features_final_imputed

train_features_lm_state <- train_features_final_imputed |>
  add_mnf_flow_state_features() |>
  select(-mnf_flow)

valid_features_lm_state <- valid_features_final_imputed |>
  add_mnf_flow_state_features() |>
  select(-mnf_flow)
train_lm_original_data <- train_features_lm_original |>
  mutate(ph = train_ph) |>
  filter(!is.na(ph))

valid_lm_original_data <- valid_features_lm_original |>
  mutate(ph = valid_ph) |>
  filter(!is.na(ph))

train_lm_state_data <- train_features_lm_state |>
  mutate(ph = train_ph) |>
  filter(!is.na(ph))

valid_lm_state_data <- valid_features_lm_state |>
  mutate(ph = valid_ph) |>
  filter(!is.na(ph))
mnf_flow_transformation_example <- bind_rows(
  train_features_lm_original |>
    filter(mnf_flow < 0) |>
    distinct(mnf_flow) |>
    slice_head(n = 2),
  
  train_features_lm_original |>
    filter(mnf_flow >= 0) |>
    distinct(mnf_flow) |>
    slice_head(n = 5)
) |>
  mutate(
    mnf_flow_state = case_when(
      is.na(mnf_flow) ~ "Missing",
      mnf_flow == -100 ~ "Negative_100",
      mnf_flow == -100.2 ~ "Negative_100_2",
      mnf_flow < 0 ~ "Other_negative",
      mnf_flow >= 0 ~ "Non_negative"
    ),
    mnf_flow_positive_value = ifelse(
      !is.na(mnf_flow) & mnf_flow >= 0,
      mnf_flow,
      0
    )
  ) |>
  arrange(mnf_flow)

mnf_flow_transformation_example |>
  kable(
    caption = "Example of mnfflow State-Based Transformation",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Example of mnfflow State-Based Transformation
mnf_flow mnf_flow_state mnf_flow_positive_value
-100.2 Negative_100_2 0.0
-100.0 Negative_100 0.0
0.2 Non_negative 0.2
161.4 Non_negative 161.4
162.6 Non_negative 162.6
164.4 Non_negative 164.4
167.4 Non_negative 167.4

This table shows how the original mnf_flow variable is changed for the linear-model test. Negative values such as -100 and -100.2 are treated as categorical states, while positive values are kept as a separate numeric variable.

For example, when mnf_flow = -100.2, the new mnf_flow_state is Negative_100_2, and mnf_flow_positive_value is set to 0. When mnf_flow is positive, the state is Non_negative, and the positive value is kept as the continuous flow measurement.

Now let’s see before and after split results with simple linear regression

lm_original_model <- lm(
  ph ~ .,
  data = train_lm_original_data
)

lm_state_model <- lm(
  ph ~ .,
  data = train_lm_state_data
)
lm_original_pred <- predict(
  lm_original_model,
  newdata = valid_lm_original_data
)

lm_state_pred <- predict(
  lm_state_model,
  newdata = valid_lm_state_data
)
lm_mnf_flow_comparison <- bind_rows(
  calculate_regression_metrics(
    actual = valid_lm_original_data$ph,
    predicted = lm_original_pred
  ) |>
    mutate(
      workflow = "Linear model with original mnfflow",
      training_rows = nrow(train_lm_original_data),
      validation_rows = nrow(valid_lm_original_data)
    ),
  
  calculate_regression_metrics(
    actual = valid_lm_state_data$ph,
    predicted = lm_state_pred
  ) |>
    mutate(
      workflow = "Linear model with mnfflow state + positive value",
      training_rows = nrow(train_lm_state_data),
      validation_rows = nrow(valid_lm_state_data)
    )
) |>
  select(
    workflow,
    training_rows,
    validation_rows,
    rmse,
    mae,
    rsq
  ) |>
  arrange(rmse)

lm_mnf_flow_comparison |>
  kable(
    caption = "Simple Linear Model Comparison of mnfflow Representations",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Simple Linear Model Comparison of mnfflow Representations
workflow training_rows validation_rows rmse mae rsq
Linear model with mnfflow state + positive value 2052 515 0.1280 0.0990 0.4448
Linear model with original mnfflow 2052 515 0.1311 0.1011 0.4173

The simple linear model comparison supports using the state-based mnf_flow split for linear models.

The model with mnf_flow_state + mnf_flow_positive_value performed better than the model with the original mnf_flow variable:

  • RMSE improved from 0.1311 to 0.1280,

  • MAE improved from 0.1011 to 0.0990,

  • and \(R^2\) improved from 0.4173 to 0.4448.

This supports my idea that mnf_flow should not always be treated as one regular continuous predictor. For linear models, one coefficient may be too restrictive because negative mnf_flow values appear to represent a coded process state, while positive values represent regular flow magnitude.

Because of this, I will use this mnf_flow_state + mnf_flow_positive_value representation dataset for linear and regularized model families.

Final Model-Family Preprocessing Plan

Based on the negative-value review, the mnf_flow behavior check, and the simple linear regression comparison, I use different preprocessing strategies for different model families.

Linear and Regularized Models

For linear and regularized models, I will use the mnf_flow_state + mnf_flow_positive_value representation. These models use coefficients, so separating the negative coded states from positive flow magnitude gives the model more flexibility.

The linear and regularized model preprocessing will include:

  • near-zero variance predictor removal
  • dummy encoding for categorical variables
  • selective Yeo-Johnson transformation for highly skewed continuous numeric variables
  • centering and scaling of numeric predictors
  • high-correlation review or filtering if needed

I wouldn’t apply Yeo-Johnson transformation to all numeric variables. It will be only considered for continuous numeric predictors with high skewness, and will not be applied to categorical variables, dummy variables, or variables that behave more like process settings or coded states.

This will keep the transformation more careful because not every numeric-looking variable should be treated as a regular continuous predictor.

Nonlinear Non-Tree Models

For nonlinear non-tree models, such as SVM with radial basis kernel or other distance/kernel-based methods, I will treat the mnf_flow representation as a comparison question.

These models may benefit from the state-based mnf_flow representation because they are sensitive to feature geometry, scale, and distance. However, I will not assume this automatically.

For nonlinear non-tree models, I may compare:

  • original final imputed feature set
  • final imputed feature set with mnf_flow_state + mnf_flow_positive_value

These models usually require centering and scaling. Selective Yeo-Johnson transformation may also be useful for highly skewed continuous predictors.

Tree-Based Models

For tree-based models, I will use the best final imputed feature set with the original mnf_flow variable.

This is because tree-based models can naturally split on rules like mnf_flow < 0. Because of that, they do not need the same mnf_flow state representation that linear models need.

Tree-based models also do not require centering and scaling, and I will not apply Yeo-Johnson transformation by default. These models can handle nonlinear splits, interactions, and skewed predictors.

I did side Random Forest test (not included here), and mnf_flow_state + mnf_flow_positive_valuerepresentation had lower accuracy performance compared with the original mnf_flow version. So, for Random Forest and other tree-based models, I will keep using the original final imputed feature set.

Data Leakage Control

All preprocessing steps that learn values from data must be learned from the training set only. This includes Yeo-Johnson transformation, centering, and scaling.

After these preprocessing steps are fitted on the training data during models training, the same fitted steps are applied to the validation and final evaluation sets.

The validation set is used only for model comparison. The final evaluation set will not be used for training, tuning, or model selection. I will use evaluation set at the end after the final model is selected.

#these will be tree-based model feature sets
train_features_tree_family <- train_features_final_imputed
valid_features_tree_family <- valid_features_final_imputed
eval_features_tree_family <- eval_features_final_imputed

#here linear / regularized model feature sets
train_features_linear_family <- train_features_final_imputed |>
  add_mnf_flow_state_features() |>
  select(-mnf_flow)

valid_features_linear_family <- valid_features_final_imputed |>
  add_mnf_flow_state_features() |>
  select(-mnf_flow)

eval_features_linear_family <- eval_features_final_imputed |>
  add_mnf_flow_state_features() |>
  select(-mnf_flow)
model_family_feature_set_check <- tibble(
  feature_set = c(
    "Tree family training",
    "Tree family validation",
    "Tree family final evaluation",
    "Linear family training",
    "Linear family validation",
    "Linear family final evaluation"
  ),
  rows = c(
    nrow(train_features_tree_family),
    nrow(valid_features_tree_family),
    nrow(eval_features_tree_family),
    nrow(train_features_linear_family),
    nrow(valid_features_linear_family),
    nrow(eval_features_linear_family)
  ),
  columns = c(
    ncol(train_features_tree_family),
    ncol(valid_features_tree_family),
    ncol(eval_features_tree_family),
    ncol(train_features_linear_family),
    ncol(valid_features_linear_family),
    ncol(eval_features_linear_family)
  ),
  missing_values = c(
    sum(is.na(train_features_tree_family)),
    sum(is.na(valid_features_tree_family)),
    sum(is.na(eval_features_tree_family)),
    sum(is.na(train_features_linear_family)),
    sum(is.na(valid_features_linear_family)),
    sum(is.na(eval_features_linear_family))
  )
)

model_family_feature_set_check |>
  kable(
    caption = "Model Family Feature Set Check",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Model Family Feature Set Check
feature_set rows columns missing_values
Tree family training 2052 32 0
Tree family validation 515 32 0
Tree family final evaluation 267 32 0
Linear family training 2052 33 0
Linear family validation 515 33 0
Linear family final evaluation 267 33 0
processed_data_folder <- "processed_modeling_data"

if (!dir.exists(processed_data_folder)) {
  dir.create(processed_data_folder)
}
write_csv(
  train_features_tree_family,
  file.path(processed_data_folder, "train_features_tree_family.csv")
)

write_csv(
  valid_features_tree_family,
  file.path(processed_data_folder, "valid_features_tree_family.csv")
)

write_csv(
  eval_features_tree_family,
  file.path(processed_data_folder, "eval_features_tree_family.csv")
)

write_csv(
  train_features_linear_family,
  file.path(processed_data_folder, "train_features_linear_family.csv")
)

write_csv(
  valid_features_linear_family,
  file.path(processed_data_folder, "valid_features_linear_family.csv")
)

write_csv(
  eval_features_linear_family,
  file.path(processed_data_folder, "eval_features_linear_family.csv")
)
write_csv(
  tibble(ph = train_ph),
  file.path(processed_data_folder, "train_ph.csv")
)

write_csv(
  tibble(ph = valid_ph),
  file.path(processed_data_folder, "valid_ph.csv")
)
processed_file_summary <- tibble(
  file_name = c(
    "train_features_tree_family.csv",
    "valid_features_tree_family.csv",
    "eval_features_tree_family.csv",
    "train_features_linear_family.csv",
    "valid_features_linear_family.csv",
    "eval_features_linear_family.csv",
    "train_ph.csv",
    "valid_ph.csv"
  ),
  description = c(
    "Training predictors for tree-based models using original mnf_flow",
    "Validation predictors for tree-based models using original mnf_flow",
    "Final evaluation predictors for tree-based models using original mnf_flow",
    "Training predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value",
    "Validation predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value",
    "Final evaluation predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value",
    "Training PH target values",
    "Validation PH target values"
  ),
  rows = c(
    nrow(train_features_tree_family),
    nrow(valid_features_tree_family),
    nrow(eval_features_tree_family),
    nrow(train_features_linear_family),
    nrow(valid_features_linear_family),
    nrow(eval_features_linear_family),
    length(train_ph),
    length(valid_ph)
  ),
  columns = c(
    ncol(train_features_tree_family),
    ncol(valid_features_tree_family),
    ncol(eval_features_tree_family),
    ncol(train_features_linear_family),
    ncol(valid_features_linear_family),
    ncol(eval_features_linear_family),
    1,
    1
  )
)

write_csv(
  processed_file_summary,
  file.path(processed_data_folder, "processed_file_summary.csv")
)

processed_file_summary |>
  kable(
    caption = "Saved Processed Modeling Files",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Saved Processed Modeling Files
file_name description rows columns
train_features_tree_family.csv Training predictors for tree-based models using original mnf_flow 2052 32
valid_features_tree_family.csv Validation predictors for tree-based models using original mnf_flow 515 32
eval_features_tree_family.csv Final evaluation predictors for tree-based models using original mnf_flow 267 32
train_features_linear_family.csv Training predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value 2052 33
valid_features_linear_family.csv Validation predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value 515 33
eval_features_linear_family.csv Final evaluation predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value 267 33
train_ph.csv Training PH target values 2052 1
valid_ph.csv Validation PH target values 515 1
set.seed(123)
project_seed = 123


# num_cores <- 10
# 
# 
# cl <- makePSOCKcluster(num_cores)
# registerDoParallel(cl)

Load Processed Modeling Data

processed_data_folder <- "processed_modeling_data"

train_features_tree_family <- read_csv(
  file.path(processed_data_folder, "train_features_tree_family.csv"),
  show_col_types = FALSE
)

valid_features_tree_family <- read_csv(
  file.path(processed_data_folder, "valid_features_tree_family.csv"),
  show_col_types = FALSE
)

eval_features_tree_family <- read_csv(
  file.path(processed_data_folder, "eval_features_tree_family.csv"),
  show_col_types = FALSE
)

train_features_linear_family <- read_csv(
  file.path(processed_data_folder, "train_features_linear_family.csv"),
  show_col_types = FALSE
)

valid_features_linear_family <- read_csv(
  file.path(processed_data_folder, "valid_features_linear_family.csv"),
  show_col_types = FALSE
)

eval_features_linear_family <- read_csv(
  file.path(processed_data_folder, "eval_features_linear_family.csv"),
  show_col_types = FALSE
)

train_ph <- read_csv(
  file.path(processed_data_folder, "train_ph.csv"),
  show_col_types = FALSE
) |>
  pull(ph)

valid_ph <- read_csv(
  file.path(processed_data_folder, "valid_ph.csv"),
  show_col_types = FALSE
) |>
  pull(ph)
loaded_data_check <- tibble(
  object_name = c(
    "train_features_tree_family",
    "valid_features_tree_family",
    "eval_features_tree_family",
    "train_features_linear_family",
    "valid_features_linear_family",
    "eval_features_linear_family",
    "train_ph",
    "valid_ph"
  ),
  rows = c(
    nrow(train_features_tree_family),
    nrow(valid_features_tree_family),
    nrow(eval_features_tree_family),
    nrow(train_features_linear_family),
    nrow(valid_features_linear_family),
    nrow(eval_features_linear_family),
    length(train_ph),
    length(valid_ph)
  ),
  columns = c(
    ncol(train_features_tree_family),
    ncol(valid_features_tree_family),
    ncol(eval_features_tree_family),
    ncol(train_features_linear_family),
    ncol(valid_features_linear_family),
    ncol(eval_features_linear_family),
    1,
    1
  )
)

loaded_data_check |>
  kable(
    caption = "Loaded Modeling Data Check",
    digits = 0
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Loaded Modeling Data Check
object_name rows columns
train_features_tree_family 2052 32
valid_features_tree_family 515 32
eval_features_tree_family 267 32
train_features_linear_family 2052 33
valid_features_linear_family 515 33
eval_features_linear_family 267 33
train_ph 2052 1
valid_ph 515 1
stopifnot(nrow(train_features_tree_family) == length(train_ph))
stopifnot(nrow(valid_features_tree_family) == length(valid_ph))

stopifnot(nrow(train_features_linear_family) == length(train_ph))
stopifnot(nrow(valid_features_linear_family) == length(valid_ph))

stopifnot(nrow(eval_features_tree_family) == nrow(eval_features_linear_family))

Helper Functions

Before fitting models, I define a few helper functions that will be reused throughout the modeling file. This keeps the model sections shorter and makes the results tables more consistent.

Function 1: Regression metrics

calculate_regression_metrics <- function(actual, predicted) {
  
  tibble(
    rmse = sqrt(mean((actual - predicted)^2, na.rm = TRUE)),
    mae = mean(abs(actual - predicted), na.rm = TRUE),
    rsq = cor(actual, predicted, use = "complete.obs")^2
  )
}

This function calculates the main validation metrics used in this project: RMSE, MAE, and R-squared.

Function 2: Make a validation result row

make_model_result_row <- function(
    workflow,
    model_family,
    model_name,
    training_rows,
    validation_rows,
    tuning_parameters,
    actual,
    predicted,
    training_time_seconds
) {
  
  metrics <- calculate_regression_metrics(
    actual = actual,
    predicted = predicted
  )
  
  tibble(
    workflow = workflow,
    model_family = model_family,
    model_name = model_name,
    training_rows = training_rows,
    validation_rows = validation_rows,
    tuning_parameters = tuning_parameters,
    rmse = metrics$rmse,
    mae = metrics$mae,
    rsq = metrics$rsq,
    training_time_seconds = training_time_seconds
  )
}

This function creates one clean result row for each model. Extra tuning information, such as best alpha, lambda, mtry, or cp, can be added when needed.

Function 3: Prepare linear model matrix

prepare_linear_model_matrix <- function(train_x, valid_x, eval_x = NULL) {
  
  dummy_fit <- caret::dummyVars(
    ~ .,
    data = train_x,
    fullRank = TRUE
  )
  
  train_dummy <- predict(dummy_fit, newdata = train_x) |>
    as.data.frame()
  
  valid_dummy <- predict(dummy_fit, newdata = valid_x) |>
    as.data.frame()
  
  if (!is.null(eval_x)) {
    eval_dummy <- predict(dummy_fit, newdata = eval_x) |>
      as.data.frame()
  } else {
    eval_dummy <- NULL
  }
  
  list(
    train = train_dummy,
    valid = valid_dummy,
    eval = eval_dummy,
    dummy_fit = dummy_fit
  )
}

This function creates dummy variables using the training data and then applies the same dummy encoding to validation and final evaluation data. This keeps categorical predictors consistent across all datasets.

Function 4: detect skewed continuous variables

find_skewed_numeric_vars <- function(
  train_x,
  skew_cutoff = 1,
  min_unique_values = 10,
  exclude_vars = c(
    "pressure_setpoint",
    "bowl_setpoint"
  )
) {
  
  numeric_vars <- train_x |>
    select(where(is.numeric)) |>
    names()
  
  skew_summary <- train_x |>
    select(all_of(numeric_vars)) |>
    summarise(
      across(
        everything(),
        list(
          unique_values = ~ n_distinct(.x, na.rm = TRUE),
          skewness = ~ e1071::skewness(.x, na.rm = TRUE)
        )
      )
    ) |>
    pivot_longer(
      cols = everything(),
      names_to = "name",
      values_to = "value"
    ) |>
    separate(
      name,
      into = c("variable", "metric"),
      sep = "_(?=[^_]+$)"
    ) |>
    pivot_wider(
      names_from = metric,
      values_from = value
    ) |>
    mutate(
      abs_skewness = abs(skewness),
      use_yeo_johnson = abs_skewness >= skew_cutoff &
        unique_values >= min_unique_values &
        !(variable %in% exclude_vars)
    ) |>
    arrange(desc(abs_skewness))
  
  list(
    skew_summary = skew_summary,
    yeo_johnson_vars = skew_summary |>
      filter(use_yeo_johnson) |>
      pull(variable)
  )
}

This function identifies numeric variables that are strongly skewed and may need a Yeo-Johnson transformation.

Function 5: Apply high-correlation filtering

apply_correlation_filter <- function(
  train_x,
  valid_x,
  eval_x = NULL,
  cutoff = 0.90
) {
  
  numeric_train <- train_x |>
    select(where(is.numeric))
  
  corr_matrix <- cor(
    numeric_train,
    use = "pairwise.complete.obs"
  )
  
  high_corr_cols <- caret::findCorrelation(
    corr_matrix,
    cutoff = cutoff,
    names = TRUE
  )
  
  if (length(high_corr_cols) > 0) {
    
    train_x <- train_x |>
      select(-all_of(high_corr_cols))
    
    valid_x <- valid_x |>
      select(-all_of(high_corr_cols))
    
    if (!is.null(eval_x)) {
      eval_x <- eval_x |>
        select(-all_of(high_corr_cols))
    }
  }
  
  list(
    train = train_x,
    valid = valid_x,
    eval = eval_x,
    removed_columns = high_corr_cols
  )
}

This function removes highly correlated numeric predictors using the training data only. This is mainly useful for linear and regularized models where strong correlation can make coefficients unstable.

Function 6: NZV filtering

apply_nzv_filter <- function(train_x, valid_x, eval_x = NULL) {
  
  nzv_cols_index <- caret::nearZeroVar(train_x)
  
  removed_columns <- names(train_x)[nzv_cols_index]
  
  if (length(removed_columns) > 0) {
    
    train_x <- train_x |>
      select(-all_of(removed_columns))
    
    valid_x <- valid_x |>
      select(-all_of(removed_columns))
    
    if (!is.null(eval_x)) {
      eval_x <- eval_x |>
        select(-all_of(removed_columns))
    }
  }
  
  list(
    train = train_x,
    valid = valid_x,
    eval = eval_x,
    removed_columns = removed_columns
  )
}

Function 7: Plots of predicted vs observed values

plot_model_diagnostics <- function(obs, pred, model_name = "Model") {
  
  diag_df <- data.frame(
    observed = as.numeric(obs),
    predicted = as.numeric(pred),
    residuals = as.numeric(obs) - as.numeric(pred)
  )
  
  p1 <- ggplot(data = diag_df, aes(x = predicted, y = observed)) +
    geom_point(alpha = 0.6, color = "blue") +
    geom_abline(slope = 1, intercept = 0, linetype = 2) +
    labs(
      title = paste("Predicted vs Observed: \n", model_name),
      x = "Predicted PH",
      y = "Observed PH"
    )
  
  p2 <- ggplot(data = diag_df, aes(x = predicted, y = residuals)) +
    geom_point(alpha = 0.6, color = "blue") +
    geom_hline(yintercept = 0, linetype = 2) +
    labs(
      title = paste("Residuals vs Predicted: \n", model_name),
      x = "Predicted PH",
      y = "Residuals"
    )
  
  p1 + p2
}

The first plot compares predicted values with observed values. If the model is doing well, the points should be close to the diagonal line.

The second plot shows residuals. Ideally, the residuals should be scattered around zero without a strong pattern.

Function 8: Model validation results table

make_and_show_model_result <- function(
    workflow,
    model_family,
    model_name,
    training_rows,
    validation_rows,
    tuning_parameters,
    actual,
    predicted,
    training_time_seconds,
    table_caption = NULL
) {
  
  result <- make_model_result_row(
    workflow = workflow,
    model_family = model_family,
    model_name = model_name,
    training_rows = training_rows,
    validation_rows = validation_rows,
    tuning_parameters = tuning_parameters,
    actual = actual,
    predicted = predicted,
    training_time_seconds = training_time_seconds
  )
  
  if (is.null(table_caption)) {
    table_caption <- paste(model_name, "Validation Performance")
  }
  
  print(
    result |>
      kable(
        caption = table_caption,
        digits = 4
      ) |>
      kable_styling(
        full_width = FALSE,
        position = "center"
      )
  )
  
  return(result)
}

Function 9: Comparison table

add_model_result <- function(results_table, new_result) {
  
  bind_rows(
    results_table,
    new_result
  ) |>
    arrange(rmse)
}

linear_model_results <- tibble()
tree_model_results <- tibble()
all_model_results <- tibble()

Function 10: doParallel settings

start_parallel <- function(cores_to_leave_free = 1) {
  
  library(doParallel)
  
  total_cores <- parallel::detectCores(logical = TRUE)
  
  cores_to_use <- max(1, total_cores - cores_to_leave_free)
  
  cl <- makePSOCKcluster(cores_to_use)
  registerDoParallel(cl)
  
  message("Parallel backend started with ", cores_to_use, " cores.")
  
  return(cl)
}

stop_parallel <- function(cl) {
  
  stopCluster(cl)
  registerDoSEQ()
  
  message("Parallel backend stopped.")
}

Modeling Approach

In this part I will compare several model families using the validation set. The training set will be used for model fitting and cross-validation tuning. The validation set is used only after tuning to compare model performance.

I will use RMSE as the main metric because PH prediction is a regression problem and RMSE penalizes larger errors more strongly. I also report MAE because it is easier to interpret as the average absolute error. \(R^2\) will be included to show how much variation in PH is explained by the model.

Different model families will use different feature sets:

  • Linear and regularized models use the linear-family feature set with mnf_flow_state + mnf_flow_positive_value.
  • Tree-based models use the tree-family feature set with the original mnf_flow variable.
  • Nonlinear models may be tested with one or both feature sets depending on the model.
set.seed(123)

cv_control <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 3,
  savePredictions = "final"
)

I’m going to use repeated 5-fold cross-validation with 3 repeats.

The validation set is still kept separate. I will use it to compare the models after they are trained.

Linear and Regularized Models

Since I already created the linear-family feature set, I continue with that version here.

The linear-family feature set uses mnf_flow_state and mnf_flow_positive_value instead of the original mnf_flow, because that made more sense for linear models earlier.

train_linear_model_data <- train_features_linear_family |>
  mutate(ph = train_ph)

valid_linear_model_data <- valid_features_linear_family |>
  mutate(ph = valid_ph)

Near-Zero Variance and Correlation Filtering for Linear Models

Before fitting the linear models, I remove predictors with almost no variation. I did NZV test on raw data in preprocessing secton, but I’m doing it again as precation step since datasets now have imputed values

linear_nzv <- apply_nzv_filter(
  train_x = train_features_linear_family,
  valid_x = valid_features_linear_family
)

train_linear_nzv <- linear_nzv$train
valid_linear_nzv <- linear_nzv$valid
linear_nzv_removed <- tibble(
  removed_predictor = linear_nzv$removed_columns
)

linear_nzv_removed 
# A tibble: 0 × 1
# ℹ 1 variable: removed_predictor <chr>

No near-zero variance predictors were removed.

linear_corr <- apply_correlation_filter(
  train_x = train_linear_nzv,
  valid_x = valid_linear_nzv,
  cutoff = 0.90
)

train_linear_filtered <- linear_corr$train
valid_linear_filtered <- linear_corr$valid

stopifnot(exists("train_linear_filtered"))
stopifnot(exists("valid_linear_filtered"))
linear_corr_removed <- tibble(
  removed_predictor = linear_corr$removed_columns
)

linear_corr_removed |>
  kable(
    caption = "Highly Correlated Predictors Removed for Linear Models"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Highly Correlated Predictors Removed for Linear Models
removed_predictor
balling
hyd_pressure3
density
balling_lvl
filler_level

The correlation filter removed five predictors: balling, hyd_pressure3, density, balling_lvl, and filler_level.

train_linear_filtered_data <- train_linear_filtered |>
  mutate(ph = train_ph)

valid_linear_filtered_data <- valid_linear_filtered |>
  mutate(ph = valid_ph)

Yeo-Johnson Variable List for Linear Models

For the linear and regularized models, I first create the list of variables that should receive the Yeo-Johnson transformation.

I do not apply the transformation directly here. The actual Yeo-Johnson transformation will be handled inside the modeling workflow together with centering and scaling.

exclude_from_yj <- c(
  "pressure_setpoint",
  "bowl_setpoint"
)

yeo_johnson_vars <- train_linear_filtered_data |>
  select(where(is.numeric)) |>
  select(-any_of(c(exclude_from_yj, "ph"))) |>
  names()
tibble(
  yeo_johnson_variable = yeo_johnson_vars
) |>
  kable(
    caption = "Variables Selected for Yeo-Johnson Transformation"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Variables Selected for Yeo-Johnson Transformation
yeo_johnson_variable
carb_volume
fill_ounces
pc_volume
carb_pressure
carb_temp
psc
psc_fill
psc_co2
carb_pressure1
fill_pressure
hyd_pressure1
hyd_pressure2
hyd_pressure4
filler_speed
temperature
usage_cont
carb_flow
mfr
pressure_vacuum
oxygen_filler
air_pressurer
alch_rel
carb_rel
mnf_flow_positive_value

Create Linear Modeling Data

After the NZV and correlation filtering steps, I add ph back to the filtered linear-family training and validation datasets.

The actual Yeo-Johnson, centering, and scaling steps will be applied inside the model training workflow.

train_linear_model_data <- train_linear_filtered |>
  mutate(ph = train_ph)

valid_linear_model_data <- valid_linear_filtered |>
  mutate(ph = valid_ph)
linear_recipe <- recipe(
  ph ~ .,
  data = train_linear_model_data
) |>
  step_YeoJohnson(all_of(yeo_johnson_vars)) |>
  step_dummy(all_nominal_predictors()) |>
  step_center(all_numeric_predictors()) |>
  step_scale(all_numeric_predictors())

Here I created the preprocessing workflow for the linear and regularized models.

The Yeo-Johnson step uses only the variables selected earlier in yeo_johnson_vars. Also, since some variables have negative values it is better to use Yeo-Johnson method instad of BoxCox. This keeps the transformation limited to the numeric variables we decided are appropriate.

After that, categorical predictors are converted into dummy variables. Then numeric predictors are centered and scaled, which is important for regularized models like ridge, lasso, and elastic net.

I keep this inside the modeling workflow so the preprocessing is connected to cross-validation instead of being applied separately by hand.

Linear Regression Baseline

I start with a regular linear regression model as the first baseline.

I originally tried to use caret::train() with a recipe, but it caused an error during repeated cross-validation.

Because this modeling section uses a recipe for Yeo-Johnson, centering, scaling, and dummy encoding, I switch to the tidymodels workflow. This fits better here because recipes, workflows, parsnip, and rsample are designed to work together.

The modeling logic stays the same. I am not changing the preprocessing plan. I am only changing the model training framework so the recipe can be handled more reliably.

set.seed(123)

linear_folds <- vfold_cv(
  train_linear_model_data,
  v = 5,
  repeats = 3
)

I use repeated 5-fold cross-validation again, but now through tidymodels.

set.seed(123)

linear_model_spec <- linear_reg() |>
  set_engine("lm")

linear_workflow <- workflow() |>
  add_recipe(linear_recipe) |>
  add_model(linear_model_spec)
set.seed(123)

linear_cv_results <- fit_resamples(
  linear_workflow,
  resamples = linear_folds,
  metrics = metric_set(rmse, mae, rsq),
  control = control_resamples(save_pred = TRUE)
)
collect_metrics(linear_cv_results) |>
  kable(
    caption = "Linear Regression Cross-Validation Performance",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Linear Regression Cross-Validation Performance
.metric .estimator mean n std_err .config
mae standard 0.1066 15 0.0008 pre0_mod0_post0
rmse standard 0.1363 15 0.0010 pre0_mod0_post0
rsq standard 0.3786 15 0.0093 pre0_mod0_post0
linear_start_time <- Sys.time()

linear_fit <- fit(
  linear_workflow,
  data = train_linear_model_data
)

linear_end_time <- Sys.time()

linear_training_time <- as.numeric(
  difftime(linear_end_time, linear_start_time, units = "secs")
)
linear_validation_predictions <- predict(
  linear_fit,
  new_data = valid_linear_model_data
) |>
  pull(.pred)
linear_result <- make_model_result_row(
  workflow = "Linear-family ",
  model_family = "Linear / Regularized",
  model_name = "Linear Regression",
  training_rows = nrow(train_linear_model_data),
  validation_rows = nrow(valid_linear_model_data),
  tuning_parameters = "None",
  actual = valid_ph,
  predicted = linear_validation_predictions,
  training_time_seconds = linear_training_time
)

linear_result |>
  kable(
    caption = "Linear Regression Validation Performance",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Linear Regression Validation Performance
workflow model_family model_name training_rows validation_rows tuning_parameters rmse mae rsq training_time_seconds
Linear-family Linear / Regularized Linear Regression 2052 515 None 0.1305 0.1007 0.4244 0.1201

The first model I tested was regular linear regression.

This model is used as a baseline model because it is simple and easy to understand.

The validation performance was:

  • RMSE: 0.1305
  • MAE: 0.1007
  • \(R^2\): 0.4244

This result gives me a starting point for comparing all other models. The model was very fast to train, but the prediction performance is not very strong as expected.

The \(R^2\) value of 0.4244 means the model explained some of the PH pattern, but a large part of the variation was still not captured. This makes sense because the data we have probably has nonlinear relationships and interactions between variables.

Overall, linear regression is useful as a baseline, but it was not flexible enough.

plot_model_diagnostics(
  obs = valid_ph,
  pred = linear_validation_predictions,
  model_name = "Linear Regression"
)

The diagnostic plots show why linear regression is useful as a baseline, but not strong enough.

In Predicted vs Observed plot, the points follow the general upward direction, so the model captured some relationship between predictors and PH. However, the points spread around the diagonal line is wide. This means the linear regression produced many over and under predicted values.

The Residuals vs Predicted plot also shows a fairly wide spread of residuals. Some residuals are larger than +0.25 or below -0.25, which means the model had some noticeable overprediction and underprediction cases.

It also appear to be some funnel-shape pattern in the residuals instead of a fully random cloud around zero.

Overall, linear regression gave a useful starting point, but the diagnostic plots show that PH prediction likely needs a more flexible model that can capture nonlinear patterns and interactions between predictors.

Elastic Net Regression

After the regular linear regression model, I continue with elastic net regression.

Elastic net is still a linear model, but it adds regularization. This means it can shrink some coefficients and reduce overfitting compared with a regular linear regression model.

Elastic net combines both ridge and lasso regression.

Elastic Net Tuning Parameters

Elastic net has two main tuning parameters: penalty and mixture.

The penalty controls how aggressivly the model shrinks the coefficients. A small penalty means the model is closer to regular linear regression. On the other hand, larger penalty means stronger shrinkage.

The mixture controls the type of regularization. When mixture = 0, the model behaves pretty much like ridge regression. When mixture = 1, it behaves like lasso regression. Values between 0 and 1 combine both ridge and lasso methods.

In this part, I tune both parameters because I do not want to assume ahead of time whether ridge-like, lasso-like, or mixed regularization will work best.

make_linear_recipe <- function(training_data) {
  
  recipe(
    ph ~ .,
    data = training_data
  ) |>
    step_YeoJohnson(all_of(yeo_johnson_vars)) |>
    step_center(all_numeric_predictors()) |>
    step_scale(all_numeric_predictors()) |>
    step_dummy(all_nominal_predictors())
}
elastic_net_spec <- linear_reg(
  penalty = tune(),
  mixture = tune()
) |>
  set_engine("glmnet")
elastic_net_grid <- grid_regular(
  penalty(range = c(-6, -1)),
  mixture(range = c(0, 1)),
  levels = c(
    penalty = 15,
    mixture = 6
  )
)
elastic_net_workflow <- workflow() |>
  add_recipe(linear_recipe) |>
  add_model(elastic_net_spec)
elastic_net_grid_2 <- grid_regular(
  penalty(range = c(-6, -1)),
  mixture(range = c(0, 1)),
  levels = c(
    penalty = 15,
    mixture = 6
  )
)
set.seed(123)

elastic_net_start_time <- Sys.time()

elastic_net_tune <- tune_grid(
  elastic_net_workflow,
  resamples = linear_folds,
  grid = elastic_net_grid,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

elastic_net_end_time <- Sys.time()

elastic_net_training_time <- as.numeric(
  difftime(elastic_net_end_time, elastic_net_start_time, units = "secs")
)
#elastic_net_tune
# elastic_net_tune |>
#   collect_metrics() |>
#   arrange(.metric, mean) |>
#   kable(
#     caption = "Elastic Net Cross-Validation Tuning Results",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )
best_elastic_net <- select_best(
  elastic_net_tune,
  metric = "rmse"
)

best_elastic_net
# A tibble: 1 × 3
  penalty mixture .config         
    <dbl>   <dbl> <chr>           
1 0.00848       0 pre0_mod67_post0

Now let’s fit chosen model on training and validation sets

final_elastic_net_workflow <- finalize_workflow(
  elastic_net_workflow,
  best_elastic_net
)
elastic_net_start_time <- Sys.time()

final_elastic_net_fit <- fit(
  final_elastic_net_workflow,
  data = train_linear_model_data
)

elastic_net_end_time <- Sys.time()

elastic_net_training_time <- as.numeric(
  difftime(elastic_net_end_time, elastic_net_start_time, units = "secs")
)
elastic_net_validation_predictions <- predict(
  final_elastic_net_fit,
  new_data = valid_linear_model_data
) |>
  pull(.pred)
elastic_net_result <- make_and_show_model_result(
  workflow = "Linear-family ",
  model_family = "Linear / Regularized",
  model_name = "Elastic Net",
  training_rows = nrow(train_linear_model_data),
  validation_rows = nrow(valid_linear_model_data),
  tuning_parameters = paste(
    "penalty =",
    best_elastic_net$penalty,
    ", mixture =",
    best_elastic_net$mixture
  ),
  actual = valid_ph,
  predicted = elastic_net_validation_predictions,
  training_time_seconds = elastic_net_training_time
)
Elastic Net Validation Performance
workflow model_family model_name training_rows validation_rows tuning_parameters rmse mae rsq training_time_seconds
Linear-family Linear / Regularized Elastic Net 2052 515 penalty = 0.00848342898244073 , mixture = 0 0.1305 0.1007 0.425 0.1375

The best elastic net model selected penalty = 0.0085 and mixture = 0.

The elastic net model produced almost the same validation performance as the regular linear regression baseline.This makes sense because the best tuning result selected penalty = 0.0085 and mixture = 0. A mixture value of 0 means the model is ridge-like, and the very small penalty means only a tiny amount of shrinkage was applied.
Because of this, the elastic net model stayed very close to the regular linear regression model. In this case, regularization did not meaningfully improve validation performance.

plot_model_diagnostics(
  obs = valid_ph,
  pred = elastic_net_validation_predictions,
  model_name = "Elastic Net"
)

The elastic net diagnostic plots show that the model captures the general direction of PH values, and almost identical with diagnosis plots of previous linear regression.

In the predicted vs observed plot, the points follow an upward trend, but they are not tightly clustered around the diagonal line. The predicted PH values are also compressed into a smaller range than the observed PH values, which suggests that the model is pulling predictions toward the average.

The residual plot shows that the residuals are mostly centered around zero, but there is still noticeable spread. The spread has a mild funnel shape, with larger residual variation for smaller predicted PH values. This suggests that the elastic net model struggles more with explaining variation in the lower predicted PH range.

For larger predicted PH values, there also appears to be a slight downward pattern in the residuals.

Overall, the residual plot suggests that the elastic net model captures some of the general PH pattern, but it does not fully explain the variance across the whole PH range.

Partial Least Squares Regression

I also test Partial Least Squares regression because it can be useful when predictors are correlated.

Instead of using the original predictors directly, PLS creates a smaller number of components (supervised version of PCA) from the predictors. These components are created in a way that tries to explain the relationship with ph.

The main tuning parameter is num_comp, which controls how many PLS components are used. Too few components may underfit, while too many components may start acting more like a regular linear model. For this model I will use tuning with up to 15 components.

train_pls_model_data <- train_linear_nzv |>
  mutate(ph = train_ph)

valid_pls_model_data <- valid_linear_nzv |>
  mutate(ph = valid_ph)
exclude_from_pls_yj <- c(
  "pressure_setpoint",
  "bowl_setpoint"
)

pls_yeo_johnson_vars <- train_linear_nzv |>
  select(where(is.numeric)) |>
  select(-any_of(exclude_from_pls_yj)) |>
  names()
make_pls_recipe <- function(training_data) {
  
  recipe(
    ph ~ .,
    data = training_data
  ) |>
    step_YeoJohnson(all_of(pls_yeo_johnson_vars)) |>
    step_center(all_numeric_predictors()) |>
    step_scale(all_numeric_predictors()) |>
    step_dummy(all_nominal_predictors())
}
set.seed(123)

pls_folds <- vfold_cv(
  train_pls_model_data,
  v = 5,
  repeats = 3
)
fit_pls_cv_split <- function(split, num_comp) {
  
  train_split_data <- analysis(split)
  valid_split_data <- assessment(split)
  
  split_recipe <- make_pls_recipe(train_split_data)
  
  split_recipe_prep <- prep(
    split_recipe,
    training = train_split_data,
    retain = TRUE
  )
  
  train_split_baked <- bake(
    split_recipe_prep,
    new_data = train_split_data
  )
  
  valid_split_baked <- bake(
    split_recipe_prep,
    new_data = valid_split_data
  )
  
  pls_fit <- plsr(
    ph ~ .,
    data = train_split_baked,
    ncomp = num_comp,
    method = "kernelpls"
  )
  
  pls_predictions <- predict(
    pls_fit,
    newdata = valid_split_baked,
    ncomp = num_comp
  ) |>
    as.numeric()
  
  tibble(
    num_comp = num_comp,
    rmse = rmse_vec(
      truth = valid_split_baked$ph,
      estimate = pls_predictions
    ),
    mae = mae_vec(
      truth = valid_split_baked$ph,
      estimate = pls_predictions
    ),
    rsq = rsq_vec(
      truth = valid_split_baked$ph,
      estimate = pls_predictions
    )
  )
}
pls_grid <- tibble(
  num_comp = 1:20
)

set.seed(123)

pls_uncorr_start_time <- Sys.time()

pls_uncorr_cv_results <- crossing(
  split_id = seq_along(pls_folds$splits),
  num_comp = pls_grid$num_comp
) |>
  mutate(
    metrics = map2(
      split_id,
      num_comp,
      ~ fit_pls_cv_split(
        split = pls_folds$splits[[.x]],
        num_comp = .y
      )
    )
  ) |>
  select(-split_id, -num_comp) |>
  unnest(metrics)

pls_uncorr_end_time <- Sys.time()

pls_uncorr_training_time <- as.numeric(
  difftime(
    pls_uncorr_end_time,
    pls_uncorr_start_time,
    units = "secs"
  )
)
pls_uncorr_cv_summary <- pls_uncorr_cv_results |>
  group_by(num_comp) |>
  summarise(
    rmse = mean(rmse, na.rm = TRUE),
    mae = mean(mae, na.rm = TRUE),
    rsq = mean(rsq, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(rmse)

pls_uncorr_cv_summary |>
  kable(
    caption = "PLS Cross-Validation Results with Correlated Predictors Kept",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
PLS Cross-Validation Results with Correlated Predictors Kept
num_comp rmse mae rsq
14 0.1350 0.1051 0.3906
15 0.1350 0.1051 0.3904
17 0.1350 0.1052 0.3903
16 0.1350 0.1052 0.3903
18 0.1350 0.1052 0.3901
13 0.1350 0.1053 0.3900
19 0.1351 0.1052 0.3899
12 0.1351 0.1052 0.3897
20 0.1351 0.1051 0.3896
10 0.1352 0.1053 0.3889
11 0.1352 0.1053 0.3886
8 0.1354 0.1057 0.3863
9 0.1354 0.1055 0.3863
7 0.1359 0.1058 0.3823
6 0.1364 0.1064 0.3778
5 0.1372 0.1064 0.3696
4 0.1388 0.1085 0.3557
3 0.1398 0.1091 0.3462
2 0.1443 0.1132 0.3042
1 0.1527 0.1210 0.2212
best_pls_uncorr <- pls_uncorr_cv_summary |>
  slice(1)

best_pls_uncorr |>
  kable(
    caption = "Best PLS Result with Correlated Predictors Kept",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Best PLS Result with Correlated Predictors Kept
num_comp rmse mae rsq
14 0.135 0.1051 0.3906
best_pls_uncorr_num_comp <- best_pls_uncorr$num_comp

The best PLS model selected num_comp = 14 and it is a lot of componets are being used.

.

pls_uncorr_final_recipe <- make_pls_recipe(
  train_pls_model_data
)

pls_uncorr_final_recipe_prep <- prep(
  pls_uncorr_final_recipe,
  training = train_pls_model_data,
  retain = TRUE
)

train_pls_uncorr_baked <- bake(
  pls_uncorr_final_recipe_prep,
  new_data = train_pls_model_data
)

valid_pls_uncorr_baked <- bake(
  pls_uncorr_final_recipe_prep,
  new_data = valid_pls_model_data
)
pls_uncorr_final_start_time <- Sys.time()

pls_uncorr_final_model <- plsr(
  ph ~ .,
  data = train_pls_uncorr_baked,
  ncomp = best_pls_uncorr_num_comp,
  method = "kernelpls"
)

pls_uncorr_final_end_time <- Sys.time()

pls_uncorr_final_training_time <- as.numeric(
  difftime(
    pls_uncorr_final_end_time,
    pls_uncorr_final_start_time,
    units = "secs"
  )
)
pls_uncorr_validation_predictions <- predict(
  pls_uncorr_final_model,
  newdata = valid_pls_uncorr_baked,
  ncomp = best_pls_uncorr_num_comp
) |>
  as.numeric()
pls_uncorr_result <- make_and_show_model_result(
  workflow = "Linear-family / PLS keeps correlated predictors",
  model_family = "Linear / Regularized",
  model_name = "PLS Regression",
  training_rows = nrow(train_pls_model_data),
  validation_rows = nrow(valid_pls_model_data),
  tuning_parameters = paste(
    "num_comp =",
    best_pls_uncorr_num_comp
  ),
  actual = valid_ph,
  predicted = pls_uncorr_validation_predictions,
  training_time_seconds = pls_uncorr_final_training_time
)
PLS Regression Validation Performance
workflow model_family model_name training_rows validation_rows tuning_parameters rmse mae rsq training_time_seconds
Linear-family / PLS keeps correlated predictors Linear / Regularized PLS Regression 2052 515 num_comp = 14 0.1286 0.0987 0.4403 0.0095

The best PLS model selected num_comp = 14 and it is a lot of componets.

This means the model needed many components to get its best cross-validation result. In this case, PLS did not reduce the predictor space into a small number of very strong components.

The best PLS cross-validation performance was:

  • RMSE: 0.1362
  • MAE: 0.1066
  • \(R^2\): 0.3792

Compared with the regular linear regression and elastic net models, PLS performed slightly worse. This suggests that reducing the predictors into PLS components did not improve PH prediction for this dataset.

Because of this, PLS is useful as another linear-family comparison model, but it does not look like the strongest model in this family.

The plots are almost identical to previous plots, and my previous explanations of ENET diagnostics apply to these plots as well.

plot_model_diagnostics(
  obs = valid_ph,
  pred = pls_uncorr_validation_predictions,
  model_name = "PLS Regression"
)

The PLS diagnostic plots look very similar to the regular linear regression plots.

In the Predicted vs Observed plot, the points follow the general upward direction, so PLS captured some relationship between the predictors and PH. However, there is still a wide spread around the diagonal line. This means the predictions are not very close to the observed PH values for many validation rows.

The Residuals vs Predicted plot also shows that residuals are spread fairly wide around zero. There are still some larger overpredictions and underpredictions.

PLS was useful because it can handle correlated predictors better than regular linear regression. This is why I tested it after seeing correlation between some process variables.

However, the plots show that PLS still behaves mostly like a linear model. It improved slightly over regular linear regression, but it still did not capture the nonlinear patterns.

Overall, PLS was a reasonable linear-family model to test, but it was not flexible enough.

linear_model_results <- tibble()

linear_model_results <- add_model_result(
  results_table = linear_model_results,
  new_result = linear_result
)

linear_model_results <- add_model_result(
  results_table = linear_model_results,
  new_result = elastic_net_result
)

linear_model_results <- add_model_result(
  results_table = linear_model_results,
  new_result = pls_uncorr_result
)
# linear_model_results <- linear_model_results |>
#   filter(model_name != "PLS Regression")
# 
# all_model_results <- all_model_results |>
#   filter(model_name != "PLS Regression")
linear_model_results |>
  arrange(rmse) |>
  kable(
    caption = "Linear and Regularized Model Comparison",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Linear and Regularized Model Comparison
workflow model_family model_name training_rows validation_rows tuning_parameters rmse mae rsq training_time_seconds
Linear-family / PLS keeps correlated predictors Linear / Regularized PLS Regression 2052 515 num_comp = 14 0.1286 0.0987 0.4403 0.0095
Linear-family Linear / Regularized Linear Regression 2052 515 None 0.1305 0.1007 0.4244 0.1201
Linear-family Linear / Regularized Elastic Net 2052 515 penalty = 0.00848342898244073 , mixture = 0 0.1305 0.1007 0.4250 0.1375

The linear and regularized model comparison shows that PLS regression performed slightly better than regular linear regression and elastic net.

PLS regression:

  • RMSE: 0.1286
  • MAE: 0.0987
  • \(R^2\): 0.4403

Linear regression:

  • RMSE: 0.1305
  • MAE: 0.1007
  • \(R^2\): 0.4244

Elastic net:

  • RMSE: 0.1305
  • MAE: 0.1007
  • \(R^2\): 0.4250

For PLS, I kept the correlated predictors instead of applying correlation filtering. This is because PLS is designed to work with correlated predictors by creating components from them. Removing correlated predictors before PLS could remove useful information.

The best PLS model selected num_comp = 14. This means the model needed many components to reach its best performance, so it did not reduce the data into only a few strong components. Still, PLS gave a small improvement over the regular linear regression and elastic net models.

Elastic net selected mixture = 0, which means it behaved like ridge regression. Its performance was almost the same as regular linear regression, so regularization did not add much improvement in this section.

Overall, PLS was the best model inside the linear and regularized model family, but the improvement was small. This tells us that linear models are able to capture some PH patterns, but linear models still leave a good amount of data structure unexplained.

linear_model_results <- tibble()
all_model_results <- tibble()

linear_model_results <- add_model_result(
  results_table = linear_model_results,
  new_result = linear_result
)

linear_model_results <- add_model_result(
  results_table = linear_model_results,
  new_result = elastic_net_result
)

linear_model_results <- add_model_result(
  results_table = linear_model_results,
  new_result = pls_uncorr_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = linear_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = elastic_net_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = pls_uncorr_result
)
#all_model_results

Nonlinear Models

After the linear-family models, I will test nonlinear models.

The linear models gave very similar prediction patterns, so the next question is if flexible models can capture nonlinear relationships or interactions in the production data.

For these models, I continue using the linear-family feature set. This is because mnf_flow was already separated into mnf_flow_state and mnf_flow_positive_value, which is safer for models that depend on distances, scaling, or smooth nonlinear relationships.

I also keep the same preprocessing recipe with Yeo-Johnson, centering, scaling, and dummy encoding. This is important for SVM and neural networks because they are sensitive to variable scale.

nonlinear_model_results <- tibble()

SVM with Radial Basis Kernel

I start the nonlinear model section with SVM using a radial basis kernel.
This model can capture nonlinear patterns by comparing observations based on distance. Because of that, here scaling is important. I use the same preprocessing recipe from the previous section, where numeric variables are transformed, centered, and scaled before modeling.

The main tuning parameters are cost and rbf_sigma.

  • cost controls how much the model tries to avoid errors. A larger cost allows less flexibility for mistakes, while a smaller cost allows more tolerance.

  • rbf_sigma controls how local or smooth the radial kernel is. Smaller values usually create smoother patterns, while larger values can make the model more sensitive to nearby points.

svm_rbf_spec <- svm_rbf(
  cost = tune(),
  rbf_sigma = tune()
) |>
  set_mode("regression") |>
  set_engine("kernlab")
svm_rbf_spec <- svm_rbf(
  cost = tune(),
  rbf_sigma = tune()) |>
  set_mode("regression") |>
  set_engine("kernlab")
svm_rbf_workflow <- workflow() |>
  add_recipe(linear_recipe) |>
  add_model(svm_rbf_spec)
svm_rbf_grid <- grid_regular(
  cost(range = c(-3, 5)),
  rbf_sigma(range = c(-6, 1)),
  levels = c(
    cost = 9,
    rbf_sigma = 8
  )
)

# svm_rbf_grid |>
#   kable(
#     caption = "SVM Radial Tuning Grid",
#     digits = 6
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )
set.seed(123)

svm_rbf_start_time <- Sys.time()

svm_rbf_tune <- tune_grid(
  svm_rbf_workflow,
  resamples = linear_folds,
  grid = svm_rbf_grid,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

svm_rbf_end_time <- Sys.time()

svm_rbf_training_time <- as.numeric(
  difftime(svm_rbf_end_time, svm_rbf_start_time, units = "secs")
)
svm_rbf_tune |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top SVM Radial Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Top SVM Radial Tuning Results by RMSE
cost rbf_sigma .metric .estimator mean n std_err .config
2 0.10 rmse standard 0.1201 15 0.0009 pre0_mod38_post0
4 0.10 rmse standard 0.1210 15 0.0008 pre0_mod46_post0
8 0.01 rmse standard 0.1216 15 0.0010 pre0_mod53_post0
1 0.10 rmse standard 0.1218 15 0.0010 pre0_mod30_post0
8 0.10 rmse standard 0.1220 15 0.0008 pre0_mod54_post0
4 0.01 rmse standard 0.1221 15 0.0011 pre0_mod45_post0
16 0.10 rmse standard 0.1225 15 0.0008 pre0_mod62_post0
32 0.10 rmse standard 0.1225 15 0.0008 pre0_mod70_post0
16 0.01 rmse standard 0.1226 15 0.0011 pre0_mod61_post0
2 0.01 rmse standard 0.1234 15 0.0011 pre0_mod37_post0
best_svm_rbf <- select_best(
  svm_rbf_tune,
  metric = "rmse"
)

best_svm_rbf |>
  kable(
    caption = "Best SVM Radial Tuning Parameters",
    digits = 6
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Best SVM Radial Tuning Parameters
cost rbf_sigma .config
2 0.1 pre0_mod38_post0

In this SVM radial tuning grid found a good region around cost = 2 and rbf_sigma = 0.10.

As we see in tuning results table above, several of the best tuning combinations are close to each other, which suggests that the model was not depending on just one lucky parameter setting. Instead, there seemed to be a good tuning region.

For example, this tuning grid showed strong results for:

- cost = 2, rbf_sigma = 0.10 , mean RMSE = 0.1201
- cost = 4, rbf_sigma = 0.10``, mean RMSE = 0.1210
- cost = 1, rbf_sigma = 0.10``, mean RMSE = 0.1218
- cost = 8, rbf_sigma = 0.01``, mean RMSE = 0.1216

Because nearby values had similar RMSE scores, below I will run a refined tuning grid around that area. This will allowe me to test smaller changes in cost and rbf_sigma instead of only using the wider first grid.

svm_rbf_grid |>
  summarise(
    min_cost = min(cost),
    max_cost = max(cost),
    min_rbf_sigma = min(rbf_sigma),
    max_rbf_sigma = max(rbf_sigma)
  )
# A tibble: 1 × 4
  min_cost max_cost min_rbf_sigma max_rbf_sigma
     <dbl>    <dbl>         <dbl>         <dbl>
1    0.125       32      0.000001            10
svm_rbf_grid_refined <- expand_grid(
  cost = c(1, 1.5, 2, 3, 4, 6, 8),
  rbf_sigma = c(0.05, 0.075, 0.10, 0.125, 0.15, 0.20)
)

# svm_rbf_grid_refined |>
#   kable(
#     caption = "Refined SVM Radial Tuning Grid",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )
set.seed(123)

svm_rbf_refined_start_time <- Sys.time()

svm_rbf_tune_refined <- tune_grid(
  svm_rbf_workflow,
  resamples = linear_folds,
  grid = svm_rbf_grid_refined,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

svm_rbf_refined_end_time <- Sys.time()

svm_rbf_refined_training_time <- as.numeric(
  difftime(
    svm_rbf_refined_end_time,
    svm_rbf_refined_start_time,
    units = "secs"
  )
)
best_svm_rbf_refined <- select_best(
  svm_rbf_tune_refined,
  metric = "rmse"
)

best_svm_rbf_refined |>
  kable(
    caption = "Best Refined SVM Radial Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Best Refined SVM Radial Tuning Parameters
cost rbf_sigma .config
3 0.05 pre0_mod19_post0

The refined tuning selected:

  • cost: 3

  • rbf_sigma: 0.05

  • RMSE: 0.1183

This improved the cross-validation RMSE from 0.1201 in the first grid to 0.1183 in the refined grid.

This result can indicate that the SVM radial model performs better with a moderate cost value and a smoother radial kernel. In practical terms, the model is flexible enough to capture nonlinear patterns, but it is not using an overly aggressive kernel setting.

final_svm_rbf_workflow <- finalize_workflow(
  svm_rbf_workflow,
  best_svm_rbf_refined
)
svm_rbf_final_start_time <- Sys.time()

final_svm_rbf_fit <- fit(
  final_svm_rbf_workflow,
  data = train_linear_model_data
)

svm_rbf_final_end_time <- Sys.time()

svm_rbf_final_training_time <- as.numeric(
  difftime(
    svm_rbf_final_end_time,
    svm_rbf_final_start_time,
    units = "secs"
  )
)
svm_rbf_validation_predictions <- predict(
  final_svm_rbf_fit,
  new_data = valid_linear_model_data
) |>
  pull(.pred)
svm_rbf_result <- make_and_show_model_result(
  workflow = "Nonlinear-family",
  model_family = "Nonlinear",
  model_name = "SVM Radial",
  training_rows = nrow(train_linear_model_data),
  validation_rows = nrow(valid_linear_model_data),
  tuning_parameters = paste(
    "cost =",
    best_svm_rbf_refined$cost,
    ", / rbf_sigma =",
    best_svm_rbf_refined$rbf_sigma
  ),
  actual = valid_ph,
  predicted = svm_rbf_validation_predictions,
  training_time_seconds = svm_rbf_final_training_time
)
SVM Radial Validation Performance
workflow model_family model_name training_rows validation_rows tuning_parameters rmse mae rsq training_time_seconds
Nonlinear-family Nonlinear SVM Radial 2052 515 cost = 3 , / rbf_sigma = 0.05 0.1112 0.0812 0.5806 0.8783

The SVM radial model performed better than the linear-family models on the validation data.

  • RMSE: 0.1112
  • MAE: 0.0812
  • \(R^2\): 0.5806

This is an improvement over the linear regression, elastic net, and PLS models, which all had validation RMSE values around 0.1305 and \(R^2\) values around 0.425.

The best SVM radial model used cost = 3 and rbf_sigma = 0.05. This means the model used a moderate cost value and a smoother radial kernel.

This result suggests that the relationships between predictors and PH are not only linear. The radial kernel performed better to capture nonlinear patterns that the linear-family models could not capture.

plot_model_diagnostics(
  obs = valid_ph,
  pred = svm_rbf_validation_predictions,
  model_name = "SVM Radial"
)

The SVM radial diagnostic plots show better behavior than the linear-family models.

In the predicted vs observed plot, the points follow a clearer upward pattern and are tighter to the diagonal line compared to linear models. There are few far points from the line for predictions less than 8.72, and the cluster of predictions above 8.72 PH is much tighter to diagonal line. This suggests that SVM radial basis model has a tendency to over or under predict PH values less than 8.72.

The residual plot shows that most residuals are centered and tighter around zero. There is still noticeable spread, especially around the middle predicted PH range with some of the predictions with larger negative residuals (overpredicted), but the residual pattern looks more balanced than the linear-family models.

Overall, there are still some larger errors, which means this model is not perfect. However, compared with the earlier linear models, the SVM radial model captures more of the PH variation and handles nonlinear structure better.

nonlinear_model_results <- add_model_result(
  results_table = nonlinear_model_results,
  new_result = svm_rbf_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = svm_rbf_result
)
nonlinear_model_results |>
  arrange(rmse) |>
  kable(
    caption = "Current Nonlinear Model Comparison",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
    
  )
Current Nonlinear Model Comparison
workflow model_family model_name training_rows validation_rows tuning_parameters rmse mae rsq training_time_seconds
Nonlinear-family Nonlinear SVM Radial 2052 515 cost = 3 , / rbf_sigma = 0.05 0.1112 0.0812 0.5806 0.8783

Multivariate Adaptive Regression Splines (MARS) Model

Next, I test a MARS model.

MARS can capture nonlinear relationships by creating “hinged” linear terms. This makes it more flexible than regular linear regression, but usually easier to interpret than models like SVM or neural n

The main tuning parameters are num_terms and prod_degree.

  • num_terms controls how many model terms are allowed. A larger value gives the model more flexibility.

  • prod_degree controls whether the model can include interaction terms. A value of 1 allows only main effects, while a value of 2 allows two-way interactions, like multiplicative interactions.

mars_spec <- mars(
  num_terms = tune(),
  prod_degree = tune(),
  prune_method = "none") |>
  set_mode("regression") |>
  set_engine("earth")
mars_workflow <- workflow() |>
  add_recipe(linear_recipe) |>
  add_model(mars_spec)
mars_grid <- expand_grid(
  num_terms = seq(5, 50, by = 5),
  prod_degree = c(1, 2)
)

# mars_grid |>
#   kable(
#     caption = "MARS Tuning Grid"
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )
set.seed(123)

mars_start_time <- Sys.time()

mars_tune <- tune_grid(
  mars_workflow,
  resamples = linear_folds,
  grid = mars_grid,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

mars_end_time <- Sys.time()

mars_training_time <- as.numeric(
  difftime(mars_end_time, mars_start_time, units = "secs")
)
mars_tune |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top MARS Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Top MARS Tuning Results by RMSE
num_terms prod_degree .metric .estimator mean n std_err .config
35 2 rmse standard 0.1293 15 0.0015 pre0_mod14_post0
40 2 rmse standard 0.1294 15 0.0014 pre0_mod16_post0
50 2 rmse standard 0.1295 15 0.0016 pre0_mod20_post0
45 2 rmse standard 0.1295 15 0.0016 pre0_mod18_post0
30 2 rmse standard 0.1297 15 0.0015 pre0_mod12_post0
25 2 rmse standard 0.1315 15 0.0014 pre0_mod10_post0
20 2 rmse standard 0.1341 15 0.0013 pre0_mod08_post0
15 1 rmse standard 0.1353 15 0.0010 pre0_mod05_post0
15 2 rmse standard 0.1366 15 0.0010 pre0_mod06_post0
10 2 rmse standard 0.1380 15 0.0011 pre0_mod04_post0
best_mars <- select_best(
  mars_tune,
  metric = "rmse"
)

best_mars |>
  kable(
    caption = "Best MARS Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Best MARS Tuning Parameters
num_terms prod_degree .config
35 2 pre0_mod14_post0

The best MARS model selected num_terms = 35 and prod_degree = 2.

This means the best version allowed up to 35 model terms and included two-way interaction terms. The top MARS results all used prod_degree = 2, which suggests that interactions between predictors helped the model more than main effects only.

The best cross-validation RMSE was 0.1296. This is close to the linear-family models, but it is not as strong as the SVM radial model.

This result suggests that MARS added some flexibility through piecewise linear terms and interactions, but it still did not capture the PH patterns as well as the radial SVM model.

final_mars_workflow <- finalize_workflow(
  mars_workflow,
  best_mars
)

mars_final_start_time <- Sys.time()

final_mars_fit <- fit(
  final_mars_workflow,
  data = train_linear_model_data
)

mars_final_end_time <- Sys.time()

mars_final_training_time <- as.numeric(
  difftime(
    mars_final_end_time,
    mars_final_start_time,
    units = "secs"
  )
)
mars_validation_predictions <- predict(
  final_mars_fit,
  new_data = valid_linear_model_data) |>
  pull(.pred)
mars_result <- make_and_show_model_result(
  workflow = "Nonlinear-family",
  model_family = "Nonlinear",
  model_name = "MARS",
  training_rows = nrow(train_linear_model_data),
  validation_rows = nrow(valid_linear_model_data),
  tuning_parameters = paste(
    "num_terms =",
    best_mars$num_terms,
    ", prod_degree =",
    best_mars$prod_degree
  ),
  actual = valid_ph,
  predicted = mars_validation_predictions,
  training_time_seconds = mars_final_training_time
)
MARS Validation Performance
workflow model_family model_name training_rows validation_rows tuning_parameters rmse mae rsq training_time_seconds
Nonlinear-family Nonlinear MARS 2052 515 num_terms = 35 , prod_degree = 2 0.1212 0.0907 0.5013 0.3193

The final MARS model used:

  • num_terms: 35
  • prod_degree: 2

Its validation performance was:

  • RMSE: 0.1212
  • MAE: 0.0907
  • \(R^2\): 0.5013

This result is better than overall linear-family models, but weaker than SVM radial.

The selected prod_degree = 2 means the model used two-way interaction terms. This indicate that some relationships between predictors and PH are not just simple one-variable effects. The model improved by allowing predictors to work together.

The selected num_terms = 35 means MARS needed a flexible model with many piecewise terms.

It looks like MARS captured more nonlinear structure than linear regression, elastic net, and PLS. This suggests that piecewise linear interactions helped, but they were still not flexible enough to capture the full data pattern.

plot_model_diagnostics(
  obs = valid_ph,
  pred = mars_validation_predictions,
  model_name = "MARS"
)

MARS diagnostic plots also show better behavior than the linear-family models.

In the predicted vs observed plot, the points follow a clearer upward pattern and are tighter to the diagonal line compared to linear models. But the model still has some over and under predictions

The residual plot shows that most residuals are centered and tighter around zero. There is still noticeable spread. As we see around the middle predicted PH range some of the predictions with larger residuals (overpredicted), and it the same tendency as we saw in SVM radial diagnostics plots.

nonlinear_model_results <- add_model_result(
  results_table = nonlinear_model_results,
  new_result = mars_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = mars_result
)

Neural Network Model

Next, I test a neural network model.

Neural networks can capture nonlinear relationships, but they are also more sensitive to preprocessing and tuning. Because of this, I use the same linear-family preprocessing recipe with Yeo-Johnson, centering, scaling, and dummy encoding.

The main tuning parameters are:

  • hidden_units: number of neurons in the hidden layer
  • penalty: regularization strength to reduce overfitting
  • epochs: number of training iterations
nn_spec <- mlp(
  hidden_units = tune(),
  penalty = tune(),
  epochs = tune()
) |>
  set_mode("regression") |>
  set_engine(
    "nnet",
    trace = FALSE,
    MaxNWts = 5000
  )
nn_workflow <- workflow() |>
  add_recipe(linear_recipe) |>
  add_model(nn_spec)
nn_grid <- expand_grid(
  hidden_units = c(3, 5, 7, 10),
  penalty = c(0.0001, 0.001, 0.01, 0.1),
  epochs = c(100, 200, 500)
)

# nn_grid |>
#   kable(
#     caption = "Neural Network Tuning Grid",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )
set.seed(123)

nn_start_time <- Sys.time()

nn_tune <- tune_grid(
  nn_workflow,
  resamples = linear_folds,
  grid = nn_grid,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

nn_end_time <- Sys.time()

nn_training_time <- as.numeric(
  difftime(nn_end_time, nn_start_time, units = "secs")
)
nn_tune |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top Neural Network Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Top Neural Network Tuning Results by RMSE
hidden_units penalty epochs .metric .estimator mean n std_err .config
10 0.10 100 rmse standard 0.1264 15 0.0017 pre0_mod46_post0
5 0.10 500 rmse standard 0.1283 15 0.0010 pre0_mod24_post0
3 0.01 500 rmse standard 0.1292 15 0.0013 pre0_mod09_post0
7 0.10 200 rmse standard 0.1292 15 0.0010 pre0_mod35_post0
5 0.10 200 rmse standard 0.1295 15 0.0015 pre0_mod23_post0
3 0.10 500 rmse standard 0.1295 15 0.0013 pre0_mod12_post0
7 0.10 500 rmse standard 0.1295 15 0.0019 pre0_mod36_post0
5 0.01 100 rmse standard 0.1306 15 0.0014 pre0_mod19_post0
7 0.01 100 rmse standard 0.1307 15 0.0017 pre0_mod31_post0
5 0.01 200 rmse standard 0.1309 15 0.0011 pre0_mod20_post0
best_nn <- select_best(
  nn_tune,
  metric = "rmse"
)

best_nn |>
  kable(
    caption = "Best Neural Network Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Best Neural Network Tuning Parameters
hidden_units penalty epochs .config
10 0.1 100 pre0_mod46_post0

The first neural network tuning grid selected hidden_units = 10, penalty = 0.10, and epochs = 100.

Since hidden_units = 10 and penalty = 0.10 were both at the upper end of the first grid, I will run a new refined grid around that area in following code. This will check if slightly larger network or stronger regularization can improve accuracy.

I also will change the epoch values to the lower range (50, 100, 150) because the first grid did not show improvement from using 200 or 500 epochs.

nn_grid_refined <- expand_grid(
  hidden_units = c(8, 10, 12, 15, 20),
  penalty = c(0.05, 0.10, 0.20, 0.30, 0.50),
  epochs = c(50, 100, 150)
)

# nn_grid_refined |>
#   kable(
#     caption = "Refined Neural Network Tuning Grid",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )
set.seed(123)

nn_refined_start_time <- Sys.time()

nn_tune_refined <- tune_grid(
  nn_workflow,
  resamples = linear_folds,
  grid = nn_grid_refined,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

nn_refined_end_time <- Sys.time()

nn_refined_training_time <- as.numeric(
  difftime(
    nn_refined_end_time,
    nn_refined_start_time,
    units = "secs"
  )
)
nn_tune_refined |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top Refined Neural Network Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Top Refined Neural Network Tuning Results by RMSE
hidden_units penalty epochs .metric .estimator mean n std_err .config
12 0.5 150 rmse standard 0.1232 15 0.0013 pre0_mod45_post0
15 0.5 150 rmse standard 0.1234 15 0.0014 pre0_mod60_post0
12 0.3 150 rmse standard 0.1247 15 0.0014 pre0_mod42_post0
10 0.5 150 rmse standard 0.1251 15 0.0011 pre0_mod30_post0
12 0.5 100 rmse standard 0.1253 15 0.0017 pre0_mod44_post0
15 0.3 150 rmse standard 0.1259 15 0.0014 pre0_mod57_post0
15 0.5 100 rmse standard 0.1259 15 0.0012 pre0_mod59_post0
12 0.2 100 rmse standard 0.1261 15 0.0012 pre0_mod38_post0
15 0.3 100 rmse standard 0.1268 15 0.0013 pre0_mod56_post0
10 0.3 100 rmse standard 0.1273 15 0.0014 pre0_mod26_post0
best_nn_refined <- select_best(
  nn_tune_refined,
  metric = "rmse"
)

best_nn_refined |>
  kable(
    caption = "Best Refined Neural Network Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Best Refined Neural Network Tuning Parameters
hidden_units penalty epochs .config
12 0.5 150 pre0_mod45_post0

The refined neural network tuning improved the model compared with the first tuning grid.

The first grid selected:

  • hidden_units: 10
  • penalty: 0.10
  • epochs: 100
  • Cross-validation RMSE: 0.1264

The refined grid selected:

  • hidden_units: 12
  • penalty: 0.50
  • epochs: 150
  • Cross-validation RMSE: 0.1232

This shows that the neural network benefited from a slightly larger hidden layer and stronger regularization. The stronger penalty = 0.50 suggests that the model needed shrinkage to avoid overfitting.

final_nn_workflow <- finalize_workflow(
  nn_workflow,
  best_nn_refined
)
nn_final_start_time <- Sys.time()

final_nn_fit <- fit(
  final_nn_workflow,
  data = train_linear_model_data
)

nn_final_end_time <- Sys.time()

nn_final_training_time <- as.numeric(
  difftime(
    nn_final_end_time,
    nn_final_start_time,
    units = "secs"
  )
)
nn_validation_predictions <- predict(
  final_nn_fit,
  new_data = valid_linear_model_data
) |>
  pull(.pred)
nn_result <- make_and_show_model_result(
  workflow = "Nonlinear-family",
  model_family = "Nonlinear",
  model_name = "Neural Network",
  training_rows = nrow(train_linear_model_data),
  validation_rows = nrow(valid_linear_model_data),
  tuning_parameters = paste(
    "hidden_units =",
    best_nn_refined$hidden_units,
    ", penalty =",
    best_nn_refined$penalty,
    ", epochs =",
    best_nn_refined$epochs
  ),
  actual = valid_ph,
  predicted = nn_validation_predictions,
  training_time_seconds = nn_final_training_time
)
Neural Network Validation Performance
workflow model_family model_name training_rows validation_rows tuning_parameters rmse mae rsq training_time_seconds
Nonlinear-family Nonlinear Neural Network 2052 515 hidden_units = 12 , penalty = 0.5 , epochs = 150 0.1226 0.0926 0.4984 1.0357

The refined neural network model improved over the linear-family models, but it did not outperform the SVM radial model.

The final neural network used:

  • hidden_units: 12
  • penalty: 0.5
  • epochs: 150

Its validation performance was:

  • RMSE: 0.1226
  • MAE: 0.0926
  • \(R^2\): 0.4984

This is better than the linear regression, elastic net, and PLS models, which had RMSE values around 0.1305 and \(R^2\) values around 0.425. This suggests that the neural network captured some nonlinear structure that the linear-family models missed.

However, the neural network did not perform as well as the SVM radial model. SVM radial still had lower RMSE and MAE and a higher \(R^2\) on the validation data.

The stronger penalty value, penalty = 0.5, also tells us that regularization was important for the neural network. Without enough regularization, the neural network possibly could overfit training data.

p1 <- plot_model_diagnostics(
  obs = valid_ph,
  pred = nn_validation_predictions,
  model_name = "Neural Network"
)
p1

nonlinear_model_results <- add_model_result(
  results_table = nonlinear_model_results,
  new_result = nn_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = nn_result
)

Neural Network with Tree-Family Features Dataset

I also test the neural network using the tree-family feature set.

This is an experimental comparison because the tree-family data keeps the original mnf_flow variable. Neural networks are sensitive to scale and numeric distances, so this version may or may not work better than the linear-family version.

To make this safer, I still use Yeo-Johnson, centering, scaling, and dummy encoding. However, I exclude mnf_flow from Yeo-Johnson because it may represent special process states, not just a regular continuous measurement.

train_tree_nn_model_data <- train_features_tree_family |>
  mutate(ph = train_ph)

valid_tree_nn_model_data <- valid_features_tree_family |>
  mutate(ph = valid_ph)
exclude_from_tree_nn_yj <- c(
  "pressure_setpoint",
  "bowl_setpoint",
  "mnf_flow"
)

tree_nn_yj_vars <- train_features_tree_family |>
  select(where(is.numeric)) |>
  select(-any_of(exclude_from_tree_nn_yj)) |>
  names()

tibble(
  yeo_johnson_variable = tree_nn_yj_vars
) |>
  kable(
    caption = "Tree-Family Variables Selected for Yeo-Johnson in Neural Network"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Tree-Family Variables Selected for Yeo-Johnson in Neural Network
yeo_johnson_variable
carb_volume
fill_ounces
pc_volume
carb_pressure
carb_temp
psc
psc_fill
psc_co2
carb_pressure1
fill_pressure
hyd_pressure1
hyd_pressure2
hyd_pressure3
hyd_pressure4
filler_level
filler_speed
temperature
usage_cont
carb_flow
density
mfr
balling
pressure_vacuum
oxygen_filler
air_pressurer
alch_rel
carb_rel
balling_lvl
tree_nn_recipe <- recipe(
  ph ~ .,
  data = train_tree_nn_model_data
) |>
  step_YeoJohnson(all_of(tree_nn_yj_vars)) |>
  step_center(all_numeric_predictors()) |>
  step_scale(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())
tree_nn_recipe <- recipe(
  ph ~ .,
  data = train_tree_nn_model_data
) |>
  step_YeoJohnson(all_of(tree_nn_yj_vars)) |>
  step_center(all_numeric_predictors()) |>
  step_scale(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())
tree_nn_grid <- expand_grid(
  hidden_units = c(8, 10, 12, 15, 20),
  penalty = c(0.05, 0.10, 0.20, 0.30, 0.50),
  epochs = c(50, 100, 150)
)

# tree_nn_grid |>
#   kable(
#     caption = "Tree-Family Neural Network Tuning Grid",
#     digits = 4
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )
tree_nn_folds <- vfold_cv(
  train_tree_nn_model_data,
  v = 5,
  repeats = 3
)
tree_nn_spec <- mlp(
  hidden_units = tune(),
  penalty = tune(),
  epochs = tune()
) |>
  set_mode("regression") |>
  set_engine(
    "nnet",
    trace = FALSE,
    MaxNWts = 5000
  )
tree_nn_workflow <- workflow() |>
  add_recipe(tree_nn_recipe) |>
  add_model(tree_nn_spec)
set.seed(123)

tree_nn_start_time <- Sys.time()

tree_nn_tune <- tune_grid(
  tree_nn_workflow,
  resamples = tree_nn_folds,
  grid = tree_nn_grid,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

tree_nn_end_time <- Sys.time()

tree_nn_training_time <- as.numeric(
  difftime(
    tree_nn_end_time,
    tree_nn_start_time,
    units = "secs"
  )
)
tree_nn_tune |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top Tree-Family Neural Network Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Top Tree-Family Neural Network Tuning Results by RMSE
hidden_units penalty epochs .metric .estimator mean n std_err .config
15 0.5 150 rmse standard 0.1220 15 0.0017 pre0_mod60_post0
15 0.3 150 rmse standard 0.1234 15 0.0020 pre0_mod57_post0
12 0.5 150 rmse standard 0.1247 15 0.0021 pre0_mod45_post0
12 0.2 150 rmse standard 0.1251 15 0.0019 pre0_mod39_post0
10 0.5 150 rmse standard 0.1253 15 0.0019 pre0_mod30_post0
12 0.3 150 rmse standard 0.1258 15 0.0020 pre0_mod42_post0
15 0.2 100 rmse standard 0.1260 15 0.0018 pre0_mod53_post0
15 0.5 100 rmse standard 0.1261 15 0.0026 pre0_mod59_post0
12 0.5 100 rmse standard 0.1263 15 0.0025 pre0_mod44_post0
15 0.3 100 rmse standard 0.1267 15 0.0019 pre0_mod56_post0
best_tree_nn <- select_best(
  tree_nn_tune,
  metric = "rmse"
)

best_tree_nn |>
  kable(
    caption = "Best Tree-Family Neural Network Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Best Tree-Family Neural Network Tuning Parameters
hidden_units penalty epochs .config
15 0.5 150 pre0_mod60_post0
final_tree_nn_workflow <- finalize_workflow(
  tree_nn_workflow,
  best_tree_nn
)
tree_nn_final_start_time <- Sys.time()

final_tree_nn_fit <- fit(
  final_tree_nn_workflow,
  data = train_tree_nn_model_data
)

tree_nn_final_end_time <- Sys.time()

tree_nn_final_training_time <- as.numeric(
  difftime(
    tree_nn_final_end_time,
    tree_nn_final_start_time,
    units = "secs"
  )
)
tree_nn_validation_predictions <- predict(
  final_tree_nn_fit,
  new_data = valid_tree_nn_model_data
) |>
  pull(.pred)
tree_nn_result <- make_and_show_model_result(
  workflow = "Tree-family",
  model_family = "Nonlinear",
  model_name = "Neural Network with Tree-Family Features",
  training_rows = nrow(train_tree_nn_model_data),
  validation_rows = nrow(valid_tree_nn_model_data),
  tuning_parameters = paste(
    "hidden_units =",
    best_tree_nn$hidden_units,
    ", penalty =",
    best_tree_nn$penalty,
    ", epochs =",
    best_tree_nn$epochs
  ),
  actual = valid_ph,
  predicted = tree_nn_validation_predictions,
  training_time_seconds = tree_nn_final_training_time
)
Neural Network with Tree-Family Features Validation Performance
workflow model_family model_name training_rows validation_rows tuning_parameters rmse mae rsq training_time_seconds
Tree-family Nonlinear Neural Network with Tree-Family Features 2052 515 hidden_units = 15 , penalty = 0.5 , epochs = 150 0.1432 0.0989 0.366 1.209

I tested two neural network versions.

The first neural network used the nonlinear-family data set. It selected:

  • hidden_units = 12
  • penalty = 0.5
  • epochs = 150

Its validation performance was:

  • RMSE: 0.1226
  • MAE: 0.0926
  • \(R^2\): 0.4984

The second neural network used the tree-family data set. It selected:

  • hidden_units = 15
  • penalty = 0.5
  • epochs = 150

Its validation performance was:

  • RMSE: 0.1432
  • MAE: 0.0989
  • \(R^2\): 0.3660

The regular neural network performed better than the neural network with tree-family dataset. Its RMSE and MAE were lower, and its \(R^2\) was higher.

This means that the tree-family feature version did not improve neural network performance. In fact, it made the neural network worse. This may be because the neural network was more sensitive to how the predictors were represented and scaled.

Overall, the neural network trained/tested with linear family dataset was the better neural network version.

p2 <- plot_model_diagnostics(
  obs = valid_ph,
  pred = tree_nn_validation_predictions,
  model_name = "Neural Network with Tree-Family Features"
)

p1/p2

I also compared the diagnostic plots for the two neural network versions.

The first neural network plot shows that the model captured the general PH trend better than the linear models. Most points follow the diagonal direction, but there is still a noticeable spread around the line. This means the model improved over simple linear models, but still had some larger prediction errors.

The residual plot for the first neural network is mostly centered around zero, but there are still some clear overprediction and underprediction cases. A few residuals are close to -0.50 and above +0.50, which means the model struggled with some validation observations.

The second neural network was trained and tested using the tree-family dataset. Its Predicted vs Observed plot shows a slightly different pattern. The points look more concentrated in the middle and upper PH range, but the overall prediction pattern is weaker.

The residual plot for the second neural network shows that most errors are still close to zero, but there are also a few larger outliers. There are a couple of observations that were severely underpredicted, which can be seen on the left side of the diagnostic plots.

The regular neural network performed better than the neural network with tree-family features. The tree-family feature version did not improve neural network performance and actually produced weaker test results.

This comparison also supports my decision to create separate feature sets for different model families. The same feature engineering approach does not always help every model. As we see, some models benefit from one predictor structure, while other models may perform worse with it.

nonlinear_model_results <- add_model_result(
  results_table = nonlinear_model_results,
  new_result = tree_nn_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = tree_nn_result
)
nonlinear_model_results |>
  arrange(rmse) |>
  kable(
    caption = "Updated Nonlinear Model Comparison",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Updated Nonlinear Model Comparison
workflow model_family model_name training_rows validation_rows tuning_parameters rmse mae rsq training_time_seconds
Nonlinear-family Nonlinear SVM Radial 2052 515 cost = 3 , / rbf_sigma = 0.05 0.1112 0.0812 0.5806 0.8783
Nonlinear-family Nonlinear MARS 2052 515 num_terms = 35 , prod_degree = 2 0.1212 0.0907 0.5013 0.3193
Nonlinear-family Nonlinear Neural Network 2052 515 hidden_units = 12 , penalty = 0.5 , epochs = 150 0.1226 0.0926 0.4984 1.0357
Tree-family Nonlinear Neural Network with Tree-Family Features 2052 515 hidden_units = 15 , penalty = 0.5 , epochs = 150 0.1432 0.0989 0.3660 1.2090

The nonlinear model comparison shows that SVM radial performed best in this model family.

SVM radial:

  • RMSE: 0.1112
  • MAE: 0.0812
  • \(R^2\): 0.5806

MARS:

  • RMSE: 0.1212
  • MAE: 0.0907
  • \(R^2\): 0.5013

Neural network with nonlinear-family dataset:

  • RMSE: 0.1226
  • MAE: 0.0926
  • \(R^2\): 0.4984

Neural network with tree-family dataset:

  • RMSE: 0.1432
  • MAE: 0.0989
  • \(R^2\): 0.3660

The SVM with radial basis kernel model had the lowest RMSE and MAE and the highest \(R^2\) among the nonlinear models. This suggests that the radial kernel was better able to capture nonlinear relationships in the beverage production data.

MARS was the second-best nonlinear model by RMSE. It performed slightly better than both neural network versions.

The regular neural network performed better than the neural network trained with tree-family dataset. The regular neural network had lower RMSE, lower MAE, and higher \(R^2\).

The neural network with tree-family features did not improve model performance. Its RMSE increased to 0.1432, and its \(R^2\) dropped to 0.3660. This means that using the tree-family dataset made the neural network weaker in this case.

Overall, SVM radial was the strongest nonlinear model. MARS and the regular neural network also performed better than the linear-family models, but they were not as strong as SVM radial. The nonlinear model results still support the idea that PH prediction depends on nonlinear patterns or interactions that regular linear models do not capture well.

Tree-Based Models

After testing linear, regularized, and nonlinear models, I now test tree-based models.

For these models, I use the tree-family feature set. This version keeps the original mnf_flow variable, because tree-based models can naturally split on different values and ranges.

I will not use math transformations, centering, scaling, or correlation filtering for tree-based models. Tree-based models do not depend on variables being on the same scale, and they can handle nonlinear relationships and interactions directly.

train_tree_model_data <- train_features_tree_family |>
  mutate(ph = train_ph)

valid_tree_model_data <- valid_features_tree_family |>
  mutate(ph = valid_ph)
set.seed(123)

tree_folds <- vfold_cv(
  train_tree_model_data,
  v = 5,
  repeats = 3
)

I will use 5-fold cross-validation repeated 3 times for each models.

tree_model_results <- tibble()

Random Forest

I start the tree-based section with Random Forest.

Random Forest builds many decision trees and averages their predictions. Each tree sees a slightly different version of the data, and each split only considers a random subset of predictors.

The main tuning parameters are:

  • mtry: how many predictors are randomly spread at each split
  • min_n: the minimum number of observations allowed in a terminal node
  • trees: the number of trees in the forest

A smaller min_n allows more detailed trees, while a larger min_n creates smoother predictions. A larger mtry lets each split consider more predictors, but it is possible that it creates trees more similar to each other.

rf_spec <- rand_forest(
  mtry = tune(),
  min_n = tune(),
  trees = 750
) |>
  set_mode("regression") |>
  set_engine(
    "ranger",
    importance = "impurity",
    num.threads = 1
  )
rf_workflow <- workflow() |>
  add_formula(ph ~ .) |>
  add_model(rf_spec)
set.seed(123)

rf_grid_refined <- expand_grid(
  mtry = seq(8, 24, by = 2),
  min_n = seq(1, 6, by = 1)
)

# rf_grid |>
#   kable(
#     caption = "Random Forest Tuning Grid"
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )
#cl <- start_parallel(cores_to_leave_free = 1)

set.seed(123)

rf_refined_start_time <- Sys.time()

rf_tune_refined <- tune_grid(
  rf_workflow,
  resamples = tree_folds,
  grid = rf_grid_refined,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(save_pred = TRUE)
)

rf_refined_end_time <- Sys.time()

rf_refined_training_time <- as.numeric(
  difftime(
    rf_refined_end_time,
    rf_refined_start_time,
    units = "secs"
  )
)

#stop_parallel(cl)
rf_tune_refined |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top Refined Random Forest Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Top Refined Random Forest Tuning Results by RMSE
mtry min_n .metric .estimator mean n std_err .config
24 1 rmse standard 0.1052 15 0.0011 pre0_mod49_post0
24 3 rmse standard 0.1053 15 0.0012 pre0_mod51_post0
22 1 rmse standard 0.1054 15 0.0012 pre0_mod43_post0
24 2 rmse standard 0.1054 15 0.0011 pre0_mod50_post0
22 2 rmse standard 0.1054 15 0.0011 pre0_mod44_post0
20 3 rmse standard 0.1054 15 0.0011 pre0_mod39_post0
22 3 rmse standard 0.1055 15 0.0011 pre0_mod45_post0
24 4 rmse standard 0.1055 15 0.0011 pre0_mod52_post0
20 1 rmse standard 0.1055 15 0.0011 pre0_mod37_post0
22 4 rmse standard 0.1055 15 0.0011 pre0_mod46_post0
best_rf_refined <- select_best(
  rf_tune_refined,
  metric = "rmse"
)

best_rf_refined |>
  kable(
    caption = "Best Refined Random Forest Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Best Refined Random Forest Tuning Parameters
mtry min_n .config
24 1 pre0_mod49_post0

The Random Forest tuning selected:

  • mtry: 24 , meaning that at each split, the model randomly considers 24 predictors.
  • min_n: 1 meaning the model preferred detailed trees.
  • Cross-validation RMSE: 0.1052
final_rf_workflow <- finalize_workflow(
  rf_workflow,
  best_rf_refined
)

rf_final_start_time <- Sys.time()

final_rf_fit <- fit(
  final_rf_workflow,
  data = train_tree_model_data
)

rf_final_end_time <- Sys.time()

rf_final_training_time <- as.numeric(
  difftime(
    rf_final_end_time,
    rf_final_start_time,
    units = "secs"
  )
)
rf_validation_predictions <- predict(
  final_rf_fit,
  new_data = valid_tree_model_data
) |>
  pull(.pred)
rf_result <- make_and_show_model_result(
  workflow = "Tree-family",
  model_family = "Tree-Based",
  model_name = "Random Forest",
  training_rows = nrow(train_tree_model_data),
  validation_rows = nrow(valid_tree_model_data),
  tuning_parameters = paste(
    "mtry =",
    best_rf_refined$mtry,
    ", min_n =",
    best_rf_refined$min_n,
    ", trees = 1000"
  ),
  actual = valid_ph,
  predicted = rf_validation_predictions,
  training_time_seconds = rf_refined_training_time
)
Random Forest Validation Performance
workflow model_family model_name training_rows validation_rows tuning_parameters rmse mae rsq training_time_seconds
Tree-family Tree-Based Random Forest 2052 515 mtry = 24 , min_n = 1 , trees = 1000 0.0984 0.0706 0.6835 2200.34

On the validation data Random Forest model performed much better compared to previous models.

The final Random Forest model used:

  • mtry: 24
  • min_n: 1
  • trees: 1000

Its validation performance was:

  • RMSE: 0.0984
  • MAE: 0.0706
  • \(R^2\): 0.6835

This is the strongest test performance so far. Compared with the earlier nonlinear models, RF model has lower RMSE and a higher \(R^2\).

Overall, this result suggests that PH prediction depends on nonlinear relationships and interactions that RF model can capture data structure better than the linear-family and earlier nonlinear models.

rf_diagnostiv_plots <- plot_model_diagnostics(
  obs = valid_ph,
  pred = rf_validation_predictions,
  model_name = "Random Forest"
)
rf_diagnostiv_plots

In the predicted vs observed plot, the points are tighter to the diagonal reference line compared to earlier models. This means the Random Forest predictions are better aligned with the actual PH values.

The residual plot shows that most residuals are centered around zero. There are still few larger residuals, especially in the middle predicted PH range, and most of residual are in the range of -0.25 to 0.25.

Overall, the diagnostic plots support the validation metrics.

tree_model_results <- add_model_result(
  results_table = tree_model_results,
  new_result = rf_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = rf_result
)

Cubist Model

Next, I test a Cubist model.

Cubist is different from Random Forest because it does not only average many trees. Instead, it creates rule-based groups and fits linear models inside those rules.

This can be useful because PH may behave differently under different production conditions.

The main tuning parameters are:

  • committees: how many Cubist model committees are used
  • neighbors: how many nearby observations are used to adjust predictions

Larger committees can make the model stronger, but at the same time model becomes more complex. The neighbors parameter can help adjust predictions based on nearby observations, which may improve accuracy.

cubist_spec <- cubist_rules(
  committees = tune(),
  neighbors = tune()
) |>
  set_mode("regression") |>
  set_engine("Cubist")
cubist_workflow <- workflow() |>
  add_formula(ph ~ .) |>
  add_model(cubist_spec)
cubist_grid <- expand_grid(
  committees = c(15, 20, 25, 35, 40, 45, 50),
  neighbors = c(0, 3, 5, 7, 9, 11)
)
#num_cores_to_use <- max(1, parallel::detectCores(logical = TRUE) - 1)

#cl <- makePSOCKcluster(num_cores_to_use)
#registerDoParallel(cl)

set.seed(123)

cubist_start_time <- Sys.time()

cubist_tune <- tune_grid(
  cubist_workflow,
  resamples = tree_folds,
  grid = cubist_grid,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(
    save_pred = TRUE,
    verbose = TRUE,
    allow_par = TRUE
  )
)

cubist_end_time <- Sys.time()

cubist_training_time <- as.numeric(
  difftime(
    cubist_end_time,
    cubist_start_time,
    units = "secs"
  )
)

#stopCluster(cl)
#registerDoSEQ()
cubist_tune |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top Cubist Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Top Cubist Tuning Results by RMSE
committees neighbors .metric .estimator mean n std_err .config
50 5 rmse standard 0.1003 15 0.0009 pre0_mod39_post0
35 5 rmse standard 0.1004 15 0.0010 pre0_mod21_post0
45 5 rmse standard 0.1005 15 0.0009 pre0_mod33_post0
40 5 rmse standard 0.1006 15 0.0009 pre0_mod27_post0
50 7 rmse standard 0.1008 15 0.0011 pre0_mod40_post0
25 5 rmse standard 0.1008 15 0.0010 pre0_mod15_post0
35 7 rmse standard 0.1009 15 0.0011 pre0_mod22_post0
45 7 rmse standard 0.1009 15 0.0011 pre0_mod34_post0
40 7 rmse standard 0.1011 15 0.0011 pre0_mod28_post0
50 3 rmse standard 0.1012 15 0.0008 pre0_mod38_post0
best_cubist <- select_best(
  cubist_tune,
  metric = "rmse"
)

best_cubist |>
  kable(
    caption = "Best Cubist Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Best Cubist Tuning Parameters
committees neighbors .config
50 5 pre0_mod39_post0
final_cubist_workflow <- finalize_workflow(
  cubist_workflow,
  best_cubist
)

cubist_final_start_time <- Sys.time()

final_cubist_fit <- fit(
  final_cubist_workflow,
  data = train_tree_model_data
)

cubist_final_end_time <- Sys.time()

cubist_final_training_time <- as.numeric(
  difftime(
    cubist_end_time,
    cubist_start_time,
    units = "secs"
  )
)
cubist_validation_predictions <- predict(
  final_cubist_fit,
  new_data = valid_tree_model_data
) |>
  pull(.pred)
cubist_result <- make_and_show_model_result(
  workflow = "Tree-family",
  model_family = "Tree-Based",
  model_name = "Cubist",
  training_rows = nrow(train_tree_model_data),
  validation_rows = nrow(valid_tree_model_data),
  tuning_parameters = paste(
    "committees =",
    best_cubist$committees,
    ", neighbors =",
    best_cubist$neighbors
  ),
  actual = valid_ph,
  predicted = cubist_validation_predictions,
  training_time_seconds = cubist_final_training_time
)
Cubist Validation Performance
workflow model_family model_name training_rows validation_rows tuning_parameters rmse mae rsq training_time_seconds
Tree-family Tree-Based Cubist 2052 515 committees = 50 , neighbors = 5 0.0996 0.0682 0.665 393.0181

The Cubist model also performed very well on the validation data.

The final Cubist model used:

  • committees: 50
  • neighbors: 5

Its validation performance was:

  • RMSE: 0.0996
  • MAE: 0.0682
  • \(R^2\): 0.6650

The selected committees = 50 means Cubist performed better with combining many rule-based models. The selected neighbors = 5 means it also used nearby observations to adjust predictions, which probably helped improve local accuracy. Cubist is a strong tree/rule-based model for this dataset and is competitive with RF.

cubist_diagnostic_plots <- plot_model_diagnostics(
  obs = valid_ph,
  pred = cubist_validation_predictions,
  model_name = "Cubist"
)
cubist_diagnostic_plots

In the predicted vs observed plot, the points follow the diagonal reference line closely. The spread around the line is present with several over and under predicted values, but it is tighter than most earlier models.
The residual plot also looks good overall. Majority of residuals are centered around zero, but there are several values in the midrange being over and under predicted.

Compared with Random Forest, Cubist has slightly higher RMSE and lower \(R^2\), but slightly better MAE. The diagnostic plot supports that idea because many predictions are close to the actual values, even though a few larger errors remain.

tree_model_results <- add_model_result(
  results_table = tree_model_results,
  new_result = cubist_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = cubist_result
)

XGBoost model

For the XGBoost tuning grid, here I will keep the grid smaller than the full possible tuning space.

I will use tree_depth = 2 and 3 because the dataset is not very large, and very deep trees might overfit. I will use learning rates of 0.03 and 0.05 because they allow gradual boosting without making the model too slow. I test trees = 500 and 1000 to compare a medium and larger number of boosting rounds.

I will keep sample_size = 0.80 fixed for the first pass to reduce the grid size. Also I will tune mtry, min_n, and loss_reduction because they affect how complex or conservative the boosted trees become.

xgb_spec <- boost_tree(
  trees = tune(),
  tree_depth = tune(),
  learn_rate = tune(),
  loss_reduction = tune(),
  min_n = tune(),
  sample_size = tune(),
  mtry = tune()
) |>
  set_mode("regression") |>
  set_engine(
    "xgboost",
    nthread = 1
  )
xgb_workflow <- workflow() |>
  add_formula(ph ~ .) |>
  add_model(xgb_spec)
xgb_grid <- expand_grid(
  trees = c(1000),
  tree_depth = c(2, 3, 4),
  learn_rate = c(0.03, 0.05, 0.07),
  loss_reduction = c(0, 0.01),
  min_n = c(2, 5),
  sample_size = c(0.80),
  mtry = c(16, 24, 28)
)

# xgb_grid |>
#   kable(
#     caption = "XGBoost Tuning Grid"
#   ) |>
#   kable_styling(
#     full_width = FALSE,
#     position = "center"
#   )
#num_cores_to_use <- max(1, parallel::detectCores(logical = TRUE) - 1)

xgb_start_time <- Sys.time()
##cl <- makePSOCKcluster(num_cores_to_use)
#registerDoParallel(cl)

set.seed(123)



xgb_tune <- tune_grid(
  xgb_workflow,
  resamples = tree_folds,
  grid = xgb_grid,
  metrics = metric_set(rmse, mae, rsq),
  control = control_grid(
    save_pred = TRUE,
    verbose = TRUE,
    allow_par = TRUE
  )
)


xgb_end_time <- Sys.time()

xgb_training_time <- as.numeric(
  difftime(
    xgb_end_time,
    xgb_start_time,
    units = "secs"))
  
#stopCluster(cl)
#registerDoSEQ()
xgb_tune |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean) |>
  slice(1:10) |>
  kable(
    caption = "Top XGBoost Tuning Results by RMSE",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Top XGBoost Tuning Results by RMSE
mtry trees min_n tree_depth learn_rate loss_reduction sample_size .metric .estimator mean n std_err .config
24 1000 2 4 0.05 0 0.8 rmse standard 0.1038 15 0.0012 pre0_mod051_post0
28 1000 2 4 0.07 0 0.8 rmse standard 0.1040 15 0.0011 pre0_mod089_post0
24 1000 5 4 0.05 0 0.8 rmse standard 0.1041 15 0.0011 pre0_mod069_post0
24 1000 2 4 0.07 0 0.8 rmse standard 0.1042 15 0.0012 pre0_mod053_post0
28 1000 2 4 0.03 0 0.8 rmse standard 0.1043 15 0.0011 pre0_mod085_post0
24 1000 5 4 0.03 0 0.8 rmse standard 0.1043 15 0.0010 pre0_mod067_post0
28 1000 5 4 0.05 0 0.8 rmse standard 0.1044 15 0.0011 pre0_mod105_post0
28 1000 2 4 0.05 0 0.8 rmse standard 0.1044 15 0.0010 pre0_mod087_post0
24 1000 2 4 0.03 0 0.8 rmse standard 0.1045 15 0.0011 pre0_mod049_post0
24 1000 5 4 0.07 0 0.8 rmse standard 0.1045 15 0.0011 pre0_mod071_post0
best_xgb <- select_best(
  xgb_tune,
  metric = "rmse"
)

best_xgb |>
  kable(
    caption = "Best XGBoost Tuning Parameters",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Best XGBoost Tuning Parameters
mtry trees min_n tree_depth learn_rate loss_reduction sample_size .config
24 1000 2 4 0.05 0 0.8 pre0_mod051_post0

The grid tested:

  • trees: 1000

  • tree_depth: 2, 3, 4

  • learn_rate: 0.03, 0.05, 0.07

  • loss_reduction: 0, 0.01

  • min_n: 2, 5

  • sample_size: 0.80

  • mtry: 16, 24, 28

The best XGBoost tuning result selected:

  • mtry: 24
  • trees: 1000
  • min_n: 2
  • tree_depth: 4
  • learn_rate: 0.05
  • loss_reduction: 0
  • sample_size: 0.80

This means the best XGBoost model used 1000 boosted trees with moderate tree depth. A tree_depth of 4 allows the model to capture interactions between predictors, but it is not extremely deep.

The selected learn_rate = 0.05 means the model learned gradually instead of making large updates at each boosting step. This is usually safer because each tree only makes a small contribution to the final model.

The selected min_n = 2 means the model allowed fairly detailed splits. The loss_reduction = 0 value means the model did not require extra improvement before making a split, so it was allowed to split more freely when useful.

The selected sample_size = 0.80 means each boosting iteration used a random 80% of the training rows.
I used row sampling because the dataset has some imbalance in brand_code, especially Brand B, which makes up almost half of the observations. The row sampling does not fully fix brand imbalance, but it adds randomness during training and can reduce the chance that the model depends too heavily on one exact training sample.
This randomness can help reduce overfitting and make the boosted trees more stable.

Overall, the selected tuning parameters show that XGBoost benefited from a moderately flexible model: enough trees, moderate depth, gradual learning, small terminal nodes, and row sampling.

# xgb_tune |>
#   collect_metrics() |>
#   filter(.metric == "rmse") |>
#   group_by(tree_depth, learn_rate) |>
#   summarise(
#     best_rmse = min(mean),
#     .groups = "drop"
#   ) |>
#   ggplot(aes(
#     x = factor(tree_depth),
#     y = factor(learn_rate),
#     fill = best_rmse
#   )) +
#   geom_tile() +
#   geom_text(aes(label = round(best_rmse, 4))) +
#   labs(
#     title = "XGBoost Best RMSE by Tree Depth and Learning Rate",
#     x = "Tree Depth",
#     y = "Learning Rate",
#     fill = "RMSE"
#   )
final_xgb_workflow <- finalize_workflow(
  xgb_workflow,
  best_xgb
)

xgb_final_start_time <- Sys.time()

final_xgb_fit <- fit(
  final_xgb_workflow,
  data = train_tree_model_data
)

xgb_final_end_time <- Sys.time()

xgb_final_training_time <- as.numeric(
  difftime(
    xgb_final_end_time,
    xgb_final_start_time,
    units = "secs"
  )
)
xgb_validation_predictions <- predict(
  final_xgb_fit,
  new_data = valid_tree_model_data
) |>
  pull(.pred)
xgb_result_short <- xgb_result |>
  mutate(
    training_time_minutes = training_time_seconds / 60
  ) |>
  select(
    workflow,
    model_family,
    model_name,
    training_rows,
    validation_rows,
    rmse,
    mae,
    rsq,
    training_time_minutes
  )

xgb_result_short |>
  kable(
    caption = "XGBoost Validation Performance",
    digits = 4,
    format = "html"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center",
    bootstrap_options = c("striped", "hover", "condensed", "responsive")
  )
XGBoost Validation Performance
workflow model_family model_name training_rows validation_rows rmse mae rsq training_time_minutes
Tree-family Tree-Based XGBoost 2052 515 0.0964 0.0705 0.6852 10.8502
xgb_selected_parameters <- xgb_result |>
  select(
    model_name,
    tuning_parameters
  )

xgb_selected_parameters |>
  kable(
    caption = "Selected XGBoost Tuning Parameters",
    format = "html"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center",
    bootstrap_options = c("striped", "hover", "condensed", "responsive")
  )
Selected XGBoost Tuning Parameters
model_name tuning_parameters
XGBoost trees = 1000 , tree_depth = 4 , learn_rate = 0.05 , loss_reduction = 0 , min_n = 2 , sample_size = 0.8 , mtry = 24

The final XGBoost model used:

  • trees = 1000
  • tree_depth = 4
  • learn_rate = 0.05
  • loss_reduction = 0
  • min_n = 2
  • sample_size = 0.80
  • mtry = 24

Its validation performance is:

  • RMSE: 0.0964
  • MAE: 0.0705
  • \(R^2\): 0.6852

This is the best individual model result so far. XGBoost has the lowest validation RMSE and \(R^2\) among the individual models

The model used enough trees to learn the pattern, a moderate tree depth to capture nonlinear relationships, and a learning rate of 0.05 so the model could improve gradually. The sample_size = 0.80 added some randomness during training, which can help reduce overfitting.

xgb_diagn_plots <- plot_model_diagnostics(
  obs = valid_ph,
  pred = xgb_validation_predictions,
  model_name = "XGBoost"
)
xgb_diagn_plots

The XGBoost diagnostic plots show that the model performed well overall on the validation data.

In the Predicted vs Observed plot, most points follow the diagonal line, which means the predicted PH values were generally close to the observed PH values. There is still some spread, but the overall pattern is strong and shows that the model captured the main PH trend.

The Residuals vs Predicted plot also looks reasonably balanced around zero. This means the model did not strongly overpredict or underpredict across the whole predicted PH range.

There are still a few larger residuals, especially some points below -0.25 and above 0.25.

Overall, these plots support using XGBoost as the final individual model. The model is not perfect, but it shows a good balance between prediction accuracy and residual behavior.

tree_model_results <- add_model_result(
  results_table = tree_model_results,
  new_result = xgb_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = xgb_result
)

Final Model Comparison Summary

final_model_comparison <- all_model_results |>
  mutate(
    training_time_minutes = training_time_seconds / 60
  ) |>
  select(
    `Dataset type` = workflow,
    `Model Family` = model_family,
    `Model Name` = model_name,
    `Tuning Parameters` = tuning_parameters,
    RMSE = rmse,
    MAE = mae,
    `R-squared` = rsq
  ) |>
  arrange(RMSE)

final_model_comparison |>
  kable(
    caption = "Final Model Comparison on Validation Data",
    digits = 4,
    format = "html"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  ) |>
  column_spec(
    column = 4,
    width = "16em",
    extra_css = "white-space: normal; word-break: break-word;"
  )
Final Model Comparison on Validation Data
Dataset type Model Family Model Name Tuning Parameters RMSE MAE R-squared
Tree-family Tree-Based XGBoost trees = 1000 , tree_depth = 4 , learn_rate = 0.05 , loss_reduction = 0 , min_n = 2 , sample_size = 0.8 , mtry = 24 0.0964 0.0705 0.6852
Tree-family Tree-Based Random Forest mtry = 24 , min_n = 1 , trees = 1000 0.0984 0.0706 0.6835
Tree-family Tree-Based Cubist committees = 50 , neighbors = 5 0.0996 0.0682 0.6650
Nonlinear-family Nonlinear SVM Radial cost = 3 , / rbf_sigma = 0.05 0.1112 0.0812 0.5806
Nonlinear-family Nonlinear MARS num_terms = 35 , prod_degree = 2 0.1212 0.0907 0.5013
Nonlinear-family Nonlinear Neural Network hidden_units = 12 , penalty = 0.5 , epochs = 150 0.1226 0.0926 0.4984
Linear-family / PLS keeps correlated predictors Linear / Regularized PLS Regression num_comp = 14 0.1286 0.0987 0.4403
Linear-family Linear / Regularized Linear Regression None 0.1305 0.1007 0.4244
Linear-family Linear / Regularized Elastic Net penalty = 0.00848342898244073 , mixture = 0 0.1305 0.1007 0.4250
Tree-family Nonlinear Neural Network with Tree-Family Features hidden_units = 15 , penalty = 0.5 , epochs = 150 0.1432 0.0989 0.3660

The final validation comparison shows a clear difference between the model families.

Linear and Regularized Models

The linear and regularized models were the weakest group overall.

  • Linear regression and elastic net produced almost identical results:
    • RMSE: about 0.1305
    • MAE: about 0.1007
    • \(R^2\): about 0.425
  • PLS regression performed slightly better:
    • RMSE: 0.1286
    • MAE: 0.0987
    • \(R^2\): 0.4403

For PLS, I kept the correlated predictors because PLS is designed to work with correlated variables by creating components from them. This helped PLS perform slightly better than regular linear regression and elastic net.

However, even with this adjustment, the linear-family models still left a lot of PH variation unexplained.

Nonlinear Models

The nonlinear models improved over the linear-family models.

  • SVM with radial basis kernel was the best model in this group:
    • RMSE: 0.1112
    • MAE: 0.0812
    • \(R^2\): 0.5806
  • MARS and neural networks also improved over the linear models, but they were weaker than SVM radial.

This tells us that nonlinear relationships are playing larger role in the PH explanation. However, not all nonlinear models were able to capture those relationships equally well.

Tree-Based Models

The tree-based models performed best overall.

  • XGBoost had the lowest RMSE:
    • RMSE: 0.0964
    • MAE: 0.0705
    • \(R^2\): 0.6852
  • Random Forest was very close:
    • RMSE: 0.0984
    • MAE: 0.0706
    • \(R^2\): 0.6835
  • Cubist also performed strongly:
    • RMSE: 0.0996
    • MAE: 0.0682
    • \(R^2\): 0.6650

Cubist had the lowest MAE among the individual tree-based models, which means it had slightly smaller average absolute errors. XGBoost and Random Forest performed better by RMSE and \(R^2\).

cubist_diagnostic_plots/rf_diagnostiv_plots/xgb_diagn_plots

Since tree-based models produced fairly close results, it is worth to notice that these diagnostic plots represent a better understanding of which model is preformed better overall.

  • Cubist plots:
    Predicted vs Observed plot show that Cubist model had slightly uneven balance in a spread. There are few highly over and underpredicted values.
    From Residuals vs Predicted plot it is noticeable that that Cubist model had some larger over and under estimated predictions
    Cubist had the best MAE among the tree-based models, which means it made smaller average absolute errors. However, the diagnostic plots show that it still had a few larger misses, there are few residuals in 0.50 and -0.50 ranges.

  • Random Forest:
    There is a visible pattern in the residuals. For predicted PH values below about 8.65, the model appears to have more negative residuals, meaning it tends to overpredict in that range. For predicted PH values above about 8.65, there is more tendency toward positive residuals, meaning it tends to underpredict some higher PH values.
    The residuals vs predicted plot shows this pattern more clearly. This suggests that Random Forest performs well overall, but it may still pull some predictions toward the middle range

  • XGBoost:
    The XGBoost diagnostic plots look the most balanced overall.
    In the Predicted vs Observed plot, the points are tightly grouped around the diagonal line, and the spread looks more even compared with Cubist and Random Forest.
    The residuals vs predicted plot also looks more balanced. The residuals are mostly centered around zero with kinda oval shape, and the spread appears more even across the predicted PH range. Compared with Cubist and Random Forest, XGBoost shows less spread of the residuals.

    Because of this, XGBoost looks like the strongest individual tree-based model from the diagnostic plots. It also had the best RMSE in the final comparison table, which supports this choice.

Overall, based on both the validation metrics and the diagnostic plots, I would select XGBoost as the best individual model.

Tuning Process

I did not rely on only one tuning attempt for the stronger models. For several models, I used a staged tuning process:

  • First, I used a wider tuning grid to find the general region where the model performed best (for some models it took a longer computing time).
  • Then, I refined the grid around the best-performing parameters area.
  • This was done for models such as SVM radial, Random Forest, neural networks, and XGBoost.

For example:

  • Random Forest was first tuned across a wider range of mtry and min_n values.
  • After the first tuning results showed me that stronger models used larger mtry values and smaller min_n values, so after that I refined the search around that region.
  • XGBoost was also tuned using same logic because it has many parameters and a full tuning grid can become very expensive to keep and run here every time.

Some intermediate tuning attempts are not included in the final RMD file and knitted reports.

For the final report, I kept only the most important tuning results and the final validation comparison. Otherwise including every tuning attempt would make the report too long and would also increase computing and knitting time.

I also considered saving model metrics after each tuning attempt, but I decided not to do that in the final workflow. Since the modeling process already included many models and tuning steps, saving too many intermediate results could make the project harder to follow and easier to mix up. Instead, I focused on keeping the final model comparison clean and reproducible.

Final Takeaway

Based on the validation results:

  • XGBoost is the best individual model by RMSE.
  • Random Forest is very close and is a strong stable benchmark.
  • Cubist is also competitive and has the best MAE among the individual tree-based models.

So, the tree-based models are the strongest individual model family for this PH prediction task.

Important Predictors from XGBoost and Random Forest

After selecting the best individual models, I also review which predictors were most important.

This part is important because the goal is not only to predict PH, but also to understand which manufacturing and process variables helped the models make predictions.

I use XGBoost variable importance because XGBoost was the strongest individual model by RMSE. I also check Random Forest variable importance because Random Forest had very close validation performance and gives another tree-based view of important predictors.

These importance values should be read as predictive importance, not direct causation. In other words, a variable can help predict PH, but that does not automatically mean it directly causes PH to change.

XGBoost important predictors

final_xgb_engine <- final_xgb_fit |>
  extract_fit_parsnip() |>
  pluck("fit")

xgb_importance <- xgb.importance(
  model = final_xgb_engine
)

xgb_importance_table <- xgb_importance |>
  as_tibble() |>
  slice(1:20)

xgb_importance_table |>
  kable(
    caption = "Top 20 XGBoost Important Predictors",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Top 20 XGBoost Important Predictors
Feature Gain Cover Frequency
mnf_flow 0.1613 0.0345 0.0280
usage_cont 0.0680 0.0446 0.0451
oxygen_filler 0.0510 0.0469 0.0489
brand_codeC 0.0508 0.0093 0.0069
temperature 0.0453 0.0281 0.0323
bowl_setpoint 0.0436 0.0074 0.0075
balling_lvl 0.0430 0.0323 0.0344
pressure_vacuum 0.0418 0.0425 0.0374
alch_rel 0.0387 0.0357 0.0292
density 0.0358 0.0324 0.0265
air_pressurer 0.0350 0.0257 0.0280
carb_pressure1 0.0348 0.0476 0.0424
filler_speed 0.0343 0.0510 0.0432
carb_rel 0.0264 0.0216 0.0264
carb_flow 0.0236 0.0500 0.0431
carb_volume 0.0204 0.0344 0.0384
hyd_pressure3 0.0202 0.0207 0.0220
balling 0.0190 0.0232 0.0231
pc_volume 0.0180 0.0360 0.0431
filler_level 0.0172 0.0307 0.0323
xgb_importance_table |>
  ggplot(aes(x = reorder(Feature, Gain), y = Gain)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 20 XGBoost Important Predictors",
    x = "Predictor",
    y = "Gain"
  )

Random Forest important predictors

final_rf_engine <- final_rf_fit |>
  extract_fit_parsnip() |>
  pluck("fit")

rf_importance_table <- tibble(
  variable = names(final_rf_engine$variable.importance),
  importance = as.numeric(final_rf_engine$variable.importance)
) |>
  arrange(desc(importance)) |>
  slice(1:20)

rf_importance_table |>
  kable(
    caption = "Top 20 Random Forest Important Predictors",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Top 20 Random Forest Important Predictors
variable importance
mnf_flow 11.9829
usage_cont 3.4860
temperature 3.0087
oxygen_filler 2.9751
carb_rel 2.9246
alch_rel 2.8036
balling_lvl 2.6853
pressure_vacuum 2.3379
air_pressurer 2.0996
carb_pressure1 1.9842
density 1.9279
brand_code 1.8040
bowl_setpoint 1.6859
carb_flow 1.6133
filler_speed 1.5604
filler_level 1.4296
balling 1.3673
hyd_pressure3 1.3143
pc_volume 1.1827
carb_volume 1.1820
rf_importance_table |>
  ggplot(aes(x = reorder(variable, importance), y = importance)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 20 Random Forest Important Predictors",
    x = "Predictor",
    y = "Importance"
  )

xgb_importance_ranked <- xgb_importance |>
  as_tibble() |>
  transmute(
    variable = Feature,
    xgb_gain = Gain
  ) |>
  arrange(desc(xgb_gain)) |>
  mutate(
    xgb_rank = row_number()
  )

rf_importance_ranked <- tibble(
  variable = names(final_rf_engine$variable.importance),
  rf_importance = as.numeric(final_rf_engine$variable.importance)
) |>
  arrange(desc(rf_importance)) |>
  mutate(
    rf_rank = row_number()
  )

importance_rank_comparison <- full_join(
  xgb_importance_ranked,
  rf_importance_ranked,
  by = "variable"
) |>
  mutate(
    appears_in_both = ifelse(
      !is.na(xgb_rank) & !is.na(rf_rank),
      "Yes",
      "No"
    )
  ) |>
  rowwise() |>
  mutate(
    average_rank = mean(
      c(xgb_rank, rf_rank),
      na.rm = TRUE
    )
  ) |>
  ungroup() |>
  arrange(average_rank)

XGBoost and RF models comparison

importance_rank_display <- importance_rank_comparison |>
  select(
    variable,
    xgb_rank,
    rf_rank,
    average_rank,
    appears_in_both
  ) |>
  slice(1:10) |>
  mutate(
    xgb_rank = ifelse(is.na(xgb_rank), "Not ranked", as.character(xgb_rank)),
    rf_rank = ifelse(is.na(rf_rank), "Not ranked", as.character(rf_rank))
  )

importance_rank_display |>
  kable(
    caption = "Rank-Based Comparison of Top-10 Important Predictors",
    digits = 2
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Rank-Based Comparison of Top-10 Important Predictors
variable xgb_rank rf_rank average_rank appears_in_both
mnf_flow 1 1 1.0 Yes
usage_cont 2 2 2.0 Yes
oxygen_filler 3 4 3.5 Yes
brand_codeC 4 Not ranked 4.0 No
temperature 5 3 4.0 Yes
balling_lvl 7 7 7.0 Yes
alch_rel 9 6 7.5 Yes
pressure_vacuum 8 8 8.0 Yes
bowl_setpoint 6 13 9.5 Yes
carb_rel 14 5 9.5 Yes

XGBoost and Random Forest both showed that some predictors were much more important than others, but the importance values are not on the same scale.

For XGBoost, mnf_flow had the highest Gain value at 0.1613. This was much larger than next predictor, usage_cont, which had Gain = 0.0680. This tells us that mnf_flow was the strongest predictor in the XGBoost model.

Random Forest showed the same pattern. mnf_flow had the highest importance value at 11.9829, while the next predictor, usage_cont, had importance of 3.4860. So in both models, mnf_flow was clearly the most important predictor for PH.

Both models also found similar process-related variables important, including:

  • usage_cont
  • oxygen_filler
  • temperature
  • balling_lvl
  • pressure_vacuum
  • alch_rel
  • carb_rel
  • carb_pressure1
  • filler_speed
  • carb_flow

This gives more confidence that these variables are useful predictive factors because they appear in both strong models.

There is one difference is how brand information appears. XGBoost shows brand_codeC as an important predictor with Gain = 0.0508, while Random Forest shows the full brand_code variable with importance = 1.8040. This difference is probably because XGBoost used dummy variables for brand_code and Random Forest handled the brand variable as is and didn’t split it into dummy variables.

Overall, the importance results suggest that PH prediction is mostly related to mnf_flow (I aasume it is manufacturing flow), usage/process conditions, oxygen and filling behavior, temperature, pressure-related variables, carbonation/alcohol-related variables, and brand differences.

I think it is important to note that these are predictive factors, not direct causal conclusions. They tell us which variables helped the models predict PH, but not necessarily which variables directly cause PH to change.

What happens if we change variable’s values?

After selecting XGBoost as the final individual model, I wanted to see how predicted PH changes when models important predictors are changed.

I wanted to see more precise changes for each brand’s average predicted PH, so I calculated the average prediction changes separately by brand_code.

I think this is more useful than using only one global average changes, because I think each brand may have different process behavior. In other words, the same variable change may affect the model predictions differently for each brand type.

For each brand, I changed one XGBoost’s important predictor by 10% while keeping the other predictors the same. Then I compared the new predicted PH with the original predicted PH.

This does not prove that changing one variable directly causes PH to change. It only shows how the trained XGBoost model responds to these changes.

# Baseline row for scenario testing
# I use the validation row closest to the median predicted PH

valid_tree_predictions_for_scenario <- predict(
  final_xgb_fit,
  new_data = valid_tree_model_data
) |>
  pull(.pred)

median_predicted_ph <- median(valid_tree_predictions_for_scenario)

baseline_row_index <- which.min(
  abs(valid_tree_predictions_for_scenario - median_predicted_ph)
)

baseline_row <- valid_tree_model_data |>
  slice(baseline_row_index) |>
  select(-ph)

baseline_prediction <- predict(
  final_xgb_fit,
  new_data = baseline_row
) |>
  pull(.pred)

baseline_prediction
[1] 8.533664
make_brand_numeric_scenario <- function(variable_name, change_percent = 0.10) {
  
  # Validation predictors only
  valid_x <- valid_tree_model_data |>
    select(-ph)
  
  # Original predictions
  original_predictions <- predict(
    final_xgb_fit,
    new_data = valid_x
  ) |>
    pull(.pred)
  
  # Decreased scenario
  decreased_x <- valid_x |>
    mutate(
      !!variable_name := .data[[variable_name]] * (1 - change_percent)
    )
  
  decreased_predictions <- predict(
    final_xgb_fit,
    new_data = decreased_x
  ) |>
    pull(.pred)
  
  # Increased scenario
  increased_x <- valid_x |>
    mutate(
      !!variable_name := .data[[variable_name]] * (1 + change_percent)
    )
  
  increased_predictions <- predict(
    final_xgb_fit,
    new_data = increased_x
  ) |>
    pull(.pred)
  
  # Combine and summarize by brand
  bind_rows(
    tibble(
      brand_code = valid_x$brand_code,
      variable_changed = variable_name,
      scenario = paste0(variable_name, " decreased by 10%"),
      original_predicted_ph = original_predictions,
      scenario_predicted_ph = decreased_predictions,
      change_from_original = scenario_predicted_ph - original_predicted_ph
    ),
    tibble(
      brand_code = valid_x$brand_code,
      variable_changed = variable_name,
      scenario = paste0(variable_name, " increased by 10%"),
      original_predicted_ph = original_predictions,
      scenario_predicted_ph = increased_predictions,
      change_from_original = scenario_predicted_ph - original_predicted_ph
    )
  ) |>
    group_by(
      brand_code,
      variable_changed,
      scenario
    ) |>
    summarise(
      validation_rows = n(),
      average_original_ph = mean(original_predicted_ph, na.rm = TRUE),
      average_scenario_ph = mean(scenario_predicted_ph, na.rm = TRUE),
      average_change_from_original = mean(change_from_original, na.rm = TRUE),
      min_change = min(change_from_original, na.rm = TRUE),
      max_change = max(change_from_original, na.rm = TRUE),
      .groups = "drop"
    ) |>
    arrange(
      variable_changed,
      brand_code,
      scenario
    )
}
scenario_variables <- c(
  "mnf_flow",
  "usage_cont",
  "oxygen_filler",
  "temperature",
  "pressure_vacuum",
  "carb_rel"
)

brand_scenario_results <- map_dfr(
  scenario_variables,
  make_brand_numeric_scenario
)
format_change <- function(x) {
  ifelse(
    x > 0,
    paste0("+", sprintf("%.4f", x)),
    sprintf("%.4f", x)
  )
}
scenario_decrease_table <- brand_scenario_results |>
  filter(str_detect(scenario, "decreased")) |>
  select(
    brand_code,
    variable_changed,
    average_change_from_original
  ) |>
  mutate(
    average_change_from_original = round(average_change_from_original, 4)
  ) |>
  pivot_wider(
    names_from = brand_code,
    values_from = average_change_from_original
  ) |>
  arrange(variable_changed)

scenario_increase_table <- brand_scenario_results |>
  filter(str_detect(scenario, "increased")) |>
  select(
    brand_code,
    variable_changed,
    average_change_from_original
  ) |>
  mutate(
    average_change_from_original = round(average_change_from_original, 4)
  ) |>
  pivot_wider(
    names_from = brand_code,
    values_from = average_change_from_original
  ) |>
  arrange(variable_changed)
scenario_decrease_display <- scenario_decrease_table |>
  mutate(
    across(
      c(A, B, C, D),
      format_change
    )
  )

scenario_decrease_display |>
  kable(
    caption = "Average Predicted PH Change When Variable Decreased by 10%",
    format = "html"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
   # bootstrap_options = c("striped", "hover", "condensed")
  )
Average Predicted PH Change When Variable Decreased by 10%
variable_changed A B C D
carb_rel -0.0475 -0.0218 +0.0038 -0.0585
mnf_flow -0.0047 -0.0047 +0.0107 -0.0038
oxygen_filler +0.0021 +0.0020 +0.0002 +0.0013
pressure_vacuum +0.0014 -0.0019 +0.0032 0.0000
temperature +0.0053 +0.0139 +0.0104 +0.0124
usage_cont +0.0248 +0.0254 +0.0175 +0.0240

The table above shows the average change in predicted PH when each selected predictor was decreased by 10%, separated by brand code.

Positive values mean the XGBoost model predicted higher averafe PH after the variable was decreased by 10%. Negative values mean the model predicted lower average PH after the variable was decreased.

There are few patterns are noticeable:

  • carb_rel had one of the strongest effects. When carb_rel was decreased by 10%, predicted PH decreased for brands A, B, and D. The strongest average decrease was for brand D (-0.0585), followed by brand A (-0.0475). Brand C behaved differently and had a very small increase (+0.0038).

  • usage_cont showed a consistent pattern across all brands. When usage_cont was decreased by 10%, predicted PH increased for all brands. The average increase was between +0.0175 and +0.0254.

  • temperature also showed a consistent pattern. When temperature was decreased by 10%, predicted PH increased for all brands. The increase is not very large, but it is consistent.

  • oxygen_filler and pressure_vacuum had very small mostly positive average effects in this case.

  • mnf_flow had a small average effect, but it did not behave exactly the same for all brands. For brands A, B, and D, decreasing mnf_flow slightly lowered predicted PH. For Brand C, it slightly increased predicted PH.

Overall as I expected, this table shows that XGBoost is not treating every brand the same way . The same 10% change can have different effects depending on the brand. This supports the idea that PH prediction depends on the combination of process variables and brand-level behavior, not only one variable by itself.

And again, these results should still be interpreted as model-based prediction scenarios, not direct cause-and-effect results.

scenario_increase_display <- scenario_increase_table |>
  mutate(
    across(
      c(A, B, C, D),
      format_change
    )
  )

scenario_increase_display |>
  kable(
    caption = "Average Predicted PH Change When Variable Increased by 10%",
    format = "html"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
   # bootstrap_options = c("striped", "hover")
  )
Average Predicted PH Change When Variable Increased by 10%
variable_changed A B C D
carb_rel -0.0062 -0.0274 -0.0251 -0.0097
mnf_flow +0.0106 +0.0094 +0.0042 +0.0140
oxygen_filler -0.0027 -0.0017 -0.0002 -0.0033
pressure_vacuum +0.0017 +0.0030 +0.0111 +0.0021
temperature -0.0349 -0.0566 -0.0201 -0.0451
usage_cont -0.0010 -0.0061 -0.0039 +0.0022

And this table above shows the average change in predicted PH when each selected predictor was increased by 10%, separated by brand codes.

There are also patterns are noticeable:

  • temperature had one of the strongest effects. When temperature was increased by 10%, as we see predicted PH decreased for all brands. The largest average decrease was for brand B (-0.0566), followed by D (-0.0451), brand A (-0.0349), and brand C (-0.0201). This tells us that the model associates higher temperature values with lower predicted PH.

  • carb_rel also showed mostly negative changes. When carb_rel was increased by 10%, predicted PH decreased for all brands. The decrease was strongest for brand B (-0.0274) and brand C (-0.0251).

  • mnf_flow showed a consistent positive pattern. When mnf_flow was increased by 10%, predicted PH increased for all brands. The effect was not very large, but it was consistent across brands. Brand D had the largest average increase (+0.0140).

  • pressure_vacuum had a small positive effect for all brands. The largest average change was for Brand C (+0.0111), while the other brands had smaller changes.

  • oxygen_filler had a very small negative effect across all brands. The changes were close to zero, so this scenario did not strongly move predicted PH on average.

  • usage_cont had small average effects. For Brands A, B, and C, increasing usage_cont slightly decreased predicted PH. For Brand D, it slightly increased predicted PH, but the change was very small (+0.0022).

As we see from both tables, temperature, carb_rel, usage_cont, and mnf_flow showed the clearest average changes in predicted PH.

The temperature pattern was especially clear. Decreasing temperature increased predicted PH for all brands, while increasing temperature decreased predicted PH for all brands.

The mnf_flow pattern was also mostly consistent. Increasing mnf_flow increased predicted PH for all brands, although the average changes were not very large.

The carb_rel variable had stronger negative effects for some brands, especially Brand A and Brand D when decreased, and Brand B and Brand C when increased.

I think that the important point here is that these are not simple one-variable rules. The model response can differ by brand, and real process changes may also affect other variables at the same time. Because of that, these scenario results should be used as a guide for decision-making discussion with process and manufacturing involved experts, not as direct production instructions.

This interaction review changes two important predictors at the same time: mnf_flow and oxygen_filler.

If the predicted PH changes in a simple smooth pattern, then the model response is easier to explain. If the prediction changes differently across combinations of the two variables, that suggests the XGBoost model may be using an interaction between those predictors.

This does not prove a physical interaction in the manufacturing process. It only shows that the model prediction depends on the combination of these input values.

Final Evaluation PH Predictions

The final evaluation predictor set was already prepared in the preprocessing part.

Since I already cleaned, capped, and imputed predictors for the final evaluation dataset eval_features_tree_family, I will not repeat the full preprocessing workflow here.

The only remaining step is model-specific prediction. Since the final XGBoost model was trained as a workflow, the model object handles the same model preparation that was used during training. I only need to confirm that the evaluation predictors have the same columns as the tree-family training predictors, then generate PH predictions.

missing_from_eval <- setdiff(
  names(train_tree_model_data |> select(-ph)),
  names(eval_features_tree_family)
)

extra_in_eval <- setdiff(
  names(eval_features_tree_family),
  names(train_tree_model_data |> select(-ph))
)

missing_from_eval
character(0)
extra_in_eval
character(0)
stopifnot(length(missing_from_eval) == 0)
stopifnot(length(extra_in_eval) == 0)

It confirmed that there are same columns and no missing values.

Now we can do predictions

eval_xgb_predictions <- predict(
  final_xgb_fit,
  new_data = eval_features_tree_family
) |>
  rename(predicted_ph = .pred)
eval_predictions_xgboost <- eval_features_tree_family |>
  mutate(
    row_id = row_number(),
    predicted_ph = eval_xgb_predictions$predicted_ph
  ) |>
  select(
    row_id,
    predicted_ph
  )

eval_predictions_xgboost |>
  head(10) |>
  kable(
    caption = "First 10 Final Evaluation PH Predictions",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
First 10 Final Evaluation PH Predictions
row_id predicted_ph
1 8.5002
2 8.4530
3 8.5082
4 8.6799
5 8.3887
6 8.4879
7 8.5332
8 8.5005
9 8.5930
10 8.7368
write_csv(
  eval_predictions_xgboost,
  "final_evaluation_ph_predictions_xgboost.csv"
)
write_xlsx(
  eval_predictions_xgboost,
  "final_evaluation_ph_predictions_xgboost.xlsx"
)

Final Conclusion

I selected XGBoost as the final individual model for PH prediction.

It had the best validation RMSE among the individual models and showed strong diagnostic performance compared with the other model families. Random Forest and Cubist models had close accuracy metrics, but XGBoost had the best overall validation result.

The most important predictors suggest that PH is connected to manufacturing flow, oxygen/filling behavior, temperature, pressure, carbonation/alcohol-related measurements, and brand-level differences.

The final evaluation data do not contain observed PH values, so I couldn’t calculate model accuracy could on that dataset. I used trained XGBoost model to generate final PH predictions.

The variable importance results should be interpreted as predictive importance, not direct causation.

Optional Bonus Experiment: Ensemble Model Predictions

Note: This section is my additional experiment. It doesn’t have a purpose for selecting the final submitted model.

After comparing the individual models, I also wanted to experiment to see if combining model predictions could improve validation performance.

The idea behind an ensemble is that different models may capture different parts of the PH prediction problem. For example, Random Forest, Cubist, and XGBoost are all tree-based models, but they learn patterns in different ways:

  • Random Forest averages many independent trees.
  • Cubist creates rule-based groups and local linear models.
  • XGBoost builds trees sequentially, where each tree tries to correct previous errors.

Because these models are different, they may not make the same mistakes on the same observations. If one model overpredicts PH for some rows and another model underpredicts those same rows, averaging their predictions might reduce overall error.

For this experiment, I first tested simple average ensembles. Then I tested a weighted ensemble, where stronger models were allowed to receive more weight.

This ensemble section is treated as an experiment because the weights were selected using validation performance. Because of that, I avoid repeatedly changing the weights too many times, so the validation set does not become overused.

For the final modeling approach, I focus on individual models only. This keeps the final model easier to explain, reproduce, and report. Based on the validation results, XGBoost is selected as the best individual model because it had the lowest RMSE and highest \(R^2\) among the individual models tested.

Simple Average Ensemble: Random Forest and Cubist

ensemble_predictions_rf_cubist <- (
  rf_validation_predictions +
  cubist_validation_predictions
) / 2

ensemble_rf_cubist_result <- make_and_show_model_result(
  workflow = "Ensemble",
  model_family = "Model Combination",
  model_name = "Average Ensemble: RF + Cubist",
  training_rows = nrow(train_tree_model_data),
  validation_rows = length(valid_ph),
  tuning_parameters = "Average of Random Forest and Cubist",
  actual = valid_ph,
  predicted = ensemble_predictions_rf_cubist,
  training_time_seconds = 0
)
Average Ensemble: RF + Cubist Validation Performance
workflow model_family model_name training_rows validation_rows tuning_parameters rmse mae rsq training_time_seconds
Ensemble Model Combination Average Ensemble: RF + Cubist 2052 515 Average of Random Forest and Cubist 0.0954 0.0666 0.6937 0

The first ensemble experiment used a simple average of the Random Forest and Cubist predictions.

This ensemble had:

  • RMSE: 0.0954
  • MAE: 0.0666
  • \(R^2\): 0.6937

This result is better than both individual models by RMSE and \(R^2\).

Random Forest had:

  • RMSE: 0.0984
  • MAE: 0.0706
  • \(R^2\): 0.6835

Cubist had:

  • RMSE: 0.0996
  • MAE: 0.0682
  • \(R^2\): 0.6650

The ensemble improved because Random Forest and Cubist do not make exactly the same errors. Random Forest had stronger RMSE and \(R^2\), while Cubist had better MAE. By averaging their predictions, the ensemble combined useful parts of both models.

This is also a good ensemble to report because it is simple and easy to explain. It does not require training another model. It only averages the predictions from two already strong models.

ensemble_model_results <- tibble()

ensemble_model_results <- add_model_result(
  results_table = ensemble_model_results,
  new_result = ensemble_rf_cubist_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = ensemble_rf_cubist_result
)

Simple Average Ensemble: Random Forest, Cubist, and XGBoost

ensemble_predictions_rf_cubist_xgb <- (
  rf_validation_predictions +
  cubist_validation_predictions +
  xgb_validation_predictions
) / 3

ensemble_rf_cubist_xgb_result <- make_and_show_model_result(
  workflow = "Ensemble",
  model_family = "Model Combination",
  model_name = "Average Ensemble: RF + Cubist + XGBoost",
  training_rows = nrow(train_tree_model_data),
  validation_rows = length(valid_ph),
  tuning_parameters = "Average of Random Forest, Cubist, and XGBoost",
  actual = valid_ph,
  predicted = ensemble_predictions_rf_cubist_xgb,
  training_time_seconds = 0
)
Average Ensemble: RF + Cubist + XGBoost Validation Performance
workflow model_family model_name training_rows validation_rows tuning_parameters rmse mae rsq training_time_seconds
Ensemble Model Combination Average Ensemble: RF + Cubist + XGBoost 2052 515 Average of Random Forest, Cubist, and XGBoost 0.0941 0.0667 0.7027 0

The second ensemble experiment averaged the predictions from Random Forest, Cubist, and XGBoost.

This ensemble had:

  • RMSE: 0.0941
  • MAE: 0.0667
  • \(R^2\): 0.7027

This improved over the RF + Cubist average ensemble, which had:

  • RMSE: 0.0954
  • MAE: 0.0666
  • \(R^2\): 0.6937

The improvement in RMSE and \(R^2\) suggests that XGBoost added useful information to the ensemble

Validation-Weighted Ensemble: Random Forest, Cubist, and XGBoost

ensemble_weight_grid <- expand_grid(
  rf_weight = seq(0, 1, by = 0.05),
  cubist_weight = seq(0, 1, by = 0.05),
  xgb_weight = seq(0, 1, by = 0.05)
) |>
  filter(abs(rf_weight + cubist_weight + xgb_weight - 1) < 0.000001)

ensemble_weight_results <- ensemble_weight_grid |>
  mutate(
    predictions = pmap(
      list(rf_weight, cubist_weight, xgb_weight),
      function(rf_weight, cubist_weight, xgb_weight) {
        rf_weight * rf_validation_predictions +
          cubist_weight * cubist_validation_predictions +
          xgb_weight * xgb_validation_predictions
      }
    ),
    rmse = map_dbl(
      predictions,
      ~ yardstick::rmse_vec(
        truth = valid_ph,
        estimate = .x
      )
    ),
    mae = map_dbl(
      predictions,
      ~ yardstick::mae_vec(
        truth = valid_ph,
        estimate = .x
      )
    ),
    rsq = map_dbl(
      predictions,
      ~ yardstick::rsq_vec(
        truth = valid_ph,
        estimate = .x
      )
    )
  ) |>
  select(-predictions) |>
  arrange(rmse)
best_weighted_ensemble <- ensemble_weight_results |>
  slice(1)

best_weighted_ensemble |>
  kable(
    caption = "Best Weighted Ensemble Weights",
    digits = 4
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center"
  )
Best Weighted Ensemble Weights
rf_weight cubist_weight xgb_weight rmse mae rsq
0.25 0.3 0.45 0.094 0.067 0.7026
weighted_ensemble_predictions <- 
  best_weighted_ensemble$rf_weight * rf_validation_predictions +
  best_weighted_ensemble$cubist_weight * cubist_validation_predictions +
  best_weighted_ensemble$xgb_weight * xgb_validation_predictions
weighted_ensemble_result <- make_and_show_model_result(
  workflow = "Ensemble",
  model_family = "Model Combination",
  model_name = "Weighted Ensemble: RF + Cubist + XGBoost",
  training_rows = nrow(train_tree_model_data),
  validation_rows = length(valid_ph),
  tuning_parameters = paste(
    "RF =",
    best_weighted_ensemble$rf_weight,
    ", Cubist =",
    best_weighted_ensemble$cubist_weight,
    ", XGBoost =",
    best_weighted_ensemble$xgb_weight
  ),
  actual = valid_ph,
  predicted = weighted_ensemble_predictions,
  training_time_seconds = 0
)
Weighted Ensemble: RF + Cubist + XGBoost Validation Performance
workflow model_family model_name training_rows validation_rows tuning_parameters rmse mae rsq training_time_seconds
Ensemble Model Combination Weighted Ensemble: RF + Cubist + XGBoost 2052 515 RF = 0.25 , Cubist = 0.3 , XGBoost = 0.45 0.094 0.067 0.7026 0

The weighted ensemble also performed very well.

The selected weights were:

  • Random Forest: 0.25
  • Cubist: 0.30
  • XGBoost: 0.45

The validation performance was:

  • RMSE: 0.0940
  • MAE: 0.0670
  • \(R^2\): 0.7026

This is very close to the simple average ensemble of Random Forest, Cubist, and XGBoost. The weighted ensemble had slightly lower RMSE (by 0.001), while the simple average ensemble had slightly better MAE and almost the same \(R^2\).

Note: The ensemble models were tested only as an experiment to see if combining predictions would improve accuracy. Even ensembling results are better then individual XGBoost’s result, for this project I decided do not use them as official final model to make predictions.

ensemble_model_results <- tibble()

ensemble_model_results <- add_model_result(
  results_table = ensemble_model_results,
  new_result = ensemble_rf_cubist_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = ensemble_rf_cubist_result
)
ensemble_model_results <- add_model_result(
  results_table = ensemble_model_results,
  new_result = ensemble_rf_cubist_xgb_result
)
all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = ensemble_rf_cubist_xgb_result
)
ensemble_model_results <- add_model_result(
  results_table = ensemble_model_results,
  new_result = weighted_ensemble_result
)

all_model_results <- add_model_result(
  results_table = all_model_results,
  new_result = weighted_ensemble_result
)

#ensemble_model_results
# if (exists("cl")) {
#   stopCluster(cl)
#   registerDoSEQ()
#   message("Parallel backend stopped.")
# }