library(tidyverse)
library(tidymodels)
library(caret)
library(recipes)
library(kableExtra)
library(patchwork)
library(readxl)
library(janitor)
library(glmnet)
library(earth)
library(kernlab)
library(nnet)
library(ranger)
library(xgboost)
library(e1071)
library(naniar)
library(class)
library(rpart)
library(missForest)
library(vip)
library(writexl)
library(doParallel)
set.seed(123)
project_seed = 123
# use_parallel <- TRUE
#
# if (use_parallel) {
# #num_cores_to_use <- max(1, parallel::detectCores(logical = TRUE) - 1)
# num_cores_to_use <- 2
# cl <- makePSOCKcluster(num_cores_to_use)
# registerDoParallel(cl)
#
# message("Parallel backend started with ", num_cores_to_use, " cores.")
# }
The project is to better understand the manufacturing process and build a model that can predict PH. New regulations require the company to explain what production factors may affect PH and also report a predictive model based on historical production data.
The main goal of this project is to create a reliable PH prediction workflow. I will review the data, check missing values, test different imputation methods, compare multiple models, and then choose the final model based on validation performance.
This project has both technical and business purpose. The technical part explains data preprocessing, imputation, model tuning, and validation metrics. The business part explains the final model and important production factors in more readable way.
The workflow for this project is:
PH will not be used during imputation because PH is the target variable. If PH is used to fill missing feature values, then the model can indirectly learn from the answer it is supposed to predict. This would ppotentially create target leakage.
set.seed(123)
train_url <- "https://raw.githubusercontent.com/farhodibr/CUNY-SPS-MSDS/main/DATA624_Project2/DATA/StudentData%20%282%29.xlsx"
test_url <- "https://raw.githubusercontent.com/farhodibr/CUNY-SPS-MSDS/main/DATA624_Project2/DATA/StudentEvaluation.xlsx"
train_file <- tempfile(fileext = ".xlsx")
test_file <- tempfile(fileext = ".xlsx")
download.file(train_url, train_file, mode = "wb")
download.file(test_url, test_file, mode = "wb")
train_data_raw <- read_excel(train_file) |>
clean_names()
test_data_raw <- read_excel(test_file) |>
clean_names()
categorical_vars_raw <- names(train_data_raw)[
sapply(train_data_raw, function(x) is.character(x) || is.factor(x))
]
categorical_vars_raw
[1] "brand_code"
categorical_distribution <- tibble()
for (var in categorical_vars_raw) {
temp_distribution <- train_data_raw |>
count(.data[[var]], name = "count") |>
mutate(
variable = var,
percent = round(count / sum(count) * 100, 2)
)
names(temp_distribution)[1] <- "category"
categorical_distribution <- bind_rows(
categorical_distribution,
temp_distribution
)
}
categorical_distribution |>
arrange(variable, desc(percent)) |>
kable(
caption = "Categorical Variable Distributions",
digits = 2
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| category | count | variable | percent |
|---|---|---|---|
| B | 1239 | brand_code | 48.19 |
| D | 615 | brand_code | 23.92 |
| C | 304 | brand_code | 11.82 |
| A | 293 | brand_code | 11.40 |
| NA | 120 | brand_code | 4.67 |
Only brand_code is a categorical variable in this data.
Brand B is the most common category, making up about 48% of the
rows. Brand D is the second largest group, while Brands A
and C are smaller groups but still have enough observations. Around
4.7% of the brand_code values are missing.
Later I will check if the training and validation split keeps
brand_code's distributions similar.
After reviewing the categorical variable distribution, I will split
train_data_raw data into 80% training and
20% validation sets. The split will be stratified by
PH because PH is the target variable.
After the split, I will also check if brand_code’s
values has same distributions between training and validation.
set.seed(123)
production_data <- train_data_raw |>
mutate(
brand_code = as.factor(brand_code)
)
eval_data <- test_data_raw |>
mutate(
brand_code = as.factor(brand_code)
)
ph_split <- initial_split(
production_data,
prop = 0.80,
strata = ph
)
train_data <- training(ph_split)
valid_data <- testing(ph_split)
train_features_raw <- train_data |>
select(-ph)
valid_features_raw <- valid_data |>
select(-ph)
eval_features_raw <- eval_data |>
select(-any_of("ph"))
train_ph <- train_data$ph
valid_ph <- valid_data$ph
valid_features_raw$brand_code <- factor(
valid_features_raw$brand_code,
levels = levels(train_features_raw$brand_code)
)
eval_features_raw$brand_code <- factor(
eval_features_raw$brand_code,
levels = levels(train_features_raw$brand_code)
)
split_overview <- tibble(
dataset = c("Training", "Validation", "Final Evaluation"),
rows = c(
nrow(train_features_raw),
nrow(valid_features_raw),
nrow(eval_features_raw)
),
feature_columns = c(
ncol(train_features_raw),
ncol(valid_features_raw),
ncol(eval_features_raw)
),
ph_available = c("Yes", "Yes", "No")
)
split_overview |>
kable(
caption = "Modeling Data Split"
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| dataset | rows | feature_columns | ph_available |
|---|---|---|---|
| Training | 2055 | 32 | Yes |
| Validation | 516 | 32 | Yes |
| Final Evaluation | 267 | 32 | No |
brand_balance_check <- bind_rows(
train_features_raw |>
mutate(split = "Training"),
valid_features_raw |>
mutate(split = "Validation")
) |>
count(split, brand_code, name = "count") |>
group_by(split) |>
mutate(
percent = round(count / sum(count) * 100, 2),
count_percent = paste0(count, " (", percent, "%)")
) |>
ungroup()
brand_balance_wide <- brand_balance_check |>
select(split, brand_code, count_percent) |>
mutate(
brand_code = ifelse(
is.na(brand_code),
"Missing",
as.character(brand_code)
),
brand_code = factor(
brand_code,
levels = c("A", "B", "C", "D", "Missing")
)
) |>
pivot_wider(
names_from = brand_code,
values_from = count_percent
)
brand_balance_wide |>
kable(
caption = "Brand Code Distribution After Train/Validation Split",
col.names = c(
"Dataset",
"Brand A",
"Brand B",
"Brand C",
"Brand D",
"Missing Brand Code"
)
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| Dataset | Brand A | Brand B | Brand C | Brand D | Missing Brand Code |
|---|---|---|---|---|---|
| Training | 225 (10.95%) | 986 (47.98%) | 255 (12.41%) | 495 (24.09%) | 94 (4.57%) |
| Validation | 68 (13.18%) | 253 (49.03%) | 49 (9.5%) | 120 (23.26%) | 26 (5.04%) |
After the split brand_code distribution is pretty much
similar between the training and validation sets. Brand B
remains the largest group in both sets, and Brand D is also
close between the two sets. Brand A and Brand C have
some difference, but the split still looks acceptable. Missing
brand_code values are also close, around 4.6% in
training and 5.0% in validation.
Because the PH distribution was stratified during the split
and brand_code distribution also looks reasonable, I will
keep this train/validation split.
Since PH is the target variable, I am going to check if PH values has similar distributions between the training and validation sets.
ph_split_summary <- tibble(
dataset = c("Training", "Validation"),
count = c(length(train_ph), length(valid_ph)),
mean_ph = c(mean(train_ph, na.rm = TRUE), mean(valid_ph, na.rm = TRUE)),
median_ph = c(median(train_ph, na.rm = TRUE), median(valid_ph, na.rm = TRUE)),
sd_ph = c(sd(train_ph, na.rm = TRUE), sd(valid_ph, na.rm = TRUE)),
min_ph = c(min(train_ph, na.rm = TRUE), min(valid_ph, na.rm = TRUE)),
max_ph = c(max(train_ph, na.rm = TRUE), max(valid_ph, na.rm = TRUE))
)
ph_split_summary |>
kable(
caption = "PH Distribution After Train/Validation Split",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| dataset | count | mean_ph | median_ph | sd_ph | min_ph | max_ph |
|---|---|---|---|---|---|---|
| Training | 2055 | 8.5455 | 8.54 | 0.1728 | 7.88 | 9.36 |
| Validation | 516 | 8.5462 | 8.54 | 0.1717 | 8.00 | 8.88 |
ph_split_plot_data <- tibble(
ph = c(train_ph, valid_ph),
dataset = c(
rep("Training", length(train_ph)),
rep("Validation", length(valid_ph))
)
)
ggplot(
ph_split_plot_data,
aes(x = ph, fill = dataset)
) +
geom_density(alpha = 0.4) +
labs(
title = "PH Distribution in Training and Validation Sets",
x = "PH",
y = "Density",
fill = "Dataset"
) +
theme_minimal()
The PH distribution looks similar enough between the training and
validation sets. As we see from the above table mean, median, and
standard deviation are almost the same.
The training set has a slightly higher maximum PH value, but the overall
distribution still looks balanced. Because of this, I keep this split
for the rest of the project.
After checking PH balance and brand_code balance, I am
going to review the remaining preprocessing issues. In this section I
will explore missing values, numeric predictor summaries, skewness,
near-zero variance.
This section has only diagnostic purpose. The actual preprocessing steps like dummy variables, transformations, scaling, and feature removal will be handled later inside modeling workflows when needed.
Before checking missing predictor values, I will first check if PH itself has missing values in the training/validation data and final evaluation data.
ph_missing_check <- tibble(
dataset = c("Training/VAlidation Data", "Final Evaluation Data"),
rows = c(nrow(train_data_raw), nrow(test_data_raw)),
ph_column_exists = c(
"ph" %in% names(train_data_raw),
"ph" %in% names(test_data_raw)
),
missing_ph_count = c(
if ("ph" %in% names(train_data_raw)) sum(is.na(train_data_raw$ph)) else NA,
if ("ph" %in% names(test_data_raw)) sum(is.na(test_data_raw$ph)) else NA
),
missing_ph_percent = c(
if ("ph" %in% names(train_data_raw)) round(mean(is.na(train_data_raw$ph)) * 100, 2) else NA,
if ("ph" %in% names(test_data_raw)) round(mean(is.na(test_data_raw$ph)) * 100, 2) else NA
)
)
ph_missing_check |>
kable(
caption = "PH Missing Value Check",
digits = 2
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| dataset | rows | ph_column_exists | missing_ph_count | missing_ph_percent |
|---|---|---|---|---|
| Training/VAlidation Data | 2571 | TRUE | 4 | 0.16 |
| Final Evaluation Data | 267 | TRUE | 267 | 100.00 |
The final evaluation data has PH missing for all rows, which is expected because those are the values the model needs to predict.
The training/validation data has 4 rows with missing PH values. Because PH is the target variable, those rows cannot be used for model training or validation. I will remove any rows with missing PH from the training and validation sets before separating features and target values.
train_data <- train_data |>
filter(!is.na(ph))
valid_data <- valid_data |>
filter(!is.na(ph))
train_features_raw <- train_data |>
select(-ph)
valid_features_raw <- valid_data |>
select(-ph)
eval_features_raw <- eval_data |>
select(-any_of("ph"))
train_ph <- train_data$ph
valid_ph <- valid_data$ph
sum(is.na(train_ph))
[1] 0
sum(is.na(valid_ph))
[1] 0
nrow(train_features_raw)
[1] 2052
nrow(valid_features_raw)
[1] 515
length(train_ph)
[1] 2052
length(valid_ph)
[1] 515
missing_summary <- bind_rows(
train_features_raw |>
summarise(across(everything(), ~ sum(is.na(.)))) |>
pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "missing_count"
) |>
mutate(
dataset = "Training",
total_rows = nrow(train_features_raw)
),
valid_features_raw |>
summarise(across(everything(), ~ sum(is.na(.)))) |>
pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "missing_count"
) |>
mutate(
dataset = "Validation",
total_rows = nrow(valid_features_raw)
),
eval_features_raw |>
summarise(across(everything(), ~ sum(is.na(.)))) |>
pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "missing_count"
) |>
mutate(
dataset = "Final Evaluation",
total_rows = nrow(eval_features_raw)
)
) |>
mutate(
missing_percent = round(missing_count / total_rows * 100, 2)
) |>
filter(missing_count > 0) |>
select(
dataset,
variable,
missing_count,
missing_percent
) |>
mutate(
dataset = factor(
dataset,
levels = c("Training", "Validation", "Final Evaluation")
)
) |>
arrange(dataset, desc(missing_count))
missing_summary |>
kable(
caption = "Missing Predictor Values by Dataset",
digits = 2
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| dataset | variable | missing_count | missing_percent |
|---|---|---|---|
| Training | mfr | 173 | 8.43 |
| Training | brand_code | 94 | 4.58 |
| Training | filler_speed | 43 | 2.10 |
| Training | psc_co2 | 36 | 1.75 |
| Training | pc_volume | 32 | 1.56 |
| Training | fill_ounces | 30 | 1.46 |
| Training | psc | 27 | 1.32 |
| Training | carb_pressure1 | 25 | 1.22 |
| Training | hyd_pressure4 | 24 | 1.17 |
| Training | carb_pressure | 21 | 1.02 |
| Training | psc_fill | 21 | 1.02 |
| Training | carb_temp | 17 | 0.83 |
| Training | filler_level | 14 | 0.68 |
| Training | fill_pressure | 13 | 0.63 |
| Training | hyd_pressure2 | 11 | 0.54 |
| Training | hyd_pressure3 | 11 | 0.54 |
| Training | temperature | 10 | 0.49 |
| Training | oxygen_filler | 9 | 0.44 |
| Training | pressure_setpoint | 9 | 0.44 |
| Training | carb_volume | 8 | 0.39 |
| Training | hyd_pressure1 | 8 | 0.39 |
| Training | carb_rel | 8 | 0.39 |
| Training | alch_rel | 6 | 0.29 |
| Training | usage_cont | 4 | 0.19 |
| Training | carb_flow | 2 | 0.10 |
| Training | balling_lvl | 1 | 0.05 |
| Validation | mfr | 35 | 6.80 |
| Validation | brand_code | 26 | 5.05 |
| Validation | filler_speed | 11 | 2.14 |
| Validation | carb_temp | 9 | 1.75 |
| Validation | fill_ounces | 8 | 1.55 |
| Validation | pc_volume | 7 | 1.36 |
| Validation | carb_pressure1 | 7 | 1.36 |
| Validation | carb_pressure | 6 | 1.17 |
| Validation | psc | 6 | 1.17 |
| Validation | fill_pressure | 5 | 0.97 |
| Validation | hyd_pressure2 | 4 | 0.78 |
| Validation | hyd_pressure3 | 4 | 0.78 |
| Validation | hyd_pressure4 | 4 | 0.78 |
| Validation | psc_co2 | 3 | 0.58 |
| Validation | hyd_pressure1 | 3 | 0.58 |
| Validation | pressure_setpoint | 3 | 0.58 |
| Validation | carb_volume | 2 | 0.39 |
| Validation | psc_fill | 2 | 0.39 |
| Validation | filler_level | 2 | 0.39 |
| Validation | temperature | 2 | 0.39 |
| Validation | oxygen_filler | 2 | 0.39 |
| Validation | bowl_setpoint | 2 | 0.39 |
| Validation | usage_cont | 1 | 0.19 |
| Validation | alch_rel | 1 | 0.19 |
| Final Evaluation | mfr | 31 | 11.61 |
| Final Evaluation | filler_speed | 10 | 3.75 |
| Final Evaluation | brand_code | 8 | 3.00 |
| Final Evaluation | fill_ounces | 6 | 2.25 |
| Final Evaluation | psc | 5 | 1.87 |
| Final Evaluation | psc_co2 | 5 | 1.87 |
| Final Evaluation | pc_volume | 4 | 1.50 |
| Final Evaluation | carb_pressure1 | 4 | 1.50 |
| Final Evaluation | hyd_pressure4 | 4 | 1.50 |
| Final Evaluation | psc_fill | 3 | 1.12 |
| Final Evaluation | oxygen_filler | 3 | 1.12 |
| Final Evaluation | alch_rel | 3 | 1.12 |
| Final Evaluation | fill_pressure | 2 | 0.75 |
| Final Evaluation | filler_level | 2 | 0.75 |
| Final Evaluation | temperature | 2 | 0.75 |
| Final Evaluation | usage_cont | 2 | 0.75 |
| Final Evaluation | pressure_setpoint | 2 | 0.75 |
| Final Evaluation | carb_rel | 2 | 0.75 |
| Final Evaluation | carb_volume | 1 | 0.37 |
| Final Evaluation | carb_temp | 1 | 0.37 |
| Final Evaluation | hyd_pressure2 | 1 | 0.37 |
| Final Evaluation | hyd_pressure3 | 1 | 0.37 |
| Final Evaluation | density | 1 | 0.37 |
| Final Evaluation | balling | 1 | 0.37 |
| Final Evaluation | pressure_vacuum | 1 | 0.37 |
| Final Evaluation | bowl_setpoint | 1 | 0.37 |
| Final Evaluation | air_pressurer | 1 | 0.37 |
The missing value review shows that missing predictor values are
present in training, validation, and evaluation sets. The variable
mfr has the highest missingness in all three datasets, with
about 8.6% missing in training, 7.0% in validation, and 11.6% in final
evaluation. brand_code also has missing values in all
datasets.
Most other predictors have smaller missing percentages, usually below 2%.
Since evaluation set also has missing predictor values, the imputation process has to be applied to training, validation, and evaluation missing values. PH will not be used during imputation because it is the target variable.
These checks are done before missing values imputation to understand the original data. They are not final transformation decisions. After imputation, I will review the completed feature sets again before modeling.
numeric_vars <- names(train_features_raw)[
sapply(train_features_raw, is.numeric)
]
#numeric_vars
numeric_summary <- train_features_raw |>
select(all_of(numeric_vars)) |>
pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "value"
) |>
group_by(variable) |>
summarise(
count = sum(!is.na(value)),
missing_count = sum(is.na(value)),
mean = mean(value, na.rm = TRUE),
median = median(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
min = min(value, na.rm = TRUE),
q1 = quantile(value, 0.25, na.rm = TRUE),
q3 = quantile(value, 0.75, na.rm = TRUE),
max = max(value, na.rm = TRUE),
.groups = "drop"
)
numeric_summary |>
kable(
caption = "Numeric Predictor Summary in Training Data",
digits = 3
) |>
kable_styling(
full_width = FALSE,
position = "center",
bootstrap_options = c("striped", "hover", "condensed")
)
| variable | count | missing_count | mean | median | sd | min | q1 | q3 | max |
|---|---|---|---|---|---|---|---|---|---|
| air_pressurer | 2052 | 0 | 142.835 | 142.600 | 1.211 | 140.800 | 142.200 | 143.000 | 148.200 |
| alch_rel | 2046 | 6 | 6.898 | 6.560 | 0.508 | 5.280 | 6.540 | 7.240 | 8.620 |
| balling | 2052 | 0 | 2.196 | 1.648 | 0.931 | 0.160 | 1.496 | 3.292 | 3.984 |
| balling_lvl | 2051 | 1 | 2.050 | 1.460 | 0.869 | 0.000 | 1.380 | 3.140 | 3.660 |
| bowl_setpoint | 2052 | 0 | 109.197 | 120.000 | 15.268 | 70.000 | 100.000 | 120.000 | 140.000 |
| carb_flow | 2050 | 2 | 2462.747 | 3029.000 | 1075.415 | 26.000 | 1160.000 | 3180.000 | 3834.000 |
| carb_pressure | 2031 | 21 | 68.162 | 68.000 | 3.511 | 57.000 | 65.600 | 70.500 | 79.400 |
| carb_pressure1 | 2027 | 25 | 122.537 | 123.200 | 4.779 | 105.600 | 118.800 | 125.400 | 140.200 |
| carb_rel | 2044 | 8 | 5.436 | 5.400 | 0.129 | 4.960 | 5.340 | 5.540 | 6.060 |
| carb_temp | 2035 | 17 | 141.026 | 140.800 | 4.034 | 128.600 | 138.400 | 143.600 | 154.000 |
| carb_volume | 2044 | 8 | 5.370 | 5.347 | 0.107 | 5.040 | 5.293 | 5.460 | 5.700 |
| density | 2052 | 0 | 1.172 | 0.980 | 0.377 | 0.240 | 0.900 | 1.620 | 1.920 |
| fill_ounces | 2022 | 30 | 23.976 | 23.973 | 0.088 | 23.633 | 23.920 | 24.033 | 24.320 |
| fill_pressure | 2039 | 13 | 47.923 | 46.400 | 3.179 | 36.000 | 46.000 | 50.000 | 60.400 |
| filler_level | 2038 | 14 | 109.145 | 118.200 | 15.609 | 55.800 | 98.450 | 120.000 | 153.200 |
| filler_speed | 2009 | 43 | 3677.520 | 3982.000 | 785.011 | 998.000 | 3888.000 | 3998.000 | 4030.000 |
| hyd_pressure1 | 2044 | 8 | 12.475 | 11.600 | 12.366 | -0.800 | 0.000 | 20.250 | 58.000 |
| hyd_pressure2 | 2041 | 11 | 20.934 | 28.600 | 16.391 | 0.000 | 0.000 | 34.600 | 59.400 |
| hyd_pressure3 | 2041 | 11 | 20.496 | 27.800 | 15.971 | -1.200 | 0.000 | 33.400 | 50.000 |
| hyd_pressure4 | 2028 | 24 | 96.417 | 96.000 | 13.255 | 62.000 | 86.000 | 102.000 | 142.000 |
| mfr | 1879 | 173 | 703.896 | 724.000 | 74.710 | 31.400 | 706.700 | 731.000 | 868.600 |
| mnf_flow | 2052 | 0 | 25.160 | 76.200 | 119.586 | -100.200 | -100.000 | 140.800 | 229.400 |
| oxygen_filler | 2043 | 9 | 0.047 | 0.033 | 0.045 | 0.002 | 0.022 | 0.060 | 0.400 |
| pc_volume | 2020 | 32 | 0.277 | 0.272 | 0.061 | 0.079 | 0.239 | 0.313 | 0.471 |
| pressure_setpoint | 2043 | 9 | 47.618 | 46.000 | 2.030 | 44.000 | 46.000 | 50.000 | 52.000 |
| pressure_vacuum | 2052 | 0 | -5.217 | -5.400 | 0.568 | -6.600 | -5.600 | -5.000 | -3.600 |
| psc | 2025 | 27 | 0.085 | 0.076 | 0.050 | 0.002 | 0.048 | 0.112 | 0.268 |
| psc_co2 | 2016 | 36 | 0.056 | 0.040 | 0.043 | 0.000 | 0.020 | 0.080 | 0.240 |
| psc_fill | 2031 | 21 | 0.194 | 0.180 | 0.116 | 0.000 | 0.100 | 0.260 | 0.620 |
| temperature | 2042 | 10 | 65.976 | 65.600 | 1.405 | 63.600 | 65.200 | 66.400 | 76.200 |
| usage_cont | 2048 | 4 | 21.028 | 21.920 | 2.969 | 12.080 | 18.440 | 23.740 | 25.900 |
As we see from this table, numeric predictors are measured on different scales. Because of this, models like linear, SVM, neural networks, and regularized regression will need centering and scaling later.
Some variables also have zero or negative values, such as
mnf_flow, pressure_vacuum, and some hydraulic
pressure variables. Because of this, Box-Cox transformation is not safe
for all predictors. Yeo-Johnson transformation is a better option
because it can handle zero and negative values.
Some variables are stored as numeric, but they may actually represent
process settings or categories. To check this, I will use the counts of
unique values each numeric predictor has in the training data.
If a numeric variable has only a small number of unique values, I may
apply different methods during preprocessing and modeling.
numeric_unique_summary <- train_features_raw |>
select(all_of(numeric_vars)) |>
summarise(
across(
everything(),
~ n_distinct(., na.rm = TRUE)
)
) |>
pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "unique_values"
) |>
left_join(
numeric_summary |>
select(variable, count, missing_count, min, max),
by = "variable"
) |>
mutate(
possible_discrete_or_setting = ifelse(
unique_values <= 10,
"Yes",
"No"
)
) |>
arrange(unique_values)
numeric_discrete_review <- numeric_unique_summary |>
filter(unique_values <= 50)
numeric_discrete_review |>
kable(
caption = "Numeric Predictors With Limited Unique Values",
digits = 3
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| variable | unique_values | count | missing_count | min | max | possible_discrete_or_setting |
|---|---|---|---|---|---|---|
| pressure_setpoint | 7 | 2043 | 9 | 44.00 | 52.00 | Yes |
| bowl_setpoint | 10 | 2052 | 0 | 70.00 | 140.00 | Yes |
| pressure_vacuum | 16 | 2052 | 0 | -6.60 | -3.60 | No |
| psc_co2 | 23 | 2016 | 36 | 0.00 | 0.24 | No |
| air_pressurer | 32 | 2052 | 0 | 140.80 | 148.20 | No |
| hyd_pressure4 | 39 | 2028 | 24 | 62.00 | 142.00 | No |
| carb_rel | 40 | 2044 | 8 | 4.96 | 6.06 | No |
| alch_rel | 49 | 2046 | 6 | 5.28 | 8.62 | No |
Based on the unique value check, pressure_setpoint and
bowl_setpoint look like process setting variables. However,
I preffer do not treat these variables exactly like fully continuous
numeric measurements. Later, when Yeo-Johnson transformations are used
inside modeling recipes, I might exclude pressure_setpoint
and bowl_setpoint numeric variables from that
transformation.
exclude_from_yj_vars <- c(
"pressure_setpoint",
"bowl_setpoint"
)
# recipe(ph ~ ., data = train_data_model) |>
# step_YeoJohnson(
# all_numeric_predictors(),
# -all_of(setting_like_numeric_vars)
# ) |>
# step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
# step_zv(all_predictors()) |>
# step_nzv(all_predictors()) |>
# step_normalize(all_numeric_predictors())
Since some values are missing, this skewness check is based on available raw training values. Imputation may slightly change the distribution, so the final transformation choices will be confirmed after imputed feature sets are created.
Skewness analysis is useful here because some models work better when numeric predictors are more normally distributed. This is especially important for models like linear regression, elastic net, SVM, and neural networks.
However, I do not want to blindly transform every skewed variable. Some numeric variables are process settings or rounded measurements as we saw earlier, so skewness may not mean the same thing for them. Because of this, transformation decisions will be handled inside the modelling section later.
numeric_skewness <- train_features_raw |>
select(all_of(numeric_vars)) |>
summarise(
across(
everything(),
~ skewness(., na.rm = TRUE)
)
) |>
pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "skewness"
) |>
mutate(
abs_skewness = abs(skewness),
skewness_level = case_when(
abs_skewness >= 2 ~ "High skewness",
abs_skewness >= 1 ~ "Moderate skewness",
TRUE ~ "Low skewness"
),
# setting_like = ifelse(
# variable %in% exclude_from_yj_vars,
# "Yes",
# "No"
#)
) |>
arrange(desc(abs_skewness))
skewed_numeric_vars <- numeric_skewness |>
filter(abs_skewness >= 1)
skewed_numeric_vars |>
kable(
caption = "Numeric Predictors With Moderate or High Skewness",
digits = 3
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| variable | skewness | abs_skewness | skewness_level |
|---|---|---|---|
| mfr | -5.096 | 5.096 | High skewness |
| filler_speed | -2.790 | 2.790 | High skewness |
| temperature | 2.483 | 2.483 | High skewness |
| oxygen_filler | 2.366 | 2.366 | High skewness |
| air_pressurer | 2.259 | 2.259 | High skewness |
| psc_co2 | 1.755 | 1.755 | Moderate skewness |
ggplot(
numeric_skewness,
aes(
x = reorder(variable, abs_skewness),
y = abs_skewness,
fill = skewness_level
)
) +
geom_col() +
coord_flip() +
labs(
title = "Absolute Skewness of Numeric Predictors",
subtitle = "Variables are colored by skewness level",
x = "Variable",
y = "Absolute Skewness",
fill = "Skewness Level"
) +
theme_minimal()
skewed_vars_to_plot <- skewed_numeric_vars |>
pull(variable)
skewed_distribution_data <- train_features_raw |>
select(all_of(skewed_vars_to_plot)) |>
pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "value"
)
ggplot(
skewed_distribution_data,
aes(x = value)
) +
geom_histogram(
bins = 30,
color = "white"
) +
facet_wrap(
~ variable,
scales = "free",
ncol = 3
) +
labs(
title = "Distributions of Skewed Numeric Predictors",
subtitle = "Only variables with absolute skewness greater than or equal to 1 are shown",
#x = "Value",
y = "Count"
) +
theme_minimal()
The skewness table and plot shows that only a small group of numeric
predictors have moderate or high skewness. The variables with absolute
skewness greater than or equal to 1 are mfr,
filler_speed, oxygen_filler,
temperature, air_pressurer, and
psc_co2.
At first, these variables may look like candidates for transformation. However, the histograms show that not all skewed variables behave the same way. Some variables look like regular continuous variables with long tails, while others look more like grouped or rounded process measurements.
oxygen_filler and temperature look more
like continuous variables with right skewness. These variables are
better candidates for Yeo-Johnson transformation in models that are
sensitive to skewed predictors.
mfr and filler_speed are left-skewed.
Most observations are concentrated around normal operating values, but
there are some much lower values. After outliers analysis section,
transformation may still be useful for some model types.
Even psc_co2 have skewness, its plot show grouped
values and repeated bars respectfully. This suggests
psc_co2 may be categorical or rounded measurements. Because
of this, I preffer not to apply Yeo-Johnson transformation to
psc_co2.
Based on this review, I will keep all skewed variables in the
dataset, but I separate the transformation decision by variable type. In
the modeling workflow, Yeo-Johnson transformation will be applied before
brand_codedummy encoding, and variables that look like
process settings or grouped measurements will be excluded from
Yeo-Johnson transformation.
exclude_from_yj_vars <- c(
"pressure_setpoint",
"bowl_setpoint",
"air_pressurer",
"psc_co2"
)
Next I will check for near-zero variance predictors.
This matters because predictors with very little variation usually do not have much useful information for the models.
nzv_raw_check <- nearZeroVar(
train_features_raw,
saveMetrics = FALSE)
nzv_raw_check
integer(0)
No raw predictors were flagged as near-zero variance. This means I do not remove any predictors at this stage based on near-zero variance.
Before imputation, I would check for unusual numeric values. If a value is clearly impossible or likely caused by a recording issue, it should be reviewed before missing values are filled.
I do not automatically remove outliers only because they are extreme. In production data, unusual values may represent real process conditions. My goal here is to identify suspicious values and decide carefully before changing the data.
numeric_vars_raw <- train_features_raw |>
select(where(is.numeric)) |>
names()
outlier_summary <- map_dfr(
numeric_vars_raw,
function(var) {
values <- train_features_raw[[var]]
q1 <- quantile(values, 0.25, na.rm = TRUE)
q3 <- quantile(values, 0.75, na.rm = TRUE)
iqr_value <- q3 - q1
lower_bound <- q1 - 1.5 * iqr_value
upper_bound <- q3 + 1.5 * iqr_value
tibble(
variable = var,
count = sum(!is.na(values)),
missing_count = sum(is.na(values)),
min_value = min(values, na.rm = TRUE),
q1 = q1,
median = median(values, na.rm = TRUE),
q3 = q3,
max_value = max(values, na.rm = TRUE),
lower_bound = lower_bound,
upper_bound = upper_bound,
outlier_count = sum(values < lower_bound | values > upper_bound, na.rm = TRUE),
outlier_percent = round(outlier_count / count * 100, 2)
)
}
) |>
arrange(desc(outlier_percent))
outlier_summary |>
kable(
caption = "Outlier Review Using IQR Rule",
digits = 3
) |>
kable_styling(
full_width = FALSE,
position = "center",
bootstrap_options = c("striped", "hover", "condensed")
)|>
scroll_box(
width = "100%",
height = "500px"
)
| variable | count | missing_count | min_value | q1 | median | q3 | max_value | lower_bound | upper_bound | outlier_count | outlier_percent |
|---|---|---|---|---|---|---|---|---|---|---|---|
| filler_speed | 2009 | 43 | 998.000 | 3888.000 | 3982.000 | 3998.000 | 4030.000 | 3723.000 | 4163.000 | 356 | 17.72 |
| mfr | 1879 | 173 | 31.400 | 706.700 | 724.000 | 731.000 | 868.600 | 670.250 | 767.450 | 190 | 10.11 |
| air_pressurer | 2052 | 0 | 140.800 | 142.200 | 142.600 | 143.000 | 148.200 | 141.000 | 144.200 | 177 | 8.63 |
| oxygen_filler | 2043 | 9 | 0.002 | 0.022 | 0.033 | 0.060 | 0.400 | -0.035 | 0.117 | 147 | 7.20 |
| temperature | 2042 | 10 | 63.600 | 65.200 | 65.600 | 66.400 | 76.200 | 63.400 | 68.200 | 103 | 5.04 |
| pressure_vacuum | 2052 | 0 | -6.600 | -5.600 | -5.400 | -5.000 | -3.600 | -6.500 | -4.100 | 95 | 4.63 |
| hyd_pressure4 | 2028 | 24 | 62.000 | 86.000 | 96.000 | 102.000 | 142.000 | 62.000 | 126.000 | 70 | 3.45 |
| pc_volume | 2020 | 32 | 0.079 | 0.239 | 0.272 | 0.313 | 0.471 | 0.128 | 0.424 | 64 | 3.17 |
| fill_pressure | 2039 | 13 | 36.000 | 46.000 | 46.400 | 50.000 | 60.400 | 40.000 | 56.000 | 64 | 3.14 |
| psc_co2 | 2016 | 36 | 0.000 | 0.020 | 0.040 | 0.080 | 0.240 | -0.070 | 0.170 | 63 | 3.12 |
| psc_fill | 2031 | 21 | 0.000 | 0.100 | 0.180 | 0.260 | 0.620 | -0.140 | 0.500 | 52 | 2.56 |
| psc | 2025 | 27 | 0.002 | 0.048 | 0.076 | 0.112 | 0.268 | -0.048 | 0.208 | 48 | 2.37 |
| fill_ounces | 2022 | 30 | 23.633 | 23.920 | 23.973 | 24.033 | 24.320 | 23.750 | 24.203 | 38 | 1.88 |
| carb_temp | 2035 | 17 | 128.600 | 138.400 | 140.800 | 143.600 | 154.000 | 130.600 | 151.400 | 35 | 1.72 |
| carb_pressure1 | 2027 | 25 | 105.600 | 118.800 | 123.200 | 125.400 | 140.200 | 108.900 | 135.300 | 17 | 0.84 |
| carb_pressure | 2031 | 21 | 57.000 | 65.600 | 68.000 | 70.500 | 79.400 | 58.250 | 77.850 | 14 | 0.69 |
| filler_level | 2038 | 14 | 55.800 | 98.450 | 118.200 | 120.000 | 153.200 | 66.125 | 152.325 | 6 | 0.29 |
| alch_rel | 2046 | 6 | 5.280 | 6.540 | 6.560 | 7.240 | 8.620 | 5.490 | 8.290 | 6 | 0.29 |
| carb_rel | 2044 | 8 | 4.960 | 5.340 | 5.400 | 5.540 | 6.060 | 5.040 | 5.840 | 6 | 0.29 |
| hyd_pressure1 | 2044 | 8 | -0.800 | 0.000 | 11.600 | 20.250 | 58.000 | -30.375 | 50.625 | 4 | 0.20 |
| carb_volume | 2044 | 8 | 5.040 | 5.293 | 5.347 | 5.460 | 5.700 | 5.043 | 5.710 | 2 | 0.10 |
| mnf_flow | 2052 | 0 | -100.200 | -100.000 | 76.200 | 140.800 | 229.400 | -461.200 | 502.000 | 0 | 0.00 |
| hyd_pressure2 | 2041 | 11 | 0.000 | 0.000 | 28.600 | 34.600 | 59.400 | -51.900 | 86.500 | 0 | 0.00 |
| hyd_pressure3 | 2041 | 11 | -1.200 | 0.000 | 27.800 | 33.400 | 50.000 | -50.100 | 83.500 | 0 | 0.00 |
| usage_cont | 2048 | 4 | 12.080 | 18.440 | 21.920 | 23.740 | 25.900 | 10.490 | 31.690 | 0 | 0.00 |
| carb_flow | 2050 | 2 | 26.000 | 1160.000 | 3029.000 | 3180.000 | 3834.000 | -1870.000 | 6210.000 | 0 | 0.00 |
| density | 2052 | 0 | 0.240 | 0.900 | 0.980 | 1.620 | 1.920 | -0.180 | 2.700 | 0 | 0.00 |
| balling | 2052 | 0 | 0.160 | 1.496 | 1.648 | 3.292 | 3.984 | -1.198 | 5.986 | 0 | 0.00 |
| bowl_setpoint | 2052 | 0 | 70.000 | 100.000 | 120.000 | 120.000 | 140.000 | 70.000 | 150.000 | 0 | 0.00 |
| pressure_setpoint | 2043 | 9 | 44.000 | 46.000 | 46.000 | 50.000 | 52.000 | 40.000 | 56.000 | 0 | 0.00 |
| balling_lvl | 2051 | 1 | 0.000 | 1.380 | 1.460 | 3.140 | 3.660 | -1.260 | 5.780 | 0 | 0.00 |
negative_value_check <- train_features_raw |>
select(where(is.numeric)) |>
summarise(
across(
everything(),
~ sum(. < 0, na.rm = TRUE)
)
) |>
pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "negative_count"
) |>
filter(negative_count > 0) |>
mutate(
total_non_missing = map_int(
variable,
~ sum(!is.na(train_features_raw[[.x]]))
),
negative_percent = round(negative_count / total_non_missing * 100, 2),
min_value = map_dbl(
variable,
~ min(train_features_raw[[.x]], na.rm = TRUE)
),
max_value = map_dbl(
variable,
~ max(train_features_raw[[.x]], na.rm = TRUE)
)
) |>
arrange(desc(negative_count))
negative_value_check |>
kable(
caption = "Numeric Predictors With Negative Values",
digits = 3
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| variable | negative_count | total_non_missing | negative_percent | min_value | max_value |
|---|---|---|---|---|---|
| pressure_vacuum | 2052 | 2052 | 100.00 | -6.6 | -3.6 |
| mnf_flow | 941 | 2052 | 45.86 | -100.2 | 229.4 |
| hyd_pressure3 | 60 | 2041 | 2.94 | -1.2 | 50.0 |
| hyd_pressure1 | 33 | 2044 | 1.61 | -0.8 | 58.0 |
mnfflow ValuesThe negative-value summary shows that almost 46% of
mnf_flow has many negative values . Since almost half of
the observed mnf_flow values are negative, I want to review
this variable more carefully before deciding whether these values should
be treated as errors.
Negative mnf_flow values look suspicious because many of
them appear to be negative. However, before deciding on what to do with
negative values, I would like to check whether these values are
connected to PH and other variables.
First of all, I will check the what actually these negative values look like
mnf_flow_negative_values <- train_features_raw |>
filter(
!is.na(mnf_flow),
mnf_flow < 0
) |>
count(
mnf_flow,
sort = TRUE
)
mnf_flow_negative_values |>
kable(
caption = "Negative mnfflow Values in Training Data",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| mnf_flow | n |
|---|---|
| -100.0 | 487 |
| -100.2 | 454 |
The table shows that negative mnf_flow values are almost
entirely concentrated at -100 and -100.2. This
suggests that these values may represent an error readings or actual
production settings.
Below I will compare PH values for rows with negative and
non-negative mnf_flow.
mnf_flow_ph_check <- train_features_raw |>
mutate(
mnf_flow_status = case_when(
is.na(mnf_flow) ~ "Missing",
mnf_flow < 0 ~ "Negative",
mnf_flow >= 0 ~ "Non-negative"
)
) |>
bind_cols(
tibble(ph = train_ph)
) |>
filter(
!is.na(ph)
) |>
group_by(
brand_code,
mnf_flow_status
) |>
summarise(
n = n(),
mean_ph = mean(ph, na.rm = TRUE),
median_ph = median(ph, na.rm = TRUE),
sd_ph = sd(ph, na.rm = TRUE),
min_ph = min(ph, na.rm = TRUE),
max_ph = max(ph, na.rm = TRUE),
.groups = "drop"
) |>
arrange(
brand_code,
mnf_flow_status
)
mnf_flow_ph_check |>
kable(
caption = "PH Comparison for Negative vs Non-Negative mnfflow",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| brand_code | mnf_flow_status | n | mean_ph | median_ph | sd_ph | min_ph | max_ph |
|---|---|---|---|---|---|---|---|
| A | Negative | 85 | 8.5779 | 8.60 | 0.1388 | 8.20 | 8.86 |
| A | Non-negative | 140 | 8.4381 | 8.46 | 0.1519 | 8.06 | 8.68 |
| B | Negative | 476 | 8.6700 | 8.70 | 0.1385 | 8.18 | 8.94 |
| B | Non-negative | 507 | 8.4716 | 8.48 | 0.1329 | 8.00 | 8.86 |
| C | Negative | 118 | 8.4675 | 8.50 | 0.1995 | 8.08 | 9.36 |
| C | Non-negative | 137 | 8.3737 | 8.38 | 0.1495 | 7.88 | 8.80 |
| D | Negative | 222 | 8.6627 | 8.68 | 0.1316 | 8.32 | 8.92 |
| D | Non-negative | 273 | 8.5527 | 8.56 | 0.1226 | 8.16 | 8.82 |
| NA | Negative | 40 | 8.6330 | 8.61 | 0.1261 | 8.28 | 8.88 |
| NA | Non-negative | 54 | 8.3930 | 8.45 | 0.1289 | 8.12 | 8.60 |
As we see from the table, negative mnf_flow rows have
higher mean PH in every Brand Code group. This strongly suggests that
the negative values are carrying useful information about the PH
outcome.
I also compare a few process variables to see whether negative
mnf_flow rows look different in the production process.
mnf_flow_process_check <- train_features_raw |>
mutate(
mnf_flow_negative = case_when(
is.na(mnf_flow) ~ "Missing",
mnf_flow < 0 ~ "Negative",
mnf_flow >= 0 ~ "Non-negative"
)
) |>
group_by(
brand_code,
mnf_flow_negative
) |>
summarise(
n = n(),
mean_temperature = mean(temperature, na.rm = TRUE),
mean_density = mean(density, na.rm = TRUE),
mean_carb_rel = mean(carb_rel, na.rm = TRUE),
mean_oxygen_filler = mean(oxygen_filler, na.rm = TRUE),
mean_filler_speed = mean(filler_speed, na.rm = TRUE),
.groups = "drop"
) |>
arrange(
brand_code,
mnf_flow_negative
)
mnf_flow_process_check |>
kable(
caption = "Process Variable Comparison for Negative vs Non-Negative mnfflow",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| brand_code | mnf_flow_negative | n | mean_temperature | mean_density | mean_carb_rel | mean_oxygen_filler | mean_filler_speed |
|---|---|---|---|---|---|---|---|
| A | Negative | 85 | 66.1153 | 1.5522 | 5.5026 | 0.0701 | 3381.349 |
| A | Non-negative | 140 | 66.1696 | 1.5759 | 5.5247 | 0.0265 | 3684.567 |
| B | Negative | 476 | 65.8474 | 0.9103 | 5.3818 | 0.0713 | 3735.662 |
| B | Non-negative | 507 | 65.9199 | 0.9084 | 5.3424 | 0.0255 | 3705.375 |
| C | Negative | 118 | 66.9879 | 0.8961 | 5.3483 | 0.0848 | 3637.983 |
| C | Non-negative | 137 | 66.4482 | 0.9330 | 5.3543 | 0.0229 | 3723.293 |
| D | Negative | 222 | 65.5674 | 1.6842 | 5.6032 | 0.0709 | 3622.532 |
| D | Non-negative | 273 | 65.3735 | 1.6787 | 5.6078 | 0.0232 | 3706.127 |
| NA | Negative | 40 | 67.3500 | 0.9605 | 5.3349 | 0.0792 | 3610.350 |
| NA | Non-negative | 54 | 67.2491 | 0.9919 | 5.3506 | 0.0269 | 3449.077 |
The process-variable comparison also shows that negative
mnf_flow rows are different from non-negative rows.
The clearest example is oxygen_filler. For Brand B:
mnf_flow have mean of
oxygen_filler of 0.0713, compared with only
0.0255 for non-negative mnf_flow rows.The same pattern appears for other brands too. For Brand C:
oxygen_filler is 0.0848 for
negative mnf_flow rows and 0.0229 for non-negative
rows.This makes me more careful about treating negative
mnf_flow as an error. If these values were only random bad
measurements, I would not expect them to clearly separate PH and
oxygen_filler.
Based on this review, I will not convert negative
mnf_flow values as errors in the main preprocessing
workflow at this point. Even though -100 and
-100.2 look like suspicious, they appear to contain useful
information. For example, Brand B has mean PH of 8.6700 when
mnf_flow is negative, compared with 8.4716 when
mnf_flow is non-negative. I will decide how to treat them
after I impute missing values.
I think it is more logical to also anlyse outliers by
brand_code group. Since different brands may have different
production settings, one global outlier cutoff can be misleading.
For example, a value may look unusual when all brands are combined together, but it may be normal for one specific brand. Because of this, I do not want to decide about outliers only from the full dataset. I want to see if values are unusual inside their own brand group.
This brand-level outlier check gives a better view of the production process. If a value is only extreme globally but not extreme within its own brand, I would probably keep it. But if a value is extreme even inside its own brand group and also does not make sense from production logic, then it may need more review before imputation.
At this stage, I use this as a diagnostic check. I will not remove or change outliers automatically, because some unusual values may still represent real production conditions.
brand_outlier_summary <- train_features_raw |>
filter(!is.na(brand_code)) |>
select(brand_code, all_of(numeric_vars_raw)) |>
pivot_longer(
cols = all_of(numeric_vars_raw),
names_to = "variable",
values_to = "value"
) |>
group_by(brand_code, variable) |>
summarise(
count = sum(!is.na(value)),
missing_count = sum(is.na(value)),
q1 = quantile(value, 0.25, na.rm = TRUE),
median = median(value, na.rm = TRUE),
q3 = quantile(value, 0.75, na.rm = TRUE),
iqr_value = q3 - q1,
min_value = min(value, na.rm = TRUE),
max_value = max(value, na.rm = TRUE),
lower_bound = q1 - 1.5 * iqr_value,
upper_bound = q3 + 1.5 * iqr_value,
outlier_count = sum(
value < lower_bound | value > upper_bound,
na.rm = TRUE
),
outlier_percent = round(outlier_count / count * 100, 2),
.groups = "drop"
) |>
arrange(desc(outlier_percent))
brand_outlier_summary |>
filter(outlier_count > 0) |>
kable(
caption = "Outlier Review Within Each Brand Code Group",
digits = 3
) |>
kable_styling(
full_width = FALSE,
position = "center",
bootstrap_options = c("striped", "hover", "condensed")
)|>
scroll_box(
width = "100%",
height = "500px"
)
| brand_code | variable | count | missing_count | q1 | median | q3 | iqr_value | min_value | max_value | lower_bound | upper_bound | outlier_count | outlier_percent |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| D | pressure_setpoint | 493 | 2 | 46.000 | 46.000 | 46.000 | 0.000 | 44.000 | 52.000 | 46.000 | 46.000 | 159 | 32.25 |
| D | bowl_setpoint | 495 | 0 | 110.000 | 120.000 | 120.000 | 10.000 | 70.000 | 140.000 | 95.000 | 135.000 | 110 | 22.22 |
| C | filler_speed | 249 | 6 | 3892.000 | 3980.000 | 3996.000 | 104.000 | 1004.000 | 4030.000 | 3736.000 | 4152.000 | 48 | 19.28 |
| D | filler_speed | 485 | 10 | 3892.000 | 3982.000 | 3996.000 | 104.000 | 998.000 | 4026.000 | 3736.000 | 4152.000 | 91 | 18.76 |
| A | filler_speed | 217 | 8 | 3806.000 | 3984.000 | 3998.000 | 192.000 | 1004.000 | 4024.000 | 3518.000 | 4286.000 | 34 | 15.67 |
| B | filler_speed | 966 | 17 | 3890.000 | 3982.000 | 3998.000 | 108.000 | 1002.000 | 4020.000 | 3728.000 | 4160.000 | 147 | 15.22 |
| C | mfr | 233 | 22 | 709.800 | 721.600 | 729.800 | 20.000 | 76.800 | 835.400 | 679.800 | 759.800 | 31 | 13.30 |
| C | filler_level | 254 | 1 | 108.250 | 118.100 | 120.000 | 11.750 | 69.400 | 150.800 | 90.625 | 137.625 | 32 | 12.60 |
| C | bowl_setpoint | 255 | 0 | 110.000 | 120.000 | 120.000 | 10.000 | 70.000 | 130.000 | 95.000 | 135.000 | 32 | 12.55 |
| D | hyd_pressure4 | 494 | 1 | 78.000 | 80.000 | 84.000 | 6.000 | 62.000 | 140.000 | 69.000 | 93.000 | 61 | 12.35 |
| B | air_pressurer | 983 | 0 | 142.200 | 142.600 | 143.000 | 0.800 | 140.800 | 148.200 | 141.000 | 144.200 | 120 | 12.21 |
| A | mfr | 194 | 31 | 709.200 | 723.600 | 731.000 | 21.800 | 31.400 | 868.600 | 676.500 | 763.700 | 23 | 11.86 |
| D | mfr | 453 | 42 | 707.400 | 724.800 | 732.200 | 24.800 | 112.600 | 792.400 | 670.200 | 769.400 | 51 | 11.26 |
| C | air_pressurer | 255 | 0 | 142.200 | 142.400 | 142.800 | 0.600 | 141.200 | 147.200 | 141.300 | 143.700 | 28 | 10.98 |
| D | density | 495 | 0 | 1.660 | 1.720 | 1.780 | 0.120 | 0.720 | 1.920 | 1.480 | 1.960 | 49 | 9.90 |
| C | density | 255 | 0 | 0.860 | 0.940 | 1.020 | 0.160 | 0.240 | 1.760 | 0.620 | 1.260 | 25 | 9.80 |
| D | fill_pressure | 491 | 4 | 45.800 | 46.200 | 48.000 | 2.200 | 37.800 | 60.000 | 42.500 | 51.300 | 48 | 9.78 |
| C | carb_rel | 252 | 3 | 5.320 | 5.360 | 5.380 | 0.060 | 5.020 | 5.580 | 5.230 | 5.470 | 24 | 9.52 |
| A | oxygen_filler | 223 | 2 | 0.022 | 0.032 | 0.048 | 0.026 | 0.002 | 0.400 | -0.017 | 0.088 | 21 | 9.42 |
| B | hyd_pressure4 | 975 | 8 | 96.000 | 98.000 | 102.000 | 6.000 | 78.000 | 138.000 | 87.000 | 111.000 | 90 | 9.23 |
| D | oxygen_filler | 494 | 1 | 0.019 | 0.033 | 0.052 | 0.033 | 0.002 | 0.318 | -0.030 | 0.101 | 45 | 9.11 |
| A | density | 225 | 0 | 1.540 | 1.600 | 1.660 | 0.120 | 0.840 | 1.760 | 1.360 | 1.840 | 20 | 8.89 |
| A | air_pressurer | 225 | 0 | 142.400 | 142.600 | 143.000 | 0.600 | 141.000 | 147.000 | 141.500 | 143.900 | 18 | 8.00 |
| C | psc_co2 | 250 | 5 | 0.040 | 0.040 | 0.080 | 0.040 | 0.000 | 0.240 | -0.020 | 0.140 | 20 | 8.00 |
| B | mfr | 914 | 69 | 706.200 | 725.000 | 730.800 | 24.600 | 95.400 | 840.000 | 669.300 | 767.700 | 70 | 7.66 |
| A | alch_rel | 224 | 1 | 7.100 | 7.120 | 7.160 | 0.060 | 6.520 | 7.760 | 7.010 | 7.250 | 16 | 7.14 |
| B | oxygen_filler | 981 | 2 | 0.022 | 0.034 | 0.064 | 0.042 | 0.002 | 0.400 | -0.041 | 0.127 | 60 | 6.12 |
| B | density | 983 | 0 | 0.880 | 0.920 | 0.960 | 0.080 | 0.340 | 1.840 | 0.760 | 1.080 | 58 | 5.90 |
| D | pc_volume | 489 | 6 | 0.221 | 0.255 | 0.292 | 0.071 | 0.087 | 0.462 | 0.115 | 0.398 | 27 | 5.52 |
| B | temperature | 980 | 3 | 65.200 | 65.600 | 66.200 | 1.000 | 64.000 | 75.800 | 63.700 | 67.700 | 51 | 5.20 |
| C | oxygen_filler | 252 | 3 | 0.022 | 0.034 | 0.072 | 0.050 | 0.003 | 0.238 | -0.054 | 0.148 | 13 | 5.16 |
| D | pressure_vacuum | 495 | 0 | -5.600 | -5.400 | -5.000 | 0.600 | -6.600 | -3.600 | -6.500 | -4.100 | 25 | 5.05 |
| B | psc_co2 | 963 | 20 | 0.020 | 0.040 | 0.070 | 0.050 | 0.000 | 0.240 | -0.055 | 0.145 | 48 | 4.98 |
| A | temperature | 223 | 2 | 65.200 | 65.800 | 66.600 | 1.400 | 64.200 | 76.200 | 63.100 | 68.700 | 11 | 4.93 |
| D | carb_rel | 495 | 0 | 5.580 | 5.600 | 5.640 | 0.060 | 4.960 | 5.780 | 5.490 | 5.730 | 24 | 4.85 |
| A | pressure_vacuum | 225 | 0 | -5.600 | -5.200 | -5.000 | 0.600 | -6.200 | -3.800 | -6.500 | -4.100 | 10 | 4.44 |
| D | air_pressurer | 495 | 0 | 142.200 | 142.600 | 142.800 | 0.600 | 141.200 | 147.400 | 141.300 | 143.700 | 22 | 4.44 |
| C | fill_ounces | 251 | 4 | 23.940 | 23.980 | 24.040 | 0.100 | 23.693 | 24.313 | 23.790 | 24.190 | 11 | 4.38 |
| D | temperature | 493 | 2 | 64.600 | 65.200 | 66.000 | 1.400 | 63.600 | 73.600 | 62.500 | 68.100 | 20 | 4.06 |
| A | balling_lvl | 225 | 0 | 3.040 | 3.100 | 3.200 | 0.160 | 1.440 | 3.400 | 2.800 | 3.440 | 9 | 4.00 |
| A | fill_pressure | 225 | 0 | 46.000 | 47.400 | 50.200 | 4.200 | 36.200 | 60.400 | 39.700 | 56.500 | 9 | 4.00 |
| C | pc_volume | 251 | 4 | 0.258 | 0.290 | 0.322 | 0.064 | 0.103 | 0.463 | 0.163 | 0.417 | 10 | 3.98 |
| C | fill_pressure | 252 | 3 | 46.000 | 48.000 | 50.200 | 4.200 | 39.200 | 58.800 | 39.700 | 56.500 | 10 | 3.97 |
| A | psc | 220 | 5 | 0.046 | 0.072 | 0.102 | 0.056 | 0.006 | 0.260 | -0.038 | 0.186 | 8 | 3.64 |
| D | psc_co2 | 489 | 6 | 0.020 | 0.040 | 0.060 | 0.040 | 0.000 | 0.240 | -0.040 | 0.120 | 17 | 3.48 |
| C | temperature | 253 | 2 | 65.800 | 66.600 | 67.200 | 1.400 | 64.000 | 76.000 | 63.700 | 69.300 | 8 | 3.16 |
| B | balling_lvl | 983 | 0 | 1.360 | 1.400 | 1.440 | 0.080 | 0.000 | 3.440 | 1.240 | 1.560 | 31 | 3.15 |
| A | carb_rel | 224 | 1 | 5.460 | 5.520 | 5.560 | 0.100 | 5.120 | 6.060 | 5.310 | 5.710 | 7 | 3.12 |
| B | balling | 983 | 0 | 1.398 | 1.498 | 1.596 | 0.198 | 0.346 | 3.864 | 1.101 | 1.893 | 30 | 3.05 |
| D | carb_temp | 492 | 3 | 138.800 | 140.700 | 143.250 | 4.450 | 129.200 | 153.800 | 132.125 | 149.925 | 15 | 3.05 |
| B | pc_volume | 969 | 14 | 0.247 | 0.277 | 0.315 | 0.068 | 0.109 | 0.471 | 0.145 | 0.417 | 28 | 2.89 |
| B | psc_fill | 974 | 9 | 0.120 | 0.180 | 0.260 | 0.140 | 0.000 | 0.620 | -0.090 | 0.470 | 28 | 2.87 |
| D | carb_volume | 492 | 3 | 5.473 | 5.513 | 5.540 | 0.067 | 5.260 | 5.700 | 5.373 | 5.640 | 14 | 2.85 |
| D | filler_level | 491 | 4 | 100.800 | 117.400 | 120.200 | 19.400 | 55.800 | 151.800 | 71.700 | 149.300 | 14 | 2.85 |
| A | fill_ounces | 219 | 6 | 23.933 | 23.980 | 24.033 | 0.100 | 23.753 | 24.313 | 23.783 | 24.183 | 6 | 2.74 |
| D | alch_rel | 495 | 0 | 7.680 | 7.720 | 7.740 | 0.060 | 6.500 | 8.200 | 7.590 | 7.830 | 13 | 2.63 |
| B | carb_pressure | 976 | 7 | 65.000 | 66.800 | 69.000 | 4.000 | 57.000 | 79.400 | 59.000 | 75.000 | 25 | 2.56 |
| D | psc | 490 | 5 | 0.042 | 0.074 | 0.106 | 0.064 | 0.002 | 0.264 | -0.053 | 0.201 | 12 | 2.45 |
| C | psc_fill | 253 | 2 | 0.100 | 0.180 | 0.280 | 0.180 | 0.020 | 0.600 | -0.170 | 0.550 | 6 | 2.37 |
| B | psc | 971 | 12 | 0.052 | 0.080 | 0.114 | 0.062 | 0.004 | 0.268 | -0.041 | 0.207 | 22 | 2.27 |
| A | psc_co2 | 222 | 3 | 0.020 | 0.040 | 0.080 | 0.060 | 0.000 | 0.240 | -0.070 | 0.170 | 5 | 2.25 |
| A | psc_fill | 223 | 2 | 0.100 | 0.180 | 0.260 | 0.160 | 0.000 | 0.580 | -0.140 | 0.500 | 5 | 2.24 |
| A | balling | 225 | 0 | 3.092 | 3.194 | 3.390 | 0.298 | 1.298 | 3.764 | 2.645 | 3.837 | 5 | 2.22 |
| B | fill_pressure | 980 | 3 | 46.000 | 47.000 | 50.000 | 4.000 | 36.000 | 59.200 | 40.000 | 56.000 | 21 | 2.14 |
| D | carb_pressure | 484 | 11 | 68.750 | 70.600 | 72.400 | 3.650 | 62.600 | 78.800 | 63.275 | 77.875 | 10 | 2.07 |
| D | fill_ounces | 490 | 5 | 23.907 | 23.960 | 24.007 | 0.100 | 23.693 | 24.320 | 23.757 | 24.157 | 10 | 2.04 |
| D | psc_fill | 491 | 4 | 0.100 | 0.180 | 0.260 | 0.160 | 0.000 | 0.600 | -0.140 | 0.500 | 10 | 2.04 |
| D | balling | 495 | 0 | 3.390 | 3.490 | 3.688 | 0.298 | 0.998 | 3.984 | 2.943 | 4.135 | 10 | 2.02 |
| D | balling_lvl | 495 | 0 | 3.200 | 3.280 | 3.360 | 0.160 | 0.900 | 3.660 | 2.960 | 3.600 | 10 | 2.02 |
| C | carb_volume | 254 | 1 | 5.260 | 5.300 | 5.340 | 0.080 | 5.047 | 5.473 | 5.140 | 5.460 | 5 | 1.97 |
| C | balling | 255 | 0 | 1.472 | 1.646 | 1.796 | 0.324 | 0.160 | 3.588 | 0.986 | 2.282 | 5 | 1.96 |
| B | alch_rel | 980 | 3 | 6.520 | 6.540 | 6.560 | 0.040 | 6.400 | 8.560 | 6.460 | 6.620 | 19 | 1.94 |
| A | hyd_pressure4 | 216 | 9 | 94.000 | 102.000 | 112.000 | 18.000 | 62.000 | 142.000 | 67.000 | 139.000 | 4 | 1.85 |
| A | carb_volume | 225 | 0 | 5.373 | 5.433 | 5.467 | 0.093 | 5.040 | 5.627 | 5.233 | 5.607 | 4 | 1.78 |
| B | carb_volume | 979 | 4 | 5.280 | 5.313 | 5.347 | 0.067 | 5.080 | 5.560 | 5.180 | 5.447 | 16 | 1.63 |
| C | hyd_pressure4 | 253 | 2 | 94.000 | 102.000 | 110.000 | 16.000 | 74.000 | 140.000 | 70.000 | 134.000 | 4 | 1.58 |
| C | psc | 253 | 2 | 0.052 | 0.082 | 0.122 | 0.070 | 0.002 | 0.256 | -0.053 | 0.227 | 4 | 1.58 |
| B | carb_rel | 981 | 2 | 5.320 | 5.360 | 5.400 | 0.080 | 5.100 | 5.840 | 5.200 | 5.520 | 15 | 1.53 |
| A | pc_volume | 220 | 5 | 0.223 | 0.255 | 0.308 | 0.085 | 0.111 | 0.456 | 0.095 | 0.436 | 3 | 1.36 |
| A | carb_pressure | 224 | 1 | 67.200 | 69.200 | 71.200 | 4.000 | 60.600 | 78.200 | 61.200 | 77.200 | 3 | 1.34 |
| B | carb_temp | 973 | 10 | 138.200 | 140.600 | 143.800 | 5.600 | 128.600 | 154.000 | 129.800 | 152.200 | 13 | 1.34 |
| B | fill_ounces | 972 | 11 | 23.920 | 23.980 | 24.033 | 0.113 | 23.633 | 24.280 | 23.750 | 24.203 | 13 | 1.34 |
| C | pressure_vacuum | 255 | 0 | -5.700 | -5.400 | -5.000 | 0.700 | -6.200 | -3.800 | -6.750 | -3.950 | 3 | 1.18 |
| D | carb_pressure1 | 488 | 7 | 119.000 | 123.400 | 125.800 | 6.800 | 105.600 | 140.200 | 108.800 | 136.000 | 5 | 1.02 |
| A | carb_pressure1 | 222 | 3 | 119.800 | 123.400 | 125.600 | 5.800 | 111.000 | 134.600 | 111.100 | 134.300 | 2 | 0.90 |
| C | carb_pressure | 254 | 1 | 64.800 | 67.200 | 69.200 | 4.400 | 59.600 | 78.200 | 58.200 | 75.800 | 2 | 0.79 |
| C | balling_lvl | 255 | 0 | 1.400 | 1.520 | 1.620 | 0.220 | 0.000 | 3.380 | 1.070 | 1.950 | 2 | 0.78 |
| B | carb_pressure1 | 971 | 12 | 118.600 | 123.000 | 125.400 | 6.800 | 107.000 | 139.400 | 108.400 | 135.600 | 7 | 0.72 |
| A | hyd_pressure1 | 225 | 0 | 0.000 | 10.400 | 19.400 | 19.400 | -0.800 | 53.800 | -29.100 | 48.500 | 1 | 0.44 |
| C | carb_pressure1 | 253 | 2 | 118.600 | 123.200 | 125.200 | 6.600 | 109.600 | 138.400 | 108.700 | 135.100 | 1 | 0.40 |
| C | carb_temp | 253 | 2 | 138.600 | 141.600 | 144.000 | 5.400 | 130.600 | 153.600 | 130.500 | 152.100 | 1 | 0.40 |
| C | alch_rel | 254 | 1 | 6.520 | 6.560 | 6.600 | 0.080 | 6.400 | 7.660 | 6.400 | 6.720 | 1 | 0.39 |
| B | hyd_pressure1 | 981 | 2 | 0.000 | 12.000 | 20.200 | 20.200 | -0.800 | 58.000 | -30.300 | 50.500 | 2 | 0.20 |
brand_outlier_summary |>
filter(outlier_percent >= 5) |>
arrange(desc(outlier_percent)) |>
kable(
caption = "Brand-Level Outlier Checks With At Least 5 Percent Outliers",
digits = 3
) |>
kable_styling(
full_width = FALSE,
position = "center"
)|>
scroll_box(
width = "100%",
height = "500px"
)
| brand_code | variable | count | missing_count | q1 | median | q3 | iqr_value | min_value | max_value | lower_bound | upper_bound | outlier_count | outlier_percent |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| D | pressure_setpoint | 493 | 2 | 46.000 | 46.000 | 46.000 | 0.000 | 44.000 | 52.000 | 46.000 | 46.000 | 159 | 32.25 |
| D | bowl_setpoint | 495 | 0 | 110.000 | 120.000 | 120.000 | 10.000 | 70.000 | 140.000 | 95.000 | 135.000 | 110 | 22.22 |
| C | filler_speed | 249 | 6 | 3892.000 | 3980.000 | 3996.000 | 104.000 | 1004.000 | 4030.000 | 3736.000 | 4152.000 | 48 | 19.28 |
| D | filler_speed | 485 | 10 | 3892.000 | 3982.000 | 3996.000 | 104.000 | 998.000 | 4026.000 | 3736.000 | 4152.000 | 91 | 18.76 |
| A | filler_speed | 217 | 8 | 3806.000 | 3984.000 | 3998.000 | 192.000 | 1004.000 | 4024.000 | 3518.000 | 4286.000 | 34 | 15.67 |
| B | filler_speed | 966 | 17 | 3890.000 | 3982.000 | 3998.000 | 108.000 | 1002.000 | 4020.000 | 3728.000 | 4160.000 | 147 | 15.22 |
| C | mfr | 233 | 22 | 709.800 | 721.600 | 729.800 | 20.000 | 76.800 | 835.400 | 679.800 | 759.800 | 31 | 13.30 |
| C | filler_level | 254 | 1 | 108.250 | 118.100 | 120.000 | 11.750 | 69.400 | 150.800 | 90.625 | 137.625 | 32 | 12.60 |
| C | bowl_setpoint | 255 | 0 | 110.000 | 120.000 | 120.000 | 10.000 | 70.000 | 130.000 | 95.000 | 135.000 | 32 | 12.55 |
| D | hyd_pressure4 | 494 | 1 | 78.000 | 80.000 | 84.000 | 6.000 | 62.000 | 140.000 | 69.000 | 93.000 | 61 | 12.35 |
| B | air_pressurer | 983 | 0 | 142.200 | 142.600 | 143.000 | 0.800 | 140.800 | 148.200 | 141.000 | 144.200 | 120 | 12.21 |
| A | mfr | 194 | 31 | 709.200 | 723.600 | 731.000 | 21.800 | 31.400 | 868.600 | 676.500 | 763.700 | 23 | 11.86 |
| D | mfr | 453 | 42 | 707.400 | 724.800 | 732.200 | 24.800 | 112.600 | 792.400 | 670.200 | 769.400 | 51 | 11.26 |
| C | air_pressurer | 255 | 0 | 142.200 | 142.400 | 142.800 | 0.600 | 141.200 | 147.200 | 141.300 | 143.700 | 28 | 10.98 |
| D | density | 495 | 0 | 1.660 | 1.720 | 1.780 | 0.120 | 0.720 | 1.920 | 1.480 | 1.960 | 49 | 9.90 |
| C | density | 255 | 0 | 0.860 | 0.940 | 1.020 | 0.160 | 0.240 | 1.760 | 0.620 | 1.260 | 25 | 9.80 |
| D | fill_pressure | 491 | 4 | 45.800 | 46.200 | 48.000 | 2.200 | 37.800 | 60.000 | 42.500 | 51.300 | 48 | 9.78 |
| C | carb_rel | 252 | 3 | 5.320 | 5.360 | 5.380 | 0.060 | 5.020 | 5.580 | 5.230 | 5.470 | 24 | 9.52 |
| A | oxygen_filler | 223 | 2 | 0.022 | 0.032 | 0.048 | 0.026 | 0.002 | 0.400 | -0.017 | 0.088 | 21 | 9.42 |
| B | hyd_pressure4 | 975 | 8 | 96.000 | 98.000 | 102.000 | 6.000 | 78.000 | 138.000 | 87.000 | 111.000 | 90 | 9.23 |
| D | oxygen_filler | 494 | 1 | 0.019 | 0.033 | 0.052 | 0.033 | 0.002 | 0.318 | -0.030 | 0.101 | 45 | 9.11 |
| A | density | 225 | 0 | 1.540 | 1.600 | 1.660 | 0.120 | 0.840 | 1.760 | 1.360 | 1.840 | 20 | 8.89 |
| A | air_pressurer | 225 | 0 | 142.400 | 142.600 | 143.000 | 0.600 | 141.000 | 147.000 | 141.500 | 143.900 | 18 | 8.00 |
| C | psc_co2 | 250 | 5 | 0.040 | 0.040 | 0.080 | 0.040 | 0.000 | 0.240 | -0.020 | 0.140 | 20 | 8.00 |
| B | mfr | 914 | 69 | 706.200 | 725.000 | 730.800 | 24.600 | 95.400 | 840.000 | 669.300 | 767.700 | 70 | 7.66 |
| A | alch_rel | 224 | 1 | 7.100 | 7.120 | 7.160 | 0.060 | 6.520 | 7.760 | 7.010 | 7.250 | 16 | 7.14 |
| B | oxygen_filler | 981 | 2 | 0.022 | 0.034 | 0.064 | 0.042 | 0.002 | 0.400 | -0.041 | 0.127 | 60 | 6.12 |
| B | density | 983 | 0 | 0.880 | 0.920 | 0.960 | 0.080 | 0.340 | 1.840 | 0.760 | 1.080 | 58 | 5.90 |
| D | pc_volume | 489 | 6 | 0.221 | 0.255 | 0.292 | 0.071 | 0.087 | 0.462 | 0.115 | 0.398 | 27 | 5.52 |
| B | temperature | 980 | 3 | 65.200 | 65.600 | 66.200 | 1.000 | 64.000 | 75.800 | 63.700 | 67.700 | 51 | 5.20 |
| C | oxygen_filler | 252 | 3 | 0.022 | 0.034 | 0.072 | 0.050 | 0.003 | 0.238 | -0.054 | 0.148 | 13 | 5.16 |
| D | pressure_vacuum | 495 | 0 | -5.600 | -5.400 | -5.000 | 0.600 | -6.600 | -3.600 | -6.500 | -4.100 | 25 | 5.05 |
negative_values_by_brand <- train_features_raw |>
filter(!is.na(brand_code)) |>
select(brand_code, all_of(numeric_vars_raw)) |>
pivot_longer(
cols = all_of(numeric_vars_raw),
names_to = "variable",
values_to = "value"
) |>
group_by(brand_code, variable) |>
summarise(
count = sum(!is.na(value)),
negative_count = sum(value < 0, na.rm = TRUE),
negative_percent = round(negative_count / count * 100, 2),
min_value = min(value, na.rm = TRUE),
max_value = max(value, na.rm = TRUE),
.groups = "drop"
) |>
filter(negative_count > 0) |>
arrange(desc(negative_percent))
negative_values_by_brand |>
kable(
caption = "Negative Values by Brand Code",
digits = 3
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| brand_code | variable | count | negative_count | negative_percent | min_value | max_value |
|---|---|---|---|---|---|---|
| A | pressure_vacuum | 225 | 225 | 100.00 | -6.2 | -3.8 |
| B | pressure_vacuum | 983 | 983 | 100.00 | -6.4 | -3.6 |
| C | pressure_vacuum | 255 | 255 | 100.00 | -6.2 | -3.8 |
| D | pressure_vacuum | 495 | 495 | 100.00 | -6.6 | -3.6 |
| B | mnf_flow | 983 | 476 | 48.42 | -100.2 | 213.4 |
| C | mnf_flow | 255 | 118 | 46.27 | -100.2 | 182.0 |
| D | mnf_flow | 495 | 222 | 44.85 | -100.2 | 204.6 |
| A | mnf_flow | 225 | 85 | 37.78 | -100.2 | 216.2 |
| C | hyd_pressure3 | 255 | 8 | 3.14 | -1.2 | 43.8 |
| A | hyd_pressure3 | 224 | 7 | 3.12 | -1.2 | 43.8 |
| B | hyd_pressure3 | 980 | 27 | 2.76 | -1.2 | 45.6 |
| A | hyd_pressure1 | 225 | 6 | 2.67 | -0.8 | 53.8 |
| D | hyd_pressure3 | 488 | 13 | 2.66 | -1.2 | 50.0 |
| C | hyd_pressure1 | 255 | 5 | 1.96 | -0.8 | 43.2 |
| D | hyd_pressure1 | 489 | 8 | 1.64 | -0.8 | 47.0 |
| B | hyd_pressure1 | 981 | 12 | 1.22 | -0.8 | 58.0 |
After finding outliers, I compare them with PH. This helps answer whether unusual predictor values are connected to unusual PH values.
If outlier rows have different PH from non-outlier rows, then removing them may remove important production information. Because of this, I do not want to delete outliers only because they are statistically unusual.
train_features_with_ph <- train_features_raw |>
mutate(ph = train_ph)
Now I will create outlier flags within each brand group:
brand_outlier_ph_data <- train_features_with_ph |>
mutate(row_id = row_number()) |>
filter(!is.na(brand_code)) |>
select(row_id, brand_code, ph, all_of(numeric_vars_raw)) |>
pivot_longer(
cols = all_of(numeric_vars_raw),
names_to = "variable",
values_to = "value"
) |>
group_by(brand_code, variable) |>
mutate(
q1 = quantile(value, 0.25, na.rm = TRUE),
q3 = quantile(value, 0.75, na.rm = TRUE),
iqr_value = q3 - q1,
lower_bound = q1 - 1.5 * iqr_value,
upper_bound = q3 + 1.5 * iqr_value,
is_outlier = value < lower_bound | value > upper_bound
) |>
ungroup()
This will summarize PH for outlier vs non-outlier rows:
outlier_ph_summary <- brand_outlier_ph_data |>
filter(!is.na(value)) |>
group_by(variable, is_outlier) |>
summarise(
count = n(),
mean_ph = mean(ph, na.rm = TRUE),
median_ph = median(ph, na.rm = TRUE),
sd_ph = sd(ph, na.rm = TRUE),
min_ph = min(ph, na.rm = TRUE),
max_ph = max(ph, na.rm = TRUE),
.groups = "drop"
) |>
pivot_wider(
names_from = is_outlier,
values_from = c(count, mean_ph, median_ph, sd_ph, min_ph, max_ph),
names_prefix = "outlier_"
) |>
mutate(
mean_ph_difference = mean_ph_outlier_TRUE - mean_ph_outlier_FALSE,
abs_mean_ph_difference = abs(mean_ph_difference)
) |>
arrange(desc(abs_mean_ph_difference))
outlier_ph_summary |>
filter(count_outlier_TRUE >= 10) |>
slice_max(abs_mean_ph_difference, n = 15) |>
kable(
caption = "Top PH Differences for Outlier Rows",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center",
bootstrap_options = c("striped", "hover", "condensed")
)|>
scroll_box(
width = "100%",
height = "500px"
)
| variable | count_outlier_FALSE | count_outlier_TRUE | mean_ph_outlier_FALSE | mean_ph_outlier_TRUE | median_ph_outlier_FALSE | median_ph_outlier_TRUE | sd_ph_outlier_FALSE | sd_ph_outlier_TRUE | min_ph_outlier_FALSE | min_ph_outlier_TRUE | max_ph_outlier_FALSE | max_ph_outlier_TRUE | mean_ph_difference | abs_mean_ph_difference |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| carb_pressure1 | 1919 | 15 | 8.5497 | 8.6680 | 8.56 | 8.62 | 0.1701 | 0.2733 | 7.88 | 8.38 | 8.94 | 9.36 | 0.1183 | 0.1183 |
| filler_level | 1900 | 46 | 8.5503 | 8.4365 | 8.56 | 8.45 | 0.1720 | 0.1069 | 7.88 | 8.22 | 8.94 | 8.76 | -0.1138 | 0.1138 |
| carb_rel | 1882 | 70 | 8.5513 | 8.4446 | 8.56 | 8.46 | 0.1690 | 0.2020 | 7.88 | 7.90 | 8.94 | 8.94 | -0.1068 | 0.1068 |
| temperature | 1859 | 90 | 8.5495 | 8.4969 | 8.56 | 8.50 | 0.1703 | 0.1853 | 7.88 | 8.08 | 8.94 | 8.92 | -0.0527 | 0.0527 |
| carb_volume | 1911 | 39 | 8.5488 | 8.4964 | 8.54 | 8.52 | 0.1721 | 0.1887 | 7.88 | 8.06 | 9.36 | 8.86 | -0.0524 | 0.0524 |
| oxygen_filler | 1811 | 139 | 8.5437 | 8.5957 | 8.54 | 8.60 | 0.1724 | 0.1642 | 7.88 | 8.20 | 8.94 | 9.36 | 0.0520 | 0.0520 |
| pressure_setpoint | 1790 | 159 | 8.5452 | 8.5877 | 8.54 | 8.60 | 0.1760 | 0.1156 | 7.88 | 8.24 | 9.36 | 8.84 | 0.0425 | 0.0425 |
| pressure_vacuum | 1920 | 38 | 8.5487 | 8.5068 | 8.56 | 8.46 | 0.1727 | 0.1522 | 7.88 | 8.26 | 9.36 | 8.78 | -0.0419 | 0.0419 |
| mfr | 1619 | 175 | 8.5509 | 8.5123 | 8.56 | 8.50 | 0.1702 | 0.1675 | 7.88 | 8.08 | 8.94 | 8.92 | -0.0386 | 0.0386 |
| fill_ounces | 1892 | 40 | 8.5484 | 8.5145 | 8.54 | 8.55 | 0.1722 | 0.2026 | 7.88 | 8.02 | 9.36 | 8.86 | -0.0339 | 0.0339 |
| hyd_pressure4 | 1779 | 159 | 8.5448 | 8.5780 | 8.54 | 8.58 | 0.1700 | 0.1873 | 7.88 | 8.02 | 8.94 | 8.94 | 0.0332 | 0.0332 |
| psc_co2 | 1834 | 90 | 8.5492 | 8.5169 | 8.56 | 8.51 | 0.1713 | 0.1875 | 7.88 | 8.08 | 9.36 | 8.94 | -0.0324 | 0.0324 |
| psc_fill | 1892 | 49 | 8.5470 | 8.5780 | 8.54 | 8.60 | 0.1715 | 0.2126 | 7.88 | 8.12 | 9.36 | 8.94 | 0.0310 | 0.0310 |
| balling_lvl | 1906 | 52 | 8.5471 | 8.5777 | 8.54 | 8.58 | 0.1708 | 0.2228 | 7.88 | 8.10 | 8.94 | 9.36 | 0.0306 | 0.0306 |
| fill_pressure | 1860 | 88 | 8.5488 | 8.5189 | 8.56 | 8.50 | 0.1724 | 0.1533 | 7.88 | 8.22 | 8.94 | 8.88 | -0.0299 | 0.0299 |
top_outlier_ph_vars <- outlier_ph_summary |>
filter(count_outlier_TRUE >= 10) |>
slice_max(abs_mean_ph_difference, n = 6) |>
pull(variable)
brand_outlier_ph_data |>
filter(variable %in% top_outlier_ph_vars) |>
mutate(
outlier_status = ifelse(is_outlier, "Outlier", "Not Outlier")
) |>
ggplot(
aes(x = outlier_status, y = ph)
) +
geom_boxplot() +
facet_wrap(~ variable, scales = "free_x") +
labs(
title = "PH Distribution for Outlier vs Non-Outlier Rows",
x = "Outlier Status",
y = "PH"
) +
theme_minimal()
Since outliers were identified within each brand_code
group, I also compare them with PH within each brand_code
group. This is more logical than combining all brands together, because
each brand may have different normal production behavior.
This check helps me see whether outlier values for a specific brand are connected with different PH values.
train_features_with_ph <- train_features_raw |>
mutate(ph = train_ph)
brand_outlier_ph_data <- train_features_with_ph |>
mutate(row_id = row_number()) |>
filter(!is.na(brand_code)) |>
select(row_id, brand_code, ph, all_of(numeric_vars_raw)) |>
pivot_longer(
cols = all_of(numeric_vars_raw),
names_to = "variable",
values_to = "value"
) |>
group_by(brand_code, variable) |>
mutate(
q1 = quantile(value, 0.25, na.rm = TRUE),
q3 = quantile(value, 0.75, na.rm = TRUE),
iqr_value = q3 - q1,
lower_bound = q1 - 1.5 * iqr_value,
upper_bound = q3 + 1.5 * iqr_value,
is_outlier = value < lower_bound | value > upper_bound
) |>
ungroup()
brand_outlier_ph_summary <- brand_outlier_ph_data |>
filter(!is.na(value)) |>
group_by(brand_code, variable, is_outlier) |>
summarise(
count = n(),
mean_ph = mean(ph, na.rm = TRUE),
median_ph = median(ph, na.rm = TRUE),
sd_ph = sd(ph, na.rm = TRUE),
min_ph = min(ph, na.rm = TRUE),
max_ph = max(ph, na.rm = TRUE),
.groups = "drop"
) |>
pivot_wider(
names_from = is_outlier,
values_from = c(count, mean_ph, median_ph, sd_ph, min_ph, max_ph),
names_prefix = "outlier_"
) |>
mutate(
mean_ph_difference = mean_ph_outlier_TRUE - mean_ph_outlier_FALSE,
abs_mean_ph_difference = abs(mean_ph_difference)
) |>
arrange(desc(abs_mean_ph_difference))
top_brand_outlier_ph_differences <- brand_outlier_ph_summary |>
filter(count_outlier_TRUE >= 10) |>
slice_max(abs_mean_ph_difference, n = 20)
top_brand_outlier_ph_differences |>
kable(
caption = "Top PH Differences for Brand-Level Outlier Rows",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center",
bootstrap_options = c("striped", "hover", "condensed")
)|>
scroll_box(
width = "100%",
height = "500px"
)
| brand_code | variable | count_outlier_FALSE | count_outlier_TRUE | mean_ph_outlier_FALSE | mean_ph_outlier_TRUE | median_ph_outlier_FALSE | median_ph_outlier_TRUE | sd_ph_outlier_FALSE | sd_ph_outlier_TRUE | min_ph_outlier_FALSE | min_ph_outlier_TRUE | max_ph_outlier_FALSE | max_ph_outlier_TRUE | mean_ph_difference | abs_mean_ph_difference |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| C | oxygen_filler | 239 | 13 | 8.4049 | 8.5892 | 8.42 | 8.58 | 0.1675 | 0.2872 | 7.88 | 8.26 | 8.80 | 9.36 | 0.1844 | 0.1844 |
| C | carb_rel | 228 | 24 | 8.4211 | 8.3117 | 8.43 | 8.33 | 0.1630 | 0.1910 | 7.88 | 7.90 | 8.80 | 8.62 | -0.1095 | 0.1095 |
| B | temperature | 929 | 51 | 8.5727 | 8.4639 | 8.56 | 8.44 | 0.1659 | 0.1698 | 8.00 | 8.18 | 8.94 | 8.86 | -0.1088 | 0.1088 |
| C | density | 230 | 25 | 8.4277 | 8.3200 | 8.44 | 8.26 | 0.1558 | 0.3185 | 7.90 | 7.88 | 8.80 | 9.36 | -0.1077 | 0.1077 |
| C | filler_speed | 201 | 48 | 8.3945 | 8.4975 | 8.42 | 8.51 | 0.1612 | 0.1888 | 7.88 | 8.02 | 8.80 | 8.80 | 0.1030 | 0.1030 |
| B | carb_temp | 960 | 13 | 8.5688 | 8.4723 | 8.56 | 8.46 | 0.1680 | 0.1372 | 8.00 | 8.24 | 8.94 | 8.80 | -0.0965 | 0.0965 |
| D | carb_rel | 471 | 24 | 8.6061 | 8.5217 | 8.62 | 8.55 | 0.1363 | 0.1487 | 8.24 | 8.16 | 8.92 | 8.78 | -0.0844 | 0.0844 |
| B | carb_volume | 963 | 16 | 8.5689 | 8.4875 | 8.56 | 8.48 | 0.1677 | 0.1805 | 8.00 | 8.08 | 8.94 | 8.80 | -0.0814 | 0.0814 |
| A | mfr | 171 | 23 | 8.4916 | 8.4130 | 8.52 | 8.44 | 0.1563 | 0.1784 | 8.06 | 8.12 | 8.86 | 8.74 | -0.0785 | 0.0785 |
| A | filler_speed | 183 | 34 | 8.4833 | 8.5524 | 8.52 | 8.59 | 0.1574 | 0.1773 | 8.06 | 8.20 | 8.86 | 8.86 | 0.0691 | 0.0691 |
| A | oxygen_filler | 202 | 21 | 8.4830 | 8.5514 | 8.50 | 8.58 | 0.1607 | 0.1612 | 8.06 | 8.20 | 8.86 | 8.86 | 0.0685 | 0.0685 |
| B | psc_fill | 946 | 28 | 8.5661 | 8.6329 | 8.56 | 8.71 | 0.1674 | 0.1813 | 8.00 | 8.26 | 8.94 | 8.94 | 0.0668 | 0.0668 |
| B | balling_lvl | 952 | 31 | 8.5656 | 8.6316 | 8.56 | 8.68 | 0.1675 | 0.1716 | 8.00 | 8.22 | 8.94 | 8.94 | 0.0660 | 0.0660 |
| D | psc_fill | 481 | 10 | 8.6001 | 8.6660 | 8.60 | 8.65 | 0.1381 | 0.1310 | 8.16 | 8.50 | 8.92 | 8.92 | 0.0659 | 0.0659 |
| D | fill_pressure | 443 | 48 | 8.6086 | 8.5446 | 8.62 | 8.52 | 0.1377 | 0.1286 | 8.16 | 8.24 | 8.92 | 8.82 | -0.0640 | 0.0640 |
| D | filler_level | 477 | 14 | 8.6044 | 8.5429 | 8.62 | 8.54 | 0.1386 | 0.0938 | 8.16 | 8.40 | 8.92 | 8.76 | -0.0615 | 0.0615 |
| C | fill_pressure | 242 | 10 | 8.4097 | 8.4700 | 8.42 | 8.49 | 0.1714 | 0.1367 | 7.88 | 8.26 | 8.80 | 8.62 | 0.0603 | 0.0603 |
| B | fill_ounces | 959 | 13 | 8.5677 | 8.5077 | 8.56 | 8.52 | 0.1681 | 0.1916 | 8.00 | 8.18 | 8.94 | 8.80 | -0.0600 | 0.0600 |
| D | balling_lvl | 485 | 10 | 8.6032 | 8.5440 | 8.62 | 8.56 | 0.1368 | 0.1849 | 8.24 | 8.16 | 8.92 | 8.78 | -0.0592 | 0.0592 |
| D | psc | 478 | 12 | 8.6030 | 8.5450 | 8.62 | 8.55 | 0.1388 | 0.1138 | 8.16 | 8.34 | 8.92 | 8.70 | -0.0580 | 0.0580 |
The brand-level PH comparison shows that outliers do not behave the same way for every brand.
For Brand C, oxygen_filler outlier rows had average
PH of 8.5892, while non-outlier rows had average PH of
8.4049.
This is a difference of about +0.18 PH. Since PH values in this
dataset do not vary very widely, this difference is meaningful. It
suggests that high or unusual oxygen_filler values may be
connected with higher PH for Brand C.
For Brand C, carb_rel outlier rows had average PH of
8.3117, while non-outlier rows had average PH of
8.4211.
This is a difference of about -0.11 PH. This means unusual
carb_rel values for Brand C may be connected with lower
PH.
For Brand B, temperature outlier rows had average PH
of 8.4639, while non-outlier rows had average PH of
8.5727.
This is a difference of about -0.11 PH. This suggests that
unusual temperature values may matter for Brand B and should not be
removed without thinking.
For Brand A, mfr outlier rows had average PH of
8.4130, while non-outlier rows had average PH of
8.4916.
This is a smaller difference, about -0.08 PH, but it still
shows that unusual mfr values may have some relationship
with PH.
These examples show that some outliers may contain useful production information. If I remove all outliers without proper analysis, I might remove cases where the process behaves differently and PH changes.
For this project, I will use brand-level capping for selected continuous variables. This means the outlier limits are calculated separately inside each Brand Code group, not from the full dataset. This is more logical because each brand may have its own normal operating range.
I will not cap setting-like or grouped variables such as
pressure_setpoint, bowl_setpoint,
air_pressurer, psc_co2, and
pressure_vacuum. These variables look like they have fixed
or grouped values by design, so regular IQR outlier rules may not make
sense for them.
Overall, I do not treat outliers as simple errors. I treat them as unusual production values that need careful handling. Capping will be documented as a preprocessing step, and later model performance can show whether this treatment helps or hurts prediction.
vars_excluded_from_capping <- c(
"pressure_setpoint",
"bowl_setpoint",
"air_pressurer",
"psc_co2",
"pressure_vacuum"
)
top_brand_outlier_vars <- top_brand_outlier_ph_differences |>
distinct(brand_code, variable) |>
slice_head(n = 8)
# brand_outlier_ph_data |>
# semi_join(
# top_brand_outlier_vars,
# by = c("brand_code", "variable")
# ) |>
# mutate(
# outlier_status = ifelse(is_outlier, "Outlier", "Not Outlier"),
# brand_variable = paste("Brand", brand_code, "-", variable)
# ) |>
# ggplot(
# aes(x = outlier_status, y = ph)
# ) +
# geom_boxplot() +
# facet_wrap(~ brand_variable, scales = "free_x") +
# labs(
# title = "PH Distribution for Brand-Level Outlier vs Non-Outlier Rows",
# x = "Outlier Status",
# y = "PH"
# ) +
# theme_minimal()
Based on the outlier review, I apply capping to selected continuous numeric predictors. The caps are calculated from the training data only and separately within each Brand Code group.
This avoids using validation or final evaluation data to make preprocessing decisions. It also keeps the capping process more fair because each brand can have different normal operating ranges.
I do not cap setting-like or grouped variables because IQR limits may
not make sense for them. In the first version, I exclude
pressure_setpoint, bowl_setpoint,
air_pressurer, psc_co2, and
pressure_vacuum from capping.
pressure_setpoint and bowl_setpoint both
look like process settings with limited possible values.
air_pressurer and psc_co2 have grouped or
repeated values, so regular IQR capping may over-treat them.
pressure_vacuum is excluded because negative values appear
to be part of how this variable is measured.
vars_excluded_from_capping <- c(
"pressure_setpoint",
"bowl_setpoint",
"air_pressurer",
"psc_co2",
"pressure_vacuum"
)
get_numeric_vars_to_cap <- function(data_x, excluded_vars) {
numeric_vars <- data_x |>
select(where(is.numeric)) |>
names()
numeric_vars[
!numeric_vars %in% excluded_vars
]
}
create_capping_limits <- function(train_x, numeric_vars_to_cap, brand_col = "brand_code") {
brand_cap_limits <- train_x |>
filter(!is.na(.data[[brand_col]])) |>
select(all_of(brand_col), all_of(numeric_vars_to_cap)) |>
pivot_longer(
cols = all_of(numeric_vars_to_cap),
names_to = "variable",
values_to = "value"
) |>
group_by(.data[[brand_col]], variable) |>
summarise(
q1 = quantile(value, 0.25, na.rm = TRUE),
q3 = quantile(value, 0.75, na.rm = TRUE),
iqr_value = q3 - q1,
lower_cap = q1 - 1.5 * iqr_value,
upper_cap = q3 + 1.5 * iqr_value,
.groups = "drop"
) |>
rename(brand_code = all_of(brand_col))
global_cap_limits <- train_x |>
select(all_of(numeric_vars_to_cap)) |>
pivot_longer(
cols = all_of(numeric_vars_to_cap),
names_to = "variable",
values_to = "value"
) |>
group_by(variable) |>
summarise(
q1 = quantile(value, 0.25, na.rm = TRUE),
q3 = quantile(value, 0.75, na.rm = TRUE),
iqr_value = q3 - q1,
lower_cap = q1 - 1.5 * iqr_value,
upper_cap = q3 + 1.5 * iqr_value,
.groups = "drop"
)
list(
brand_cap_limits = brand_cap_limits,
global_cap_limits = global_cap_limits
)
}
apply_brand_caps <- function(data_x, numeric_vars_to_cap, cap_limits, brand_col = "brand_code") {
capped_data <- data_x
brand_cap_limits <- cap_limits$brand_cap_limits
global_cap_limits <- cap_limits$global_cap_limits
for (var in numeric_vars_to_cap) {
for (brand in unique(brand_cap_limits$brand_code)) {
caps <- brand_cap_limits |>
filter(
brand_code == brand,
variable == var
)
rows_to_cap <- which(capped_data[[brand_col]] == brand)
if (nrow(caps) == 1 && length(rows_to_cap) > 0) {
capped_data[[var]][rows_to_cap] <- pmin(
pmax(capped_data[[var]][rows_to_cap], caps$lower_cap),
caps$upper_cap
)
}
}
global_caps <- global_cap_limits |>
filter(variable == var)
missing_brand_rows <- which(is.na(capped_data[[brand_col]]))
if (nrow(global_caps) == 1 && length(missing_brand_rows) > 0) {
capped_data[[var]][missing_brand_rows] <- pmin(
pmax(capped_data[[var]][missing_brand_rows], global_caps$lower_cap),
global_caps$upper_cap
)
}
}
capped_data
}
apply_brand_caps <- function(data_x, numeric_vars_to_cap, cap_limits, brand_col = "brand_code") {
capped_data <- data_x
brand_cap_limits <- cap_limits$brand_cap_limits
global_cap_limits <- cap_limits$global_cap_limits
for (var in numeric_vars_to_cap) {
for (brand in unique(brand_cap_limits$brand_code)) {
caps <- brand_cap_limits |>
filter(
brand_code == brand,
variable == var
)
rows_to_cap <- which(capped_data[[brand_col]] == brand)
if (nrow(caps) == 1 && length(rows_to_cap) > 0) {
capped_data[[var]][rows_to_cap] <- pmin(
pmax(capped_data[[var]][rows_to_cap], caps$lower_cap),
caps$upper_cap
)
}
}
global_caps <- global_cap_limits |>
filter(variable == var)
missing_brand_rows <- which(is.na(capped_data[[brand_col]]))
if (nrow(global_caps) == 1 && length(missing_brand_rows) > 0) {
capped_data[[var]][missing_brand_rows] <- pmin(
pmax(capped_data[[var]][missing_brand_rows], global_caps$lower_cap),
global_caps$upper_cap
)
}
}
capped_data
}
summarize_capping_by_brand <- function(original_x, capped_x, numeric_vars_to_cap, dataset_name) {
map_dfr(
numeric_vars_to_cap,
function(var) {
original_values <- original_x[[var]]
capped_values <- capped_x[[var]]
changed_values <- !is.na(original_values) &
!is.na(capped_values) &
original_values != capped_values
tibble(
dataset = dataset_name,
variable = var,
brand_code = original_x$brand_code,
was_capped = changed_values
) |>
mutate(
brand_code = ifelse(
is.na(brand_code),
"Missing Brand",
as.character(brand_code)
)
) |>
group_by(dataset, variable, brand_code) |>
summarise(
values_capped = sum(was_capped),
.groups = "drop"
)
}
) |>
pivot_wider(
names_from = brand_code,
values_from = values_capped,
values_fill = 0
) |>
mutate(
Total = rowSums(
across(
where(is.numeric)
)
)
)
}
First, I run the capping process using the original excluded variable list. This helps me see how much the capping rule changes the data before deciding whether the excluded list needs to be updated.
vars_excluded_from_capping_v1 <- c(
"pressure_setpoint",
"bowl_setpoint",
"air_pressurer",
"psc_co2",
"pressure_vacuum"
)
numeric_vars_to_cap_v1 <- get_numeric_vars_to_cap(
data_x = train_features_raw,
excluded_vars = vars_excluded_from_capping_v1
)
numeric_vars_to_cap_v1
[1] "carb_volume" "fill_ounces" "pc_volume" "carb_pressure"
[5] "carb_temp" "psc" "psc_fill" "mnf_flow"
[9] "carb_pressure1" "fill_pressure" "hyd_pressure1" "hyd_pressure2"
[13] "hyd_pressure3" "hyd_pressure4" "filler_level" "filler_speed"
[17] "temperature" "usage_cont" "carb_flow" "density"
[21] "mfr" "balling" "oxygen_filler" "alch_rel"
[25] "carb_rel" "balling_lvl"
cap_limits_v1 <- create_capping_limits(
train_x = train_features_raw,
numeric_vars_to_cap = numeric_vars_to_cap_v1,
brand_col = "brand_code"
)
train_features_capped_v1 <- apply_brand_caps(
data_x = train_features_raw,
numeric_vars_to_cap = numeric_vars_to_cap_v1,
cap_limits = cap_limits_v1,
brand_col = "brand_code"
)
valid_features_capped_v1 <- apply_brand_caps(
data_x = valid_features_raw,
numeric_vars_to_cap = numeric_vars_to_cap_v1,
cap_limits = cap_limits_v1,
brand_col = "brand_code"
)
eval_features_capped_v1 <- apply_brand_caps(
data_x = eval_features_raw,
numeric_vars_to_cap = numeric_vars_to_cap_v1,
cap_limits = cap_limits_v1,
brand_col = "brand_code"
)
get_numeric_vars_to_cap <- function(data_x, excluded_vars) {
numeric_vars <- data_x |>
select(where(is.numeric)) |>
names()
numeric_vars[
!numeric_vars %in% excluded_vars
]
}
capping_summary_v1 <- bind_rows(
summarize_capping_by_brand(
original_x = train_features_raw,
capped_x = train_features_capped_v1,
numeric_vars_to_cap = numeric_vars_to_cap_v1,
dataset_name = "Training"
),
summarize_capping_by_brand(
original_x = valid_features_raw,
capped_x = valid_features_capped_v1,
numeric_vars_to_cap = numeric_vars_to_cap_v1,
dataset_name = "Validation"
),
summarize_capping_by_brand(
original_x = eval_features_raw,
capped_x = eval_features_capped_v1,
numeric_vars_to_cap = numeric_vars_to_cap_v1,
dataset_name = "Final Evaluation"
)
) |>
mutate(
dataset = factor(
dataset,
levels = c("Training", "Validation", "Final Evaluation")
)
) |>
arrange(dataset, desc(Total))
capping_summary_v1 |>
kable(
caption = "First Capping Attempt: Values Capped by Brand Code",
digits = 0
) |>
kable_styling(
full_width = FALSE,
position = "center"
)|>
scroll_box(
width = "100%",
height = "500px"
)
| dataset | variable | A | B | C | D | Missing Brand | Total |
|---|---|---|---|---|---|---|---|
| Training | filler_speed | 34 | 147 | 48 | 91 | 28 | 348 |
| Training | mfr | 23 | 70 | 31 | 51 | 19 | 194 |
| Training | hyd_pressure4 | 4 | 90 | 4 | 61 | 5 | 164 |
| Training | density | 20 | 58 | 25 | 49 | 0 | 152 |
| Training | oxygen_filler | 21 | 60 | 13 | 45 | 8 | 147 |
| Training | temperature | 11 | 51 | 8 | 20 | 16 | 106 |
| Training | fill_pressure | 9 | 21 | 10 | 48 | 7 | 95 |
| Training | pc_volume | 3 | 28 | 10 | 27 | 13 | 81 |
| Training | carb_rel | 7 | 15 | 24 | 24 | 0 | 70 |
| Training | alch_rel | 16 | 19 | 1 | 13 | 5 | 54 |
| Training | balling_lvl | 9 | 31 | 2 | 10 | 0 | 52 |
| Training | psc_fill | 5 | 28 | 6 | 10 | 1 | 50 |
| Training | balling | 5 | 30 | 5 | 10 | 0 | 50 |
| Training | psc | 8 | 22 | 4 | 12 | 3 | 49 |
| Training | fill_ounces | 6 | 13 | 11 | 10 | 7 | 47 |
| Training | filler_level | 0 | 0 | 32 | 14 | 0 | 46 |
| Training | carb_pressure | 3 | 25 | 2 | 10 | 2 | 42 |
| Training | carb_volume | 4 | 16 | 5 | 14 | 1 | 40 |
| Training | carb_temp | 0 | 13 | 1 | 15 | 2 | 31 |
| Training | carb_pressure1 | 2 | 7 | 1 | 5 | 2 | 17 |
| Training | hyd_pressure1 | 1 | 2 | 0 | 0 | 1 | 4 |
| Training | mnf_flow | 0 | 0 | 0 | 0 | 0 | 0 |
| Training | hyd_pressure2 | 0 | 0 | 0 | 0 | 0 | 0 |
| Training | hyd_pressure3 | 0 | 0 | 0 | 0 | 0 | 0 |
| Training | usage_cont | 0 | 0 | 0 | 0 | 0 | 0 |
| Training | carb_flow | 0 | 0 | 0 | 0 | 0 | 0 |
| Validation | filler_speed | 8 | 33 | 9 | 20 | 9 | 79 |
| Validation | mfr | 7 | 24 | 6 | 14 | 7 | 58 |
| Validation | oxygen_filler | 3 | 15 | 3 | 12 | 1 | 34 |
| Validation | hyd_pressure4 | 1 | 18 | 1 | 12 | 0 | 32 |
| Validation | temperature | 2 | 19 | 2 | 3 | 2 | 28 |
| Validation | density | 4 | 13 | 2 | 9 | 0 | 28 |
| Validation | carb_rel | 1 | 3 | 10 | 5 | 1 | 20 |
| Validation | pc_volume | 0 | 4 | 3 | 8 | 2 | 17 |
| Validation | psc_fill | 1 | 9 | 1 | 5 | 1 | 17 |
| Validation | fill_pressure | 2 | 8 | 3 | 4 | 0 | 17 |
| Validation | balling_lvl | 3 | 9 | 1 | 3 | 0 | 16 |
| Validation | alch_rel | 5 | 5 | 1 | 3 | 0 | 14 |
| Validation | carb_pressure | 2 | 6 | 0 | 2 | 1 | 11 |
| Validation | filler_level | 0 | 0 | 8 | 2 | 0 | 10 |
| Validation | balling | 1 | 5 | 1 | 3 | 0 | 10 |
| Validation | carb_volume | 1 | 7 | 0 | 1 | 0 | 9 |
| Validation | fill_ounces | 0 | 5 | 1 | 1 | 1 | 8 |
| Validation | carb_temp | 3 | 3 | 0 | 1 | 1 | 8 |
| Validation | psc | 1 | 1 | 3 | 2 | 0 | 7 |
| Validation | carb_pressure1 | 0 | 1 | 0 | 0 | 2 | 3 |
| Validation | hyd_pressure1 | 0 | 1 | 0 | 0 | 0 | 1 |
| Validation | mnf_flow | 0 | 0 | 0 | 0 | 0 | 0 |
| Validation | hyd_pressure2 | 0 | 0 | 0 | 0 | 0 | 0 |
| Validation | hyd_pressure3 | 0 | 0 | 0 | 0 | 0 | 0 |
| Validation | usage_cont | 0 | 0 | 0 | 0 | 0 | 0 |
| Validation | carb_flow | 0 | 0 | 0 | 0 | 0 | 0 |
| Final Evaluation | filler_speed | 5 | 20 | 6 | 14 | 2 | 47 |
| Final Evaluation | hyd_pressure4 | 1 | 16 | 0 | 14 | 2 | 33 |
| Final Evaluation | mfr | 6 | 10 | 6 | 8 | 2 | 32 |
| Final Evaluation | density | 2 | 11 | 3 | 6 | 0 | 22 |
| Final Evaluation | temperature | 2 | 8 | 0 | 6 | 4 | 20 |
| Final Evaluation | oxygen_filler | 3 | 5 | 4 | 5 | 1 | 18 |
| Final Evaluation | fill_pressure | 1 | 2 | 0 | 10 | 1 | 14 |
| Final Evaluation | balling | 1 | 9 | 0 | 4 | 0 | 14 |
| Final Evaluation | psc | 3 | 5 | 1 | 2 | 0 | 11 |
| Final Evaluation | carb_volume | 2 | 5 | 1 | 1 | 0 | 9 |
| Final Evaluation | carb_rel | 0 | 2 | 2 | 5 | 0 | 9 |
| Final Evaluation | pc_volume | 2 | 2 | 0 | 4 | 0 | 8 |
| Final Evaluation | balling_lvl | 2 | 5 | 0 | 1 | 0 | 8 |
| Final Evaluation | carb_pressure | 2 | 3 | 0 | 1 | 0 | 6 |
| Final Evaluation | alch_rel | 1 | 4 | 0 | 1 | 0 | 6 |
| Final Evaluation | carb_temp | 1 | 2 | 0 | 1 | 0 | 4 |
| Final Evaluation | psc_fill | 1 | 1 | 0 | 2 | 0 | 4 |
| Final Evaluation | fill_ounces | 1 | 0 | 1 | 1 | 0 | 3 |
| Final Evaluation | filler_level | 0 | 0 | 1 | 1 | 0 | 2 |
| Final Evaluation | carb_pressure1 | 1 | 0 | 0 | 0 | 0 | 1 |
| Final Evaluation | hyd_pressure1 | 1 | 0 | 0 | 0 | 0 | 1 |
| Final Evaluation | mnf_flow | 0 | 0 | 0 | 0 | 0 | 0 |
| Final Evaluation | hyd_pressure2 | 0 | 0 | 0 | 0 | 0 | 0 |
| Final Evaluation | hyd_pressure3 | 0 | 0 | 0 | 0 | 0 | 0 |
| Final Evaluation | usage_cont | 0 | 0 | 0 | 0 | 0 | 0 |
| Final Evaluation | carb_flow | 0 | 0 | 0 | 0 | 0 | 0 |
The first capping attempt showed that some variables were changed a
lot. The largest issue was filler_speed. In the training
data, filler_speed had 348 capped values,
which is a large number.
This made me review the earlier brand-level outlier table more
carefully. filler_speed had high outlier percentages across
all brands:
filler_speed.filler_speed.filler_speed.filler_speed.The low filler_speed values are around 998 to
1004, while most normal values are closer to 3900 to
4000. This may look like an error at first, but because the
same pattern appears across all brands, it may represent a real
slow-speed production state, startup/shutdown condition, or machine
slowdown.
Because of this, I do not want to cap filler_speed.
Capping these low values may hide a real condition that could be
important for predicting PH.
The first capping attempt was still useful because it showed that the
original capping rule was too aggressive for filler_speed.
Based on this review, I update the excluded variable list and rerun the
capping process.
vars_excluded_from_capping_v2 <- c(
"pressure_setpoint",
"bowl_setpoint",
"air_pressurer",
"psc_co2",
"pressure_vacuum"
)
numeric_vars_to_cap_v2 <- get_numeric_vars_to_cap(
data_x = train_features_raw,
excluded_vars = vars_excluded_from_capping_v2
)
numeric_vars_to_cap_v2
[1] "carb_volume" "fill_ounces" "pc_volume" "carb_pressure"
[5] "carb_temp" "psc" "psc_fill" "mnf_flow"
[9] "carb_pressure1" "fill_pressure" "hyd_pressure1" "hyd_pressure2"
[13] "hyd_pressure3" "hyd_pressure4" "filler_level" "filler_speed"
[17] "temperature" "usage_cont" "carb_flow" "density"
[21] "mfr" "balling" "oxygen_filler" "alch_rel"
[25] "carb_rel" "balling_lvl"
ph_difference_cutoff <- 0.05
min_outlier_count_for_rule <- 10
cap_allowed_combos <- brand_outlier_ph_summary |>
filter(
count_outlier_TRUE >= min_outlier_count_for_rule,
abs_mean_ph_difference < ph_difference_cutoff
) |>
select(
brand_code,
variable,
count_outlier_TRUE,
mean_ph_difference,
abs_mean_ph_difference
) |>
arrange(abs_mean_ph_difference)
cap_allowed_combos |>
kable(
caption = "Brand-Variable Outlier Groups Allowed for Capping",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)|>
scroll_box(
width = "100%",
height = "500px"
)
| brand_code | variable | count_outlier_TRUE | mean_ph_difference | abs_mean_ph_difference |
|---|---|---|---|---|
| D | filler_speed | 91 | 0.0015 | 0.0015 |
| B | filler_speed | 147 | 0.0022 | 0.0022 |
| C | air_pressurer | 28 | 0.0033 | 0.0033 |
| B | hyd_pressure4 | 90 | 0.0039 | 0.0039 |
| B | air_pressurer | 120 | -0.0059 | 0.0059 |
| D | psc_co2 | 17 | 0.0064 | 0.0064 |
| D | air_pressurer | 22 | -0.0097 | 0.0097 |
| D | hyd_pressure4 | 61 | 0.0140 | 0.0140 |
| A | density | 20 | 0.0143 | 0.0143 |
| C | psc_co2 | 20 | -0.0160 | 0.0160 |
| B | oxygen_filler | 60 | 0.0166 | 0.0166 |
| D | carb_pressure | 10 | -0.0174 | 0.0174 |
| D | carb_volume | 14 | -0.0181 | 0.0181 |
| B | psc | 22 | -0.0202 | 0.0202 |
| D | balling | 10 | -0.0204 | 0.0204 |
| D | pressure_setpoint | 159 | -0.0215 | 0.0215 |
| B | density | 58 | -0.0254 | 0.0254 |
| D | density | 49 | 0.0263 | 0.0263 |
| C | filler_level | 32 | -0.0268 | 0.0268 |
| B | psc_co2 | 48 | -0.0277 | 0.0277 |
| C | bowl_setpoint | 32 | -0.0281 | 0.0281 |
| B | pc_volume | 28 | -0.0288 | 0.0288 |
| B | balling | 30 | 0.0299 | 0.0299 |
| A | air_pressurer | 18 | -0.0300 | 0.0300 |
| A | pressure_vacuum | 10 | -0.0303 | 0.0303 |
| B | carb_rel | 15 | -0.0313 | 0.0313 |
| B | carb_pressure | 25 | -0.0333 | 0.0333 |
| D | temperature | 20 | -0.0333 | 0.0333 |
| D | mfr | 51 | -0.0334 | 0.0334 |
| D | bowl_setpoint | 110 | -0.0353 | 0.0353 |
| D | oxygen_filler | 45 | 0.0369 | 0.0369 |
| D | fill_ounces | 10 | 0.0386 | 0.0386 |
| B | mfr | 70 | -0.0437 | 0.0437 |
| C | pc_volume | 10 | 0.0458 | 0.0458 |
| C | mfr | 31 | 0.0462 | 0.0462 |
| B | alch_rel | 19 | 0.0467 | 0.0467 |
| D | carb_temp | 15 | -0.0478 | 0.0478 |
| C | fill_ounces | 11 | 0.0488 | 0.0488 |
brand_cap_limits_v2 <- train_features_raw |>
filter(!is.na(brand_code)) |>
select(brand_code, all_of(numeric_vars_to_cap_v2)) |>
pivot_longer(
cols = all_of(numeric_vars_to_cap_v2),
names_to = "variable",
values_to = "value"
) |>
group_by(brand_code, variable) |>
summarise(
q1 = quantile(value, 0.25, na.rm = TRUE),
q3 = quantile(value, 0.75, na.rm = TRUE),
iqr_value = q3 - q1,
lower_cap = q1 - 1.5 * iqr_value,
upper_cap = q3 + 1.5 * iqr_value,
.groups = "drop"
)
apply_brand_caps_ph_allowed <- function(
data_x,
numeric_vars_to_cap,
brand_cap_limits,
cap_allowed_combos,
brand_col = "brand_code"
) {
capped_data <- data_x
for (i in seq_len(nrow(cap_allowed_combos))) {
current_brand <- cap_allowed_combos$brand_code[i]
current_var <- cap_allowed_combos$variable[i]
if (current_var %in% numeric_vars_to_cap) {
caps <- brand_cap_limits |>
filter(
brand_code == current_brand,
variable == current_var
)
rows_to_cap <- which(capped_data[[brand_col]] == current_brand)
if (nrow(caps) == 1 && length(rows_to_cap) > 0) {
capped_data[[current_var]][rows_to_cap] <- pmin(
pmax(capped_data[[current_var]][rows_to_cap], caps$lower_cap),
caps$upper_cap
)
}
}
}
capped_data
}
train_features_capped <- apply_brand_caps_ph_allowed(
data_x = train_features_raw,
numeric_vars_to_cap = numeric_vars_to_cap_v2,
brand_cap_limits = brand_cap_limits_v2,
cap_allowed_combos = cap_allowed_combos,
brand_col = "brand_code"
)
valid_features_capped <- apply_brand_caps_ph_allowed(
data_x = valid_features_raw,
numeric_vars_to_cap = numeric_vars_to_cap_v2,
brand_cap_limits = brand_cap_limits_v2,
cap_allowed_combos = cap_allowed_combos,
brand_col = "brand_code"
)
eval_features_capped <- apply_brand_caps_ph_allowed(
data_x = eval_features_raw,
numeric_vars_to_cap = numeric_vars_to_cap_v2,
brand_cap_limits = brand_cap_limits_v2,
cap_allowed_combos = cap_allowed_combos,
brand_col = "brand_code"
)
capping_summary_v2 <- bind_rows(
summarize_capping_by_brand(
original_x = train_features_raw,
capped_x = train_features_capped,
numeric_vars_to_cap = numeric_vars_to_cap_v2,
dataset_name = "Training"
),
summarize_capping_by_brand(
original_x = valid_features_raw,
capped_x = valid_features_capped,
numeric_vars_to_cap = numeric_vars_to_cap_v2,
dataset_name = "Validation"
),
summarize_capping_by_brand(
original_x = eval_features_raw,
capped_x = eval_features_capped,
numeric_vars_to_cap = numeric_vars_to_cap_v2,
dataset_name = "Final Evaluation"
)
) |>
mutate(
dataset = factor(
dataset,
levels = c("Training", "Validation", "Final Evaluation")
)
) |>
filter(Total > 0) |>
arrange(dataset, desc(Total))
capping_summary_v2 |>
slice_max(
order_by = Total,
n = 15,
with_ties = FALSE
) |>
kable(
caption = "Top 15 Variables with Values Capped by Brand-Level PH-Difference Rule.
Counts of variables' values capped are in each Brand Code group",
digits = 0
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| dataset | variable | A | B | C | D | Missing Brand | Total |
|---|---|---|---|---|---|---|---|
| Training | filler_speed | 0 | 147 | 0 | 91 | 0 | 238 |
| Training | mfr | 0 | 70 | 31 | 51 | 0 | 152 |
| Training | hyd_pressure4 | 0 | 90 | 0 | 61 | 0 | 151 |
| Training | density | 20 | 58 | 0 | 49 | 0 | 127 |
| Training | oxygen_filler | 0 | 60 | 0 | 45 | 0 | 105 |
| Validation | filler_speed | 0 | 33 | 0 | 20 | 0 | 53 |
| Validation | mfr | 0 | 24 | 6 | 14 | 0 | 44 |
| Training | balling | 0 | 30 | 0 | 10 | 0 | 40 |
| Training | pc_volume | 0 | 28 | 10 | 0 | 0 | 38 |
| Training | carb_pressure | 0 | 25 | 0 | 10 | 0 | 35 |
| Final Evaluation | filler_speed | 0 | 20 | 0 | 14 | 0 | 34 |
| Training | filler_level | 0 | 0 | 32 | 0 | 0 | 32 |
| Validation | hyd_pressure4 | 0 | 18 | 0 | 12 | 0 | 30 |
| Final Evaluation | hyd_pressure4 | 0 | 16 | 0 | 14 | 0 | 30 |
| Validation | oxygen_filler | 0 | 15 | 0 | 12 | 0 | 27 |
After reviewing the outlier-PH comparison, I updated the capping strategy to be selective by both Brand Code and variable.
The PH-difference value compares the average PH of outlier rows with
the average PH of non-outlier rows within the same Brand Code and
variable group. For example, if Brand B has outliers in
filler_speed, I compare the average PH for Brand B rows
where filler_speed is an outlier against the average PH for
Brand B rows where filler_speed is not an outlier.
This helps separate outliers that may contain useful production signal from outliers that are more likely just extreme numeric values. If the PH difference is large, the outlier group may be related to changes in PH, so capping those values could remove useful information. If the PH difference is small, the outlier group does not appear to behave very differently in terms of PH, so capping is safer.
For this project, I will use averages difference of 0.05 PH as the cutoff. If a specific Brand Code and variable combination has an absolute PH difference below 0.05, capping is allowed for that group. If the absolute PH difference is 0.05 or larger, that group is not capped because those outliers may contain useful signal for predicting PH.
Capping will not be applied globally to an entire variable. It is
applied only to the specific Brand Code value’s group and variable
combination that meets the rule. For example, if Brand B
filler_speed has a very small PH difference, then Brand B
filler_speedoutliers can be capped. This does not mean
filler_speed is automatically capped for Brands A, C, or
D.
For example, filler_speed shows how the brand-level rule
works. In the training data, filler_speed had 147
capped values for Brand B and 91 capped values for Brand
D, for a total of 238 capped values.
| dataset | variable | A | B | C | D | Missing Brand | Total |
|---|---|---|---|---|---|---|---|
| Training | filler_speed | 0 | 147 | 0 | 91 | 0 | 238 |
This does not mean filler_speed was capped globally for
every brand. It means the PH-difference rule allowed capping for
specific brand groups where the outlier values did not show meaningful
PH difference. In this case, Brand B and Brand D
filler_speed values were capped, while Brand A and Brand C
filler_speed values were not capped.
The final capped feature sets are then passed into the imputation workflow.
The preprocessing review showed that predictors’ missing values exist in the training, validation, and final evaluation data. Since many models (for example linear models) cannot use missing predictor values directly, I want to create completed feature sets before modeling. Also, i don’t want to remove rows with missing values.
The important point is that imputation must be learned from the training data only. Validation and final evaluation data should receive the same imputation process, but they should not be used to learn imputation values. This will help to avoid data leakage.
PH is not used during imputation because PH is the target variable. Also, PH is not available in the final evaluation data, so using PH during imputation would not match the real prediction situation.
I will compare multiple imputation approaches and later test them with different model types. The final imputation choice will be based on validation performance, not just assumptions.
Before creating imputed feature sets, I make sure categorical variables have consistent factor levels across training, validation, and final evaluation data.
brand_levels <- train_features_raw |>
pull(brand_code) |>
as.factor() |>
levels()
train_features_raw <- train_features_raw |>
mutate(
brand_code = factor(brand_code, levels = brand_levels)
)
valid_features_raw <- valid_features_raw |>
mutate(
brand_code = factor(brand_code, levels = brand_levels)
)
eval_features_raw <- eval_features_raw |>
mutate(
brand_code = factor(brand_code, levels = brand_levels)
)
pre_imputation_missing_check <- tibble(
dataset = c("Training", "Validation", "Final Evaluation"),
total_missing_values = c(
sum(is.na(train_features_raw)),
sum(is.na(valid_features_raw)),
sum(is.na(eval_features_raw))
)
)
pre_imputation_missing_check |>
kable(
caption = "Total Missing Predictor Values Before Imputation"
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| dataset | total_missing_values |
|---|---|
| Training | 657 |
| Validation | 155 |
| Final Evaluation | 107 |
This check confirms that missing predictor values are present before imputation. The next step is to create completed versions of the training, validation, and final evaluation feature sets.
get_numeric_predictors <- function(data_x) {
data_x |>
select(where(is.numeric)) |>
names()
}
impute_numeric_by_brand_tree <- function(
train_x,
valid_x,
eval_x,
brand_col = "brand_code",
trees = 25,
min_brand_rows = 20
) {
numeric_vars <- get_numeric_predictors(train_x)
train_out <- train_x
valid_out <- valid_x
eval_out <- eval_x
fallback_recipe <- recipe(~ ., data = train_x) |>
step_unknown(
all_of(brand_col),
new_level = "Unknown"
) |>
step_impute_bag(
all_of(numeric_vars),
impute_with = imp_vars(all_predictors()),
trees = trees
)
fallback_fit <- prep(
fallback_recipe,
training = train_x
)
train_fallback <- bake(
fallback_fit,
new_data = train_x
)
valid_fallback <- bake(
fallback_fit,
new_data = valid_x
)
eval_fallback <- bake(
fallback_fit,
new_data = eval_x
)
train_out[numeric_vars] <- train_fallback[numeric_vars]
valid_out[numeric_vars] <- valid_fallback[numeric_vars]
eval_out[numeric_vars] <- eval_fallback[numeric_vars]
known_brands <- train_x |>
filter(!is.na(.data[[brand_col]])) |>
distinct(.data[[brand_col]]) |>
pull(.data[[brand_col]])
for (brand in known_brands) {
train_rows <- which(train_x[[brand_col]] == brand)
valid_rows <- which(valid_x[[brand_col]] == brand)
eval_rows <- which(eval_x[[brand_col]] == brand)
brand_train_numeric <- train_x[train_rows, numeric_vars]
if (nrow(brand_train_numeric) >= min_brand_rows) {
brand_recipe <- recipe(~ ., data = brand_train_numeric) |>
step_impute_bag(
all_numeric_predictors(),
impute_with = imp_vars(all_numeric_predictors()),
trees = trees
)
brand_fit <- prep(
brand_recipe,
training = brand_train_numeric
)
train_out[train_rows, numeric_vars] <- bake(
brand_fit,
new_data = train_x[train_rows, numeric_vars]
)
if (length(valid_rows) > 0) {
valid_out[valid_rows, numeric_vars] <- bake(
brand_fit,
new_data = valid_x[valid_rows, numeric_vars]
)
}
if (length(eval_rows) > 0) {
eval_out[eval_rows, numeric_vars] <- bake(
brand_fit,
new_data = eval_x[eval_rows, numeric_vars]
)
}
}
}
list(
train = train_out,
valid = valid_out,
eval = eval_out,
numeric_predictors = numeric_vars
)
}
train_features_for_imputation <- train_features_capped
valid_features_for_imputation <- valid_features_capped
eval_features_for_imputation <- eval_features_capped
check_numeric_missing <- function(train_x, valid_x, eval_x, numeric_vars) {
tibble(
dataset = c("Training", "Validation", "Final Evaluation"),
numeric_missing_count = c(
sum(is.na(train_x |> select(all_of(numeric_vars)))),
sum(is.na(valid_x |> select(all_of(numeric_vars)))),
sum(is.na(eval_x |> select(all_of(numeric_vars))))
)
)
}
collect_imputed_values <- function(before_x, after_x, numeric_vars, brand_col = "brand_code") {
imputed_values <- tibble()
for (var in numeric_vars) {
missing_rows <- which(is.na(before_x[[var]]))
if (length(missing_rows) > 0) {
temp_values <- tibble(
variable = var,
row_id = missing_rows,
brand_code = before_x[[brand_col]][missing_rows],
original_value = before_x[[var]][missing_rows],
imputed_value = after_x[[var]][missing_rows]
)
imputed_values <- bind_rows(
imputed_values,
temp_values
)
}
}
imputed_values
}
check_imputed_ranges <- function(before_x, after_x, numeric_vars) {
map_dfr(
numeric_vars,
function(var) {
missing_rows <- which(is.na(before_x[[var]]))
if (length(missing_rows) == 0) {
return(NULL)
}
imputed_values <- after_x[[var]][missing_rows]
original_values <- before_x[[var]]
tibble(
variable = var,
original_min = min(original_values, na.rm = TRUE),
original_max = max(original_values, na.rm = TRUE),
min_imputed_value = min(imputed_values, na.rm = TRUE),
max_imputed_value = max(imputed_values, na.rm = TRUE),
imputed_below_original_min = any(imputed_values < original_min),
imputed_above_original_max = any(imputed_values > original_max)
)
}
)
}
predict_missing_brand_code <- function(data_x, brand_model, brand_col = "brand_code") {
completed_x <- data_x
missing_brand_rows <- which(is.na(completed_x[[brand_col]]))
if (length(missing_brand_rows) > 0) {
prediction_data <- completed_x[missing_brand_rows, ]
predicted_brand <- predict(
brand_model,
data = prediction_data
)$predictions
completed_x[[brand_col]] <- as.character(completed_x[[brand_col]])
completed_x[[brand_col]][missing_brand_rows] <- as.character(predicted_brand)
completed_x[[brand_col]] <- as.factor(completed_x[[brand_col]])
}
completed_x
}
After PH-guided brand-level capping, I wil test several imputation workflows. My goal here is not only to fill in missing values, but also to check which imputation strategy creates the best completed dataset for predictions.
The logic of this section is:
brand_code has structure
in the completed numeric space.brand_code rows are located near known brand
groups, KNN is a reasonable brand_codeimputation
method.brand_code together.This makes the imputation methods comparison practical because the final goal is not just to predict and fill missing values, but to improve PH prediction performance.
get_numeric_predictors <- function(data_x) {
data_x |>
select(where(is.numeric)) |>
names()
}
calculate_regression_metrics <- function(actual, predicted) {
tibble(
rmse = sqrt(mean((actual - predicted)^2, na.rm = TRUE)),
mae = mean(abs(actual - predicted), na.rm = TRUE),
rsq = cor(actual, predicted, use = "complete.obs")^2
)
}
rf_tune_grid_temp <- expand.grid(
mtry = c(5, 10, 15, 20)
)
rf_control_temp <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 3,
verboseIter = FALSE
)
run_rf_validation_test <- function(
train_x,
valid_x,
train_y,
valid_y,
workflow_name,
project_seed = 123,
rf_tune_grid = rf_tune_grid_temp,
rf_control = rf_control_temp,
ntree = 500
) {
set.seed(project_seed)
rf_time <- system.time({
rf_model <- train(
x = train_x,
y = train_y,
method = "rf",
trControl = rf_control,
tuneGrid = rf_tune_grid,
metric = "RMSE",
ntree = ntree,
importance = TRUE
)
})
rf_pred <- predict(
rf_model,
newdata = valid_x
)
calculate_regression_metrics(
actual = valid_y,
predicted = rf_pred
) |>
mutate(
workflow = workflow_name,
training_rows = nrow(train_x),
validation_rows = nrow(valid_x),
best_mtry = rf_model$bestTune$mtry,
training_time_seconds = as.numeric(rf_time["elapsed"])
) |>
select(
workflow,
training_rows,
validation_rows,
best_mtry,
rmse,
mae,
rsq,
training_time_seconds
)
}
In this first workflow I will use is tree-based numeric imputation
for the numeric predictors. At this step, brand_code is
still allowed to be missing, because brand_code will be
handled separately later with KNN.
The reason I impute numeric variables first is because I will need full numeric data in order to get right with both PCA and KNN . PCA usually struggle with missing numeric values and might deliver not correct results, and KNN is at best with complete numeric predictors to calculate distance between observations in a vector space.
So the order is:
bagging-tree numeric imputation first
PCA components explorations for all known and missing brands
KNN brand_code imputation
This keeps the workflow more logical and reasonable. The following
process starts with imputation of missing numeric predictors values,
then I check associations of missing brand_code with known
brand_code groups on PCA plot, and then I use KNN to impute
brand_code.
impute_numeric_by_brand_tree <- function(
train_x,
valid_x,
eval_x,
brand_col = "brand_code",
trees = 25,
min_brand_rows = 20
) {
numeric_vars <- get_numeric_predictors(train_x)
train_out <- train_x
valid_out <- valid_x
eval_out <- eval_x
fallback_recipe <- recipe(~ ., data = train_x) |>
step_unknown(
all_of(brand_col),
new_level = "Unknown"
) |>
step_impute_bag(
all_of(numeric_vars),
impute_with = imp_vars(all_predictors()),
trees = trees
)
fallback_fit <- prep(
fallback_recipe,
training = train_x
)
train_fallback <- bake(
fallback_fit,
new_data = train_x
)
valid_fallback <- bake(
fallback_fit,
new_data = valid_x
)
eval_fallback <- bake(
fallback_fit,
new_data = eval_x
)
train_out[numeric_vars] <- train_fallback[numeric_vars]
valid_out[numeric_vars] <- valid_fallback[numeric_vars]
eval_out[numeric_vars] <- eval_fallback[numeric_vars]
known_brands <- train_x |>
filter(!is.na(.data[[brand_col]])) |>
distinct(.data[[brand_col]]) |>
pull(.data[[brand_col]])
for (brand in known_brands) {
train_rows <- which(train_x[[brand_col]] == brand)
valid_rows <- which(valid_x[[brand_col]] == brand)
eval_rows <- which(eval_x[[brand_col]] == brand)
brand_train_numeric <- train_x[train_rows, numeric_vars]
if (nrow(brand_train_numeric) >= min_brand_rows) {
brand_recipe <- recipe(~ ., data = brand_train_numeric) |>
step_impute_bag(
all_numeric_predictors(),
impute_with = imp_vars(all_numeric_predictors()),
trees = trees
)
brand_fit <- prep(
brand_recipe,
training = brand_train_numeric
)
train_out[train_rows, numeric_vars] <- bake(
brand_fit,
new_data = train_x[train_rows, numeric_vars]
)
if (length(valid_rows) > 0) {
valid_out[valid_rows, numeric_vars] <- bake(
brand_fit,
new_data = valid_x[valid_rows, numeric_vars]
)
}
if (length(eval_rows) > 0) {
eval_out[eval_rows, numeric_vars] <- bake(
brand_fit,
new_data = eval_x[eval_rows, numeric_vars]
)
}
}
}
list(
train = train_out,
valid = valid_out,
eval = eval_out,
numeric_predictors = numeric_vars
)
}
brand_tree_numeric_imputed <- impute_numeric_by_brand_tree(
train_x = train_features_for_imputation,
valid_x = valid_features_for_imputation,
eval_x = eval_features_for_imputation,
brand_col = "brand_code",
trees = 25,
min_brand_rows = 20
)
train_features_brand_tree_numeric <- brand_tree_numeric_imputed$train
valid_features_brand_tree_numeric <- brand_tree_numeric_imputed$valid
eval_features_brand_tree_numeric <- brand_tree_numeric_imputed$eval
tree_numeric_missing_check <- tibble(
dataset = c("Training", "Validation", "Final Evaluation"),
numeric_missing_values = c(
train_features_brand_tree_numeric |>
select(where(is.numeric)) |>
is.na() |>
sum(),
valid_features_brand_tree_numeric |>
select(where(is.numeric)) |>
is.na() |>
sum(),
eval_features_brand_tree_numeric |>
select(where(is.numeric)) |>
is.na() |>
sum()
)
)
tree_numeric_missing_check |>
kable(
caption = "Missing Numeric Values After Tree-Based Numeric Imputation",
digits = 0
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| dataset | numeric_missing_values |
|---|---|
| Training | 0 |
| Validation | 0 |
| Final Evaluation | 0 |
As we see, we do not have missing numeric variables after bagging tree imputation.
Now we can take a look at clucters of all brand_code,
and see how rows with missing brand_code are spreaded
between those clusters.
Here I will use PCA to visually check if brand_code
groups have clusters structure in the completed numeric process-variable
space.
Here PCA is not used as the final model. It is just a diagnostic
visualization. My goal is to see whether rows with missing
brand_code are located near known brand_code
groups. If missing brand_code rows are near known brand
clusters, then KNN will be a reasonable imputation method since it
assigns values for missing brand_code using nearby neighbor
observations.
The PCA is created using previous bagging tree imputations of missing numeric predictors, because PCA cannot handle missing numeric values directly.
pca_numeric_vars <- get_numeric_predictors(train_features_brand_tree_numeric)
pca_source_data <- bind_rows(
train_features_brand_tree_numeric |>
mutate(dataset = "Training"),
valid_features_brand_tree_numeric |>
mutate(dataset = "Validation"),
eval_features_brand_tree_numeric |>
mutate(dataset = "Final Evaluation")
) |>
mutate(
brand_code_plot = ifelse(
is.na(brand_code),
"Missing Brand",
as.character(brand_code)
),
brand_code_plot = factor(
brand_code_plot,
levels = c("A", "B", "C", "D", "Missing Brand")
)
)
pca_input <- pca_source_data |>
select(all_of(pca_numeric_vars))
pca_fit <- prcomp(
pca_input,
center = TRUE,
scale. = TRUE
)
pca_scores <- pca_source_data |>
select(dataset, brand_code, brand_code_plot) |>
bind_cols(
as_tibble(pca_fit$x[, 1:2])
)
pca_variance <- summary(pca_fit)$importance[2, 1:2] * 100
brand_colors <- c(
"A" = "#1f77b4",
"B" = "#ff7f0e",
"C" = "#2ca02c",
"D" = "#9467bd",
"Missing Brand" = "#d62728"
)
brand_shapes <- c(
"A" = 16,
"B" = 17,
"C" = 15,
"D" = 18,
"Missing Brand" = 8
)
ggplot(
pca_scores,
aes(
x = PC1,
y = PC2,
color = brand_code_plot,
shape = brand_code_plot
)
) +
geom_point(
alpha = 0.35,
size = 2
) +
geom_point(
data = pca_scores |>
filter(brand_code_plot == "Missing Brand"),
aes(
x = PC1,
y = PC2,
color = brand_code_plot,
shape = brand_code_plot
),
size = 4,
alpha = 0.95
) +
scale_color_manual(
values = brand_colors
) +
scale_shape_manual(
values = brand_shapes
) +
guides(
color = guide_legend(
override.aes = list(
size = 5,
alpha = 1
)
),
shape = guide_legend(
override.aes = list(
size = 5,
alpha = 1
)
)
) +
labs(
title = "PCA Plot of Process Variables by Brand Code",
subtitle = "Rows with missing Brand Code are highlighted with red star markers",
x = paste0("Principal Component 1 (", round(pca_variance[1], 1), "% variance)"),
y = paste0("Principal Component 2 (", round(pca_variance[2], 1), "% variance)"),
color = "Brand Code",
shape = "Brand Code"
) +
coord_equal() +
theme_classic(base_size = 14) +
theme(
legend.position = "right",
legend.title = element_text(size = 15, face = "bold"),
legend.text = element_text(size = 14),
legend.key.size = unit(1.1, "cm"),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 13),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
The PCA plot shows that the known brand_code groups are
not randomly mixed. Brand A and Brand D form clearer separated regions,
while Brand B and Brand C overlap more on the left side of the plot.
At the same time, most of the observations with missing
brand_code are located inside or very close to the Brand B
and Brand C regions. This is strong suggestion that rows with missing
brand_code are not just isolated unusual observations. At
PCA plot, great majority of them look very similar to existing Brands B
and C.
Since the missing rows are clustered near known
brand_code groups, KNN distance-based imputationis is my
preffered approach in this case.
In this step, I will apply KNN only to impute missing
brand_code values. The numeric predictors are already
imputed before this step.
Before applying KNN approach, I center and scale the numeric predictors from the training data. Otherwise without scaling, variables with larger values can dominate the distance calculations and make the KNN result less reliable.
After scaling, KNN will look for the closest rows with known
brand_code and use those nearby rows to predict the missing
brand_code.
In the following code validate_knn_brand() function
tests several K values for KNN using rows where brand_code
is already known. It splits the known-brand rows into training and
validation parts, centers and scales the numeric predictors, runs KNN,
and returns the accuracy table, best K value, and confusion table.
The impute_brand_knn() function then uses the best K value
to impute missing Brand Code in the training, validation, and final
evaluation sets.
validate_knn_brand <- function(
train_x,
brand_col = "brand_code",
k_values = c(3, 5, 7, 9, 11),
project_seed = 123
) {
numeric_vars <- get_numeric_predictors(train_x)
known_data <- train_x |>
filter(!is.na(.data[[brand_col]]))
known_data[[brand_col]] <- as.factor(known_data[[brand_col]])
set.seed(project_seed)
validation_index <- createDataPartition(
known_data[[brand_col]],
p = 0.80,
list = FALSE
)
brand_train <- known_data[validation_index, ]
brand_valid <- known_data[-validation_index, ]
train_numeric <- brand_train |>
select(all_of(numeric_vars))
valid_numeric <- brand_valid |>
select(all_of(numeric_vars))
train_center <- map_dbl(train_numeric, mean, na.rm = TRUE)
train_scale <- map_dbl(train_numeric, sd, na.rm = TRUE)
train_scale[train_scale == 0 | is.na(train_scale)] <- 1
scale_with_train <- function(data_numeric) {
sweep(
sweep(
as.matrix(data_numeric),
2,
train_center,
"-"
),
2,
train_scale,
"/"
)
}
train_scaled <- scale_with_train(train_numeric)
valid_scaled <- scale_with_train(valid_numeric)
accuracy_results <- map_dfr(
k_values,
function(k_value) {
pred_brand <- class::knn(
train = train_scaled,
test = valid_scaled,
cl = brand_train[[brand_col]],
k = k_value
)
tibble(
k = k_value,
accuracy = mean(pred_brand == brand_valid[[brand_col]]),
test_rows = nrow(brand_valid)
)
}
)
best_k <- accuracy_results |>
arrange(desc(accuracy), k) |>
slice(1) |>
pull(k)
best_pred <- class::knn(
train = train_scaled,
test = valid_scaled,
cl = brand_train[[brand_col]],
k = best_k
)
confusion_table <- tibble(
actual_brand = brand_valid[[brand_col]],
predicted_brand = best_pred
) |>
count(actual_brand, predicted_brand, name = "n")
list(
accuracy_results = accuracy_results,
best_k = best_k,
confusion_table = confusion_table
)
}
impute_brand_knn <- function(
train_x,
valid_x,
eval_x,
brand_col = "brand_code",
k = 5
) {
numeric_vars <- get_numeric_predictors(train_x)
train_out <- train_x
valid_out <- valid_x
eval_out <- eval_x
known_train_rows <- which(!is.na(train_x[[brand_col]]))
known_train_x <- train_x[known_train_rows, ]
known_train_y <- as.factor(known_train_x[[brand_col]])
brand_levels <- levels(known_train_y)
train_numeric_known <- known_train_x |>
select(all_of(numeric_vars))
train_center <- map_dbl(train_numeric_known, mean, na.rm = TRUE)
train_scale <- map_dbl(train_numeric_known, sd, na.rm = TRUE)
train_scale[train_scale == 0 | is.na(train_scale)] <- 1
scale_with_train <- function(data_x) {
data_numeric <- data_x |>
select(all_of(numeric_vars))
sweep(
sweep(
as.matrix(data_numeric),
2,
train_center,
"-"
),
2,
train_scale,
"/"
)
}
train_scaled_all <- scale_with_train(train_x)
valid_scaled_all <- scale_with_train(valid_x)
eval_scaled_all <- scale_with_train(eval_x)
train_scaled_known <- train_scaled_all[known_train_rows, , drop = FALSE]
impute_one_dataset <- function(data_out, scaled_all) {
missing_rows <- which(is.na(data_out[[brand_col]]))
if (length(missing_rows) > 0) {
predicted_brand <- class::knn(
train = train_scaled_known,
test = scaled_all[missing_rows, , drop = FALSE],
cl = known_train_y,
k = k
)
data_out[[brand_col]][missing_rows] <- as.character(predicted_brand)
}
data_out[[brand_col]] <- factor(
data_out[[brand_col]],
levels = brand_levels
)
data_out
}
train_out <- impute_one_dataset(train_out, train_scaled_all)
valid_out <- impute_one_dataset(valid_out, valid_scaled_all)
eval_out <- impute_one_dataset(eval_out, eval_scaled_all)
list(
train = train_out,
valid = valid_out,
eval = eval_out
)
}
knn_brand_validation <- validate_knn_brand(
train_x = train_features_brand_tree_numeric,
brand_col = "brand_code",
k_values = c(3, 5, 7, 9, 11),
project_seed = project_seed
)
knn_brand_validation$accuracy_results |>
kable(
caption = "KNN Brand Code Validation Accuracy by K",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| k | accuracy | test_rows |
|---|---|---|
| 3 | 0.8645 | 391 |
| 5 | 0.8491 | 391 |
| 7 | 0.8593 | 391 |
| 9 | 0.8542 | 391 |
| 11 | 0.8389 | 391 |
best_knn_k <- knn_brand_validation$best_k
The KNN Brand Code validation results show that k = 3
gave the best accuracy.
The accuracy values were:
k = 3: 0.8645k = 5: 0.8491k = 7: 0.8593k = 9: 0.8542k = 11: 0.8389This means the model predicted the correct brand_code
about 86.45% of the time when k = 3, and I will use it for
imputing missing brand_code values. A smaller
k lets the imputation depend more on the closest nearby
observations instead of averaging across too many neighbors. In my
opinion this makes sense because the PCA plot showed that missing
brand_code rows were located near known brand groups
knn_confusion_wide <- knn_brand_validation$confusion_table |>
pivot_wider(
names_from = predicted_brand,
values_from = n,
values_fill = 0
) |>
arrange(actual_brand)
knn_confusion_wide |>
kable(
caption = "KNN Brand Code Validation Confusion Matrix",
digits = 0
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| actual_brand | A | B | C | D |
|---|---|---|---|---|
| A | 36 | 2 | 1 | 6 |
| B | 1 | 185 | 10 | 0 |
| C | 0 | 31 | 20 | 0 |
| D | 3 | 0 | 0 | 96 |
The confusion matrix makes the KNN brand_code
predictions validation easier to understand.
The diagonal values are the correct predictions:
KNN predicted Brand B and Brand D especially well. Brand B had 185 correct predictions, with only 10 rows predicted as C and 1 row predicted as A. Brand D also performed well, with 96 correct predictions and only 3 rows predicted as A.
Brand C was the most difficult group. Out of actual Brand C rows, 20 were correctly predicted as C, but 31 were predicted as Brand B. This errors can be explained by the earlier PCA plot, where Brand B and Brand C looked more overlapped in the process-variable space, therefore KNN model struggled to correctly make predictions for Brand C.
Brand A was mostly predicted correctly, but there was some confusion with Brand D.
Overall, the confusion matrix supports using KNN to impute missing
brand_code values. It performs well overall, but it also
shows that Brand C is harder to separate from Brand B.
And now I will create first imputed dataset using bagged-tree model to impute missing numeric values, and KNN model to impute missing brand_code values
tree_numeric_knn_brand <- impute_brand_knn(
train_x = train_features_brand_tree_numeric,
valid_x = valid_features_brand_tree_numeric,
eval_x = eval_features_brand_tree_numeric,
brand_col = "brand_code",
k = best_knn_k
)
train_features_tree_numeric_knn_brand <- tree_numeric_knn_brand$train
valid_features_tree_numeric_knn_brand <- tree_numeric_knn_brand$valid
eval_features_tree_numeric_knn_brand <- tree_numeric_knn_brand$eval
# Keep old object names too, so later code that already uses them does not break.
train_features_capped_knn_brand <- train_features_tree_numeric_knn_brand
valid_features_capped_knn_brand <- valid_features_tree_numeric_knn_brand
eval_features_capped_knn_brand <- eval_features_tree_numeric_knn_brand
tree_numeric_knn_missing_check <- tibble(
dataset = c("Training", "Validation", "Final Evaluation"),
total_missing_values = c(
sum(is.na(train_features_tree_numeric_knn_brand)),
sum(is.na(valid_features_tree_numeric_knn_brand)),
sum(is.na(eval_features_tree_numeric_knn_brand))
)
)
tree_numeric_knn_missing_check |>
kable(
caption = "Missing Values After Tree Numeric + KNN Brand Imputation",
digits = 0
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| dataset | total_missing_values |
|---|---|
| Training | 0 |
| Validation | 0 |
| Final Evaluation | 0 |
rf_capped_knn_results <- run_rf_validation_test(
train_x = train_features_tree_numeric_knn_brand,
valid_x = valid_features_tree_numeric_knn_brand,
train_y = train_ph,
valid_y = valid_ph,
workflow_name = "Tree numeric + KNN brand",
project_seed = project_seed
)
rf_capped_knn_results |>
kable(
caption = "Tree Numeric + KNN Brand Random Forest Validation Performance",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| workflow | training_rows | validation_rows | best_mtry | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|
| Tree numeric + KNN brand | 2052 | 515 | 20 | 0.0962 | 0.0687 | 0.698 | 205.3 |
summarize_imputed_brand_counts <- function(
original_train_x,
original_valid_x,
original_eval_x,
imputed_train_x,
imputed_valid_x,
imputed_eval_x,
brand_col = "brand_code"
) {
bind_rows(
imputed_train_x |>
filter(is.na(original_train_x[[brand_col]])) |>
count(.data[[brand_col]], name = "n") |>
mutate(dataset = "Training"),
imputed_valid_x |>
filter(is.na(original_valid_x[[brand_col]])) |>
count(.data[[brand_col]], name = "n") |>
mutate(dataset = "Validation"),
imputed_eval_x |>
filter(is.na(original_eval_x[[brand_col]])) |>
count(.data[[brand_col]], name = "n") |>
mutate(dataset = "Final Evaluation")
) |>
rename(imputed_brand_code = .data[[brand_col]]) |>
select(dataset, imputed_brand_code, n) |>
pivot_wider(
names_from = imputed_brand_code,
values_from = n,
values_fill = 0
) |>
mutate(
Total = rowSums(across(where(is.numeric)))
)
}
tree_numeric_knn_brand_counts <- summarize_imputed_brand_counts(
original_train_x = train_features_brand_tree_numeric,
original_valid_x = valid_features_brand_tree_numeric,
original_eval_x = eval_features_brand_tree_numeric,
imputed_train_x = train_features_tree_numeric_knn_brand,
imputed_valid_x = valid_features_tree_numeric_knn_brand,
imputed_eval_x = eval_features_tree_numeric_knn_brand,
brand_col = "brand_code"
)
tree_numeric_knn_brand_counts |>
kable(
caption = "Imputed Brand Code Counts: Tree Numeric + KNN Brand",
digits = 0
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| dataset | A | B | C | Total |
|---|---|---|---|---|
| Training | 13 | 65 | 16 | 94 |
| Validation | 3 | 21 | 2 | 26 |
| Final Evaluation | 0 | 5 | 3 | 8 |
This table shows only the rows where brand_code was
originally missing. It does not show the full brand_code
distribution.
For the tree-based numeric imputation workflow, KNN assigned most
missing Brand Code values to Brand B and Brand C.
In the training data, 65 missing rows were imputed as Brand B
and 16 were imputed as Brand C. Some rows were also assigned to
Brand A (13), but no missing rows were assigned to Brand D.
This matches the PCA plot reasonably well because most missing
brand_code rows were located inside or close to the Brand B
and Brand C clusters. The KNN result is therefore not just assigning
everything to the largest brand randomly. It is using nearby rows in the
completed numeric process-variable space.
Overall, this matches the PCA plot because most missing Brand Code rows were located near the Brand B and Brand C clusters.
I mentioned earlier that I will test few imputations techniques. So here, I will test Random Forest model as another method for imputing missing numeric predictors.
The idea is that each numeric variable with missing values can be
predicted from the other available process variables.
Random Forest can be useful here because it can capture nonlinear
relationships and interactions between variables, which may exist in the
production process. And it is possible it gives better results compared
to bagged trees imputation method completed previously.
This process follows these steps:
brand_code missing during the numeric
imputation step.brand_code.Missing values of brand_code will be handled separately
with KNN, since PCA plot showed earlier that missing
brand_code rows are mostly located near Brand B
and Brand C clusters. So the same KNN brand_code
imputation step is applied after the numeric predictors are
completed.
impute_numeric_by_random_forest <- function(
train_x,
valid_x,
eval_x,
brand_col = "brand_code",
ntree = 250,
min_obs = 30,
project_seed = 123
) {
numeric_vars <- get_numeric_predictors(train_x)
train_out <- train_x
valid_out <- valid_x
eval_out <- eval_x
train_medians <- map_dbl(
numeric_vars,
function(var) {
med_value <- median(train_x[[var]], na.rm = TRUE)
if (is.nan(med_value)) {
med_value <- 0
}
med_value
}
)
names(train_medians) <- numeric_vars
fill_numeric_with_train_medians <- function(data_x) {
data_filled <- data_x
for (var in numeric_vars) {
data_filled[[var]][is.na(data_filled[[var]])] <- train_medians[[var]]
}
data_filled
}
train_base <- fill_numeric_with_train_medians(train_x)
valid_base <- fill_numeric_with_train_medians(valid_x)
eval_base <- fill_numeric_with_train_medians(eval_x)
train_out[numeric_vars] <- train_base[numeric_vars]
valid_out[numeric_vars] <- valid_base[numeric_vars]
eval_out[numeric_vars] <- eval_base[numeric_vars]
brand_levels <- train_x[[brand_col]] |>
as.character() |>
na.omit() |>
unique()
brand_levels <- c(sort(brand_levels), "Unknown")
prepare_brand_for_dummy <- function(data_x) {
data_x |>
mutate(
"{brand_col}" := ifelse(
is.na(.data[[brand_col]]),
"Unknown",
as.character(.data[[brand_col]])
),
"{brand_col}" := factor(
.data[[brand_col]],
levels = brand_levels
)
)
}
train_brand_dummy_data <- prepare_brand_for_dummy(train_x)
brand_dummy_fit <- caret::dummyVars(
formula = as.formula(paste("~", brand_col)),
data = train_brand_dummy_data,
fullRank = FALSE
)
make_rf_predictor_matrix <- function(data_filled, original_data, target_var) {
numeric_predictors <- setdiff(numeric_vars, target_var)
brand_dummy_data <- prepare_brand_for_dummy(original_data)
brand_dummy_matrix <- predict(
brand_dummy_fit,
newdata = brand_dummy_data
) |>
as.data.frame()
bind_cols(
data_filled[numeric_predictors],
brand_dummy_matrix
)
}
vars_to_impute <- numeric_vars[
map_lgl(
numeric_vars,
function(var) {
any(is.na(train_x[[var]])) ||
any(is.na(valid_x[[var]])) ||
any(is.na(eval_x[[var]]))
}
)
]
imputation_log <- list()
for (var in vars_to_impute) {
observed_train_rows <- which(!is.na(train_x[[var]]))
if (length(observed_train_rows) < min_obs) {
imputation_log[[var]] <- tibble(
variable = var,
method = "median fallback",
observed_training_rows = length(observed_train_rows),
train_missing_imputed = sum(is.na(train_x[[var]])),
valid_missing_imputed = sum(is.na(valid_x[[var]])),
eval_missing_imputed = sum(is.na(eval_x[[var]]))
)
next
}
x_train_matrix <- make_rf_predictor_matrix(
data_filled = train_base,
original_data = train_x,
target_var = var
)
x_valid_matrix <- make_rf_predictor_matrix(
data_filled = valid_base,
original_data = valid_x,
target_var = var
)
x_eval_matrix <- make_rf_predictor_matrix(
data_filled = eval_base,
original_data = eval_x,
target_var = var
)
set.seed(project_seed)
rf_imp_model <- randomForest::randomForest(
x = x_train_matrix[observed_train_rows, ],
y = train_x[[var]][observed_train_rows],
ntree = ntree
)
train_missing_rows <- which(is.na(train_x[[var]]))
valid_missing_rows <- which(is.na(valid_x[[var]]))
eval_missing_rows <- which(is.na(eval_x[[var]]))
if (length(train_missing_rows) > 0) {
train_out[[var]][train_missing_rows] <- predict(
rf_imp_model,
newdata = x_train_matrix[train_missing_rows, ]
)
}
if (length(valid_missing_rows) > 0) {
valid_out[[var]][valid_missing_rows] <- predict(
rf_imp_model,
newdata = x_valid_matrix[valid_missing_rows, ]
)
}
if (length(eval_missing_rows) > 0) {
eval_out[[var]][eval_missing_rows] <- predict(
rf_imp_model,
newdata = x_eval_matrix[eval_missing_rows, ]
)
}
imputation_log[[var]] <- tibble(
variable = var,
method = "random forest",
observed_training_rows = length(observed_train_rows),
train_missing_imputed = length(train_missing_rows),
valid_missing_imputed = length(valid_missing_rows),
eval_missing_imputed = length(eval_missing_rows)
)
}
list(
train = train_out,
valid = valid_out,
eval = eval_out,
numeric_predictors = numeric_vars,
imputation_log = bind_rows(imputation_log)
)
}
rf_numeric_imputed_temp <- impute_numeric_by_random_forest(
train_x = train_features_for_imputation,
valid_x = valid_features_for_imputation,
eval_x = eval_features_for_imputation,
brand_col = "brand_code",
ntree = 250,
min_obs = 30,
project_seed = project_seed
)
train_features_rf_numeric_temp <- rf_numeric_imputed_temp$train
valid_features_rf_numeric_temp <- rf_numeric_imputed_temp$valid
eval_features_rf_numeric_temp <- rf_numeric_imputed_temp$eval
rf_numeric_imputed_temp$imputation_log |>
arrange(desc(train_missing_imputed + valid_missing_imputed + eval_missing_imputed)) |>
kable(
caption = "Random Forest Numeric Imputation Log",
digits = 0
) |>
kable_styling(
full_width = FALSE,
position = "center"
)|>
scroll_box(
width = "100%",
height = "500px"
)
| variable | method | observed_training_rows | train_missing_imputed | valid_missing_imputed | eval_missing_imputed |
|---|---|---|---|---|---|
| mfr | random forest | 1879 | 173 | 35 | 31 |
| filler_speed | random forest | 2009 | 43 | 11 | 10 |
| fill_ounces | random forest | 2022 | 30 | 8 | 6 |
| psc_co2 | random forest | 2016 | 36 | 3 | 5 |
| pc_volume | random forest | 2020 | 32 | 7 | 4 |
| psc | random forest | 2025 | 27 | 6 | 5 |
| carb_pressure1 | random forest | 2027 | 25 | 7 | 4 |
| hyd_pressure4 | random forest | 2028 | 24 | 4 | 4 |
| carb_pressure | random forest | 2031 | 21 | 6 | 0 |
| carb_temp | random forest | 2035 | 17 | 9 | 1 |
| psc_fill | random forest | 2031 | 21 | 2 | 3 |
| fill_pressure | random forest | 2039 | 13 | 5 | 2 |
| filler_level | random forest | 2038 | 14 | 2 | 2 |
| hyd_pressure2 | random forest | 2041 | 11 | 4 | 1 |
| hyd_pressure3 | random forest | 2041 | 11 | 4 | 1 |
| temperature | random forest | 2042 | 10 | 2 | 2 |
| oxygen_filler | random forest | 2043 | 9 | 2 | 3 |
| pressure_setpoint | random forest | 2043 | 9 | 3 | 2 |
| carb_volume | random forest | 2044 | 8 | 2 | 1 |
| hyd_pressure1 | random forest | 2044 | 8 | 3 | 0 |
| alch_rel | random forest | 2046 | 6 | 1 | 3 |
| carb_rel | random forest | 2044 | 8 | 0 | 2 |
| usage_cont | random forest | 2048 | 4 | 1 | 2 |
| bowl_setpoint | random forest | 2052 | 0 | 2 | 1 |
| carb_flow | random forest | 2050 | 2 | 0 | 0 |
| density | random forest | 2052 | 0 | 0 | 1 |
| balling | random forest | 2052 | 0 | 0 | 1 |
| pressure_vacuum | random forest | 2052 | 0 | 0 | 1 |
| air_pressurer | random forest | 2052 | 0 | 0 | 1 |
| balling_lvl | random forest | 2051 | 1 | 0 | 0 |
rf_numeric_knn_brand_temp <- impute_brand_knn(
train_x = train_features_rf_numeric_temp,
valid_x = valid_features_rf_numeric_temp,
eval_x = eval_features_rf_numeric_temp,
brand_col = "brand_code",
k = best_knn_k
)
train_features_rf_numeric_knn_brand_temp <- rf_numeric_knn_brand_temp$train
valid_features_rf_numeric_knn_brand_temp <- rf_numeric_knn_brand_temp$valid
eval_features_rf_numeric_knn_brand_temp <- rf_numeric_knn_brand_temp$eval
rf_numeric_knn_results_temp <- run_rf_validation_test(
train_x = train_features_rf_numeric_knn_brand_temp,
valid_x = valid_features_rf_numeric_knn_brand_temp,
train_y = train_ph,
valid_y = valid_ph,
workflow_name = "RF numeric + KNN brand",
project_seed = project_seed
)
# rf_numeric_knn_results_temp |>
# kable(
# caption = "RF Numeric + KNN Brand Random Forest Validation Performance",
# digits = 4
# ) |>
# kable_styling(
# full_width = FALSE,
# position = "center"
# )
Now we have all missing values imputed, and let’s try another imputation technique.
Here I will use XGBoost model as another method for predicting and imputing missing numeric predictors.
The idea is pretty much same as previous Bagging Trees and Random Forest numeric imputation workflows. Each numeric variable with missing values is treated as a temporary prediction problem, where the other process variables are used as predictors. XGBoost can be useful here because it can model nonlinear patterns and interactions, and it often performs well on structured tabular data.
The steps are:
brand_code missing during the numeric
imputation step.brand_code.Missing values of brand_code are still handled
separately with KNN model. So after XGBoost completes the numeric
predictors, KNN is used to assign brand_code based on
nearby observations just like in previous methods - centering and
scaling applied in KNN training process.
I’m using this imputations technique to see if XGBoost numeric imputation creates a better completed dataset than the Bagging Tree based and Random Forest numeric imputation methods.
impute_numeric_by_xgboost <- function(
train_x,
valid_x,
eval_x,
brand_col = "brand_code",
nrounds = 100,
max_depth = 3,
eta = 0.05,
subsample = 0.8,
colsample_bytree = 0.8,
min_obs = 30,
project_seed = 123
) {
numeric_vars <- get_numeric_predictors(train_x)
train_out <- train_x
valid_out <- valid_x
eval_out <- eval_x
train_medians <- map_dbl(
numeric_vars,
function(var) {
med_value <- median(train_x[[var]], na.rm = TRUE)
if (is.nan(med_value)) {
med_value <- 0
}
med_value
}
)
names(train_medians) <- numeric_vars
for (var in numeric_vars) {
train_out[[var]][is.na(train_out[[var]])] <- train_medians[[var]]
valid_out[[var]][is.na(valid_out[[var]])] <- train_medians[[var]]
eval_out[[var]][is.na(eval_out[[var]])] <- train_medians[[var]]
}
brand_levels <- train_x[[brand_col]] |>
as.character() |>
na.omit() |>
unique()
brand_levels <- c(sort(brand_levels), "Unknown")
prepare_brand_for_dummy <- function(data_x) {
data_x |>
mutate(
"{brand_col}" := ifelse(
is.na(.data[[brand_col]]),
"Unknown",
as.character(.data[[brand_col]])
),
"{brand_col}" := factor(
.data[[brand_col]],
levels = brand_levels
)
)
}
brand_dummy_fit <- caret::dummyVars(
formula = as.formula(paste("~", brand_col)),
data = prepare_brand_for_dummy(train_x),
fullRank = FALSE
)
make_xgb_matrix <- function(data_x, target_var) {
numeric_predictors <- setdiff(numeric_vars, target_var)
brand_dummy_matrix <- predict(
brand_dummy_fit,
newdata = prepare_brand_for_dummy(data_x)
) |>
as.data.frame()
predictor_data <- bind_cols(
data_x[numeric_predictors],
brand_dummy_matrix
)
as.matrix(predictor_data)
}
vars_to_impute <- numeric_vars[
map_lgl(
numeric_vars,
function(var) {
any(is.na(train_x[[var]])) ||
any(is.na(valid_x[[var]])) ||
any(is.na(eval_x[[var]]))
}
)
]
imputation_log <- list()
for (var in vars_to_impute) {
observed_train_rows <- which(!is.na(train_x[[var]]))
train_missing_rows <- which(is.na(train_x[[var]]))
valid_missing_rows <- which(is.na(valid_x[[var]]))
eval_missing_rows <- which(is.na(eval_x[[var]]))
if (length(observed_train_rows) < min_obs) {
imputation_log[[var]] <- tibble(
variable = var,
method = "median fallback",
observed_training_rows = length(observed_train_rows),
train_missing_imputed = length(train_missing_rows),
valid_missing_imputed = length(valid_missing_rows),
eval_missing_imputed = length(eval_missing_rows)
)
next
}
x_train_matrix <- make_xgb_matrix(train_x, target_var = var)
x_valid_matrix <- make_xgb_matrix(valid_x, target_var = var)
x_eval_matrix <- make_xgb_matrix(eval_x, target_var = var)
dtrain <- xgboost::xgb.DMatrix(
data = x_train_matrix[observed_train_rows, , drop = FALSE],
label = train_x[[var]][observed_train_rows],
missing = NA
)
params <- list(
objective = "reg:squarederror",
eval_metric = "rmse",
max_depth = max_depth,
eta = eta,
subsample = subsample,
colsample_bytree = colsample_bytree
)
set.seed(project_seed)
xgb_imp_model <- xgboost::xgb.train(
params = params,
data = dtrain,
nrounds = nrounds,
verbose = 0
)
if (length(train_missing_rows) > 0) {
train_out[[var]][train_missing_rows] <- predict(
xgb_imp_model,
newdata = xgboost::xgb.DMatrix(
x_train_matrix[train_missing_rows, , drop = FALSE],
missing = NA
)
)
}
if (length(valid_missing_rows) > 0) {
valid_out[[var]][valid_missing_rows] <- predict(
xgb_imp_model,
newdata = xgboost::xgb.DMatrix(
x_valid_matrix[valid_missing_rows, , drop = FALSE],
missing = NA
)
)
}
if (length(eval_missing_rows) > 0) {
eval_out[[var]][eval_missing_rows] <- predict(
xgb_imp_model,
newdata = xgboost::xgb.DMatrix(
x_eval_matrix[eval_missing_rows, , drop = FALSE],
missing = NA
)
)
}
imputation_log[[var]] <- tibble(
variable = var,
method = "xgboost",
observed_training_rows = length(observed_train_rows),
train_missing_imputed = length(train_missing_rows),
valid_missing_imputed = length(valid_missing_rows),
eval_missing_imputed = length(eval_missing_rows)
)
}
list(
train = train_out,
valid = valid_out,
eval = eval_out,
numeric_predictors = numeric_vars,
imputation_log = bind_rows(imputation_log)
)
}
xgb_numeric_imputed_temp <- impute_numeric_by_xgboost(
train_x = train_features_for_imputation,
valid_x = valid_features_for_imputation,
eval_x = eval_features_for_imputation,
brand_col = "brand_code",
nrounds = 100,
max_depth = 3,
eta = 0.05,
subsample = 0.8,
colsample_bytree = 0.8,
min_obs = 30,
project_seed = project_seed
)
train_features_xgb_numeric_temp <- xgb_numeric_imputed_temp$train
valid_features_xgb_numeric_temp <- xgb_numeric_imputed_temp$valid
eval_features_xgb_numeric_temp <- xgb_numeric_imputed_temp$eval
xgb_numeric_imputed_temp$imputation_log |>
arrange(desc(train_missing_imputed + valid_missing_imputed + eval_missing_imputed)) |>
kable(
caption = "XGBoost Numeric Imputation Log",
digits = 0
) |>
kable_styling(
full_width = FALSE,
position = "center"
)|>
scroll_box(
width = "100%",
height = "500px"
)
| variable | method | observed_training_rows | train_missing_imputed | valid_missing_imputed | eval_missing_imputed |
|---|---|---|---|---|---|
| mfr | xgboost | 1879 | 173 | 35 | 31 |
| filler_speed | xgboost | 2009 | 43 | 11 | 10 |
| fill_ounces | xgboost | 2022 | 30 | 8 | 6 |
| psc_co2 | xgboost | 2016 | 36 | 3 | 5 |
| pc_volume | xgboost | 2020 | 32 | 7 | 4 |
| psc | xgboost | 2025 | 27 | 6 | 5 |
| carb_pressure1 | xgboost | 2027 | 25 | 7 | 4 |
| hyd_pressure4 | xgboost | 2028 | 24 | 4 | 4 |
| carb_pressure | xgboost | 2031 | 21 | 6 | 0 |
| carb_temp | xgboost | 2035 | 17 | 9 | 1 |
| psc_fill | xgboost | 2031 | 21 | 2 | 3 |
| fill_pressure | xgboost | 2039 | 13 | 5 | 2 |
| filler_level | xgboost | 2038 | 14 | 2 | 2 |
| hyd_pressure2 | xgboost | 2041 | 11 | 4 | 1 |
| hyd_pressure3 | xgboost | 2041 | 11 | 4 | 1 |
| temperature | xgboost | 2042 | 10 | 2 | 2 |
| oxygen_filler | xgboost | 2043 | 9 | 2 | 3 |
| pressure_setpoint | xgboost | 2043 | 9 | 3 | 2 |
| carb_volume | xgboost | 2044 | 8 | 2 | 1 |
| hyd_pressure1 | xgboost | 2044 | 8 | 3 | 0 |
| alch_rel | xgboost | 2046 | 6 | 1 | 3 |
| carb_rel | xgboost | 2044 | 8 | 0 | 2 |
| usage_cont | xgboost | 2048 | 4 | 1 | 2 |
| bowl_setpoint | xgboost | 2052 | 0 | 2 | 1 |
| carb_flow | xgboost | 2050 | 2 | 0 | 0 |
| density | xgboost | 2052 | 0 | 0 | 1 |
| balling | xgboost | 2052 | 0 | 0 | 1 |
| pressure_vacuum | xgboost | 2052 | 0 | 0 | 1 |
| air_pressurer | xgboost | 2052 | 0 | 0 | 1 |
| balling_lvl | xgboost | 2051 | 1 | 0 | 0 |
The XGBoost numeric imputation log shows which variables had missing values filled in the training, validation, and final evaluation datasets.
Each variable with missing numeric values was imputed using an XGBoost model trained from the available training data. PH was not used during this process, so the imputation step does not create target leakage.
The largest number of missing values was for mfr,
followed by variables such as filler_speed,
fill_ounces, psc_co2, and
pc_volume.
I included this table as a check that the imputation process was applied consistently across all three datasets before modeling.
xgb_numeric_knn_brand_temp <- impute_brand_knn(
train_x = train_features_xgb_numeric_temp,
valid_x = valid_features_xgb_numeric_temp,
eval_x = eval_features_xgb_numeric_temp,
brand_col = "brand_code",
k = best_knn_k
)
train_features_xgb_numeric_knn_brand_temp <- xgb_numeric_knn_brand_temp$train
valid_features_xgb_numeric_knn_brand_temp <- xgb_numeric_knn_brand_temp$valid
eval_features_xgb_numeric_knn_brand_temp <- xgb_numeric_knn_brand_temp$eval
xgb_numeric_knn_rf_results_temp <- run_rf_validation_test(
train_x = train_features_xgb_numeric_knn_brand_temp,
valid_x = valid_features_xgb_numeric_knn_brand_temp,
train_y = train_ph,
valid_y = valid_ph,
workflow_name = "XGBoost numeric + KNN brand",
project_seed = project_seed
)
# xgb_numeric_knn_rf_results_temp |>
# kable(
# caption = "XGBoost Numeric + KNN Brand Random Forest Validation Performance",
# digits = 4
# ) |>
# kable_styling(
# full_width = FALSE,
# position = "center"
# )
xgb_knn_imputed_brand_counts <- bind_rows(
train_features_xgb_numeric_knn_brand_temp |>
filter(is.na(train_features_xgb_numeric_temp$brand_code)) |>
count(brand_code, name = "n") |>
mutate(dataset = "Training"),
valid_features_xgb_numeric_knn_brand_temp |>
filter(is.na(valid_features_xgb_numeric_temp$brand_code)) |>
count(brand_code, name = "n") |>
mutate(dataset = "Validation"),
eval_features_xgb_numeric_knn_brand_temp |>
filter(is.na(eval_features_xgb_numeric_temp$brand_code)) |>
count(brand_code, name = "n") |>
mutate(dataset = "Final Evaluation")
) |>
rename(imputed_brand_code = brand_code) |>
pivot_wider(
names_from = imputed_brand_code,
values_from = n,
values_fill = 0
) |>
mutate(
Total = rowSums(across(where(is.numeric)))
) |>
select(dataset, everything())
xgb_knn_imputed_brand_counts |>
kable(
caption = "Imputed Brand Code Counts After XGBoost Numeric Imputation + KNN",
digits = 0
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| dataset | A | B | C | Total |
|---|---|---|---|---|
| Training | 7 | 64 | 23 | 94 |
| Validation | 1 | 20 | 5 | 26 |
| Final Evaluation | 0 | 6 | 2 | 8 |
xgb_knn_brand_validation <- validate_knn_brand(
train_x = train_features_xgb_numeric_temp,
brand_col = "brand_code",
k_values = c(3, 5, 7, 9, 11),
project_seed = project_seed
)
xgb_knn_brand_validation$confusion_table |>
pivot_wider(
names_from = predicted_brand,
values_from = n,
values_fill = 0
) |>
arrange(actual_brand) |>
kable(
caption = "KNN Brand Code Validation Confusion Matrix Using XGBoost-Imputed Numeric Predictors",
digits = 0
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| actual_brand | A | B | C | D |
|---|---|---|---|---|
| A | 36 | 2 | 1 | 6 |
| B | 1 | 185 | 10 | 0 |
| C | 1 | 31 | 19 | 0 |
| D | 2 | 0 | 1 | 96 |
Compared with the earlier KNN brand_code confusion
matrix, the results using XGBoost-imputed numeric predictors are almost
the same.
The previous confusion matrix had:
The XGBoost-imputed version had:
So the main difference is very small:
As we see, Brand C rows are still the hardest to classify. In the XGBoost-imputed version, 31 actual Brand C rows were predicted as Brand B, which is the same pattern as before. This supports the earlier PCA result, where Brand B and Brand C were more overlapped in the numeric process-variable space.
Overall, the XGBoost numeric imputation did not meaningfully change
the KNN brand_codevalidation pattern. The KNN model still
performs well for Brands A, B, and D, but Brand C remains harder to
separate from Brand B.
missForest packageIn this part, I test missForest as another imputation
method.
This method is different from other workflows because it does not
separate numeric imputation and brand_code imputation into
two steps as I used earlier. Instead, missForest imputes
numeric predictors and brand_code together in one
process.
The steps are:
missForest to impute both numeric predictors
and missing brand_code.I included this imputation process as a comparison method because
missForest can handle mixed numeric and categorical
predictors. However, it is less connected to my PCA and KNN
brand_code imputations logic, because it does not use the
separate process of completing numeric values first and then assigning
brand_code based on nearby observations.
prepare_missforest_data <- function(train_x, valid_x, eval_x, brand_col = "brand_code") {
train_temp <- train_x |>
mutate(.dataset = "Training", .row_id = row_number())
valid_temp <- valid_x |>
mutate(.dataset = "Validation", .row_id = row_number())
eval_temp <- eval_x |>
mutate(.dataset = "Final Evaluation", .row_id = row_number())
combined_data <- bind_rows(
train_temp,
valid_temp,
eval_temp
)
combined_data <- combined_data |>
mutate(
"{brand_col}" := as.factor(.data[[brand_col]])
)
list(
combined_data = combined_data,
id_cols = c(".dataset", ".row_id")
)
}
missforest_input <- prepare_missforest_data(
train_x = train_features_for_imputation,
valid_x = valid_features_for_imputation,
eval_x = eval_features_for_imputation,
brand_col = "brand_code"
)
combined_for_missforest <- missforest_input$combined_data
id_cols <- missforest_input$id_cols
combined_ids <- combined_for_missforest |>
select(all_of(id_cols))
combined_features_for_missforest <- combined_for_missforest |>
select(-all_of(id_cols)) |>
as.data.frame()
combined_features_for_missforest <- combined_features_for_missforest |>
mutate(
across(where(is.character), as.factor),
across(where(is.logical), as.factor)
)
combined_features_for_missforest$brand_code <- as.factor(
combined_features_for_missforest$brand_code
)
combined_features_for_missforest <- combined_features_for_missforest |>
mutate(
across(
where(~ inherits(.x, "Date") || inherits(.x, "POSIXct") || inherits(.x, "POSIXlt")),
as.numeric
)
)
all_missing_cols <- names(combined_features_for_missforest)[
map_lgl(combined_features_for_missforest, ~ all(is.na(.x)))
]
combined_features_for_missforest <- combined_features_for_missforest |>
select(-any_of(all_missing_cols))
constant_cols <- names(combined_features_for_missforest)[
map_lgl(
combined_features_for_missforest,
~ length(unique(na.omit(.x))) <= 1
)
]
combined_features_for_missforest <- combined_features_for_missforest |>
select(-any_of(constant_cols))
set.seed(project_seed)
missforest_time_temp <- system.time({
missforest_result_temp <- missForest(
xmis = combined_features_for_missforest,
maxiter = 3,
ntree = 50,
variablewise = TRUE,
verbose = TRUE
)
})
missForest iteration 1 in progress...done!
estimated error(s): 0.02549889 0.002423574 0.006228133 0.002000095 2.476716 3.270997 0.002225144 0.01265173 0.001775185 0 11.25729 3.633197 21.6205 11.23396 7.67219 52.86191 26.78502 66429.97 0.9822 4.856411 174607.9 0.002789376 645.1303 0.00792792 0.07094996 0.0007109196 18.60701 0.7337317 0.6809167 0.01107182 0.00279986 0.01465174
difference(s): 0.001248508 0.03246295
time: 1.44 seconds
missForest iteration 2 in progress...done!
estimated error(s): 0.0225425 0.002431696 0.006274587 0.002011106 2.344886 3.298449 0.002247812 0.01253804 0.001766746 0 11.14045 3.696281 22.65158 10.87017 7.500838 51.82271 27.53997 30159.45 0.9322445 4.845559 191381.8 0.002579745 718.2835 0.007228232 0.0708929 0.0006863428 16.51772 0.7274178 0.6579717 0.01043707 0.002802485 0.01902887
difference(s): 0.0001388316 0.00952717
time: 1.39 seconds
missForest iteration 3 in progress...done!
estimated error(s): 0.02143385 0.002448337 0.006218199 0.002009058 2.13478 3.319124 0.002221153 0.01261356 0.001786276 0 11.25951 3.73122 21.58347 11.16358 7.748629 51.69491 28.4234 33827.62 0.9165848 4.869446 184279.5 0.002834121 655.3057 0.007171282 0.06827684 0.0007088335 15.52839 0.7407598 0.6634867 0.01034152 0.002816703 0.0200716
difference(s): 0.00002911975 0.007410021
time: 1.41 seconds
missforest_time_temp
user system elapsed
4.89 2.50 4.24
missforest_result_temp$OOBerror
PFC MSE MSE MSE
0.0214338507 0.0024483373 0.0062181991 0.0020090582
MSE MSE MSE MSE
2.1347803332 3.3191239656 0.0022211529 0.0126135631
MSE MSE MSE MSE
0.0017862764 0.0000000000 11.2595069395 3.7312199962
MSE MSE MSE MSE
21.5834697187 11.1635822029 7.7486286628 51.6949086916
MSE MSE MSE MSE
28.4234033967 33827.6197125108 0.9165848314 4.8694455645
MSE MSE MSE MSE
184279.4672627342 0.0028341205 655.3057387789 0.0071712824
MSE MSE MSE MSE
0.0682768382 0.0007088335 15.5283864592 0.7407597868
MSE MSE MSE MSE
0.6634867181 0.0103415237 0.0028167026 0.0200716008
combined_missforest_imputed <- missforest_result_temp$ximp |>
bind_cols(combined_ids)
train_features_missforest_temp <- combined_missforest_imputed |>
filter(.dataset == "Training") |>
arrange(.row_id) |>
select(-all_of(id_cols))
valid_features_missforest_temp <- combined_missforest_imputed |>
filter(.dataset == "Validation") |>
arrange(.row_id) |>
select(-all_of(id_cols))
eval_features_missforest_temp <- combined_missforest_imputed |>
filter(.dataset == "Final Evaluation") |>
arrange(.row_id) |>
select(-all_of(id_cols))
missforest_missing_check_temp <- tibble(
dataset = c("Training", "Validation", "Final Evaluation"),
rows = c(
nrow(train_features_missforest_temp),
nrow(valid_features_missforest_temp),
nrow(eval_features_missforest_temp)
),
columns = c(
ncol(train_features_missforest_temp),
ncol(valid_features_missforest_temp),
ncol(eval_features_missforest_temp)
),
total_missing_values = c(
sum(is.na(train_features_missforest_temp)),
sum(is.na(valid_features_missforest_temp)),
sum(is.na(eval_features_missforest_temp))
)
)
# missforest_missing_check_temp |>
# kable(
# caption = "Missing Values After missForest Imputation",
# digits = 0
# ) |>
# kable_styling(
# full_width = FALSE,
# position = "center"
# )
missforest_rf_results_temp <- run_rf_validation_test(
train_x = train_features_missforest_temp,
valid_x = valid_features_missforest_temp,
train_y = train_ph,
valid_y = valid_ph,
workflow_name = "missForest imputation",
project_seed = project_seed
)
# missforest_rf_results_temp |>
# kable(
# caption = "missForest Random Forest Validation Performance",
# digits = 4
# ) |>
# kable_styling(
# full_width = FALSE,
# position = "center"
# )
Before fully relying on imputations, I will also test a complete-case dataset with Random Forest model. In this part, I do not impute missing predictor values. Instead, I simply remove rows that have any missing predictor values to create complete-case daaset.
This gives a useful benchmark because the model is trained only on fully observed records. However, it also removes many observations with missing values, so it may not be fully practical for the final evaluation data where missing predictors also exist.
I will use this as a comparison point, not as the final preprocessing workflow.
# Complete-case feature sets
complete_train_rows <- complete.cases(train_features_for_imputation)
complete_valid_rows <- complete.cases(valid_features_for_imputation)
train_features_complete_case <- train_features_for_imputation[complete_train_rows, ]
valid_features_complete_case <- valid_features_for_imputation[complete_valid_rows, ]
train_ph_complete_case <- train_ph[complete_train_rows]
valid_ph_complete_case <- valid_ph[complete_valid_rows]
complete_case_row_check <- tibble(
dataset = c("Training", "Validation"),
original_rows = c(
nrow(train_features_for_imputation),
nrow(valid_features_for_imputation)
),
complete_case_rows = c(
nrow(train_features_complete_case),
nrow(valid_features_complete_case)
),
removed_rows = original_rows - complete_case_rows
)
complete_case_row_check |>
kable(
caption = "Rows Kept and Removed for Complete-Case Random Forest",
digits = 0
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| dataset | original_rows | complete_case_rows | removed_rows |
|---|---|---|---|
| Training | 2052 | 1625 | 427 |
| Validation | 515 | 413 | 102 |
complete_case_rf_results <- run_rf_validation_test(
train_x = train_features_complete_case,
valid_x = valid_features_complete_case,
train_y = train_ph_complete_case,
valid_y = valid_ph_complete_case,
workflow_name = "Complete-case random forest",
project_seed = project_seed
)
complete_case_rf_results |>
kable(
caption = "Complete-Case Random Forest Validation Performance",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| workflow | training_rows | validation_rows | best_mtry | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|
| Complete-case random forest | 1625 | 413 | 15 | 0.092 | 0.0659 | 0.7309 | 136.01 |
After creating the different datasets with imputed NA values, I compare them using the same Random Forest validation test.
My main goal is not just to see which method fills missing values. I want to see which imputation workflow creates the most useful dataset for predicting PH.
Each workflow has been tested with the same general Random Forest setup, same training PH target, and same validation PH target. This makes the comparison more practical because the final decision is based on downstream model performance, not only on the imputation method itself.
I also included complete-case dataset. This model does not use imputation and only kept rows with no missing predictor values. It can be useful as a baseline, but it uses less rows, so it is not exactly the same situation as the imputed workflows.
imputation_workflow_comparison <- list(
if (exists("complete_case_rf_results")) complete_case_rf_results else NULL,
if (exists("xgb_numeric_knn_rf_results_temp")) xgb_numeric_knn_rf_results_temp else NULL,
if (exists("rf_numeric_knn_results_temp")) rf_numeric_knn_results_temp else NULL,
if (exists("rf_capped_knn_results")) rf_capped_knn_results else NULL,
if (exists("missforest_rf_results_temp")) missforest_rf_results_temp else NULL
) |>
purrr::compact() |>
bind_rows()
imputation_workflow_comparison |>
select(
workflow,
training_rows,
validation_rows,
best_mtry,
rmse,
mae,
rsq,
training_time_seconds
) |>
arrange(rmse) |>
kable(
caption = "Random Forest Comparison: Complete-Case and Imputed Workflows",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| workflow | training_rows | validation_rows | best_mtry | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|
| Complete-case random forest | 1625 | 413 | 15 | 0.0920 | 0.0659 | 0.7309 | 136.01 |
| XGBoost numeric + KNN brand | 2052 | 515 | 20 | 0.0958 | 0.0682 | 0.7013 | 192.01 |
| Tree numeric + KNN brand | 2052 | 515 | 20 | 0.0962 | 0.0687 | 0.6980 | 205.30 |
| missForest imputation | 2052 | 515 | 20 | 0.0966 | 0.0694 | 0.6947 | 189.60 |
| RF numeric + KNN brand | 2052 | 515 | 20 | 0.0967 | 0.0692 | 0.6944 | 189.14 |
The complete-case dataset has the best validation performance, but it
uses fewer rows because records with missing predictors are
removed.
This makes it useful as a benchmark, but I wouldn’t use it for following
prediction models, because the final evaluation data also contains
missing predictor values. And, also I would like to use all given
observations.
Among the imputed workflows, Random Forest test shows us that
“XGBoost numeric imputation + KNN brand_code imputation”
technique performed best.
It has:
the lowest RMSE = 0.0958,
and highest \(R^2\) = 0.7013 among imputed datasets.
The other three workflows results are very close to each other, but none of themis better than XGBoost numeric + KNN Brand Code imputation method.
Based on these results, I select datasets of XGBoost numeric + KNN Brand Code imputation method as the main training/validation and evaluation sets to use for modeling and final evaluation predictions.
train_features_final_imputed <- train_features_xgb_numeric_knn_brand_temp
valid_features_final_imputed <- valid_features_xgb_numeric_knn_brand_temp
eval_features_final_imputed <- eval_features_xgb_numeric_knn_brand_temp
final_imputed_set_check <- tibble(
dataset = c("Training", "Validation", "Final Evaluation"),
rows = c(
nrow(train_features_final_imputed),
nrow(valid_features_final_imputed),
nrow(eval_features_final_imputed)
),
columns = c(
ncol(train_features_final_imputed),
ncol(valid_features_final_imputed),
ncol(eval_features_final_imputed)
),
total_missing_values = c(
sum(is.na(train_features_final_imputed)),
sum(is.na(valid_features_final_imputed)),
sum(is.na(eval_features_final_imputed))
)
)
final_imputed_set_check |>
kable(
caption = "Final Completed Feature Sets Used for Modeling",
digits = 0
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| dataset | rows | columns | total_missing_values |
|---|---|---|---|
| Training | 2052 | 32 | 0 |
| Validation | 515 | 32 | 0 |
| Final Evaluation | 267 | 32 | 0 |
The one single preprocessing approach may not be best for every model family. Some models are sensitive to scale, skewness, and correlated predictors. Other models, especially tree-based models, can handle nonlinear splits and interactions more naturally.
Because of this, I would not force the exact same transformed features set into every model. Instead, I will make preprocessing decisions based on how each model family works.
mnfflow Variable ReviewThe mnf_flow variable earlier needed extra review
because almost 50% of varible values are negative.
mnf_flow_negative_values |>
kable(
caption = "Negative mnfflow Values in Training Data",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| mnf_flow | n |
|---|---|
| -100.0 | 487 |
| -100.2 | 454 |
All negative values are exactly -100 or
-100.2, so at first they looked like possible coded process
states rather than regular continuous flow measurements.
However, the earlier PH comparison showed that these negative
mnf_flow rows behave differently from non-negative rows.
For example, Brand B had mean PH of 8.6700 when
mnf_flow was negative, compared with 8.4716 when
mnf_flow was non-negative.
The process-variable comparison also showed differences. For Brand B,
mean oxygen_filler was 0.0713 for negative
mnf_flow rows, compared with 0.0255 for
non-negative rows.
Because of this, I do not treat negative mnf_flow values
as simple errors. Even though -100 and -100.2
look like special coded values, they appear to contain useful process
information.
mnfflow Representation for Linear ModelsThe mnf_flow review showed that this variable may not
behave like one regular continuous predictor. This matters especially
for linear models because linear models use coefficients.
If mnf_flow is kept as one numeric variable, the model
has to use one linear relationship for both:
-100 and
-100.2Because of this, I test two simple linear regression versions before making the final linear-model preprocessing decision:
mnf_flow: keep
mnf_flow as one numeric predictor.mnf_flow: treat negative
values as categorical process states and keep positive flow as a
separate continuous variable. Basically I split mnf_flow
into two separate variables: one catecorical for negative values
(-100and-100.2), and another numerical
variable for the rest of values.This is only a quick preprocessing diagnostic. I am not applying the
full final linear-model preprocessing steps yet. The purpose is only to
decide whether mnf_flow should be represented differently
for linear and regularized models.
add_mnf_flow_state_features <- function(data_x) {
data_x |>
mutate(
mnf_flow_state = case_when(
is.na(mnf_flow) ~ "Missing",
mnf_flow == -100 ~ "Negative_100",
mnf_flow == -100.2 ~ "Negative_100_2",
mnf_flow < 0 ~ "Other_negative",
mnf_flow >= 0 ~ "Non_negative"
),
mnf_flow_state = factor(
mnf_flow_state,
levels = c(
"Non_negative",
"Negative_100",
"Negative_100_2",
"Other_negative",
"Missing"
)
),
mnf_flow_positive_value = ifelse(
!is.na(mnf_flow) & mnf_flow >= 0,
mnf_flow,
0
)
)
}
train_features_lm_original <- train_features_final_imputed
valid_features_lm_original <- valid_features_final_imputed
train_features_lm_state <- train_features_final_imputed |>
add_mnf_flow_state_features() |>
select(-mnf_flow)
valid_features_lm_state <- valid_features_final_imputed |>
add_mnf_flow_state_features() |>
select(-mnf_flow)
train_lm_original_data <- train_features_lm_original |>
mutate(ph = train_ph) |>
filter(!is.na(ph))
valid_lm_original_data <- valid_features_lm_original |>
mutate(ph = valid_ph) |>
filter(!is.na(ph))
train_lm_state_data <- train_features_lm_state |>
mutate(ph = train_ph) |>
filter(!is.na(ph))
valid_lm_state_data <- valid_features_lm_state |>
mutate(ph = valid_ph) |>
filter(!is.na(ph))
mnf_flow_transformation_example <- bind_rows(
train_features_lm_original |>
filter(mnf_flow < 0) |>
distinct(mnf_flow) |>
slice_head(n = 2),
train_features_lm_original |>
filter(mnf_flow >= 0) |>
distinct(mnf_flow) |>
slice_head(n = 5)
) |>
mutate(
mnf_flow_state = case_when(
is.na(mnf_flow) ~ "Missing",
mnf_flow == -100 ~ "Negative_100",
mnf_flow == -100.2 ~ "Negative_100_2",
mnf_flow < 0 ~ "Other_negative",
mnf_flow >= 0 ~ "Non_negative"
),
mnf_flow_positive_value = ifelse(
!is.na(mnf_flow) & mnf_flow >= 0,
mnf_flow,
0
)
) |>
arrange(mnf_flow)
mnf_flow_transformation_example |>
kable(
caption = "Example of mnfflow State-Based Transformation",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| mnf_flow | mnf_flow_state | mnf_flow_positive_value |
|---|---|---|
| -100.2 | Negative_100_2 | 0.0 |
| -100.0 | Negative_100 | 0.0 |
| 0.2 | Non_negative | 0.2 |
| 161.4 | Non_negative | 161.4 |
| 162.6 | Non_negative | 162.6 |
| 164.4 | Non_negative | 164.4 |
| 167.4 | Non_negative | 167.4 |
This table shows how the original mnf_flow variable is
changed for the linear-model test. Negative values such as
-100 and -100.2 are treated as categorical
states, while positive values are kept as a separate numeric
variable.
For example, when mnf_flow = -100.2, the new
mnf_flow_state is Negative_100_2, and
mnf_flow_positive_value is set to 0. When
mnf_flow is positive, the state is
Non_negative, and the positive value is kept as the
continuous flow measurement.
Now let’s see before and after split results with simple linear regression
lm_original_model <- lm(
ph ~ .,
data = train_lm_original_data
)
lm_state_model <- lm(
ph ~ .,
data = train_lm_state_data
)
lm_original_pred <- predict(
lm_original_model,
newdata = valid_lm_original_data
)
lm_state_pred <- predict(
lm_state_model,
newdata = valid_lm_state_data
)
lm_mnf_flow_comparison <- bind_rows(
calculate_regression_metrics(
actual = valid_lm_original_data$ph,
predicted = lm_original_pred
) |>
mutate(
workflow = "Linear model with original mnfflow",
training_rows = nrow(train_lm_original_data),
validation_rows = nrow(valid_lm_original_data)
),
calculate_regression_metrics(
actual = valid_lm_state_data$ph,
predicted = lm_state_pred
) |>
mutate(
workflow = "Linear model with mnfflow state + positive value",
training_rows = nrow(train_lm_state_data),
validation_rows = nrow(valid_lm_state_data)
)
) |>
select(
workflow,
training_rows,
validation_rows,
rmse,
mae,
rsq
) |>
arrange(rmse)
lm_mnf_flow_comparison |>
kable(
caption = "Simple Linear Model Comparison of mnfflow Representations",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| workflow | training_rows | validation_rows | rmse | mae | rsq |
|---|---|---|---|---|---|
| Linear model with mnfflow state + positive value | 2052 | 515 | 0.1280 | 0.0990 | 0.4448 |
| Linear model with original mnfflow | 2052 | 515 | 0.1311 | 0.1011 | 0.4173 |
The simple linear model comparison supports using the state-based
mnf_flow split for linear models.
The model with mnf_flow_state + mnf_flow_positive_value
performed better than the model with the original mnf_flow
variable:
RMSE improved from 0.1311 to 0.1280,
MAE improved from 0.1011 to 0.0990,
and \(R^2\) improved from 0.4173 to 0.4448.
This supports my idea that mnf_flow should not always be
treated as one regular continuous predictor. For linear models, one
coefficient may be too restrictive because negative
mnf_flow values appear to represent a coded process state,
while positive values represent regular flow magnitude.
Because of this, I will use this
mnf_flow_state + mnf_flow_positive_value representation
dataset for linear and regularized model families.
Based on the negative-value review, the mnf_flow
behavior check, and the simple linear regression comparison, I use
different preprocessing strategies for different model families.
For linear and regularized models, I will use the
mnf_flow_state + mnf_flow_positive_value representation.
These models use coefficients, so separating the negative coded states
from positive flow magnitude gives the model more flexibility.
The linear and regularized model preprocessing will include:
I wouldn’t apply Yeo-Johnson transformation to all numeric variables. It will be only considered for continuous numeric predictors with high skewness, and will not be applied to categorical variables, dummy variables, or variables that behave more like process settings or coded states.
This will keep the transformation more careful because not every numeric-looking variable should be treated as a regular continuous predictor.
For nonlinear non-tree models, such as SVM with radial basis kernel
or other distance/kernel-based methods, I will treat the
mnf_flow representation as a comparison question.
These models may benefit from the state-based mnf_flow
representation because they are sensitive to feature geometry, scale,
and distance. However, I will not assume this automatically.
For nonlinear non-tree models, I may compare:
mnf_flow_state + mnf_flow_positive_valueThese models usually require centering and scaling. Selective Yeo-Johnson transformation may also be useful for highly skewed continuous predictors.
For tree-based models, I will use the best final imputed feature set
with the original mnf_flow variable.
This is because tree-based models can naturally split on rules like
mnf_flow < 0. Because of that, they do not need the same
mnf_flow state representation that linear models need.
Tree-based models also do not require centering and scaling, and I will not apply Yeo-Johnson transformation by default. These models can handle nonlinear splits, interactions, and skewed predictors.
I did side Random Forest test (not included here), and
mnf_flow_state + mnf_flow_positive_valuerepresentation had
lower accuracy performance compared with the original
mnf_flow version. So, for Random Forest and other
tree-based models, I will keep using the original final imputed feature
set.
All preprocessing steps that learn values from data must be learned from the training set only. This includes Yeo-Johnson transformation, centering, and scaling.
After these preprocessing steps are fitted on the training data during models training, the same fitted steps are applied to the validation and final evaluation sets.
The validation set is used only for model comparison. The final evaluation set will not be used for training, tuning, or model selection. I will use evaluation set at the end after the final model is selected.
#these will be tree-based model feature sets
train_features_tree_family <- train_features_final_imputed
valid_features_tree_family <- valid_features_final_imputed
eval_features_tree_family <- eval_features_final_imputed
#here linear / regularized model feature sets
train_features_linear_family <- train_features_final_imputed |>
add_mnf_flow_state_features() |>
select(-mnf_flow)
valid_features_linear_family <- valid_features_final_imputed |>
add_mnf_flow_state_features() |>
select(-mnf_flow)
eval_features_linear_family <- eval_features_final_imputed |>
add_mnf_flow_state_features() |>
select(-mnf_flow)
model_family_feature_set_check <- tibble(
feature_set = c(
"Tree family training",
"Tree family validation",
"Tree family final evaluation",
"Linear family training",
"Linear family validation",
"Linear family final evaluation"
),
rows = c(
nrow(train_features_tree_family),
nrow(valid_features_tree_family),
nrow(eval_features_tree_family),
nrow(train_features_linear_family),
nrow(valid_features_linear_family),
nrow(eval_features_linear_family)
),
columns = c(
ncol(train_features_tree_family),
ncol(valid_features_tree_family),
ncol(eval_features_tree_family),
ncol(train_features_linear_family),
ncol(valid_features_linear_family),
ncol(eval_features_linear_family)
),
missing_values = c(
sum(is.na(train_features_tree_family)),
sum(is.na(valid_features_tree_family)),
sum(is.na(eval_features_tree_family)),
sum(is.na(train_features_linear_family)),
sum(is.na(valid_features_linear_family)),
sum(is.na(eval_features_linear_family))
)
)
model_family_feature_set_check |>
kable(
caption = "Model Family Feature Set Check",
digits = 0
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| feature_set | rows | columns | missing_values |
|---|---|---|---|
| Tree family training | 2052 | 32 | 0 |
| Tree family validation | 515 | 32 | 0 |
| Tree family final evaluation | 267 | 32 | 0 |
| Linear family training | 2052 | 33 | 0 |
| Linear family validation | 515 | 33 | 0 |
| Linear family final evaluation | 267 | 33 | 0 |
processed_data_folder <- "processed_modeling_data"
if (!dir.exists(processed_data_folder)) {
dir.create(processed_data_folder)
}
write_csv(
train_features_tree_family,
file.path(processed_data_folder, "train_features_tree_family.csv")
)
write_csv(
valid_features_tree_family,
file.path(processed_data_folder, "valid_features_tree_family.csv")
)
write_csv(
eval_features_tree_family,
file.path(processed_data_folder, "eval_features_tree_family.csv")
)
write_csv(
train_features_linear_family,
file.path(processed_data_folder, "train_features_linear_family.csv")
)
write_csv(
valid_features_linear_family,
file.path(processed_data_folder, "valid_features_linear_family.csv")
)
write_csv(
eval_features_linear_family,
file.path(processed_data_folder, "eval_features_linear_family.csv")
)
write_csv(
tibble(ph = train_ph),
file.path(processed_data_folder, "train_ph.csv")
)
write_csv(
tibble(ph = valid_ph),
file.path(processed_data_folder, "valid_ph.csv")
)
processed_file_summary <- tibble(
file_name = c(
"train_features_tree_family.csv",
"valid_features_tree_family.csv",
"eval_features_tree_family.csv",
"train_features_linear_family.csv",
"valid_features_linear_family.csv",
"eval_features_linear_family.csv",
"train_ph.csv",
"valid_ph.csv"
),
description = c(
"Training predictors for tree-based models using original mnf_flow",
"Validation predictors for tree-based models using original mnf_flow",
"Final evaluation predictors for tree-based models using original mnf_flow",
"Training predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value",
"Validation predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value",
"Final evaluation predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value",
"Training PH target values",
"Validation PH target values"
),
rows = c(
nrow(train_features_tree_family),
nrow(valid_features_tree_family),
nrow(eval_features_tree_family),
nrow(train_features_linear_family),
nrow(valid_features_linear_family),
nrow(eval_features_linear_family),
length(train_ph),
length(valid_ph)
),
columns = c(
ncol(train_features_tree_family),
ncol(valid_features_tree_family),
ncol(eval_features_tree_family),
ncol(train_features_linear_family),
ncol(valid_features_linear_family),
ncol(eval_features_linear_family),
1,
1
)
)
write_csv(
processed_file_summary,
file.path(processed_data_folder, "processed_file_summary.csv")
)
processed_file_summary |>
kable(
caption = "Saved Processed Modeling Files",
digits = 0
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| file_name | description | rows | columns |
|---|---|---|---|
| train_features_tree_family.csv | Training predictors for tree-based models using original mnf_flow | 2052 | 32 |
| valid_features_tree_family.csv | Validation predictors for tree-based models using original mnf_flow | 515 | 32 |
| eval_features_tree_family.csv | Final evaluation predictors for tree-based models using original mnf_flow | 267 | 32 |
| train_features_linear_family.csv | Training predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value | 2052 | 33 |
| valid_features_linear_family.csv | Validation predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value | 515 | 33 |
| eval_features_linear_family.csv | Final evaluation predictors for linear/regularized models using mnf_flow_state + mnf_flow_positive_value | 267 | 33 |
| train_ph.csv | Training PH target values | 2052 | 1 |
| valid_ph.csv | Validation PH target values | 515 | 1 |
set.seed(123)
project_seed = 123
# num_cores <- 10
#
#
# cl <- makePSOCKcluster(num_cores)
# registerDoParallel(cl)
processed_data_folder <- "processed_modeling_data"
train_features_tree_family <- read_csv(
file.path(processed_data_folder, "train_features_tree_family.csv"),
show_col_types = FALSE
)
valid_features_tree_family <- read_csv(
file.path(processed_data_folder, "valid_features_tree_family.csv"),
show_col_types = FALSE
)
eval_features_tree_family <- read_csv(
file.path(processed_data_folder, "eval_features_tree_family.csv"),
show_col_types = FALSE
)
train_features_linear_family <- read_csv(
file.path(processed_data_folder, "train_features_linear_family.csv"),
show_col_types = FALSE
)
valid_features_linear_family <- read_csv(
file.path(processed_data_folder, "valid_features_linear_family.csv"),
show_col_types = FALSE
)
eval_features_linear_family <- read_csv(
file.path(processed_data_folder, "eval_features_linear_family.csv"),
show_col_types = FALSE
)
train_ph <- read_csv(
file.path(processed_data_folder, "train_ph.csv"),
show_col_types = FALSE
) |>
pull(ph)
valid_ph <- read_csv(
file.path(processed_data_folder, "valid_ph.csv"),
show_col_types = FALSE
) |>
pull(ph)
loaded_data_check <- tibble(
object_name = c(
"train_features_tree_family",
"valid_features_tree_family",
"eval_features_tree_family",
"train_features_linear_family",
"valid_features_linear_family",
"eval_features_linear_family",
"train_ph",
"valid_ph"
),
rows = c(
nrow(train_features_tree_family),
nrow(valid_features_tree_family),
nrow(eval_features_tree_family),
nrow(train_features_linear_family),
nrow(valid_features_linear_family),
nrow(eval_features_linear_family),
length(train_ph),
length(valid_ph)
),
columns = c(
ncol(train_features_tree_family),
ncol(valid_features_tree_family),
ncol(eval_features_tree_family),
ncol(train_features_linear_family),
ncol(valid_features_linear_family),
ncol(eval_features_linear_family),
1,
1
)
)
loaded_data_check |>
kable(
caption = "Loaded Modeling Data Check",
digits = 0
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| object_name | rows | columns |
|---|---|---|
| train_features_tree_family | 2052 | 32 |
| valid_features_tree_family | 515 | 32 |
| eval_features_tree_family | 267 | 32 |
| train_features_linear_family | 2052 | 33 |
| valid_features_linear_family | 515 | 33 |
| eval_features_linear_family | 267 | 33 |
| train_ph | 2052 | 1 |
| valid_ph | 515 | 1 |
stopifnot(nrow(train_features_tree_family) == length(train_ph))
stopifnot(nrow(valid_features_tree_family) == length(valid_ph))
stopifnot(nrow(train_features_linear_family) == length(train_ph))
stopifnot(nrow(valid_features_linear_family) == length(valid_ph))
stopifnot(nrow(eval_features_tree_family) == nrow(eval_features_linear_family))
Before fitting models, I define a few helper functions that will be reused throughout the modeling file. This keeps the model sections shorter and makes the results tables more consistent.
calculate_regression_metrics <- function(actual, predicted) {
tibble(
rmse = sqrt(mean((actual - predicted)^2, na.rm = TRUE)),
mae = mean(abs(actual - predicted), na.rm = TRUE),
rsq = cor(actual, predicted, use = "complete.obs")^2
)
}
This function calculates the main validation metrics used in this project: RMSE, MAE, and R-squared.
make_model_result_row <- function(
workflow,
model_family,
model_name,
training_rows,
validation_rows,
tuning_parameters,
actual,
predicted,
training_time_seconds
) {
metrics <- calculate_regression_metrics(
actual = actual,
predicted = predicted
)
tibble(
workflow = workflow,
model_family = model_family,
model_name = model_name,
training_rows = training_rows,
validation_rows = validation_rows,
tuning_parameters = tuning_parameters,
rmse = metrics$rmse,
mae = metrics$mae,
rsq = metrics$rsq,
training_time_seconds = training_time_seconds
)
}
This function creates one clean result row for each model. Extra tuning information, such as best alpha, lambda, mtry, or cp, can be added when needed.
prepare_linear_model_matrix <- function(train_x, valid_x, eval_x = NULL) {
dummy_fit <- caret::dummyVars(
~ .,
data = train_x,
fullRank = TRUE
)
train_dummy <- predict(dummy_fit, newdata = train_x) |>
as.data.frame()
valid_dummy <- predict(dummy_fit, newdata = valid_x) |>
as.data.frame()
if (!is.null(eval_x)) {
eval_dummy <- predict(dummy_fit, newdata = eval_x) |>
as.data.frame()
} else {
eval_dummy <- NULL
}
list(
train = train_dummy,
valid = valid_dummy,
eval = eval_dummy,
dummy_fit = dummy_fit
)
}
This function creates dummy variables using the training data and then applies the same dummy encoding to validation and final evaluation data. This keeps categorical predictors consistent across all datasets.
find_skewed_numeric_vars <- function(
train_x,
skew_cutoff = 1,
min_unique_values = 10,
exclude_vars = c(
"pressure_setpoint",
"bowl_setpoint"
)
) {
numeric_vars <- train_x |>
select(where(is.numeric)) |>
names()
skew_summary <- train_x |>
select(all_of(numeric_vars)) |>
summarise(
across(
everything(),
list(
unique_values = ~ n_distinct(.x, na.rm = TRUE),
skewness = ~ e1071::skewness(.x, na.rm = TRUE)
)
)
) |>
pivot_longer(
cols = everything(),
names_to = "name",
values_to = "value"
) |>
separate(
name,
into = c("variable", "metric"),
sep = "_(?=[^_]+$)"
) |>
pivot_wider(
names_from = metric,
values_from = value
) |>
mutate(
abs_skewness = abs(skewness),
use_yeo_johnson = abs_skewness >= skew_cutoff &
unique_values >= min_unique_values &
!(variable %in% exclude_vars)
) |>
arrange(desc(abs_skewness))
list(
skew_summary = skew_summary,
yeo_johnson_vars = skew_summary |>
filter(use_yeo_johnson) |>
pull(variable)
)
}
This function identifies numeric variables that are strongly skewed and may need a Yeo-Johnson transformation.
apply_correlation_filter <- function(
train_x,
valid_x,
eval_x = NULL,
cutoff = 0.90
) {
numeric_train <- train_x |>
select(where(is.numeric))
corr_matrix <- cor(
numeric_train,
use = "pairwise.complete.obs"
)
high_corr_cols <- caret::findCorrelation(
corr_matrix,
cutoff = cutoff,
names = TRUE
)
if (length(high_corr_cols) > 0) {
train_x <- train_x |>
select(-all_of(high_corr_cols))
valid_x <- valid_x |>
select(-all_of(high_corr_cols))
if (!is.null(eval_x)) {
eval_x <- eval_x |>
select(-all_of(high_corr_cols))
}
}
list(
train = train_x,
valid = valid_x,
eval = eval_x,
removed_columns = high_corr_cols
)
}
This function removes highly correlated numeric predictors using the training data only. This is mainly useful for linear and regularized models where strong correlation can make coefficients unstable.
apply_nzv_filter <- function(train_x, valid_x, eval_x = NULL) {
nzv_cols_index <- caret::nearZeroVar(train_x)
removed_columns <- names(train_x)[nzv_cols_index]
if (length(removed_columns) > 0) {
train_x <- train_x |>
select(-all_of(removed_columns))
valid_x <- valid_x |>
select(-all_of(removed_columns))
if (!is.null(eval_x)) {
eval_x <- eval_x |>
select(-all_of(removed_columns))
}
}
list(
train = train_x,
valid = valid_x,
eval = eval_x,
removed_columns = removed_columns
)
}
plot_model_diagnostics <- function(obs, pred, model_name = "Model") {
diag_df <- data.frame(
observed = as.numeric(obs),
predicted = as.numeric(pred),
residuals = as.numeric(obs) - as.numeric(pred)
)
p1 <- ggplot(data = diag_df, aes(x = predicted, y = observed)) +
geom_point(alpha = 0.6, color = "blue") +
geom_abline(slope = 1, intercept = 0, linetype = 2) +
labs(
title = paste("Predicted vs Observed: \n", model_name),
x = "Predicted PH",
y = "Observed PH"
)
p2 <- ggplot(data = diag_df, aes(x = predicted, y = residuals)) +
geom_point(alpha = 0.6, color = "blue") +
geom_hline(yintercept = 0, linetype = 2) +
labs(
title = paste("Residuals vs Predicted: \n", model_name),
x = "Predicted PH",
y = "Residuals"
)
p1 + p2
}
The first plot compares predicted values with observed values. If the model is doing well, the points should be close to the diagonal line.
The second plot shows residuals. Ideally, the residuals should be scattered around zero without a strong pattern.
make_and_show_model_result <- function(
workflow,
model_family,
model_name,
training_rows,
validation_rows,
tuning_parameters,
actual,
predicted,
training_time_seconds,
table_caption = NULL
) {
result <- make_model_result_row(
workflow = workflow,
model_family = model_family,
model_name = model_name,
training_rows = training_rows,
validation_rows = validation_rows,
tuning_parameters = tuning_parameters,
actual = actual,
predicted = predicted,
training_time_seconds = training_time_seconds
)
if (is.null(table_caption)) {
table_caption <- paste(model_name, "Validation Performance")
}
print(
result |>
kable(
caption = table_caption,
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
)
return(result)
}
add_model_result <- function(results_table, new_result) {
bind_rows(
results_table,
new_result
) |>
arrange(rmse)
}
linear_model_results <- tibble()
tree_model_results <- tibble()
all_model_results <- tibble()
start_parallel <- function(cores_to_leave_free = 1) {
library(doParallel)
total_cores <- parallel::detectCores(logical = TRUE)
cores_to_use <- max(1, total_cores - cores_to_leave_free)
cl <- makePSOCKcluster(cores_to_use)
registerDoParallel(cl)
message("Parallel backend started with ", cores_to_use, " cores.")
return(cl)
}
stop_parallel <- function(cl) {
stopCluster(cl)
registerDoSEQ()
message("Parallel backend stopped.")
}
In this part I will compare several model families using the validation set. The training set will be used for model fitting and cross-validation tuning. The validation set is used only after tuning to compare model performance.
I will use RMSE as the main metric because PH prediction is a regression problem and RMSE penalizes larger errors more strongly. I also report MAE because it is easier to interpret as the average absolute error. \(R^2\) will be included to show how much variation in PH is explained by the model.
Different model families will use different feature sets:
mnf_flow_state + mnf_flow_positive_value.mnf_flow variable.set.seed(123)
cv_control <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 3,
savePredictions = "final"
)
I’m going to use repeated 5-fold cross-validation with 3 repeats.
The validation set is still kept separate. I will use it to compare the models after they are trained.
Since I already created the linear-family feature set, I continue with that version here.
The linear-family feature set uses mnf_flow_state and
mnf_flow_positive_value instead of the original
mnf_flow, because that made more sense for linear models
earlier.
train_linear_model_data <- train_features_linear_family |>
mutate(ph = train_ph)
valid_linear_model_data <- valid_features_linear_family |>
mutate(ph = valid_ph)
Before fitting the linear models, I remove predictors with almost no variation. I did NZV test on raw data in preprocessing secton, but I’m doing it again as precation step since datasets now have imputed values
linear_nzv <- apply_nzv_filter(
train_x = train_features_linear_family,
valid_x = valid_features_linear_family
)
train_linear_nzv <- linear_nzv$train
valid_linear_nzv <- linear_nzv$valid
linear_nzv_removed <- tibble(
removed_predictor = linear_nzv$removed_columns
)
linear_nzv_removed
# A tibble: 0 × 1
# ℹ 1 variable: removed_predictor <chr>
No near-zero variance predictors were removed.
linear_corr <- apply_correlation_filter(
train_x = train_linear_nzv,
valid_x = valid_linear_nzv,
cutoff = 0.90
)
train_linear_filtered <- linear_corr$train
valid_linear_filtered <- linear_corr$valid
stopifnot(exists("train_linear_filtered"))
stopifnot(exists("valid_linear_filtered"))
linear_corr_removed <- tibble(
removed_predictor = linear_corr$removed_columns
)
linear_corr_removed |>
kable(
caption = "Highly Correlated Predictors Removed for Linear Models"
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| removed_predictor |
|---|
| balling |
| hyd_pressure3 |
| density |
| balling_lvl |
| filler_level |
The correlation filter removed five predictors: balling,
hyd_pressure3, density,
balling_lvl, and filler_level.
train_linear_filtered_data <- train_linear_filtered |>
mutate(ph = train_ph)
valid_linear_filtered_data <- valid_linear_filtered |>
mutate(ph = valid_ph)
For the linear and regularized models, I first create the list of variables that should receive the Yeo-Johnson transformation.
I do not apply the transformation directly here. The actual Yeo-Johnson transformation will be handled inside the modeling workflow together with centering and scaling.
exclude_from_yj <- c(
"pressure_setpoint",
"bowl_setpoint"
)
yeo_johnson_vars <- train_linear_filtered_data |>
select(where(is.numeric)) |>
select(-any_of(c(exclude_from_yj, "ph"))) |>
names()
tibble(
yeo_johnson_variable = yeo_johnson_vars
) |>
kable(
caption = "Variables Selected for Yeo-Johnson Transformation"
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| yeo_johnson_variable |
|---|
| carb_volume |
| fill_ounces |
| pc_volume |
| carb_pressure |
| carb_temp |
| psc |
| psc_fill |
| psc_co2 |
| carb_pressure1 |
| fill_pressure |
| hyd_pressure1 |
| hyd_pressure2 |
| hyd_pressure4 |
| filler_speed |
| temperature |
| usage_cont |
| carb_flow |
| mfr |
| pressure_vacuum |
| oxygen_filler |
| air_pressurer |
| alch_rel |
| carb_rel |
| mnf_flow_positive_value |
After the NZV and correlation filtering steps, I add ph
back to the filtered linear-family training and validation datasets.
The actual Yeo-Johnson, centering, and scaling steps will be applied inside the model training workflow.
train_linear_model_data <- train_linear_filtered |>
mutate(ph = train_ph)
valid_linear_model_data <- valid_linear_filtered |>
mutate(ph = valid_ph)
linear_recipe <- recipe(
ph ~ .,
data = train_linear_model_data
) |>
step_YeoJohnson(all_of(yeo_johnson_vars)) |>
step_dummy(all_nominal_predictors()) |>
step_center(all_numeric_predictors()) |>
step_scale(all_numeric_predictors())
Here I created the preprocessing workflow for the linear and regularized models.
The Yeo-Johnson step uses only the variables selected earlier in
yeo_johnson_vars. Also, since some variables have negative
values it is better to use Yeo-Johnson method instad of BoxCox. This
keeps the transformation limited to the numeric variables we decided are
appropriate.
After that, categorical predictors are converted into dummy variables. Then numeric predictors are centered and scaled, which is important for regularized models like ridge, lasso, and elastic net.
I keep this inside the modeling workflow so the preprocessing is connected to cross-validation instead of being applied separately by hand.
I start with a regular linear regression model as the first baseline.
I originally tried to use caret::train() with a recipe,
but it caused an error during repeated cross-validation.
Because this modeling section uses a recipe for Yeo-Johnson,
centering, scaling, and dummy encoding, I switch to the
tidymodels workflow. This fits better here because
recipes, workflows, parsnip, and
rsample are designed to work together.
The modeling logic stays the same. I am not changing the preprocessing plan. I am only changing the model training framework so the recipe can be handled more reliably.
set.seed(123)
linear_folds <- vfold_cv(
train_linear_model_data,
v = 5,
repeats = 3
)
I use repeated 5-fold cross-validation again, but now through
tidymodels.
set.seed(123)
linear_model_spec <- linear_reg() |>
set_engine("lm")
linear_workflow <- workflow() |>
add_recipe(linear_recipe) |>
add_model(linear_model_spec)
set.seed(123)
linear_cv_results <- fit_resamples(
linear_workflow,
resamples = linear_folds,
metrics = metric_set(rmse, mae, rsq),
control = control_resamples(save_pred = TRUE)
)
collect_metrics(linear_cv_results) |>
kable(
caption = "Linear Regression Cross-Validation Performance",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|
| mae | standard | 0.1066 | 15 | 0.0008 | pre0_mod0_post0 |
| rmse | standard | 0.1363 | 15 | 0.0010 | pre0_mod0_post0 |
| rsq | standard | 0.3786 | 15 | 0.0093 | pre0_mod0_post0 |
linear_start_time <- Sys.time()
linear_fit <- fit(
linear_workflow,
data = train_linear_model_data
)
linear_end_time <- Sys.time()
linear_training_time <- as.numeric(
difftime(linear_end_time, linear_start_time, units = "secs")
)
linear_validation_predictions <- predict(
linear_fit,
new_data = valid_linear_model_data
) |>
pull(.pred)
linear_result <- make_model_result_row(
workflow = "Linear-family ",
model_family = "Linear / Regularized",
model_name = "Linear Regression",
training_rows = nrow(train_linear_model_data),
validation_rows = nrow(valid_linear_model_data),
tuning_parameters = "None",
actual = valid_ph,
predicted = linear_validation_predictions,
training_time_seconds = linear_training_time
)
linear_result |>
kable(
caption = "Linear Regression Validation Performance",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| workflow | model_family | model_name | training_rows | validation_rows | tuning_parameters | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|---|---|
| Linear-family | Linear / Regularized | Linear Regression | 2052 | 515 | None | 0.1305 | 0.1007 | 0.4244 | 0.1201 |
The first model I tested was regular linear regression.
This model is used as a baseline model because it is simple and easy to understand.
The validation performance was:
This result gives me a starting point for comparing all other models. The model was very fast to train, but the prediction performance is not very strong as expected.
The \(R^2\) value of 0.4244 means the model explained some of the PH pattern, but a large part of the variation was still not captured. This makes sense because the data we have probably has nonlinear relationships and interactions between variables.
Overall, linear regression is useful as a baseline, but it was not flexible enough.
plot_model_diagnostics(
obs = valid_ph,
pred = linear_validation_predictions,
model_name = "Linear Regression"
)
The diagnostic plots show why linear regression is useful as a baseline, but not strong enough.
In Predicted vs Observed plot, the points follow the general upward direction, so the model captured some relationship between predictors and PH. However, the points spread around the diagonal line is wide. This means the linear regression produced many over and under predicted values.
The Residuals vs Predicted plot also shows a fairly wide spread of residuals. Some residuals are larger than +0.25 or below -0.25, which means the model had some noticeable overprediction and underprediction cases.
It also appear to be some funnel-shape pattern in the residuals instead of a fully random cloud around zero.
Overall, linear regression gave a useful starting point, but the diagnostic plots show that PH prediction likely needs a more flexible model that can capture nonlinear patterns and interactions between predictors.
After the regular linear regression model, I continue with elastic net regression.
Elastic net is still a linear model, but it adds regularization. This means it can shrink some coefficients and reduce overfitting compared with a regular linear regression model.
Elastic net combines both ridge and lasso regression.
Elastic net has two main tuning parameters: penalty and
mixture.
The penalty controls how aggressivly the model shrinks
the coefficients. A small penalty means the model is closer to regular
linear regression. On the other hand, larger penalty means stronger
shrinkage.
The mixture controls the type of regularization. When
mixture = 0, the model behaves pretty much like ridge
regression. When mixture = 1, it behaves like lasso
regression. Values between 0 and 1 combine both ridge and lasso
methods.
In this part, I tune both parameters because I do not want to assume ahead of time whether ridge-like, lasso-like, or mixed regularization will work best.
make_linear_recipe <- function(training_data) {
recipe(
ph ~ .,
data = training_data
) |>
step_YeoJohnson(all_of(yeo_johnson_vars)) |>
step_center(all_numeric_predictors()) |>
step_scale(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
}
elastic_net_spec <- linear_reg(
penalty = tune(),
mixture = tune()
) |>
set_engine("glmnet")
elastic_net_grid <- grid_regular(
penalty(range = c(-6, -1)),
mixture(range = c(0, 1)),
levels = c(
penalty = 15,
mixture = 6
)
)
elastic_net_workflow <- workflow() |>
add_recipe(linear_recipe) |>
add_model(elastic_net_spec)
elastic_net_grid_2 <- grid_regular(
penalty(range = c(-6, -1)),
mixture(range = c(0, 1)),
levels = c(
penalty = 15,
mixture = 6
)
)
set.seed(123)
elastic_net_start_time <- Sys.time()
elastic_net_tune <- tune_grid(
elastic_net_workflow,
resamples = linear_folds,
grid = elastic_net_grid,
metrics = metric_set(rmse, mae, rsq),
control = control_grid(save_pred = TRUE)
)
elastic_net_end_time <- Sys.time()
elastic_net_training_time <- as.numeric(
difftime(elastic_net_end_time, elastic_net_start_time, units = "secs")
)
#elastic_net_tune
# elastic_net_tune |>
# collect_metrics() |>
# arrange(.metric, mean) |>
# kable(
# caption = "Elastic Net Cross-Validation Tuning Results",
# digits = 4
# ) |>
# kable_styling(
# full_width = FALSE,
# position = "center"
# )
best_elastic_net <- select_best(
elastic_net_tune,
metric = "rmse"
)
best_elastic_net
# A tibble: 1 × 3
penalty mixture .config
<dbl> <dbl> <chr>
1 0.00848 0 pre0_mod67_post0
Now let’s fit chosen model on training and validation sets
final_elastic_net_workflow <- finalize_workflow(
elastic_net_workflow,
best_elastic_net
)
elastic_net_start_time <- Sys.time()
final_elastic_net_fit <- fit(
final_elastic_net_workflow,
data = train_linear_model_data
)
elastic_net_end_time <- Sys.time()
elastic_net_training_time <- as.numeric(
difftime(elastic_net_end_time, elastic_net_start_time, units = "secs")
)
elastic_net_validation_predictions <- predict(
final_elastic_net_fit,
new_data = valid_linear_model_data
) |>
pull(.pred)
elastic_net_result <- make_and_show_model_result(
workflow = "Linear-family ",
model_family = "Linear / Regularized",
model_name = "Elastic Net",
training_rows = nrow(train_linear_model_data),
validation_rows = nrow(valid_linear_model_data),
tuning_parameters = paste(
"penalty =",
best_elastic_net$penalty,
", mixture =",
best_elastic_net$mixture
),
actual = valid_ph,
predicted = elastic_net_validation_predictions,
training_time_seconds = elastic_net_training_time
)
| workflow | model_family | model_name | training_rows | validation_rows | tuning_parameters | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|---|---|
| Linear-family | Linear / Regularized | Elastic Net | 2052 | 515 | penalty = 0.00848342898244073 , mixture = 0 | 0.1305 | 0.1007 | 0.425 | 0.1375 |
The best elastic net model selected penalty = 0.0085 and
mixture = 0.
The elastic net model produced almost the same validation performance
as the regular linear regression baseline.This makes sense because the
best tuning result selected penalty = 0.0085 and
mixture = 0. A mixture value of 0 means the
model is ridge-like, and the very small penalty means only a tiny amount
of shrinkage was applied.
Because of this, the elastic net model stayed very close to the regular
linear regression model. In this case, regularization did not
meaningfully improve validation performance.
plot_model_diagnostics(
obs = valid_ph,
pred = elastic_net_validation_predictions,
model_name = "Elastic Net"
)
The elastic net diagnostic plots show that the model captures the general direction of PH values, and almost identical with diagnosis plots of previous linear regression.
In the predicted vs observed plot, the points follow an upward trend, but they are not tightly clustered around the diagonal line. The predicted PH values are also compressed into a smaller range than the observed PH values, which suggests that the model is pulling predictions toward the average.
The residual plot shows that the residuals are mostly centered around zero, but there is still noticeable spread. The spread has a mild funnel shape, with larger residual variation for smaller predicted PH values. This suggests that the elastic net model struggles more with explaining variation in the lower predicted PH range.
For larger predicted PH values, there also appears to be a slight downward pattern in the residuals.
Overall, the residual plot suggests that the elastic net model captures some of the general PH pattern, but it does not fully explain the variance across the whole PH range.
I also test Partial Least Squares regression because it can be useful when predictors are correlated.
Instead of using the original predictors directly, PLS creates a
smaller number of components (supervised version of PCA) from the
predictors. These components are created in a way that tries to explain
the relationship with ph.
The main tuning parameter is num_comp, which controls
how many PLS components are used. Too few components may underfit, while
too many components may start acting more like a regular linear model.
For this model I will use tuning with up to 15 components.
train_pls_model_data <- train_linear_nzv |>
mutate(ph = train_ph)
valid_pls_model_data <- valid_linear_nzv |>
mutate(ph = valid_ph)
exclude_from_pls_yj <- c(
"pressure_setpoint",
"bowl_setpoint"
)
pls_yeo_johnson_vars <- train_linear_nzv |>
select(where(is.numeric)) |>
select(-any_of(exclude_from_pls_yj)) |>
names()
make_pls_recipe <- function(training_data) {
recipe(
ph ~ .,
data = training_data
) |>
step_YeoJohnson(all_of(pls_yeo_johnson_vars)) |>
step_center(all_numeric_predictors()) |>
step_scale(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
}
set.seed(123)
pls_folds <- vfold_cv(
train_pls_model_data,
v = 5,
repeats = 3
)
fit_pls_cv_split <- function(split, num_comp) {
train_split_data <- analysis(split)
valid_split_data <- assessment(split)
split_recipe <- make_pls_recipe(train_split_data)
split_recipe_prep <- prep(
split_recipe,
training = train_split_data,
retain = TRUE
)
train_split_baked <- bake(
split_recipe_prep,
new_data = train_split_data
)
valid_split_baked <- bake(
split_recipe_prep,
new_data = valid_split_data
)
pls_fit <- plsr(
ph ~ .,
data = train_split_baked,
ncomp = num_comp,
method = "kernelpls"
)
pls_predictions <- predict(
pls_fit,
newdata = valid_split_baked,
ncomp = num_comp
) |>
as.numeric()
tibble(
num_comp = num_comp,
rmse = rmse_vec(
truth = valid_split_baked$ph,
estimate = pls_predictions
),
mae = mae_vec(
truth = valid_split_baked$ph,
estimate = pls_predictions
),
rsq = rsq_vec(
truth = valid_split_baked$ph,
estimate = pls_predictions
)
)
}
pls_grid <- tibble(
num_comp = 1:20
)
set.seed(123)
pls_uncorr_start_time <- Sys.time()
pls_uncorr_cv_results <- crossing(
split_id = seq_along(pls_folds$splits),
num_comp = pls_grid$num_comp
) |>
mutate(
metrics = map2(
split_id,
num_comp,
~ fit_pls_cv_split(
split = pls_folds$splits[[.x]],
num_comp = .y
)
)
) |>
select(-split_id, -num_comp) |>
unnest(metrics)
pls_uncorr_end_time <- Sys.time()
pls_uncorr_training_time <- as.numeric(
difftime(
pls_uncorr_end_time,
pls_uncorr_start_time,
units = "secs"
)
)
pls_uncorr_cv_summary <- pls_uncorr_cv_results |>
group_by(num_comp) |>
summarise(
rmse = mean(rmse, na.rm = TRUE),
mae = mean(mae, na.rm = TRUE),
rsq = mean(rsq, na.rm = TRUE),
.groups = "drop"
) |>
arrange(rmse)
pls_uncorr_cv_summary |>
kable(
caption = "PLS Cross-Validation Results with Correlated Predictors Kept",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| num_comp | rmse | mae | rsq |
|---|---|---|---|
| 14 | 0.1350 | 0.1051 | 0.3906 |
| 15 | 0.1350 | 0.1051 | 0.3904 |
| 17 | 0.1350 | 0.1052 | 0.3903 |
| 16 | 0.1350 | 0.1052 | 0.3903 |
| 18 | 0.1350 | 0.1052 | 0.3901 |
| 13 | 0.1350 | 0.1053 | 0.3900 |
| 19 | 0.1351 | 0.1052 | 0.3899 |
| 12 | 0.1351 | 0.1052 | 0.3897 |
| 20 | 0.1351 | 0.1051 | 0.3896 |
| 10 | 0.1352 | 0.1053 | 0.3889 |
| 11 | 0.1352 | 0.1053 | 0.3886 |
| 8 | 0.1354 | 0.1057 | 0.3863 |
| 9 | 0.1354 | 0.1055 | 0.3863 |
| 7 | 0.1359 | 0.1058 | 0.3823 |
| 6 | 0.1364 | 0.1064 | 0.3778 |
| 5 | 0.1372 | 0.1064 | 0.3696 |
| 4 | 0.1388 | 0.1085 | 0.3557 |
| 3 | 0.1398 | 0.1091 | 0.3462 |
| 2 | 0.1443 | 0.1132 | 0.3042 |
| 1 | 0.1527 | 0.1210 | 0.2212 |
best_pls_uncorr <- pls_uncorr_cv_summary |>
slice(1)
best_pls_uncorr |>
kable(
caption = "Best PLS Result with Correlated Predictors Kept",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| num_comp | rmse | mae | rsq |
|---|---|---|---|
| 14 | 0.135 | 0.1051 | 0.3906 |
best_pls_uncorr_num_comp <- best_pls_uncorr$num_comp
The best PLS model selected num_comp = 14 and it is a
lot of componets are being used.
.
pls_uncorr_final_recipe <- make_pls_recipe(
train_pls_model_data
)
pls_uncorr_final_recipe_prep <- prep(
pls_uncorr_final_recipe,
training = train_pls_model_data,
retain = TRUE
)
train_pls_uncorr_baked <- bake(
pls_uncorr_final_recipe_prep,
new_data = train_pls_model_data
)
valid_pls_uncorr_baked <- bake(
pls_uncorr_final_recipe_prep,
new_data = valid_pls_model_data
)
pls_uncorr_final_start_time <- Sys.time()
pls_uncorr_final_model <- plsr(
ph ~ .,
data = train_pls_uncorr_baked,
ncomp = best_pls_uncorr_num_comp,
method = "kernelpls"
)
pls_uncorr_final_end_time <- Sys.time()
pls_uncorr_final_training_time <- as.numeric(
difftime(
pls_uncorr_final_end_time,
pls_uncorr_final_start_time,
units = "secs"
)
)
pls_uncorr_validation_predictions <- predict(
pls_uncorr_final_model,
newdata = valid_pls_uncorr_baked,
ncomp = best_pls_uncorr_num_comp
) |>
as.numeric()
pls_uncorr_result <- make_and_show_model_result(
workflow = "Linear-family / PLS keeps correlated predictors",
model_family = "Linear / Regularized",
model_name = "PLS Regression",
training_rows = nrow(train_pls_model_data),
validation_rows = nrow(valid_pls_model_data),
tuning_parameters = paste(
"num_comp =",
best_pls_uncorr_num_comp
),
actual = valid_ph,
predicted = pls_uncorr_validation_predictions,
training_time_seconds = pls_uncorr_final_training_time
)
| workflow | model_family | model_name | training_rows | validation_rows | tuning_parameters | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|---|---|
| Linear-family / PLS keeps correlated predictors | Linear / Regularized | PLS Regression | 2052 | 515 | num_comp = 14 | 0.1286 | 0.0987 | 0.4403 | 0.0095 |
The best PLS model selected num_comp = 14 and it is a
lot of componets.
This means the model needed many components to get its best cross-validation result. In this case, PLS did not reduce the predictor space into a small number of very strong components.
The best PLS cross-validation performance was:
Compared with the regular linear regression and elastic net models, PLS performed slightly worse. This suggests that reducing the predictors into PLS components did not improve PH prediction for this dataset.
Because of this, PLS is useful as another linear-family comparison model, but it does not look like the strongest model in this family.
The plots are almost identical to previous plots, and my previous explanations of ENET diagnostics apply to these plots as well.
plot_model_diagnostics(
obs = valid_ph,
pred = pls_uncorr_validation_predictions,
model_name = "PLS Regression"
)
The PLS diagnostic plots look very similar to the regular linear regression plots.
In the Predicted vs Observed plot, the points follow the general upward direction, so PLS captured some relationship between the predictors and PH. However, there is still a wide spread around the diagonal line. This means the predictions are not very close to the observed PH values for many validation rows.
The Residuals vs Predicted plot also shows that residuals are spread fairly wide around zero. There are still some larger overpredictions and underpredictions.
PLS was useful because it can handle correlated predictors better than regular linear regression. This is why I tested it after seeing correlation between some process variables.
However, the plots show that PLS still behaves mostly like a linear model. It improved slightly over regular linear regression, but it still did not capture the nonlinear patterns.
Overall, PLS was a reasonable linear-family model to test, but it was not flexible enough.
linear_model_results <- tibble()
linear_model_results <- add_model_result(
results_table = linear_model_results,
new_result = linear_result
)
linear_model_results <- add_model_result(
results_table = linear_model_results,
new_result = elastic_net_result
)
linear_model_results <- add_model_result(
results_table = linear_model_results,
new_result = pls_uncorr_result
)
# linear_model_results <- linear_model_results |>
# filter(model_name != "PLS Regression")
#
# all_model_results <- all_model_results |>
# filter(model_name != "PLS Regression")
linear_model_results |>
arrange(rmse) |>
kable(
caption = "Linear and Regularized Model Comparison",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| workflow | model_family | model_name | training_rows | validation_rows | tuning_parameters | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|---|---|
| Linear-family / PLS keeps correlated predictors | Linear / Regularized | PLS Regression | 2052 | 515 | num_comp = 14 | 0.1286 | 0.0987 | 0.4403 | 0.0095 |
| Linear-family | Linear / Regularized | Linear Regression | 2052 | 515 | None | 0.1305 | 0.1007 | 0.4244 | 0.1201 |
| Linear-family | Linear / Regularized | Elastic Net | 2052 | 515 | penalty = 0.00848342898244073 , mixture = 0 | 0.1305 | 0.1007 | 0.4250 | 0.1375 |
The linear and regularized model comparison shows that PLS regression performed slightly better than regular linear regression and elastic net.
PLS regression:
Linear regression:
Elastic net:
For PLS, I kept the correlated predictors instead of applying correlation filtering. This is because PLS is designed to work with correlated predictors by creating components from them. Removing correlated predictors before PLS could remove useful information.
The best PLS model selected num_comp = 14. This means
the model needed many components to reach its best performance, so it
did not reduce the data into only a few strong components. Still, PLS
gave a small improvement over the regular linear regression and elastic
net models.
Elastic net selected mixture = 0, which means it behaved
like ridge regression. Its performance was almost the same as regular
linear regression, so regularization did not add much improvement in
this section.
Overall, PLS was the best model inside the linear and regularized model family, but the improvement was small. This tells us that linear models are able to capture some PH patterns, but linear models still leave a good amount of data structure unexplained.
linear_model_results <- tibble()
all_model_results <- tibble()
linear_model_results <- add_model_result(
results_table = linear_model_results,
new_result = linear_result
)
linear_model_results <- add_model_result(
results_table = linear_model_results,
new_result = elastic_net_result
)
linear_model_results <- add_model_result(
results_table = linear_model_results,
new_result = pls_uncorr_result
)
all_model_results <- add_model_result(
results_table = all_model_results,
new_result = linear_result
)
all_model_results <- add_model_result(
results_table = all_model_results,
new_result = elastic_net_result
)
all_model_results <- add_model_result(
results_table = all_model_results,
new_result = pls_uncorr_result
)
#all_model_results
After the linear-family models, I will test nonlinear models.
The linear models gave very similar prediction patterns, so the next
question is if flexible models can capture nonlinear relationships or
interactions in the production data.
For these models, I continue using the linear-family feature set. This
is because mnf_flow was already separated into
mnf_flow_state and mnf_flow_positive_value,
which is safer for models that depend on distances, scaling, or smooth
nonlinear relationships.
I also keep the same preprocessing recipe with Yeo-Johnson, centering,
scaling, and dummy encoding. This is important for SVM and neural
networks because they are sensitive to variable scale.
nonlinear_model_results <- tibble()
I start the nonlinear model section with SVM using a radial basis
kernel.
This model can capture nonlinear patterns by comparing observations
based on distance. Because of that, here scaling is important. I use the
same preprocessing recipe from the previous section, where numeric
variables are transformed, centered, and scaled before modeling.
The main tuning parameters are cost and
rbf_sigma.
cost controls how much the model tries to avoid
errors. A larger cost allows less flexibility for mistakes, while a
smaller cost allows more tolerance.
rbf_sigma controls how local or smooth the radial
kernel is. Smaller values usually create smoother patterns, while larger
values can make the model more sensitive to nearby points.
svm_rbf_spec <- svm_rbf(
cost = tune(),
rbf_sigma = tune()
) |>
set_mode("regression") |>
set_engine("kernlab")
svm_rbf_spec <- svm_rbf(
cost = tune(),
rbf_sigma = tune()) |>
set_mode("regression") |>
set_engine("kernlab")
svm_rbf_workflow <- workflow() |>
add_recipe(linear_recipe) |>
add_model(svm_rbf_spec)
svm_rbf_grid <- grid_regular(
cost(range = c(-3, 5)),
rbf_sigma(range = c(-6, 1)),
levels = c(
cost = 9,
rbf_sigma = 8
)
)
# svm_rbf_grid |>
# kable(
# caption = "SVM Radial Tuning Grid",
# digits = 6
# ) |>
# kable_styling(
# full_width = FALSE,
# position = "center"
# )
set.seed(123)
svm_rbf_start_time <- Sys.time()
svm_rbf_tune <- tune_grid(
svm_rbf_workflow,
resamples = linear_folds,
grid = svm_rbf_grid,
metrics = metric_set(rmse, mae, rsq),
control = control_grid(save_pred = TRUE)
)
svm_rbf_end_time <- Sys.time()
svm_rbf_training_time <- as.numeric(
difftime(svm_rbf_end_time, svm_rbf_start_time, units = "secs")
)
svm_rbf_tune |>
collect_metrics() |>
filter(.metric == "rmse") |>
arrange(mean) |>
slice(1:10) |>
kable(
caption = "Top SVM Radial Tuning Results by RMSE",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| cost | rbf_sigma | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|
| 2 | 0.10 | rmse | standard | 0.1201 | 15 | 0.0009 | pre0_mod38_post0 |
| 4 | 0.10 | rmse | standard | 0.1210 | 15 | 0.0008 | pre0_mod46_post0 |
| 8 | 0.01 | rmse | standard | 0.1216 | 15 | 0.0010 | pre0_mod53_post0 |
| 1 | 0.10 | rmse | standard | 0.1218 | 15 | 0.0010 | pre0_mod30_post0 |
| 8 | 0.10 | rmse | standard | 0.1220 | 15 | 0.0008 | pre0_mod54_post0 |
| 4 | 0.01 | rmse | standard | 0.1221 | 15 | 0.0011 | pre0_mod45_post0 |
| 16 | 0.10 | rmse | standard | 0.1225 | 15 | 0.0008 | pre0_mod62_post0 |
| 32 | 0.10 | rmse | standard | 0.1225 | 15 | 0.0008 | pre0_mod70_post0 |
| 16 | 0.01 | rmse | standard | 0.1226 | 15 | 0.0011 | pre0_mod61_post0 |
| 2 | 0.01 | rmse | standard | 0.1234 | 15 | 0.0011 | pre0_mod37_post0 |
best_svm_rbf <- select_best(
svm_rbf_tune,
metric = "rmse"
)
best_svm_rbf |>
kable(
caption = "Best SVM Radial Tuning Parameters",
digits = 6
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| cost | rbf_sigma | .config |
|---|---|---|
| 2 | 0.1 | pre0_mod38_post0 |
In this SVM radial tuning grid found a good region around
cost = 2 and rbf_sigma = 0.10.
As we see in tuning results table above, several of the best tuning
combinations are close to each other, which suggests that the model was
not depending on just one lucky parameter setting. Instead, there seemed
to be a good tuning region.
For example, this tuning grid showed strong results for:
- cost = 2,
rbf_sigma = 0.10 , mean RMSE = 0.1201
- cost = 4,
rbf_sigma = 0.10``, mean RMSE = 0.1210
- cost = 1,
rbf_sigma = 0.10``, mean RMSE = 0.1218
- cost = 8,
rbf_sigma = 0.01``, mean RMSE = 0.1216
Because nearby values had similar RMSE scores, below I will run a
refined tuning grid around that area. This will allowe me to test
smaller changes in cost and rbf_sigma instead
of only using the wider first grid.
svm_rbf_grid |>
summarise(
min_cost = min(cost),
max_cost = max(cost),
min_rbf_sigma = min(rbf_sigma),
max_rbf_sigma = max(rbf_sigma)
)
# A tibble: 1 × 4
min_cost max_cost min_rbf_sigma max_rbf_sigma
<dbl> <dbl> <dbl> <dbl>
1 0.125 32 0.000001 10
svm_rbf_grid_refined <- expand_grid(
cost = c(1, 1.5, 2, 3, 4, 6, 8),
rbf_sigma = c(0.05, 0.075, 0.10, 0.125, 0.15, 0.20)
)
# svm_rbf_grid_refined |>
# kable(
# caption = "Refined SVM Radial Tuning Grid",
# digits = 4
# ) |>
# kable_styling(
# full_width = FALSE,
# position = "center"
# )
set.seed(123)
svm_rbf_refined_start_time <- Sys.time()
svm_rbf_tune_refined <- tune_grid(
svm_rbf_workflow,
resamples = linear_folds,
grid = svm_rbf_grid_refined,
metrics = metric_set(rmse, mae, rsq),
control = control_grid(save_pred = TRUE)
)
svm_rbf_refined_end_time <- Sys.time()
svm_rbf_refined_training_time <- as.numeric(
difftime(
svm_rbf_refined_end_time,
svm_rbf_refined_start_time,
units = "secs"
)
)
best_svm_rbf_refined <- select_best(
svm_rbf_tune_refined,
metric = "rmse"
)
best_svm_rbf_refined |>
kable(
caption = "Best Refined SVM Radial Tuning Parameters",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| cost | rbf_sigma | .config |
|---|---|---|
| 3 | 0.05 | pre0_mod19_post0 |
The refined tuning selected:
cost: 3
rbf_sigma: 0.05
RMSE: 0.1183
This improved the cross-validation RMSE from 0.1201 in the
first grid to 0.1183 in the refined grid.
This result can indicate that the SVM radial model performs better with
a moderate cost value and a smoother radial kernel. In
practical terms, the model is flexible enough to capture nonlinear
patterns, but it is not using an overly aggressive kernel setting.
final_svm_rbf_workflow <- finalize_workflow(
svm_rbf_workflow,
best_svm_rbf_refined
)
svm_rbf_final_start_time <- Sys.time()
final_svm_rbf_fit <- fit(
final_svm_rbf_workflow,
data = train_linear_model_data
)
svm_rbf_final_end_time <- Sys.time()
svm_rbf_final_training_time <- as.numeric(
difftime(
svm_rbf_final_end_time,
svm_rbf_final_start_time,
units = "secs"
)
)
svm_rbf_validation_predictions <- predict(
final_svm_rbf_fit,
new_data = valid_linear_model_data
) |>
pull(.pred)
svm_rbf_result <- make_and_show_model_result(
workflow = "Nonlinear-family",
model_family = "Nonlinear",
model_name = "SVM Radial",
training_rows = nrow(train_linear_model_data),
validation_rows = nrow(valid_linear_model_data),
tuning_parameters = paste(
"cost =",
best_svm_rbf_refined$cost,
", / rbf_sigma =",
best_svm_rbf_refined$rbf_sigma
),
actual = valid_ph,
predicted = svm_rbf_validation_predictions,
training_time_seconds = svm_rbf_final_training_time
)
| workflow | model_family | model_name | training_rows | validation_rows | tuning_parameters | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|---|---|
| Nonlinear-family | Nonlinear | SVM Radial | 2052 | 515 | cost = 3 , / rbf_sigma = 0.05 | 0.1112 | 0.0812 | 0.5806 | 0.8783 |
The SVM radial model performed better than the linear-family models on the validation data.
This is an improvement over the linear regression, elastic net, and PLS models, which all had validation RMSE values around 0.1305 and \(R^2\) values around 0.425.
The best SVM radial model used cost = 3 and
rbf_sigma = 0.05. This means the model used a moderate cost
value and a smoother radial kernel.
This result suggests that the relationships between predictors and PH are not only linear. The radial kernel performed better to capture nonlinear patterns that the linear-family models could not capture.
plot_model_diagnostics(
obs = valid_ph,
pred = svm_rbf_validation_predictions,
model_name = "SVM Radial"
)
The SVM radial diagnostic plots show better behavior than the linear-family models.
In the predicted vs observed plot, the points follow a clearer upward pattern and are tighter to the diagonal line compared to linear models. There are few far points from the line for predictions less than 8.72, and the cluster of predictions above 8.72 PH is much tighter to diagonal line. This suggests that SVM radial basis model has a tendency to over or under predict PH values less than 8.72.
The residual plot shows that most residuals are centered and tighter around zero. There is still noticeable spread, especially around the middle predicted PH range with some of the predictions with larger negative residuals (overpredicted), but the residual pattern looks more balanced than the linear-family models.
Overall, there are still some larger errors, which means this model is not perfect. However, compared with the earlier linear models, the SVM radial model captures more of the PH variation and handles nonlinear structure better.
nonlinear_model_results <- add_model_result(
results_table = nonlinear_model_results,
new_result = svm_rbf_result
)
all_model_results <- add_model_result(
results_table = all_model_results,
new_result = svm_rbf_result
)
nonlinear_model_results |>
arrange(rmse) |>
kable(
caption = "Current Nonlinear Model Comparison",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| workflow | model_family | model_name | training_rows | validation_rows | tuning_parameters | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|---|---|
| Nonlinear-family | Nonlinear | SVM Radial | 2052 | 515 | cost = 3 , / rbf_sigma = 0.05 | 0.1112 | 0.0812 | 0.5806 | 0.8783 |
Next, I test a MARS model.
MARS can capture nonlinear relationships by creating “hinged” linear terms. This makes it more flexible than regular linear regression, but usually easier to interpret than models like SVM or neural n
The main tuning parameters are num_terms and
prod_degree.
num_terms controls how many model terms are allowed.
A larger value gives the model more flexibility.
prod_degree controls whether the model can include
interaction terms. A value of 1 allows only main effects, while a value
of 2 allows two-way interactions, like multiplicative
interactions.
mars_spec <- mars(
num_terms = tune(),
prod_degree = tune(),
prune_method = "none") |>
set_mode("regression") |>
set_engine("earth")
mars_workflow <- workflow() |>
add_recipe(linear_recipe) |>
add_model(mars_spec)
mars_grid <- expand_grid(
num_terms = seq(5, 50, by = 5),
prod_degree = c(1, 2)
)
# mars_grid |>
# kable(
# caption = "MARS Tuning Grid"
# ) |>
# kable_styling(
# full_width = FALSE,
# position = "center"
# )
set.seed(123)
mars_start_time <- Sys.time()
mars_tune <- tune_grid(
mars_workflow,
resamples = linear_folds,
grid = mars_grid,
metrics = metric_set(rmse, mae, rsq),
control = control_grid(save_pred = TRUE)
)
mars_end_time <- Sys.time()
mars_training_time <- as.numeric(
difftime(mars_end_time, mars_start_time, units = "secs")
)
mars_tune |>
collect_metrics() |>
filter(.metric == "rmse") |>
arrange(mean) |>
slice(1:10) |>
kable(
caption = "Top MARS Tuning Results by RMSE",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| num_terms | prod_degree | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|
| 35 | 2 | rmse | standard | 0.1293 | 15 | 0.0015 | pre0_mod14_post0 |
| 40 | 2 | rmse | standard | 0.1294 | 15 | 0.0014 | pre0_mod16_post0 |
| 50 | 2 | rmse | standard | 0.1295 | 15 | 0.0016 | pre0_mod20_post0 |
| 45 | 2 | rmse | standard | 0.1295 | 15 | 0.0016 | pre0_mod18_post0 |
| 30 | 2 | rmse | standard | 0.1297 | 15 | 0.0015 | pre0_mod12_post0 |
| 25 | 2 | rmse | standard | 0.1315 | 15 | 0.0014 | pre0_mod10_post0 |
| 20 | 2 | rmse | standard | 0.1341 | 15 | 0.0013 | pre0_mod08_post0 |
| 15 | 1 | rmse | standard | 0.1353 | 15 | 0.0010 | pre0_mod05_post0 |
| 15 | 2 | rmse | standard | 0.1366 | 15 | 0.0010 | pre0_mod06_post0 |
| 10 | 2 | rmse | standard | 0.1380 | 15 | 0.0011 | pre0_mod04_post0 |
best_mars <- select_best(
mars_tune,
metric = "rmse"
)
best_mars |>
kable(
caption = "Best MARS Tuning Parameters",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| num_terms | prod_degree | .config |
|---|---|---|
| 35 | 2 | pre0_mod14_post0 |
The best MARS model selected num_terms = 35 and
prod_degree = 2.
This means the best version allowed up to 35 model terms and included
two-way interaction terms. The top MARS results all used
prod_degree = 2, which suggests that interactions between
predictors helped the model more than main effects only.
The best cross-validation RMSE was 0.1296. This is close to the linear-family models, but it is not as strong as the SVM radial model.
This result suggests that MARS added some flexibility through piecewise linear terms and interactions, but it still did not capture the PH patterns as well as the radial SVM model.
final_mars_workflow <- finalize_workflow(
mars_workflow,
best_mars
)
mars_final_start_time <- Sys.time()
final_mars_fit <- fit(
final_mars_workflow,
data = train_linear_model_data
)
mars_final_end_time <- Sys.time()
mars_final_training_time <- as.numeric(
difftime(
mars_final_end_time,
mars_final_start_time,
units = "secs"
)
)
mars_validation_predictions <- predict(
final_mars_fit,
new_data = valid_linear_model_data) |>
pull(.pred)
mars_result <- make_and_show_model_result(
workflow = "Nonlinear-family",
model_family = "Nonlinear",
model_name = "MARS",
training_rows = nrow(train_linear_model_data),
validation_rows = nrow(valid_linear_model_data),
tuning_parameters = paste(
"num_terms =",
best_mars$num_terms,
", prod_degree =",
best_mars$prod_degree
),
actual = valid_ph,
predicted = mars_validation_predictions,
training_time_seconds = mars_final_training_time
)
| workflow | model_family | model_name | training_rows | validation_rows | tuning_parameters | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|---|---|
| Nonlinear-family | Nonlinear | MARS | 2052 | 515 | num_terms = 35 , prod_degree = 2 | 0.1212 | 0.0907 | 0.5013 | 0.3193 |
The final MARS model used:
num_terms: 35prod_degree: 2Its validation performance was:
This result is better than overall linear-family models, but weaker than SVM radial.
The selected prod_degree = 2 means the model used
two-way interaction terms. This indicate that some relationships between
predictors and PH are not just simple one-variable effects. The model
improved by allowing predictors to work together.
The selected num_terms = 35 means MARS needed a flexible
model with many piecewise terms.
It looks like MARS captured more nonlinear structure than linear regression, elastic net, and PLS. This suggests that piecewise linear interactions helped, but they were still not flexible enough to capture the full data pattern.
plot_model_diagnostics(
obs = valid_ph,
pred = mars_validation_predictions,
model_name = "MARS"
)
MARS diagnostic plots also show better behavior than the linear-family models.
In the predicted vs observed plot, the points follow a clearer upward pattern and are tighter to the diagonal line compared to linear models. But the model still has some over and under predictions
The residual plot shows that most residuals are centered and tighter around zero. There is still noticeable spread. As we see around the middle predicted PH range some of the predictions with larger residuals (overpredicted), and it the same tendency as we saw in SVM radial diagnostics plots.
nonlinear_model_results <- add_model_result(
results_table = nonlinear_model_results,
new_result = mars_result
)
all_model_results <- add_model_result(
results_table = all_model_results,
new_result = mars_result
)
Next, I test a neural network model.
Neural networks can capture nonlinear relationships, but they are also more sensitive to preprocessing and tuning. Because of this, I use the same linear-family preprocessing recipe with Yeo-Johnson, centering, scaling, and dummy encoding.
The main tuning parameters are:
hidden_units: number of neurons in the hidden
layerpenalty: regularization strength to reduce
overfittingepochs: number of training iterationsnn_spec <- mlp(
hidden_units = tune(),
penalty = tune(),
epochs = tune()
) |>
set_mode("regression") |>
set_engine(
"nnet",
trace = FALSE,
MaxNWts = 5000
)
nn_workflow <- workflow() |>
add_recipe(linear_recipe) |>
add_model(nn_spec)
nn_grid <- expand_grid(
hidden_units = c(3, 5, 7, 10),
penalty = c(0.0001, 0.001, 0.01, 0.1),
epochs = c(100, 200, 500)
)
# nn_grid |>
# kable(
# caption = "Neural Network Tuning Grid",
# digits = 4
# ) |>
# kable_styling(
# full_width = FALSE,
# position = "center"
# )
set.seed(123)
nn_start_time <- Sys.time()
nn_tune <- tune_grid(
nn_workflow,
resamples = linear_folds,
grid = nn_grid,
metrics = metric_set(rmse, mae, rsq),
control = control_grid(save_pred = TRUE)
)
nn_end_time <- Sys.time()
nn_training_time <- as.numeric(
difftime(nn_end_time, nn_start_time, units = "secs")
)
nn_tune |>
collect_metrics() |>
filter(.metric == "rmse") |>
arrange(mean) |>
slice(1:10) |>
kable(
caption = "Top Neural Network Tuning Results by RMSE",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| hidden_units | penalty | epochs | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|
| 10 | 0.10 | 100 | rmse | standard | 0.1264 | 15 | 0.0017 | pre0_mod46_post0 |
| 5 | 0.10 | 500 | rmse | standard | 0.1283 | 15 | 0.0010 | pre0_mod24_post0 |
| 3 | 0.01 | 500 | rmse | standard | 0.1292 | 15 | 0.0013 | pre0_mod09_post0 |
| 7 | 0.10 | 200 | rmse | standard | 0.1292 | 15 | 0.0010 | pre0_mod35_post0 |
| 5 | 0.10 | 200 | rmse | standard | 0.1295 | 15 | 0.0015 | pre0_mod23_post0 |
| 3 | 0.10 | 500 | rmse | standard | 0.1295 | 15 | 0.0013 | pre0_mod12_post0 |
| 7 | 0.10 | 500 | rmse | standard | 0.1295 | 15 | 0.0019 | pre0_mod36_post0 |
| 5 | 0.01 | 100 | rmse | standard | 0.1306 | 15 | 0.0014 | pre0_mod19_post0 |
| 7 | 0.01 | 100 | rmse | standard | 0.1307 | 15 | 0.0017 | pre0_mod31_post0 |
| 5 | 0.01 | 200 | rmse | standard | 0.1309 | 15 | 0.0011 | pre0_mod20_post0 |
best_nn <- select_best(
nn_tune,
metric = "rmse"
)
best_nn |>
kable(
caption = "Best Neural Network Tuning Parameters",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| hidden_units | penalty | epochs | .config |
|---|---|---|---|
| 10 | 0.1 | 100 | pre0_mod46_post0 |
The first neural network tuning grid selected
hidden_units = 10, penalty = 0.10, and
epochs = 100.
Since hidden_units = 10 and penalty = 0.10
were both at the upper end of the first grid, I will run a new refined
grid around that area in following code. This will check if slightly
larger network or stronger regularization can improve accuracy.
I also will change the epoch values to the lower range (50, 100, 150) because the first grid did not show improvement from using 200 or 500 epochs.
nn_grid_refined <- expand_grid(
hidden_units = c(8, 10, 12, 15, 20),
penalty = c(0.05, 0.10, 0.20, 0.30, 0.50),
epochs = c(50, 100, 150)
)
# nn_grid_refined |>
# kable(
# caption = "Refined Neural Network Tuning Grid",
# digits = 4
# ) |>
# kable_styling(
# full_width = FALSE,
# position = "center"
# )
set.seed(123)
nn_refined_start_time <- Sys.time()
nn_tune_refined <- tune_grid(
nn_workflow,
resamples = linear_folds,
grid = nn_grid_refined,
metrics = metric_set(rmse, mae, rsq),
control = control_grid(save_pred = TRUE)
)
nn_refined_end_time <- Sys.time()
nn_refined_training_time <- as.numeric(
difftime(
nn_refined_end_time,
nn_refined_start_time,
units = "secs"
)
)
nn_tune_refined |>
collect_metrics() |>
filter(.metric == "rmse") |>
arrange(mean) |>
slice(1:10) |>
kable(
caption = "Top Refined Neural Network Tuning Results by RMSE",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| hidden_units | penalty | epochs | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|
| 12 | 0.5 | 150 | rmse | standard | 0.1232 | 15 | 0.0013 | pre0_mod45_post0 |
| 15 | 0.5 | 150 | rmse | standard | 0.1234 | 15 | 0.0014 | pre0_mod60_post0 |
| 12 | 0.3 | 150 | rmse | standard | 0.1247 | 15 | 0.0014 | pre0_mod42_post0 |
| 10 | 0.5 | 150 | rmse | standard | 0.1251 | 15 | 0.0011 | pre0_mod30_post0 |
| 12 | 0.5 | 100 | rmse | standard | 0.1253 | 15 | 0.0017 | pre0_mod44_post0 |
| 15 | 0.3 | 150 | rmse | standard | 0.1259 | 15 | 0.0014 | pre0_mod57_post0 |
| 15 | 0.5 | 100 | rmse | standard | 0.1259 | 15 | 0.0012 | pre0_mod59_post0 |
| 12 | 0.2 | 100 | rmse | standard | 0.1261 | 15 | 0.0012 | pre0_mod38_post0 |
| 15 | 0.3 | 100 | rmse | standard | 0.1268 | 15 | 0.0013 | pre0_mod56_post0 |
| 10 | 0.3 | 100 | rmse | standard | 0.1273 | 15 | 0.0014 | pre0_mod26_post0 |
best_nn_refined <- select_best(
nn_tune_refined,
metric = "rmse"
)
best_nn_refined |>
kable(
caption = "Best Refined Neural Network Tuning Parameters",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| hidden_units | penalty | epochs | .config |
|---|---|---|---|
| 12 | 0.5 | 150 | pre0_mod45_post0 |
The refined neural network tuning improved the model compared with the first tuning grid.
The first grid selected:
hidden_units: 10penalty: 0.10epochs: 100The refined grid selected:
hidden_units: 12penalty: 0.50epochs: 150This shows that the neural network benefited from a slightly larger
hidden layer and stronger regularization. The stronger
penalty = 0.50 suggests that the model needed shrinkage to
avoid overfitting.
final_nn_workflow <- finalize_workflow(
nn_workflow,
best_nn_refined
)
nn_final_start_time <- Sys.time()
final_nn_fit <- fit(
final_nn_workflow,
data = train_linear_model_data
)
nn_final_end_time <- Sys.time()
nn_final_training_time <- as.numeric(
difftime(
nn_final_end_time,
nn_final_start_time,
units = "secs"
)
)
nn_validation_predictions <- predict(
final_nn_fit,
new_data = valid_linear_model_data
) |>
pull(.pred)
nn_result <- make_and_show_model_result(
workflow = "Nonlinear-family",
model_family = "Nonlinear",
model_name = "Neural Network",
training_rows = nrow(train_linear_model_data),
validation_rows = nrow(valid_linear_model_data),
tuning_parameters = paste(
"hidden_units =",
best_nn_refined$hidden_units,
", penalty =",
best_nn_refined$penalty,
", epochs =",
best_nn_refined$epochs
),
actual = valid_ph,
predicted = nn_validation_predictions,
training_time_seconds = nn_final_training_time
)
| workflow | model_family | model_name | training_rows | validation_rows | tuning_parameters | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|---|---|
| Nonlinear-family | Nonlinear | Neural Network | 2052 | 515 | hidden_units = 12 , penalty = 0.5 , epochs = 150 | 0.1226 | 0.0926 | 0.4984 | 1.0357 |
The refined neural network model improved over the linear-family models, but it did not outperform the SVM radial model.
The final neural network used:
hidden_units: 12penalty: 0.5epochs: 150Its validation performance was:
This is better than the linear regression, elastic net, and PLS models, which had RMSE values around 0.1305 and \(R^2\) values around 0.425. This suggests that the neural network captured some nonlinear structure that the linear-family models missed.
However, the neural network did not perform as well as the SVM radial model. SVM radial still had lower RMSE and MAE and a higher \(R^2\) on the validation data.
The stronger penalty value, penalty = 0.5, also tells us
that regularization was important for the neural network. Without enough
regularization, the neural network possibly could overfit training
data.
p1 <- plot_model_diagnostics(
obs = valid_ph,
pred = nn_validation_predictions,
model_name = "Neural Network"
)
p1
nonlinear_model_results <- add_model_result(
results_table = nonlinear_model_results,
new_result = nn_result
)
all_model_results <- add_model_result(
results_table = all_model_results,
new_result = nn_result
)
I also test the neural network using the tree-family feature set.
This is an experimental comparison because the tree-family data keeps
the original mnf_flow variable. Neural networks are
sensitive to scale and numeric distances, so this version may or may not
work better than the linear-family version.
To make this safer, I still use Yeo-Johnson, centering, scaling, and
dummy encoding. However, I exclude mnf_flow from
Yeo-Johnson because it may represent special process states, not just a
regular continuous measurement.
train_tree_nn_model_data <- train_features_tree_family |>
mutate(ph = train_ph)
valid_tree_nn_model_data <- valid_features_tree_family |>
mutate(ph = valid_ph)
exclude_from_tree_nn_yj <- c(
"pressure_setpoint",
"bowl_setpoint",
"mnf_flow"
)
tree_nn_yj_vars <- train_features_tree_family |>
select(where(is.numeric)) |>
select(-any_of(exclude_from_tree_nn_yj)) |>
names()
tibble(
yeo_johnson_variable = tree_nn_yj_vars
) |>
kable(
caption = "Tree-Family Variables Selected for Yeo-Johnson in Neural Network"
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| yeo_johnson_variable |
|---|
| carb_volume |
| fill_ounces |
| pc_volume |
| carb_pressure |
| carb_temp |
| psc |
| psc_fill |
| psc_co2 |
| carb_pressure1 |
| fill_pressure |
| hyd_pressure1 |
| hyd_pressure2 |
| hyd_pressure3 |
| hyd_pressure4 |
| filler_level |
| filler_speed |
| temperature |
| usage_cont |
| carb_flow |
| density |
| mfr |
| balling |
| pressure_vacuum |
| oxygen_filler |
| air_pressurer |
| alch_rel |
| carb_rel |
| balling_lvl |
tree_nn_recipe <- recipe(
ph ~ .,
data = train_tree_nn_model_data
) |>
step_YeoJohnson(all_of(tree_nn_yj_vars)) |>
step_center(all_numeric_predictors()) |>
step_scale(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
tree_nn_recipe <- recipe(
ph ~ .,
data = train_tree_nn_model_data
) |>
step_YeoJohnson(all_of(tree_nn_yj_vars)) |>
step_center(all_numeric_predictors()) |>
step_scale(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
tree_nn_grid <- expand_grid(
hidden_units = c(8, 10, 12, 15, 20),
penalty = c(0.05, 0.10, 0.20, 0.30, 0.50),
epochs = c(50, 100, 150)
)
# tree_nn_grid |>
# kable(
# caption = "Tree-Family Neural Network Tuning Grid",
# digits = 4
# ) |>
# kable_styling(
# full_width = FALSE,
# position = "center"
# )
tree_nn_folds <- vfold_cv(
train_tree_nn_model_data,
v = 5,
repeats = 3
)
tree_nn_spec <- mlp(
hidden_units = tune(),
penalty = tune(),
epochs = tune()
) |>
set_mode("regression") |>
set_engine(
"nnet",
trace = FALSE,
MaxNWts = 5000
)
tree_nn_workflow <- workflow() |>
add_recipe(tree_nn_recipe) |>
add_model(tree_nn_spec)
set.seed(123)
tree_nn_start_time <- Sys.time()
tree_nn_tune <- tune_grid(
tree_nn_workflow,
resamples = tree_nn_folds,
grid = tree_nn_grid,
metrics = metric_set(rmse, mae, rsq),
control = control_grid(save_pred = TRUE)
)
tree_nn_end_time <- Sys.time()
tree_nn_training_time <- as.numeric(
difftime(
tree_nn_end_time,
tree_nn_start_time,
units = "secs"
)
)
tree_nn_tune |>
collect_metrics() |>
filter(.metric == "rmse") |>
arrange(mean) |>
slice(1:10) |>
kable(
caption = "Top Tree-Family Neural Network Tuning Results by RMSE",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| hidden_units | penalty | epochs | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|
| 15 | 0.5 | 150 | rmse | standard | 0.1220 | 15 | 0.0017 | pre0_mod60_post0 |
| 15 | 0.3 | 150 | rmse | standard | 0.1234 | 15 | 0.0020 | pre0_mod57_post0 |
| 12 | 0.5 | 150 | rmse | standard | 0.1247 | 15 | 0.0021 | pre0_mod45_post0 |
| 12 | 0.2 | 150 | rmse | standard | 0.1251 | 15 | 0.0019 | pre0_mod39_post0 |
| 10 | 0.5 | 150 | rmse | standard | 0.1253 | 15 | 0.0019 | pre0_mod30_post0 |
| 12 | 0.3 | 150 | rmse | standard | 0.1258 | 15 | 0.0020 | pre0_mod42_post0 |
| 15 | 0.2 | 100 | rmse | standard | 0.1260 | 15 | 0.0018 | pre0_mod53_post0 |
| 15 | 0.5 | 100 | rmse | standard | 0.1261 | 15 | 0.0026 | pre0_mod59_post0 |
| 12 | 0.5 | 100 | rmse | standard | 0.1263 | 15 | 0.0025 | pre0_mod44_post0 |
| 15 | 0.3 | 100 | rmse | standard | 0.1267 | 15 | 0.0019 | pre0_mod56_post0 |
best_tree_nn <- select_best(
tree_nn_tune,
metric = "rmse"
)
best_tree_nn |>
kable(
caption = "Best Tree-Family Neural Network Tuning Parameters",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| hidden_units | penalty | epochs | .config |
|---|---|---|---|
| 15 | 0.5 | 150 | pre0_mod60_post0 |
final_tree_nn_workflow <- finalize_workflow(
tree_nn_workflow,
best_tree_nn
)
tree_nn_final_start_time <- Sys.time()
final_tree_nn_fit <- fit(
final_tree_nn_workflow,
data = train_tree_nn_model_data
)
tree_nn_final_end_time <- Sys.time()
tree_nn_final_training_time <- as.numeric(
difftime(
tree_nn_final_end_time,
tree_nn_final_start_time,
units = "secs"
)
)
tree_nn_validation_predictions <- predict(
final_tree_nn_fit,
new_data = valid_tree_nn_model_data
) |>
pull(.pred)
tree_nn_result <- make_and_show_model_result(
workflow = "Tree-family",
model_family = "Nonlinear",
model_name = "Neural Network with Tree-Family Features",
training_rows = nrow(train_tree_nn_model_data),
validation_rows = nrow(valid_tree_nn_model_data),
tuning_parameters = paste(
"hidden_units =",
best_tree_nn$hidden_units,
", penalty =",
best_tree_nn$penalty,
", epochs =",
best_tree_nn$epochs
),
actual = valid_ph,
predicted = tree_nn_validation_predictions,
training_time_seconds = tree_nn_final_training_time
)
| workflow | model_family | model_name | training_rows | validation_rows | tuning_parameters | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|---|---|
| Tree-family | Nonlinear | Neural Network with Tree-Family Features | 2052 | 515 | hidden_units = 15 , penalty = 0.5 , epochs = 150 | 0.1432 | 0.0989 | 0.366 | 1.209 |
I tested two neural network versions.
The first neural network used the nonlinear-family data set. It selected:
hidden_units = 12penalty = 0.5epochs = 150Its validation performance was:
The second neural network used the tree-family data set. It selected:
hidden_units = 15penalty = 0.5epochs = 150Its validation performance was:
The regular neural network performed better than the neural network with tree-family dataset. Its RMSE and MAE were lower, and its \(R^2\) was higher.
This means that the tree-family feature version did not improve neural network performance. In fact, it made the neural network worse. This may be because the neural network was more sensitive to how the predictors were represented and scaled.
Overall, the neural network trained/tested with linear family dataset was the better neural network version.
p2 <- plot_model_diagnostics(
obs = valid_ph,
pred = tree_nn_validation_predictions,
model_name = "Neural Network with Tree-Family Features"
)
p1/p2
I also compared the diagnostic plots for the two neural network versions.
The first neural network plot shows that the model captured the general PH trend better than the linear models. Most points follow the diagonal direction, but there is still a noticeable spread around the line. This means the model improved over simple linear models, but still had some larger prediction errors.
The residual plot for the first neural network is mostly centered around zero, but there are still some clear overprediction and underprediction cases. A few residuals are close to -0.50 and above +0.50, which means the model struggled with some validation observations.
The second neural network was trained and tested using the
tree-family dataset. Its Predicted vs Observed plot shows a slightly
different pattern. The points look more concentrated in the middle and
upper PH range, but the overall prediction pattern is weaker.
The residual plot for the second neural network shows that most errors
are still close to zero, but there are also a few larger outliers. There
are a couple of observations that were severely underpredicted, which
can be seen on the left side of the diagnostic plots.
The regular neural network performed better than the neural network with
tree-family features. The tree-family feature version did not improve
neural network performance and actually produced weaker test
results.
This comparison also supports my decision to create separate feature sets for different model families. The same feature engineering approach does not always help every model. As we see, some models benefit from one predictor structure, while other models may perform worse with it.
nonlinear_model_results <- add_model_result(
results_table = nonlinear_model_results,
new_result = tree_nn_result
)
all_model_results <- add_model_result(
results_table = all_model_results,
new_result = tree_nn_result
)
nonlinear_model_results |>
arrange(rmse) |>
kable(
caption = "Updated Nonlinear Model Comparison",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| workflow | model_family | model_name | training_rows | validation_rows | tuning_parameters | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|---|---|
| Nonlinear-family | Nonlinear | SVM Radial | 2052 | 515 | cost = 3 , / rbf_sigma = 0.05 | 0.1112 | 0.0812 | 0.5806 | 0.8783 |
| Nonlinear-family | Nonlinear | MARS | 2052 | 515 | num_terms = 35 , prod_degree = 2 | 0.1212 | 0.0907 | 0.5013 | 0.3193 |
| Nonlinear-family | Nonlinear | Neural Network | 2052 | 515 | hidden_units = 12 , penalty = 0.5 , epochs = 150 | 0.1226 | 0.0926 | 0.4984 | 1.0357 |
| Tree-family | Nonlinear | Neural Network with Tree-Family Features | 2052 | 515 | hidden_units = 15 , penalty = 0.5 , epochs = 150 | 0.1432 | 0.0989 | 0.3660 | 1.2090 |
The nonlinear model comparison shows that SVM radial performed best in this model family.
SVM radial:
MARS:
Neural network with nonlinear-family dataset:
Neural network with tree-family dataset:
The SVM with radial basis kernel model had the lowest RMSE and MAE and the highest \(R^2\) among the nonlinear models. This suggests that the radial kernel was better able to capture nonlinear relationships in the beverage production data.
MARS was the second-best nonlinear model by RMSE. It performed slightly better than both neural network versions.
The regular neural network performed better than the neural network trained with tree-family dataset. The regular neural network had lower RMSE, lower MAE, and higher \(R^2\).
The neural network with tree-family features did not improve model performance. Its RMSE increased to 0.1432, and its \(R^2\) dropped to 0.3660. This means that using the tree-family dataset made the neural network weaker in this case.
Overall, SVM radial was the strongest nonlinear model. MARS and the regular neural network also performed better than the linear-family models, but they were not as strong as SVM radial. The nonlinear model results still support the idea that PH prediction depends on nonlinear patterns or interactions that regular linear models do not capture well.
After testing linear, regularized, and nonlinear models, I now test tree-based models.
For these models, I use the tree-family feature set. This version
keeps the original mnf_flow variable, because tree-based
models can naturally split on different values and ranges.
I will not use math transformations, centering, scaling, or correlation filtering for tree-based models. Tree-based models do not depend on variables being on the same scale, and they can handle nonlinear relationships and interactions directly.
train_tree_model_data <- train_features_tree_family |>
mutate(ph = train_ph)
valid_tree_model_data <- valid_features_tree_family |>
mutate(ph = valid_ph)
set.seed(123)
tree_folds <- vfold_cv(
train_tree_model_data,
v = 5,
repeats = 3
)
I will use 5-fold cross-validation repeated 3 times for each models.
tree_model_results <- tibble()
I start the tree-based section with Random Forest.
Random Forest builds many decision trees and averages their predictions. Each tree sees a slightly different version of the data, and each split only considers a random subset of predictors.
The main tuning parameters are:
mtry: how many predictors are randomly spread at each
splitmin_n: the minimum number of observations allowed in a
terminal nodetrees: the number of trees in the forestA smaller min_n allows more detailed trees, while a
larger min_n creates smoother predictions. A larger
mtry lets each split consider more predictors, but it is
possible that it creates trees more similar to each other.
rf_spec <- rand_forest(
mtry = tune(),
min_n = tune(),
trees = 750
) |>
set_mode("regression") |>
set_engine(
"ranger",
importance = "impurity",
num.threads = 1
)
rf_workflow <- workflow() |>
add_formula(ph ~ .) |>
add_model(rf_spec)
set.seed(123)
rf_grid_refined <- expand_grid(
mtry = seq(8, 24, by = 2),
min_n = seq(1, 6, by = 1)
)
# rf_grid |>
# kable(
# caption = "Random Forest Tuning Grid"
# ) |>
# kable_styling(
# full_width = FALSE,
# position = "center"
# )
#cl <- start_parallel(cores_to_leave_free = 1)
set.seed(123)
rf_refined_start_time <- Sys.time()
rf_tune_refined <- tune_grid(
rf_workflow,
resamples = tree_folds,
grid = rf_grid_refined,
metrics = metric_set(rmse, mae, rsq),
control = control_grid(save_pred = TRUE)
)
rf_refined_end_time <- Sys.time()
rf_refined_training_time <- as.numeric(
difftime(
rf_refined_end_time,
rf_refined_start_time,
units = "secs"
)
)
#stop_parallel(cl)
rf_tune_refined |>
collect_metrics() |>
filter(.metric == "rmse") |>
arrange(mean) |>
slice(1:10) |>
kable(
caption = "Top Refined Random Forest Tuning Results by RMSE",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| mtry | min_n | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|
| 24 | 1 | rmse | standard | 0.1052 | 15 | 0.0011 | pre0_mod49_post0 |
| 24 | 3 | rmse | standard | 0.1053 | 15 | 0.0012 | pre0_mod51_post0 |
| 22 | 1 | rmse | standard | 0.1054 | 15 | 0.0012 | pre0_mod43_post0 |
| 24 | 2 | rmse | standard | 0.1054 | 15 | 0.0011 | pre0_mod50_post0 |
| 22 | 2 | rmse | standard | 0.1054 | 15 | 0.0011 | pre0_mod44_post0 |
| 20 | 3 | rmse | standard | 0.1054 | 15 | 0.0011 | pre0_mod39_post0 |
| 22 | 3 | rmse | standard | 0.1055 | 15 | 0.0011 | pre0_mod45_post0 |
| 24 | 4 | rmse | standard | 0.1055 | 15 | 0.0011 | pre0_mod52_post0 |
| 20 | 1 | rmse | standard | 0.1055 | 15 | 0.0011 | pre0_mod37_post0 |
| 22 | 4 | rmse | standard | 0.1055 | 15 | 0.0011 | pre0_mod46_post0 |
best_rf_refined <- select_best(
rf_tune_refined,
metric = "rmse"
)
best_rf_refined |>
kable(
caption = "Best Refined Random Forest Tuning Parameters",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| mtry | min_n | .config |
|---|---|---|
| 24 | 1 | pre0_mod49_post0 |
The Random Forest tuning selected:
mtry: 24 , meaning that at each split, the model
randomly considers 24 predictors.min_n: 1 meaning the model preferred detailed
trees.final_rf_workflow <- finalize_workflow(
rf_workflow,
best_rf_refined
)
rf_final_start_time <- Sys.time()
final_rf_fit <- fit(
final_rf_workflow,
data = train_tree_model_data
)
rf_final_end_time <- Sys.time()
rf_final_training_time <- as.numeric(
difftime(
rf_final_end_time,
rf_final_start_time,
units = "secs"
)
)
rf_validation_predictions <- predict(
final_rf_fit,
new_data = valid_tree_model_data
) |>
pull(.pred)
rf_result <- make_and_show_model_result(
workflow = "Tree-family",
model_family = "Tree-Based",
model_name = "Random Forest",
training_rows = nrow(train_tree_model_data),
validation_rows = nrow(valid_tree_model_data),
tuning_parameters = paste(
"mtry =",
best_rf_refined$mtry,
", min_n =",
best_rf_refined$min_n,
", trees = 1000"
),
actual = valid_ph,
predicted = rf_validation_predictions,
training_time_seconds = rf_refined_training_time
)
| workflow | model_family | model_name | training_rows | validation_rows | tuning_parameters | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|---|---|
| Tree-family | Tree-Based | Random Forest | 2052 | 515 | mtry = 24 , min_n = 1 , trees = 1000 | 0.0984 | 0.0706 | 0.6835 | 2200.34 |
On the validation data Random Forest model performed much better compared to previous models.
The final Random Forest model used:
mtry: 24min_n: 1trees: 1000Its validation performance was:
This is the strongest test performance so far. Compared with the earlier nonlinear models, RF model has lower RMSE and a higher \(R^2\).
Overall, this result suggests that PH prediction depends on nonlinear relationships and interactions that RF model can capture data structure better than the linear-family and earlier nonlinear models.
rf_diagnostiv_plots <- plot_model_diagnostics(
obs = valid_ph,
pred = rf_validation_predictions,
model_name = "Random Forest"
)
rf_diagnostiv_plots
In the predicted vs observed plot, the points are tighter to the diagonal reference line compared to earlier models. This means the Random Forest predictions are better aligned with the actual PH values.
The residual plot shows that most residuals are centered around zero. There are still few larger residuals, especially in the middle predicted PH range, and most of residual are in the range of -0.25 to 0.25.
Overall, the diagnostic plots support the validation metrics.
tree_model_results <- add_model_result(
results_table = tree_model_results,
new_result = rf_result
)
all_model_results <- add_model_result(
results_table = all_model_results,
new_result = rf_result
)
Next, I test a Cubist model.
Cubist is different from Random Forest because it does not only average many trees. Instead, it creates rule-based groups and fits linear models inside those rules.
This can be useful because PH may behave differently under different production conditions.
The main tuning parameters are:
committees: how many Cubist model committees are
usedneighbors: how many nearby observations are used to
adjust predictionsLarger committees can make the model stronger, but at
the same time model becomes more complex. The neighbors
parameter can help adjust predictions based on nearby observations,
which may improve accuracy.
cubist_spec <- cubist_rules(
committees = tune(),
neighbors = tune()
) |>
set_mode("regression") |>
set_engine("Cubist")
cubist_workflow <- workflow() |>
add_formula(ph ~ .) |>
add_model(cubist_spec)
cubist_grid <- expand_grid(
committees = c(15, 20, 25, 35, 40, 45, 50),
neighbors = c(0, 3, 5, 7, 9, 11)
)
#num_cores_to_use <- max(1, parallel::detectCores(logical = TRUE) - 1)
#cl <- makePSOCKcluster(num_cores_to_use)
#registerDoParallel(cl)
set.seed(123)
cubist_start_time <- Sys.time()
cubist_tune <- tune_grid(
cubist_workflow,
resamples = tree_folds,
grid = cubist_grid,
metrics = metric_set(rmse, mae, rsq),
control = control_grid(
save_pred = TRUE,
verbose = TRUE,
allow_par = TRUE
)
)
cubist_end_time <- Sys.time()
cubist_training_time <- as.numeric(
difftime(
cubist_end_time,
cubist_start_time,
units = "secs"
)
)
#stopCluster(cl)
#registerDoSEQ()
cubist_tune |>
collect_metrics() |>
filter(.metric == "rmse") |>
arrange(mean) |>
slice(1:10) |>
kable(
caption = "Top Cubist Tuning Results by RMSE",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| committees | neighbors | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|
| 50 | 5 | rmse | standard | 0.1003 | 15 | 0.0009 | pre0_mod39_post0 |
| 35 | 5 | rmse | standard | 0.1004 | 15 | 0.0010 | pre0_mod21_post0 |
| 45 | 5 | rmse | standard | 0.1005 | 15 | 0.0009 | pre0_mod33_post0 |
| 40 | 5 | rmse | standard | 0.1006 | 15 | 0.0009 | pre0_mod27_post0 |
| 50 | 7 | rmse | standard | 0.1008 | 15 | 0.0011 | pre0_mod40_post0 |
| 25 | 5 | rmse | standard | 0.1008 | 15 | 0.0010 | pre0_mod15_post0 |
| 35 | 7 | rmse | standard | 0.1009 | 15 | 0.0011 | pre0_mod22_post0 |
| 45 | 7 | rmse | standard | 0.1009 | 15 | 0.0011 | pre0_mod34_post0 |
| 40 | 7 | rmse | standard | 0.1011 | 15 | 0.0011 | pre0_mod28_post0 |
| 50 | 3 | rmse | standard | 0.1012 | 15 | 0.0008 | pre0_mod38_post0 |
best_cubist <- select_best(
cubist_tune,
metric = "rmse"
)
best_cubist |>
kable(
caption = "Best Cubist Tuning Parameters",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| committees | neighbors | .config |
|---|---|---|
| 50 | 5 | pre0_mod39_post0 |
final_cubist_workflow <- finalize_workflow(
cubist_workflow,
best_cubist
)
cubist_final_start_time <- Sys.time()
final_cubist_fit <- fit(
final_cubist_workflow,
data = train_tree_model_data
)
cubist_final_end_time <- Sys.time()
cubist_final_training_time <- as.numeric(
difftime(
cubist_end_time,
cubist_start_time,
units = "secs"
)
)
cubist_validation_predictions <- predict(
final_cubist_fit,
new_data = valid_tree_model_data
) |>
pull(.pred)
cubist_result <- make_and_show_model_result(
workflow = "Tree-family",
model_family = "Tree-Based",
model_name = "Cubist",
training_rows = nrow(train_tree_model_data),
validation_rows = nrow(valid_tree_model_data),
tuning_parameters = paste(
"committees =",
best_cubist$committees,
", neighbors =",
best_cubist$neighbors
),
actual = valid_ph,
predicted = cubist_validation_predictions,
training_time_seconds = cubist_final_training_time
)
| workflow | model_family | model_name | training_rows | validation_rows | tuning_parameters | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|---|---|
| Tree-family | Tree-Based | Cubist | 2052 | 515 | committees = 50 , neighbors = 5 | 0.0996 | 0.0682 | 0.665 | 393.0181 |
The Cubist model also performed very well on the validation data.
The final Cubist model used:
committees: 50neighbors: 5Its validation performance was:
The selected committees = 50 means Cubist performed
better with combining many rule-based models. The selected
neighbors = 5 means it also used nearby observations to
adjust predictions, which probably helped improve local accuracy. Cubist
is a strong tree/rule-based model for this dataset and is competitive
with RF.
cubist_diagnostic_plots <- plot_model_diagnostics(
obs = valid_ph,
pred = cubist_validation_predictions,
model_name = "Cubist"
)
cubist_diagnostic_plots
In the predicted vs observed plot, the points follow the diagonal
reference line closely. The spread around the line is present with
several over and under predicted values, but it is tighter than most
earlier models.
The residual plot also looks good overall. Majority of residuals are
centered around zero, but there are several values in the midrange being
over and under predicted.
Compared with Random Forest, Cubist has slightly higher RMSE and lower
\(R^2\), but slightly better MAE. The
diagnostic plot supports that idea because many predictions are close to
the actual values, even though a few larger errors remain.
tree_model_results <- add_model_result(
results_table = tree_model_results,
new_result = cubist_result
)
all_model_results <- add_model_result(
results_table = all_model_results,
new_result = cubist_result
)
For the XGBoost tuning grid, here I will keep the grid smaller than the full possible tuning space.
I will use tree_depth = 2 and 3 because the
dataset is not very large, and very deep trees might overfit. I will use
learning rates of 0.03 and 0.05 because they
allow gradual boosting without making the model too slow. I test
trees = 500 and 1000 to compare a medium and
larger number of boosting rounds.
I will keep sample_size = 0.80 fixed for the first pass
to reduce the grid size. Also I will tune mtry,
min_n, and loss_reduction because they affect
how complex or conservative the boosted trees become.
xgb_spec <- boost_tree(
trees = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
min_n = tune(),
sample_size = tune(),
mtry = tune()
) |>
set_mode("regression") |>
set_engine(
"xgboost",
nthread = 1
)
xgb_workflow <- workflow() |>
add_formula(ph ~ .) |>
add_model(xgb_spec)
xgb_grid <- expand_grid(
trees = c(1000),
tree_depth = c(2, 3, 4),
learn_rate = c(0.03, 0.05, 0.07),
loss_reduction = c(0, 0.01),
min_n = c(2, 5),
sample_size = c(0.80),
mtry = c(16, 24, 28)
)
# xgb_grid |>
# kable(
# caption = "XGBoost Tuning Grid"
# ) |>
# kable_styling(
# full_width = FALSE,
# position = "center"
# )
#num_cores_to_use <- max(1, parallel::detectCores(logical = TRUE) - 1)
xgb_start_time <- Sys.time()
##cl <- makePSOCKcluster(num_cores_to_use)
#registerDoParallel(cl)
set.seed(123)
xgb_tune <- tune_grid(
xgb_workflow,
resamples = tree_folds,
grid = xgb_grid,
metrics = metric_set(rmse, mae, rsq),
control = control_grid(
save_pred = TRUE,
verbose = TRUE,
allow_par = TRUE
)
)
xgb_end_time <- Sys.time()
xgb_training_time <- as.numeric(
difftime(
xgb_end_time,
xgb_start_time,
units = "secs"))
#stopCluster(cl)
#registerDoSEQ()
xgb_tune |>
collect_metrics() |>
filter(.metric == "rmse") |>
arrange(mean) |>
slice(1:10) |>
kable(
caption = "Top XGBoost Tuning Results by RMSE",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| mtry | trees | min_n | tree_depth | learn_rate | loss_reduction | sample_size | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24 | 1000 | 2 | 4 | 0.05 | 0 | 0.8 | rmse | standard | 0.1038 | 15 | 0.0012 | pre0_mod051_post0 |
| 28 | 1000 | 2 | 4 | 0.07 | 0 | 0.8 | rmse | standard | 0.1040 | 15 | 0.0011 | pre0_mod089_post0 |
| 24 | 1000 | 5 | 4 | 0.05 | 0 | 0.8 | rmse | standard | 0.1041 | 15 | 0.0011 | pre0_mod069_post0 |
| 24 | 1000 | 2 | 4 | 0.07 | 0 | 0.8 | rmse | standard | 0.1042 | 15 | 0.0012 | pre0_mod053_post0 |
| 28 | 1000 | 2 | 4 | 0.03 | 0 | 0.8 | rmse | standard | 0.1043 | 15 | 0.0011 | pre0_mod085_post0 |
| 24 | 1000 | 5 | 4 | 0.03 | 0 | 0.8 | rmse | standard | 0.1043 | 15 | 0.0010 | pre0_mod067_post0 |
| 28 | 1000 | 5 | 4 | 0.05 | 0 | 0.8 | rmse | standard | 0.1044 | 15 | 0.0011 | pre0_mod105_post0 |
| 28 | 1000 | 2 | 4 | 0.05 | 0 | 0.8 | rmse | standard | 0.1044 | 15 | 0.0010 | pre0_mod087_post0 |
| 24 | 1000 | 2 | 4 | 0.03 | 0 | 0.8 | rmse | standard | 0.1045 | 15 | 0.0011 | pre0_mod049_post0 |
| 24 | 1000 | 5 | 4 | 0.07 | 0 | 0.8 | rmse | standard | 0.1045 | 15 | 0.0011 | pre0_mod071_post0 |
best_xgb <- select_best(
xgb_tune,
metric = "rmse"
)
best_xgb |>
kable(
caption = "Best XGBoost Tuning Parameters",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| mtry | trees | min_n | tree_depth | learn_rate | loss_reduction | sample_size | .config |
|---|---|---|---|---|---|---|---|
| 24 | 1000 | 2 | 4 | 0.05 | 0 | 0.8 | pre0_mod051_post0 |
The grid tested:
trees: 1000
tree_depth: 2, 3, 4
learn_rate: 0.03, 0.05, 0.07
loss_reduction: 0, 0.01
min_n: 2, 5
sample_size: 0.80
mtry: 16, 24, 28
The best XGBoost tuning result selected:
mtry: 24trees: 1000min_n: 2tree_depth: 4learn_rate: 0.05loss_reduction: 0sample_size: 0.80This means the best XGBoost model used 1000 boosted trees with
moderate tree depth. A tree_depth of 4 allows the model to
capture interactions between predictors, but it is not extremely
deep.
The selected learn_rate = 0.05 means the model learned
gradually instead of making large updates at each boosting step. This is
usually safer because each tree only makes a small contribution to the
final model.
The selected min_n = 2 means the model allowed fairly
detailed splits. The loss_reduction = 0 value means the
model did not require extra improvement before making a split, so it was
allowed to split more freely when useful.
The selected sample_size = 0.80 means each boosting
iteration used a random 80% of the training rows.
I used row sampling because the dataset has some imbalance in
brand_code, especially Brand B, which makes up almost half
of the observations. The row sampling does not fully fix brand
imbalance, but it adds randomness during training and can reduce the
chance that the model depends too heavily on one exact training
sample.
This randomness can help reduce overfitting and make the boosted trees
more stable.
Overall, the selected tuning parameters show that XGBoost benefited from
a moderately flexible model: enough trees, moderate depth, gradual
learning, small terminal nodes, and row sampling.
# xgb_tune |>
# collect_metrics() |>
# filter(.metric == "rmse") |>
# group_by(tree_depth, learn_rate) |>
# summarise(
# best_rmse = min(mean),
# .groups = "drop"
# ) |>
# ggplot(aes(
# x = factor(tree_depth),
# y = factor(learn_rate),
# fill = best_rmse
# )) +
# geom_tile() +
# geom_text(aes(label = round(best_rmse, 4))) +
# labs(
# title = "XGBoost Best RMSE by Tree Depth and Learning Rate",
# x = "Tree Depth",
# y = "Learning Rate",
# fill = "RMSE"
# )
final_xgb_workflow <- finalize_workflow(
xgb_workflow,
best_xgb
)
xgb_final_start_time <- Sys.time()
final_xgb_fit <- fit(
final_xgb_workflow,
data = train_tree_model_data
)
xgb_final_end_time <- Sys.time()
xgb_final_training_time <- as.numeric(
difftime(
xgb_final_end_time,
xgb_final_start_time,
units = "secs"
)
)
xgb_validation_predictions <- predict(
final_xgb_fit,
new_data = valid_tree_model_data
) |>
pull(.pred)
xgb_result_short <- xgb_result |>
mutate(
training_time_minutes = training_time_seconds / 60
) |>
select(
workflow,
model_family,
model_name,
training_rows,
validation_rows,
rmse,
mae,
rsq,
training_time_minutes
)
xgb_result_short |>
kable(
caption = "XGBoost Validation Performance",
digits = 4,
format = "html"
) |>
kable_styling(
full_width = FALSE,
position = "center",
bootstrap_options = c("striped", "hover", "condensed", "responsive")
)
| workflow | model_family | model_name | training_rows | validation_rows | rmse | mae | rsq | training_time_minutes |
|---|---|---|---|---|---|---|---|---|
| Tree-family | Tree-Based | XGBoost | 2052 | 515 | 0.0964 | 0.0705 | 0.6852 | 10.8502 |
xgb_selected_parameters <- xgb_result |>
select(
model_name,
tuning_parameters
)
xgb_selected_parameters |>
kable(
caption = "Selected XGBoost Tuning Parameters",
format = "html"
) |>
kable_styling(
full_width = FALSE,
position = "center",
bootstrap_options = c("striped", "hover", "condensed", "responsive")
)
| model_name | tuning_parameters |
|---|---|
| XGBoost | trees = 1000 , tree_depth = 4 , learn_rate = 0.05 , loss_reduction = 0 , min_n = 2 , sample_size = 0.8 , mtry = 24 |
The final XGBoost model used:
trees = 1000tree_depth = 4learn_rate = 0.05loss_reduction = 0min_n = 2sample_size = 0.80mtry = 24Its validation performance is:
This is the best individual model result so far. XGBoost has the lowest validation RMSE and \(R^2\) among the individual models
The model used enough trees to learn the pattern, a moderate tree
depth to capture nonlinear relationships, and a learning rate of 0.05 so
the model could improve gradually. The sample_size = 0.80
added some randomness during training, which can help reduce
overfitting.
xgb_diagn_plots <- plot_model_diagnostics(
obs = valid_ph,
pred = xgb_validation_predictions,
model_name = "XGBoost"
)
xgb_diagn_plots
The XGBoost diagnostic plots show that the model performed well overall on the validation data.
In the Predicted vs Observed plot, most points follow the diagonal line, which means the predicted PH values were generally close to the observed PH values. There is still some spread, but the overall pattern is strong and shows that the model captured the main PH trend.
The Residuals vs Predicted plot also looks reasonably balanced around zero. This means the model did not strongly overpredict or underpredict across the whole predicted PH range.
There are still a few larger residuals, especially some points below -0.25 and above 0.25.
Overall, these plots support using XGBoost as the final individual model. The model is not perfect, but it shows a good balance between prediction accuracy and residual behavior.
tree_model_results <- add_model_result(
results_table = tree_model_results,
new_result = xgb_result
)
all_model_results <- add_model_result(
results_table = all_model_results,
new_result = xgb_result
)
final_model_comparison <- all_model_results |>
mutate(
training_time_minutes = training_time_seconds / 60
) |>
select(
`Dataset type` = workflow,
`Model Family` = model_family,
`Model Name` = model_name,
`Tuning Parameters` = tuning_parameters,
RMSE = rmse,
MAE = mae,
`R-squared` = rsq
) |>
arrange(RMSE)
final_model_comparison |>
kable(
caption = "Final Model Comparison on Validation Data",
digits = 4,
format = "html"
) |>
kable_styling(
full_width = FALSE,
position = "center"
) |>
column_spec(
column = 4,
width = "16em",
extra_css = "white-space: normal; word-break: break-word;"
)
| Dataset type | Model Family | Model Name | Tuning Parameters | RMSE | MAE | R-squared |
|---|---|---|---|---|---|---|
| Tree-family | Tree-Based | XGBoost | trees = 1000 , tree_depth = 4 , learn_rate = 0.05 , loss_reduction = 0 , min_n = 2 , sample_size = 0.8 , mtry = 24 | 0.0964 | 0.0705 | 0.6852 |
| Tree-family | Tree-Based | Random Forest | mtry = 24 , min_n = 1 , trees = 1000 | 0.0984 | 0.0706 | 0.6835 |
| Tree-family | Tree-Based | Cubist | committees = 50 , neighbors = 5 | 0.0996 | 0.0682 | 0.6650 |
| Nonlinear-family | Nonlinear | SVM Radial | cost = 3 , / rbf_sigma = 0.05 | 0.1112 | 0.0812 | 0.5806 |
| Nonlinear-family | Nonlinear | MARS | num_terms = 35 , prod_degree = 2 | 0.1212 | 0.0907 | 0.5013 |
| Nonlinear-family | Nonlinear | Neural Network | hidden_units = 12 , penalty = 0.5 , epochs = 150 | 0.1226 | 0.0926 | 0.4984 |
| Linear-family / PLS keeps correlated predictors | Linear / Regularized | PLS Regression | num_comp = 14 | 0.1286 | 0.0987 | 0.4403 |
| Linear-family | Linear / Regularized | Linear Regression | None | 0.1305 | 0.1007 | 0.4244 |
| Linear-family | Linear / Regularized | Elastic Net | penalty = 0.00848342898244073 , mixture = 0 | 0.1305 | 0.1007 | 0.4250 |
| Tree-family | Nonlinear | Neural Network with Tree-Family Features | hidden_units = 15 , penalty = 0.5 , epochs = 150 | 0.1432 | 0.0989 | 0.3660 |
The final validation comparison shows a clear difference between the model families.
The linear and regularized models were the weakest group overall.
For PLS, I kept the correlated predictors because PLS is designed to work with correlated variables by creating components from them. This helped PLS perform slightly better than regular linear regression and elastic net.
However, even with this adjustment, the linear-family models still left a lot of PH variation unexplained.
The nonlinear models improved over the linear-family models.
This tells us that nonlinear relationships are playing larger role in the PH explanation. However, not all nonlinear models were able to capture those relationships equally well.
The tree-based models performed best overall.
Cubist had the lowest MAE among the individual tree-based models, which means it had slightly smaller average absolute errors. XGBoost and Random Forest performed better by RMSE and \(R^2\).
cubist_diagnostic_plots/rf_diagnostiv_plots/xgb_diagn_plots
Since tree-based models produced fairly close results, it is worth to notice that these diagnostic plots represent a better understanding of which model is preformed better overall.
Cubist plots:
Predicted vs Observed plot show that Cubist model had slightly uneven
balance in a spread. There are few highly over and underpredicted
values.
From Residuals vs Predicted plot it is noticeable that that Cubist model
had some larger over and under estimated predictions
Cubist had the best MAE among the tree-based models, which means it made
smaller average absolute errors. However, the diagnostic plots show that
it still had a few larger misses, there are few residuals in 0.50 and
-0.50 ranges.
Random Forest:
There is a visible pattern in the residuals. For predicted PH values
below about 8.65, the model appears to have more negative
residuals, meaning it tends to overpredict in that range. For predicted
PH values above about 8.65, there is more tendency toward
positive residuals, meaning it tends to underpredict some higher PH
values.
The residuals vs predicted plot shows this pattern more clearly. This
suggests that Random Forest performs well overall, but it may still pull
some predictions toward the middle range
XGBoost:
The XGBoost diagnostic plots look the most balanced overall.
In the Predicted vs Observed plot, the points are tightly grouped around
the diagonal line, and the spread looks more even compared with Cubist
and Random Forest.
The residuals vs predicted plot also looks more balanced. The residuals
are mostly centered around zero with kinda oval shape, and the spread
appears more even across the predicted PH range. Compared with Cubist
and Random Forest, XGBoost shows less spread of the residuals.
Because of this, XGBoost looks like the strongest individual tree-based model from the diagnostic plots. It also had the best RMSE in the final comparison table, which supports this choice.
Overall, based on both the validation metrics and the diagnostic plots, I would select XGBoost as the best individual model.
I did not rely on only one tuning attempt for the stronger models. For several models, I used a staged tuning process:
For example:
mtry and min_n values.mtry values and smaller min_n values,
so after that I refined the search around that region.Some intermediate tuning attempts are not included in the final RMD file and knitted reports.
For the final report, I kept only the most important tuning results and the final validation comparison. Otherwise including every tuning attempt would make the report too long and would also increase computing and knitting time.
I also considered saving model metrics after each tuning attempt, but I decided not to do that in the final workflow. Since the modeling process already included many models and tuning steps, saving too many intermediate results could make the project harder to follow and easier to mix up. Instead, I focused on keeping the final model comparison clean and reproducible.
Based on the validation results:
So, the tree-based models are the strongest individual model family for this PH prediction task.
After selecting the best individual models, I also review which predictors were most important.
This part is important because the goal is not only to predict PH, but also to understand which manufacturing and process variables helped the models make predictions.
I use XGBoost variable importance because XGBoost was the strongest individual model by RMSE. I also check Random Forest variable importance because Random Forest had very close validation performance and gives another tree-based view of important predictors.
These importance values should be read as predictive importance, not direct causation. In other words, a variable can help predict PH, but that does not automatically mean it directly causes PH to change.
final_xgb_engine <- final_xgb_fit |>
extract_fit_parsnip() |>
pluck("fit")
xgb_importance <- xgb.importance(
model = final_xgb_engine
)
xgb_importance_table <- xgb_importance |>
as_tibble() |>
slice(1:20)
xgb_importance_table |>
kable(
caption = "Top 20 XGBoost Important Predictors",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| Feature | Gain | Cover | Frequency |
|---|---|---|---|
| mnf_flow | 0.1613 | 0.0345 | 0.0280 |
| usage_cont | 0.0680 | 0.0446 | 0.0451 |
| oxygen_filler | 0.0510 | 0.0469 | 0.0489 |
| brand_codeC | 0.0508 | 0.0093 | 0.0069 |
| temperature | 0.0453 | 0.0281 | 0.0323 |
| bowl_setpoint | 0.0436 | 0.0074 | 0.0075 |
| balling_lvl | 0.0430 | 0.0323 | 0.0344 |
| pressure_vacuum | 0.0418 | 0.0425 | 0.0374 |
| alch_rel | 0.0387 | 0.0357 | 0.0292 |
| density | 0.0358 | 0.0324 | 0.0265 |
| air_pressurer | 0.0350 | 0.0257 | 0.0280 |
| carb_pressure1 | 0.0348 | 0.0476 | 0.0424 |
| filler_speed | 0.0343 | 0.0510 | 0.0432 |
| carb_rel | 0.0264 | 0.0216 | 0.0264 |
| carb_flow | 0.0236 | 0.0500 | 0.0431 |
| carb_volume | 0.0204 | 0.0344 | 0.0384 |
| hyd_pressure3 | 0.0202 | 0.0207 | 0.0220 |
| balling | 0.0190 | 0.0232 | 0.0231 |
| pc_volume | 0.0180 | 0.0360 | 0.0431 |
| filler_level | 0.0172 | 0.0307 | 0.0323 |
xgb_importance_table |>
ggplot(aes(x = reorder(Feature, Gain), y = Gain)) +
geom_col() +
coord_flip() +
labs(
title = "Top 20 XGBoost Important Predictors",
x = "Predictor",
y = "Gain"
)
final_rf_engine <- final_rf_fit |>
extract_fit_parsnip() |>
pluck("fit")
rf_importance_table <- tibble(
variable = names(final_rf_engine$variable.importance),
importance = as.numeric(final_rf_engine$variable.importance)
) |>
arrange(desc(importance)) |>
slice(1:20)
rf_importance_table |>
kable(
caption = "Top 20 Random Forest Important Predictors",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| variable | importance |
|---|---|
| mnf_flow | 11.9829 |
| usage_cont | 3.4860 |
| temperature | 3.0087 |
| oxygen_filler | 2.9751 |
| carb_rel | 2.9246 |
| alch_rel | 2.8036 |
| balling_lvl | 2.6853 |
| pressure_vacuum | 2.3379 |
| air_pressurer | 2.0996 |
| carb_pressure1 | 1.9842 |
| density | 1.9279 |
| brand_code | 1.8040 |
| bowl_setpoint | 1.6859 |
| carb_flow | 1.6133 |
| filler_speed | 1.5604 |
| filler_level | 1.4296 |
| balling | 1.3673 |
| hyd_pressure3 | 1.3143 |
| pc_volume | 1.1827 |
| carb_volume | 1.1820 |
rf_importance_table |>
ggplot(aes(x = reorder(variable, importance), y = importance)) +
geom_col() +
coord_flip() +
labs(
title = "Top 20 Random Forest Important Predictors",
x = "Predictor",
y = "Importance"
)
xgb_importance_ranked <- xgb_importance |>
as_tibble() |>
transmute(
variable = Feature,
xgb_gain = Gain
) |>
arrange(desc(xgb_gain)) |>
mutate(
xgb_rank = row_number()
)
rf_importance_ranked <- tibble(
variable = names(final_rf_engine$variable.importance),
rf_importance = as.numeric(final_rf_engine$variable.importance)
) |>
arrange(desc(rf_importance)) |>
mutate(
rf_rank = row_number()
)
importance_rank_comparison <- full_join(
xgb_importance_ranked,
rf_importance_ranked,
by = "variable"
) |>
mutate(
appears_in_both = ifelse(
!is.na(xgb_rank) & !is.na(rf_rank),
"Yes",
"No"
)
) |>
rowwise() |>
mutate(
average_rank = mean(
c(xgb_rank, rf_rank),
na.rm = TRUE
)
) |>
ungroup() |>
arrange(average_rank)
importance_rank_display <- importance_rank_comparison |>
select(
variable,
xgb_rank,
rf_rank,
average_rank,
appears_in_both
) |>
slice(1:10) |>
mutate(
xgb_rank = ifelse(is.na(xgb_rank), "Not ranked", as.character(xgb_rank)),
rf_rank = ifelse(is.na(rf_rank), "Not ranked", as.character(rf_rank))
)
importance_rank_display |>
kable(
caption = "Rank-Based Comparison of Top-10 Important Predictors",
digits = 2
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| variable | xgb_rank | rf_rank | average_rank | appears_in_both |
|---|---|---|---|---|
| mnf_flow | 1 | 1 | 1.0 | Yes |
| usage_cont | 2 | 2 | 2.0 | Yes |
| oxygen_filler | 3 | 4 | 3.5 | Yes |
| brand_codeC | 4 | Not ranked | 4.0 | No |
| temperature | 5 | 3 | 4.0 | Yes |
| balling_lvl | 7 | 7 | 7.0 | Yes |
| alch_rel | 9 | 6 | 7.5 | Yes |
| pressure_vacuum | 8 | 8 | 8.0 | Yes |
| bowl_setpoint | 6 | 13 | 9.5 | Yes |
| carb_rel | 14 | 5 | 9.5 | Yes |
XGBoost and Random Forest both showed that some predictors were much more important than others, but the importance values are not on the same scale.
For XGBoost, mnf_flow had the highest Gain value at
0.1613. This was much larger than next predictor,
usage_cont, which had Gain = 0.0680. This tells us
that mnf_flow was the strongest predictor in the XGBoost
model.
Random Forest showed the same pattern. mnf_flow had the
highest importance value at 11.9829, while the next predictor,
usage_cont, had importance of 3.4860. So in both
models, mnf_flow was clearly the most important predictor
for PH.
Both models also found similar process-related variables important, including:
usage_contoxygen_fillertemperatureballing_lvlpressure_vacuumalch_relcarb_relcarb_pressure1filler_speedcarb_flowThis gives more confidence that these variables are useful predictive factors because they appear in both strong models.
There is one difference is how brand information appears. XGBoost
shows brand_codeC as an important predictor with Gain =
0.0508, while Random Forest shows the full brand_code
variable with importance = 1.8040. This difference is probably because
XGBoost used dummy variables for brand_code and Random
Forest handled the brand variable as is and didn’t split it into dummy
variables.
Overall, the importance results suggest that PH prediction is mostly
related to mnf_flow (I aasume it is manufacturing flow),
usage/process conditions, oxygen and filling behavior, temperature,
pressure-related variables, carbonation/alcohol-related variables, and
brand differences.
I think it is important to note that these are predictive factors, not direct causal conclusions. They tell us which variables helped the models predict PH, but not necessarily which variables directly cause PH to change.
After selecting XGBoost as the final individual model, I wanted to see how predicted PH changes when models important predictors are changed.
I wanted to see more precise changes for each brand’s average
predicted PH, so I calculated the average prediction changes separately
by brand_code.
I think this is more useful than using only one global average changes, because I think each brand may have different process behavior. In other words, the same variable change may affect the model predictions differently for each brand type.
For each brand, I changed one XGBoost’s important predictor by 10% while keeping the other predictors the same. Then I compared the new predicted PH with the original predicted PH.
This does not prove that changing one variable directly causes PH to change. It only shows how the trained XGBoost model responds to these changes.
# Baseline row for scenario testing
# I use the validation row closest to the median predicted PH
valid_tree_predictions_for_scenario <- predict(
final_xgb_fit,
new_data = valid_tree_model_data
) |>
pull(.pred)
median_predicted_ph <- median(valid_tree_predictions_for_scenario)
baseline_row_index <- which.min(
abs(valid_tree_predictions_for_scenario - median_predicted_ph)
)
baseline_row <- valid_tree_model_data |>
slice(baseline_row_index) |>
select(-ph)
baseline_prediction <- predict(
final_xgb_fit,
new_data = baseline_row
) |>
pull(.pred)
baseline_prediction
[1] 8.533664
make_brand_numeric_scenario <- function(variable_name, change_percent = 0.10) {
# Validation predictors only
valid_x <- valid_tree_model_data |>
select(-ph)
# Original predictions
original_predictions <- predict(
final_xgb_fit,
new_data = valid_x
) |>
pull(.pred)
# Decreased scenario
decreased_x <- valid_x |>
mutate(
!!variable_name := .data[[variable_name]] * (1 - change_percent)
)
decreased_predictions <- predict(
final_xgb_fit,
new_data = decreased_x
) |>
pull(.pred)
# Increased scenario
increased_x <- valid_x |>
mutate(
!!variable_name := .data[[variable_name]] * (1 + change_percent)
)
increased_predictions <- predict(
final_xgb_fit,
new_data = increased_x
) |>
pull(.pred)
# Combine and summarize by brand
bind_rows(
tibble(
brand_code = valid_x$brand_code,
variable_changed = variable_name,
scenario = paste0(variable_name, " decreased by 10%"),
original_predicted_ph = original_predictions,
scenario_predicted_ph = decreased_predictions,
change_from_original = scenario_predicted_ph - original_predicted_ph
),
tibble(
brand_code = valid_x$brand_code,
variable_changed = variable_name,
scenario = paste0(variable_name, " increased by 10%"),
original_predicted_ph = original_predictions,
scenario_predicted_ph = increased_predictions,
change_from_original = scenario_predicted_ph - original_predicted_ph
)
) |>
group_by(
brand_code,
variable_changed,
scenario
) |>
summarise(
validation_rows = n(),
average_original_ph = mean(original_predicted_ph, na.rm = TRUE),
average_scenario_ph = mean(scenario_predicted_ph, na.rm = TRUE),
average_change_from_original = mean(change_from_original, na.rm = TRUE),
min_change = min(change_from_original, na.rm = TRUE),
max_change = max(change_from_original, na.rm = TRUE),
.groups = "drop"
) |>
arrange(
variable_changed,
brand_code,
scenario
)
}
scenario_variables <- c(
"mnf_flow",
"usage_cont",
"oxygen_filler",
"temperature",
"pressure_vacuum",
"carb_rel"
)
brand_scenario_results <- map_dfr(
scenario_variables,
make_brand_numeric_scenario
)
format_change <- function(x) {
ifelse(
x > 0,
paste0("+", sprintf("%.4f", x)),
sprintf("%.4f", x)
)
}
scenario_decrease_table <- brand_scenario_results |>
filter(str_detect(scenario, "decreased")) |>
select(
brand_code,
variable_changed,
average_change_from_original
) |>
mutate(
average_change_from_original = round(average_change_from_original, 4)
) |>
pivot_wider(
names_from = brand_code,
values_from = average_change_from_original
) |>
arrange(variable_changed)
scenario_increase_table <- brand_scenario_results |>
filter(str_detect(scenario, "increased")) |>
select(
brand_code,
variable_changed,
average_change_from_original
) |>
mutate(
average_change_from_original = round(average_change_from_original, 4)
) |>
pivot_wider(
names_from = brand_code,
values_from = average_change_from_original
) |>
arrange(variable_changed)
scenario_decrease_display <- scenario_decrease_table |>
mutate(
across(
c(A, B, C, D),
format_change
)
)
scenario_decrease_display |>
kable(
caption = "Average Predicted PH Change When Variable Decreased by 10%",
format = "html"
) |>
kable_styling(
full_width = FALSE,
position = "center"
# bootstrap_options = c("striped", "hover", "condensed")
)
| variable_changed | A | B | C | D |
|---|---|---|---|---|
| carb_rel | -0.0475 | -0.0218 | +0.0038 | -0.0585 |
| mnf_flow | -0.0047 | -0.0047 | +0.0107 | -0.0038 |
| oxygen_filler | +0.0021 | +0.0020 | +0.0002 | +0.0013 |
| pressure_vacuum | +0.0014 | -0.0019 | +0.0032 | 0.0000 |
| temperature | +0.0053 | +0.0139 | +0.0104 | +0.0124 |
| usage_cont | +0.0248 | +0.0254 | +0.0175 | +0.0240 |
The table above shows the average change in predicted PH when each selected predictor was decreased by 10%, separated by brand code.
Positive values mean the XGBoost model predicted higher averafe PH after the variable was decreased by 10%. Negative values mean the model predicted lower average PH after the variable was decreased.
There are few patterns are noticeable:
carb_rel had one of the strongest effects. When
carb_rel was decreased by 10%, predicted PH decreased for
brands A, B, and D. The strongest average decrease was for brand D
(-0.0585), followed by brand A (-0.0475). Brand C behaved
differently and had a very small increase (+0.0038).
usage_cont showed a consistent pattern across all
brands. When usage_cont was decreased by 10%, predicted PH
increased for all brands. The average increase was between +0.0175 and
+0.0254.
temperature also showed a consistent pattern. When
temperature was decreased by 10%, predicted PH increased for all brands.
The increase is not very large, but it is consistent.
oxygen_filler and pressure_vacuum had
very small mostly positive average effects in this case.
mnf_flow had a small average effect, but it did not
behave exactly the same for all brands. For brands A, B, and D,
decreasing mnf_flow slightly lowered predicted PH. For
Brand C, it slightly increased predicted PH.
Overall as I expected, this table shows that XGBoost is not treating every brand the same way . The same 10% change can have different effects depending on the brand. This supports the idea that PH prediction depends on the combination of process variables and brand-level behavior, not only one variable by itself.
And again, these results should still be interpreted as model-based prediction scenarios, not direct cause-and-effect results.
scenario_increase_display <- scenario_increase_table |>
mutate(
across(
c(A, B, C, D),
format_change
)
)
scenario_increase_display |>
kable(
caption = "Average Predicted PH Change When Variable Increased by 10%",
format = "html"
) |>
kable_styling(
full_width = FALSE,
position = "center"
# bootstrap_options = c("striped", "hover")
)
| variable_changed | A | B | C | D |
|---|---|---|---|---|
| carb_rel | -0.0062 | -0.0274 | -0.0251 | -0.0097 |
| mnf_flow | +0.0106 | +0.0094 | +0.0042 | +0.0140 |
| oxygen_filler | -0.0027 | -0.0017 | -0.0002 | -0.0033 |
| pressure_vacuum | +0.0017 | +0.0030 | +0.0111 | +0.0021 |
| temperature | -0.0349 | -0.0566 | -0.0201 | -0.0451 |
| usage_cont | -0.0010 | -0.0061 | -0.0039 | +0.0022 |
And this table above shows the average change in predicted PH when each selected predictor was increased by 10%, separated by brand codes.
There are also patterns are noticeable:
temperature had one of the strongest effects. When
temperature was increased by 10%, as we see predicted PH
decreased for all brands. The largest average decrease was for brand B
(-0.0566), followed by D (-0.0451), brand A
(-0.0349), and brand C (-0.0201). This tells us that
the model associates higher temperature values with lower predicted
PH.
carb_rel also showed mostly negative changes. When
carb_rel was increased by 10%, predicted PH decreased for
all brands. The decrease was strongest for brand B (-0.0274)
and brand C (-0.0251).
mnf_flow showed a consistent positive pattern. When
mnf_flow was increased by 10%, predicted PH increased for
all brands. The effect was not very large, but it was consistent across
brands. Brand D had the largest average increase
(+0.0140).
pressure_vacuum had a small positive effect for all
brands. The largest average change was for Brand C (+0.0111), while the
other brands had smaller changes.
oxygen_filler had a very small negative effect
across all brands. The changes were close to zero, so this scenario did
not strongly move predicted PH on average.
usage_cont had small average effects. For Brands A,
B, and C, increasing usage_cont slightly decreased
predicted PH. For Brand D, it slightly increased predicted PH, but the
change was very small (+0.0022).
As we see from both tables, temperature,
carb_rel, usage_cont, and
mnf_flow showed the clearest average changes in predicted
PH.
The temperature pattern was especially clear. Decreasing temperature increased predicted PH for all brands, while increasing temperature decreased predicted PH for all brands.
The mnf_flow pattern was also mostly consistent.
Increasing mnf_flow increased predicted PH for all brands,
although the average changes were not very large.
The carb_rel variable had stronger negative effects for
some brands, especially Brand A and Brand D when decreased, and Brand B
and Brand C when increased.
I think that the important point here is that these are not simple one-variable rules. The model response can differ by brand, and real process changes may also affect other variables at the same time. Because of that, these scenario results should be used as a guide for decision-making discussion with process and manufacturing involved experts, not as direct production instructions.
This interaction review changes two important predictors at the same
time: mnf_flow and oxygen_filler.
If the predicted PH changes in a simple smooth pattern, then the model response is easier to explain. If the prediction changes differently across combinations of the two variables, that suggests the XGBoost model may be using an interaction between those predictors.
This does not prove a physical interaction in the manufacturing process. It only shows that the model prediction depends on the combination of these input values.
The final evaluation predictor set was already prepared in the preprocessing part.
Since I already cleaned, capped, and imputed predictors for the final
evaluation dataset eval_features_tree_family, I will not
repeat the full preprocessing workflow here.
The only remaining step is model-specific prediction. Since the final XGBoost model was trained as a workflow, the model object handles the same model preparation that was used during training. I only need to confirm that the evaluation predictors have the same columns as the tree-family training predictors, then generate PH predictions.
missing_from_eval <- setdiff(
names(train_tree_model_data |> select(-ph)),
names(eval_features_tree_family)
)
extra_in_eval <- setdiff(
names(eval_features_tree_family),
names(train_tree_model_data |> select(-ph))
)
missing_from_eval
character(0)
extra_in_eval
character(0)
stopifnot(length(missing_from_eval) == 0)
stopifnot(length(extra_in_eval) == 0)
It confirmed that there are same columns and no missing values.
Now we can do predictions
eval_xgb_predictions <- predict(
final_xgb_fit,
new_data = eval_features_tree_family
) |>
rename(predicted_ph = .pred)
eval_predictions_xgboost <- eval_features_tree_family |>
mutate(
row_id = row_number(),
predicted_ph = eval_xgb_predictions$predicted_ph
) |>
select(
row_id,
predicted_ph
)
eval_predictions_xgboost |>
head(10) |>
kable(
caption = "First 10 Final Evaluation PH Predictions",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| row_id | predicted_ph |
|---|---|
| 1 | 8.5002 |
| 2 | 8.4530 |
| 3 | 8.5082 |
| 4 | 8.6799 |
| 5 | 8.3887 |
| 6 | 8.4879 |
| 7 | 8.5332 |
| 8 | 8.5005 |
| 9 | 8.5930 |
| 10 | 8.7368 |
write_csv(
eval_predictions_xgboost,
"final_evaluation_ph_predictions_xgboost.csv"
)
write_xlsx(
eval_predictions_xgboost,
"final_evaluation_ph_predictions_xgboost.xlsx"
)
I selected XGBoost as the final individual model for PH prediction.
It had the best validation RMSE among the individual models and showed strong diagnostic performance compared with the other model families. Random Forest and Cubist models had close accuracy metrics, but XGBoost had the best overall validation result.
The most important predictors suggest that PH is connected to manufacturing flow, oxygen/filling behavior, temperature, pressure, carbonation/alcohol-related measurements, and brand-level differences.
The final evaluation data do not contain observed PH values, so I couldn’t calculate model accuracy could on that dataset. I used trained XGBoost model to generate final PH predictions.
The variable importance results should be interpreted as predictive importance, not direct causation.
Note: This section is my additional experiment. It doesn’t have a purpose for selecting the final submitted model.
After comparing the individual models, I also wanted to experiment to see if combining model predictions could improve validation performance.
The idea behind an ensemble is that different models may capture different parts of the PH prediction problem. For example, Random Forest, Cubist, and XGBoost are all tree-based models, but they learn patterns in different ways:
Because these models are different, they may not make the same mistakes on the same observations. If one model overpredicts PH for some rows and another model underpredicts those same rows, averaging their predictions might reduce overall error.
For this experiment, I first tested simple average ensembles. Then I tested a weighted ensemble, where stronger models were allowed to receive more weight.
This ensemble section is treated as an experiment because the weights were selected using validation performance. Because of that, I avoid repeatedly changing the weights too many times, so the validation set does not become overused.
For the final modeling approach, I focus on individual models only. This keeps the final model easier to explain, reproduce, and report. Based on the validation results, XGBoost is selected as the best individual model because it had the lowest RMSE and highest \(R^2\) among the individual models tested.
ensemble_predictions_rf_cubist <- (
rf_validation_predictions +
cubist_validation_predictions
) / 2
ensemble_rf_cubist_result <- make_and_show_model_result(
workflow = "Ensemble",
model_family = "Model Combination",
model_name = "Average Ensemble: RF + Cubist",
training_rows = nrow(train_tree_model_data),
validation_rows = length(valid_ph),
tuning_parameters = "Average of Random Forest and Cubist",
actual = valid_ph,
predicted = ensemble_predictions_rf_cubist,
training_time_seconds = 0
)
| workflow | model_family | model_name | training_rows | validation_rows | tuning_parameters | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|---|---|
| Ensemble | Model Combination | Average Ensemble: RF + Cubist | 2052 | 515 | Average of Random Forest and Cubist | 0.0954 | 0.0666 | 0.6937 | 0 |
The first ensemble experiment used a simple average of the Random Forest and Cubist predictions.
This ensemble had:
This result is better than both individual models by RMSE and \(R^2\).
Random Forest had:
Cubist had:
The ensemble improved because Random Forest and Cubist do not make exactly the same errors. Random Forest had stronger RMSE and \(R^2\), while Cubist had better MAE. By averaging their predictions, the ensemble combined useful parts of both models.
This is also a good ensemble to report because it is simple and easy to explain. It does not require training another model. It only averages the predictions from two already strong models.
ensemble_model_results <- tibble()
ensemble_model_results <- add_model_result(
results_table = ensemble_model_results,
new_result = ensemble_rf_cubist_result
)
all_model_results <- add_model_result(
results_table = all_model_results,
new_result = ensemble_rf_cubist_result
)
ensemble_predictions_rf_cubist_xgb <- (
rf_validation_predictions +
cubist_validation_predictions +
xgb_validation_predictions
) / 3
ensemble_rf_cubist_xgb_result <- make_and_show_model_result(
workflow = "Ensemble",
model_family = "Model Combination",
model_name = "Average Ensemble: RF + Cubist + XGBoost",
training_rows = nrow(train_tree_model_data),
validation_rows = length(valid_ph),
tuning_parameters = "Average of Random Forest, Cubist, and XGBoost",
actual = valid_ph,
predicted = ensemble_predictions_rf_cubist_xgb,
training_time_seconds = 0
)
| workflow | model_family | model_name | training_rows | validation_rows | tuning_parameters | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|---|---|
| Ensemble | Model Combination | Average Ensemble: RF + Cubist + XGBoost | 2052 | 515 | Average of Random Forest, Cubist, and XGBoost | 0.0941 | 0.0667 | 0.7027 | 0 |
The second ensemble experiment averaged the predictions from Random Forest, Cubist, and XGBoost.
This ensemble had:
This improved over the RF + Cubist average ensemble, which had:
The improvement in RMSE and \(R^2\) suggests that XGBoost added useful information to the ensemble
ensemble_weight_grid <- expand_grid(
rf_weight = seq(0, 1, by = 0.05),
cubist_weight = seq(0, 1, by = 0.05),
xgb_weight = seq(0, 1, by = 0.05)
) |>
filter(abs(rf_weight + cubist_weight + xgb_weight - 1) < 0.000001)
ensemble_weight_results <- ensemble_weight_grid |>
mutate(
predictions = pmap(
list(rf_weight, cubist_weight, xgb_weight),
function(rf_weight, cubist_weight, xgb_weight) {
rf_weight * rf_validation_predictions +
cubist_weight * cubist_validation_predictions +
xgb_weight * xgb_validation_predictions
}
),
rmse = map_dbl(
predictions,
~ yardstick::rmse_vec(
truth = valid_ph,
estimate = .x
)
),
mae = map_dbl(
predictions,
~ yardstick::mae_vec(
truth = valid_ph,
estimate = .x
)
),
rsq = map_dbl(
predictions,
~ yardstick::rsq_vec(
truth = valid_ph,
estimate = .x
)
)
) |>
select(-predictions) |>
arrange(rmse)
best_weighted_ensemble <- ensemble_weight_results |>
slice(1)
best_weighted_ensemble |>
kable(
caption = "Best Weighted Ensemble Weights",
digits = 4
) |>
kable_styling(
full_width = FALSE,
position = "center"
)
| rf_weight | cubist_weight | xgb_weight | rmse | mae | rsq |
|---|---|---|---|---|---|
| 0.25 | 0.3 | 0.45 | 0.094 | 0.067 | 0.7026 |
weighted_ensemble_predictions <-
best_weighted_ensemble$rf_weight * rf_validation_predictions +
best_weighted_ensemble$cubist_weight * cubist_validation_predictions +
best_weighted_ensemble$xgb_weight * xgb_validation_predictions
weighted_ensemble_result <- make_and_show_model_result(
workflow = "Ensemble",
model_family = "Model Combination",
model_name = "Weighted Ensemble: RF + Cubist + XGBoost",
training_rows = nrow(train_tree_model_data),
validation_rows = length(valid_ph),
tuning_parameters = paste(
"RF =",
best_weighted_ensemble$rf_weight,
", Cubist =",
best_weighted_ensemble$cubist_weight,
", XGBoost =",
best_weighted_ensemble$xgb_weight
),
actual = valid_ph,
predicted = weighted_ensemble_predictions,
training_time_seconds = 0
)
| workflow | model_family | model_name | training_rows | validation_rows | tuning_parameters | rmse | mae | rsq | training_time_seconds |
|---|---|---|---|---|---|---|---|---|---|
| Ensemble | Model Combination | Weighted Ensemble: RF + Cubist + XGBoost | 2052 | 515 | RF = 0.25 , Cubist = 0.3 , XGBoost = 0.45 | 0.094 | 0.067 | 0.7026 | 0 |
The weighted ensemble also performed very well.
The selected weights were:
The validation performance was:
This is very close to the simple average ensemble of Random Forest, Cubist, and XGBoost. The weighted ensemble had slightly lower RMSE (by 0.001), while the simple average ensemble had slightly better MAE and almost the same \(R^2\).
Note: The ensemble models were tested only as an experiment to see if combining predictions would improve accuracy. Even ensembling results are better then individual XGBoost’s result, for this project I decided do not use them as official final model to make predictions.
ensemble_model_results <- tibble()
ensemble_model_results <- add_model_result(
results_table = ensemble_model_results,
new_result = ensemble_rf_cubist_result
)
all_model_results <- add_model_result(
results_table = all_model_results,
new_result = ensemble_rf_cubist_result
)
ensemble_model_results <- add_model_result(
results_table = ensemble_model_results,
new_result = ensemble_rf_cubist_xgb_result
)
all_model_results <- add_model_result(
results_table = all_model_results,
new_result = ensemble_rf_cubist_xgb_result
)
ensemble_model_results <- add_model_result(
results_table = ensemble_model_results,
new_result = weighted_ensemble_result
)
all_model_results <- add_model_result(
results_table = all_model_results,
new_result = weighted_ensemble_result
)
#ensemble_model_results
# if (exists("cl")) {
# stopCluster(cl)
# registerDoSEQ()
# message("Parallel backend stopped.")
# }