library(readxl)
library(readr)
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
library(writexl)
library(ggplot2)
library(ggridges)
library(DataExplorer)
library(skimr)
library(caret)
library(VIM)
library(gridExtra)
library(smotefamily)
library(fastDummies)
library(rsample)
library(earth)
theme_set(theme_minimal())
train_raw <- read_excel("StudentData.xlsx")
eval_raw <- read_excel("StudentEvaluation.xlsx")
names(train_raw) <- str_replace_all(names(train_raw), " ", "_")
names(eval_raw) <- str_replace_all(names(eval_raw), " ", "_")
overview The dataset titled studentdata contains 2,571 observations and 33 variables, capturing detailed measurements from ABC Beverage’s manufacturing process. These variables encompass both process metrics—such as pressure, flow, temperature—and chemical indicators, including alcohol content, density, and carbonation, with the primary modeling target being pH. The Brand Code column, a categorical identifier originally representing values A, B, C, and D, has been encoded numerically (1–4) but includes 120 missing entries. Core production variables like Carb Volume (5.04–5.70) and Fill Ounces (23.63–24.32) show tight operational ranges, whereas others like PC Volume and PSC exhibit wider variability and higher missingness (up to 39 missing values).
More complex or sensor-derived features—such as Hyd Pressure1–4, Balling, Oxygen Filler, and Carb Rel—also show notable gaps and fluctuations, suggesting areas for equipment calibration or process standardization. A structural diagnostic reveals that 96.97% of the columns are continuous, with just 3.03% discrete, making the data set well-suited for numerical modeling. Importantly, 79.27% of rows are complete, and only 0.99% of total data points are missing, indicating minimal and scattered missingness. With a footprint of just 670.4 Kb, the data set is lightweight and efficient to work with in R or Python. In summary, the data is mostly clean, structurally sound, and ready for predictive modeling with only minor imputation required.
#take a look at the column and data types
# Advanced data descriptions
skim(train_raw)
## Warning: 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning: 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning: 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
Name | train_raw |
Number of rows | 2571 |
Number of columns | 33 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 32 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Brand_Code | 120 | 0.95 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Carb_Volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
Fill_Ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
PC_Volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
Carb_Pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | ▁▅▇▃▁ |
Carb_Temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
PSC | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
PSC_Fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
PSC_CO2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
Mnf_Flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | ▇▁▁▇▂ |
Carb_Pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | ▁▃▇▂▁ |
Fill_Pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
Hyd_Pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
Hyd_Pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | ▇▂▇▅▁ |
Hyd_Pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | ▇▁▃▇▁ |
Hyd_Pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | ▁▃▇▂▁ |
Filler_Level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | ▁▃▅▇▁ |
Filler_Speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
Temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
Usage_cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | ▁▃▅▃▇ |
Carb_Flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | ▂▅▆▇▁ |
Density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
MFR | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | ▁▁▁▂▇ |
Balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
Pressure_Vacuum | 0 | 1.00 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
PH | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
Oxygen_Filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
Bowl_Setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
Pressure_Setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
Air_Pressurer | 0 | 1.00 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | ▅▇▁▁▁ |
Alch_Rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | ▁▇▂▃▁ |
Carb_Rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | ▁▇▇▂▁ |
Balling_Lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
summary(train_raw)
## Brand_Code Carb_Volume Fill_Ounces PC_Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb_Pressure Carb_Temp PSC PSC_Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC_CO2 Mnf_Flow Carb_Pressure1 Fill_Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd_Pressure1 Hyd_Pressure2 Hyd_Pressure3 Hyd_Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler_Level Filler_Speed Temperature Usage_cont Carb_Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure_Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## PH Oxygen_Filler Bowl_Setpoint Pressure_Setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## Air_Pressurer Alch_Rel Carb_Rel Balling_Lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
# give an idea of what the data set is
plot_intro(train_raw)
missing data
# plot missing data
plot_missing(train_raw)
# Advanced data descriptions
skim(train_raw)
## Warning: 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning: 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning: 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
Name | train_raw |
Number of rows | 2571 |
Number of columns | 33 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 32 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Brand_Code | 120 | 0.95 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Carb_Volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
Fill_Ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
PC_Volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
Carb_Pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | ▁▅▇▃▁ |
Carb_Temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
PSC | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
PSC_Fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
PSC_CO2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
Mnf_Flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | ▇▁▁▇▂ |
Carb_Pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | ▁▃▇▂▁ |
Fill_Pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
Hyd_Pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
Hyd_Pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | ▇▂▇▅▁ |
Hyd_Pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | ▇▁▃▇▁ |
Hyd_Pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | ▁▃▇▂▁ |
Filler_Level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | ▁▃▅▇▁ |
Filler_Speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
Temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
Usage_cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | ▁▃▅▃▇ |
Carb_Flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | ▂▅▆▇▁ |
Density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
MFR | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | ▁▁▁▂▇ |
Balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
Pressure_Vacuum | 0 | 1.00 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
PH | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
Oxygen_Filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
Bowl_Setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
Pressure_Setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
Air_Pressurer | 0 | 1.00 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | ▅▇▁▁▁ |
Alch_Rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | ▁▇▂▃▁ |
Carb_Rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | ▁▇▇▂▁ |
Balling_Lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
variables
# Observations with the target variables
colnames(train_raw)
## [1] "Brand_Code" "Carb_Volume" "Fill_Ounces"
## [4] "PC_Volume" "Carb_Pressure" "Carb_Temp"
## [7] "PSC" "PSC_Fill" "PSC_CO2"
## [10] "Mnf_Flow" "Carb_Pressure1" "Fill_Pressure"
## [13] "Hyd_Pressure1" "Hyd_Pressure2" "Hyd_Pressure3"
## [16] "Hyd_Pressure4" "Filler_Level" "Filler_Speed"
## [19] "Temperature" "Usage_cont" "Carb_Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure_Vacuum" "PH" "Oxygen_Filler"
## [28] "Bowl_Setpoint" "Pressure_Setpoint" "Air_Pressurer"
## [31] "Alch_Rel" "Carb_Rel" "Balling_Lvl"
# Only categorical values column
ggplot(train_raw, aes(x = Brand_Code, fill = Brand_Code)) +
geom_bar() +
geom_text(stat = "count", aes(label = after_stat(count)),
vjust = 1.4, colour = "white") +
labs(title = "Brand_Code frequency", x = NULL, y = "Count") +
theme(legend.position = "none")
Brand B is the most frequently produced product in this data set with 1239 counts. This imbalance suggests that Brand B dominates production, which could influence the predictive model for pH due to its high representation. It’s important to verify whether pH behavior differs by brand and, if so, account for it in modeling or stratify analysis accordingly.
# plot average PH by brand code
train_raw %>% group_by(Brand_Code) %>%
summarise(Mean_PH = mean(PH, na.rm = T)) %>%
ggplot(aes(x = Brand_Code, y = Mean_PH, fill = Brand_Code)) + geom_col() +
geom_text(aes(label = round(Mean_PH, 3)), vjust = 1.5, colour = "white") +
labs(title = "Average pH by Brand Code", y = "Mean pH")
We can see that there is a difference in the average pH values in each brand code. This further provides evidence that attaining a more balanced representation of brand code will be important in reducing bias.
# create a correlation graph and p-value
categorical_cols <- sapply(train_raw, is.character) | sapply(data, is.factor)
## Warning in sapply(train_raw, is.character) | sapply(data, is.factor): longer
## object length is not a multiple of shorter object length
train_raw[categorical_cols] <- lapply(train_raw[categorical_cols], function(x) as.numeric(as.factor(x)))
target <- "PH"
# Initializing an empty data frame
results <- data.frame(Feature = character(), Correlation = numeric(), P_value = numeric(), stringsAsFactors = FALSE)
for (col in names(train_raw)) {
if (col != target) {
correlation <- cor(train_raw[[col]], train_raw[[target]], use = "complete.obs", method = 'pearson')
p_value <- cor.test(train_raw[[col]], train_raw[[target]], method = 'pearson')$p.value
results <- rbind(results, data.frame(Feature = col, Correlation = correlation, P_value = p_value))
}
}
print(results)
## Feature Correlation P_value
## 1 Brand_Code 0.09668448 1.655458e-06
## 2 Carb_Volume 0.06349111 1.317168e-03
## 3 Fill_Ounces -0.09505837 1.677065e-06
## 4 PC_Volume 0.04580178 2.128180e-02
## 5 Carb_Pressure 0.05958033 2.665036e-03
## 6 Carb_Temp 0.02887456 1.456400e-01
## 7 PSC -0.09003060 5.651603e-06
## 8 PSC_Fill -0.03625652 6.748775e-02
## 9 PSC_CO2 -0.07558459 1.424376e-04
## 10 Mnf_Flow -0.44697510 2.557184e-126
## 11 Carb_Pressure1 -0.07932723 6.378245e-05
## 12 Fill_Pressure -0.21343571 1.199531e-27
## 13 Hyd_Pressure1 -0.07398881 1.811517e-04
## 14 Hyd_Pressure2 -0.20067834 1.352668e-24
## 15 Hyd_Pressure3 -0.24030003 7.593496e-35
## 16 Hyd_Pressure4 -0.14153475 7.818363e-13
## 17 Filler_Level 0.32327031 3.917540e-63
## 18 Filler_Speed -0.02456179 2.183784e-01
## 19 Temperature -0.15824135 8.592733e-16
## 20 Usage_cont -0.31841555 1.842646e-61
## 21 Carb_Flow 0.15737137 1.089641e-15
## 22 Density 0.07843973 6.938385e-05
## 23 MFR -0.00871805 6.721382e-01
## 24 Balling 0.06527259 9.363280e-04
## 25 Pressure_Vacuum 0.22053722 1.201970e-29
## 26 Oxygen_Filler 0.16798960 1.234045e-17
## 27 Bowl_Setpoint 0.34911980 2.063873e-74
## 28 Pressure_Setpoint -0.31101906 2.010989e-58
## 29 Air_Pressurer -0.01365873 4.891116e-01
## 30 Alch_Rel 0.14893852 3.609012e-14
## 31 Carb_Rel 0.16414447 6.477984e-17
## 32 Balling_Lvl 0.10047517 3.395271e-07
Correlation between PH and other variables reveals that Bowl Setpoint, Filler Level, and Pressure Vacuum have the strongest positive associations with pH, while Mnf Flow, Pressure Setpoint, and Usage Cont show strong negative correlations with all statistically significant with very low p-values. PH influenced key operations. Variables like Carb Temp, Filler Speed, and Air Pressurer show weak or non-significant relationships, suggesting minimal influence on pH.
# Drop non-numeric columns and rows with missing PH
data_clean <- train_raw %>%
select_if(is.numeric) %>%
drop_na(PH)
# Compute correlation with PH
correlations <- data_clean %>%
select(-PH) %>%
summarise_all(~cor(.x, data_clean$PH, use = "complete.obs")) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Correlation") %>%
arrange(desc(abs(Correlation)))
# Barplot of correlations
ggplot(correlations, aes(x = reorder(Variable, Correlation), y = Correlation)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Correlation of Variables with PH",
x = "Variable",
y = "Correlation Coefficient") +
theme_minimal()
The bar plot visualizes the strength and direction of correlation between various manufacturing variables and pH levels in the production process. At a high level, it shows that Bowl Setpoint, Filler Level, and Pressure Vacuum have the strongest positive correlations with pH, suggesting that higher values in these variables are typically associated with higher pH. Conversely, variables like MnF Flow, Pressure Setpoint, and various Hydraulic Pressures show the most negative correlations, indicating that increases in these may lead to lower pH values. Most other variables exhibit weak or negligible correlations, implying limited direct influence on pH individually. This insight helps prioritize which process parameters should be monitored or adjusted when targeting pH control.
Relationships, Distributions and Checking Outliers
# Keep only numeric columns and drop rows with missing PH
data_numeric <- train_raw %>%
select(where(is.numeric)) %>%
drop_na(PH)
# Get all column names except PH
predictors <- setdiff(names(data_numeric), "PH")
# Create scatter plots for all numeric variables against PH (excluding PH itself)
# Generate scatter plots where color is mapped to PH
plots <- map(predictors, function(var) {
ggplot(data_numeric, aes(x = .data[[var]], y = PH, color = PH)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
scale_color_gradient(low = "blue", high = "red") +
labs(title = paste("PH vs", var), x = var, y = "PH") +
theme_minimal()
})
# Display first 6 plots
grid.arrange(grobs = plots[1:6], ncol = 3)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# plot histogram for only continuous data
plot_histogram(train_raw)
Some features, like Carb Pressure, Fill Ounces, and PC Volume, display
near-normal distributions, indicating stable process control. Others,
such as Oxygen Filler, PSC CO2, Carb Flow, and Usage Cont, are heavily
right-skewed, with long tails suggesting sporadic high values or
potential outliers.
# Select only numeric columns
numeric_data <- train_raw %>% select(where(is.numeric))
# Convert data to long format for ggplot
long_data <- numeric_data %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
# Generate boxplots with facets
ggplot(long_data, aes(x = "", y = Value)) +
geom_boxplot(fill = "skyblue", outlier.alpha = 0.3) +
facet_wrap(~ Variable, scales = "free_y") +
labs(title = "Boxplots of All Numeric Variables", x = NULL, y = "Value") +
theme_minimal() +
theme(axis.text.x = element_blank(),
strip.text = element_text(size = 8))
## Warning: Removed 844 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The boxplot grid highlights the distribution and outliers across all numeric variables. Several features, such as MFR, Hydraulic Pressures, Filler Speed, Oxygen Filler, and PSC CO2, display a high number of outliers, indicating measurement variability or extreme values. In contrast, variables like pH, Carb Volume, and Fill Ounces are tightly clustered with minimal deviation, suggesting strong process control.
Using findCorrelation function finds the absolute pairwise correlations between variables at the level set by the cutoff value. It looks at the average correlations between each predictor and identifies those with the largest mean absolute correlation. We don’t have to eliminate these predictors but it gives an idea of which ones might be the most influential.
# Find correlation coefficients *before missing values
cor_coef <- cor(train_raw %>% select(-Brand_Code), use = "complete.obs") #%>% drop_na())
## Create a corplot to visualize the correlations
## order the correlations into clusters or regions of high correlations
corrplot::corrplot(cor_coef, order = "hclust")
# Find high correlations
## Returns column numbers denoting the predictors that are recommended for
## deletion using the findCorrelation() function in the caret package
highCorr <- findCorrelation(cor_coef, cutoff = .80, exact = TRUE)
## Get names of the columns
colnames(train_raw[, highCorr])
## [1] "MFR" "Hyd_Pressure2" "Air_Pressurer" "Carb_Rel"
## [5] "Alch_Rel" "Carb_Pressure1" "Hyd_Pressure4" "Filler_Level"
## [9] "Carb_Pressure"
Another way of understanding the distribution of the data is to calculate the skewness and kurtosis. Skewness measures the asymmetry of a distribution, while kurtosis captures how heavy the tails are (i.e how many extreme values reside in the tails) relative to a normal curve. Highly skewed variables are more likely to harbor extreme values that can distort model training.
library(e1071) # for skewness / kurtosis
##
## Attaching package: 'e1071'
## The following object is masked from 'package:rsample':
##
## permutations
num_only <- select(train_raw, where(is.numeric))
# compute skew / kurtosis excluding Brand_Code and PH
sk_kurt <- tibble(
Feature = setdiff(names(num_only), c("PH")),
Skewness = apply(select(num_only, -PH), 2, skewness, na.rm = TRUE),
Kurtosis = apply(select(num_only, -PH), 2, kurtosis, na.rm = TRUE)
) %>% arrange(desc(Skewness))
head(sk_kurt, 5) # top positively skewed
## # A tibble: 5 × 3
## Feature Skewness Kurtosis
## <chr> <dbl> <dbl>
## 1 Oxygen_Filler 2.66 11.1
## 2 Temperature 2.39 10.2
## 3 Air_Pressurer 2.25 4.73
## 4 PSC_CO2 1.73 3.73
## 5 PSC_Fill 0.933 0.769
tail(sk_kurt, 5) # most negatively skewed
## # A tibble: 5 × 3
## Feature Skewness Kurtosis
## <chr> <dbl> <dbl>
## 1 Filler_Level -0.848 0.0460
## 2 Bowl_Setpoint -0.974 -0.0564
## 3 Carb_Flow -0.988 -0.583
## 4 Filler_Speed -2.87 6.71
## 5 MFR -5.09 30.5
We see Oxygen_Filler, Temperature, and Air_Pressurer have the largest positive skew; Filler_Speed and MFR are the most negatively skewed.
An observation beyond \(q1 - 1.5*IQR\) or \(q3 + 1.5*IQR\) is flagged as an extreme value.
count_outliers <- function(x) {
x <- na.omit(x)
iqr <- IQR(x)
q1 <- quantile(x, .25)
q3 <- quantile(x, .75)
sum(x < q1 - 1.5*iqr | x > q3 + 1.5*iqr)
}
outlier_tbl <- map_dfr(select(num_only, -PH), count_outliers, .id = "Feature") %>%
rename(Outlier_Count = 1) %>% arrange(desc(Outlier_Count))
head(outlier_tbl, 10)
## # A tibble: 1 × 32
## Outlier_Count Carb_Volume Fill_Ounces PC_Volume Carb_Pressure Carb_Temp PSC
## <int> <int> <int> <int> <int> <int> <int>
## 1 0 5 55 86 14 36 52
## # ℹ 25 more variables: PSC_Fill <int>, PSC_CO2 <int>, Mnf_Flow <int>,
## # Carb_Pressure1 <int>, Fill_Pressure <int>, Hyd_Pressure1 <int>,
## # Hyd_Pressure2 <int>, Hyd_Pressure3 <int>, Hyd_Pressure4 <int>,
## # Filler_Level <int>, Filler_Speed <int>, Temperature <int>,
## # Usage_cont <int>, Carb_Flow <int>, Density <int>, MFR <int>, Balling <int>,
## # Pressure_Vacuum <int>, Oxygen_Filler <int>, Bowl_Setpoint <int>,
## # Pressure_Setpoint <int>, Air_Pressurer <int>, Alch_Rel <int>, …
The same trio (MFR, Filler_Speed, Oxygen_Filler) tops the outlier‑count list, validating our earlier focus.
Rather than deleting rows, we convert extreme tails (>1.5×IQR) in
MFR, Filler_Speed, and
Oxygen_Filler to NA
. The focuse was on
handling outliers only in these three features as they had the most
extreme outliers. The recoded NA
values will be imputed in
the next step.
trim_cols <- c("MFR", "Filler_Speed", "Oxygen_Filler")
for (col in trim_cols) {
iqr <- IQR(train_raw[[col]], na.rm = TRUE)
bounds <- quantile(train_raw[[col]], c(.25, .75), na.rm = TRUE) + c(-1, 1) * 1.5 * iqr
train_raw[[col]] <- ifelse(between(train_raw[[col]], bounds[1], bounds[2]),
train_raw[[col]], NA_real_)
}
studentData_outliers_rm <- train_raw # keep legacy name for downstream code
# 5a· kNN on outlier‑removed frame ---------------------------
set.seed(100)
students_cleaned <- studentData_outliers_rm %>%
kNN(k = 5) %>%
select(1:33) # drop VIM appended columns
# 5b· kNN on original raw frame ------------------------------
set.seed(101)
students_naFilled <- train_raw %>%
kNN(k = 5) %>%
select(1:33)
set.seed(102)
smote_res <- SMOTE(students_naFilled[,-1], target = students_naFilled$Brand_Code,
K = 5, dup_size = 2)
students_balanced <- smote_res$data %>%
rename(Brand_Code = class)
ggplot(students_balanced, aes(Brand_Code, fill = Brand_Code)) +
geom_bar() +
labs(title = "Balanced Brand_Code after SMOTE", y = "Count", x = NULL) +
geom_text(stat = "count", aes(label = after_stat(count)),
vjust = 1.4, colour = "white")
theme(legend.position = "none")
## List of 1
## $ legend.position: chr "none"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
Using SMOTE partially re balanced the data, adding more observations with Brand_Code A (recoded as 1) and Brand_Code C (recoded as 3).
students_balanced <- dummy_cols(students_balanced, select_columns = "Brand_Code") %>%
select(-Brand_Code)
set.seed(200)
train_test_split <- initial_split(students_balanced, prop = 0.7)
train_df <- training(train_test_split) %>% drop_na()
test_df <- testing(train_test_split) %>% drop_na()
x_train <- select(train_df, -PH)
y_train <- train_df$PH
x_test <- select(test_df, -PH)
y_test <- test_df$PH
ctrl <- trainControl(method = "cv", number = 10, allowParallel = TRUE)
The exploratory data analysis revealed several factors that assisted in choosing models to test the predictability of pH. The existence of weak correlations and the presence of heavy outliers dissuaded the use of linear models as these would not capture the true relationship between the predictors and pH. With a goal of understanding which predictors had the most influence over pH, a strategy that narrowed down the predictors through feature selection was also an important consideration. An initial round of base models were fitted to evaluate and confirm which models were better suited for the data and our goal.
run_models <- function(data, target = "ph") {
library(caret)
library(randomForest)
library(xgboost)
library(e1071)
library(dplyr)
set.seed(123)
# Step 1: Impute missing values and scale data
preproc <- preProcess(data, method = c("medianImpute", "center", "scale"))
data_clean <- predict(preproc, newdata = data)
# Step 2: Train/test split
train_index <- createDataPartition(data_clean[[target]], p = 0.8, list = FALSE)
train_data <- data_clean[train_index, ]
test_data <- data_clean[-train_index, ]
# Extract x and y for models that need matrix input
x_train <- train_data %>% select(-all_of(target)) %>% as.matrix()
y_train <- train_data[[target]]
x_test <- test_data %>% select(-all_of(target)) %>% as.matrix()
y_test <- test_data[[target]]
results <- list()
## 1. Linear Regression
lm_model <- lm(as.formula(paste(target, "~ .")), data = train_data)
lm_preds <- predict(lm_model, newdata = test_data)
results$Linear_Regression <- postResample(lm_preds, y_test)
## 2. Random Forest
rf_model <- randomForest(as.formula(paste(target, "~ .")), data = train_data)
rf_preds <- predict(rf_model, newdata = test_data)
results$Random_Forest <- postResample(rf_preds, y_test)
## 3. XGBoost
dtrain <- xgb.DMatrix(data = x_train, label = y_train)
dtest <- xgb.DMatrix(data = x_test, label = y_test)
xgb_model <- xgboost(data = dtrain, nrounds = 100, objective = "reg:squarederror", verbose = 0)
xgb_preds <- predict(xgb_model, newdata = dtest)
results$XGBoost <- postResample(xgb_preds, y_test)
## 4. SVR
svr_model <- svm(x = x_train, y = y_train, type = "eps-regression", kernel = "radial")
svr_preds <- predict(svr_model, x_test)
results$SVR <- postResample(svr_preds, y_test)
## Print all results
for (model in names(results)) {
cat("Model:", model, "\n")
print(results[[model]])
cat("\n")
}
return(invisible(results))
}
cleaned_data <- janitor::clean_names(students_balanced)
model_results <- run_models(cleaned_data)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
## Warning: package 'xgboost' was built under R version 4.3.3
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
## Model: Linear_Regression
## RMSE Rsquared MAE
## 0.7861197 0.3636575 0.6120384
##
## Model: Random_Forest
## RMSE Rsquared MAE
## 0.5486348 0.7026078 0.3833599
##
## Model: XGBoost
## RMSE Rsquared MAE
## 0.5877401 0.6449252 0.3906336
##
## Model: SVR
## RMSE Rsquared MAE
## 0.6580015 0.5538357 0.4538674
# Convert results to a tidy data frame
performance_df <- bind_rows(model_results, .id = "Model") %>%
pivot_longer(cols = -Model, names_to = "Metric", values_to = "Value")
# Plot
ggplot(performance_df, aes(x = Model, y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Model Comparison: RMSE, R², MAE",
y = "Metric Value",
x = "Model") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")
As expected the linear model had the poorest performance based on the lowest Rsquared and the highest RMSE. Non linear models were selected to move forward with the aim of improve accuracy using cross validation and hyper-parameter tuning. The final set of models tested were MARS, SVM, Random Forest, XGBoost and Cubist. The model with the highest R_squared value was used to predict pH for the new data - eval_raw.
A flexible spline‑based regression that automatically selects knots and interactions.
set.seed(201)
mars_grid <- expand.grid(degree = 1:4, nprune = seq(5,50,5))
mars_fit <- train(x_train, y_train,
method = "earth",
tuneGrid = mars_grid,
trControl = ctrl,
metric = "Rsquared",
preProcess = c("center","scale"))
Captures non‑linear relationships by projecting data into a high‑dimensional feature space.
set.seed(202)
svm_fit <- train(x_train, y_train,
method = "svmRadial",
tuneLength = 14,
trControl = ctrl,
metric = "Rsquared",
preProcess = c("center","scale"))
set.seed(203)
rf_grid <- expand.grid(mtry = c(12,14,16)) # Number of predictors to sample at each run
rf_fit <- train(x_train, y_train,
method = "rf",
tuneLength = 10,
tuneGrid = rf_grid,
trControl = ctrl,
metric = "Rsquared")
xgb_grid <- expand.grid(
nrounds = seq(from = 200, to = 500, by = 50),
eta = c(0.1, 0.3),
max_depth = c(6, 8),
gamma = c(0, 0.5,1),
colsample_bytree = c(0.6, 0.8, 1),
min_child_weight = 1,
subsample = 1
)
xgb_fit <- train(x = x_train,
y = y_train,
method = "xgbTree",
# Use R-sqaured to find best model
metric = "Rsquared",
# Use 10 fold cross validation and Grid search
trControl = ctrl,
tuneGrid = xgb_grid,
verbose = FALSE,
verbosity = 0
)
set.seed(205)
cub_grid <- expand.grid(committees = c(1,10,50,100), neighbors = c(0,1,5,9))
cubist_fit <- train(x_train, y_train,
method = "cubist",
tuneGrid = cub_grid,
trControl = ctrl,
metric = "Rsquared",
preProcess = c("center","scale"))
get_metrics <- function(fit, x, y) {
p <- as.numeric(predict(fit, x))
tibble(RMSE = RMSE(p, y),
Rsq = R2(p, y),
MAE = MAE(p, y))
}
metrics_tbl <- bind_rows(
get_metrics(mars_fit, x_test, y_test),
get_metrics(svm_fit, x_test, y_test),
get_metrics(rf_fit, x_test, y_test),
get_metrics(xgb_fit, x_test, y_test),
get_metrics(cubist_fit,x_test, y_test)
) %>%
mutate(Model = c("MARS","SVM","Random Forest","XGB","Cubist"), .before = 1)
#knitr::kable(metrics_tbl, digits = 3)
metrics_tbl
## # A tibble: 5 × 4
## Model RMSE Rsq MAE
## <chr> <dbl> <dbl> <dbl>
## 1 MARS 1.39 0.000614 0.153
## 2 SVM 0.112 0.588 0.0757
## 3 Random Forest 0.0994 0.695 0.0685
## 4 XGB 0.101 0.669 0.0699
## 5 Cubist 0.0930 0.717 0.0607
performance_plot_tbl <- metrics_tbl %>%
pivot_longer(cols = c(RMSE, Rsq, MAE), names_to = "Metric", values_to = "Value")
# Plot
ggplot(performance_plot_tbl, aes(x = Model, y = Value, fill = Metric)) +
geom_col(position = "dodge") +
labs(
title = "Model Performance Metrics on Held-Out Test Data",
subtitle = "Comparison of RMSE, R², and MAE for 5 Regression Models",
y = "Metric Value",
x = "Model"
) +
theme_minimal() +
scale_fill_brewer(palette = "Set1")
cubist_imp <- varImp(cubist_fit)$importance %>%
tibble::rownames_to_column("Feature") %>%
arrange(desc(Overall)) %>%
slice_head(n = 20)
ggplot(cubist_imp, aes(reorder(Feature, Overall), Overall)) +
geom_col(fill = "steelblue") + coord_flip() +
labs(title = "Cubist – Top 20", x = NULL, y = "Importance")
PH
in the evaluation file is entirely NA
,
so we exclude it before imputation to avoid VIM’s
sub‑script error. We then ensure that the dummy columns match the
training matrix—adding any missing columns as zeroes in the correct
order.
# 12a· kNN‑impute predictors (drop PH first)
eval_knn <- eval_raw %>%
select(-PH) %>% # remove all‑NA response
kNN(k = 5) %>%
select(-matches("(_imp$|_NAflag$)"))
# 12b· dummy‑encode Brand_Code
eval_dummy <- dummy_cols(eval_knn, select_columns = "Brand_Code") %>%
select(-Brand_Code)
# 12c· Align columns with training design matrix
missing_cols <- setdiff(names(x_train), names(eval_dummy))
if (length(missing_cols) > 0) {
eval_dummy[missing_cols] <- 0
}
# reorder to match training column order
eval_dummy <- eval_dummy[, names(x_train)]
# 12d· Predict pH using Random Forest
pred_eval <- predict(cubist_fit, eval_dummy)
# 12e· Save predictions
write_xlsx(bind_cols(eval_raw, Predicted_PH = pred_eval), "ph_predictions.xlsx")
To conclude, The ABC Beverage dataset’s exploratory study demonstrates a complex relationship between process variables and pH results, a crucial quality indicator in the beverage production industry. While the majority of variables exhibit minimal missingness and stable distributions, a few features—such as Temperature, MNR, Oxygen Filler, Filler Speed—show substantial skewness and outliers. This suggests that these features might be candidates for process improvement or sensor re calibration.
Both correlation coefficients and scatter plots support stronger linear relationships between pH and Bowl Setpoint, Filler Level, and Pressure Vacuum versus other predictors. On the other hand, factors like air pressure and filler speed have little predictive value, which might make it possible to simplify the model without noticeably sacrificing accuracy.
It’s interesting to note that Brand Code’s categorical imbalance, in which Brand B predominates the data set, raises questions regarding possible modeling bias. Testing if pH behaves differently between brands is essential, particularly when applying the model to diverse production lines. To prevent skewed performance, brand-stratified sample or brand interactions should be introduced in future modeling efforts.
Among the models we tested, Cubist gave the best fit—lowest RMSE and highest R²—by capturing nonlinear effects in features more effectively than linear or tree‐based methods. To improve on this, we could: - Tune Cubist’s main settings (number of committees, neighbors and rule pruning levels). - Check residuals for any consistent errors and use partial‐dependence plots to see how each predictor really effects pH. - Apply simple transformations (log or Box–Cox) to the most skewed inputs. - Add key interaction terms (for example, Temperature × Filler Speed). - Try fitting Cubist separately for each brand to guard against bias and improve overall accuracy.