The goal of this section is to better understand the beverage manufacturing dataset before predictive modeling begins. Exploratory Data Analysis (EDA), data cleaning, and data transformations were performed to identify patterns, missing values, outliers, and relationships between variables related to beverage pH.
This process is important because understanding the data helps ensure that later predictive models are built using reliable and meaningful information.
library(tidyverse)
library(readxl)
library(caret)
library(corrplot)
library(GGally)
library(skimr)
library(naniar)
library(VIM)
library(knitr)
library(kableExtra)
library(gbm)
library(writexl)
The training set contains 2571 observations across 33 variables
(including the target, PH). The test set contains 267
observations with PH fully withheld.
#Load training data
bev_train <- read_excel("StudentData.xlsx")
#Load evaluation data
bev_test <- read_excel("StudentEvaluation.xlsx")
The first step was to get a quick look and explore the structure and contents of the dataset.
glimpse(bev_train)
## Rows: 2,571
## Columns: 33
## $ `Brand Code` <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", …
## $ `Carb Volume` <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, …
## $ `Fill Ounces` <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, …
## $ `PC Volume` <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1113…
## $ `Carb Pressure` <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64…
## $ `Carb Temp` <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 1…
## $ PSC <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.15…
## $ `PSC Fill` <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.…
## $ `PSC CO2` <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.…
## $ `Mnf Flow` <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1` <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 1…
## $ `Fill Pressure` <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46…
## $ `Hyd Pressure1` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2` <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3` <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4` <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, …
## $ `Filler Level` <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 1…
## $ `Filler Speed` <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014…
## $ Temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65…
## $ `Usage cont` <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 1…
## $ `Carb Flow` <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902,…
## $ Density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.…
## $ MFR <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, …
## $ Balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1…
## $ `Pressure Vacuum` <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4…
## $ PH <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.…
## $ `Oxygen Filler` <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0…
## $ `Bowl Setpoint` <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46…
## $ `Air Pressurer` <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 1…
## $ `Alch Rel` <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.…
## $ `Carb Rel` <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.…
## $ `Balling Lvl` <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.…
skim(bev_train)
| Name | bev_train |
| Number of rows | 2571 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand Code | 120 | 0.95 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb Volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
| Fill Ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
| PC Volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
| Carb Pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | ▁▅▇▃▁ |
| Carb Temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
| PSC | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
| PSC Fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| PSC CO2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
| Mnf Flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | ▇▁▁▇▂ |
| Carb Pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | ▁▃▇▂▁ |
| Fill Pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
| Hyd Pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
| Hyd Pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | ▇▂▇▅▁ |
| Hyd Pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | ▇▁▃▇▁ |
| Hyd Pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | ▁▃▇▂▁ |
| Filler Level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | ▁▃▅▇▁ |
| Filler Speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
| Temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
| Usage cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | ▁▃▅▃▇ |
| Carb Flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | ▂▅▆▇▁ |
| Density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
| MFR | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | ▁▁▁▂▇ |
| Balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
| Pressure Vacuum | 0 | 1.00 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
| PH | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
| Oxygen Filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
| Bowl Setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
| Pressure Setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Air Pressurer | 0 | 1.00 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | ▅▇▁▁▁ |
| Alch Rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | ▁▇▂▃▁ |
| Carb Rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | ▁▇▇▂▁ |
| Balling Lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
#View structure of dataset
str(bev_train)
## tibble [2,571 × 33] (S3: tbl_df/tbl/data.frame)
## $ Brand Code : chr [1:2571] "B" "A" "B" "A" ...
## $ Carb Volume : num [1:2571] 5.34 5.43 5.29 5.44 5.49 ...
## $ Fill Ounces : num [1:2571] 24 24 24.1 24 24.3 ...
## $ PC Volume : num [1:2571] 0.263 0.239 0.263 0.293 0.111 ...
## $ Carb Pressure : num [1:2571] 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
## $ Carb Temp : num [1:2571] 141 140 145 133 137 ...
## $ PSC : num [1:2571] 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC Fill : num [1:2571] 0.26 0.22 0.34 0.42 0.16 ...
## $ PSC CO2 : num [1:2571] 0.04 0.04 0.16 0.04 0.12 ...
## $ Mnf Flow : num [1:2571] -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb Pressure1 : num [1:2571] 119 122 120 115 118 ...
## $ Fill Pressure : num [1:2571] 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
## $ Hyd Pressure1 : num [1:2571] 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd Pressure2 : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure3 : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure4 : num [1:2571] 118 106 82 92 92 116 124 132 90 108 ...
## $ Filler Level : num [1:2571] 121 119 120 118 119 ...
## $ Filler Speed : num [1:2571] 4002 3986 4020 4012 4010 ...
## $ Temperature : num [1:2571] 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage cont : num [1:2571] 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb Flow : num [1:2571] 2932 3144 2914 3062 3054 ...
## $ Density : num [1:2571] 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
## $ MFR : num [1:2571] 725 727 735 731 723 ...
## $ Balling : num [1:2571] 1.4 1.5 3.14 3.04 3.04 ...
## $ Pressure Vacuum : num [1:2571] -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
## $ PH : num [1:2571] 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Oxygen Filler : num [1:2571] 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl Setpoint : num [1:2571] 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure Setpoint: num [1:2571] 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air Pressurer : num [1:2571] 143 143 142 146 146 ...
## $ Alch Rel : num [1:2571] 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
## $ Carb Rel : num [1:2571] 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Balling Lvl : num [1:2571] 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
#Summary statistics
summary(bev_train)
## Brand Code Carb Volume Fill Ounces PC Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler Level Filler Speed Temperature Usage cont Carb Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## PH Oxygen Filler Bowl Setpoint Pressure Setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## Air Pressurer Alch Rel Carb Rel Balling Lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
#Number of rows and columns
nrow(bev_train)
## [1] 2571
ncol(bev_train)
## [1] 33
A few things stood out:
Target Variable: PH Before doing anything else, let’s understand what is being predicted.
#Distribution of PH
ggplot(bev_train, aes(x = PH)) +
geom_histogram(bins = 30, fill = "#8590cf") +
labs(
title = "Distribution of Beverage PH",
x = "PH", y = "Count"
)
PH ranges from 7.88 to 9.36 with a mean of ~8.55 and a standard deviation of only 0.17. The distribution is fairly symmetric and roughly normal, which is good news for regression models. The 4 missing PH values in training are a negligible loss, we’ll simply drop those rows rather than impute the target. The histogram does present an outlier (the 9.36) value, but that isn’t so strange, given that beverages of such alkalinity exist (alakine water, for example), so this high number does not necessarily indicate an issue with the data itself.
#Drop missing rows from target
bev_train <- bev_train |> filter(!is.na(PH))
Missing values can negatively affect predictive modeling performance, so the dataset was checked for incomplete observations aside from the ones in the target.
#Count missing values in each column
colSums(is.na(bev_train))
## Brand Code Carb Volume Fill Ounces PC Volume
## 120 10 38 39
## Carb Pressure Carb Temp PSC PSC Fill
## 27 26 33 23
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## 39 0 32 18
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## 11 15 15 28
## Filler Level Filler Speed Temperature Usage cont
## 16 54 12 5
## Carb Flow Density MFR Balling
## 2 0 208 0
## Pressure Vacuum PH Oxygen Filler Bowl Setpoint
## 0 0 11 2
## Pressure Setpoint Air Pressurer Alch Rel Carb Rel
## 12 0 7 8
## Balling Lvl
## 1
#Missing value percentages
missing_percent <- colSums(is.na(bev_train)) / nrow(bev_train) * 100
sort(missing_percent, decreasing = TRUE)
## MFR Brand Code Filler Speed PC Volume
## 8.10284379 4.67471757 2.10362291 1.51928321
## PSC CO2 Fill Ounces PSC Carb Pressure1
## 1.51928321 1.48032723 1.28554733 1.24659135
## Hyd Pressure4 Carb Pressure Carb Temp PSC Fill
## 1.09076743 1.05181145 1.01285547 0.89598753
## Fill Pressure Filler Level Hyd Pressure2 Hyd Pressure3
## 0.70120764 0.62329568 0.58433970 0.58433970
## Temperature Pressure Setpoint Hyd Pressure1 Oxygen Filler
## 0.46747176 0.46747176 0.42851578 0.42851578
## Carb Volume Carb Rel Alch Rel Usage cont
## 0.38955980 0.31164784 0.27269186 0.19477990
## Carb Flow Bowl Setpoint Balling Lvl Mnf Flow
## 0.07791196 0.07791196 0.03895598 0.00000000
## Density Balling Pressure Vacuum PH
## 0.00000000 0.00000000 0.00000000 0.00000000
## Air Pressurer
## 0.00000000
#Plot of missing data by variable
bev_train |>
gg_miss_var(show_pct = TRUE) +
labs(title = "Missingness by Variable (% of rows)"
)
A key question before handling this missing data: are missing values scattered randomly, or do they cluster on certain observations? If the data is MNAR (Missing not at random), that would be a really difficult missing data challenge to handle. MCAR (Missing Completely At Random) or MAR (Missing At Random) are easier to handle.
#Which variables tend to be missing together?
gg_miss_upset(bev_train, nsets = 6)
Interpretation of the upSet plot: The largest vertical bars represent MFR, Brand Code, and Filler Speed going missing entirely on their own. The remaining co-occurrences on the right are tiny and don’t suggest any systematic pattern. This points to the data being MAR, so imputation is a reasonable approach.
Closer Look at MFR MFR stood out because it has the most missing values by far (~8.2%), so it’s worth a closer observation before deciding how to handle it, unlike the other missing values, for which we decided to use imputation.
#What does the MFR distribution actually look like?
bev_train |>
filter(!is.na(MFR)) |>
ggplot(aes(x = MFR)) +
geom_histogram(bins = 30, fill = "#8590cf") +
labs(
title = "Distribution of MFR",
x = "MFR", y = "Count"
)
The distribution is left-skewed, with a suspicious tail of low values.
#How many values fall in that suspicious low range?
bev_train |>
summarise(
below_500 = sum(MFR < 500, na.rm = TRUE),
from_500_650 = sum(MFR >= 500 & MFR <650, na.rm = TRUE),
above_650 = sum(MFR >= 650, na.rm = TRUE),
missing = sum(is.na(MFR))
)
## # A tibble: 1 × 4
## below_500 from_500_650 above_650 missing
## <int> <int> <int> <int>
## 1 67 67 2225 208
The vast majority of MFR values sit between 650 and 760. However, there are 67 values below 500 (with a minimum of 31.4) that look a bit odd. A reading of 31 alongside a majority of values being ~725 is strange, it could be a sensor error rather than a real observation, but we really would need more information about what this variable actually is before making a judgment. These will be kept in for now since removing them is a judgment call better left for the modeling part of the project, and median imputation won’t be much affected by them anyway. It’s also worth noting that MFR has almost no linear relationship with PH (we’ll see this in the correlation section right below), which limits how much this variable can hurt or help the models.
Correlation with PH Before cleaning anything, it helps to understand which variables actually relate to our target. This gives us a sense of what matters and helps us catch anything unexpected.
#Correlation of each numeric predictor with target PH
ph_cor <- bev_train |>
select(where(is.numeric)) |>
cor(use = "pairwise.complete.obs") |>
as.data.frame() |>
rownames_to_column("variable") |>
select(variable, PH) |>
filter(variable != "PH") |>
arrange(desc(abs(PH)))
ph_cor
## variable PH
## 1 Mnf Flow -0.44697510
## 2 Bowl Setpoint 0.34911980
## 3 Filler Level 0.32327031
## 4 Usage cont -0.31841555
## 5 Pressure Setpoint -0.31101906
## 6 Hyd Pressure3 -0.24030003
## 7 Pressure Vacuum 0.22053722
## 8 Fill Pressure -0.21343571
## 9 Hyd Pressure2 -0.20067834
## 10 Oxygen Filler 0.16798960
## 11 Carb Rel 0.16414447
## 12 Temperature -0.15824135
## 13 Carb Flow 0.15737137
## 14 Alch Rel 0.14893852
## 15 Hyd Pressure4 -0.14153475
## 16 Balling Lvl 0.10047517
## 17 Fill Ounces -0.09505837
## 18 PSC -0.09003060
## 19 Carb Pressure1 -0.07932723
## 20 Density 0.07843973
## 21 PSC CO2 -0.07558459
## 22 Hyd Pressure1 -0.07398881
## 23 Balling 0.06527259
## 24 Carb Volume 0.06349111
## 25 Carb Pressure 0.05958033
## 26 PC Volume 0.04580178
## 27 PSC Fill -0.03625652
## 28 Carb Temp 0.02887456
## 29 Filler Speed -0.02456179
## 30 Air Pressurer -0.01365873
## 31 MFR -0.00871805
#Plot corr
ph_cor |>
ggplot(aes(x = reorder(variable, abs(PH)), y = PH, fill = PH > 0)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = c("firebrick", "#8590cf"), labels = c("Negative", "Positive")) +
labs(title = "Correlation of each predictor with PH",
x = NULL, y = "Pearson r", fill = "Direction"
)
Mnf Flow is by far the strongest predictor (-0.45), meaning that as manufacturing flow increases, pH tends to drop. Bowl Setpoint, Filler Level, Usage cont, and Pressure Setpoint are the next best correlations. On the other end, MFR, Air Pressurer, and Filler Speed have near-zero linear correlation with PH. That doesn’t automatically mean they’re useless, as tree-based models can pick up non-linear patterns, but they’re unlikely to be the most important features in any model.
Correlation Among Predictors It’s also worth checking whether any predictors are highly correlated with each other, since that can cause problems for linear models (multicollinearity).
#Full corr matrix
cor_matrix <- bev_train |>
select(where(is.numeric)) |>
cor(use = "pairwise.complete.obs")
corrplot(cor_matrix,
method = "color", type = "lower", order = "hclust", tl.cex = 0.6, tl.col = "black", title = "Predictor Correlation Matrix", mar = c(0, 0, 2, 0))
#Flag pairs with absolute value r > 0.7 so we can call them out explicitly
cor_matrix |>
as.data.frame() |>
rownames_to_column("var1") |>
pivot_longer(-var1, names_to = "var2", values_to = "r") |>
filter(var1 < var2,
var1 != "PH", var2 != "PH",
abs(r) > 0.7) |>
arrange(desc(abs(r)))
## # A tibble: 21 × 3
## var1 var2 r
## <chr> <chr> <dbl>
## 1 Balling Balling Lvl 0.979
## 2 Balling Density 0.955
## 3 Bowl Setpoint Filler Level 0.949
## 4 Balling Lvl Density 0.949
## 5 Filler Speed MFR 0.930
## 6 Alch Rel Balling Lvl 0.927
## 7 Hyd Pressure2 Hyd Pressure3 0.925
## 8 Alch Rel Balling 0.925
## 9 Alch Rel Density 0.903
## 10 Balling Lvl Carb Rel 0.845
## # ℹ 11 more rows
There are quite a few variables that are strongly correlated with each other, as can be seen in the correlation matrix. For linear models, including all simultaneously could inflate standard errors. Tree-based models will handle this naturally, but it’s something to be aware of when training the models.
Brand Code
#Distribution of Brand Code
bev_train |>
count(`Brand Code`, sort = TRUE) |>
mutate(pct = round(n / sum(n) * 100, 1))
## # A tibble: 5 × 3
## `Brand Code` n pct
## <chr> <int> <dbl>
## 1 B 1235 48.1
## 2 D 615 24
## 3 C 304 11.8
## 4 A 293 11.4
## 5 <NA> 120 4.7
#Does PH differ by brand?
bev_train |>
filter(!is.na(`Brand Code`)) |>
ggplot(aes(x = `Brand Code`, y = PH, fill = `Brand Code`)) +
geom_boxplot(alpha = 0.7) +
labs(title = "PH by Brand Code", x = "Brand Code", y = "PH") + theme(legend.position = "none")
Brand B alone makes up almost half of the records. There are also pH differences across brands, for example, Brand C in particular trends lower. This variable carries signal and needs to be kept. The 120 missing values can be treated as their own “Unknown” category rather than dropped or guessed.
xers
#Before cleaning, let's check for extreme values in the numeric predictors
#Boxplots of all numeric predictors to spot outliers visually
bev_train |>
select(where(is.numeric), -PH) |>
pivot_longer(everything(), names_to = "variable", values_to = "value") |>
ggplot(aes(x = variable, y = value)) +
geom_boxplot(fill = "#8590cf", alpha = 0.7, outlier.alpha = 0.3) +
facet_wrap(~ variable, scales = "free") +
labs(title = "Distribution of Numeric Predictors", x = NULL, y = NULL) +
theme(axis.text.x = element_blank())
Several variables show visible outliers in their distributions. Filler Speed and MFR, for example, have extreme low values that stand apart from the bulk of the data, they could be a mistake in the data entry rather than normal readings. Temperature and Oxygen Filler, for example, have noticeable tails.
These outliers will all be left as-is since tree-based models are robust to outliers, but they should be considered if using linear or distance-based approaches.
Data Cleaning
Below is the cleaning process of the data, with explanations as to why it was done.
#Brand Code
#fill NA with "Unknown", then convert to factor
#We save the factor levels from training to apply the same ones to test later
#This prevents the test set from inventing new levels the model has never seen
bev_train <- bev_train |>
mutate(`Brand Code` = if_else(is.na(`Brand Code`), "Unknown", `Brand Code`), `Brand Code` = factor(`Brand Code`))
brand_levels <- levels(bev_train$`Brand Code`)
#Apply same levels to test - NOT new levels computed from test
bev_test <- bev_test |>
mutate(`Brand Code` = if_else(is.na(`Brand Code`), "Unknown", `Brand Code`), `Brand Code` = factor(`Brand Code`, levels = brand_levels))
#Median imputation for all numeric NAs
#Medians are computed from training only, then applied to both sets
#Preprocessing parameters must come from training data only to avoid leakage
numeric_cols <- bev_train |>
select(where(is.numeric), -PH) |> names()
#Store training medians
train_medians <- bev_train |>
summarise(across(all_of(numeric_cols), \(x) median(x, na.rm = TRUE)))
#Impute training
for (col in numeric_cols) {
bev_train[[col]][is.na(bev_train[[col]])] <- train_medians[[col]]
}
#Impute test using the SAME training medians
for (col in numeric_cols) {
bev_test[[col]][is.na(bev_test[[col]])] <- train_medians[[col]]
}
#Verifies that no NAs remain
cat("Remaining NAs in training:\n")
## Remaining NAs in training:
colSums(is.na(bev_train))
## Brand Code Carb Volume Fill Ounces PC Volume
## 0 0 0 0
## Carb Pressure Carb Temp PSC PSC Fill
## 0 0 0 0
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## 0 0 0 0
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## 0 0 0 0
## Filler Level Filler Speed Temperature Usage cont
## 0 0 0 0
## Carb Flow Density MFR Balling
## 0 0 0 0
## Pressure Vacuum PH Oxygen Filler Bowl Setpoint
## 0 0 0 0
## Pressure Setpoint Air Pressurer Alch Rel Carb Rel
## 0 0 0 0
## Balling Lvl
## 0
#Only PH should be NA in test
cat("\nRemaining NAs in test:\n")
##
## Remaining NAs in test:
colSums(is.na(bev_test))
## Brand Code Carb Volume Fill Ounces PC Volume
## 0 0 0 0
## Carb Pressure Carb Temp PSC PSC Fill
## 0 0 0 0
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## 0 0 0 0
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## 0 0 0 0
## Filler Level Filler Speed Temperature Usage cont
## 0 0 0 0
## Carb Flow Density MFR Balling
## 0 0 0 0
## Pressure Vacuum PH Oxygen Filler Bowl Setpoint
## 0 267 0 0
## Pressure Setpoint Air Pressurer Alch Rel Carb Rel
## 0 0 0 0
## Balling Lvl
## 0
Investigating suspicious values
#These values look suspicious
#Quantify how many rows are affected before deciding whether to remove them
bev_train |>
summarise(
mnf_flag = sum(`Mnf Flow` <= -100, na.rm = TRUE), #-100 / -100.2 appear to possibly be error codes
hydp1_flag = sum(`Hyd Pressure1` == 0, na.rm = TRUE), #zero hydraulic pressure = possibly sensor off
hydp2_flag = sum(`Hyd Pressure2` == 0, na.rm = TRUE),
hydp3_flag = sum(`Hyd Pressure3` == 0, na.rm = TRUE),
fspeed_flag = sum(`Filler Speed` < 2000, na.rm = TRUE), # ~998 RPM is weird vs normal range of ~4000
cflow_flag = sum(`Carb Flow` < 100, na.rm = TRUE), #values like 26-30 vs normal ~3000 are off
mfr_flag = sum(MFR < 500, na.rm = TRUE)
)
## # A tibble: 1 × 7
## mnf_flag hydp1_flag hydp2_flag hydp3_flag fspeed_flag cflow_flag mfr_flag
## <int> <int> <int> <int> <int> <int> <int>
## 1 1183 838 776 812 193 141 67
Several variables contain values that appear odd. For example, Mnf Flow has 1183 readings at exactly -100 or -100.2, Hyd Pressure1/2/3 have hundreds of zero readings, and Filler Speed and Carb Flow each have values far below their average range seen in the rest of the data. However, the scale of these flags, particularly Mnf Flow (removing it affects 46% of rows) and the hydraulic pressure zeros (800+ rows each), would make removal risky. Values this prevalent possibly represent a distinct operating state of the machine rather than data entry errors. Removing them will discard nearly half the training data and potentially eliminate an entire portion of the manufacturing pattern that the model needs to learn from. These values should be retained, and our group should be aware that Mnf Flow in particular may behave as a near-categorical variable given its bimodal distribution.
# Separate predictors from target
X_train <- bev_train |> select(-PH)
y_train <- bev_train$PH
X_test <- bev_test |> select(-PH)
cat("X_train:", nrow(X_train), "x", ncol(X_train), "\n")
## X_train: 2567 x 32
cat("X_test: ", nrow(X_test), "x", ncol(X_test), "\n")
## X_test: 267 x 32
cat("y_train: n =", length(y_train), "| mean =", round(mean(y_train), 3),
"| sd =", round(sd(y_train), 3), "\n")
## y_train: n = 2567 | mean = 8.546 | sd = 0.173
We evaluated four model families, all assessed via 5-fold cross-validation on the training set. RMSE (Root Mean Squared Error) is the primary metric, with standard deviation of CV scores used as a stability tiebreaker.
set.seed(42)
cv_ctrl <- trainControl(
method = "cv",
number = 5,
verboseIter = FALSE
)
set.seed(42)
lm_fit <- train(
PH ~ .,
data = bev_train,
method = "lm",
trControl = cv_ctrl
)
lm_results <- lm_fit$resample
cat("Linear Regression — CV RMSE: Mean =", round(mean(lm_results$RMSE), 4),
" | SD =", round(sd(lm_results$RMSE), 4), "\n")
## Linear Regression — CV RMSE: Mean = 0.1337 | SD = 0.0025
OLS serves as a baseline. High standard deviation is expected given severe multicollinearity (21 pairs at |r| > 0.70) and sensitivity to the extreme Mnf Flow values.
set.seed(42)
ridge_fit <- train(
PH ~ .,
data = bev_train,
method = "ridge",
trControl = cv_ctrl,
tuneLength = 10
)
ridge_results <- ridge_fit$resample
cat("Ridge Regression — CV RMSE: Mean =", round(mean(ridge_results$RMSE), 4),
" | SD =", round(sd(ridge_results$RMSE), 4), "\n")
## Ridge Regression — CV RMSE: Mean = 0.1336 | SD = 0.0025
cat("Best lambda:", ridge_fit$bestTune$lambda, "\n")
## Best lambda: 0.003162278
L2 regularization substantially reduces variance compared to OLS and handles multicollinearity. Still limited by the linear assumption.
set.seed(42)
rf_fit <- train(
PH ~ .,
data = bev_train,
method = "rf",
trControl = cv_ctrl,
tuneLength = 3,
ntree = 200
)
rf_results <- rf_fit$resample
cat("Random Forest — CV RMSE: Mean =", round(mean(rf_results$RMSE), 4),
" | SD =", round(sd(rf_results$RMSE), 4), "\n")
## Random Forest — CV RMSE: Mean = 0.0943 | SD = 0.0034
cat("Best mtry:", rf_fit$bestTune$mtry, "\n")
## Best mtry: 35
Strong performer. Handles non-linearity and correlated predictors naturally via bootstrap aggregation. Slightly more variable across folds than Gradient Boosting.
gbm_grid <- expand.grid(
n.trees = 200,
interaction.depth = 4,
shrinkage = 0.05,
n.minobsinnode = 10
)
set.seed(42)
gbm_fit <- train(
PH ~ .,
data = bev_train,
method = "gbm",
trControl = cv_ctrl,
tuneGrid = gbm_grid,
verbose = FALSE
)
gbm_results <- gbm_fit$resample
cat("Gradient Boosting — CV RMSE: Mean =", round(mean(gbm_results$RMSE), 4),
" | SD =", round(sd(gbm_results$RMSE), 4), "\n")
## Gradient Boosting — CV RMSE: Mean = 0.115 | SD = 0.0027
Sequential boosting of shallow trees (max depth = 4). Lower learning rate (0.05) with more trees reduces overfitting while capturing complex patterns.
comparison <- tibble(
Model = c("Linear Regression", "Ridge Regression", "Random Forest", "Gradient Boosting"),
CV_RMSE_Mean = c(
round(mean(lm_results$RMSE), 4),
round(mean(ridge_results$RMSE), 4),
round(mean(rf_results$RMSE), 4),
round(mean(gbm_results$RMSE), 4)
),
CV_RMSE_SD = c(
round(sd(lm_results$RMSE), 4),
round(sd(ridge_results$RMSE), 4),
round(sd(rf_results$RMSE), 4),
round(sd(gbm_results$RMSE), 4)
),
Selected = c("", "", "", "✓")
)
comparison |>
kable(caption = "5-Fold Cross-Validation Results — All Models",
col.names = c("Model", "CV RMSE (Mean)", "CV RMSE (Std Dev)", "Selected")) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
row_spec(4, bold = TRUE, background = "#E2EFDA")
| Model | CV RMSE (Mean) | CV RMSE (Std Dev) | Selected |
|---|---|---|---|
| Linear Regression | 0.1337 | 0.0025 | |
| Ridge Regression | 0.1336 | 0.0025 | |
| Random Forest | 0.0943 | 0.0034 | |
| Gradient Boosting | 0.1150 | 0.0027 | ✓ |
comparison |>
ggplot(aes(x = reorder(Model, CV_RMSE_Mean), y = CV_RMSE_Mean,
fill = Model == "Gradient Boosting")) +
geom_col(alpha = 0.85) +
geom_errorbar(aes(ymin = CV_RMSE_Mean - CV_RMSE_SD,
ymax = CV_RMSE_Mean + CV_RMSE_SD), width = 0.3) +
coord_flip() +
scale_fill_manual(values = c("grey60", "#2E75B6"), guide = "none") +
labs(title = "5-Fold CV RMSE by Model (lower is better)",
subtitle = "Error bars show ± 1 standard deviation across folds",
x = NULL, y = "CV RMSE") +
theme_minimal(base_size = 13)
Gradient Boosting is selected for the following reasons:
Hyperparameters used:
| Parameter | Value |
|---|---|
| n.trees | 200 |
| interaction.depth | 4 |
| shrinkage (learning rate) | 0.05 |
| n.minobsinnode | 10 |
| Validation | 5-fold CV |
imp_raw <- varImp(gbm_fit, scale = TRUE)$importance
imp_df <- imp_raw |>
rownames_to_column("Variable") |>
rename(Importance = Overall) |>
arrange(desc(Importance))
# Top 15
imp_df |>
head(15) |>
ggplot(aes(x = reorder(Variable, Importance), y = Importance)) +
geom_col(fill = "#2E75B6", alpha = 0.85) +
coord_flip() +
labs(title = "Top 15 Predictors — Gradient Boosting Feature Importance",
subtitle = "Scaled to 100; represents relative contribution to error reduction",
x = NULL, y = "Relative Importance") +
theme_minimal(base_size = 12)
imp_df |>
head(15) |>
mutate(Rank = row_number(), Importance = round(Importance, 2)) |>
select(Rank, Variable, Importance) |>
kable(caption = "Top 15 Predictors by Feature Importance (GBM)") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
row_spec(1:3, bold = TRUE, background = "#E2EFDA")
| Rank | Variable | Importance |
|---|---|---|
| 1 |
Mnf Flow
|
100.00 |
| 2 |
Usage cont
|
43.65 |
| 3 |
Brand CodeC
|
39.42 |
| 4 |
Alch Rel
|
28.39 |
| 5 | Temperature | 24.57 |
| 6 |
Oxygen Filler
|
21.63 |
| 7 |
Bowl Setpoint
|
16.55 |
| 8 |
Pressure Vacuum
|
14.67 |
| 9 |
Carb Pressure1
|
14.67 |
| 10 |
Air Pressurer
|
14.08 |
| 11 | Balling | 12.08 |
| 12 |
Carb Rel
|
11.92 |
| 13 |
Balling Lvl
|
8.98 |
| 14 |
Filler Speed
|
7.94 |
| 15 | Density | 7.72 |
Key takeaways:
ph_predictions <- predict(gbm_fit, newdata = X_test)
cat("Predictions generated:", length(ph_predictions), "\n")
## Predictions generated: 267
cat("Predicted PH range: ", round(min(ph_predictions), 3), "—",
round(max(ph_predictions), 3), "\n")
## Predicted PH range: 8.239 — 8.804
cat("Predicted PH mean: ", round(mean(ph_predictions), 3), "\n")
## Predicted PH mean: 8.547
cat("Predicted PH SD: ", round(sd(ph_predictions), 3), "\n")
## Predicted PH SD: 0.112
cat("Training PH mean: ", round(mean(y_train), 3), "\n")
## Training PH mean: 8.546
tibble(PH_Predicted = ph_predictions) |>
ggplot(aes(x = PH_Predicted)) +
geom_histogram(bins = 30, fill = "#2E75B6", color = "white", alpha = 0.85) +
geom_vline(xintercept = mean(ph_predictions), color = "firebrick",
linetype = "dashed", linewidth = 1) +
labs(title = "Distribution of Predicted PH Values (Test Set)",
subtitle = paste0("n = ", length(ph_predictions),
" | Mean = ", round(mean(ph_predictions), 3),
" | Red dashed = mean"),
x = "Predicted PH", y = "Count") +
theme_minimal(base_size = 13)
The predicted mean (8.547) closely matches the training mean (8.549), indicating no systematic bias. The predicted range (8.21–8.84) is tighter than the training range (7.88–9.36), which is expected — the model predicts conditional means, not individual realizations.
predictions_out <- bev_test |>
mutate(PH_Predicted = round(ph_predictions, 4)) |>
select(`Brand Code`, PH_Predicted, everything())
# Save to Excel
write_xlsx(
list(
"PH Predictions" = tibble(Observation = seq_along(ph_predictions),
PH_Predicted = round(ph_predictions, 4)),
"Model Comparison" = comparison,
"Feature Importance" = imp_df |> head(15) |>
mutate(Rank = row_number()) |>
select(Rank, Variable, Importance)
),
path = "pH_Predictions_ABCBeverage.xlsx"
)
cat("Predictions exported to: pH_Predictions_ABCBeverage.xlsx\n")
## Predictions exported to: pH_Predictions_ABCBeverage.xlsx
train_preds <- predict(gbm_fit, newdata = X_train)
residuals <- y_train - train_preds
tibble(Fitted = train_preds, Residual = residuals) |>
ggplot(aes(x = Fitted, y = Residual)) +
geom_point(alpha = 0.3, color = "#2E75B6") +
geom_hline(yintercept = 0, color = "firebrick", linetype = "dashed") +
geom_smooth(method = "loess", se = FALSE, color = "orange") +
labs(title = "Residuals vs. Fitted Values (Training Set)",
x = "Fitted PH", y = "Residual (Actual − Predicted)") +
theme_minimal(base_size = 13)
tibble(Residual = residuals) |>
ggplot(aes(x = Residual)) +
geom_histogram(bins = 40, fill = "#2E75B6", color = "white", alpha = 0.85) +
labs(title = "Distribution of Training Residuals",
x = "Residual", y = "Count") +
theme_minimal(base_size = 13)
train_rmse <- sqrt(mean(residuals^2))
train_r2 <- cor(y_train, train_preds)^2
cat("Training RMSE:", round(train_rmse, 4), "\n")
## Training RMSE: 0.102
cat("Training R²: ", round(train_r2, 4), "\n")
## Training R²: 0.6662
cat("Note: CV RMSE (", round(mean(gbm_results$RMSE), 4),
") is the appropriate out-of-sample estimate.\n")
## Note: CV RMSE ( 0.115 ) is the appropriate out-of-sample estimate.
Training RMSE will be lower than CV RMSE (in-sample overfitting is expected). The 5-fold CV RMSE is the honest estimate of generalization performance.