DATA624 Project 2 Brewing Models

library(caret)
library(tidyverse)

Introduction

New regulations are requiring us to understand our manufacturing process and the predictive factors with modeling. We also need to choose an optimal predictive model of PH to report to our boss. We have been given historical data regarding our manufacturing process that has already been split into a training set of “StudentData.xlsx” and a test set of “StudentEvaluation.xlsx”.

This will be a technical report that showcases our process of tidying the data received and exploring it.

Data Preparation and Exploration

The first part of the our workflow is tidying our data to prepare it for analysis. Tidying involves loading the data and then preparing it to be in the correct format to generate predictive models for. While we are tidying our data we are also visualizing and exploring our variables and their relationships in order to best inform the model building process.

Load

We begin our tidying journey by loading in the data from the Excel files utilizing readxl’s read_xlsx() function. We see that we are dealing with a dataset containing 33 different columns where all but the brand code seem to be numeric variables. There are additionally 2,571 different samples within our training dataset. A good proportion relative to the amount of features we have.

require(readxl)

file <- "StudentData.xlsx"

train <- readxl::read_xlsx(path = file)

file <- "StudentEvaluation.xlsx"

test <- readxl::read_xlsx(path = file)

glimpse(train)

## Rows: 2,571
## Columns: 33
## $ `Brand Code`        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", …
## $ `Carb Volume`       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, …
## $ `Fill Ounces`       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, …
## $ `PC Volume`         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1113…
## $ `Carb Pressure`     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64…
## $ `Carb Temp`         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 1…
## $ PSC                 <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.15…
## $ `PSC Fill`          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.…
## $ `PSC CO2`           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.…
## $ `Mnf Flow`          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1`    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 1…
## $ `Fill Pressure`     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46…
## $ `Hyd Pressure1`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4`     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, …
## $ `Filler Level`      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 1…
## $ `Filler Speed`      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014…
## $ Temperature         <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65…
## $ `Usage cont`        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 1…
## $ `Carb Flow`         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902,…
## $ Density             <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.…
## $ MFR                 <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, …
## $ Balling             <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1…
## $ `Pressure Vacuum`   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4…
## $ PH                  <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.…
## $ `Oxygen Filler`     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0…
## $ `Bowl Setpoint`     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46…
## $ `Air Pressurer`     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 1…
## $ `Alch Rel`          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.…
## $ `Carb Rel`          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.…
## $ `Balling Lvl`       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.…

Looking up some choice column names such as balling, mnf flow, and intuiting alcohol level we understand that we are attempting to model the pH of a batch of beer depending on its manufacturing parameters.

We notice that the column names do not conform to tibble naming standards and utilize the janitor package’s clean_names function to clean up the names for us. This will make things much simpler down the line.

require(janitor)

train <- train %>% 
  janitor::clean_names()

test <- test %>% 
  janitor::clean_names()

colnames(train)

##  [1] "brand_code"        "carb_volume"       "fill_ounces"      
##  [4] "pc_volume"         "carb_pressure"     "carb_temp"        
##  [7] "psc"               "psc_fill"          "psc_co2"          
## [10] "mnf_flow"          "carb_pressure1"    "fill_pressure"    
## [13] "hyd_pressure1"     "hyd_pressure2"     "hyd_pressure3"    
## [16] "hyd_pressure4"     "filler_level"      "filler_speed"     
## [19] "temperature"       "usage_cont"        "carb_flow"        
## [22] "density"           "mfr"               "balling"          
## [25] "pressure_vacuum"   "ph"                "oxygen_filler"    
## [28] "bowl_setpoint"     "pressure_setpoint" "air_pressurer"    
## [31] "alch_rel"          "carb_rel"          "balling_lvl"

We begin transformation of the data by factorizing brand code to more accurately represent the fact that this is a categorical variable of four possible levels: “A”,“B”,“C”, or “D”.

train$brand_code <- as_factor(train$brand_code)

test$brand_code <- as_factor(test$brand_code)

levels(train$brand_code)

## [1] "B" "A" "C" "D"

Skim

Our next step is taking a high level overview of the dataset utilizing the powerful skim() tool from the package skimr. We are able to see the missing values contained within each column, the frequencies of the factors, and the distributions of each categorical feature.

We see that overall most rows have a miniscule percentage of missing values besides MFR. MFR has 9% of values missing from the column, but this is not enough missing values to consider dropping the row. We want to impute these missing values utilizing MICE imputation, but first we need to make sure that there isn’t a pattern in the values that are missing.

Looking at the brand codes we see that the frequency distribution is not degenerate with one value being almost unrepresented in the data or the highest count being exponentially larger than the second highest.

If we take a peek at the distributions we see that half of our explanatory variables seem to be roughly normally distributed, most of the other half skew towards either the left or the right direction, but there are a curious few categories such as pressure set point and MNF flow that appear to have explicit bimodal distributions. We will want to take a proper look at what we might be able to do to these to shift them closer to normality down the line.

require(skimr)

train |>
  skimr::skim()

Data summary
Name	train
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
factor	1
numeric	32
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
brand_code	120	0.95	FALSE	4	B: 1239, D: 615, C: 304, A: 293

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
carb_volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
fill_ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
pc_volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
carb_pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
carb_temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
psc	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
psc_fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
psc_co2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
mnf_flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
carb_pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
fill_pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
hyd_pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
hyd_pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
hyd_pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
hyd_pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
filler_level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
filler_speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
usage_cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
carb_flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
mfr	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
pressure_vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
ph	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
oxygen_filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
bowl_setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
pressure_setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
air_pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
alch_rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
carb_rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
balling_lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

Evaluate Missing Values

The next step for pre-processing our data is to look at the missing data to see if there is a significant relationship between multiple columns having missing values together. We see that the vast majority of missing values occur with two or less rows. So most of our data is going to be missing at random rather than in a way that gives us information about other columns.

tibble(NAs = rowSums(is.na(train))) |>
  filter(NAs != 0) |>
  group_by(NAs) |>
  summarise(n = n()) |>
  ggplot(aes(x = NAs, y = n)) +
    geom_bar(stat="identity") +
    labs(title = "3+ combined missing values are rare", subtitle = "# NAs per row", y = NULL, x = NULL) +
    scale_x_continuous(limits=c(0, 13)) +
    theme_minimal()

Still we wouldn’t know if the missing values are related to index position from just the previous plot. If there were missing values with index position-based patterns then we would assume that there was some information hidden such as the manufacturing process being down during a time period where the data was being collected.

Utilizing the naniar package’s vis_miss function we are able to see that there do not seem to be index-based patterns. Lending more credence to the fact that the majority of this data is missing completely at random.

require(naniar)

naniar::vis_miss(train)

Although the majority of missing data is segmented to an individual feature for the row, we want to look at the breakdown of how rows that have multiple missing values align with which columns the value is missing from. If the more than a hundred rows which had two missing values were all coming from the same combination of columns, we would know that we were missing some meaning with the missing values.

We do notice some information in the fact that whenever filler speed seems to be missing MFR is also missing, indicating that we can not have an MFR recording if we do not have a filler speed recording. Thankfully this interaction falls within a subset of about 1% of the data, so it is not significant information. Still, keeping in mind our missing values do have some information contained within them lends more credence to utilizing mice as the imputation method. As mice takes into account missing value interactions when imputing values.

naniar::gg_miss_upset(train, 
                      nsets = naniar::n_var_miss(train)/2,
                      nintersects = 20)

Impute Missing Values

With most of our missing values deemed being missing at random, we take the opportunity to impute said values through the mice function from the mice package. Mice is a multiple imputation strategy that relies heavily on finding rows that are geometrically close in distance to the rows with missing values and deriving the imputed value from the mean of these neighbors. This predictive mean matching algorithm is used to find the missing values for each column except brand code. Brand code utilized a polynomial regression model from the other columns to determine the missing values.

require(mice)
set.seed(123)

imp <- mice::mice(train,
                  printFlag = FALSE)
train <- complete(imp)

imp <- mice::mice(test,
                  printFlag = FALSE)
test <- complete(imp)

colSums(is.na(train))

##        brand_code       carb_volume       fill_ounces         pc_volume 
##                 0                 0                 0                 0 
##     carb_pressure         carb_temp               psc          psc_fill 
##                 0                 0                 0                 0 
##           psc_co2          mnf_flow    carb_pressure1     fill_pressure 
##                 0                 0                 0                 0 
##     hyd_pressure1     hyd_pressure2     hyd_pressure3     hyd_pressure4 
##                 0                 0                 0                 0 
##      filler_level      filler_speed       temperature        usage_cont 
##                 0                 0                 0                 0 
##         carb_flow           density               mfr           balling 
##                 0                 0                 0                 0 
##   pressure_vacuum                ph     oxygen_filler     bowl_setpoint 
##                 0                 0                 0                 0 
## pressure_setpoint     air_pressurer          alch_rel          carb_rel 
##                 0                 0                 0                 0 
##       balling_lvl 
##                 0

Analyze Distribution

Now that we have our missing values dealt with, we want to deal with the odd distributions that we spotted within the skim. Ensuring that our distributions are close to normality allows for keeping the variance constant which leads to better regression fits for most models. We plot the distributions in better resolution and combine it with a scatterplot of each variable to the respoonse to analyze them further. The DataExplorer’s packages of plot_histogram and plot_scatterplot make this process not so mind numbing.

It is here that we can break down our different variables into different classes:

Those that have enough of a constant covariance with the response variable of ph where transformation is not needed:

This includes brand_code, carb_pressure, carb_temp, fill_ounces, pc_volume, and pressure_vacuum.

Those which have their distribution shifted in a way that means covariance is noticeably inconsistent:

This includes ph, carb_volume, psc, psc_co2, psc_fill, carb_pressure1, fill_pressure, filler_speed, hyd_pressure4, mfr, oxygen_filler, temperature, usage_cont, carb_rel.

Those which have multiple distinct distributions and can be dealt with by changing the variable from continuous to discrete:

This includes hyd_pressure1, hyd_pressure2, hyd_pressure3, mnf_flow, balling, bowl_setpoint, carb_flow, density, air_pressurer, alch_rel, balling_lvl, and pressure_setpoint.

require(DataExplorer)

DataExplorer::plot_histogram(train,
                             ggtheme = theme_minimal())

DataExplorer::plot_scatterplot(train, 
                               by = "ph",
                               ggtheme = theme_minimal(),
                               geom_point_args = list(alpha = 0.2),
                               theme_config = list(axis.text.x = element_text(size = 5)))

Remove Outlier(s)

Our scatterplot breakdown also highlights an outlier within ph that stands outside of the 1.5*IQR range of the category being above 9.5 pH and having no other points near it anywhere. There is no convincing reason for this outlier to exist based on the trends within each category. Thus, we remove it from the dataset by replacing it with the value of the 90th ph quantile.

train <- train %>% 
  mutate(ph = if_else(ph > 9, quantile(train$ph,.9), ph))

# Since ph is missing from test, this doesn't do anything. Seems like good practice to do it anyways though.
test <- test %>% 
  mutate(ph = if_else(ph > 9, quantile(train$ph,.9), ph))

max(train$ph)

## [1] 8.94

Bin

Some of our variables seem like they contain multiple distinct distributions within their data. The way we will deal with this is to bin the variables into bins that attempt to classify a sample into each individual distribution.

Zero Modal

One of the simpler ways to bin distributions is to bin into those where the value is zero and the value is not zero. This works especially well in variables where the mode is zero and there are only two distinct distributions. This includes hyd_pressure1, hyd_pressure2, and hyd_pressure3.

After binning we can begin to see a slight relationship where the median pH increases if the hydraulic pressure stays at 0.

train <- train %>% 
  mutate(
    across(c(hyd_pressure1, hyd_pressure2, hyd_pressure3),
           ~as_factor(if_else(.x < 0.01, "zero", "above_zero"))
           )
  ) %>% 
  rename(hyd_pressure1_bin = hyd_pressure1, hyd_pressure2_bin = hyd_pressure2, hyd_pressure3_bin = hyd_pressure3)

test <- test %>% 
  mutate(
    across(c(hyd_pressure1, hyd_pressure2, hyd_pressure3),
           ~as_factor(if_else(.x < 0.01, "zero", "above_zero"))
           )
  ) %>% 
  rename(hyd_pressure1_bin = hyd_pressure1, hyd_pressure2_bin = hyd_pressure2, hyd_pressure3_bin = hyd_pressure3)

plot_boxplot(train %>% select(ph, hyd_pressure1_bin, hyd_pressure2_bin, hyd_pressure3_bin), by = "hyd_pressure1_bin", ggtheme = theme_minimal())

Dual Distribution

Although not as simple we can break down two distributions where one is not a zero modal distribution by finding a value that visually maximizes separation and utilizing that as the divider. We will follow this process for balling, density, and air_pressurer. The values simply get classified as low or high depending on their position relative to the breakpoint between distributions.

After binning we can begin to see a slight relationship where the median pH increases if the density gets high.

train <- train %>% 
  mutate(
    balling = as_factor(if_else(balling < 2.26, "low", "high")),
    density = as_factor(if_else(density < 1.31, "low", "high")),
    air_pressurer = as_factor(if_else(air_pressurer < 144.7, "low", "high"))) %>% 
  rename(balling_bin = balling, density_bin = density, air_pressurer_bin = air_pressurer)

test <- test %>% 
  mutate(
    balling = as_factor(if_else(balling < 2.26, "low", "high")),
    density = as_factor(if_else(density < 1.31, "low", "high")),
    air_pressurer = as_factor(if_else(air_pressurer < 144.7, "low", "high"))) %>% 
  rename(balling_bin = balling, density_bin = density, air_pressurer_bin = air_pressurer)

plot_boxplot(train %>% select(ph, density_bin), by = "density_bin", ggtheme = theme_minimal())

Triple Distribution

For those with three distributions apparent, we once again utilize visual separation to best bin. We will follow this process for mnf_flow, carb_flow, balling_lvl, and alch_rel.

train <- train %>% 
  mutate(
    mnf_flow = as_factor(case_when(
      mnf_flow < -2 ~ "negative",
      mnf_flow <= 2 ~ "zero",
      mnf_flow > 2 ~ "positive"
    )),
    carb_flow = as_factor(case_when(
      carb_flow < 400 ~ "low",
      carb_flow <= 2500 ~ "mid",
      carb_flow > 2500 ~ "high"      
    )),
    balling_lvl = as_factor(case_when(
      balling_lvl == 0 ~ "zero",
      balling_lvl < 2.5 ~ "low",
      balling_lvl >= 2.5 ~ "high"            
    )),
    alch_rel =as_factor( case_when(
      alch_rel < 6.8 ~ "low",
      alch_rel < 7.5 ~ "mid",
      alch_rel >= 7.5 ~ "high"          
    ))) %>% 
  rename(mnf_flow_bin = mnf_flow, carb_flow_bin = carb_flow, balling_lvl_bin = balling_lvl, alch_rel_bin = alch_rel)

test <- test %>% 
  mutate(
    mnf_flow = as_factor(case_when(
      mnf_flow < -2 ~ "negative",
      mnf_flow <= 2 ~ "zero",
      mnf_flow > 2 ~ "positive"
    )),
    carb_flow = as_factor(case_when(
      carb_flow < 400 ~ "low",
      carb_flow <= 2500 ~ "mid",
      carb_flow > 2500 ~ "high"      
    )),
    balling_lvl = as_factor(case_when(
      balling_lvl == 0 ~ "zero",
      balling_lvl < 2.5 ~ "low",
      balling_lvl >= 2.5 ~ "high"            
    )),
    alch_rel =as_factor( case_when(
      alch_rel < 6.8 ~ "low",
      alch_rel < 7.5 ~ "mid",
      alch_rel >= 7.5 ~ "high"          
    ))) %>% 
  rename(mnf_flow_bin = mnf_flow, carb_flow_bin = carb_flow, balling_lvl_bin = balling_lvl, alch_rel_bin = alch_rel)

plot_boxplot(train %>% select(ph, mnf_flow_bin), by = "mnf_flow_bin", ggtheme = theme_minimal())

After binning we’re able to see that as the flow changes the median pH changes a decent amount as well.

Multiple Discrete

These variables with more than three “distributions” are a bit special in that that they seem to be true discrete variables following consistent breaks between values. There’s also a hint about the discrete nature of these variables with how they include “setpoint” in the name. There are specific set points that these variables can exist as. The only exception being values that were imputed in from mice which can simply be rounded to a closer true value. These variables are pressure_setpoint and bowl_setpoint.

Pressure bowl_setpoint has breaks of 10 from 70 to 140, while pressure_setpoint has breaks of value 2 from 44 to 52. We use some clever combination rounding with division in order to enforce these for our imputed values.

train <- train %>% 
  mutate(pressure_setpoint = as_factor(2 * round(pressure_setpoint/2)),
         bowl_setpoint = as_factor(10 * round(bowl_setpoint/10))) %>% 
  rename(pressure_setpoint_bin = pressure_setpoint, bowl_setpoint_bin = bowl_setpoint)
  
test <- test %>% 
  mutate(pressure_setpoint = as_factor(2 * round(pressure_setpoint/2)),
         bowl_setpoint = as_factor(10 * round(bowl_setpoint/10))) %>% 
  rename(pressure_setpoint_bin = pressure_setpoint, bowl_setpoint_bin = bowl_setpoint)
  
plot_boxplot(train %>% select(ph, bowl_setpoint_bin), by = "bowl_setpoint_bin", ggtheme = theme_minimal())

It’s now easier to visualize how pH changes with either one of these variables.

Box-Cox Transform

We tackle the variables that seem they can be nudged towards normality by finding the optimal lambda to box-cox transform them. This includes ph, carb_volume, psc, psc_co2, psc_fill, carb_pressure1, fill_pressure, filler_speed, hyd_pressure4, mfr, oxygen_filler, temperature, usage_cont, and carb_rel.

For psc_co2 and psc_fill, which have zero values, we need to shift these columns to the right with a an addition of 0.001.

We showcase a list of the lambdas that we will be using for transforming these columns below:

train %>% 
  select(ph, carb_volume, psc, psc_co2, psc_fill, carb_pressure1, fill_pressure, filler_speed, hyd_pressure4, mfr, oxygen_filler, temperature, usage_cont, carb_rel) %>% 
  mutate(psc_co2 = psc_co2 + 0.001, psc_fill = psc_fill + 0.001) %>% 
  lapply(BoxCoxTrans) -> lambdas

lapply(lambdas, `[[`, "lambda")

## $ph
## [1] 2
## 
## $carb_volume
## [1] -2
## 
## $psc
## [1] 0.5
## 
## $psc_co2
## [1] 0.4
## 
## $psc_fill
## [1] 0.5
## 
## $carb_pressure1
## [1] 0.5
## 
## $fill_pressure
## [1] -0.6
## 
## $filler_speed
## [1] 2
## 
## $hyd_pressure4
## [1] -0.3
## 
## $mfr
## [1] 2
## 
## $oxygen_filler
## [1] 0.2
## 
## $temperature
## [1] -2
## 
## $usage_cont
## [1] 2
## 
## $carb_rel
## [1] -2

We apply each of these lambdas towards box-cox transformations of our training set of data in order to bring the distributions closer to normality. We then visualize these newly transformed columns and compare them to our untransformed data from before.

train <- train %>% 
  mutate(psc_co2_trans = predict(lambdas$psc_co2, psc_co2 + 0.001),
         psc_fill_trans = predict(lambdas$psc_fill, psc_fill + 0.001),
         ph_trans = predict(lambdas$ph, ph),
         carb_volume_trans = predict(lambdas$carb_volume, carb_volume),
         psc_trans = predict(lambdas$psc, psc),
         carb_pressure1_trans = predict(lambdas$carb_pressure1, carb_pressure1),
         fill_pressure_trans = predict(lambdas$fill_pressure, fill_pressure),
         filler_speed_trans = predict(lambdas$filler_speed, filler_speed),
         hyd_pressure4_trans = predict(lambdas$hyd_pressure4, hyd_pressure4),
         mfr_trans = predict(lambdas$mfr, mfr),
         oxygen_filler_trans = predict(lambdas$oxygen_filler, oxygen_filler),
         temperature_trans = predict(lambdas$temperature, temperature),
         usage_cont_trans = predict(lambdas$usage_cont, usage_cont),
         carb_rel_trans = predict(lambdas$carb_rel, carb_rel))

plot_histogram(train %>% 
                 select(c(ph_trans, carb_volume_trans, psc_trans, psc_co2_trans, psc_fill_trans, carb_pressure1_trans, fill_pressure_trans, filler_speed_trans, hyd_pressure4_trans, mfr_trans, oxygen_filler_trans, temperature_trans, usage_cont_trans, carb_rel_trans)))

It is this visualization that allows us to see that some transformations have had no noticeable effect and thus we should not use them. These categories include: carb_pressure1_trans, carb_rel_trans, carb_volume_trans, fill_pressure_trans, filler_speed_trans, hyd_pressure4_trans, ph_trans, psc_co2_trans, psc_fill_trans, temperature_trans, and usage_cont_trans.

The columns where the transformation has led to a visible shift towards normality include: filler_speed_trans, mfr_trans, oxygen_filler_trans, and psc_trans. We remove the transformations that we do not want to apply from the training dataframe and then also apply the transformations that we have found to be helpful to testing dataframe.

train <- train %>% 
  select(-c(psc, mfr, oxygen_filler, filler_speed, ph_trans, carb_volume_trans, psc_co2_trans, psc_fill_trans, carb_pressure1_trans, fill_pressure_trans, hyd_pressure4_trans, temperature_trans, usage_cont_trans, carb_rel_trans))

test <- test %>% 
  mutate(filler_speed_trans = predict(lambdas$filler_speed, filler_speed),
         psc_trans = predict(lambdas$psc, psc),
         mfr_trans = predict(lambdas$mfr, mfr),
         oxygen_filler_trans = predict(lambdas$oxygen_filler, oxygen_filler)) %>% 
  select(-c(psc, mfr, oxygen_filler, filler_speed))

glimpse(train)

## Rows: 2,571
## Columns: 33
## $ brand_code            <fct> B, A, B, A, A, A, A, B, B, B, B, B, B, B, B, B, …
## $ carb_volume           <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667…
## $ fill_ounces           <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333…
## $ pc_volume             <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.11…
## $ carb_pressure         <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, …
## $ carb_temp             <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8,…
## $ psc_fill              <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, …
## $ psc_co2               <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, …
## $ mnf_flow_bin          <fct> negative, negative, negative, negative, negative…
## $ carb_pressure1        <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2,…
## $ fill_pressure         <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, …
## $ hyd_pressure1_bin     <fct> zero, zero, zero, zero, zero, zero, zero, zero, …
## $ hyd_pressure2_bin     <fct> zero, zero, zero, zero, zero, zero, zero, zero, …
## $ hyd_pressure3_bin     <fct> zero, zero, zero, zero, zero, zero, zero, zero, …
## $ hyd_pressure4         <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94…
## $ filler_level          <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4,…
## $ temperature           <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, …
## $ usage_cont            <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74,…
## $ carb_flow_bin         <fct> high, high, high, high, high, high, low, mid, hi…
## $ density_bin           <fct> low, low, high, high, high, high, low, low, low,…
## $ balling_bin           <fct> low, low, high, high, high, high, low, low, low,…
## $ pressure_vacuum       <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, …
## $ ph                    <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, …
## $ bowl_setpoint_bin     <fct> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120…
## $ pressure_setpoint_bin <fct> 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, …
## $ air_pressurer_bin     <fct> low, low, low, high, high, high, high, high, hig…
## $ alch_rel_bin          <fct> low, low, high, mid, mid, mid, low, low, low, lo…
## $ carb_rel              <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, …
## $ balling_lvl_bin       <fct> low, low, high, high, high, high, low, low, low,…
## $ psc_trans             <dbl> -1.355019, -1.295727, -1.400000, -1.821115, -1.6…
## $ filler_speed_trans    <dbl> 8008001.5, 7944097.5, 8080199.5, 8048071.5, 8040…
## $ mfr_trans             <dbl> 262812.00, 264118.62, 270112.00, 266887.68, 2612…
## $ oxygen_filler_trans   <dbl> -2.669471, -2.590291, -2.628559, -2.520328, -2.5…

We have now finished the processing portion of this project, and will save the processed data in .csv format so it can be shared between the team.

train %>% 
  write_csv("StudentDataProcessed.csv")

test %>% 
  write_csv("StudentEvaluationProcessed.csv")

Correlate

With our data promised we want to learn more about how the data interacts in other ways between features. The easiest way to do this is creating a correlation plot with the plot_correlation function from the DataExplorer package. We have to tweak the parameters to make the plot barely visible in a full screen view, but doing so gives us some insightful information:

For our response variable of pH we do not have any strong correlations, but we do discover that the filler level, usage cont, mnf flow, and bowl setpoint are more correlated with it than the rest. Indicating that these are varaibles that we should definitely keep in our models.

We also find a surprising amount of multicollinearity within our dataset. For example, our mfr feature is almost completely correlated to our filler speed feature. Additionally, we have the balling and density bins created also sharing extremely high correlation between each other. If either of these variable sets are used in our models then only one of them per set should be left in the end since they share quite similar information.

Multicollinearity also appears in features that are measuring different dimensions of something such as carbonation temperature to carbonation pressure. This hints at building interaction variables for our models that have interactions between each related feature.

DataExplorer::plot_correlation(train,
  ggtheme = theme_minimal(),
  theme_config = list(legend.position = "none", axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text.x = element_text(angle = 15, size = 5), axis.text.y = element_text(size = 5)))

We also take a different approach to visualizing correlation with the correlationfunnel package. This package provides a function which lays out correlation for categorical variables and is much easier to read when only trying to analyze correlation to a single variable (in our case this variable is ph split into a two level factor). It becomes much more clear that a negative mnf flow tends to correlate with a higher pH, there are specific bowl (120) and pressure (46) set points that encourage a higher pH beverage, and the first three hyd_pressures correlate better to a higher pH than the fourth. These all become variables we should focus on in our model building process.

require(correlationfunnel)

train %>%
  binarize(n_bins = 2, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE) %>% 
  correlate(`ph__8.54_Inf`) %>% 
  correlationfunnel::plot_correlation_funnel(alpha = 0.7, interactive = TRUE)

Review

Through our data preparation and exploration process here is what we have done:

Loaded the training and test data then cleaned the column names.
Removed missing values with multiple imputation.
Removed a singular pH outlier that was much higher above the variable’s distribution.
Binned 12 of our numerical variables that had multiple innate distributions to become categorical variables.
Box-cox transformed 4 of our numerical variables to reduce changes in variability over the distribution.
Exported our transformed training and test data for sharing.
Visualized correlation between variables in two different ways.

Model Building

In order to take advantage of our processed data, our team has chosen to utilize Python for its ease of modeling multiple models at once. Thus we load in the Python packages we will use to build our models below:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.cross_decomposition import PLSRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import make_pipeline

Loading Data

We have transferred the data between teams and thus need to reload the data into a Pandas dataframe for variable selection, model building, and model predicting.

testingDataPath = r"StudentDataProcessed.csv"
testingData = pd.read_csv(testingDataPath)

Feature Selection

In this section we are doing variable selection because we want to reduce the number of predictors not relevant to the response variable pH. This will improve model prediction performance for predicting pH and more importantly, make it a little easier to interpret the model. As our end goal is to be able to provide an interpretable approximation on how our processes affect pH for regulators. Additionally, this will reduce the cost of computation when applied to a larger sample set which is also important for a business like ABC beverages.

First, we convert categorical variables to numeric using one-hot encoding.

testingDataNumeric = pd.get_dummies(testingData)

Next, we split our dataset into the target variable category of pH and the various explanatory variables that are part of the manufacturing process.

targetVariable = testingDataNumeric['ph']
features = testingDataNumeric.drop('ph', axis=1)

We utilize this feature split to generate a correlation matrix in python that we will want to individually breakdown.

correlationMatrix = testingDataNumeric.corr()
phCorrelations = correlationMatrix['ph'].sort_values(ascending=False)

We select the variables that are strong correlated to pH by setting a threshold of 0.5 correlation for strongly correlated on 0.3 for moderately correlated. As these are commonly accepted correlation thresholds in statistics. We then combine the the features that are mildly and strongly correlated for use as our selected features.

correlationThresholdStrong = 0.5 # means theres a strong correlation
correlationThresholdModerate = 0.3 # means there is a moderate correlation

selectedFeaturesStrong = phCorrelations[(phCorrelations >= correlationThresholdStrong) | (phCorrelations <= -correlationThresholdStrong)].index
selectedFeaturesModerate = phCorrelations[(phCorrelations >= correlationThresholdModerate) & (phCorrelations < correlationThresholdStrong) | 
                                          (phCorrelations <= -correlationThresholdModerate) & (phCorrelations > -correlationThresholdStrong)].index
                                          
selectedFeaturesWeak = phCorrelations[abs(phCorrelations) < correlationThresholdModerate].index

selectedFeatures = selectedFeaturesStrong.union(selectedFeaturesModerate).drop('ph')

Printing our features below, we see yet again that we have no strong predictors and only five moderate ones: The set points, the filler level, and mnf_flow.

print("Features with strong and moderate correlation to pH:")

## Features with strong and moderate correlation to pH:

for feature in selectedFeatures:
    correlation_value = phCorrelations[feature]
    correlation_category = "Strong" if abs(correlation_value) >= correlationThresholdStrong else "Moderate"
    print(f"{feature} ({correlation_category} Correlation): {correlation_value}")

## bowl_setpoint_bin (Moderate Correlation): 0.34829243501643675
## filler_level (Moderate Correlation): 0.32430482945545946
## mnf_flow_bin_negative (Moderate Correlation): 0.4685449173252741
## mnf_flow_bin_positive (Moderate Correlation): -0.43157431007045766
## pressure_setpoint_bin (Moderate Correlation): -0.3070616792918806
## usage_cont (Moderate Correlation): -0.3181072262308761

print("\nFeatures with weak or no correlation to pH:")

## 
## Features with weak or no correlation to pH:

for feature in selectedFeaturesWeak:
    if feature != 'ph':
        print(f"{feature}: {phCorrelations[feature]}")

## hyd_pressure2_bin_zero: 0.26620694288279967
## hyd_pressure1_bin_zero: 0.22494341678000382
## hyd_pressure3_bin_zero: 0.22377493976672705
## pressure_vacuum: 0.2205096730713491
## carb_flow_bin_high: 0.21088669708682595
## oxygen_filler_trans: 0.20153676544522764
## alch_rel_bin_high: 0.19227532498453223
## brand_code_D: 0.18923596866970802
## carb_rel: 0.1613853106502643
## balling_bin_high: 0.1048831949552594
## density_bin_high: 0.10431127780714758
## balling_lvl_bin_high: 0.10417738787211916
## brand_code_B: 0.10292672422256868
## carb_volume: 0.06501150608141244
## carb_pressure: 0.05994004541575252
## pc_volume: 0.038834777916828656
## carb_temp: 0.022564168837281105
## air_pressurer_bin_low: 0.011651116286033499
## carb_flow_bin_low: 0.007180262610724045
## balling_lvl_bin_zero: 0.005263160077082794
## filler_speed_trans: -0.006986984315991737
## mfr_trans: -0.01093539758478368
## air_pressurer_bin_high: -0.011651116286033504
## psc_fill: -0.03427413239093513
## carb_pressure1: -0.07331059235736752
## psc_co2: -0.08190252156363692
## psc_trans: -0.08723983278037595
## fill_ounces: -0.096366747216344
## alch_rel_bin_mid: -0.10088920799204425
## brand_code_A: -0.10253060430975168
## mnf_flow_bin_zero: -0.10255776356389108
## density_bin_low: -0.1043112778071476
## balling_lvl_bin_low: -0.10456875885313943
## balling_bin_low: -0.10488319495525913
## alch_rel_bin_low: -0.10527376868612105
## hyd_pressure4: -0.13862235628512254
## temperature: -0.1496250586303669
## fill_pressure: -0.21028379530359573
## hyd_pressure3_bin_above_zero: -0.22377493976672694
## hyd_pressure1_bin_above_zero: -0.22494341678000387
## carb_flow_bin_mid: -0.23041685408114532
## hyd_pressure2_bin_above_zero: -0.2662069428827997
## brand_code_C: -0.29337174939846905

We showcase this correlation in an easy to understand manner by plotting it as a bar graph below:

#graph to show a visual of the correlation of variables correlating to ph
plt.figure(figsize=(15, 8))
sns.barplot(x=phCorrelations.index, y=phCorrelations.values)
plt.xticks(rotation=90)
plt.xlabel('Features')
plt.ylabel('Correlation with pH')
plt.title('Correlation of Features with pH')
plt.savefig('corrfeature.png')

corrplot

Model Fitting

With our features selected we begin fitting multiple different models that come with their own set of assumptions about the data. We do know that our response variable of pH is normally distributed, but we don’t exactly have obviously assumable relationships between the explanatory variables and the response. Thus, multiple types of models will be used as a way to see what relationship best predicts pH and will be able to provide an explainable view of our pH to regulators.

We begin this process by filtering the dataset to only have our selected explanatory variables and splitting the dataset to a 80-20 train-test split for further evaluation.

selectedFeaturesData = testingDataNumeric[selectedFeatures]

#splitting the dataset into training and testing sets
XTrain, XTest, YTrain, YTest = train_test_split(selectedFeaturesData, targetVariable, test_size=0.2, random_state=42)

For our specific model types we chose: linear, random forest, decision tree, partial least squared, and polynomial regression:

Linear serves as a good baseline for us as it’s the simplest to interpret and calculate for a regression model.
Polynomial is also fairly easy to interpret and allows for fitting more complex relationships. Note that we keep a degree of 2 to not make interpretability impossible.
Decision trees for non linear relationships and very clear interpretability.
Random forest can be useful to capture relationships come from complex interaction in the data but at a sacrifice of more interpretability than variable importance.
Partial least squares to attempt to account for multicollinear relationships that build upon pH.

models = {
    'LinearRegression': LinearRegression(),
    'RandomForestRegressor': RandomForestRegressor(n_estimators=100, random_state=42),
    'DecisionTreeRegressor': DecisionTreeRegressor(random_state=42),
    'PLSRegression': PLSRegression(n_components=2),
    'PolynomialRegression': make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
}

We fit our models and evaluate them against the test set. However, we get disappointing results on all models with our lowest RMSE being 0.147 which means our pH predictions will be off by 0.147 pH on average. Still, we need a model for regulatory reasons and can attempt to later improve on it down the line.

#training and evaluation data
modelStats = {}
for modelName, model in models.items():
    model.fit(XTrain, YTrain)
    predictions = model.predict(XTest)
    mse = mean_squared_error(YTest, predictions)
    rmse = np.sqrt(mse)  # Calculating RMSE
    r2 = r2_score(YTest, predictions)
    modelStats[modelName] = {'MSE': mse, 'RMSE': rmse, 'R2': r2}

Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
                ('linearregression', LinearRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

#model performance
for modelName, stats in modelStats.items():
    print(f"{modelName} - MSE: {stats['MSE']:.3f}, RMSE: {stats['RMSE']:.3f}, R²: {stats['R2']:.3f}")

## LinearRegression - MSE: 0.022, RMSE: 0.149, R²: 0.244
## RandomForestRegressor - MSE: 0.025, RMSE: 0.157, R²: 0.170
## DecisionTreeRegressor - MSE: 0.044, RMSE: 0.210, R²: -0.493
## PLSRegression - MSE: 0.022, RMSE: 0.150, R²: 0.241
## PolynomialRegression - MSE: 0.021, RMSE: 0.147, R²: 0.272

We visualize our model evaluation statistics in order to make it easy to see at a glance the best and worse models. The decision tree is particularly egregious with a negative R^2 indicating that it does worse than random predictions.

modelNames = list(modelStats.keys())
y_pos = np.arange(len(modelNames))
mseValues = [stats['MSE'] for stats in modelStats.values()]
rmseValues = [stats['RMSE'] for stats in modelStats.values()]
r2Values = [stats['R2'] for stats in modelStats.values()]

fig, ax = plt.subplots(1, 3, figsize=(18, 8))

ax[0].barh(y_pos, mseValues, align='center', color='skyblue')
ax[0].set_yticks(y_pos)
ax[0].set_yticklabels(modelNames)
ax[0].invert_yaxis()
ax[0].set_xlabel('MSE')
ax[0].set_title('Mean Squared Error (MSE) of Models')

ax[1].barh(y_pos, rmseValues, align='center', color='orange')
ax[1].set_yticks(y_pos)
ax[1].set_yticklabels(modelNames)
ax[1].invert_yaxis()
ax[1].set_xlabel('RMSE')
ax[1].set_title('Root Mean Squared Error (RMSE) of Models')

ax[2].barh(y_pos, r2Values, align='center', color='lightgreen')
ax[2].set_yticks(y_pos)
ax[2].set_yticklabels(modelNames)
ax[2].invert_yaxis()
ax[2].set_xlabel('R²')
ax[2].set_title('R-squared (R²) of Models')

plt.tight_layout()
plt.show()

Model Selection

Now that we have fit our models we need to select a model that is best suited to our needs. The primary need for us is a model that can be provided to regulators to see what factors within our manufacturing process build up to the end result of the pH of the drink. A model that is best suited for that would be a model that can be interpreted easily and also model the pH from the process parameters it is given more accurately than the others. We will test accuracy through MSE, RMSE, and R^2 while interpretability is going to be the highest for our linear and decision tree models.

#selecting the model
#taking in the consideration for MSE, RMSE, and R2 out of all the models
#and comparing them to find the optimal model to predict ph
#polynomial model out performed the other models in all 3 key metrics together.
bestModelName = None
bestModelStats = {'MSE': np.inf, 'RMSE': np.inf, 'R2': -np.inf}

for modelName, stats in modelStats.items():
    if stats['MSE'] < bestModelStats['MSE'] and stats['RMSE'] < bestModelStats['RMSE'] and stats['R2'] > bestModelStats['R2']:
        bestModelName = modelName
        bestModelStats = stats

We formally dub the polynomial regression model that we have fit as the most accurate of its contemporaries by beating the other models in each evaluation indicator.

# Print the best model's performance
print(f"Best Performing Model: {bestModelName}")

## Best Performing Model: PolynomialRegression

print(f"MSE: {bestModelStats['MSE']:.3f}, RMSE: {bestModelStats['RMSE']:.3f}, R²: {bestModelStats['R2']:.3f}")

## MSE: 0.021, RMSE: 0.147, R²: 0.272

However, considering the minuscule difference of RMSE between the polynomial regression and linear regression of 0.02 combined with the vast increase in simplicity and interpretability of the linear model. It is more appropriate to take our second best model to better satisfy business requirements and make sure it is easy for the boss to understand.

bestModelName = 'LinearRegression'
print(f"Most interpretable but still accurate model: {bestModelName}")

## Most interpretable but still accurate model: LinearRegression

Conclusion

We are now ready to showcase our final deliverables of the sample predictions and explanations of our most important manufacturing features to building pH.

Prediction

To give a sample of pH predictions we must load in our previously processed evaluation data that underwent the same transformations as the training data.

#lastly we use the evaluation data with the best performing model for the prediction of PH
evaluationDataPath = r"StudentEvaluationProcessed.csv"
evaluationData = pd.read_csv(evaluationDataPath)
evaluationDataNumeric = pd.get_dummies(evaluationData)
evaluationDataFeatures = evaluationDataNumeric[selectedFeatures]

We then programatically have the best model predict pH in a new column. This way we can revisit our work if we make model improvements down the line.

bestModel = models[bestModelName]
predictedPH = bestModel.predict(evaluationDataFeatures)

evaluationData['ph'] = predictedPH
print(evaluationData[['ph']])

##            ph
## 0    8.632885
## 1    8.645337
## 2    8.611071
## 3    8.642286
## 4    8.601599
## ..        ...
## 262  8.467599
## 263  8.473633
## 264  8.461278
## 265  8.464523
## 266  8.460319
## 
## [267 rows x 1 columns]

We make a line graph to showcase that most of our predictions do fall into the normal values of pH, although there are two particularly deviated values. Yet, that’s also something that happens in production sometimes.

#graph for ph
plt.figure(figsize=(10, 6))
plt.plot(evaluationData['ph'].head(50), marker='o', linestyle='-', color='blue')
plt.title('Predicted pH Values')
plt.xlabel('Sample Index')
plt.ylabel('Predicted pH')
plt.grid(True)
plt.show()

Finally we wrap up our predicting by saving the dataframe where our predictions are to a .xlsx file that can easily be opened in Excel.

#exporting to excel
outputCsvPath = r"PredictedPH.xlsx"
evaluationData.to_excel(outputCsvPath, index=False)
print(f"export done {outputCsvPath}")

Model Explanation

Finally, we want to explain pH in terms of our manufacturing processes through our model’s coefficients. We can do this by extracting the coefficients and coefficient names then zipping them together to start.

from pprint import PrettyPrinter

pp = PrettyPrinter()

pp.pprint(dict(zip(bestModel.feature_names_in_, bestModel.coef_)))

## {'bowl_setpoint_bin': 0.0022250393068531807,
##  'filler_level': -0.001387848716802194,
##  'mnf_flow_bin_negative': 0.12889777152498594,
##  'mnf_flow_bin_positive': 0.0169674080964651,
##  'pressure_setpoint_bin': -0.007361722797602103,
##  'usage_cont': -0.00533408580105737}

Our linear model that we have selected as the best model for explaining pH follows the following pattern:

We assume a baseline pH of 0 for a beverage that has zero mnf flow, the pressure and bowl set point at their minimum, and filler level along with usage cont at 0. When we set the bowl’s set point initially, every change of the set point from the baseline will increase the resulting drinks pH by 0.0022. When we set the pressure’s set point initially, every change of the set point from the baseline will decrease the resulting drinks pH by 0.0074. If the mnf flow is positive then the pH will increase by 0.0170, but if the mnf flow is negative then pH will instead increase by 0.1289. For each increase in filler level our drinks pH decreases by 0.0014. Finally, every time the usage cont increases our pH decreases by 0.0053.

It’s this explanation that we should provide regulators as proof that we do understand our manufacturing process and how different features play into the results.