library(DataExplorer)
library(dplyr)
library(psych)
library(ggplot2)
library(tidyverse)
library(janitor)
library(interactions)
library(kableExtra)
library(pracma)
library(randomForest)
library(caret)
library(partykit)
library(corrplot)

Introduction

The goal of the project is to understand the manufacturing process and the predictive factor of the beverage.Create a predictive model of PH.

Data Exploration

Load Data

library(dplyr)
data <- read.csv("https://raw.githubusercontent.com/suswong/DATA-624/refs/heads/main/StudentData.csv", na.strings = c("", "NA"))

s.eval <- read.csv("https://raw.githubusercontent.com/suswong/DATA-624/refs/heads/main/StudentEvaluation.csv", na.strings = c("", "NA"))

head(data)

##   Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp   PSC
## 1          B    5.340000    23.96667 0.2633333          68.2     141.2 0.104
## 2          A    5.426667    24.00667 0.2386667          68.4     139.6 0.124
## 3          B    5.286667    24.06000 0.2633333          70.8     144.8 0.090
## 4          A    5.440000    24.00667 0.2933333          63.0     132.6    NA
## 5          A    5.486667    24.31333 0.1113333          67.2     136.8 0.026
## 6          A    5.380000    23.92667 0.2693333          66.6     138.4 0.090
##   PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1     0.26    0.04     -100          118.8          46.0             0
## 2     0.22    0.04     -100          121.6          46.0             0
## 3     0.34    0.16     -100          120.2          46.0             0
## 4     0.42    0.04     -100          115.2          46.4             0
## 5     0.16    0.12     -100          118.4          45.8             0
## 6     0.24    0.04     -100          119.6          45.6             0
##   Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1            NA            NA           118        121.2         4002
## 2            NA            NA           106        118.6         3986
## 3            NA            NA            82        120.0         4020
## 4             0             0            92        117.8         4012
## 5             0             0            92        118.6         4010
## 6             0             0           116        120.2         4014
##   Temperature Usage.cont Carb.Flow Density   MFR Balling Pressure.Vacuum   PH
## 1        66.0      16.18      2932    0.88 725.0   1.398            -4.0 8.36
## 2        67.6      19.90      3144    0.92 726.8   1.498            -4.0 8.26
## 3        67.0      17.76      2914    1.58 735.0   3.142            -3.8 8.94
## 4        65.6      17.42      3062    1.54 730.6   3.042            -4.4 8.24
## 5        65.6      17.68      3054    1.54 722.8   3.042            -4.4 8.26
## 6        66.2      23.82      2948    1.52 738.8   2.992            -4.4 8.32
##   Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1         0.022           120              46.4         142.6     6.58     5.32
## 2         0.026           120              46.8         143.0     6.56     5.30
## 3         0.024           120              46.6         142.0     7.66     5.84
## 4         0.030           120              46.0         146.2     7.14     5.42
## 5         0.030           120              46.0         146.2     7.14     5.44
## 6         0.024           120              46.0         146.6     7.16     5.44
##   Balling.Lvl
## 1        1.48
## 2        1.56
## 3        3.28
## 4        3.04
## 5        3.04
## 6        3.02

There are 2571 observations with 33 variables. Preliminary summary shows there are missing values in most columns.

PH level ranges from 7.88 to 9.36 with an mean values of 8.55.

summary(data)

##   Brand.Code         Carb.Volume     Fill.Ounces      PC.Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb.Pressure     Carb.Temp          PSC             PSC.Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC.CO2           Mnf.Flow       Carb.Pressure1  Fill.Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd.Pressure1   Hyd.Pressure2   Hyd.Pressure3   Hyd.Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler.Level    Filler.Speed   Temperature      Usage.cont      Carb.Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure.Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air.Pressurer      Alch.Rel        Carb.Rel      Balling.Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1

Renaming Variables

Column names were cleaned and standardized to make them more consistent and easier to work with in R. - Spaces and special character were removed. - Renamed Carb.Pressure and Carb.Pressure1 to CarbPressure1 and CarbPressure2 to following the naming of HydPressure1 - HydPressure4

names(data) <- c("BrandCode", "CarbVolume","FillOunces", "PCVolume","CarbPressure1","CarbTemp", "PSC", "PSCFill", "PSCCO2", "MnfFlow", "CarbPressure2","FillPressure", "HydPressure1", "HydPressure2",     "HydPressure3",     "HydPressure4",    
"FillerLevel",      "FillerSpeed" ,     "Temperature",       "Usagecont",       
"CarbFlow" ,        "Density"   ,        "MFR"       ,        "Balling" ,         
"PressureVacuum" ,  "PH"    ,            "OxygenFiller" ,    "BowlSetpoint"   , 
"PressureSetpoint", "AirPressurer",    "AlchRel" ,         "CarbRel"  ,       
"BallingLvl" )

names(s.eval) <- c("BrandCode", "CarbVolume","FillOunces", "PCVolume","CarbPressure1","CarbTemp", "PSC", "PSCFill", "PSCCO2", "MnfFlow", "CarbPressure2","FillPressure", "HydPressure1", "HydPressure2",     "HydPressure3",     "HydPressure4",    
"FillerLevel",      "FillerSpeed" ,     "Temperature",       "Usagecont",       
"CarbFlow" ,        "Density"   ,        "MFR"       ,        "Balling" ,         
"PressureVacuum" ,  "PH"    ,            "OxygenFiller" ,    "BowlSetpoint"   , 
"PressureSetpoint", "AirPressurer",    "AlchRel" ,         "CarbRel"  ,       
"BallingLvl" )

Missing Data

The plot below show the percentage of data is discrete or continuous. BrandCode is the discrete variable while the rest of the variables are continuous. Approximately 79% of the rows of the data is complete, which means about 20% of the observation is missing some values. We should be cautious when dealing missing values. It would be ideal to impute any missing values instead of omitting any missing values.

library(DataExplorer)
plot_intro(data)

844 cells contains missing data.

na_summary<- data.frame(variable= names(data), na_count= colSums(is.na(data)),
                        percent_na= colMeans(is.na(data)) * 100) %>%
  arrange(desc(percent_na))
na_summary

##                          variable na_count percent_na
## MFR                           MFR      212 8.24581875
## BrandCode               BrandCode      120 4.66744457
## FillerSpeed           FillerSpeed       57 2.21703617
## PCVolume                 PCVolume       39 1.51691949
## PSCCO2                     PSCCO2       39 1.51691949
## FillOunces             FillOunces       38 1.47802412
## PSC                           PSC       33 1.28354726
## CarbPressure2       CarbPressure2       32 1.24465189
## HydPressure4         HydPressure4       30 1.16686114
## CarbPressure1       CarbPressure1       27 1.05017503
## CarbTemp                 CarbTemp       26 1.01127966
## PSCFill                   PSCFill       23 0.89459354
## FillPressure         FillPressure       22 0.85569817
## FillerLevel           FillerLevel       20 0.77790743
## HydPressure2         HydPressure2       15 0.58343057
## HydPressure3         HydPressure3       15 0.58343057
## Temperature           Temperature       14 0.54453520
## OxygenFiller         OxygenFiller       12 0.46674446
## PressureSetpoint PressureSetpoint       12 0.46674446
## HydPressure1         HydPressure1       11 0.42784909
## CarbVolume             CarbVolume       10 0.38895371
## CarbRel                   CarbRel       10 0.38895371
## AlchRel                   AlchRel        9 0.35005834
## Usagecont               Usagecont        5 0.19447686
## PH                             PH        4 0.15558149
## MnfFlow                   MnfFlow        2 0.07779074
## CarbFlow                 CarbFlow        2 0.07779074
## BowlSetpoint         BowlSetpoint        2 0.07779074
## Density                   Density        1 0.03889537
## Balling                   Balling        1 0.03889537
## BallingLvl             BallingLvl        1 0.03889537
## PressureVacuum     PressureVacuum        0 0.00000000
## AirPressurer         AirPressurer        0 0.00000000

na_summary %>%
  summarize(total_na= sum(na_count))

##   total_na
## 1      844

Every column has missing values except Pressure.Vacuum and Air.Pressurer.

sapply(data, function(x) sum(is.na(x)))

##        BrandCode       CarbVolume       FillOunces         PCVolume 
##              120               10               38               39 
##    CarbPressure1         CarbTemp              PSC          PSCFill 
##               27               26               33               23 
##           PSCCO2          MnfFlow    CarbPressure2     FillPressure 
##               39                2               32               22 
##     HydPressure1     HydPressure2     HydPressure3     HydPressure4 
##               11               15               15               30 
##      FillerLevel      FillerSpeed      Temperature        Usagecont 
##               20               57               14                5 
##         CarbFlow          Density              MFR          Balling 
##                2                1              212                1 
##   PressureVacuum               PH     OxygenFiller     BowlSetpoint 
##                0                4               12                2 
## PressureSetpoint     AirPressurer          AlchRel          CarbRel 
##               12                0                9               10 
##       BallingLvl 
##                1

Below is the percentage of missing data within in each variable. MFR column has the highest percentage of missing data but it is only accounts about 8.25%. The amount of missing values is not significant so we should not drop any observations with missing values. We should impute the missing values later.

library(DataExplorer)
plot_missing(data,
             missing_only = T)

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the DataExplorer package.
##   Please report the issue at
##   <https://github.com/boxuancui/DataExplorer/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

data$BrandCode <- as.factor(data$BrandCode)

Distribution

Variables with good range and clear distribution is likely to show a relationship with the dependent variable. The following variables appears normal distribution:

- Carb Pressure 1

- Carb Pressure 2

- Carb Temp

- Carb Volume

- Fill Ounces

- PC Volume

Variables that appears skewed will need transformation to be performed. The following variables appears right skewed:

Hyd Pressure 1 (There is a significant amount of values at 0) PSC PSC Fill PSC CO2 - Pressure Vacuum Temperature - Oxygen Filler

The following variables appears left skewed:

Hyd Pressure2 (There is a significant amount of values at 0)
Hyd Pressure3 (There is a significant amount of values at 0)
Hyd Pressure4
Mnf Flow (Nearly half of the data contains -100 and the other half of the data contains continuous values from 0 to 200.)
MFR

The following variables appear bimodal.

Balling
Balling Lvl
Carb Rel
Density

Variables that are highly multimodal can reveal that they may be operated at different setting or conditions. Later we may want to transform these discrete variables into factors so that the model treats them as distinct operational settings.The following variables appear multimodal.

Bowl Set Point
Pressure Set Point

plot_histogram(data)

Below shows a boxplot of each variable by BrandCode. It reveals:

Outliers are present in the dataset.
All features are consistent between all brands exempt: CarbPressure1, Carb Volume , FillPressure, Balling, Density, HydPressure4, Temperature
Brand A and D has higher sugar content (balling) than the other two brands which results a higher density in brands A and D.
Brand B requires higher temperature,while Brand requires lower temperatures.
Brand D require lower HydPressure4 than other brands.
Brand D has higher alcohol content, followed by Brand A. Brand B and C has lower and similar alcohol content.

plot_boxplot(data, by = "BrandCode")

## Warning: Removed 302 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Removed 372 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Removed 50 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Almost half of the observations are brand C. Brands A and C have similar observations.

library(ggplot2)
ggplot(data, 
       aes(x = BrandCode, fill = BrandCode)) +
  geom_bar() +
  theme_minimal() + 
  theme(legend.position = "none") +
  labs( title = "Brand Frequency",
        x = "Brand",
        y = "Count")

Brand C has a lower PH while Brand D has the highest PH level.

ggplot(data, 
       aes(x = BrandCode, y = PH, fill = BrandCode)) +
  geom_boxplot() +
  theme_minimal() + 
  theme(legend.position = "none") +
  labs( title = "Boxplot of PH by Brand",
        x = "Brand",
        y = "PH")

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Transforming Variables to factors

train_data <- data %>%
  mutate(
    PressureSetpoint = as.factor(PressureSetpoint),
    BowlSetpoint = as.factor(BowlSetpoint),
    BrandCode = as.factor(BrandCode)
  )

seval <- s.eval %>%
  mutate(
    PressureSetpoint = as.factor(PressureSetpoint),
    BowlSetpoint = as.factor(BowlSetpoint),
    BrandCode = as.factor(BrandCode)
  )

Preprocessing

Summary of preprocessing step:

PH, the target variable has missing values. We should drop any observations that contains missing values in PH. We do not want to impute these values as it we will predict the ph level based on other features.
HydPressure1 was dropped as it has near zero variance.
Imputing missing values using KNN impute.
Perform Box cox transformation to standardize skewed data
Center and scale data as some models such as KNN required distance metrics.

library(tidyr)
library(caret)
train_clean_data <- data %>%
  drop_na(PH)

train_filtered <- train_clean_data[, -nearZeroVar(train_clean_data)]

Imputing Missing Data

KNN imputation was chosen over other imputation methods as it will use similar conditions between observations to estimate missing values. It is good for dataset that has low missingness and contains correlated predictors.

set.seed(1)
trainingRows <- createDataPartition(train_filtered$PH,
                                    p = .70,
                                    list= FALSE)
train1 <- train_filtered[trainingRows, ]
  
test1  <- train_filtered[-trainingRows, ]

library(forcats)

train1 <- train1%>%
  mutate(across(where(is.factor), ~fct_na_value_to_level(.x, "Unknown")))

test1 <- test1%>%
  mutate(across(where(is.factor), ~fct_na_value_to_level(.x, "Unknown")))

trans <- preProcess(train1 %>%
                      select(-PH), 
                    method = c("center", "scale", "knnImpute", "BoxCox"))

transformed_data <- predict(trans, train1%>%
                     select(-PH))
train_final <- cbind(transformed_data, PH = train1$PH)

test_transformed <- predict(trans, test1%>%
                     select(-PH))

test_final <- cbind(test_transformed, PH = test1$PH)

Correlation

The following variables is significantly positive correlated with PH: BowlSetPoint, FillerLevel, Carb Flow.

The following variables is significantly negative correlated with PH: MnfFlow, UsageCount, Fill Pressure

Linear Model- Multicollinearity

Linear regression does not require the predictor variables to be normally distributed which works here considering many of the distributions of the variables are not normal. In this section we’ll restrict which variables will be fed into the model. Milticollinearity can effect the output coefficients and the model interpretatibility. We can’t afford to lose grip on these in reporting to our stakeholders.

To assess this, we’ll separate the response variable from the predictors.

predictors <- data[, names(data) != "PH"]
response   <- data$PH
ph_corr <- corr.test(x = predictors[,-1], y = response, "pairwise", "pearson")

correlations <- ph_corr$r[, 1]
p_values <- ph_corr$p[, 1]
ph_table <- data.frame(r= correlations, p= p_values)
ph_table %>% 
  arrange(desc(r))

##                            r             p
## BowlSetpoint      0.34911980  2.063873e-74
## FillerLevel       0.32327031  3.917540e-63
## PressureVacuum    0.22053722  1.201970e-29
## OxygenFiller      0.16798960  1.234045e-17
## CarbRel           0.16414447  6.477984e-17
## CarbFlow          0.15737137  1.089641e-15
## AlchRel           0.14893852  3.609012e-14
## BallingLvl        0.10047517  3.395271e-07
## Density           0.07843973  6.938385e-05
## Balling           0.06527259  9.363280e-04
## CarbVolume        0.06349111  1.317168e-03
## CarbPressure1     0.05958033  2.665036e-03
## PCVolume          0.04580178  2.128180e-02
## CarbTemp          0.02887456  1.456400e-01
## MFR              -0.00871805  6.721382e-01
## AirPressurer     -0.01365873  4.891116e-01
## FillerSpeed      -0.02456179  2.183784e-01
## PSCFill          -0.03625652  6.748775e-02
## HydPressure1     -0.07398881  1.811517e-04
## PSCCO2           -0.07558459  1.424376e-04
## CarbPressure2    -0.07932723  6.378245e-05
## PSC              -0.09003060  5.651603e-06
## FillOunces       -0.09505837  1.677065e-06
## HydPressure4     -0.14153475  7.818363e-13
## Temperature      -0.15824135  8.592733e-16
## HydPressure2     -0.20067834  1.352668e-24
## FillPressure     -0.21343571  1.199531e-27
## HydPressure3     -0.24030003  7.593496e-35
## PressureSetpoint -0.31101906  2.010989e-58
## Usagecont        -0.31841555  1.842646e-61
## MnfFlow          -0.44697510 2.557184e-126

The correlation table r shows the strength and direction of the relationship of each variable with our response variable ph. The Pairwise calculations ignore NA values which is useful.

Next, we’ll find correlations without the response variable to determine pairwise redundancies/ multicollinearity. We’ll use findCorrelation() from the caret package which will use the cutoff 0.7 (Lin, 99) as threshold to point out the clusters.

pred_numerical <- predictors[, sapply(predictors, is.numeric)]
ppcor<- pred_cor <- cor(pred_numerical, use = "pairwise.complete.obs")
findCorrelation(ppcor, cutoff = 0.7)

##  [1] 23 14 31 21 29 13 30 16 17  5

The function prints the column numbers. We’ll have it output the column names

colnames(pred_numerical)[findCorrelation(ppcor, cutoff = 0.7)]

##  [1] "Balling"      "HydPressure3" "BallingLvl"   "Density"      "AlchRel"     
##  [6] "HydPressure2" "CarbRel"      "FillerLevel"  "FillerSpeed"  "CarbTemp"

The named variables are flagged as highly correlated clusters, above our threshold. We’ll examine the correlation plot

corrplot(ppcor, title ="Predictor v Predictor Correlation", outline = TRUE, type = "full")

The correlation plot shows variables related to liquid density including CarbVolume, Density, Balling, BallingLvl, CarbRel, and AlchRel have pairwise correlations above 0.7. Though different measurements, these measure density and will be redundant in a linear regression model. To preserve the model’s interpretability, we’ll run “variance inflation factor” vif() from the car package to double check.

Further, variables that share measurements of flow are also identified as above the 0.7 threshold. HydPressure2, HydPressure3, FillerLevel, FillerSpeed.

Correlation Plot Part 2

Previously, we show a correlation plot of data before imputation. Below is a correlation plot of data after imputation. The correlation plot continues to reveal multicollienarity is present.

The following variables is significantly positive correlated with PH: BowlSetPoint, FillerLevel, Carb Flow.

The following variables is significantly negative correlated with PH: MnfFlow, UsageCount, Fill Pressure

library(corrplot)
corrplot(cor(train_final %>%
               select(where(is.numeric)),
             use = "pairwise.complete.obs"),
         type = "lower",
         order = "alphabet",
         tl.cex = 0.6,
         tl.col = "black")

Modeling

After performing EDA and preprocessing, the following models were tested to help predict the ph level: linear regression, ridge, elastic net, random forest, SVM, KNN, GBM, and Cubist.

Simple Linear Regression

This model is simple, interpretable, and creates a good baseline to predict the pH. It can help us identify which factors have the strongest direct influence to the pH level of the beverages.

Model 1

set.seed(1)

lm_model <- train(PH ~. , 
                  data = train_final,
                  method = "lm", 
                  trControl = trainControl(method = "cv", number = 5)
                  )

summary(lm_model)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46079 -0.07923  0.01022  0.08736  0.46151 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.487678   0.018169 467.146  < 2e-16 ***
## BrandCodeB        0.099086   0.027608   3.589 0.000341 ***
## BrandCodeC       -0.053634   0.027320  -1.963 0.049779 *  
## BrandCodeD        0.077452   0.017117   4.525 6.44e-06 ***
## BrandCodeUnknown  0.006268   0.030188   0.208 0.835545    
## CarbVolume       -0.018670   0.008444  -2.211 0.027152 *  
## FillOunces       -0.003858   0.003334  -1.157 0.247440    
## PCVolume         -0.007690   0.003754  -2.049 0.040653 *  
## CarbPressure1     0.009216   0.011876   0.776 0.437849    
## CarbTemp         -0.005042   0.010756  -0.469 0.639298    
## PSC              -0.006057   0.003331  -1.819 0.069152 .  
## PSCFill          -0.002760   0.003249  -0.850 0.395706    
## PSCCO2           -0.005118   0.003206  -1.596 0.110616    
## MnfFlow          -0.079799   0.006872 -11.611  < 2e-16 ***
## CarbPressure2     0.029535   0.003861   7.650 3.28e-14 ***
## FillPressure      0.010611   0.004495   2.361 0.018350 *  
## HydPressure2     -0.023168   0.009223  -2.512 0.012090 *  
## HydPressure3      0.055841   0.010908   5.119 3.41e-07 ***
## HydPressure4     -0.001821   0.004857  -0.375 0.707775    
## FillerLevel      -0.018972   0.009660  -1.964 0.049701 *  
## FillerSpeed      -0.004413   0.006205  -0.711 0.477086    
## Temperature      -0.018530   0.003714  -4.990 6.64e-07 ***
## Usagecont        -0.025279   0.004035  -6.264 4.70e-10 ***
## CarbFlow          0.014260   0.004504   3.166 0.001570 ** 
## Density          -0.013510   0.009782  -1.381 0.167414    
## MFR               0.003196   0.005036   0.635 0.525667    
## Balling          -0.083285   0.015060  -5.530 3.68e-08 ***
## PressureVacuum   -0.013057   0.004625  -2.823 0.004811 ** 
## OxygenFiller     -0.013864   0.004443  -3.121 0.001835 ** 
## BowlSetpoint      0.053542   0.009924   5.395 7.76e-08 ***
## PressureSetpoint -0.018384   0.004588  -4.007 6.41e-05 ***
## AirPressurer     -0.003677   0.003384  -1.086 0.277488    
## AlchRel           0.008680   0.012071   0.719 0.472202    
## CarbRel           0.002517   0.007070   0.356 0.721874    
## BallingLvl        0.103335   0.018677   5.533 3.63e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1306 on 1764 degrees of freedom
## Multiple R-squared:  0.4359, Adjusted R-squared:  0.425 
## F-statistic: 40.08 on 34 and 1764 DF,  p-value: < 2.2e-16

plot(lm_model$finalModel)

The model explains about 40% of the variance of the test data. This is not ideal.

lm_pred <- predict(lm_model, test_final)
postResample(pred = lm_pred, obs = test_final$PH)

##      RMSE  Rsquared       MAE 
## 0.1337891 0.4042534 0.1037784

Below are the top predictors of the pH level of the beverage.

plot(varImp(lm_model), top = 10)

Earlier in the correlation plot, we saw multicollinearity is present. The Variance Inflation Factor can help identify variables that causes multicollinearity. We should drop high-VIF predictors.

library(car)
vif(lm_model$finalModel)

##       BrandCodeB       BrandCodeC       BrandCodeD BrandCodeUnknown 
##        20.034567         8.316839         5.709224         4.326955 
##       CarbVolume       FillOunces         PCVolume    CarbPressure1 
##         7.510812         1.162395         1.495236        14.931140 
##         CarbTemp              PSC          PSCFill           PSCCO2 
##        12.198951         1.158531         1.106716         1.073113 
##          MnfFlow    CarbPressure2     FillPressure     HydPressure2 
##         4.978779         1.565286         2.127598         8.957494 
##     HydPressure3     HydPressure4      FillerLevel      FillerSpeed 
##        12.510643         2.474103         9.824225         4.017513 
##      Temperature        Usagecont         CarbFlow          Density 
##         1.450003         1.716648         2.136822        10.085922 
##              MFR          Balling   PressureVacuum     OxygenFiller 
##         3.387248        23.909408         2.255014         2.073959 
##     BowlSetpoint PressureSetpoint     AirPressurer          AlchRel 
##        10.404015         2.216389         1.207473        15.344453 
##          CarbRel       BallingLvl 
##         5.264680        36.762426

BallingLvl has the highest VIF value. Dropping it for our next models.

Model 2

train_final_reduced <- train_final %>%
  select(-BallingLvl)

set.seed(2)
lm_model2 <- train(PH ~. , 
                  data = train_final_reduced,
                  method = "lm", 
                  trControl = trainControl(method = "cv", number = 5)
                  )

summary(lm_model2)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45149 -0.07895  0.00835  0.08803  0.50398 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.545453   0.014993 569.967  < 2e-16 ***
## BrandCodeB        0.011594   0.022820   0.508 0.611463    
## BrandCodeC       -0.133222   0.023420  -5.688 1.50e-08 ***
## BrandCodeD        0.066257   0.017138   3.866 0.000115 ***
## BrandCodeUnknown -0.083310   0.025692  -3.243 0.001207 ** 
## CarbVolume       -0.020054   0.008510  -2.356 0.018564 *  
## FillOunces       -0.002789   0.003356  -0.831 0.406119    
## PCVolume         -0.008543   0.003782  -2.259 0.024021 *  
## CarbPressure1     0.007433   0.011970   0.621 0.534718    
## CarbTemp         -0.004047   0.010844  -0.373 0.709082    
## PSC              -0.005822   0.003358  -1.734 0.083166 .  
## PSCFill          -0.002347   0.003276  -0.716 0.473860    
## PSCCO2           -0.004954   0.003233  -1.532 0.125591    
## MnfFlow          -0.080071   0.006930 -11.555  < 2e-16 ***
## CarbPressure2     0.029956   0.003892   7.697 2.31e-14 ***
## FillPressure      0.011142   0.004531   2.459 0.014031 *  
## HydPressure2     -0.023363   0.009300  -2.512 0.012085 *  
## HydPressure3      0.054547   0.010997   4.960 7.72e-07 ***
## HydPressure4     -0.001115   0.004896  -0.228 0.819899    
## FillerLevel      -0.019416   0.009741  -1.993 0.046379 *  
## FillerSpeed      -0.005256   0.006255  -0.840 0.400889    
## Temperature      -0.018819   0.003744  -5.026 5.52e-07 ***
## Usagecont        -0.023918   0.004062  -5.889 4.65e-09 ***
## CarbFlow          0.011000   0.004502   2.443 0.014650 *  
## Density          -0.010117   0.009844  -1.028 0.304203    
## MFR               0.002894   0.005077   0.570 0.568804    
## Balling          -0.042580   0.013251  -3.213 0.001336 ** 
## PressureVacuum   -0.007930   0.004569  -1.736 0.082825 .  
## OxygenFiller     -0.013421   0.004479  -2.996 0.002771 ** 
## BowlSetpoint      0.053504   0.010007   5.347 1.01e-07 ***
## PressureSetpoint -0.020059   0.004617  -4.345 1.47e-05 ***
## AirPressurer     -0.003248   0.003412  -0.952 0.341295    
## AlchRel           0.024668   0.011818   2.087 0.037005 *  
## CarbRel           0.013337   0.006851   1.947 0.051734 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1317 on 1765 degrees of freedom
## Multiple R-squared:  0.4261, Adjusted R-squared:  0.4153 
## F-statistic:  39.7 on 33 and 1765 DF,  p-value: < 2.2e-16

The model explains about 43% of the variance of the test data. Removing one of the highly correlated variable improve the model only slightly.

lm_pred2 <- predict(lm_model2, train_final_reduced)
postResample(pred = lm_pred2, obs = train_final_reduced$PH)

##      RMSE  Rsquared       MAE 
## 0.1304420 0.4260610 0.1015644

vif(lm_model2$finalModel)

##       BrandCodeB       BrandCodeC       BrandCodeD BrandCodeUnknown 
##        13.461440         6.010979         5.629432         3.082351 
##       CarbVolume       FillOunces         PCVolume    CarbPressure1 
##         7.504226         1.158495         1.492716        14.920149 
##         CarbTemp              PSC          PSCFill           PSCCO2 
##        12.195538         1.158343         1.106130         1.073021 
##          MnfFlow    CarbPressure2     FillPressure     HydPressure2 
##         4.978525         1.564677         2.126627         8.957363 
##     HydPressure3     HydPressure4      FillerLevel      FillerSpeed 
##        12.504899         2.472395         9.823544         4.015090 
##      Temperature        Usagecont         CarbFlow          Density 
##         1.449717         1.710271         2.100253        10.046288 
##              MFR          Balling   PressureVacuum     OxygenFiller 
##         3.386849        18.203599         2.164494         2.073287 
##     BowlSetpoint PressureSetpoint     AirPressurer          AlchRel 
##        10.404010         2.206743         1.206839        14.465085 
##          CarbRel 
##         4.861898

plot(varImp(lm_model2), top = 10)

Model 3

train_final_reduced <- train_final %>%
  select(-BallingLvl, -Balling)

set.seed(2)
lm_model3 <- train(PH ~. , 
                  data = train_final_reduced,
                  method = "lm", 
                  trControl = trainControl(method = "cv", number = 5)
                  )

summary(lm_model3)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48055 -0.07973  0.00769  0.08715  0.64869 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.5232252  0.0133369 639.069  < 2e-16 ***
## BrandCodeB        0.0448507  0.0203913   2.200 0.027971 *  
## BrandCodeC       -0.1053669  0.0218137  -4.830 1.48e-06 ***
## BrandCodeD        0.0729293  0.0170571   4.276 2.01e-05 ***
## BrandCodeUnknown -0.0559107  0.0243001  -2.301 0.021516 *  
## CarbVolume       -0.0206999  0.0085305  -2.427 0.015342 *  
## FillOunces       -0.0032468  0.0033622  -0.966 0.334343    
## PCVolume         -0.0074859  0.0037778  -1.982 0.047684 *  
## CarbPressure1     0.0086776  0.0119957   0.723 0.469535    
## CarbTemp         -0.0051076  0.0108680  -0.470 0.638436    
## PSC              -0.0060946  0.0033659  -1.811 0.070355 .  
## PSCFill          -0.0026262  0.0032832  -0.800 0.423882    
## PSCCO2           -0.0051091  0.0032411  -1.576 0.115121    
## MnfFlow          -0.0783970  0.0069284 -11.315  < 2e-16 ***
## CarbPressure2     0.0306438  0.0038965   7.864 6.40e-15 ***
## FillPressure      0.0109652  0.0045428   2.414 0.015891 *  
## HydPressure2     -0.0173041  0.0091304  -1.895 0.058226 .  
## HydPressure3      0.0477707  0.0108213   4.415 1.07e-05 ***
## HydPressure4     -0.0007159  0.0049075  -0.146 0.884036    
## FillerLevel      -0.0178218  0.0097537  -1.827 0.067840 .  
## FillerSpeed      -0.0062722  0.0062636  -1.001 0.316787    
## Temperature      -0.0165300  0.0036856  -4.485 7.76e-06 ***
## Usagecont        -0.0253782  0.0040468  -6.271 4.49e-10 ***
## CarbFlow          0.0124368  0.0044917   2.769 0.005684 ** 
## Density          -0.0281301  0.0081133  -3.467 0.000539 ***
## MFR               0.0012363  0.0050645   0.244 0.807173    
## PressureVacuum   -0.0029194  0.0043063  -0.678 0.497897    
## OxygenFiller     -0.0108268  0.0044174  -2.451 0.014345 *  
## BowlSetpoint      0.0503526  0.0099847   5.043 5.05e-07 ***
## PressureSetpoint -0.0204676  0.0046270  -4.424 1.03e-05 ***
## AirPressurer     -0.0031477  0.0034207  -0.920 0.357600    
## AlchRel           0.0143039  0.0113997   1.255 0.209729    
## CarbRel           0.0136197  0.0068686   1.983 0.047535 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.132 on 1766 degrees of freedom
## Multiple R-squared:  0.4227, Adjusted R-squared:  0.4122 
## F-statistic: 40.41 on 32 and 1766 DF,  p-value: < 2.2e-16

The model explains about 42% of the variance of the test data.

lm_pred3 <- predict(lm_model3, train_final_reduced)
postResample(pred = lm_pred3, obs = train_final_reduced$PH)

##      RMSE  Rsquared       MAE 
## 0.1308231 0.4227033 0.1018339

Ridge

Ridge can handle high correlated predictors by shrinking their coefficients.

set.seed(11)

ridge_model <- train(PH ~. , 
                  data = train_final,
                  method = "ridge", 
                  trControl = trainControl(method = "cv", number = 5)
                  )

summary(ridge_model)

##             Length Class      Mode     
## call           4   -none-     call     
## actions       41   -none-     list     
## allset        34   -none-     numeric  
## beta.pure   1394   -none-     numeric  
## vn            34   -none-     character
## mu             1   -none-     numeric  
## normx         34   -none-     numeric  
## meanx         34   -none-     numeric  
## lambda         1   -none-     numeric  
## L1norm        41   -none-     numeric  
## penalty       41   -none-     numeric  
## df            41   -none-     numeric  
## Cp            41   -none-     numeric  
## sigma2         1   -none-     numeric  
## xNames        34   -none-     character
## problemType    1   -none-     character
## tuneValue      1   data.frame list     
## obsLevels      1   -none-     logical  
## param          0   -none-     list

The model explains about 40% of the variance of the test data.

ridge_pred <- predict(ridge_model, test_final)
postResample(pred = ridge_pred, obs = test_final$PH)

##      RMSE  Rsquared       MAE 
## 0.1337878 0.4042556 0.1037787

plot(varImp(ridge_model), top = 10)

Elastic net

Elastic Net combines Lasso and Ridge penalties by shrinking the coefficients for less important variables and correlated variables. It is suitable to that dataset as it has correlated features and many predictors.

set.seed(111)

enet_model <- train(PH ~. , 
                  data = train_final,
                  method = "enet", 
                  trControl = trainControl(method = "cv", number = 5)
                  )

summary(enet_model)

##             Length Class      Mode     
## call           4   -none-     call     
## actions       41   -none-     list     
## allset        34   -none-     numeric  
## beta.pure   1394   -none-     numeric  
## vn            34   -none-     character
## mu             1   -none-     numeric  
## normx         34   -none-     numeric  
## meanx         34   -none-     numeric  
## lambda         1   -none-     numeric  
## L1norm        41   -none-     numeric  
## penalty       41   -none-     numeric  
## df            41   -none-     numeric  
## Cp            41   -none-     numeric  
## sigma2         1   -none-     numeric  
## xNames        34   -none-     character
## problemType    1   -none-     character
## tuneValue      2   data.frame list     
## obsLevels      1   -none-     logical  
## param          0   -none-     list

The model explains about 40% of the variance of the test data.

enet_pred <- predict(enet_model, test_final)
postResample(pred = enet_pred, obs = test_final$PH)

##      RMSE  Rsquared       MAE 
## 0.1337878 0.4042556 0.1037787

The top predictors here are different than the previous models and other models we will see later. Elastic net gives more weight to predictors that can provide more unique information. OxygenFiller is less correlated to other top predictors.

plot(varImp(enet_model), top = 10)

Random Forest

Random Forest doesn’t not require handling of multicolinearity. It is robust to outliers and can capture complex patterns.

set.seed(11111)

rf_model <- train(PH ~. , 
                  data = train_final,
                  method = "rf", 
                  trControl = trainControl(method = "cv", number = 5),
                  tuneLength = 5
                  )

summary(rf_model)

##                 Length Class      Mode     
## call               4   -none-     call     
## type               1   -none-     character
## predicted       1799   -none-     numeric  
## mse              500   -none-     numeric  
## rsq              500   -none-     numeric  
## oob.times       1799   -none-     numeric  
## importance        34   -none-     numeric  
## importanceSD       0   -none-     NULL     
## localImportance    0   -none-     NULL     
## proximity          0   -none-     NULL     
## ntree              1   -none-     numeric  
## mtry               1   -none-     numeric  
## forest            11   -none-     list     
## coefs              0   -none-     NULL     
## y               1799   -none-     numeric  
## test               0   -none-     NULL     
## inbag              0   -none-     NULL     
## xNames            34   -none-     character
## problemType        1   -none-     character
## tuneValue          1   data.frame list     
## obsLevels          1   -none-     logical  
## param              0   -none-     list

The model explains about 65% of the variance of the test data. That’s higher than the previous models.

rf_pred <- predict(rf_model, test_final)
postResample(pred = rf_pred, obs = test_final$PH)

##       RMSE   Rsquared        MAE 
## 0.10262764 0.65054981 0.07243542

plot(varImp(rf_model), top = 10)

SVM

Model 1

There are several types of SVM. We will use radial basis kernel as it handle nonlinear relationship well.

set.seed(11111)

svm_model <- train(PH ~. , 
                  data = train_final,
                  method = "svmRadial", 
                  trControl = trainControl(method = "cv", number = 5),
                  tuneLength = 10
                  )

summary(svm_model)

## Length  Class   Mode 
##      1   ksvm     S4

The model explains about 53% of the variance of the test data.

svm_pred <- predict(svm_model, test_final)
postResample(pred = svm_pred, obs = test_final$PH)

##       RMSE   Rsquared        MAE 
## 0.12046363 0.52576294 0.08732194

plot(varImp(svm_model), top = 10)

KNN

KNN is a nonparametric model that can help predict the PH based on the closest training observations. It utilize the idea of beverage observations with similar conditions shold have similar ph levels.

Model 1

set.seed(111111)

knn_model <- train(PH ~. , 
                  data = train_final,
                  method = "knn", 
                  preProc = c("scale", "center"),
                  trControl = trainControl(method = "cv", number = 5),
                  tuneLength = 10
                  )

plot(knn_model)

The model explains about 46% of the variance of the test data.

knn_pred <- predict(knn_model, test_final)
postResample(pred = knn_pred, obs = test_final$PH)

##       RMSE   Rsquared        MAE 
## 0.12807883 0.46381221 0.09304781

plot(varImp(knn_model), top = 10)

Model 2

(tuned KNN)

set.seed(111112)

knn_model2 <- train(PH ~. , 
                  data = train_final,
                  method = "kknn", 
                  preProc = c("scale", "center"),
                  trControl = trainControl(method = "cv", number = 5),
                  tuneGrid = expand.grid(kmax = seq(3,30,2), 
                                         distance = 2,
                                         kernel = c("rectangular", "triangular","epanechnikov")))
plot(knn_model2)

The model explains about 47% of the variance of the test data.

knn_pred2 <- predict(knn_model2, test_final)
postResample(pred = knn_pred2, obs = test_final$PH)

##       RMSE   Rsquared        MAE 
## 0.12678369 0.47065301 0.09136588

plot(varImp(knn_model2), top = 10)

GBM

set.seed(1111111)

library(gbm)

gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
                       .n.trees = seq(100, 1000, by = 50),
                       .shrinkage = c(0.01, 0.1),
                       .n.minobsinnode = 10)

gbm_model <- train(PH ~. , 
                  data = train_final,
                  method = "gbm",
                  tuneGrid = gbmGrid,
                  trControl = trainControl(method = "cv", number = 5),
                  verbose = FALSE)
gbm_model

## Stochastic Gradient Boosting 
## 
## 1799 samples
##   31 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1438, 1440, 1440, 1440, 1438 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE       Rsquared   MAE       
##   0.01       1                   100     0.1526716  0.3110506  0.12089633
##   0.01       1                   150     0.1479276  0.3356400  0.11683472
##   0.01       1                   200     0.1444484  0.3538429  0.11393252
##   0.01       1                   250     0.1420269  0.3642708  0.11206031
##   0.01       1                   300     0.1401542  0.3722660  0.11064018
##   0.01       1                   350     0.1387538  0.3793035  0.10961411
##   0.01       1                   400     0.1377683  0.3834491  0.10889348
##   0.01       1                   450     0.1369799  0.3871227  0.10828709
##   0.01       1                   500     0.1361987  0.3921721  0.10764358
##   0.01       1                   550     0.1355715  0.3965752  0.10712704
##   0.01       1                   600     0.1350381  0.4005789  0.10666134
##   0.01       1                   650     0.1345111  0.4043147  0.10620434
##   0.01       1                   700     0.1339982  0.4082989  0.10575040
##   0.01       1                   750     0.1335351  0.4117917  0.10539175
##   0.01       1                   800     0.1331607  0.4143896  0.10507161
##   0.01       1                   850     0.1327793  0.4174006  0.10472626
##   0.01       1                   900     0.1324805  0.4196815  0.10445975
##   0.01       1                   950     0.1321540  0.4222049  0.10417078
##   0.01       1                  1000     0.1319238  0.4240040  0.10395858
##   0.01       3                   100     0.1420308  0.4195357  0.11243656
##   0.01       3                   150     0.1358908  0.4387302  0.10740419
##   0.01       3                   200     0.1320239  0.4555174  0.10426665
##   0.01       3                   250     0.1293036  0.4694097  0.10200626
##   0.01       3                   300     0.1273462  0.4805246  0.10025574
##   0.01       3                   350     0.1257711  0.4898313  0.09882291
##   0.01       3                   400     0.1245073  0.4980241  0.09766917
##   0.01       3                   450     0.1234258  0.5054874  0.09664414
##   0.01       3                   500     0.1224610  0.5120303  0.09579518
##   0.01       3                   550     0.1217547  0.5165851  0.09508706
##   0.01       3                   600     0.1210141  0.5213643  0.09433899
##   0.01       3                   650     0.1204283  0.5249132  0.09371316
##   0.01       3                   700     0.1197810  0.5289317  0.09308755
##   0.01       3                   750     0.1192274  0.5324658  0.09255388
##   0.01       3                   800     0.1187717  0.5352816  0.09210092
##   0.01       3                   850     0.1182306  0.5389154  0.09161094
##   0.01       3                   900     0.1177706  0.5421165  0.09111080
##   0.01       3                   950     0.1173610  0.5448257  0.09072929
##   0.01       3                  1000     0.1169994  0.5472374  0.09040695
##   0.01       5                   100     0.1376711  0.4662006  0.10868911
##   0.01       5                   150     0.1308896  0.4867918  0.10308993
##   0.01       5                   200     0.1266630  0.5023671  0.09938430
##   0.01       5                   250     0.1236531  0.5173602  0.09672082
##   0.01       5                   300     0.1215495  0.5286545  0.09478162
##   0.01       5                   350     0.1198559  0.5377687  0.09325770
##   0.01       5                   400     0.1184448  0.5464582  0.09196817
##   0.01       5                   450     0.1172838  0.5536412  0.09088995
##   0.01       5                   500     0.1163913  0.5585715  0.09002560
##   0.01       5                   550     0.1155298  0.5636367  0.08924964
##   0.01       5                   600     0.1149452  0.5668969  0.08862629
##   0.01       5                   650     0.1143239  0.5707675  0.08797669
##   0.01       5                   700     0.1138612  0.5731748  0.08744861
##   0.01       5                   750     0.1133267  0.5765383  0.08694201
##   0.01       5                   800     0.1128624  0.5792913  0.08644706
##   0.01       5                   850     0.1124426  0.5818099  0.08605303
##   0.01       5                   900     0.1120457  0.5842895  0.08561387
##   0.01       5                   950     0.1116794  0.5866407  0.08525002
##   0.01       5                  1000     0.1113540  0.5887569  0.08490736
##   0.01       7                   100     0.1346423  0.4983060  0.10597046
##   0.01       7                   150     0.1272336  0.5214649  0.09988506
##   0.01       7                   200     0.1226639  0.5372257  0.09591674
##   0.01       7                   250     0.1195964  0.5496794  0.09311565
##   0.01       7                   300     0.1174443  0.5596276  0.09109248
##   0.01       7                   350     0.1157710  0.5686058  0.08952236
##   0.01       7                   400     0.1144556  0.5756106  0.08829102
##   0.01       7                   450     0.1134112  0.5812129  0.08726622
##   0.01       7                   500     0.1124477  0.5868514  0.08632523
##   0.01       7                   550     0.1116563  0.5912931  0.08555519
##   0.01       7                   600     0.1109461  0.5955338  0.08481859
##   0.01       7                   650     0.1104165  0.5986444  0.08422732
##   0.01       7                   700     0.1098155  0.6023818  0.08369095
##   0.01       7                   750     0.1092773  0.6056837  0.08317200
##   0.01       7                   800     0.1088130  0.6083925  0.08269291
##   0.01       7                   850     0.1084959  0.6103164  0.08232206
##   0.01       7                   900     0.1081207  0.6125239  0.08199643
##   0.01       7                   950     0.1078943  0.6138359  0.08169589
##   0.01       7                  1000     0.1075900  0.6157899  0.08138941
##   0.10       1                   100     0.1319131  0.4225458  0.10402324
##   0.10       1                   150     0.1295430  0.4402645  0.10159004
##   0.10       1                   200     0.1283860  0.4485284  0.10013603
##   0.10       1                   250     0.1271134  0.4579697  0.09858150
##   0.10       1                   300     0.1268761  0.4590602  0.09782979
##   0.10       1                   350     0.1261415  0.4649237  0.09694621
##   0.10       1                   400     0.1256898  0.4686091  0.09656068
##   0.10       1                   450     0.1255754  0.4693183  0.09611852
##   0.10       1                   500     0.1257185  0.4682102  0.09612941
##   0.10       1                   550     0.1256349  0.4691346  0.09631462
##   0.10       1                   600     0.1255685  0.4696445  0.09620439
##   0.10       1                   650     0.1252473  0.4723108  0.09603512
##   0.10       1                   700     0.1256594  0.4691558  0.09639679
##   0.10       1                   750     0.1256859  0.4690126  0.09627809
##   0.10       1                   800     0.1256911  0.4687806  0.09635844
##   0.10       1                   850     0.1257223  0.4689103  0.09619762
##   0.10       1                   900     0.1258346  0.4683226  0.09628161
##   0.10       1                   950     0.1256558  0.4697625  0.09607405
##   0.10       1                  1000     0.1257889  0.4690216  0.09618164
##   0.10       3                   100     0.1180172  0.5357754  0.09100267
##   0.10       3                   150     0.1162398  0.5467989  0.08909552
##   0.10       3                   200     0.1154893  0.5521490  0.08808798
##   0.10       3                   250     0.1143535  0.5601735  0.08690190
##   0.10       3                   300     0.1135615  0.5659388  0.08611826
##   0.10       3                   350     0.1131042  0.5694696  0.08585456
##   0.10       3                   400     0.1125792  0.5733124  0.08516523
##   0.10       3                   450     0.1120382  0.5774842  0.08499649
##   0.10       3                   500     0.1120605  0.5771630  0.08492003
##   0.10       3                   550     0.1114812  0.5816984  0.08432906
##   0.10       3                   600     0.1108101  0.5867877  0.08384498
##   0.10       3                   650     0.1109249  0.5862988  0.08391750
##   0.10       3                   700     0.1110716  0.5853182  0.08414610
##   0.10       3                   750     0.1110870  0.5852790  0.08405958
##   0.10       3                   800     0.1107928  0.5876450  0.08392789
##   0.10       3                   850     0.1106285  0.5888121  0.08378337
##   0.10       3                   900     0.1104179  0.5906156  0.08357161
##   0.10       3                   950     0.1101881  0.5924394  0.08338117
##   0.10       3                  1000     0.1100001  0.5938758  0.08331743
##   0.10       5                   100     0.1135245  0.5688416  0.08654510
##   0.10       5                   150     0.1110916  0.5858883  0.08372426
##   0.10       5                   200     0.1091895  0.5993818  0.08238719
##   0.10       5                   250     0.1085578  0.6034431  0.08169342
##   0.10       5                   300     0.1081121  0.6066927  0.08129819
##   0.10       5                   350     0.1075667  0.6106145  0.08088793
##   0.10       5                   400     0.1071132  0.6137639  0.08046806
##   0.10       5                   450     0.1068919  0.6156710  0.08026733
##   0.10       5                   500     0.1067581  0.6167307  0.08011442
##   0.10       5                   550     0.1064039  0.6195400  0.07986743
##   0.10       5                   600     0.1062126  0.6209327  0.07952394
##   0.10       5                   650     0.1064096  0.6197658  0.07972856
##   0.10       5                   700     0.1062913  0.6207586  0.07975663
##   0.10       5                   750     0.1063730  0.6201867  0.07999251
##   0.10       5                   800     0.1064264  0.6199653  0.08001673
##   0.10       5                   850     0.1064123  0.6203343  0.08006138
##   0.10       5                   900     0.1063937  0.6204422  0.08006031
##   0.10       5                   950     0.1063072  0.6211054  0.08008623
##   0.10       5                  1000     0.1062267  0.6215856  0.07994096
##   0.10       7                   100     0.1102564  0.5920580  0.08324851
##   0.10       7                   150     0.1081958  0.6066781  0.08156698
##   0.10       7                   200     0.1070406  0.6145028  0.08029152
##   0.10       7                   250     0.1068700  0.6155508  0.08015693
##   0.10       7                   300     0.1065737  0.6174849  0.07990953
##   0.10       7                   350     0.1065386  0.6178456  0.07959712
##   0.10       7                   400     0.1060500  0.6211456  0.07916119
##   0.10       7                   450     0.1058310  0.6228049  0.07906718
##   0.10       7                   500     0.1057816  0.6230399  0.07912964
##   0.10       7                   550     0.1058380  0.6227831  0.07917679
##   0.10       7                   600     0.1056376  0.6242841  0.07901467
##   0.10       7                   650     0.1056838  0.6241685  0.07904418
##   0.10       7                   700     0.1057440  0.6238300  0.07908166
##   0.10       7                   750     0.1057747  0.6236412  0.07908521
##   0.10       7                   800     0.1058718  0.6230561  0.07924970
##   0.10       7                   850     0.1057685  0.6239175  0.07921355
##   0.10       7                   900     0.1057312  0.6241185  0.07924197
##   0.10       7                   950     0.1056625  0.6246802  0.07919209
##   0.10       7                  1000     0.1056601  0.6247255  0.07920151
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 600, interaction.depth =
##  7, shrinkage = 0.1 and n.minobsinnode = 10.

The model explains about 58% of the variance of the test data.

gbm_pred <- predict(gbm_model, test_final)
postResample(pred = gbm_pred, obs = test_final$PH)

##       RMSE   Rsquared        MAE 
## 0.11319662 0.57835611 0.08324744

plot(varImp(gbm_model), top = 10)

Cubist

Model 1

set.seed(1111111)

cubistGrid <- expand.grid(committees = c(5, 10, 20, 25, 30, 35),
                          neighbors = c(3, 5, 7, 9))

cubist_model1 <- train(PH ~. , 
                  data = train_final,
                  method = "cubist",
                  trControl = trainControl(method = "cv", number = 5),
                  tuneGrid = cubistGrid)
cubist_model1

## Cubist 
## 
## 1799 samples
##   31 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1438, 1440, 1440, 1440, 1438 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE        Rsquared   MAE       
##    5          3          0.10651405  0.6280792  0.07553919
##    5          5          0.10427750  0.6388212  0.07454844
##    5          7          0.10312447  0.6444448  0.07435491
##    5          9          0.10315129  0.6436535  0.07485335
##   10          3          0.10567219  0.6320322  0.07418010
##   10          5          0.10335037  0.6437949  0.07321619
##   10          7          0.10222761  0.6495983  0.07314955
##   10          9          0.10236388  0.6484574  0.07372072
##   20          3          0.10356567  0.6438584  0.07329638
##   20          5          0.10117473  0.6566584  0.07216201
##   20          7          0.10002646  0.6631450  0.07213475
##   20          9          0.10001195  0.6632622  0.07260272
##   25          3          0.10305475  0.6466722  0.07301136
##   25          5          0.10064012  0.6598529  0.07188693
##   25          7          0.09951892  0.6663046  0.07185559
##   25          9          0.09948580  0.6666422  0.07230071
##   30          3          0.10248416  0.6503911  0.07266282
##   30          5          0.10007265  0.6636138  0.07158986
##   30          7          0.09890083  0.6704014  0.07148600
##   30          9          0.09887405  0.6707215  0.07190151
##   35          3          0.10263845  0.6495841  0.07271320
##   35          5          0.10019371  0.6630012  0.07168085
##   35          7          0.09898371  0.6700067  0.07145740
##   35          9          0.09893666  0.6704626  0.07188585
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 30 and neighbors = 9.

The model explains about 66% of the variance of the test data. It outperforms all other models.

cubist_pred <- predict(cubist_model1, test_final)
postResample(pred = cubist_pred, obs = test_final$PH)

##      RMSE  Rsquared       MAE 
## 0.1005293 0.6637819 0.0709191

plot(varImp(cubist_model1), top = 10)

varImp(cubist_model1)

## cubist variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##                  Overall
## MnfFlow           100.00
## BallingLvl         70.29
## PressureVacuum     65.22
## Balling            63.04
## AlchRel            63.04
## AirPressurer       58.70
## Density            56.52
## BowlSetpoint       52.90
## OxygenFiller       52.17
## CarbRel            47.10
## Temperature        47.10
## HydPressure3       44.93
## Usagecont          41.30
## HydPressure2       41.30
## BrandCodeC         39.86
## CarbFlow           36.96
## CarbPressure2      36.23
## FillerSpeed        26.81
## FillerLevel        24.64
## PressureSetpoint   18.12

Model 2

Since Cubist model outperform all other models, we can attempt to further improve it by performing feature engineering. Create new nonlinear terms based on top predictors in the cubist model.

train_final <- train_final %>%
  mutate(MnfFlow_Sq = MnfFlow^2,
         PressureVacuum_Sq = PressureVacuum^2,
         AlchRel_Sq = AlchRel^2,
         Density_Sq = Density^2)

test_final <- test_final %>%
  mutate(MnfFlow_Sq = MnfFlow^2,
         PressureVacuum_Sq = PressureVacuum^2,
         AlchRel_Sq = AlchRel^2,
         Density_Sq = Density^2)

set.seed(22)

cubistGrid <- expand.grid(committees = c(5, 10, 20, 25, 30, 35, 40, 50),
                          neighbors = c(3, 5, 7, 9))

cubist_model2 <- train(PH ~. , 
                  data = train_final,
                  method = "cubist",
                  trControl = trainControl(method = "cv", number = 5),
                  tuneGrid = cubistGrid)
cubist_model2

## Cubist 
## 
## 1799 samples
##   35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1438, 1440, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE        Rsquared   MAE       
##    5          3          0.10222737  0.6550678  0.07199884
##    5          5          0.09999068  0.6666936  0.07138812
##    5          7          0.09946298  0.6692960  0.07140892
##    5          9          0.09963699  0.6678709  0.07187855
##   10          3          0.09924323  0.6728691  0.07030428
##   10          5          0.09724085  0.6838643  0.06990586
##   10          7          0.09696206  0.6855904  0.06999404
##   10          9          0.09729341  0.6834314  0.07052296
##   20          3          0.09816791  0.6798086  0.06973650
##   20          5          0.09605854  0.6917082  0.06932720
##   20          7          0.09585064  0.6930323  0.06950136
##   20          9          0.09614020  0.6913812  0.07009527
##   25          3          0.09801813  0.6799763  0.06973876
##   25          5          0.09601050  0.6914750  0.06925161
##   25          7          0.09575630  0.6931236  0.06938101
##   25          9          0.09604980  0.6914764  0.07005069
##   30          3          0.09800589  0.6799511  0.06966948
##   30          5          0.09606406  0.6911393  0.06917148
##   30          7          0.09583391  0.6925819  0.06924104
##   30          9          0.09612170  0.6909541  0.06983818
##   35          3          0.09806063  0.6796317  0.06972117
##   35          5          0.09611731  0.6909274  0.06919588
##   35          7          0.09585513  0.6926111  0.06919878
##   35          9          0.09616431  0.6908873  0.06986622
##   40          3          0.09797819  0.6801156  0.06967103
##   40          5          0.09597889  0.6917402  0.06904851
##   40          7          0.09567743  0.6937086  0.06902613
##   40          9          0.09596047  0.6921242  0.06965484
##   50          3          0.09770026  0.6816777  0.06953448
##   50          5          0.09566340  0.6937337  0.06894769
##   50          7          0.09540904  0.6955355  0.06899666
##   50          9          0.09571220  0.6938826  0.06957323
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 50 and neighbors = 7.

The model explains about 67% of the variance of the test data. It only improve slightly.

cubist_pred2 <- predict(cubist_model2, test_final)
postResample(pred = cubist_pred2, obs = test_final$PH)

##       RMSE   Rsquared        MAE 
## 0.09975222 0.67001655 0.06823076

Below are the top predictors for this model.

plot(varImp(cubist_model2))

Model Comparison

Model	RMSE	R^2	MAE
Cubist	0.09975222	0.67001655	0.06823076
Random Forest	0.10262764	0.65054981	0.07243542
GBM	0.11319662	0.57835611	0.08324744
SVM	0.12046363	0.52576294	0.08732194
KNN	0.12678369	0.47065301	0.09136588
Simple linear model	0.1304420	0.4260610	0.1015644
elastic Net	0.1337878	0.4042556	0.1015644
Ridge	0.1337878	0.4042556	0.1037787

The Cubist model outperform all the other models as it has the highest 𝑅2 and lowest RMSE value. However, it is still not an ideal model as it only explains about 67% of the variance of the test data. All tree based model perform better than the other models as they are robust to outliers and multicollinearity, which were all present in our data.

Prediction

Use the best model (Cubist) to predict the PH values onto the student evalution dataset.

eval_data <- s.eval %>%
  select(-PH)

eval_data$BrandCode <- as.factor(eval_data$BrandCode)

eval_numeric <- eval_data %>%
  select(where(is.numeric))

eval_factors <- eval_data %>%
  select(where(is.factor))

eval_numeric_transformed <- predict(trans, eval_numeric)

eval_final <-bind_cols(eval_numeric_transformed, eval_factors)

eval_final <- eval_final %>%
  mutate(across(where(is.factor), ~fct_na_value_to_level(.x, "Unknown")),
         MnfFlow_Sq = MnfFlow^2,
         PressureVacuum_Sq = PressureVacuum^2,
         AlchRel_Sq = AlchRel^2,
         Density_Sq = Density^2
         )


eval_final$eval_pred_ph <- predict(cubist_model2, eval_final)



write.csv(eval_final, file = "Cubist_PH_Predictions.csv")

Conclusion

In this project, we explored and modeled the features affecting the pH level in beverge manufacturing process. After performing EDA and preprocessing, the following models were tested to help predict the ph level: linear regression, ridge, elastic net, random forest, SVM, KNN, GBM, and Cubist. Among all models, the Cubist model outperform all the other models as it has the highest 𝑅2 and lowest RMSE value.

Common key predictors across models includes, MnfFlow, OygenFiller, Pressure settings, Balling levels (sugar levels), and Temperature. MnfFlow consistently emerged as the top predictor across most models. By controlling these predictors, ABC Beverage can maintain more consistent pH levels of the beverages.

Further Enhancement

Although Cubist outperform all models, it is still not an ideal model as it only explains about 67% of the variance of the test data. If more time allows, we can attempt to improve the model further by:

further hyperparameter tuning
feature engineering: create new interactions and nonlinear terms
collect additional features that may influence the pH level like ingredient concentration

DATA 624 Project

Susanna Wong, Ana Collado, Gullit Navarrete

2025-12-01

Introduction

Data Exploration

Load Data

Renaming Variables

Missing Data

Distribution

Transforming Variables to factors

Preprocessing

Imputing Missing Data

Correlation

Linear Model- Multicollinearity

Correlation Plot Part 2

Modeling

Simple Linear Regression

Model 1

Model 2

Model 3

Ridge

Elastic net

Random Forest

SVM

Model 1

KNN

Model 1

Model 2

GBM

Cubist

Model 1

Model 2

Model Comparison

Prediction

Conclusion

Further Enhancement