Loading [MathJax]/jax/output/HTML-CSS/jax.js

Project #2 (Team) Assignment

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Loading the Data

We uploaded the provided data files into Github, so that it will be easier to access. We also replaced any blank values with NA values, in order to impute any missing values later.

df_StudentData <- read.csv('https://raw.githubusercontent.com/johnm1990/DATA624/main/StudentData.csv',
                           na.strings = c("", NA))
df_EvalData <- read.csv('https://raw.githubusercontent.com/johnm1990/DATA624/main/StudentEvaluation.csv',
                           na.strings = c("", NA))

Data Exploration

Summary

The training data has 33 variables with 2,571 observations. One variable, Brand.Code is categorical, while the others are integers and numerical. PH is centered about 8.5. SOme variables seem to be skewed and it would be best to center and scale later on.

glimpse(df_StudentData)
## Rows: 2,571
## Columns: 33
## $ Brand.Code        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B~
## $ Carb.Volume       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.~
## $ Fill.Ounces       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23~
## $ PC.Volume         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333~
## $ Carb.Pressure     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2~
## $ Carb.Temp         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141~
## $ PSC               <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,~
## $ PSC.Fill          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12~
## $ PSC.CO2           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14~
## $ Mnf.Flow          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100~
## $ Carb.Pressure1    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124~
## $ Fill.Pressure     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0~
## $ Hyd.Pressure1     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Hyd.Pressure2     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Hyd.Pressure3     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Hyd.Pressure4     <int> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86~
## $ Filler.Level      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118~
## $ Filler.Speed      <int> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, ~
## $ Temperature       <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4~
## $ Usage.cont        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.~
## $ Carb.Flow         <int> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3~
## $ Density           <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90~
## $ MFR               <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74~
## $ Balling           <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2~
## $ Pressure.Vacuum   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4~
## $ PH                <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38~
## $ Oxygen.Filler     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0~
## $ Bowl.Setpoint     <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12~
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0~
## $ Air.Pressurer     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146~
## $ Alch.Rel          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52~
## $ Carb.Rel          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34~
## $ Balling.Lvl       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44~
summary(df_StudentData) %>% 
  kable() %>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "200px")
Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow Density MFR Balling Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933 Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000 Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60 Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00 Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26 Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600 Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00 Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
Mode :character Median :5.347 Median :23.97 Median :0.27133 Median :68.20 Median :140.8 Median :0.07600 Median :0.1800 Median :0.04000 Median : 65.20 Median :123.2 Median :46.40 Median :11.40 Median :28.60 Median :27.60 Median : 96.00 Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028 Median :0.980 Median :724.0 Median : 1.648 Median :-5.400 Median :8.540 Median :0.03340 Median :120.0 Median :46.00 Median :142.6 Median :6.560 Median :5.400 Median :1.48
NA Mean :5.370 Mean :23.97 Mean :0.27712 Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954 Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92 Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29 Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468 Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216 Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62 Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
NA 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
NA Max. :5.700 Max. :24.32 Max. :0.47800 Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200 Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40 Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00 Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104 Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600 Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00 Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
NA NA’s :10 NA’s :38 NA’s :39 NA’s :27 NA’s :26 NA’s :33 NA’s :23 NA’s :39 NA’s :2 NA’s :32 NA’s :22 NA’s :11 NA’s :15 NA’s :15 NA’s :30 NA’s :20 NA’s :57 NA’s :14 NA’s :5 NA’s :2 NA’s :1 NA’s :212 NA’s :1 NA NA’s :4 NA’s :12 NA’s :2 NA’s :12 NA NA’s :9 NA’s :10 NA’s :1
df_StudentData %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = "free") +
  ggtitle("Histograms of Numerical Predictors")

df_StudentData %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

Data Conversion

Brand.code is the only categorical value and takes on the values: A, B, C, or D. It would be best to convert it to a factor. B see,s to be the most common brand code, accounting for nearly half of the observations.

df_StudentData$Brand.Code <- as.factor(df_StudentData$Brand.Code)
df_EvalData$Brand.Code <- as.factor(df_EvalData$Brand.Code)

df_StudentData %>%
  ggplot() + 
  geom_bar(aes(x = Brand.Code)) + 
  ggtitle("Distribution of the Brand Codes")

Missing Data

30 of the variables have missing values. MFR seems to be missing for roughly 8.25% of the data. Brand_Code has 120 missing values.

df_StudentData %>%
  summarise_all(list(~ sum(is.na(.)))) %>%
  gather(variable, value) %>%
  filter(value != 0) %>%
  arrange(-value) %>%
  kable() %>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "200px")
variable value
MFR 212
Brand.Code 120
Filler.Speed 57
PC.Volume 39
PSC.CO2 39
Fill.Ounces 38
PSC 33
Carb.Pressure1 32
Hyd.Pressure4 30
Carb.Pressure 27
Carb.Temp 26
PSC.Fill 23
Fill.Pressure 22
Filler.Level 20
Hyd.Pressure2 15
Hyd.Pressure3 15
Temperature 14
Oxygen.Filler 12
Pressure.Setpoint 12
Hyd.Pressure1 11
Carb.Volume 10
Carb.Rel 10
Alch.Rel 9
Usage.cont 5
PH 4
Mnf.Flow 2
Carb.Flow 2
Bowl.Setpoint 2
Density 1
Balling 1
Balling.Lvl 1
df_StudentData %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values",
       x = "Proportion") +
  scale_fill_manual(values=c("grey","red"))

Transforming the Data

Next, we imputed the data using mice() from the MICE library. We also excluded any near zero-variance predictors, in this case, only Hyd.Pressure1 was removed.

set.seed(100)

df_StudentData <- mice(df_StudentData, m = 1, method = 'pmm', print = FALSE) %>% complete()

# filtering low frequencies
df_StudentData <- df_StudentData[, -nearZeroVar(df_StudentData)]

There are no more missing values and the transformations are complete for now.

df_StudentData %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values",
       x = "Proportion") +
  scale_fill_manual(values=c("grey","red"))

Model Building

First, we split the data using an 80-20 split. Then we created various types of regression models that include linear regression, non-linear regression, and regression trees.

set.seed(100)


# index for training
index <- createDataPartition(df_StudentData$PH, p = .8, list = FALSE)

# train 
train_x <- df_StudentData[index, ] %>% select(-PH)
train_y <- df_StudentData[index, 'PH']

# test
test_x <- df_StudentData[-index, ] %>% select(-PH)
test_y <- df_StudentData[-index, 'PH']
set.seed(100)

# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)

lmModel <- train(train_x, train_y, method = "lm", trControl = ctrl) 

lmPred <- predict(lmModel, test_x)

postResample(lmPred, test_y)
##      RMSE  Rsquared       MAE 
## 0.1327383 0.4079061 0.1018436
set.seed(100)

plsTune <- train(train_x, train_y, 
                 method = "pls", 
                 tuneLength = 20, trControl = ctrl,
                 preProc = c("center", "scale"))

plsPred <- predict(plsTune, test_x)

postResample(plsPred, test_y)
##      RMSE  Rsquared       MAE 
## 0.1328236 0.4072428 0.1021100
# create a tuning grid
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(100)

# tune
marsTune <- train(train_x, train_y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))

marsPred <- predict(marsTune, test_x)

postResample(marsPred, test_y)
##       RMSE   Rsquared        MAE 
## 0.12388891 0.48396139 0.09457591
# remove predictors to ensure maximum abs pairwise corr between predictors < 0.75
tooHigh <-findCorrelation(cor(train_x[, -1]), cutoff = .75)

# removing 9 variables and the factored variable
train_x_nnet <- train_x[, -tooHigh]
test_x_nnet <- test_x[, -tooHigh]

# create a tuning grid
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10))


set.seed(100)

# tune
nnetTune <- train(train_x_nnet, train_y,
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 84851,
                  maxit = 500)

nnPred <- predict(nnetTune, test_x_nnet)

postResample(nnPred, test_y)
##       RMSE   Rsquared        MAE 
## 0.12524194 0.48150890 0.09465761
set.seed(100)

# tune
svmRTune <- train(train_x[, -1], train_y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))

svmRPred <- predict(svmRTune, test_x[, -1])

postResample(svmRPred, test_y)
##       RMSE   Rsquared        MAE 
## 0.12516357 0.48167776 0.09019808
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)

gbmTune <- train(train_x, train_y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test_x)

postResample(gbmPred, test_y)
##      RMSE  Rsquared       MAE 
## 0.1101787 0.5929795 0.0804736
set.seed(100)

rfModel <- randomForest(train_x, train_y, 
                        importance = TRUE,
                        ntree = 1000)


rfPred <- predict(rfModel, test_x)

postResample(rfPred, test_y)
##       RMSE   Rsquared        MAE 
## 0.09988699 0.67694042 0.07159998
set.seed(100)
cubistTuned <- train(train_x, train_y, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test_x)

postResample(cubistPred, test_y)
##       RMSE   Rsquared        MAE 
## 0.10036568 0.66326235 0.07295523

Based on the results, the lowest RMSE and the highest R2 is found in the Random Forest model, giving the best optimal resampling and test set performance.

rbind(lm = postResample(lmPred, test_y),
      pls = postResample(plsPred, test_y),
      nn = postResample(nnPred, test_y),
      mars = postResample(marsPred, test_y),
      svmR = postResample(svmRPred, test_y),
      randomForest = postResample(rfPred, test_y),
      boosted = postResample(gbmPred, test_y),
      cubist = postResample(cubistPred, test_y))
##                    RMSE  Rsquared        MAE
## lm           0.13273832 0.4079061 0.10184356
## pls          0.13282358 0.4072428 0.10210997
## nn           0.12524194 0.4815089 0.09465761
## mars         0.12388891 0.4839614 0.09457591
## svmR         0.12516357 0.4816778 0.09019808
## randomForest 0.09988699 0.6769404 0.07159998
## boosted      0.11017872 0.5929795 0.08047360
## cubist       0.10036568 0.6632624 0.07295523

Model Evaluation

Random Forest was chosen as the best model.

%IncMSE is the Mean Decrease Accuracy which shows how much the model decreases if that varaible is excluded. On the other hand, IncNodePurity is Mean Decrease Gini which uses the Gini impurity index to measure the variable importance.

Based on the Mean Decrease Accuracy, Brand.Code seems to be the most important variable, and Mnf.Flow is the second most important variable. When the Gini Impurity Index is used, it is considered the most important.

rfImp <- varImp(rfModel, scale = TRUE) %>%
  as.data.frame()

rfImp %>%
  arrange(-Overall) %>%
  kable() %>% 
  kable_styling() %>%
  scroll_box()
Overall
Brand.Code 58.8442640
Mnf.Flow 56.0374329
Pressure.Vacuum 50.1349324
Oxygen.Filler 47.6382521
Air.Pressurer 44.9762020
Usage.cont 42.8680517
Carb.Rel 39.3444174
Temperature 39.2223143
Balling.Lvl 38.8899066
Carb.Flow 34.0526289
Filler.Speed 33.9436447
Alch.Rel 32.9337260
Balling 30.4978940
Density 30.2357809
Carb.Pressure1 29.0662078
Bowl.Setpoint 28.2376748
Filler.Level 27.2386656
Hyd.Pressure3 23.7186542
Carb.Volume 21.8981787
Hyd.Pressure4 19.1261479
Hyd.Pressure2 18.7971318
MFR 17.5430425
Fill.Pressure 17.4819435
PC.Volume 16.0625502
Pressure.Setpoint 14.6211954
Carb.Pressure 5.6790446
Fill.Ounces 3.7618006
Carb.Temp 3.5644779
PSC.CO2 2.3435247
PSC.Fill -0.7138685
PSC -1.3143510
varImpPlot(rfModel, sort = TRUE, n.var = 10)

Disregarding, Brand.Code, Mnf.Flow, Usage.cont, Air.Pressurer, andTemperarture have negative effects on the PH. Mnf.Flow and Usage.cont seem to affect it the most negatively. Oxygen.Filler, Pressure.Vacuum, and Carb.Rel seem to affect the PH the most positively.

top10 <- varImp(rfModel) %>%
  filter(Overall < 57) %>%
  arrange(-Overall) %>%
  head(10)


df_StudentData %>%
  select(c("PH", row.names(top10))) %>%
  cor() %>%
  corrplot() +
  title("Correlation between PH and the Top 10 Numerical Variables")

## numeric(0)

Based on the transformed Brand.Code, the PH tends to be the lowest for those labeled “C” and highest for those labeled “D”. “B” accounts for nearly half of the data, and has the second highest PH on average.

df_StudentData %>%
  ggplot(aes(x= PH)) +
  geom_histogram(bins = 15) +
  facet_wrap(~Brand.Code, ncol = 1)

df_StudentData %>%
  ggplot(aes(x= PH)) +
  geom_boxplot() +
  facet_wrap(~Brand.Code, ncol = 1) 

df_StudentData %>%
  group_by(Brand.Code) %>%
  summarise(`Average PH` = mean(PH))
## # A tibble: 4 x 2
##   Brand.Code `Average PH`
##   <fct>             <dbl>
## 1 A                  8.50
## 2 B                  8.56
## 3 C                  8.42
## 4 D                  8.60

Forecasting

The evaluation data has to be transformed before it can used in forecasting.

set.seed(100)

df_EvalData <- df_EvalData %>%
  select(-PH) %>%
  mice(., m = 1, method = 'pmm', print = FALSE) %>% complete()

# remove Hyd.Pressure1 as it was removed in the preprocessing for Student Data
# add back in PH
df_EvalData <- df_EvalData %>%
  select(-Hyd.Pressure1) %>%
  mutate(PH = "")

# predict PH
prediction <- predict(rfModel, df_EvalData)

head(prediction)
##        1        2        3        4        5        6 
## 8.522206 8.474750 8.525824 8.612244 8.503590 8.550513
# put the PH back into the data frame
df_EvalData$PH <- prediction

#average ph
df_EvalData %>%
  group_by(Brand.Code) %>%
  summarise(`Average PH` = mean(PH))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
##   Brand.Code `Average PH`
##   <fct>             <dbl>
## 1 A                  8.51
## 2 B                  8.57
## 3 C                  8.43
## 4 D                  8.59
# export file
write.xlsx(list('PH' = prediction, 'EvalData_complete' = df_EvalData), file = 'predictions_DJO.xlsx')

Conclusion

The PH values may seem somewhat similar, as they all range between 8 and 9. The patterns still uphold in our predictions when they are grouped by the Brand.Code. We also highlighted the variables that have the most affect on the PH. We hope that understanding more about the manufacturing process helps with the new regulations in the beverage industry.

Random Forest model was able to capture the complexity of the data the best since it had the best R2 and RMSE. After all, it consists of multiple decision trees and it is computationally efficient compared to the other models.