DATA624 Project 2

Project #2 (Team) Assignment

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Loading the Data

We uploaded the provided data files into Github, so that it will be easier to access. We also replaced any blank values with NA values, in order to impute any missing values later.

df_StudentData <- read.csv('https://raw.githubusercontent.com/johnm1990/DATA624/main/StudentData.csv',
                           na.strings = c("", NA))
df_EvalData <- read.csv('https://raw.githubusercontent.com/johnm1990/DATA624/main/StudentEvaluation.csv',
                           na.strings = c("", NA))

Data Exploration

Summary

The training data has 33 variables with 2,571 observations. One variable, Brand.Code is categorical, while the others are integers and numerical. PH is centered about 8.5. SOme variables seem to be skewed and it would be best to center and scale later on.

glimpse(df_StudentData)

## Rows: 2,571
## Columns: 33
## $ Brand.Code        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B~
## $ Carb.Volume       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.~
## $ Fill.Ounces       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23~
## $ PC.Volume         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333~
## $ Carb.Pressure     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2~
## $ Carb.Temp         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141~
## $ PSC               <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,~
## $ PSC.Fill          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12~
## $ PSC.CO2           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14~
## $ Mnf.Flow          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100~
## $ Carb.Pressure1    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124~
## $ Fill.Pressure     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0~
## $ Hyd.Pressure1     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Hyd.Pressure2     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Hyd.Pressure3     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Hyd.Pressure4     <int> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86~
## $ Filler.Level      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118~
## $ Filler.Speed      <int> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, ~
## $ Temperature       <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4~
## $ Usage.cont        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.~
## $ Carb.Flow         <int> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3~
## $ Density           <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90~
## $ MFR               <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74~
## $ Balling           <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2~
## $ Pressure.Vacuum   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4~
## $ PH                <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38~
## $ Oxygen.Filler     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0~
## $ Bowl.Setpoint     <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12~
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0~
## $ Air.Pressurer     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146~
## $ Alch.Rel          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52~
## $ Carb.Rel          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34~
## $ Balling.Lvl       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44~

summary(df_StudentData) %>% 
  kable() %>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "200px")

Brand.Code	Carb.Volume	Fill.Ounces	PC.Volume	Carb.Pressure	Carb.Temp	PSC	PSC.Fill	PSC.CO2	Mnf.Flow	Carb.Pressure1	Fill.Pressure	Hyd.Pressure1	Hyd.Pressure2	Hyd.Pressure3	Hyd.Pressure4	Filler.Level	Filler.Speed	Temperature	Usage.cont	Carb.Flow	Density	MFR	Balling	Pressure.Vacuum	PH	Oxygen.Filler	Bowl.Setpoint	Pressure.Setpoint	Air.Pressurer	Alch.Rel	Carb.Rel	Balling.Lvl
Length:2571	Min. :5.040	Min. :23.63	Min. :0.07933	Min. :57.00	Min. :128.6	Min. :0.00200	Min. :0.0000	Min. :0.00000	Min. :-100.20	Min. :105.6	Min. :34.60	Min. :-0.80	Min. : 0.00	Min. :-1.20	Min. : 52.00	Min. : 55.8	Min. : 998	Min. :63.60	Min. :12.08	Min. : 26	Min. :0.240	Min. : 31.4	Min. :-0.170	Min. :-6.600	Min. :7.880	Min. :0.00240	Min. : 70.0	Min. :44.00	Min. :140.8	Min. :5.280	Min. :4.960	Min. :0.00
Class :character	1st Qu.:5.293	1st Qu.:23.92	1st Qu.:0.23917	1st Qu.:65.60	1st Qu.:138.4	1st Qu.:0.04800	1st Qu.:0.1000	1st Qu.:0.02000	1st Qu.:-100.00	1st Qu.:119.0	1st Qu.:46.00	1st Qu.: 0.00	1st Qu.: 0.00	1st Qu.: 0.00	1st Qu.: 86.00	1st Qu.: 98.3	1st Qu.:3888	1st Qu.:65.20	1st Qu.:18.36	1st Qu.:1144	1st Qu.:0.900	1st Qu.:706.3	1st Qu.: 1.496	1st Qu.:-5.600	1st Qu.:8.440	1st Qu.:0.02200	1st Qu.:100.0	1st Qu.:46.00	1st Qu.:142.2	1st Qu.:6.540	1st Qu.:5.340	1st Qu.:1.38
Mode :character	Median :5.347	Median :23.97	Median :0.27133	Median :68.20	Median :140.8	Median :0.07600	Median :0.1800	Median :0.04000	Median : 65.20	Median :123.2	Median :46.40	Median :11.40	Median :28.60	Median :27.60	Median : 96.00	Median :118.4	Median :3982	Median :65.60	Median :21.79	Median :3028	Median :0.980	Median :724.0	Median : 1.648	Median :-5.400	Median :8.540	Median :0.03340	Median :120.0	Median :46.00	Median :142.6	Median :6.560	Median :5.400	Median :1.48
NA	Mean :5.370	Mean :23.97	Mean :0.27712	Mean :68.19	Mean :141.1	Mean :0.08457	Mean :0.1954	Mean :0.05641	Mean : 24.57	Mean :122.6	Mean :47.92	Mean :12.44	Mean :20.96	Mean :20.46	Mean : 96.29	Mean :109.3	Mean :3687	Mean :65.97	Mean :20.99	Mean :2468	Mean :1.174	Mean :704.0	Mean : 2.198	Mean :-5.216	Mean :8.546	Mean :0.04684	Mean :109.3	Mean :47.62	Mean :142.8	Mean :6.897	Mean :5.437	Mean :2.05
NA	3rd Qu.:5.453	3rd Qu.:24.03	3rd Qu.:0.31200	3rd Qu.:70.60	3rd Qu.:143.8	3rd Qu.:0.11200	3rd Qu.:0.2600	3rd Qu.:0.08000	3rd Qu.: 140.80	3rd Qu.:125.4	3rd Qu.:50.00	3rd Qu.:20.20	3rd Qu.:34.60	3rd Qu.:33.40	3rd Qu.:102.00	3rd Qu.:120.0	3rd Qu.:3998	3rd Qu.:66.40	3rd Qu.:23.75	3rd Qu.:3186	3rd Qu.:1.620	3rd Qu.:731.0	3rd Qu.: 3.292	3rd Qu.:-5.000	3rd Qu.:8.680	3rd Qu.:0.06000	3rd Qu.:120.0	3rd Qu.:50.00	3rd Qu.:143.0	3rd Qu.:7.240	3rd Qu.:5.540	3rd Qu.:3.14
NA	Max. :5.700	Max. :24.32	Max. :0.47800	Max. :79.40	Max. :154.0	Max. :0.27000	Max. :0.6200	Max. :0.24000	Max. : 229.40	Max. :140.2	Max. :60.40	Max. :58.00	Max. :59.40	Max. :50.00	Max. :142.00	Max. :161.2	Max. :4030	Max. :76.20	Max. :25.90	Max. :5104	Max. :1.920	Max. :868.6	Max. : 4.012	Max. :-3.600	Max. :9.360	Max. :0.40000	Max. :140.0	Max. :52.00	Max. :148.2	Max. :8.620	Max. :6.060	Max. :3.66
NA	NA’s :10	NA’s :38	NA’s :39	NA’s :27	NA’s :26	NA’s :33	NA’s :23	NA’s :39	NA’s :2	NA’s :32	NA’s :22	NA’s :11	NA’s :15	NA’s :15	NA’s :30	NA’s :20	NA’s :57	NA’s :14	NA’s :5	NA’s :2	NA’s :1	NA’s :212	NA’s :1	NA	NA’s :4	NA’s :12	NA’s :2	NA’s :12	NA	NA’s :9	NA’s :10	NA’s :1

df_StudentData %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = "free") +
  ggtitle("Histograms of Numerical Predictors")

df_StudentData %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

Data Conversion

Brand.code is the only categorical value and takes on the values: A, B, C, or D. It would be best to convert it to a factor. B see,s to be the most common brand code, accounting for nearly half of the observations.

df_StudentData$Brand.Code <- as.factor(df_StudentData$Brand.Code)
df_EvalData$Brand.Code <- as.factor(df_EvalData$Brand.Code)

df_StudentData %>%
  ggplot() + 
  geom_bar(aes(x = Brand.Code)) + 
  ggtitle("Distribution of the Brand Codes")

Missing Data

30 of the variables have missing values. MFR seems to be missing for roughly 8.25% of the data. Brand_Code has 120 missing values.

df_StudentData %>%
  summarise_all(list(~ sum(is.na(.)))) %>%
  gather(variable, value) %>%
  filter(value != 0) %>%
  arrange(-value) %>%
  kable() %>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "200px")

variable	value
MFR	212
Brand.Code	120
Filler.Speed	57
PC.Volume	39
PSC.CO2	39
Fill.Ounces	38
PSC	33
Carb.Pressure1	32
Hyd.Pressure4	30
Carb.Pressure	27
Carb.Temp	26
PSC.Fill	23
Fill.Pressure	22
Filler.Level	20
Hyd.Pressure2	15
Hyd.Pressure3	15
Temperature	14
Oxygen.Filler	12
Pressure.Setpoint	12
Hyd.Pressure1	11
Carb.Volume	10
Carb.Rel	10
Alch.Rel	9
Usage.cont	5
PH	4
Mnf.Flow	2
Carb.Flow	2
Bowl.Setpoint	2
Density	1
Balling	1
Balling.Lvl	1

df_StudentData %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values",
       x = "Proportion") +
  scale_fill_manual(values=c("grey","red"))

Transforming the Data

Next, we imputed the data using mice() from the MICE library. We also excluded any near zero-variance predictors, in this case, only Hyd.Pressure1 was removed.

set.seed(100)

df_StudentData <- mice(df_StudentData, m = 1, method = 'pmm', print = FALSE) %>% complete()

# filtering low frequencies
df_StudentData <- df_StudentData[, -nearZeroVar(df_StudentData)]

There are no more missing values and the transformations are complete for now.

df_StudentData %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values",
       x = "Proportion") +
  scale_fill_manual(values=c("grey","red"))

Model Building

First, we split the data using an 80-20 split. Then we created various types of regression models that include linear regression, non-linear regression, and regression trees.

set.seed(100)


# index for training
index <- createDataPartition(df_StudentData$PH, p = .8, list = FALSE)

# train 
train_x <- df_StudentData[index, ] %>% select(-PH)
train_y <- df_StudentData[index, 'PH']

# test
test_x <- df_StudentData[-index, ] %>% select(-PH)
test_y <- df_StudentData[-index, 'PH']

set.seed(100)

# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)

lmModel <- train(train_x, train_y, method = "lm", trControl = ctrl) 

lmPred <- predict(lmModel, test_x)

postResample(lmPred, test_y)

##      RMSE  Rsquared       MAE 
## 0.1327383 0.4079061 0.1018436

set.seed(100)

plsTune <- train(train_x, train_y, 
                 method = "pls", 
                 tuneLength = 20, trControl = ctrl,
                 preProc = c("center", "scale"))

plsPred <- predict(plsTune, test_x)

postResample(plsPred, test_y)

##      RMSE  Rsquared       MAE 
## 0.1328236 0.4072428 0.1021100

# create a tuning grid
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(100)

# tune
marsTune <- train(train_x, train_y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))

marsPred <- predict(marsTune, test_x)

postResample(marsPred, test_y)

##       RMSE   Rsquared        MAE 
## 0.12388891 0.48396139 0.09457591

# remove predictors to ensure maximum abs pairwise corr between predictors < 0.75
tooHigh <-findCorrelation(cor(train_x[, -1]), cutoff = .75)

# removing 9 variables and the factored variable
train_x_nnet <- train_x[, -tooHigh]
test_x_nnet <- test_x[, -tooHigh]

# create a tuning grid
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10))


set.seed(100)

# tune
nnetTune <- train(train_x_nnet, train_y,
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 84851,
                  maxit = 500)

nnPred <- predict(nnetTune, test_x_nnet)

postResample(nnPred, test_y)

##       RMSE   Rsquared        MAE 
## 0.12524194 0.48150890 0.09465761

set.seed(100)

# tune
svmRTune <- train(train_x[, -1], train_y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))

svmRPred <- predict(svmRTune, test_x[, -1])

postResample(svmRPred, test_y)

##       RMSE   Rsquared        MAE 
## 0.12516357 0.48167776 0.09019808

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)

gbmTune <- train(train_x, train_y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test_x)

postResample(gbmPred, test_y)

##      RMSE  Rsquared       MAE 
## 0.1101787 0.5929795 0.0804736

set.seed(100)

rfModel <- randomForest(train_x, train_y, 
                        importance = TRUE,
                        ntree = 1000)


rfPred <- predict(rfModel, test_x)

postResample(rfPred, test_y)

##       RMSE   Rsquared        MAE 
## 0.09988699 0.67694042 0.07159998

set.seed(100)
cubistTuned <- train(train_x, train_y, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test_x)

postResample(cubistPred, test_y)

##       RMSE   Rsquared        MAE 
## 0.10036568 0.66326235 0.07295523

Based on the results, the lowest RMSE and the highest $R^2$ is found in the Random Forest model, giving the best optimal resampling and test set performance.

rbind(lm = postResample(lmPred, test_y),
      pls = postResample(plsPred, test_y),
      nn = postResample(nnPred, test_y),
      mars = postResample(marsPred, test_y),
      svmR = postResample(svmRPred, test_y),
      randomForest = postResample(rfPred, test_y),
      boosted = postResample(gbmPred, test_y),
      cubist = postResample(cubistPred, test_y))

##                    RMSE  Rsquared        MAE
## lm           0.13273832 0.4079061 0.10184356
## pls          0.13282358 0.4072428 0.10210997
## nn           0.12524194 0.4815089 0.09465761
## mars         0.12388891 0.4839614 0.09457591
## svmR         0.12516357 0.4816778 0.09019808
## randomForest 0.09988699 0.6769404 0.07159998
## boosted      0.11017872 0.5929795 0.08047360
## cubist       0.10036568 0.6632624 0.07295523

Model Evaluation

Random Forest was chosen as the best model.

%IncMSE is the Mean Decrease Accuracy which shows how much the model decreases if that varaible is excluded. On the other hand, IncNodePurity is Mean Decrease Gini which uses the Gini impurity index to measure the variable importance.

Based on the Mean Decrease Accuracy, Brand.Code seems to be the most important variable, and Mnf.Flow is the second most important variable. When the Gini Impurity Index is used, it is considered the most important.

rfImp <- varImp(rfModel, scale = TRUE) %>%
  as.data.frame()

rfImp %>%
  arrange(-Overall) %>%
  kable() %>% 
  kable_styling() %>%
  scroll_box()

	Overall
Brand.Code	58.8442640
Mnf.Flow	56.0374329
Pressure.Vacuum	50.1349324
Oxygen.Filler	47.6382521
Air.Pressurer	44.9762020
Usage.cont	42.8680517
Carb.Rel	39.3444174
Temperature	39.2223143
Balling.Lvl	38.8899066
Carb.Flow	34.0526289
Filler.Speed	33.9436447
Alch.Rel	32.9337260
Balling	30.4978940
Density	30.2357809
Carb.Pressure1	29.0662078
Bowl.Setpoint	28.2376748
Filler.Level	27.2386656
Hyd.Pressure3	23.7186542
Carb.Volume	21.8981787
Hyd.Pressure4	19.1261479
Hyd.Pressure2	18.7971318
MFR	17.5430425
Fill.Pressure	17.4819435
PC.Volume	16.0625502
Pressure.Setpoint	14.6211954
Carb.Pressure	5.6790446
Fill.Ounces	3.7618006
Carb.Temp	3.5644779
PSC.CO2	2.3435247
PSC.Fill	-0.7138685
PSC	-1.3143510

varImpPlot(rfModel, sort = TRUE, n.var = 10)

Disregarding, Brand.Code, Mnf.Flow, Usage.cont, Air.Pressurer, andTemperarture have negative effects on the PH. Mnf.Flow and Usage.cont seem to affect it the most negatively. Oxygen.Filler, Pressure.Vacuum, and Carb.Rel seem to affect the PH the most positively.

top10 <- varImp(rfModel) %>%
  filter(Overall < 57) %>%
  arrange(-Overall) %>%
  head(10)


df_StudentData %>%
  select(c("PH", row.names(top10))) %>%
  cor() %>%
  corrplot() +
  title("Correlation between PH and the Top 10 Numerical Variables")

## numeric(0)

Based on the transformed Brand.Code, the PH tends to be the lowest for those labeled “C” and highest for those labeled “D”. “B” accounts for nearly half of the data, and has the second highest PH on average.

df_StudentData %>%
  ggplot(aes(x= PH)) +
  geom_histogram(bins = 15) +
  facet_wrap(~Brand.Code, ncol = 1)

df_StudentData %>%
  ggplot(aes(x= PH)) +
  geom_boxplot() +
  facet_wrap(~Brand.Code, ncol = 1) 

df_StudentData %>%
  group_by(Brand.Code) %>%
  summarise(`Average PH` = mean(PH))

## # A tibble: 4 x 2
##   Brand.Code `Average PH`
##   <fct>             <dbl>
## 1 A                  8.50
## 2 B                  8.56
## 3 C                  8.42
## 4 D                  8.60

Forecasting

The evaluation data has to be transformed before it can used in forecasting.

set.seed(100)

df_EvalData <- df_EvalData %>%
  select(-PH) %>%
  mice(., m = 1, method = 'pmm', print = FALSE) %>% complete()

# remove Hyd.Pressure1 as it was removed in the preprocessing for Student Data
# add back in PH
df_EvalData <- df_EvalData %>%
  select(-Hyd.Pressure1) %>%
  mutate(PH = "")

# predict PH
prediction <- predict(rfModel, df_EvalData)

head(prediction)

##        1        2        3        4        5        6 
## 8.522206 8.474750 8.525824 8.612244 8.503590 8.550513

# put the PH back into the data frame
df_EvalData$PH <- prediction

#average ph
df_EvalData %>%
  group_by(Brand.Code) %>%
  summarise(`Average PH` = mean(PH))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 4 x 2
##   Brand.Code `Average PH`
##   <fct>             <dbl>
## 1 A                  8.51
## 2 B                  8.57
## 3 C                  8.43
## 4 D                  8.59

# export file
write.xlsx(list('PH' = prediction, 'EvalData_complete' = df_EvalData), file = 'predictions_DJO.xlsx')

Conclusion

The PH values may seem somewhat similar, as they all range between 8 and 9. The patterns still uphold in our predictions when they are grouped by the Brand.Code. We also highlighted the variables that have the most affect on the PH. We hope that understanding more about the manufacturing process helps with the new regulations in the beverage industry.

Random Forest model was able to capture the complexity of the data the best since it had the best $R^2$ and RMSE. After all, it consists of multiple decision trees and it is computationally efficient compared to the other models.