DATA 624 Project 2

Author

Henock Montcho

Published

May 22, 2025

Instructions

Project #2 (Team) Assignment

This is role playing.  I am your new boss.  I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me.  My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of pH.

Please use the historical data set I am providing.  Build and report the factors in BOTH a technical and non-technical report.  I like to use Word and Excel.  Please provide your non-technical report in a  business friendly readable document and your predictions in an Excel readable format.   The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports.  Also submit the excel file showing the prediction of your models for pH.

library(tsibble)
library(readxl)
library(dplyr)
library(tidyr)
library(ggplot2)
library(forecast)
library(lubridate)
library(openxlsx)
library(kableExtra)
library(purrr)

Loading the data and replacing blanks with NA.

studentsdata <- read.xlsx("https://raw.githubusercontent.com/henock93/Data_Science/main/StudentData.xlsx", na.strings = c("", NA))

studentevaluation <- read.xlsx("https://raw.githubusercontent.com/henock93/Data_Science/main/StudentEvaluation.xlsx", na.strings = c("", NA))

Exploratory Data Analysis

library(ggplot2)
glimpse(studentsdata)
Rows: 2,571
Columns: 33
$ Brand.Code        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
$ Carb.Volume       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
$ Fill.Ounces       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
$ PC.Volume         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
$ Carb.Pressure     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
$ Carb.Temp         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
$ PSC               <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
$ PSC.Fill          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
$ PSC.CO2           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
$ Mnf.Flow          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
$ Carb.Pressure1    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
$ Fill.Pressure     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
$ Hyd.Pressure1     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Hyd.Pressure2     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Hyd.Pressure3     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Hyd.Pressure4     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
$ Filler.Level      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
$ Filler.Speed      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
$ Temperature       <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
$ Usage.cont        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
$ Carb.Flow         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
$ Density           <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
$ MFR               <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
$ Balling           <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
$ Pressure.Vacuum   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
$ PH                <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
$ Oxygen.Filler     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
$ Bowl.Setpoint     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
$ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
$ Air.Pressurer     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
$ Alch.Rel          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
$ Carb.Rel          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
$ Balling.Lvl       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…
summary(studentsdata)  |>
  kable()  |>
  kable_styling()  |>
  scroll_box(width = "100%", height = "200px")
Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow Density MFR Balling Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933 Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000 Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60 Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00 Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26 Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600 Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00 Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
Mode :character Median :5.347 Median :23.97 Median :0.27133 Median :68.20 Median :140.8 Median :0.07600 Median :0.1800 Median :0.04000 Median : 65.20 Median :123.2 Median :46.40 Median :11.40 Median :28.60 Median :27.60 Median : 96.00 Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028 Median :0.980 Median :724.0 Median : 1.648 Median :-5.400 Median :8.540 Median :0.03340 Median :120.0 Median :46.00 Median :142.6 Median :6.560 Median :5.400 Median :1.48
NA Mean :5.370 Mean :23.97 Mean :0.27712 Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954 Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92 Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29 Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468 Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216 Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62 Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
NA 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
NA Max. :5.700 Max. :24.32 Max. :0.47800 Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200 Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40 Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00 Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104 Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600 Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00 Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
NA NA's :10 NA's :38 NA's :39 NA's :27 NA's :26 NA's :33 NA's :23 NA's :39 NA's :2 NA's :32 NA's :22 NA's :11 NA's :15 NA's :15 NA's :30 NA's :20 NA's :57 NA's :14 NA's :5 NA's :2 NA's :1 NA's :212 NA's :1 NA NA's :4 NA's :12 NA's :2 NA's :12 NA NA's :9 NA's :10 NA's :1
studentsdata  |>
  keep(is.numeric)  |>
  gather()  |>
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = "free") +
  ggtitle("Histograms of Numerical Predictors")

studentsdata  |>
  keep(is.numeric)  |>
  gather()  |>
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

Comment: The training dataset contains 33 variables and 2,571 observations. Among these, Brand.Code is a categorical variable, while the remaining variables are numeric, primarily integers. The pH variable is centered around a value of 8.5. Some variables exhibit skewness, so it would be beneficial to apply centering and scaling during preprocessing.

Data processing

studentsdata$Brand.Code <- as.factor(studentsdata$Brand.Code)
studentevaluation$Brand.Code <- as.factor(studentevaluation$Brand.Code)

studentsdata  |>
  ggplot() + 
  geom_bar(aes(x = Brand.Code)) + 
  ggtitle("Distribution of the Brand Codes")

Comment: Brand.code is the only categorical variable in the dataset, with possible values of A, B, C, or D. It was converted into a factor for analysis purpose. B appears to be the most frequent, representing nearly half of all observations.

Identifying missing data

studentsdata  |>
  summarise_all(list(~ sum(is.na(.))))  |>
  gather(variable, value)  |>
  filter(value != 0)  |>
  arrange(-value)  |>
  kable()  |>
  kable_styling()  |>
  scroll_box(width = "100%", height = "200px")
variable value
MFR 212
Brand.Code 120
Filler.Speed 57
PC.Volume 39
PSC.CO2 39
Fill.Ounces 38
PSC 33
Carb.Pressure1 32
Hyd.Pressure4 30
Carb.Pressure 27
Carb.Temp 26
PSC.Fill 23
Fill.Pressure 22
Filler.Level 20
Hyd.Pressure2 15
Hyd.Pressure3 15
Temperature 14
Oxygen.Filler 12
Pressure.Setpoint 12
Hyd.Pressure1 11
Carb.Volume 10
Carb.Rel 10
Alch.Rel 9
Usage.cont 5
PH 4
Mnf.Flow 2
Carb.Flow 2
Bowl.Setpoint 2
Density 1
Balling 1
Balling.Lvl 1
studentsdata  |>
  summarise_all(list(~is.na(.)))  |>
  pivot_longer(everything(), names_to = "variables", values_to="missing")  |>
  count(variables, missing)  |>
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values",
       x = "Proportion") +
  scale_fill_manual(values=c("grey","blue"))

Comment: Overall, 30 over 33 of the variables have missing values with MFR missing the most.

Data Transformation

library(mice)
library(caret)
set.seed(001)

studentsdata <- mice(studentsdata, m = 1, method = 'pmm', print = FALSE)  |>
  complete()

# filtering low frequencies
studentsdata <- studentsdata[, -nearZeroVar(studentsdata)]

studentsdata  |>
  summarise_all(list(~is.na(.)))  |>
  pivot_longer(everything(), names_to = "variables", values_to="missing")  |>
  count(variables, missing)  |>
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values",
       x = "Proportion") +
  scale_fill_manual(values=c("grey","blue"))

Comment: A data imputation was performed using the mice() function from the MICE package. Additionally, we removed any predictors with near-zero variance—only Hyd.Pessure1 met this criterion and was excluded from the dataset. There are no more missing values and the transformations are complete for now.

Model building

set.seed(002)


# index for training
index <- createDataPartition(studentsdata$PH, p = .8, list = FALSE)

# train 
train_x <- studentsdata[index, ]  |>
  select(-PH)
train_y <- studentsdata[index, 'PH']

# test
test_x <- studentsdata[-index, ]  |>
  select(-PH)
test_y <- studentsdata[-index, 'PH']
  • LM
set.seed(200)

# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)

lmModel <- train(train_x, train_y, method = "lm", trControl = ctrl) 

lmPred <- predict(lmModel, test_x)

postResample(lmPred, test_y)
     RMSE  Rsquared       MAE 
0.1344341 0.4141462 0.1048348 
  • PLS
set.seed(201)

plsTune <- train(train_x, train_y, 
                 method = "pls", 
                 tuneLength = 20, trControl = ctrl,
                 preProc = c("center", "scale"))

plsPred <- predict(plsTune, test_x)

postResample(plsPred, test_y)
     RMSE  Rsquared       MAE 
0.1349151 0.4098758 0.1050237 
  • MARS
library(earth)
# create a tuning grid
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(202)

# tune
marsTune <- train(train_x, train_y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))

marsPred <- predict(marsTune, test_x)

postResample(marsPred, test_y)
      RMSE   Rsquared        MAE 
0.12751890 0.47288667 0.09764549 
  • NNET
# remove predictors to ensure maximum abs pairwise corr between predictors < 0.75
tooHigh <- findCorrelation(cor(train_x[, -1]), cutoff = .75)

# removing 9 variables and the factored variable
train_x_nnet <- train_x[, -tooHigh]
test_x_nnet <- test_x[, -tooHigh]

# create a tuning grid
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10))


set.seed(203)

# tune
nnetTune <- train(train_x_nnet, train_y,
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 84851,
                  maxit = 500)

nnPred <- predict(nnetTune, test_x_nnet)

postResample(nnPred, test_y)
      RMSE   Rsquared        MAE 
0.12987810 0.45389272 0.09936768 
  • SVM
set.seed(204)

# tune
svmRTune <- train(train_x[, -1], train_y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))

svmRPred <- predict(svmRTune, test_x[, -1])

postResample(svmRPred, test_y)
      RMSE   Rsquared        MAE 
0.12952432 0.46907528 0.09122625 
  • Boosted Trees
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(205)

gbmTune <- train(train_x, train_y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test_x)

postResample(gbmPred, test_y)
      RMSE   Rsquared        MAE 
0.10510880 0.64255828 0.07863323 
  • Random forest
library(randomForest)
set.seed(006)

rfModel <- randomForest(train_x, train_y, 
                        importance = TRUE,
                        ntree = 1000)


rfPred <- predict(rfModel, test_x)

postResample(rfPred, test_y)
      RMSE   Rsquared        MAE 
0.09855003 0.71000923 0.07252539 
  • Cubist
set.seed(1207)
cubistTuned <- train(train_x, train_y, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test_x)

postResample(cubistPred, test_y)
      RMSE   Rsquared        MAE 
0.09518817 0.70991533 0.06945696 
  • SVM
set.seed(208)

# tune
svmRTune <- train(train_x[, -1], train_y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))

svmRPred <- predict(svmRTune, test_x[, -1])

postResample(svmRPred, test_y)
      RMSE   Rsquared        MAE 
0.13120264 0.45352417 0.09265127 
rbind(lm = postResample(lmPred, test_y),
      pls = postResample(plsPred, test_y),
      nn = postResample(nnPred, test_y),
      mars = postResample(marsPred, test_y),
      svmR = postResample(svmRPred, test_y),
      randomForest = postResample(rfPred, test_y),
      boosted = postResample(gbmPred, test_y),
      cubist = postResample(cubistPred, test_y))
                   RMSE  Rsquared        MAE
lm           0.13443415 0.4141462 0.10483479
pls          0.13491505 0.4098758 0.10502367
nn           0.12987810 0.4538927 0.09936768
mars         0.12751890 0.4728867 0.09764549
svmR         0.13120264 0.4535242 0.09265127
randomForest 0.09855003 0.7100092 0.07252539
boosted      0.10510880 0.6425583 0.07863323
cubist       0.09518817 0.7099153 0.06945696

Comment: From the results, the Random Forest and the Cubist models have the lowest RMSE and the highest R-squared, giving the best optimal resampling and test set performance. The Random Forest model will be used.

Model Evaluation

rfImp <- varImp(rfModel, scale = TRUE)  |>
  as.data.frame()

rfImp  |>
  arrange(-Overall)  |>
  kable()  |>
  kable_styling()  |>
  scroll_box()
Overall
Brand.Code 59.2126310
Mnf.Flow 58.0568360
Pressure.Vacuum 51.6717195
Oxygen.Filler 47.6872227
Air.Pressurer 42.1720430
Usage.cont 40.8280956
Balling.Lvl 38.9237499
Temperature 34.7631044
Carb.Rel 33.7910318
Filler.Speed 33.5709748
Alch.Rel 32.8438373
Balling 31.3666204
Carb.Flow 30.1705578
Density 29.7646731
Carb.Pressure1 28.6769704
Bowl.Setpoint 27.0618028
Hyd.Pressure3 26.7971502
Filler.Level 26.4589246
Carb.Volume 21.5943882
MFR 20.6224943
Hyd.Pressure4 20.5112691
PC.Volume 19.6512612
Hyd.Pressure2 18.7329164
Fill.Pressure 14.9007860
Pressure.Setpoint 13.5043671
Carb.Pressure 4.9377452
Carb.Temp 2.9409780
Fill.Ounces 2.8775095
PSC 0.7304188
PSC.CO2 0.4561094
PSC.Fill -0.3723884
varImpPlot(rfModel, sort = TRUE, n.var = 10)

Comment:

  • %IncMSE (Percent Increase in Mean Squared Error) reflects the drop in model accuracy when a specific variable is excluded—higher values indicate greater importance. In contrast, IncNodePurity (Increase in Node Purity) measures variable importance based on the Gini impurity index, showing how much each variable contributes to improving the purity of decision tree nodes.
  • According to the Mean Decrease in Accuracy, Brand.Code emerges as the most influential variable, followed by Mnf.Flow. However, when using the Gini Impurity Index as the criterion, Brand.Code is identified as the single most important variable.
library(corrplot)
top10 <- varImp(rfModel)  |>
  filter(Overall < 57)  |>
  arrange(-Overall)  |>
  head(10)

studentsdata  |>
  select(c("PH", row.names(top10)))  |>
  cor()  |>
  corrplot(title = "Correlation between pH and the Top 10 Numerical Variables",
           mar = c(0,0,1,0))

Comment: Pressure Vacuum, Oxygen FIller, Balling Lvl, Carb Rel and Alch Rel seem to positively affect the pH. On the other hand, Usage cont, and Temperature affect the pH negatively.

studentsdata  |>
  ggplot(aes(x = PH)) +
  geom_histogram(bins = 15) +
  facet_wrap(~Brand.Code, ncol = 1)

studentsdata  |>
  ggplot(aes(x= PH)) +
  geom_boxplot() +
  facet_wrap(~Brand.Code, ncol = 1) 

studentsdata  |>
  group_by(Brand.Code)  |>
  summarise(`Average PH` = mean(PH))
# A tibble: 4 × 2
  Brand.Code `Average PH`
  <fct>             <dbl>
1 A                  8.50
2 B                  8.56
3 C                  8.41
4 D                  8.60

Comment: After transforming the Brand.Code, it appears that products labeled “C” are associated with the lowest average pH values, while those labeled “D” have the highest. Category “B” shows the second highest average PH.

Forecasting

library(mice)
set.seed(100)

studentevaluation <- studentevaluation  |>
  select(-PH)  |>
  mice(m = 1, method = 'pmm', print = FALSE)  |>
  complete()

# remove Hyd.Pressure1 as it was removed in the preprocessing for Student Data
# add back in PH
studentevaluation <- studentevaluation  |>
  select(-Hyd.Pressure1)  |>
  mutate(PH = "")

# predict PH
prediction <- predict(rfModel, studentevaluation)

head(prediction)
       1        2        3        4        5        6 
8.584146 8.451713 8.517907 8.564688 8.529912 8.584629 
# put the PH back into the data frame
studentevaluation$PH <- prediction

#average ph
studentevaluation  |>
  group_by(Brand.Code)  |>
  summarise(`Average PH` = mean(PH))
# A tibble: 4 × 2
  Brand.Code `Average PH`
  <fct>             <dbl>
1 A                  8.50
2 B                  8.57
3 C                  8.42
4 D                  8.59
write.xlsx(list('PH' = prediction, 'EvalData_complete' = studentevaluation), file = 'predictions_Henock_M.xlsx')

Summary

The PH values appear relatively consistent, generally falling within the range of 8 to 9. Despite this narrow range, distinct patterns emerge when the data is grouped by Brand.Code, and these patterns are reflected in our model’s predictions. I also identified and emphasized the variables that have the strongest influence on PH levels. My hope is that these insights contribute to a better understanding of the manufacturing process, particularly in light of new regulatory standards in the beverage industry.

Among the models evaluated, the Random Forest model demonstrated the best performance, achieving the highest R² and lowest RMSE. This suggests it was most effective at capturing the complexity of the data. Its ensemble structure—comprising multiple decision trees—also makes it computationally efficient compared to other modeling approaches.