DATA 624 Project 2

Author

Henock Montcho

Published

May 22, 2025

Instructions

Project #2 (Team) Assignment

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of pH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

library(tsibble)
library(readxl)
library(dplyr)
library(tidyr)
library(ggplot2)
library(forecast)
library(lubridate)
library(openxlsx)
library(kableExtra)
library(purrr)

Loading the data and replacing blanks with NA.

studentsdata <- read.xlsx("https://raw.githubusercontent.com/henock93/Data_Science/main/StudentData.xlsx", na.strings = c("", NA))

studentevaluation <- read.xlsx("https://raw.githubusercontent.com/henock93/Data_Science/main/StudentEvaluation.xlsx", na.strings = c("", NA))

Exploratory Data Analysis

library(ggplot2)
glimpse(studentsdata)

Rows: 2,571
Columns: 33
$ Brand.Code        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
$ Carb.Volume       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
$ Fill.Ounces       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
$ PC.Volume         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
$ Carb.Pressure     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
$ Carb.Temp         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
$ PSC               <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
$ PSC.Fill          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
$ PSC.CO2           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
$ Mnf.Flow          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
$ Carb.Pressure1    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
$ Fill.Pressure     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
$ Hyd.Pressure1     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Hyd.Pressure2     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Hyd.Pressure3     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Hyd.Pressure4     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
$ Filler.Level      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
$ Filler.Speed      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
$ Temperature       <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
$ Usage.cont        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
$ Carb.Flow         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
$ Density           <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
$ MFR               <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
$ Balling           <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
$ Pressure.Vacuum   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
$ PH                <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
$ Oxygen.Filler     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
$ Bowl.Setpoint     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
$ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
$ Air.Pressurer     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
$ Alch.Rel          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
$ Carb.Rel          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
$ Balling.Lvl       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…

summary(studentsdata)  |>
  kable()  |>
  kable_styling()  |>
  scroll_box(width = "100%", height = "200px")

Brand.Code	Carb.Volume	Fill.Ounces	PC.Volume	Carb.Pressure	Carb.Temp	PSC	PSC.Fill	PSC.CO2	Mnf.Flow	Carb.Pressure1	Fill.Pressure	Hyd.Pressure1	Hyd.Pressure2	Hyd.Pressure3	Hyd.Pressure4	Filler.Level	Filler.Speed	Temperature	Usage.cont	Carb.Flow	Density	MFR	Balling	Pressure.Vacuum	PH	Oxygen.Filler	Bowl.Setpoint	Pressure.Setpoint	Air.Pressurer	Alch.Rel	Carb.Rel	Balling.Lvl
Length:2571	Min. :5.040	Min. :23.63	Min. :0.07933	Min. :57.00	Min. :128.6	Min. :0.00200	Min. :0.0000	Min. :0.00000	Min. :-100.20	Min. :105.6	Min. :34.60	Min. :-0.80	Min. : 0.00	Min. :-1.20	Min. : 52.00	Min. : 55.8	Min. : 998	Min. :63.60	Min. :12.08	Min. : 26	Min. :0.240	Min. : 31.4	Min. :-0.170	Min. :-6.600	Min. :7.880	Min. :0.00240	Min. : 70.0	Min. :44.00	Min. :140.8	Min. :5.280	Min. :4.960	Min. :0.00
Class :character	1st Qu.:5.293	1st Qu.:23.92	1st Qu.:0.23917	1st Qu.:65.60	1st Qu.:138.4	1st Qu.:0.04800	1st Qu.:0.1000	1st Qu.:0.02000	1st Qu.:-100.00	1st Qu.:119.0	1st Qu.:46.00	1st Qu.: 0.00	1st Qu.: 0.00	1st Qu.: 0.00	1st Qu.: 86.00	1st Qu.: 98.3	1st Qu.:3888	1st Qu.:65.20	1st Qu.:18.36	1st Qu.:1144	1st Qu.:0.900	1st Qu.:706.3	1st Qu.: 1.496	1st Qu.:-5.600	1st Qu.:8.440	1st Qu.:0.02200	1st Qu.:100.0	1st Qu.:46.00	1st Qu.:142.2	1st Qu.:6.540	1st Qu.:5.340	1st Qu.:1.38
Mode :character	Median :5.347	Median :23.97	Median :0.27133	Median :68.20	Median :140.8	Median :0.07600	Median :0.1800	Median :0.04000	Median : 65.20	Median :123.2	Median :46.40	Median :11.40	Median :28.60	Median :27.60	Median : 96.00	Median :118.4	Median :3982	Median :65.60	Median :21.79	Median :3028	Median :0.980	Median :724.0	Median : 1.648	Median :-5.400	Median :8.540	Median :0.03340	Median :120.0	Median :46.00	Median :142.6	Median :6.560	Median :5.400	Median :1.48
NA	Mean :5.370	Mean :23.97	Mean :0.27712	Mean :68.19	Mean :141.1	Mean :0.08457	Mean :0.1954	Mean :0.05641	Mean : 24.57	Mean :122.6	Mean :47.92	Mean :12.44	Mean :20.96	Mean :20.46	Mean : 96.29	Mean :109.3	Mean :3687	Mean :65.97	Mean :20.99	Mean :2468	Mean :1.174	Mean :704.0	Mean : 2.198	Mean :-5.216	Mean :8.546	Mean :0.04684	Mean :109.3	Mean :47.62	Mean :142.8	Mean :6.897	Mean :5.437	Mean :2.05
NA	3rd Qu.:5.453	3rd Qu.:24.03	3rd Qu.:0.31200	3rd Qu.:70.60	3rd Qu.:143.8	3rd Qu.:0.11200	3rd Qu.:0.2600	3rd Qu.:0.08000	3rd Qu.: 140.80	3rd Qu.:125.4	3rd Qu.:50.00	3rd Qu.:20.20	3rd Qu.:34.60	3rd Qu.:33.40	3rd Qu.:102.00	3rd Qu.:120.0	3rd Qu.:3998	3rd Qu.:66.40	3rd Qu.:23.75	3rd Qu.:3186	3rd Qu.:1.620	3rd Qu.:731.0	3rd Qu.: 3.292	3rd Qu.:-5.000	3rd Qu.:8.680	3rd Qu.:0.06000	3rd Qu.:120.0	3rd Qu.:50.00	3rd Qu.:143.0	3rd Qu.:7.240	3rd Qu.:5.540	3rd Qu.:3.14
NA	Max. :5.700	Max. :24.32	Max. :0.47800	Max. :79.40	Max. :154.0	Max. :0.27000	Max. :0.6200	Max. :0.24000	Max. : 229.40	Max. :140.2	Max. :60.40	Max. :58.00	Max. :59.40	Max. :50.00	Max. :142.00	Max. :161.2	Max. :4030	Max. :76.20	Max. :25.90	Max. :5104	Max. :1.920	Max. :868.6	Max. : 4.012	Max. :-3.600	Max. :9.360	Max. :0.40000	Max. :140.0	Max. :52.00	Max. :148.2	Max. :8.620	Max. :6.060	Max. :3.66
NA	NA's :10	NA's :38	NA's :39	NA's :27	NA's :26	NA's :33	NA's :23	NA's :39	NA's :2	NA's :32	NA's :22	NA's :11	NA's :15	NA's :15	NA's :30	NA's :20	NA's :57	NA's :14	NA's :5	NA's :2	NA's :1	NA's :212	NA's :1	NA	NA's :4	NA's :12	NA's :2	NA's :12	NA	NA's :9	NA's :10	NA's :1

studentsdata  |>
  keep(is.numeric)  |>
  gather()  |>
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = "free") +
  ggtitle("Histograms of Numerical Predictors")

studentsdata  |>
  keep(is.numeric)  |>
  gather()  |>
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

Comment: The training dataset contains 33 variables and 2,571 observations. Among these, Brand.Code is a categorical variable, while the remaining variables are numeric, primarily integers. The pH variable is centered around a value of 8.5. Some variables exhibit skewness, so it would be beneficial to apply centering and scaling during preprocessing.

Data processing

studentsdata$Brand.Code <- as.factor(studentsdata$Brand.Code)
studentevaluation$Brand.Code <- as.factor(studentevaluation$Brand.Code)

studentsdata  |>
  ggplot() + 
  geom_bar(aes(x = Brand.Code)) + 
  ggtitle("Distribution of the Brand Codes")

Comment: Brand.code is the only categorical variable in the dataset, with possible values of A, B, C, or D. It was converted into a factor for analysis purpose. B appears to be the most frequent, representing nearly half of all observations.

Identifying missing data

studentsdata  |>
  summarise_all(list(~ sum(is.na(.))))  |>
  gather(variable, value)  |>
  filter(value != 0)  |>
  arrange(-value)  |>
  kable()  |>
  kable_styling()  |>
  scroll_box(width = "100%", height = "200px")

variable	value
MFR	212
Brand.Code	120
Filler.Speed	57
PC.Volume	39
PSC.CO2	39
Fill.Ounces	38
PSC	33
Carb.Pressure1	32
Hyd.Pressure4	30
Carb.Pressure	27
Carb.Temp	26
PSC.Fill	23
Fill.Pressure	22
Filler.Level	20
Hyd.Pressure2	15
Hyd.Pressure3	15
Temperature	14
Oxygen.Filler	12
Pressure.Setpoint	12
Hyd.Pressure1	11
Carb.Volume	10
Carb.Rel	10
Alch.Rel	9
Usage.cont	5
PH	4
Mnf.Flow	2
Carb.Flow	2
Bowl.Setpoint	2
Density	1
Balling	1
Balling.Lvl	1

studentsdata  |>
  summarise_all(list(~is.na(.)))  |>
  pivot_longer(everything(), names_to = "variables", values_to="missing")  |>
  count(variables, missing)  |>
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values",
       x = "Proportion") +
  scale_fill_manual(values=c("grey","blue"))

Comment: Overall, 30 over 33 of the variables have missing values with MFR missing the most.

Data Transformation

library(mice)
library(caret)
set.seed(001)

studentsdata <- mice(studentsdata, m = 1, method = 'pmm', print = FALSE)  |>
  complete()

# filtering low frequencies
studentsdata <- studentsdata[, -nearZeroVar(studentsdata)]

studentsdata  |>
  summarise_all(list(~is.na(.)))  |>
  pivot_longer(everything(), names_to = "variables", values_to="missing")  |>
  count(variables, missing)  |>
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values",
       x = "Proportion") +
  scale_fill_manual(values=c("grey","blue"))

Comment: A data imputation was performed using the mice() function from the MICE package. Additionally, we removed any predictors with near-zero variance—only Hyd.Pessure1 met this criterion and was excluded from the dataset. There are no more missing values and the transformations are complete for now.

Model building

set.seed(002)


# index for training
index <- createDataPartition(studentsdata$PH, p = .8, list = FALSE)

# train 
train_x <- studentsdata[index, ]  |>
  select(-PH)
train_y <- studentsdata[index, 'PH']

# test
test_x <- studentsdata[-index, ]  |>
  select(-PH)
test_y <- studentsdata[-index, 'PH']

set.seed(200)

# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)

lmModel <- train(train_x, train_y, method = "lm", trControl = ctrl) 

lmPred <- predict(lmModel, test_x)

postResample(lmPred, test_y)

     RMSE  Rsquared       MAE 
0.1344341 0.4141462 0.1048348

set.seed(201)

plsTune <- train(train_x, train_y, 
                 method = "pls", 
                 tuneLength = 20, trControl = ctrl,
                 preProc = c("center", "scale"))

plsPred <- predict(plsTune, test_x)

postResample(plsPred, test_y)

     RMSE  Rsquared       MAE 
0.1349151 0.4098758 0.1050237

MARS

library(earth)
# create a tuning grid
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(202)

# tune
marsTune <- train(train_x, train_y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))

marsPred <- predict(marsTune, test_x)

postResample(marsPred, test_y)

      RMSE   Rsquared        MAE 
0.12751890 0.47288667 0.09764549

NNET

# remove predictors to ensure maximum abs pairwise corr between predictors < 0.75
tooHigh <- findCorrelation(cor(train_x[, -1]), cutoff = .75)

# removing 9 variables and the factored variable
train_x_nnet <- train_x[, -tooHigh]
test_x_nnet <- test_x[, -tooHigh]

# create a tuning grid
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10))


set.seed(203)

# tune
nnetTune <- train(train_x_nnet, train_y,
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 84851,
                  maxit = 500)

nnPred <- predict(nnetTune, test_x_nnet)

postResample(nnPred, test_y)

      RMSE   Rsquared        MAE 
0.12987810 0.45389272 0.09936768

set.seed(204)

# tune
svmRTune <- train(train_x[, -1], train_y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))

svmRPred <- predict(svmRTune, test_x[, -1])

postResample(svmRPred, test_y)

      RMSE   Rsquared        MAE 
0.12952432 0.46907528 0.09122625

Boosted Trees

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(205)

gbmTune <- train(train_x, train_y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test_x)

postResample(gbmPred, test_y)

      RMSE   Rsquared        MAE 
0.10510880 0.64255828 0.07863323

Random forest

library(randomForest)
set.seed(006)

rfModel <- randomForest(train_x, train_y, 
                        importance = TRUE,
                        ntree = 1000)


rfPred <- predict(rfModel, test_x)

postResample(rfPred, test_y)

      RMSE   Rsquared        MAE 
0.09855003 0.71000923 0.07252539

Cubist

set.seed(1207)
cubistTuned <- train(train_x, train_y, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test_x)

postResample(cubistPred, test_y)

      RMSE   Rsquared        MAE 
0.09518817 0.70991533 0.06945696

set.seed(208)

# tune
svmRTune <- train(train_x[, -1], train_y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))

svmRPred <- predict(svmRTune, test_x[, -1])

postResample(svmRPred, test_y)

      RMSE   Rsquared        MAE 
0.13120264 0.45352417 0.09265127

rbind(lm = postResample(lmPred, test_y),
      pls = postResample(plsPred, test_y),
      nn = postResample(nnPred, test_y),
      mars = postResample(marsPred, test_y),
      svmR = postResample(svmRPred, test_y),
      randomForest = postResample(rfPred, test_y),
      boosted = postResample(gbmPred, test_y),
      cubist = postResample(cubistPred, test_y))

                   RMSE  Rsquared        MAE
lm           0.13443415 0.4141462 0.10483479
pls          0.13491505 0.4098758 0.10502367
nn           0.12987810 0.4538927 0.09936768
mars         0.12751890 0.4728867 0.09764549
svmR         0.13120264 0.4535242 0.09265127
randomForest 0.09855003 0.7100092 0.07252539
boosted      0.10510880 0.6425583 0.07863323
cubist       0.09518817 0.7099153 0.06945696

Comment: From the results, the Random Forest and the Cubist models have the lowest RMSE and the highest R-squared, giving the best optimal resampling and test set performance. The Random Forest model will be used.

Model Evaluation

rfImp <- varImp(rfModel, scale = TRUE)  |>
  as.data.frame()

rfImp  |>
  arrange(-Overall)  |>
  kable()  |>
  kable_styling()  |>
  scroll_box()

	Overall
Brand.Code	59.2126310
Mnf.Flow	58.0568360
Pressure.Vacuum	51.6717195
Oxygen.Filler	47.6872227
Air.Pressurer	42.1720430
Usage.cont	40.8280956
Balling.Lvl	38.9237499
Temperature	34.7631044
Carb.Rel	33.7910318
Filler.Speed	33.5709748
Alch.Rel	32.8438373
Balling	31.3666204
Carb.Flow	30.1705578
Density	29.7646731
Carb.Pressure1	28.6769704
Bowl.Setpoint	27.0618028
Hyd.Pressure3	26.7971502
Filler.Level	26.4589246
Carb.Volume	21.5943882
MFR	20.6224943
Hyd.Pressure4	20.5112691
PC.Volume	19.6512612
Hyd.Pressure2	18.7329164
Fill.Pressure	14.9007860
Pressure.Setpoint	13.5043671
Carb.Pressure	4.9377452
Carb.Temp	2.9409780
Fill.Ounces	2.8775095
PSC	0.7304188
PSC.CO2	0.4561094
PSC.Fill	-0.3723884

varImpPlot(rfModel, sort = TRUE, n.var = 10)

Comment:

%IncMSE (Percent Increase in Mean Squared Error) reflects the drop in model accuracy when a specific variable is excluded—higher values indicate greater importance. In contrast, IncNodePurity (Increase in Node Purity) measures variable importance based on the Gini impurity index, showing how much each variable contributes to improving the purity of decision tree nodes.

According to the Mean Decrease in Accuracy, Brand.Code emerges as the most influential variable, followed by Mnf.Flow. However, when using the Gini Impurity Index as the criterion, Brand.Code is identified as the single most important variable.

library(corrplot)
top10 <- varImp(rfModel)  |>
  filter(Overall < 57)  |>
  arrange(-Overall)  |>
  head(10)

studentsdata  |>
  select(c("PH", row.names(top10)))  |>
  cor()  |>
  corrplot(title = "Correlation between pH and the Top 10 Numerical Variables",
           mar = c(0,0,1,0))

Comment: Pressure Vacuum, Oxygen FIller, Balling Lvl, Carb Rel and Alch Rel seem to positively affect the pH. On the other hand, Usage cont, and Temperature affect the pH negatively.

studentsdata  |>
  ggplot(aes(x = PH)) +
  geom_histogram(bins = 15) +
  facet_wrap(~Brand.Code, ncol = 1)

studentsdata  |>
  ggplot(aes(x= PH)) +
  geom_boxplot() +
  facet_wrap(~Brand.Code, ncol = 1)

studentsdata  |>
  group_by(Brand.Code)  |>
  summarise(`Average PH` = mean(PH))

# A tibble: 4 × 2
  Brand.Code `Average PH`
  <fct>             <dbl>
1 A                  8.50
2 B                  8.56
3 C                  8.41
4 D                  8.60

Comment: After transforming the Brand.Code, it appears that products labeled “C” are associated with the lowest average pH values, while those labeled “D” have the highest. Category “B” shows the second highest average PH.

Forecasting

library(mice)
set.seed(100)

studentevaluation <- studentevaluation  |>
  select(-PH)  |>
  mice(m = 1, method = 'pmm', print = FALSE)  |>
  complete()

# remove Hyd.Pressure1 as it was removed in the preprocessing for Student Data
# add back in PH
studentevaluation <- studentevaluation  |>
  select(-Hyd.Pressure1)  |>
  mutate(PH = "")

# predict PH
prediction <- predict(rfModel, studentevaluation)

head(prediction)

       1        2        3        4        5        6 
8.584146 8.451713 8.517907 8.564688 8.529912 8.584629

# put the PH back into the data frame
studentevaluation$PH <- prediction

#average ph
studentevaluation  |>
  group_by(Brand.Code)  |>
  summarise(`Average PH` = mean(PH))

# A tibble: 4 × 2
  Brand.Code `Average PH`
  <fct>             <dbl>
1 A                  8.50
2 B                  8.57
3 C                  8.42
4 D                  8.59

write.xlsx(list('PH' = prediction, 'EvalData_complete' = studentevaluation), file = 'predictions_Henock_M.xlsx')

Summary

The PH values appear relatively consistent, generally falling within the range of 8 to 9. Despite this narrow range, distinct patterns emerge when the data is grouped by Brand.Code, and these patterns are reflected in our model’s predictions. I also identified and emphasized the variables that have the strongest influence on PH levels. My hope is that these insights contribute to a better understanding of the manufacturing process, particularly in light of new regulatory standards in the beverage industry.

Among the models evaluated, the Random Forest model demonstrated the best performance, achieving the highest R² and lowest RMSE. This suggests it was most effective at capturing the complexity of the data. Its ensemble structure—comprising multiple decision trees—also makes it computationally efficient compared to other modeling approaches.