library(tsibble)
library(readxl)
library(dplyr)
library(tidyr)
library(ggplot2)
library(forecast)
library(lubridate)
library(openxlsx)
library(kableExtra)
library(purrr)DATA 624 Project 2
Instructions
Project #2 (Team) Assignment
This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of pH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
Loading the data and replacing blanks with NA.
studentsdata <- read.xlsx("https://raw.githubusercontent.com/henock93/Data_Science/main/StudentData.xlsx", na.strings = c("", NA))
studentevaluation <- read.xlsx("https://raw.githubusercontent.com/henock93/Data_Science/main/StudentEvaluation.xlsx", na.strings = c("", NA))Exploratory Data Analysis
library(ggplot2)
glimpse(studentsdata)Rows: 2,571
Columns: 33
$ Brand.Code <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
$ Carb.Volume <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
$ Fill.Ounces <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
$ PC.Volume <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
$ Carb.Pressure <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
$ Carb.Temp <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
$ PSC <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
$ PSC.Fill <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
$ PSC.CO2 <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
$ Mnf.Flow <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
$ Carb.Pressure1 <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
$ Fill.Pressure <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
$ Hyd.Pressure1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Hyd.Pressure2 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Hyd.Pressure3 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Hyd.Pressure4 <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
$ Filler.Level <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
$ Filler.Speed <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
$ Temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
$ Usage.cont <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
$ Carb.Flow <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
$ Density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
$ MFR <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
$ Balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
$ Pressure.Vacuum <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
$ PH <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
$ Oxygen.Filler <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
$ Bowl.Setpoint <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
$ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
$ Air.Pressurer <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
$ Alch.Rel <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
$ Carb.Rel <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
$ Balling.Lvl <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…
summary(studentsdata) |>
kable() |>
kable_styling() |>
scroll_box(width = "100%", height = "200px")| Brand.Code | Carb.Volume | Fill.Ounces | PC.Volume | Carb.Pressure | Carb.Temp | PSC | PSC.Fill | PSC.CO2 | Mnf.Flow | Carb.Pressure1 | Fill.Pressure | Hyd.Pressure1 | Hyd.Pressure2 | Hyd.Pressure3 | Hyd.Pressure4 | Filler.Level | Filler.Speed | Temperature | Usage.cont | Carb.Flow | Density | MFR | Balling | Pressure.Vacuum | PH | Oxygen.Filler | Bowl.Setpoint | Pressure.Setpoint | Air.Pressurer | Alch.Rel | Carb.Rel | Balling.Lvl | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:2571 | Min. :5.040 | Min. :23.63 | Min. :0.07933 | Min. :57.00 | Min. :128.6 | Min. :0.00200 | Min. :0.0000 | Min. :0.00000 | Min. :-100.20 | Min. :105.6 | Min. :34.60 | Min. :-0.80 | Min. : 0.00 | Min. :-1.20 | Min. : 52.00 | Min. : 55.8 | Min. : 998 | Min. :63.60 | Min. :12.08 | Min. : 26 | Min. :0.240 | Min. : 31.4 | Min. :-0.170 | Min. :-6.600 | Min. :7.880 | Min. :0.00240 | Min. : 70.0 | Min. :44.00 | Min. :140.8 | Min. :5.280 | Min. :4.960 | Min. :0.00 | |
| Class :character | 1st Qu.:5.293 | 1st Qu.:23.92 | 1st Qu.:0.23917 | 1st Qu.:65.60 | 1st Qu.:138.4 | 1st Qu.:0.04800 | 1st Qu.:0.1000 | 1st Qu.:0.02000 | 1st Qu.:-100.00 | 1st Qu.:119.0 | 1st Qu.:46.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 86.00 | 1st Qu.: 98.3 | 1st Qu.:3888 | 1st Qu.:65.20 | 1st Qu.:18.36 | 1st Qu.:1144 | 1st Qu.:0.900 | 1st Qu.:706.3 | 1st Qu.: 1.496 | 1st Qu.:-5.600 | 1st Qu.:8.440 | 1st Qu.:0.02200 | 1st Qu.:100.0 | 1st Qu.:46.00 | 1st Qu.:142.2 | 1st Qu.:6.540 | 1st Qu.:5.340 | 1st Qu.:1.38 | |
| Mode :character | Median :5.347 | Median :23.97 | Median :0.27133 | Median :68.20 | Median :140.8 | Median :0.07600 | Median :0.1800 | Median :0.04000 | Median : 65.20 | Median :123.2 | Median :46.40 | Median :11.40 | Median :28.60 | Median :27.60 | Median : 96.00 | Median :118.4 | Median :3982 | Median :65.60 | Median :21.79 | Median :3028 | Median :0.980 | Median :724.0 | Median : 1.648 | Median :-5.400 | Median :8.540 | Median :0.03340 | Median :120.0 | Median :46.00 | Median :142.6 | Median :6.560 | Median :5.400 | Median :1.48 | |
| NA | Mean :5.370 | Mean :23.97 | Mean :0.27712 | Mean :68.19 | Mean :141.1 | Mean :0.08457 | Mean :0.1954 | Mean :0.05641 | Mean : 24.57 | Mean :122.6 | Mean :47.92 | Mean :12.44 | Mean :20.96 | Mean :20.46 | Mean : 96.29 | Mean :109.3 | Mean :3687 | Mean :65.97 | Mean :20.99 | Mean :2468 | Mean :1.174 | Mean :704.0 | Mean : 2.198 | Mean :-5.216 | Mean :8.546 | Mean :0.04684 | Mean :109.3 | Mean :47.62 | Mean :142.8 | Mean :6.897 | Mean :5.437 | Mean :2.05 | |
| NA | 3rd Qu.:5.453 | 3rd Qu.:24.03 | 3rd Qu.:0.31200 | 3rd Qu.:70.60 | 3rd Qu.:143.8 | 3rd Qu.:0.11200 | 3rd Qu.:0.2600 | 3rd Qu.:0.08000 | 3rd Qu.: 140.80 | 3rd Qu.:125.4 | 3rd Qu.:50.00 | 3rd Qu.:20.20 | 3rd Qu.:34.60 | 3rd Qu.:33.40 | 3rd Qu.:102.00 | 3rd Qu.:120.0 | 3rd Qu.:3998 | 3rd Qu.:66.40 | 3rd Qu.:23.75 | 3rd Qu.:3186 | 3rd Qu.:1.620 | 3rd Qu.:731.0 | 3rd Qu.: 3.292 | 3rd Qu.:-5.000 | 3rd Qu.:8.680 | 3rd Qu.:0.06000 | 3rd Qu.:120.0 | 3rd Qu.:50.00 | 3rd Qu.:143.0 | 3rd Qu.:7.240 | 3rd Qu.:5.540 | 3rd Qu.:3.14 | |
| NA | Max. :5.700 | Max. :24.32 | Max. :0.47800 | Max. :79.40 | Max. :154.0 | Max. :0.27000 | Max. :0.6200 | Max. :0.24000 | Max. : 229.40 | Max. :140.2 | Max. :60.40 | Max. :58.00 | Max. :59.40 | Max. :50.00 | Max. :142.00 | Max. :161.2 | Max. :4030 | Max. :76.20 | Max. :25.90 | Max. :5104 | Max. :1.920 | Max. :868.6 | Max. : 4.012 | Max. :-3.600 | Max. :9.360 | Max. :0.40000 | Max. :140.0 | Max. :52.00 | Max. :148.2 | Max. :8.620 | Max. :6.060 | Max. :3.66 | |
| NA | NA's :10 | NA's :38 | NA's :39 | NA's :27 | NA's :26 | NA's :33 | NA's :23 | NA's :39 | NA's :2 | NA's :32 | NA's :22 | NA's :11 | NA's :15 | NA's :15 | NA's :30 | NA's :20 | NA's :57 | NA's :14 | NA's :5 | NA's :2 | NA's :1 | NA's :212 | NA's :1 | NA | NA's :4 | NA's :12 | NA's :2 | NA's :12 | NA | NA's :9 | NA's :10 | NA's :1 |
studentsdata |>
keep(is.numeric) |>
gather() |>
ggplot(aes(value)) +
geom_histogram(bins = 15) +
facet_wrap(~key, scales = "free") +
ggtitle("Histograms of Numerical Predictors")studentsdata |>
keep(is.numeric) |>
gather() |>
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots of Numerical Predictors")Comment: The training dataset contains 33 variables and 2,571 observations. Among these, Brand.Code is a categorical variable, while the remaining variables are numeric, primarily integers. The pH variable is centered around a value of 8.5. Some variables exhibit skewness, so it would be beneficial to apply centering and scaling during preprocessing.
Data processing
studentsdata$Brand.Code <- as.factor(studentsdata$Brand.Code)
studentevaluation$Brand.Code <- as.factor(studentevaluation$Brand.Code)
studentsdata |>
ggplot() +
geom_bar(aes(x = Brand.Code)) +
ggtitle("Distribution of the Brand Codes")Comment: Brand.code is the only categorical variable in the dataset, with possible values of A, B, C, or D. It was converted into a factor for analysis purpose. B appears to be the most frequent, representing nearly half of all observations.
Identifying missing data
studentsdata |>
summarise_all(list(~ sum(is.na(.)))) |>
gather(variable, value) |>
filter(value != 0) |>
arrange(-value) |>
kable() |>
kable_styling() |>
scroll_box(width = "100%", height = "200px")| variable | value |
|---|---|
| MFR | 212 |
| Brand.Code | 120 |
| Filler.Speed | 57 |
| PC.Volume | 39 |
| PSC.CO2 | 39 |
| Fill.Ounces | 38 |
| PSC | 33 |
| Carb.Pressure1 | 32 |
| Hyd.Pressure4 | 30 |
| Carb.Pressure | 27 |
| Carb.Temp | 26 |
| PSC.Fill | 23 |
| Fill.Pressure | 22 |
| Filler.Level | 20 |
| Hyd.Pressure2 | 15 |
| Hyd.Pressure3 | 15 |
| Temperature | 14 |
| Oxygen.Filler | 12 |
| Pressure.Setpoint | 12 |
| Hyd.Pressure1 | 11 |
| Carb.Volume | 10 |
| Carb.Rel | 10 |
| Alch.Rel | 9 |
| Usage.cont | 5 |
| PH | 4 |
| Mnf.Flow | 2 |
| Carb.Flow | 2 |
| Bowl.Setpoint | 2 |
| Density | 1 |
| Balling | 1 |
| Balling.Lvl | 1 |
studentsdata |>
summarise_all(list(~is.na(.))) |>
pivot_longer(everything(), names_to = "variables", values_to="missing") |>
count(variables, missing) |>
ggplot(aes(y = variables, x=n, fill = missing))+
geom_col(position = "fill") +
labs(title = "Proportion of Missing Values",
x = "Proportion") +
scale_fill_manual(values=c("grey","blue"))Comment: Overall, 30 over 33 of the variables have missing values with MFR missing the most.
Data Transformation
library(mice)
library(caret)
set.seed(001)
studentsdata <- mice(studentsdata, m = 1, method = 'pmm', print = FALSE) |>
complete()
# filtering low frequencies
studentsdata <- studentsdata[, -nearZeroVar(studentsdata)]
studentsdata |>
summarise_all(list(~is.na(.))) |>
pivot_longer(everything(), names_to = "variables", values_to="missing") |>
count(variables, missing) |>
ggplot(aes(y = variables, x=n, fill = missing))+
geom_col(position = "fill") +
labs(title = "Proportion of Missing Values",
x = "Proportion") +
scale_fill_manual(values=c("grey","blue"))Comment: A data imputation was performed using the mice() function from the MICE package. Additionally, we removed any predictors with near-zero variance—only Hyd.Pessure1 met this criterion and was excluded from the dataset. There are no more missing values and the transformations are complete for now.
Model building
set.seed(002)
# index for training
index <- createDataPartition(studentsdata$PH, p = .8, list = FALSE)
# train
train_x <- studentsdata[index, ] |>
select(-PH)
train_y <- studentsdata[index, 'PH']
# test
test_x <- studentsdata[-index, ] |>
select(-PH)
test_y <- studentsdata[-index, 'PH']- LM
set.seed(200)
# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)
lmModel <- train(train_x, train_y, method = "lm", trControl = ctrl)
lmPred <- predict(lmModel, test_x)
postResample(lmPred, test_y) RMSE Rsquared MAE
0.1344341 0.4141462 0.1048348
- PLS
set.seed(201)
plsTune <- train(train_x, train_y,
method = "pls",
tuneLength = 20, trControl = ctrl,
preProc = c("center", "scale"))
plsPred <- predict(plsTune, test_x)
postResample(plsPred, test_y) RMSE Rsquared MAE
0.1349151 0.4098758 0.1050237
- MARS
library(earth)
# create a tuning grid
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(202)
# tune
marsTune <- train(train_x, train_y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsPred <- predict(marsTune, test_x)
postResample(marsPred, test_y) RMSE Rsquared MAE
0.12751890 0.47288667 0.09764549
- NNET
# remove predictors to ensure maximum abs pairwise corr between predictors < 0.75
tooHigh <- findCorrelation(cor(train_x[, -1]), cutoff = .75)
# removing 9 variables and the factored variable
train_x_nnet <- train_x[, -tooHigh]
test_x_nnet <- test_x[, -tooHigh]
# create a tuning grid
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10))
set.seed(203)
# tune
nnetTune <- train(train_x_nnet, train_y,
method = "nnet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 84851,
maxit = 500)
nnPred <- predict(nnetTune, test_x_nnet)
postResample(nnPred, test_y) RMSE Rsquared MAE
0.12987810 0.45389272 0.09936768
- SVM
set.seed(204)
# tune
svmRTune <- train(train_x[, -1], train_y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRPred <- predict(svmRTune, test_x[, -1])
postResample(svmRPred, test_y) RMSE Rsquared MAE
0.12952432 0.46907528 0.09122625
- Boosted Trees
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(205)
gbmTune <- train(train_x, train_y,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
gbmPred <- predict(gbmTune, test_x)
postResample(gbmPred, test_y) RMSE Rsquared MAE
0.10510880 0.64255828 0.07863323
- Random forest
library(randomForest)
set.seed(006)
rfModel <- randomForest(train_x, train_y,
importance = TRUE,
ntree = 1000)
rfPred <- predict(rfModel, test_x)
postResample(rfPred, test_y) RMSE Rsquared MAE
0.09855003 0.71000923 0.07252539
- Cubist
set.seed(1207)
cubistTuned <- train(train_x, train_y,
method = "cubist")
cubistPred <- predict(cubistTuned, test_x)
postResample(cubistPred, test_y) RMSE Rsquared MAE
0.09518817 0.70991533 0.06945696
- SVM
set.seed(208)
# tune
svmRTune <- train(train_x[, -1], train_y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRPred <- predict(svmRTune, test_x[, -1])
postResample(svmRPred, test_y) RMSE Rsquared MAE
0.13120264 0.45352417 0.09265127
rbind(lm = postResample(lmPred, test_y),
pls = postResample(plsPred, test_y),
nn = postResample(nnPred, test_y),
mars = postResample(marsPred, test_y),
svmR = postResample(svmRPred, test_y),
randomForest = postResample(rfPred, test_y),
boosted = postResample(gbmPred, test_y),
cubist = postResample(cubistPred, test_y)) RMSE Rsquared MAE
lm 0.13443415 0.4141462 0.10483479
pls 0.13491505 0.4098758 0.10502367
nn 0.12987810 0.4538927 0.09936768
mars 0.12751890 0.4728867 0.09764549
svmR 0.13120264 0.4535242 0.09265127
randomForest 0.09855003 0.7100092 0.07252539
boosted 0.10510880 0.6425583 0.07863323
cubist 0.09518817 0.7099153 0.06945696
Comment: From the results, the Random Forest and the Cubist models have the lowest RMSE and the highest R-squared, giving the best optimal resampling and test set performance. The Random Forest model will be used.
Model Evaluation
rfImp <- varImp(rfModel, scale = TRUE) |>
as.data.frame()
rfImp |>
arrange(-Overall) |>
kable() |>
kable_styling() |>
scroll_box()| Overall | |
|---|---|
| Brand.Code | 59.2126310 |
| Mnf.Flow | 58.0568360 |
| Pressure.Vacuum | 51.6717195 |
| Oxygen.Filler | 47.6872227 |
| Air.Pressurer | 42.1720430 |
| Usage.cont | 40.8280956 |
| Balling.Lvl | 38.9237499 |
| Temperature | 34.7631044 |
| Carb.Rel | 33.7910318 |
| Filler.Speed | 33.5709748 |
| Alch.Rel | 32.8438373 |
| Balling | 31.3666204 |
| Carb.Flow | 30.1705578 |
| Density | 29.7646731 |
| Carb.Pressure1 | 28.6769704 |
| Bowl.Setpoint | 27.0618028 |
| Hyd.Pressure3 | 26.7971502 |
| Filler.Level | 26.4589246 |
| Carb.Volume | 21.5943882 |
| MFR | 20.6224943 |
| Hyd.Pressure4 | 20.5112691 |
| PC.Volume | 19.6512612 |
| Hyd.Pressure2 | 18.7329164 |
| Fill.Pressure | 14.9007860 |
| Pressure.Setpoint | 13.5043671 |
| Carb.Pressure | 4.9377452 |
| Carb.Temp | 2.9409780 |
| Fill.Ounces | 2.8775095 |
| PSC | 0.7304188 |
| PSC.CO2 | 0.4561094 |
| PSC.Fill | -0.3723884 |
varImpPlot(rfModel, sort = TRUE, n.var = 10)Comment:
- %IncMSE (Percent Increase in Mean Squared Error) reflects the drop in model accuracy when a specific variable is excluded—higher values indicate greater importance. In contrast, IncNodePurity (Increase in Node Purity) measures variable importance based on the Gini impurity index, showing how much each variable contributes to improving the purity of decision tree nodes.
- According to the Mean Decrease in Accuracy, Brand.Code emerges as the most influential variable, followed by Mnf.Flow. However, when using the Gini Impurity Index as the criterion, Brand.Code is identified as the single most important variable.
library(corrplot)
top10 <- varImp(rfModel) |>
filter(Overall < 57) |>
arrange(-Overall) |>
head(10)
studentsdata |>
select(c("PH", row.names(top10))) |>
cor() |>
corrplot(title = "Correlation between pH and the Top 10 Numerical Variables",
mar = c(0,0,1,0))Comment: Pressure Vacuum, Oxygen FIller, Balling Lvl, Carb Rel and Alch Rel seem to positively affect the pH. On the other hand, Usage cont, and Temperature affect the pH negatively.
studentsdata |>
ggplot(aes(x = PH)) +
geom_histogram(bins = 15) +
facet_wrap(~Brand.Code, ncol = 1)studentsdata |>
ggplot(aes(x= PH)) +
geom_boxplot() +
facet_wrap(~Brand.Code, ncol = 1) studentsdata |>
group_by(Brand.Code) |>
summarise(`Average PH` = mean(PH))# A tibble: 4 × 2
Brand.Code `Average PH`
<fct> <dbl>
1 A 8.50
2 B 8.56
3 C 8.41
4 D 8.60
Comment: After transforming the Brand.Code, it appears that products labeled “C” are associated with the lowest average pH values, while those labeled “D” have the highest. Category “B” shows the second highest average PH.
Forecasting
library(mice)
set.seed(100)
studentevaluation <- studentevaluation |>
select(-PH) |>
mice(m = 1, method = 'pmm', print = FALSE) |>
complete()
# remove Hyd.Pressure1 as it was removed in the preprocessing for Student Data
# add back in PH
studentevaluation <- studentevaluation |>
select(-Hyd.Pressure1) |>
mutate(PH = "")
# predict PH
prediction <- predict(rfModel, studentevaluation)
head(prediction) 1 2 3 4 5 6
8.584146 8.451713 8.517907 8.564688 8.529912 8.584629
# put the PH back into the data frame
studentevaluation$PH <- prediction
#average ph
studentevaluation |>
group_by(Brand.Code) |>
summarise(`Average PH` = mean(PH))# A tibble: 4 × 2
Brand.Code `Average PH`
<fct> <dbl>
1 A 8.50
2 B 8.57
3 C 8.42
4 D 8.59
write.xlsx(list('PH' = prediction, 'EvalData_complete' = studentevaluation), file = 'predictions_Henock_M.xlsx')Summary
The PH values appear relatively consistent, generally falling within the range of 8 to 9. Despite this narrow range, distinct patterns emerge when the data is grouped by Brand.Code, and these patterns are reflected in our model’s predictions. I also identified and emphasized the variables that have the strongest influence on PH levels. My hope is that these insights contribute to a better understanding of the manufacturing process, particularly in light of new regulatory standards in the beverage industry.
Among the models evaluated, the Random Forest model demonstrated the best performance, achieving the highest R² and lowest RMSE. This suggests it was most effective at capturing the complexity of the data. Its ensemble structure—comprising multiple decision trees—also makes it computationally efficient compared to other modeling approaches.