This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
We uploaded the provided data files into Github, so that it will be easier to access. We also replaced any blank values with NA
values, in order to impute any missing values later.
df_StudentData <- read.csv('https://raw.githubusercontent.com/johnm1990/DATA624/main/StudentData.csv',
na.strings = c("", NA))
df_EvalData <- read.csv('https://raw.githubusercontent.com/johnm1990/DATA624/main/StudentEvaluation.csv',
na.strings = c("", NA))
The training data has 33 variables with 2,571 observations. One variable, Brand.Code
is categorical, while the others are integers and numerical. PH
is centered about 8.5. SOme variables seem to be skewed and it would be best to center and scale later on.
glimpse(df_StudentData)
## Rows: 2,571
## Columns: 33
## $ Brand.Code <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B~
## $ Carb.Volume <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.~
## $ Fill.Ounces <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23~
## $ PC.Volume <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333~
## $ Carb.Pressure <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2~
## $ Carb.Temp <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141~
## $ PSC <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,~
## $ PSC.Fill <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12~
## $ PSC.CO2 <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14~
## $ Mnf.Flow <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100~
## $ Carb.Pressure1 <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124~
## $ Fill.Pressure <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0~
## $ Hyd.Pressure1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Hyd.Pressure2 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Hyd.Pressure3 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Hyd.Pressure4 <int> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86~
## $ Filler.Level <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118~
## $ Filler.Speed <int> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, ~
## $ Temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4~
## $ Usage.cont <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.~
## $ Carb.Flow <int> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3~
## $ Density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90~
## $ MFR <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74~
## $ Balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2~
## $ Pressure.Vacuum <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4~
## $ PH <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38~
## $ Oxygen.Filler <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0~
## $ Bowl.Setpoint <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12~
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0~
## $ Air.Pressurer <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146~
## $ Alch.Rel <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52~
## $ Carb.Rel <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34~
## $ Balling.Lvl <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44~
summary(df_StudentData) %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "100%", height = "200px")
Brand.Code | Carb.Volume | Fill.Ounces | PC.Volume | Carb.Pressure | Carb.Temp | PSC | PSC.Fill | PSC.CO2 | Mnf.Flow | Carb.Pressure1 | Fill.Pressure | Hyd.Pressure1 | Hyd.Pressure2 | Hyd.Pressure3 | Hyd.Pressure4 | Filler.Level | Filler.Speed | Temperature | Usage.cont | Carb.Flow | Density | MFR | Balling | Pressure.Vacuum | PH | Oxygen.Filler | Bowl.Setpoint | Pressure.Setpoint | Air.Pressurer | Alch.Rel | Carb.Rel | Balling.Lvl | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Length:2571 | Min. :5.040 | Min. :23.63 | Min. :0.07933 | Min. :57.00 | Min. :128.6 | Min. :0.00200 | Min. :0.0000 | Min. :0.00000 | Min. :-100.20 | Min. :105.6 | Min. :34.60 | Min. :-0.80 | Min. : 0.00 | Min. :-1.20 | Min. : 52.00 | Min. : 55.8 | Min. : 998 | Min. :63.60 | Min. :12.08 | Min. : 26 | Min. :0.240 | Min. : 31.4 | Min. :-0.170 | Min. :-6.600 | Min. :7.880 | Min. :0.00240 | Min. : 70.0 | Min. :44.00 | Min. :140.8 | Min. :5.280 | Min. :4.960 | Min. :0.00 | |
Class :character | 1st Qu.:5.293 | 1st Qu.:23.92 | 1st Qu.:0.23917 | 1st Qu.:65.60 | 1st Qu.:138.4 | 1st Qu.:0.04800 | 1st Qu.:0.1000 | 1st Qu.:0.02000 | 1st Qu.:-100.00 | 1st Qu.:119.0 | 1st Qu.:46.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 86.00 | 1st Qu.: 98.3 | 1st Qu.:3888 | 1st Qu.:65.20 | 1st Qu.:18.36 | 1st Qu.:1144 | 1st Qu.:0.900 | 1st Qu.:706.3 | 1st Qu.: 1.496 | 1st Qu.:-5.600 | 1st Qu.:8.440 | 1st Qu.:0.02200 | 1st Qu.:100.0 | 1st Qu.:46.00 | 1st Qu.:142.2 | 1st Qu.:6.540 | 1st Qu.:5.340 | 1st Qu.:1.38 | |
Mode :character | Median :5.347 | Median :23.97 | Median :0.27133 | Median :68.20 | Median :140.8 | Median :0.07600 | Median :0.1800 | Median :0.04000 | Median : 65.20 | Median :123.2 | Median :46.40 | Median :11.40 | Median :28.60 | Median :27.60 | Median : 96.00 | Median :118.4 | Median :3982 | Median :65.60 | Median :21.79 | Median :3028 | Median :0.980 | Median :724.0 | Median : 1.648 | Median :-5.400 | Median :8.540 | Median :0.03340 | Median :120.0 | Median :46.00 | Median :142.6 | Median :6.560 | Median :5.400 | Median :1.48 | |
NA | Mean :5.370 | Mean :23.97 | Mean :0.27712 | Mean :68.19 | Mean :141.1 | Mean :0.08457 | Mean :0.1954 | Mean :0.05641 | Mean : 24.57 | Mean :122.6 | Mean :47.92 | Mean :12.44 | Mean :20.96 | Mean :20.46 | Mean : 96.29 | Mean :109.3 | Mean :3687 | Mean :65.97 | Mean :20.99 | Mean :2468 | Mean :1.174 | Mean :704.0 | Mean : 2.198 | Mean :-5.216 | Mean :8.546 | Mean :0.04684 | Mean :109.3 | Mean :47.62 | Mean :142.8 | Mean :6.897 | Mean :5.437 | Mean :2.05 | |
NA | 3rd Qu.:5.453 | 3rd Qu.:24.03 | 3rd Qu.:0.31200 | 3rd Qu.:70.60 | 3rd Qu.:143.8 | 3rd Qu.:0.11200 | 3rd Qu.:0.2600 | 3rd Qu.:0.08000 | 3rd Qu.: 140.80 | 3rd Qu.:125.4 | 3rd Qu.:50.00 | 3rd Qu.:20.20 | 3rd Qu.:34.60 | 3rd Qu.:33.40 | 3rd Qu.:102.00 | 3rd Qu.:120.0 | 3rd Qu.:3998 | 3rd Qu.:66.40 | 3rd Qu.:23.75 | 3rd Qu.:3186 | 3rd Qu.:1.620 | 3rd Qu.:731.0 | 3rd Qu.: 3.292 | 3rd Qu.:-5.000 | 3rd Qu.:8.680 | 3rd Qu.:0.06000 | 3rd Qu.:120.0 | 3rd Qu.:50.00 | 3rd Qu.:143.0 | 3rd Qu.:7.240 | 3rd Qu.:5.540 | 3rd Qu.:3.14 | |
NA | Max. :5.700 | Max. :24.32 | Max. :0.47800 | Max. :79.40 | Max. :154.0 | Max. :0.27000 | Max. :0.6200 | Max. :0.24000 | Max. : 229.40 | Max. :140.2 | Max. :60.40 | Max. :58.00 | Max. :59.40 | Max. :50.00 | Max. :142.00 | Max. :161.2 | Max. :4030 | Max. :76.20 | Max. :25.90 | Max. :5104 | Max. :1.920 | Max. :868.6 | Max. : 4.012 | Max. :-3.600 | Max. :9.360 | Max. :0.40000 | Max. :140.0 | Max. :52.00 | Max. :148.2 | Max. :8.620 | Max. :6.060 | Max. :3.66 | |
NA | NA’s :10 | NA’s :38 | NA’s :39 | NA’s :27 | NA’s :26 | NA’s :33 | NA’s :23 | NA’s :39 | NA’s :2 | NA’s :32 | NA’s :22 | NA’s :11 | NA’s :15 | NA’s :15 | NA’s :30 | NA’s :20 | NA’s :57 | NA’s :14 | NA’s :5 | NA’s :2 | NA’s :1 | NA’s :212 | NA’s :1 | NA | NA’s :4 | NA’s :12 | NA’s :2 | NA’s :12 | NA | NA’s :9 | NA’s :10 | NA’s :1 |
df_StudentData %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(bins = 15) +
facet_wrap(~key, scales = "free") +
ggtitle("Histograms of Numerical Predictors")
df_StudentData %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots of Numerical Predictors")
Brand.code
is the only categorical value and takes on the values: A
, B
, C
, or D
. It would be best to convert it to a factor. B
see,s to be the most common brand code, accounting for nearly half of the observations.
df_StudentData$Brand.Code <- as.factor(df_StudentData$Brand.Code)
df_EvalData$Brand.Code <- as.factor(df_EvalData$Brand.Code)
df_StudentData %>%
ggplot() +
geom_bar(aes(x = Brand.Code)) +
ggtitle("Distribution of the Brand Codes")
30 of the variables have missing values. MFR
seems to be missing for roughly 8.25% of the data. Brand_Code
has 120 missing values.
df_StudentData %>%
summarise_all(list(~ sum(is.na(.)))) %>%
gather(variable, value) %>%
filter(value != 0) %>%
arrange(-value) %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "100%", height = "200px")
variable | value |
---|---|
MFR | 212 |
Brand.Code | 120 |
Filler.Speed | 57 |
PC.Volume | 39 |
PSC.CO2 | 39 |
Fill.Ounces | 38 |
PSC | 33 |
Carb.Pressure1 | 32 |
Hyd.Pressure4 | 30 |
Carb.Pressure | 27 |
Carb.Temp | 26 |
PSC.Fill | 23 |
Fill.Pressure | 22 |
Filler.Level | 20 |
Hyd.Pressure2 | 15 |
Hyd.Pressure3 | 15 |
Temperature | 14 |
Oxygen.Filler | 12 |
Pressure.Setpoint | 12 |
Hyd.Pressure1 | 11 |
Carb.Volume | 10 |
Carb.Rel | 10 |
Alch.Rel | 9 |
Usage.cont | 5 |
PH | 4 |
Mnf.Flow | 2 |
Carb.Flow | 2 |
Bowl.Setpoint | 2 |
Density | 1 |
Balling | 1 |
Balling.Lvl | 1 |
df_StudentData %>%
summarise_all(list(~is.na(.)))%>%
pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
count(variables, missing) %>%
ggplot(aes(y = variables, x=n, fill = missing))+
geom_col(position = "fill") +
labs(title = "Proportion of Missing Values",
x = "Proportion") +
scale_fill_manual(values=c("grey","red"))
Next, we imputed the data using mice()
from the MICE
library. We also excluded any near zero-variance predictors, in this case, only Hyd.Pressure1
was removed.
set.seed(100)
df_StudentData <- mice(df_StudentData, m = 1, method = 'pmm', print = FALSE) %>% complete()
# filtering low frequencies
df_StudentData <- df_StudentData[, -nearZeroVar(df_StudentData)]
There are no more missing values and the transformations are complete for now.
df_StudentData %>%
summarise_all(list(~is.na(.)))%>%
pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
count(variables, missing) %>%
ggplot(aes(y = variables, x=n, fill = missing))+
geom_col(position = "fill") +
labs(title = "Proportion of Missing Values",
x = "Proportion") +
scale_fill_manual(values=c("grey","red"))
First, we split the data using an 80-20 split. Then we created various types of regression models that include linear regression, non-linear regression, and regression trees.
set.seed(100)
# index for training
index <- createDataPartition(df_StudentData$PH, p = .8, list = FALSE)
# train
train_x <- df_StudentData[index, ] %>% select(-PH)
train_y <- df_StudentData[index, 'PH']
# test
test_x <- df_StudentData[-index, ] %>% select(-PH)
test_y <- df_StudentData[-index, 'PH']
set.seed(100)
# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)
lmModel <- train(train_x, train_y, method = "lm", trControl = ctrl)
lmPred <- predict(lmModel, test_x)
postResample(lmPred, test_y)
## RMSE Rsquared MAE
## 0.1327383 0.4079061 0.1018436
set.seed(100)
plsTune <- train(train_x, train_y,
method = "pls",
tuneLength = 20, trControl = ctrl,
preProc = c("center", "scale"))
plsPred <- predict(plsTune, test_x)
postResample(plsPred, test_y)
## RMSE Rsquared MAE
## 0.1328236 0.4072428 0.1021100
# create a tuning grid
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(100)
# tune
marsTune <- train(train_x, train_y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsPred <- predict(marsTune, test_x)
postResample(marsPred, test_y)
## RMSE Rsquared MAE
## 0.12388891 0.48396139 0.09457591
# remove predictors to ensure maximum abs pairwise corr between predictors < 0.75
tooHigh <-findCorrelation(cor(train_x[, -1]), cutoff = .75)
# removing 9 variables and the factored variable
train_x_nnet <- train_x[, -tooHigh]
test_x_nnet <- test_x[, -tooHigh]
# create a tuning grid
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10))
set.seed(100)
# tune
nnetTune <- train(train_x_nnet, train_y,
method = "nnet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 84851,
maxit = 500)
nnPred <- predict(nnetTune, test_x_nnet)
postResample(nnPred, test_y)
## RMSE Rsquared MAE
## 0.12524194 0.48150890 0.09465761
set.seed(100)
# tune
svmRTune <- train(train_x[, -1], train_y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRPred <- predict(svmRTune, test_x[, -1])
postResample(svmRPred, test_y)
## RMSE Rsquared MAE
## 0.12516357 0.48167776 0.09019808
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(100)
gbmTune <- train(train_x, train_y,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
gbmPred <- predict(gbmTune, test_x)
postResample(gbmPred, test_y)
## RMSE Rsquared MAE
## 0.1101787 0.5929795 0.0804736
set.seed(100)
rfModel <- randomForest(train_x, train_y,
importance = TRUE,
ntree = 1000)
rfPred <- predict(rfModel, test_x)
postResample(rfPred, test_y)
## RMSE Rsquared MAE
## 0.09988699 0.67694042 0.07159998
set.seed(100)
cubistTuned <- train(train_x, train_y,
method = "cubist")
cubistPred <- predict(cubistTuned, test_x)
postResample(cubistPred, test_y)
## RMSE Rsquared MAE
## 0.10036568 0.66326235 0.07295523
Based on the results, the lowest RMSE and the highest R2 is found in the Random Forest model, giving the best optimal resampling and test set performance.
rbind(lm = postResample(lmPred, test_y),
pls = postResample(plsPred, test_y),
nn = postResample(nnPred, test_y),
mars = postResample(marsPred, test_y),
svmR = postResample(svmRPred, test_y),
randomForest = postResample(rfPred, test_y),
boosted = postResample(gbmPred, test_y),
cubist = postResample(cubistPred, test_y))
## RMSE Rsquared MAE
## lm 0.13273832 0.4079061 0.10184356
## pls 0.13282358 0.4072428 0.10210997
## nn 0.12524194 0.4815089 0.09465761
## mars 0.12388891 0.4839614 0.09457591
## svmR 0.12516357 0.4816778 0.09019808
## randomForest 0.09988699 0.6769404 0.07159998
## boosted 0.11017872 0.5929795 0.08047360
## cubist 0.10036568 0.6632624 0.07295523
Random Forest was chosen as the best model.
%IncMSE is the Mean Decrease Accuracy which shows how much the model decreases if that varaible is excluded. On the other hand, IncNodePurity is Mean Decrease Gini which uses the Gini impurity index to measure the variable importance.
Based on the Mean Decrease Accuracy, Brand.Code
seems to be the most important variable, and Mnf.Flow
is the second most important variable. When the Gini Impurity Index is used, it is considered the most important.
rfImp <- varImp(rfModel, scale = TRUE) %>%
as.data.frame()
rfImp %>%
arrange(-Overall) %>%
kable() %>%
kable_styling() %>%
scroll_box()
Overall | |
---|---|
Brand.Code | 58.8442640 |
Mnf.Flow | 56.0374329 |
Pressure.Vacuum | 50.1349324 |
Oxygen.Filler | 47.6382521 |
Air.Pressurer | 44.9762020 |
Usage.cont | 42.8680517 |
Carb.Rel | 39.3444174 |
Temperature | 39.2223143 |
Balling.Lvl | 38.8899066 |
Carb.Flow | 34.0526289 |
Filler.Speed | 33.9436447 |
Alch.Rel | 32.9337260 |
Balling | 30.4978940 |
Density | 30.2357809 |
Carb.Pressure1 | 29.0662078 |
Bowl.Setpoint | 28.2376748 |
Filler.Level | 27.2386656 |
Hyd.Pressure3 | 23.7186542 |
Carb.Volume | 21.8981787 |
Hyd.Pressure4 | 19.1261479 |
Hyd.Pressure2 | 18.7971318 |
MFR | 17.5430425 |
Fill.Pressure | 17.4819435 |
PC.Volume | 16.0625502 |
Pressure.Setpoint | 14.6211954 |
Carb.Pressure | 5.6790446 |
Fill.Ounces | 3.7618006 |
Carb.Temp | 3.5644779 |
PSC.CO2 | 2.3435247 |
PSC.Fill | -0.7138685 |
PSC | -1.3143510 |
varImpPlot(rfModel, sort = TRUE, n.var = 10)
Disregarding, Brand.Code
, Mnf.Flow
, Usage.cont
, Air.Pressurer
, andTemperarture
have negative effects on the PH. Mnf.Flow
and Usage.cont
seem to affect it the most negatively. Oxygen.Filler
, Pressure.Vacuum
, and Carb.Rel
seem to affect the PH the most positively.
top10 <- varImp(rfModel) %>%
filter(Overall < 57) %>%
arrange(-Overall) %>%
head(10)
df_StudentData %>%
select(c("PH", row.names(top10))) %>%
cor() %>%
corrplot() +
title("Correlation between PH and the Top 10 Numerical Variables")
## numeric(0)
Based on the transformed Brand.Code
, the PH tends to be the lowest for those labeled “C” and highest for those labeled “D”. “B” accounts for nearly half of the data, and has the second highest PH on average.
df_StudentData %>%
ggplot(aes(x= PH)) +
geom_histogram(bins = 15) +
facet_wrap(~Brand.Code, ncol = 1)
df_StudentData %>%
ggplot(aes(x= PH)) +
geom_boxplot() +
facet_wrap(~Brand.Code, ncol = 1)
df_StudentData %>%
group_by(Brand.Code) %>%
summarise(`Average PH` = mean(PH))
## # A tibble: 4 x 2
## Brand.Code `Average PH`
## <fct> <dbl>
## 1 A 8.50
## 2 B 8.56
## 3 C 8.42
## 4 D 8.60
The evaluation data has to be transformed before it can used in forecasting.
set.seed(100)
df_EvalData <- df_EvalData %>%
select(-PH) %>%
mice(., m = 1, method = 'pmm', print = FALSE) %>% complete()
# remove Hyd.Pressure1 as it was removed in the preprocessing for Student Data
# add back in PH
df_EvalData <- df_EvalData %>%
select(-Hyd.Pressure1) %>%
mutate(PH = "")
# predict PH
prediction <- predict(rfModel, df_EvalData)
head(prediction)
## 1 2 3 4 5 6
## 8.522206 8.474750 8.525824 8.612244 8.503590 8.550513
# put the PH back into the data frame
df_EvalData$PH <- prediction
#average ph
df_EvalData %>%
group_by(Brand.Code) %>%
summarise(`Average PH` = mean(PH))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## Brand.Code `Average PH`
## <fct> <dbl>
## 1 A 8.51
## 2 B 8.57
## 3 C 8.43
## 4 D 8.59
# export file
write.xlsx(list('PH' = prediction, 'EvalData_complete' = df_EvalData), file = 'predictions_DJO.xlsx')
The PH values may seem somewhat similar, as they all range between 8 and 9. The patterns still uphold in our predictions when they are grouped by the Brand.Code
. We also highlighted the variables that have the most affect on the PH. We hope that understanding more about the manufacturing process helps with the new regulations in the beverage industry.
Random Forest model was able to capture the complexity of the data the best since it had the best R2 and RMSE. After all, it consists of multiple decision trees and it is computationally efficient compared to the other models.