library(DataExplorer)
library(dplyr)
library(psych)
library(ggplot2)
library(tidyverse)
library(janitor)
library(interactions)
library(kableExtra)
library(pracma)
library(randomForest)
library(caret)
library(partykit)
library(corrplot)
The goal of the project is to understand the manufacturing process and the predictive factor of the beverage.Create a predictive model of PH.
library(dplyr)
data <- read.csv("https://raw.githubusercontent.com/suswong/DATA-624/refs/heads/main/StudentData.csv", na.strings = c("", "NA"))
s.eval <- read.csv("https://raw.githubusercontent.com/suswong/DATA-624/refs/heads/main/StudentEvaluation.csv", na.strings = c("", "NA"))
head(data)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC
## 1 B 5.340000 23.96667 0.2633333 68.2 141.2 0.104
## 2 A 5.426667 24.00667 0.2386667 68.4 139.6 0.124
## 3 B 5.286667 24.06000 0.2633333 70.8 144.8 0.090
## 4 A 5.440000 24.00667 0.2933333 63.0 132.6 NA
## 5 A 5.486667 24.31333 0.1113333 67.2 136.8 0.026
## 6 A 5.380000 23.92667 0.2693333 66.6 138.4 0.090
## PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1 0.26 0.04 -100 118.8 46.0 0
## 2 0.22 0.04 -100 121.6 46.0 0
## 3 0.34 0.16 -100 120.2 46.0 0
## 4 0.42 0.04 -100 115.2 46.4 0
## 5 0.16 0.12 -100 118.4 45.8 0
## 6 0.24 0.04 -100 119.6 45.6 0
## Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1 NA NA 118 121.2 4002
## 2 NA NA 106 118.6 3986
## 3 NA NA 82 120.0 4020
## 4 0 0 92 117.8 4012
## 5 0 0 92 118.6 4010
## 6 0 0 116 120.2 4014
## Temperature Usage.cont Carb.Flow Density MFR Balling Pressure.Vacuum PH
## 1 66.0 16.18 2932 0.88 725.0 1.398 -4.0 8.36
## 2 67.6 19.90 3144 0.92 726.8 1.498 -4.0 8.26
## 3 67.0 17.76 2914 1.58 735.0 3.142 -3.8 8.94
## 4 65.6 17.42 3062 1.54 730.6 3.042 -4.4 8.24
## 5 65.6 17.68 3054 1.54 722.8 3.042 -4.4 8.26
## 6 66.2 23.82 2948 1.52 738.8 2.992 -4.4 8.32
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1 0.022 120 46.4 142.6 6.58 5.32
## 2 0.026 120 46.8 143.0 6.56 5.30
## 3 0.024 120 46.6 142.0 7.66 5.84
## 4 0.030 120 46.0 146.2 7.14 5.42
## 5 0.030 120 46.0 146.2 7.14 5.44
## 6 0.024 120 46.0 146.6 7.16 5.44
## Balling.Lvl
## 1 1.48
## 2 1.56
## 3 3.28
## 4 3.04
## 5 3.04
## 6 3.02
There are 2571 observations with 33 variables. Preliminary summary shows there are missing values in most columns.
PH level ranges from 7.88 to 9.36 with an mean values of 8.55.
summary(data)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure.Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
Column names were cleaned and standardized to make them more
consistent and easier to work with in R. - Spaces and special character
were removed. - Renamed Carb.Pressure and
Carb.Pressure1 to CarbPressure1 and
CarbPressure2 to following the naming of
HydPressure1 - HydPressure4
names(data) <- c("BrandCode", "CarbVolume","FillOunces", "PCVolume","CarbPressure1","CarbTemp", "PSC", "PSCFill", "PSCCO2", "MnfFlow", "CarbPressure2","FillPressure", "HydPressure1", "HydPressure2", "HydPressure3", "HydPressure4",
"FillerLevel", "FillerSpeed" , "Temperature", "Usagecont",
"CarbFlow" , "Density" , "MFR" , "Balling" ,
"PressureVacuum" , "PH" , "OxygenFiller" , "BowlSetpoint" ,
"PressureSetpoint", "AirPressurer", "AlchRel" , "CarbRel" ,
"BallingLvl" )
names(s.eval) <- c("BrandCode", "CarbVolume","FillOunces", "PCVolume","CarbPressure1","CarbTemp", "PSC", "PSCFill", "PSCCO2", "MnfFlow", "CarbPressure2","FillPressure", "HydPressure1", "HydPressure2", "HydPressure3", "HydPressure4",
"FillerLevel", "FillerSpeed" , "Temperature", "Usagecont",
"CarbFlow" , "Density" , "MFR" , "Balling" ,
"PressureVacuum" , "PH" , "OxygenFiller" , "BowlSetpoint" ,
"PressureSetpoint", "AirPressurer", "AlchRel" , "CarbRel" ,
"BallingLvl" )
The plot below show the percentage of data is discrete or continuous.
BrandCode is the discrete variable while the rest of the
variables are continuous. Approximately 79% of the rows of the data is
complete, which means about 20% of the observation is missing some
values. We should be cautious when dealing missing values. It would be
ideal to impute any missing values instead of omitting any missing
values.
library(DataExplorer)
plot_intro(data)
844 cells contains missing data.
na_summary<- data.frame(variable= names(data), na_count= colSums(is.na(data)),
percent_na= colMeans(is.na(data)) * 100) %>%
arrange(desc(percent_na))
na_summary
## variable na_count percent_na
## MFR MFR 212 8.24581875
## BrandCode BrandCode 120 4.66744457
## FillerSpeed FillerSpeed 57 2.21703617
## PCVolume PCVolume 39 1.51691949
## PSCCO2 PSCCO2 39 1.51691949
## FillOunces FillOunces 38 1.47802412
## PSC PSC 33 1.28354726
## CarbPressure2 CarbPressure2 32 1.24465189
## HydPressure4 HydPressure4 30 1.16686114
## CarbPressure1 CarbPressure1 27 1.05017503
## CarbTemp CarbTemp 26 1.01127966
## PSCFill PSCFill 23 0.89459354
## FillPressure FillPressure 22 0.85569817
## FillerLevel FillerLevel 20 0.77790743
## HydPressure2 HydPressure2 15 0.58343057
## HydPressure3 HydPressure3 15 0.58343057
## Temperature Temperature 14 0.54453520
## OxygenFiller OxygenFiller 12 0.46674446
## PressureSetpoint PressureSetpoint 12 0.46674446
## HydPressure1 HydPressure1 11 0.42784909
## CarbVolume CarbVolume 10 0.38895371
## CarbRel CarbRel 10 0.38895371
## AlchRel AlchRel 9 0.35005834
## Usagecont Usagecont 5 0.19447686
## PH PH 4 0.15558149
## MnfFlow MnfFlow 2 0.07779074
## CarbFlow CarbFlow 2 0.07779074
## BowlSetpoint BowlSetpoint 2 0.07779074
## Density Density 1 0.03889537
## Balling Balling 1 0.03889537
## BallingLvl BallingLvl 1 0.03889537
## PressureVacuum PressureVacuum 0 0.00000000
## AirPressurer AirPressurer 0 0.00000000
na_summary %>%
summarize(total_na= sum(na_count))
## total_na
## 1 844
Every column has missing values except Pressure.Vacuum
and Air.Pressurer.
sapply(data, function(x) sum(is.na(x)))
## BrandCode CarbVolume FillOunces PCVolume
## 120 10 38 39
## CarbPressure1 CarbTemp PSC PSCFill
## 27 26 33 23
## PSCCO2 MnfFlow CarbPressure2 FillPressure
## 39 2 32 22
## HydPressure1 HydPressure2 HydPressure3 HydPressure4
## 11 15 15 30
## FillerLevel FillerSpeed Temperature Usagecont
## 20 57 14 5
## CarbFlow Density MFR Balling
## 2 1 212 1
## PressureVacuum PH OxygenFiller BowlSetpoint
## 0 4 12 2
## PressureSetpoint AirPressurer AlchRel CarbRel
## 12 0 9 10
## BallingLvl
## 1
Below is the percentage of missing data within in each variable.
MFR column has the highest percentage of missing data but
it is only accounts about 8.25%. The amount of missing values is not
significant so we should not drop any observations with missing values.
We should impute the missing values later.
library(DataExplorer)
plot_missing(data,
missing_only = T)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the DataExplorer package.
## Please report the issue at
## <https://github.com/boxuancui/DataExplorer/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
data$BrandCode <- as.factor(data$BrandCode)
Variables with good range and clear distribution is likely to show a relationship with the dependent variable. The following variables appears normal distribution:
- Carb Pressure 1
- Carb Pressure 2
- Carb Temp
- Carb Volume
- Fill Ounces
- PC Volume
Variables that appears skewed will need transformation to be performed. The following variables appears right skewed:
Hyd Pressure 1 (There is a significant amount of values at 0) PSC PSC Fill PSC CO2 - Pressure Vacuum Temperature - Oxygen Filler
The following variables appears left skewed:
Hyd Pressure2 (There is a significant amount of values at 0)
Hyd Pressure3 (There is a significant amount of values at 0)
Hyd Pressure4
Mnf Flow (Nearly half of the data contains -100 and the other half of the data contains continuous values from 0 to 200.)
MFR
The following variables appear bimodal.
Balling
Balling Lvl
Carb Rel
Density
Variables that are highly multimodal can reveal that they may be operated at different setting or conditions. Later we may want to transform these discrete variables into factors so that the model treats them as distinct operational settings.The following variables appear multimodal.
plot_histogram(data)
Below shows a boxplot of each variable by BrandCode. It reveals:
plot_boxplot(data, by = "BrandCode")
## Warning: Removed 302 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 372 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 50 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Almost half of the observations are brand C. Brands A and C have similar observations.
library(ggplot2)
ggplot(data,
aes(x = BrandCode, fill = BrandCode)) +
geom_bar() +
theme_minimal() +
theme(legend.position = "none") +
labs( title = "Brand Frequency",
x = "Brand",
y = "Count")
Brand C has a lower PH while Brand D has the highest PH level.
ggplot(data,
aes(x = BrandCode, y = PH, fill = BrandCode)) +
geom_boxplot() +
theme_minimal() +
theme(legend.position = "none") +
labs( title = "Boxplot of PH by Brand",
x = "Brand",
y = "PH")
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Variables that are highly multimodal can reveal that they may be operated at different setting or conditions. Later we may want to transform these discrete variables into factors so that the model treats them as distinct operational settings.
train_data <- data %>%
mutate(
PressureSetpoint = as.factor(PressureSetpoint),
BowlSetpoint = as.factor(BowlSetpoint),
BrandCode = as.factor(BrandCode)
)
seval <- s.eval %>%
mutate(
PressureSetpoint = as.factor(PressureSetpoint),
BowlSetpoint = as.factor(BowlSetpoint),
BrandCode = as.factor(BrandCode)
)
Summary of preprocessing step:
PH, the target variable has missing values. We
should drop any observations that contains missing values in
PH. We do not want to impute these values as it we will
predict the ph level based on other features.
HydPressure1 was dropped as it has near zero
variance.
Imputing missing values using KNN impute.
Perform Box cox transformation to standardize skewed data
Center and scale data as some models such as KNN required distance metrics.
library(tidyr)
library(caret)
train_clean_data <- data %>%
drop_na(PH)
train_filtered <- train_clean_data[, -nearZeroVar(train_clean_data)]
KNN imputation was chosen over other imputation methods as it will use similar conditions between observations to estimate missing values. It is good for dataset that has low missingness and contains correlated predictors.
set.seed(1)
trainingRows <- createDataPartition(train_filtered$PH,
p = .70,
list= FALSE)
train1 <- train_filtered[trainingRows, ]
test1 <- train_filtered[-trainingRows, ]
library(forcats)
train1 <- train1%>%
mutate(across(where(is.factor), ~fct_na_value_to_level(.x, "Unknown")))
test1 <- test1%>%
mutate(across(where(is.factor), ~fct_na_value_to_level(.x, "Unknown")))
trans <- preProcess(train1 %>%
select(-PH),
method = c("center", "scale", "knnImpute", "BoxCox"))
transformed_data <- predict(trans, train1%>%
select(-PH))
train_final <- cbind(transformed_data, PH = train1$PH)
test_transformed <- predict(trans, test1%>%
select(-PH))
test_final <- cbind(test_transformed, PH = test1$PH)
The following variables is significantly positive correlated with PH: BowlSetPoint, FillerLevel, Carb Flow.
The following variables is significantly negative correlated with PH: MnfFlow, UsageCount, Fill Pressure
Linear regression does not require the predictor variables to be normally distributed which works here considering many of the distributions of the variables are not normal. In this section we’ll restrict which variables will be fed into the model. Milticollinearity can effect the output coefficients and the model interpretatibility. We can’t afford to lose grip on these in reporting to our stakeholders.
To assess this, we’ll separate the response variable from the predictors.
predictors <- data[, names(data) != "PH"]
response <- data$PH
ph_corr <- corr.test(x = predictors[,-1], y = response, "pairwise", "pearson")
correlations <- ph_corr$r[, 1]
p_values <- ph_corr$p[, 1]
ph_table <- data.frame(r= correlations, p= p_values)
ph_table %>%
arrange(desc(r))
## r p
## BowlSetpoint 0.34911980 2.063873e-74
## FillerLevel 0.32327031 3.917540e-63
## PressureVacuum 0.22053722 1.201970e-29
## OxygenFiller 0.16798960 1.234045e-17
## CarbRel 0.16414447 6.477984e-17
## CarbFlow 0.15737137 1.089641e-15
## AlchRel 0.14893852 3.609012e-14
## BallingLvl 0.10047517 3.395271e-07
## Density 0.07843973 6.938385e-05
## Balling 0.06527259 9.363280e-04
## CarbVolume 0.06349111 1.317168e-03
## CarbPressure1 0.05958033 2.665036e-03
## PCVolume 0.04580178 2.128180e-02
## CarbTemp 0.02887456 1.456400e-01
## MFR -0.00871805 6.721382e-01
## AirPressurer -0.01365873 4.891116e-01
## FillerSpeed -0.02456179 2.183784e-01
## PSCFill -0.03625652 6.748775e-02
## HydPressure1 -0.07398881 1.811517e-04
## PSCCO2 -0.07558459 1.424376e-04
## CarbPressure2 -0.07932723 6.378245e-05
## PSC -0.09003060 5.651603e-06
## FillOunces -0.09505837 1.677065e-06
## HydPressure4 -0.14153475 7.818363e-13
## Temperature -0.15824135 8.592733e-16
## HydPressure2 -0.20067834 1.352668e-24
## FillPressure -0.21343571 1.199531e-27
## HydPressure3 -0.24030003 7.593496e-35
## PressureSetpoint -0.31101906 2.010989e-58
## Usagecont -0.31841555 1.842646e-61
## MnfFlow -0.44697510 2.557184e-126
The correlation table r shows the strength and direction
of the relationship of each variable with our response variable
ph. The Pairwise calculations ignore NA values which is
useful.
Next, we’ll find correlations without the response variable to determine pairwise redundancies/ multicollinearity. We’ll use findCorrelation() from the caret package which will use the cutoff 0.7 (Lin, 99) as threshold to point out the clusters.
pred_numerical <- predictors[, sapply(predictors, is.numeric)]
ppcor<- pred_cor <- cor(pred_numerical, use = "pairwise.complete.obs")
findCorrelation(ppcor, cutoff = 0.7)
## [1] 23 14 31 21 29 13 30 16 17 5
The function prints the column numbers. We’ll have it output the column names
colnames(pred_numerical)[findCorrelation(ppcor, cutoff = 0.7)]
## [1] "Balling" "HydPressure3" "BallingLvl" "Density" "AlchRel"
## [6] "HydPressure2" "CarbRel" "FillerLevel" "FillerSpeed" "CarbTemp"
The named variables are flagged as highly correlated clusters, above our threshold. We’ll examine the correlation plot
corrplot(ppcor, title ="Predictor v Predictor Correlation", outline = TRUE, type = "full")
The correlation plot shows variables related to liquid density
including CarbVolume, Density, Balling, BallingLvl, CarbRel, and AlchRel
have pairwise correlations above 0.7. Though different measurements,
these measure density and will be redundant in a linear regression
model. To preserve the model’s interpretability, we’ll run “variance
inflation factor” vif() from the car package to double
check.
Further, variables that share measurements of flow are also identified as above the 0.7 threshold. HydPressure2, HydPressure3, FillerLevel, FillerSpeed.
Previously, we show a correlation plot of data before imputation. Below is a correlation plot of data after imputation. The correlation plot continues to reveal multicollienarity is present.
The following variables is significantly positive correlated with PH: BowlSetPoint, FillerLevel, Carb Flow.
The following variables is significantly negative correlated with PH: MnfFlow, UsageCount, Fill Pressure
library(corrplot)
corrplot(cor(train_final %>%
select(where(is.numeric)),
use = "pairwise.complete.obs"),
type = "lower",
order = "alphabet",
tl.cex = 0.6,
tl.col = "black")
After performing EDA and preprocessing, the following models were tested to help predict the ph level: linear regression, ridge, elastic net, random forest, SVM, KNN, GBM, and Cubist.
This model is simple, interpretable, and creates a good baseline to predict the pH. It can help us identify which factors have the strongest direct influence to the pH level of the beverages.
set.seed(1)
lm_model <- train(PH ~. ,
data = train_final,
method = "lm",
trControl = trainControl(method = "cv", number = 5)
)
summary(lm_model)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46079 -0.07923 0.01022 0.08736 0.46151
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.487678 0.018169 467.146 < 2e-16 ***
## BrandCodeB 0.099086 0.027608 3.589 0.000341 ***
## BrandCodeC -0.053634 0.027320 -1.963 0.049779 *
## BrandCodeD 0.077452 0.017117 4.525 6.44e-06 ***
## BrandCodeUnknown 0.006268 0.030188 0.208 0.835545
## CarbVolume -0.018670 0.008444 -2.211 0.027152 *
## FillOunces -0.003858 0.003334 -1.157 0.247440
## PCVolume -0.007690 0.003754 -2.049 0.040653 *
## CarbPressure1 0.009216 0.011876 0.776 0.437849
## CarbTemp -0.005042 0.010756 -0.469 0.639298
## PSC -0.006057 0.003331 -1.819 0.069152 .
## PSCFill -0.002760 0.003249 -0.850 0.395706
## PSCCO2 -0.005118 0.003206 -1.596 0.110616
## MnfFlow -0.079799 0.006872 -11.611 < 2e-16 ***
## CarbPressure2 0.029535 0.003861 7.650 3.28e-14 ***
## FillPressure 0.010611 0.004495 2.361 0.018350 *
## HydPressure2 -0.023168 0.009223 -2.512 0.012090 *
## HydPressure3 0.055841 0.010908 5.119 3.41e-07 ***
## HydPressure4 -0.001821 0.004857 -0.375 0.707775
## FillerLevel -0.018972 0.009660 -1.964 0.049701 *
## FillerSpeed -0.004413 0.006205 -0.711 0.477086
## Temperature -0.018530 0.003714 -4.990 6.64e-07 ***
## Usagecont -0.025279 0.004035 -6.264 4.70e-10 ***
## CarbFlow 0.014260 0.004504 3.166 0.001570 **
## Density -0.013510 0.009782 -1.381 0.167414
## MFR 0.003196 0.005036 0.635 0.525667
## Balling -0.083285 0.015060 -5.530 3.68e-08 ***
## PressureVacuum -0.013057 0.004625 -2.823 0.004811 **
## OxygenFiller -0.013864 0.004443 -3.121 0.001835 **
## BowlSetpoint 0.053542 0.009924 5.395 7.76e-08 ***
## PressureSetpoint -0.018384 0.004588 -4.007 6.41e-05 ***
## AirPressurer -0.003677 0.003384 -1.086 0.277488
## AlchRel 0.008680 0.012071 0.719 0.472202
## CarbRel 0.002517 0.007070 0.356 0.721874
## BallingLvl 0.103335 0.018677 5.533 3.63e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1306 on 1764 degrees of freedom
## Multiple R-squared: 0.4359, Adjusted R-squared: 0.425
## F-statistic: 40.08 on 34 and 1764 DF, p-value: < 2.2e-16
plot(lm_model$finalModel)
The model explains about 40% of the variance of the test data. This is not ideal.
lm_pred <- predict(lm_model, test_final)
postResample(pred = lm_pred, obs = test_final$PH)
## RMSE Rsquared MAE
## 0.1337891 0.4042534 0.1037784
Below are the top predictors of the pH level of the beverage.
plot(varImp(lm_model), top = 10)
Earlier in the correlation plot, we saw multicollinearity is present. The Variance Inflation Factor can help identify variables that causes multicollinearity. We should drop high-VIF predictors.
library(car)
vif(lm_model$finalModel)
## BrandCodeB BrandCodeC BrandCodeD BrandCodeUnknown
## 20.034567 8.316839 5.709224 4.326955
## CarbVolume FillOunces PCVolume CarbPressure1
## 7.510812 1.162395 1.495236 14.931140
## CarbTemp PSC PSCFill PSCCO2
## 12.198951 1.158531 1.106716 1.073113
## MnfFlow CarbPressure2 FillPressure HydPressure2
## 4.978779 1.565286 2.127598 8.957494
## HydPressure3 HydPressure4 FillerLevel FillerSpeed
## 12.510643 2.474103 9.824225 4.017513
## Temperature Usagecont CarbFlow Density
## 1.450003 1.716648 2.136822 10.085922
## MFR Balling PressureVacuum OxygenFiller
## 3.387248 23.909408 2.255014 2.073959
## BowlSetpoint PressureSetpoint AirPressurer AlchRel
## 10.404015 2.216389 1.207473 15.344453
## CarbRel BallingLvl
## 5.264680 36.762426
BallingLvl has the highest VIF value. Dropping it for our next models.
train_final_reduced <- train_final %>%
select(-BallingLvl)
set.seed(2)
lm_model2 <- train(PH ~. ,
data = train_final_reduced,
method = "lm",
trControl = trainControl(method = "cv", number = 5)
)
summary(lm_model2)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45149 -0.07895 0.00835 0.08803 0.50398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.545453 0.014993 569.967 < 2e-16 ***
## BrandCodeB 0.011594 0.022820 0.508 0.611463
## BrandCodeC -0.133222 0.023420 -5.688 1.50e-08 ***
## BrandCodeD 0.066257 0.017138 3.866 0.000115 ***
## BrandCodeUnknown -0.083310 0.025692 -3.243 0.001207 **
## CarbVolume -0.020054 0.008510 -2.356 0.018564 *
## FillOunces -0.002789 0.003356 -0.831 0.406119
## PCVolume -0.008543 0.003782 -2.259 0.024021 *
## CarbPressure1 0.007433 0.011970 0.621 0.534718
## CarbTemp -0.004047 0.010844 -0.373 0.709082
## PSC -0.005822 0.003358 -1.734 0.083166 .
## PSCFill -0.002347 0.003276 -0.716 0.473860
## PSCCO2 -0.004954 0.003233 -1.532 0.125591
## MnfFlow -0.080071 0.006930 -11.555 < 2e-16 ***
## CarbPressure2 0.029956 0.003892 7.697 2.31e-14 ***
## FillPressure 0.011142 0.004531 2.459 0.014031 *
## HydPressure2 -0.023363 0.009300 -2.512 0.012085 *
## HydPressure3 0.054547 0.010997 4.960 7.72e-07 ***
## HydPressure4 -0.001115 0.004896 -0.228 0.819899
## FillerLevel -0.019416 0.009741 -1.993 0.046379 *
## FillerSpeed -0.005256 0.006255 -0.840 0.400889
## Temperature -0.018819 0.003744 -5.026 5.52e-07 ***
## Usagecont -0.023918 0.004062 -5.889 4.65e-09 ***
## CarbFlow 0.011000 0.004502 2.443 0.014650 *
## Density -0.010117 0.009844 -1.028 0.304203
## MFR 0.002894 0.005077 0.570 0.568804
## Balling -0.042580 0.013251 -3.213 0.001336 **
## PressureVacuum -0.007930 0.004569 -1.736 0.082825 .
## OxygenFiller -0.013421 0.004479 -2.996 0.002771 **
## BowlSetpoint 0.053504 0.010007 5.347 1.01e-07 ***
## PressureSetpoint -0.020059 0.004617 -4.345 1.47e-05 ***
## AirPressurer -0.003248 0.003412 -0.952 0.341295
## AlchRel 0.024668 0.011818 2.087 0.037005 *
## CarbRel 0.013337 0.006851 1.947 0.051734 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1317 on 1765 degrees of freedom
## Multiple R-squared: 0.4261, Adjusted R-squared: 0.4153
## F-statistic: 39.7 on 33 and 1765 DF, p-value: < 2.2e-16
The model explains about 43% of the variance of the test data. Removing one of the highly correlated variable improve the model only slightly.
lm_pred2 <- predict(lm_model2, train_final_reduced)
postResample(pred = lm_pred2, obs = train_final_reduced$PH)
## RMSE Rsquared MAE
## 0.1304420 0.4260610 0.1015644
vif(lm_model2$finalModel)
## BrandCodeB BrandCodeC BrandCodeD BrandCodeUnknown
## 13.461440 6.010979 5.629432 3.082351
## CarbVolume FillOunces PCVolume CarbPressure1
## 7.504226 1.158495 1.492716 14.920149
## CarbTemp PSC PSCFill PSCCO2
## 12.195538 1.158343 1.106130 1.073021
## MnfFlow CarbPressure2 FillPressure HydPressure2
## 4.978525 1.564677 2.126627 8.957363
## HydPressure3 HydPressure4 FillerLevel FillerSpeed
## 12.504899 2.472395 9.823544 4.015090
## Temperature Usagecont CarbFlow Density
## 1.449717 1.710271 2.100253 10.046288
## MFR Balling PressureVacuum OxygenFiller
## 3.386849 18.203599 2.164494 2.073287
## BowlSetpoint PressureSetpoint AirPressurer AlchRel
## 10.404010 2.206743 1.206839 14.465085
## CarbRel
## 4.861898
plot(varImp(lm_model2), top = 10)
train_final_reduced <- train_final %>%
select(-BallingLvl, -Balling)
set.seed(2)
lm_model3 <- train(PH ~. ,
data = train_final_reduced,
method = "lm",
trControl = trainControl(method = "cv", number = 5)
)
summary(lm_model3)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48055 -0.07973 0.00769 0.08715 0.64869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.5232252 0.0133369 639.069 < 2e-16 ***
## BrandCodeB 0.0448507 0.0203913 2.200 0.027971 *
## BrandCodeC -0.1053669 0.0218137 -4.830 1.48e-06 ***
## BrandCodeD 0.0729293 0.0170571 4.276 2.01e-05 ***
## BrandCodeUnknown -0.0559107 0.0243001 -2.301 0.021516 *
## CarbVolume -0.0206999 0.0085305 -2.427 0.015342 *
## FillOunces -0.0032468 0.0033622 -0.966 0.334343
## PCVolume -0.0074859 0.0037778 -1.982 0.047684 *
## CarbPressure1 0.0086776 0.0119957 0.723 0.469535
## CarbTemp -0.0051076 0.0108680 -0.470 0.638436
## PSC -0.0060946 0.0033659 -1.811 0.070355 .
## PSCFill -0.0026262 0.0032832 -0.800 0.423882
## PSCCO2 -0.0051091 0.0032411 -1.576 0.115121
## MnfFlow -0.0783970 0.0069284 -11.315 < 2e-16 ***
## CarbPressure2 0.0306438 0.0038965 7.864 6.40e-15 ***
## FillPressure 0.0109652 0.0045428 2.414 0.015891 *
## HydPressure2 -0.0173041 0.0091304 -1.895 0.058226 .
## HydPressure3 0.0477707 0.0108213 4.415 1.07e-05 ***
## HydPressure4 -0.0007159 0.0049075 -0.146 0.884036
## FillerLevel -0.0178218 0.0097537 -1.827 0.067840 .
## FillerSpeed -0.0062722 0.0062636 -1.001 0.316787
## Temperature -0.0165300 0.0036856 -4.485 7.76e-06 ***
## Usagecont -0.0253782 0.0040468 -6.271 4.49e-10 ***
## CarbFlow 0.0124368 0.0044917 2.769 0.005684 **
## Density -0.0281301 0.0081133 -3.467 0.000539 ***
## MFR 0.0012363 0.0050645 0.244 0.807173
## PressureVacuum -0.0029194 0.0043063 -0.678 0.497897
## OxygenFiller -0.0108268 0.0044174 -2.451 0.014345 *
## BowlSetpoint 0.0503526 0.0099847 5.043 5.05e-07 ***
## PressureSetpoint -0.0204676 0.0046270 -4.424 1.03e-05 ***
## AirPressurer -0.0031477 0.0034207 -0.920 0.357600
## AlchRel 0.0143039 0.0113997 1.255 0.209729
## CarbRel 0.0136197 0.0068686 1.983 0.047535 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.132 on 1766 degrees of freedom
## Multiple R-squared: 0.4227, Adjusted R-squared: 0.4122
## F-statistic: 40.41 on 32 and 1766 DF, p-value: < 2.2e-16
The model explains about 42% of the variance of the test data.
lm_pred3 <- predict(lm_model3, train_final_reduced)
postResample(pred = lm_pred3, obs = train_final_reduced$PH)
## RMSE Rsquared MAE
## 0.1308231 0.4227033 0.1018339
Ridge can handle high correlated predictors by shrinking their coefficients.
set.seed(11)
ridge_model <- train(PH ~. ,
data = train_final,
method = "ridge",
trControl = trainControl(method = "cv", number = 5)
)
summary(ridge_model)
## Length Class Mode
## call 4 -none- call
## actions 41 -none- list
## allset 34 -none- numeric
## beta.pure 1394 -none- numeric
## vn 34 -none- character
## mu 1 -none- numeric
## normx 34 -none- numeric
## meanx 34 -none- numeric
## lambda 1 -none- numeric
## L1norm 41 -none- numeric
## penalty 41 -none- numeric
## df 41 -none- numeric
## Cp 41 -none- numeric
## sigma2 1 -none- numeric
## xNames 34 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
The model explains about 40% of the variance of the test data.
ridge_pred <- predict(ridge_model, test_final)
postResample(pred = ridge_pred, obs = test_final$PH)
## RMSE Rsquared MAE
## 0.1337878 0.4042556 0.1037787
plot(varImp(ridge_model), top = 10)
Elastic Net combines Lasso and Ridge penalties by shrinking the coefficients for less important variables and correlated variables. It is suitable to that dataset as it has correlated features and many predictors.
set.seed(111)
enet_model <- train(PH ~. ,
data = train_final,
method = "enet",
trControl = trainControl(method = "cv", number = 5)
)
summary(enet_model)
## Length Class Mode
## call 4 -none- call
## actions 41 -none- list
## allset 34 -none- numeric
## beta.pure 1394 -none- numeric
## vn 34 -none- character
## mu 1 -none- numeric
## normx 34 -none- numeric
## meanx 34 -none- numeric
## lambda 1 -none- numeric
## L1norm 41 -none- numeric
## penalty 41 -none- numeric
## df 41 -none- numeric
## Cp 41 -none- numeric
## sigma2 1 -none- numeric
## xNames 34 -none- character
## problemType 1 -none- character
## tuneValue 2 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
The model explains about 40% of the variance of the test data.
enet_pred <- predict(enet_model, test_final)
postResample(pred = enet_pred, obs = test_final$PH)
## RMSE Rsquared MAE
## 0.1337878 0.4042556 0.1037787
The top predictors here are different than the previous models and other models we will see later. Elastic net gives more weight to predictors that can provide more unique information. OxygenFiller is less correlated to other top predictors.
plot(varImp(enet_model), top = 10)
Random Forest doesn’t not require handling of multicolinearity. It is robust to outliers and can capture complex patterns.
set.seed(11111)
rf_model <- train(PH ~. ,
data = train_final,
method = "rf",
trControl = trainControl(method = "cv", number = 5),
tuneLength = 5
)
summary(rf_model)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 1799 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 1799 -none- numeric
## importance 34 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 1799 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 34 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
The model explains about 65% of the variance of the test data. That’s higher than the previous models.
rf_pred <- predict(rf_model, test_final)
postResample(pred = rf_pred, obs = test_final$PH)
## RMSE Rsquared MAE
## 0.10262764 0.65054981 0.07243542
plot(varImp(rf_model), top = 10)
There are several types of SVM. We will use radial basis kernel as it handle nonlinear relationship well.
set.seed(11111)
svm_model <- train(PH ~. ,
data = train_final,
method = "svmRadial",
trControl = trainControl(method = "cv", number = 5),
tuneLength = 10
)
summary(svm_model)
## Length Class Mode
## 1 ksvm S4
The model explains about 53% of the variance of the test data.
svm_pred <- predict(svm_model, test_final)
postResample(pred = svm_pred, obs = test_final$PH)
## RMSE Rsquared MAE
## 0.12046363 0.52576294 0.08732194
plot(varImp(svm_model), top = 10)
KNN is a nonparametric model that can help predict the PH based on the closest training observations. It utilize the idea of beverage observations with similar conditions shold have similar ph levels.
set.seed(111111)
knn_model <- train(PH ~. ,
data = train_final,
method = "knn",
preProc = c("scale", "center"),
trControl = trainControl(method = "cv", number = 5),
tuneLength = 10
)
plot(knn_model)
The model explains about 46% of the variance of the test data.
knn_pred <- predict(knn_model, test_final)
postResample(pred = knn_pred, obs = test_final$PH)
## RMSE Rsquared MAE
## 0.12807883 0.46381221 0.09304781
plot(varImp(knn_model), top = 10)
(tuned KNN)
set.seed(111112)
knn_model2 <- train(PH ~. ,
data = train_final,
method = "kknn",
preProc = c("scale", "center"),
trControl = trainControl(method = "cv", number = 5),
tuneGrid = expand.grid(kmax = seq(3,30,2),
distance = 2,
kernel = c("rectangular", "triangular","epanechnikov")))
plot(knn_model2)
The model explains about 47% of the variance of the test data.
knn_pred2 <- predict(knn_model2, test_final)
postResample(pred = knn_pred2, obs = test_final$PH)
## RMSE Rsquared MAE
## 0.12678369 0.47065301 0.09136588
plot(varImp(knn_model2), top = 10)
set.seed(1111111)
library(gbm)
gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 1000, by = 50),
.shrinkage = c(0.01, 0.1),
.n.minobsinnode = 10)
gbm_model <- train(PH ~. ,
data = train_final,
method = "gbm",
tuneGrid = gbmGrid,
trControl = trainControl(method = "cv", number = 5),
verbose = FALSE)
gbm_model
## Stochastic Gradient Boosting
##
## 1799 samples
## 31 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1438, 1440, 1440, 1440, 1438
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.01 1 100 0.1526716 0.3110506 0.12089633
## 0.01 1 150 0.1479276 0.3356400 0.11683472
## 0.01 1 200 0.1444484 0.3538429 0.11393252
## 0.01 1 250 0.1420269 0.3642708 0.11206031
## 0.01 1 300 0.1401542 0.3722660 0.11064018
## 0.01 1 350 0.1387538 0.3793035 0.10961411
## 0.01 1 400 0.1377683 0.3834491 0.10889348
## 0.01 1 450 0.1369799 0.3871227 0.10828709
## 0.01 1 500 0.1361987 0.3921721 0.10764358
## 0.01 1 550 0.1355715 0.3965752 0.10712704
## 0.01 1 600 0.1350381 0.4005789 0.10666134
## 0.01 1 650 0.1345111 0.4043147 0.10620434
## 0.01 1 700 0.1339982 0.4082989 0.10575040
## 0.01 1 750 0.1335351 0.4117917 0.10539175
## 0.01 1 800 0.1331607 0.4143896 0.10507161
## 0.01 1 850 0.1327793 0.4174006 0.10472626
## 0.01 1 900 0.1324805 0.4196815 0.10445975
## 0.01 1 950 0.1321540 0.4222049 0.10417078
## 0.01 1 1000 0.1319238 0.4240040 0.10395858
## 0.01 3 100 0.1420308 0.4195357 0.11243656
## 0.01 3 150 0.1358908 0.4387302 0.10740419
## 0.01 3 200 0.1320239 0.4555174 0.10426665
## 0.01 3 250 0.1293036 0.4694097 0.10200626
## 0.01 3 300 0.1273462 0.4805246 0.10025574
## 0.01 3 350 0.1257711 0.4898313 0.09882291
## 0.01 3 400 0.1245073 0.4980241 0.09766917
## 0.01 3 450 0.1234258 0.5054874 0.09664414
## 0.01 3 500 0.1224610 0.5120303 0.09579518
## 0.01 3 550 0.1217547 0.5165851 0.09508706
## 0.01 3 600 0.1210141 0.5213643 0.09433899
## 0.01 3 650 0.1204283 0.5249132 0.09371316
## 0.01 3 700 0.1197810 0.5289317 0.09308755
## 0.01 3 750 0.1192274 0.5324658 0.09255388
## 0.01 3 800 0.1187717 0.5352816 0.09210092
## 0.01 3 850 0.1182306 0.5389154 0.09161094
## 0.01 3 900 0.1177706 0.5421165 0.09111080
## 0.01 3 950 0.1173610 0.5448257 0.09072929
## 0.01 3 1000 0.1169994 0.5472374 0.09040695
## 0.01 5 100 0.1376711 0.4662006 0.10868911
## 0.01 5 150 0.1308896 0.4867918 0.10308993
## 0.01 5 200 0.1266630 0.5023671 0.09938430
## 0.01 5 250 0.1236531 0.5173602 0.09672082
## 0.01 5 300 0.1215495 0.5286545 0.09478162
## 0.01 5 350 0.1198559 0.5377687 0.09325770
## 0.01 5 400 0.1184448 0.5464582 0.09196817
## 0.01 5 450 0.1172838 0.5536412 0.09088995
## 0.01 5 500 0.1163913 0.5585715 0.09002560
## 0.01 5 550 0.1155298 0.5636367 0.08924964
## 0.01 5 600 0.1149452 0.5668969 0.08862629
## 0.01 5 650 0.1143239 0.5707675 0.08797669
## 0.01 5 700 0.1138612 0.5731748 0.08744861
## 0.01 5 750 0.1133267 0.5765383 0.08694201
## 0.01 5 800 0.1128624 0.5792913 0.08644706
## 0.01 5 850 0.1124426 0.5818099 0.08605303
## 0.01 5 900 0.1120457 0.5842895 0.08561387
## 0.01 5 950 0.1116794 0.5866407 0.08525002
## 0.01 5 1000 0.1113540 0.5887569 0.08490736
## 0.01 7 100 0.1346423 0.4983060 0.10597046
## 0.01 7 150 0.1272336 0.5214649 0.09988506
## 0.01 7 200 0.1226639 0.5372257 0.09591674
## 0.01 7 250 0.1195964 0.5496794 0.09311565
## 0.01 7 300 0.1174443 0.5596276 0.09109248
## 0.01 7 350 0.1157710 0.5686058 0.08952236
## 0.01 7 400 0.1144556 0.5756106 0.08829102
## 0.01 7 450 0.1134112 0.5812129 0.08726622
## 0.01 7 500 0.1124477 0.5868514 0.08632523
## 0.01 7 550 0.1116563 0.5912931 0.08555519
## 0.01 7 600 0.1109461 0.5955338 0.08481859
## 0.01 7 650 0.1104165 0.5986444 0.08422732
## 0.01 7 700 0.1098155 0.6023818 0.08369095
## 0.01 7 750 0.1092773 0.6056837 0.08317200
## 0.01 7 800 0.1088130 0.6083925 0.08269291
## 0.01 7 850 0.1084959 0.6103164 0.08232206
## 0.01 7 900 0.1081207 0.6125239 0.08199643
## 0.01 7 950 0.1078943 0.6138359 0.08169589
## 0.01 7 1000 0.1075900 0.6157899 0.08138941
## 0.10 1 100 0.1319131 0.4225458 0.10402324
## 0.10 1 150 0.1295430 0.4402645 0.10159004
## 0.10 1 200 0.1283860 0.4485284 0.10013603
## 0.10 1 250 0.1271134 0.4579697 0.09858150
## 0.10 1 300 0.1268761 0.4590602 0.09782979
## 0.10 1 350 0.1261415 0.4649237 0.09694621
## 0.10 1 400 0.1256898 0.4686091 0.09656068
## 0.10 1 450 0.1255754 0.4693183 0.09611852
## 0.10 1 500 0.1257185 0.4682102 0.09612941
## 0.10 1 550 0.1256349 0.4691346 0.09631462
## 0.10 1 600 0.1255685 0.4696445 0.09620439
## 0.10 1 650 0.1252473 0.4723108 0.09603512
## 0.10 1 700 0.1256594 0.4691558 0.09639679
## 0.10 1 750 0.1256859 0.4690126 0.09627809
## 0.10 1 800 0.1256911 0.4687806 0.09635844
## 0.10 1 850 0.1257223 0.4689103 0.09619762
## 0.10 1 900 0.1258346 0.4683226 0.09628161
## 0.10 1 950 0.1256558 0.4697625 0.09607405
## 0.10 1 1000 0.1257889 0.4690216 0.09618164
## 0.10 3 100 0.1180172 0.5357754 0.09100267
## 0.10 3 150 0.1162398 0.5467989 0.08909552
## 0.10 3 200 0.1154893 0.5521490 0.08808798
## 0.10 3 250 0.1143535 0.5601735 0.08690190
## 0.10 3 300 0.1135615 0.5659388 0.08611826
## 0.10 3 350 0.1131042 0.5694696 0.08585456
## 0.10 3 400 0.1125792 0.5733124 0.08516523
## 0.10 3 450 0.1120382 0.5774842 0.08499649
## 0.10 3 500 0.1120605 0.5771630 0.08492003
## 0.10 3 550 0.1114812 0.5816984 0.08432906
## 0.10 3 600 0.1108101 0.5867877 0.08384498
## 0.10 3 650 0.1109249 0.5862988 0.08391750
## 0.10 3 700 0.1110716 0.5853182 0.08414610
## 0.10 3 750 0.1110870 0.5852790 0.08405958
## 0.10 3 800 0.1107928 0.5876450 0.08392789
## 0.10 3 850 0.1106285 0.5888121 0.08378337
## 0.10 3 900 0.1104179 0.5906156 0.08357161
## 0.10 3 950 0.1101881 0.5924394 0.08338117
## 0.10 3 1000 0.1100001 0.5938758 0.08331743
## 0.10 5 100 0.1135245 0.5688416 0.08654510
## 0.10 5 150 0.1110916 0.5858883 0.08372426
## 0.10 5 200 0.1091895 0.5993818 0.08238719
## 0.10 5 250 0.1085578 0.6034431 0.08169342
## 0.10 5 300 0.1081121 0.6066927 0.08129819
## 0.10 5 350 0.1075667 0.6106145 0.08088793
## 0.10 5 400 0.1071132 0.6137639 0.08046806
## 0.10 5 450 0.1068919 0.6156710 0.08026733
## 0.10 5 500 0.1067581 0.6167307 0.08011442
## 0.10 5 550 0.1064039 0.6195400 0.07986743
## 0.10 5 600 0.1062126 0.6209327 0.07952394
## 0.10 5 650 0.1064096 0.6197658 0.07972856
## 0.10 5 700 0.1062913 0.6207586 0.07975663
## 0.10 5 750 0.1063730 0.6201867 0.07999251
## 0.10 5 800 0.1064264 0.6199653 0.08001673
## 0.10 5 850 0.1064123 0.6203343 0.08006138
## 0.10 5 900 0.1063937 0.6204422 0.08006031
## 0.10 5 950 0.1063072 0.6211054 0.08008623
## 0.10 5 1000 0.1062267 0.6215856 0.07994096
## 0.10 7 100 0.1102564 0.5920580 0.08324851
## 0.10 7 150 0.1081958 0.6066781 0.08156698
## 0.10 7 200 0.1070406 0.6145028 0.08029152
## 0.10 7 250 0.1068700 0.6155508 0.08015693
## 0.10 7 300 0.1065737 0.6174849 0.07990953
## 0.10 7 350 0.1065386 0.6178456 0.07959712
## 0.10 7 400 0.1060500 0.6211456 0.07916119
## 0.10 7 450 0.1058310 0.6228049 0.07906718
## 0.10 7 500 0.1057816 0.6230399 0.07912964
## 0.10 7 550 0.1058380 0.6227831 0.07917679
## 0.10 7 600 0.1056376 0.6242841 0.07901467
## 0.10 7 650 0.1056838 0.6241685 0.07904418
## 0.10 7 700 0.1057440 0.6238300 0.07908166
## 0.10 7 750 0.1057747 0.6236412 0.07908521
## 0.10 7 800 0.1058718 0.6230561 0.07924970
## 0.10 7 850 0.1057685 0.6239175 0.07921355
## 0.10 7 900 0.1057312 0.6241185 0.07924197
## 0.10 7 950 0.1056625 0.6246802 0.07919209
## 0.10 7 1000 0.1056601 0.6247255 0.07920151
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 600, interaction.depth =
## 7, shrinkage = 0.1 and n.minobsinnode = 10.
The model explains about 58% of the variance of the test data.
gbm_pred <- predict(gbm_model, test_final)
postResample(pred = gbm_pred, obs = test_final$PH)
## RMSE Rsquared MAE
## 0.11319662 0.57835611 0.08324744
plot(varImp(gbm_model), top = 10)
set.seed(1111111)
cubistGrid <- expand.grid(committees = c(5, 10, 20, 25, 30, 35),
neighbors = c(3, 5, 7, 9))
cubist_model1 <- train(PH ~. ,
data = train_final,
method = "cubist",
trControl = trainControl(method = "cv", number = 5),
tuneGrid = cubistGrid)
cubist_model1
## Cubist
##
## 1799 samples
## 31 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1438, 1440, 1440, 1440, 1438
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 5 3 0.10651405 0.6280792 0.07553919
## 5 5 0.10427750 0.6388212 0.07454844
## 5 7 0.10312447 0.6444448 0.07435491
## 5 9 0.10315129 0.6436535 0.07485335
## 10 3 0.10567219 0.6320322 0.07418010
## 10 5 0.10335037 0.6437949 0.07321619
## 10 7 0.10222761 0.6495983 0.07314955
## 10 9 0.10236388 0.6484574 0.07372072
## 20 3 0.10356567 0.6438584 0.07329638
## 20 5 0.10117473 0.6566584 0.07216201
## 20 7 0.10002646 0.6631450 0.07213475
## 20 9 0.10001195 0.6632622 0.07260272
## 25 3 0.10305475 0.6466722 0.07301136
## 25 5 0.10064012 0.6598529 0.07188693
## 25 7 0.09951892 0.6663046 0.07185559
## 25 9 0.09948580 0.6666422 0.07230071
## 30 3 0.10248416 0.6503911 0.07266282
## 30 5 0.10007265 0.6636138 0.07158986
## 30 7 0.09890083 0.6704014 0.07148600
## 30 9 0.09887405 0.6707215 0.07190151
## 35 3 0.10263845 0.6495841 0.07271320
## 35 5 0.10019371 0.6630012 0.07168085
## 35 7 0.09898371 0.6700067 0.07145740
## 35 9 0.09893666 0.6704626 0.07188585
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 30 and neighbors = 9.
The model explains about 66% of the variance of the test data. It outperforms all other models.
cubist_pred <- predict(cubist_model1, test_final)
postResample(pred = cubist_pred, obs = test_final$PH)
## RMSE Rsquared MAE
## 0.1005293 0.6637819 0.0709191
plot(varImp(cubist_model1), top = 10)
varImp(cubist_model1)
## cubist variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## MnfFlow 100.00
## BallingLvl 70.29
## PressureVacuum 65.22
## Balling 63.04
## AlchRel 63.04
## AirPressurer 58.70
## Density 56.52
## BowlSetpoint 52.90
## OxygenFiller 52.17
## CarbRel 47.10
## Temperature 47.10
## HydPressure3 44.93
## Usagecont 41.30
## HydPressure2 41.30
## BrandCodeC 39.86
## CarbFlow 36.96
## CarbPressure2 36.23
## FillerSpeed 26.81
## FillerLevel 24.64
## PressureSetpoint 18.12
Since Cubist model outperform all other models, we can attempt to further improve it by performing feature engineering. Create new nonlinear terms based on top predictors in the cubist model.
train_final <- train_final %>%
mutate(MnfFlow_Sq = MnfFlow^2,
PressureVacuum_Sq = PressureVacuum^2,
AlchRel_Sq = AlchRel^2,
Density_Sq = Density^2)
test_final <- test_final %>%
mutate(MnfFlow_Sq = MnfFlow^2,
PressureVacuum_Sq = PressureVacuum^2,
AlchRel_Sq = AlchRel^2,
Density_Sq = Density^2)
set.seed(22)
cubistGrid <- expand.grid(committees = c(5, 10, 20, 25, 30, 35, 40, 50),
neighbors = c(3, 5, 7, 9))
cubist_model2 <- train(PH ~. ,
data = train_final,
method = "cubist",
trControl = trainControl(method = "cv", number = 5),
tuneGrid = cubistGrid)
cubist_model2
## Cubist
##
## 1799 samples
## 35 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1438, 1440, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 5 3 0.10222737 0.6550678 0.07199884
## 5 5 0.09999068 0.6666936 0.07138812
## 5 7 0.09946298 0.6692960 0.07140892
## 5 9 0.09963699 0.6678709 0.07187855
## 10 3 0.09924323 0.6728691 0.07030428
## 10 5 0.09724085 0.6838643 0.06990586
## 10 7 0.09696206 0.6855904 0.06999404
## 10 9 0.09729341 0.6834314 0.07052296
## 20 3 0.09816791 0.6798086 0.06973650
## 20 5 0.09605854 0.6917082 0.06932720
## 20 7 0.09585064 0.6930323 0.06950136
## 20 9 0.09614020 0.6913812 0.07009527
## 25 3 0.09801813 0.6799763 0.06973876
## 25 5 0.09601050 0.6914750 0.06925161
## 25 7 0.09575630 0.6931236 0.06938101
## 25 9 0.09604980 0.6914764 0.07005069
## 30 3 0.09800589 0.6799511 0.06966948
## 30 5 0.09606406 0.6911393 0.06917148
## 30 7 0.09583391 0.6925819 0.06924104
## 30 9 0.09612170 0.6909541 0.06983818
## 35 3 0.09806063 0.6796317 0.06972117
## 35 5 0.09611731 0.6909274 0.06919588
## 35 7 0.09585513 0.6926111 0.06919878
## 35 9 0.09616431 0.6908873 0.06986622
## 40 3 0.09797819 0.6801156 0.06967103
## 40 5 0.09597889 0.6917402 0.06904851
## 40 7 0.09567743 0.6937086 0.06902613
## 40 9 0.09596047 0.6921242 0.06965484
## 50 3 0.09770026 0.6816777 0.06953448
## 50 5 0.09566340 0.6937337 0.06894769
## 50 7 0.09540904 0.6955355 0.06899666
## 50 9 0.09571220 0.6938826 0.06957323
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 50 and neighbors = 7.
The model explains about 67% of the variance of the test data. It only improve slightly.
cubist_pred2 <- predict(cubist_model2, test_final)
postResample(pred = cubist_pred2, obs = test_final$PH)
## RMSE Rsquared MAE
## 0.09975222 0.67001655 0.06823076
Below are the top predictors for this model.
plot(varImp(cubist_model2))
| Model | RMSE | R^2 | MAE |
|---|---|---|---|
| Cubist | 0.09975222 | 0.67001655 | 0.06823076 |
| Random Forest | 0.10262764 | 0.65054981 | 0.07243542 |
| GBM | 0.11319662 | 0.57835611 | 0.08324744 |
| SVM | 0.12046363 | 0.52576294 | 0.08732194 |
| KNN | 0.12678369 | 0.47065301 | 0.09136588 |
| Simple linear model | 0.1304420 | 0.4260610 | 0.1015644 |
| elastic Net | 0.1337878 | 0.4042556 | 0.1015644 |
| Ridge | 0.1337878 | 0.4042556 | 0.1037787 |
The Cubist model outperform all the other models as it has the highest 𝑅2 and lowest RMSE value. However, it is still not an ideal model as it only explains about 67% of the variance of the test data. All tree based model perform better than the other models as they are robust to outliers and multicollinearity, which were all present in our data.
Use the best model (Cubist) to predict the PH values onto the
student evalution dataset.
eval_data <- s.eval %>%
select(-PH)
eval_data$BrandCode <- as.factor(eval_data$BrandCode)
eval_numeric <- eval_data %>%
select(where(is.numeric))
eval_factors <- eval_data %>%
select(where(is.factor))
eval_numeric_transformed <- predict(trans, eval_numeric)
eval_final <-bind_cols(eval_numeric_transformed, eval_factors)
eval_final <- eval_final %>%
mutate(across(where(is.factor), ~fct_na_value_to_level(.x, "Unknown")),
MnfFlow_Sq = MnfFlow^2,
PressureVacuum_Sq = PressureVacuum^2,
AlchRel_Sq = AlchRel^2,
Density_Sq = Density^2
)
eval_final$eval_pred_ph <- predict(cubist_model2, eval_final)
write.csv(eval_final, file = "Cubist_PH_Predictions.csv")
In this project, we explored and modeled the features affecting the pH level in beverge manufacturing process. After performing EDA and preprocessing, the following models were tested to help predict the ph level: linear regression, ridge, elastic net, random forest, SVM, KNN, GBM, and Cubist. Among all models, the Cubist model outperform all the other models as it has the highest 𝑅2 and lowest RMSE value.
Common key predictors across models includes, MnfFlow, OygenFiller, Pressure settings, Balling levels (sugar levels), and Temperature. MnfFlow consistently emerged as the top predictor across most models. By controlling these predictors, ABC Beverage can maintain more consistent pH levels of the beverages.
Although Cubist outperform all models, it is still not an ideal model as it only explains about 67% of the variance of the test data. If more time allows, we can attempt to improve the model further by: