ABC Beverage Company is responding to new manufacturing regulations that require better understanding and control of beverage pH levels. As part of the Data Science team, i am interested to analyze historical production data, identify the key factors affecting pH, and develop a predictive model.
This final project provides: - An exploration of the provided data - Testing of multiple predictive models - Selection of the best-performing model - Final predictions of pH values based on the selected model
The results will be used to support regulatory compliance and optimize production processes.
PH background: In beverages, pH is a measure of acidity or alkalinity, on a scale from 0 (very acidic) to 14 (very alkaline), with 7 being neutral (like pure water).
Why pH Matters in Beverages:
Taste: pH affects sourness/sharpness — lower pH = more acidic/sour
Shelf Life: Lower pH (more acidic) inhibits microbial growth
Regulatory Compliance: pH needs to be controlled to meet safety standards
Process Control: pH can indicate ingredient variability or equipment issues
I sourced the Beverage dataset data from the following website to train the model and the evaluation dataset to perform prediction of PH.
Source:
https://pmc.ncbi.nlm.nih.gov/articles/PMC4808596/
Beverage dataset: 2571 rows x 33 columns.
Evaluation dataset: 267 rows x 33 columns.
The goal is to use the Beverage dataset which has 32 predictive features and a target variable Potential for hydrogen (pH), train and compare several predictive modeling approaches, and select the best-performing model. By using the best model, predict the pH for the Evaluation dataset.
This predictive model will enable ABC Beverage to generate reliable pH predictions that support production optimization, regulatory compliance, and overall product consistency.
Importing the historical manufacturing dataset and performing an initial exploration to understand its structure and key characteristics, starting with loading the necessary libraries.
library(tidyverse)
library(readxl)
library(httr)
library(naniar)
library(caret)
library(pls)
library(earth)
library(nnet)
library(kernlab)
library(glmnet)
library(corrplot)
library(caret)
library(mice)
library(rpart) # singly tree based models(CART)
library(rpart.plot) # nice plots for rpart
library(ipred) # Bagged Trees
library(randomForest)# randomForest
library(partykit) # cforest ..random forest with conditional inference
library(gbm) # boosted trees
library(Cubist) # cubist model
## Load training (student) data from Github
url <- "https://raw.githubusercontent.com/datanerddhanya/DATA-624/main/StudentData.xlsx"
temp_file <- tempfile(fileext = ".xlsx")
download.file(url, temp_file, mode = "wb")
data <- read_excel(temp_file)%>%
janitor::clean_names()
# Load test (student evalution) data from Github
GET("https://raw.githubusercontent.com/datanerddhanya/DATA-624/main/StudentEvaluation.xlsx",
write_disk(tf <- tempfile(fileext = ".xlsx")))
## Response [https://raw.githubusercontent.com/datanerddhanya/DATA-624/main/StudentEvaluation.xlsx]
## Date: 2025-05-11 02:04
## Status: 200
## Content-Type: application/octet-stream
## Size: 68.2 kB
## <ON DISK> C:\Users\ajay2\AppData\Local\Temp\RtmpMdrPBY\file1d0c335c61d1.xlsx
test_data<- read_excel(tf, col_names = TRUE)%>%
janitor::clean_names()
Let’s check the first few rows of the data set to confirm that all the features and observations are loaded correctly.
# View the first few rows
knitr::kable(head(data))
brand_code | carb_volume | fill_ounces | pc_volume | carb_pressure | carb_temp | psc | psc_fill | psc_co2 | mnf_flow | carb_pressure1 | fill_pressure | hyd_pressure1 | hyd_pressure2 | hyd_pressure3 | hyd_pressure4 | filler_level | filler_speed | temperature | usage_cont | carb_flow | density | mfr | balling | pressure_vacuum | ph | oxygen_filler | bowl_setpoint | pressure_setpoint | air_pressurer | alch_rel | carb_rel | balling_lvl |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
B | 5.340000 | 23.96667 | 0.2633333 | 68.2 | 141.2 | 0.104 | 0.26 | 0.04 | -100 | 118.8 | 46.0 | 0 | NA | NA | 118 | 121.2 | 4002 | 66.0 | 16.18 | 2932 | 0.88 | 725.0 | 1.398 | -4.0 | 8.36 | 0.022 | 120 | 46.4 | 142.6 | 6.58 | 5.32 | 1.48 |
A | 5.426667 | 24.00667 | 0.2386667 | 68.4 | 139.6 | 0.124 | 0.22 | 0.04 | -100 | 121.6 | 46.0 | 0 | NA | NA | 106 | 118.6 | 3986 | 67.6 | 19.90 | 3144 | 0.92 | 726.8 | 1.498 | -4.0 | 8.26 | 0.026 | 120 | 46.8 | 143.0 | 6.56 | 5.30 | 1.56 |
B | 5.286667 | 24.06000 | 0.2633333 | 70.8 | 144.8 | 0.090 | 0.34 | 0.16 | -100 | 120.2 | 46.0 | 0 | NA | NA | 82 | 120.0 | 4020 | 67.0 | 17.76 | 2914 | 1.58 | 735.0 | 3.142 | -3.8 | 8.94 | 0.024 | 120 | 46.6 | 142.0 | 7.66 | 5.84 | 3.28 |
A | 5.440000 | 24.00667 | 0.2933333 | 63.0 | 132.6 | NA | 0.42 | 0.04 | -100 | 115.2 | 46.4 | 0 | 0 | 0 | 92 | 117.8 | 4012 | 65.6 | 17.42 | 3062 | 1.54 | 730.6 | 3.042 | -4.4 | 8.24 | 0.030 | 120 | 46.0 | 146.2 | 7.14 | 5.42 | 3.04 |
A | 5.486667 | 24.31333 | 0.1113333 | 67.2 | 136.8 | 0.026 | 0.16 | 0.12 | -100 | 118.4 | 45.8 | 0 | 0 | 0 | 92 | 118.6 | 4010 | 65.6 | 17.68 | 3054 | 1.54 | 722.8 | 3.042 | -4.4 | 8.26 | 0.030 | 120 | 46.0 | 146.2 | 7.14 | 5.44 | 3.04 |
A | 5.380000 | 23.92667 | 0.2693333 | 66.6 | 138.4 | 0.090 | 0.24 | 0.04 | -100 | 119.6 | 45.6 | 0 | 0 | 0 | 116 | 120.2 | 4014 | 66.2 | 23.82 | 2948 | 1.52 | 738.8 | 2.992 | -4.4 | 8.32 | 0.024 | 120 | 46.0 | 146.6 | 7.16 | 5.44 | 3.02 |
We can see that everything loaded correctly let’s check the structure of the data set to confirm that data types of the features. Rows: 2,571 Columns: 33
# Quick look at the structure of the training data
glimpse(data)
## Rows: 2,571
## Columns: 33
## $ brand_code <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
## $ carb_volume <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ fill_ounces <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ pc_volume <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ carb_pressure <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ carb_temp <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ psc <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ psc_fill <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ psc_co2 <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ mnf_flow <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ carb_pressure1 <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ fill_pressure <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ hyd_pressure1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure2 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure3 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure4 <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ filler_level <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ filler_speed <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ usage_cont <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ carb_flow <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ mfr <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ pressure_vacuum <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ ph <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ oxygen_filler <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ bowl_setpoint <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ pressure_setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ air_pressurer <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ alch_rel <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ carb_rel <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ balling_lvl <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…
We can see that all the predictors are numeric with double datatype except for the brand code. The Brand Code variable is currently stored as a character type, and it may be more appropriate to convert it to a factor since it represents a category rather than free text. Additionally, there are some missing values observed in several features, including MFR, Filler Speed, PSC, PSC CO2, and others. We will address these missing values during data pre-processing.
# summary
summary(data)
## brand_code carb_volume fill_ounces pc_volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## carb_pressure carb_temp psc psc_fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## psc_co2 mnf_flow carb_pressure1 fill_pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## hyd_pressure1 hyd_pressure2 hyd_pressure3 hyd_pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## filler_level filler_speed temperature usage_cont carb_flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## density mfr balling pressure_vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## ph oxygen_filler bowl_setpoint pressure_setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## air_pressurer alch_rel carb_rel balling_lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
The summary statistics show that most features are reasonably scaled with some outliers or extreme values, and a few variables have missing data that will need to be addressed during pre-processing.
dim(data)
## [1] 2571 33
data %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") %>%
filter(!is.na(Value)) %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~Feature, scales = "free") +
ggtitle("Histograms of Numerical Features")
The histograms above show the distributions of the numerical features in the dataset. Several variables, such as Carb Pressure, Carb Temp, Carb Volume, Fill Ounces, Pressure Vacuum, and PC Volume, appear approximately normally distributed.
Other features like filler_speed, mfr, and oxygen_filler either show signs of skewness and contain a few significant outliers.
Our target variable, PH, also follows a roughly normal distribution overall. However, since the dataset includes a variety of branded beverages with differing fill levels, pressures, carbonation temperatures, and other characteristics, it’s important to examine the PH distribution within each beverage group
data |>
select(ph) |>
ggplot() +
aes(x = ph) +
geom_histogram(bins= 40,fill = "blue", color = "black") +
labs(title = "Distribution of pH", y = "Count") +
theme_minimal()
data |>
ggplot() +
aes(x = ph) +
geom_histogram(bins= 40, fill = "lightblue", color = "black") +
labs(title = "Distribution of pH by Brand", y = "Count" ) +
facet_wrap(~ `brand_code`, scales = 'free_y')
The overall PH distribution is left skewed and is bi-modal. It indicates that there is grouping within it and hence we check by each brand code.
When we group the PH by the brand code, we see that C & D only look normally distributed whereas A, B are multi-modal. Also there are a lot of NA values.The distribution around PH is better as logged and will help us with the prediction.
Since Brand Code is a character right now but actually represents categories (A, B, C…), we should convert it to a factor before plotting or using it in modeling.
# Convert Brand Code to a factor
data <- data %>%
mutate(`brand_code` = as.factor(`brand_code`))
# Plot Brand Code distribution
ggplot(data, aes(x = `brand_code`)) +
geom_bar(fill = "skyblue", color = "black") +
ggtitle("Distribution of Brand Code") +
theme_minimal()
data %>%
select(`brand_code`, ph) %>%
ggplot() +
aes(x = `brand_code`, y = ph) +
geom_boxplot(color = 'skyblue', outlier.color = 'red')+
labs(title = 'Distribution of PH by Brand Code',
x = element_blank(),
y = 'pH')
The dataset includes four distinct Brand Codes: A, B, C, and D. Brand B has the highest number of observations, followed by Brand D, while Brands A and C have fewer observations. Additionally, there are some missing values (NA) in the Brand Code column that will need to be addressed during data pre-processing.
The box plot shows that there is variation in PH based on brand code. C has the lowest PH with a very high outlier. The NA value showing PH of 8.5 is seen to be the data of A. D has the highest PH with only one outlier.
As observed by the box plot, PH varies from 7.8 to 9.6 . This indicates the beverage is clearly basic (alkaline), and that’s very uncommon for traditional commercial beverages. The product could be Alkaline water, some enhanced waters used for specialty or specific formulations.
Since PH is a continuous outcome, this confirms that we are dealing with a regression problem. As a first approach, applying linear regression is appropriate and may provide reasonable performance. However, not all predictors are perfectly symmetric or linear, and some exhibit skewness or potential outliers. If linear regression performance is inadequate, we may consider alternative regression techniques such as Multivariate Adaptive Regression Splines (MARS), Support Vector Machines (SVM), Boosted Trees, Random Forests, or Neural Networks to improve predictive accuracy.
In the previous section, we observed that several columns contain missing values. In this section, we will explore different techniques to handle these missing values through imputation. Before proceeding with imputation, we first summarize the number of missing values present in each column.
data %>%
summarise_all(~ sum(is.na(.))) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
filter(missing_count != 0) %>%
arrange(desc(missing_count))
## # A tibble: 31 × 2
## variable missing_count
## <chr> <int>
## 1 mfr 212
## 2 brand_code 120
## 3 filler_speed 57
## 4 pc_volume 39
## 5 psc_co2 39
## 6 fill_ounces 38
## 7 psc 33
## 8 carb_pressure1 32
## 9 hyd_pressure4 30
## 10 carb_pressure 27
## # ℹ 21 more rows
The MFR column has the highest number of missing values, accounting for approximately 8.24% of the total observations. Generally, many data scientists recommend removing columns with more than 5% missing data to maintain model reliability. The Brand Code column also has missing values, representing about 4.6% of the total observations.
Also we observe that 4 observations do not have a PH value. Hence we need to remove these observations.
To handle missing values in the remaining columns, we will use the mice package for imputation. However, before proceeding, we need to address the column names by removing any spaces, as the mice functions require variable names without spaces to operate properly.
To address missing data in our dataset, we use the MICE (Multivariate Imputation by Chained Equations) technique. MICE is a flexible and widely adopted method that models each incomplete variable conditionally on the other variables, iteratively generating plausible values instead of relying on a single guess. This approach captures the uncertainty around missing data and helps preserve relationships between features.
For the imputation model, we select Random Forests (RF) as the method. Random Forest is a non-parametric, ensemble learning algorithm capable of modeling complex nonlinear relationships and interactions without requiring strict assumptions about the data. This makes it particularly well-suited for imputing both numeric and categorical features in real-world manufacturing datasets like ours.
Before performing imputation, we first visualize the extent and structure of missingness across the dataset.
# Missingness plot
gg_miss_var(data)
The missing value plot highlights that MFR had the highest number of missing values, followed by Brand_Code and Filler_Speed. Several other variables exhibited moderate missingness, while many features were nearly complete.
There are 4 observations with missing PH value - these must be removed. This visualization confirmed the need to apply an imputation strategy before proceeding to model development.
# Remove rows with missing target variable (pH)
data <- data[!is.na(data$ph), ]
set.seed(786)
# MICE imputation
imputed_data <- mice(data, m = 1, method = 'rf', print = FALSE)
# Extract the completed dataset
data <- complete(imputed_data)
# Check for any remaining missing values
colSums(is.na(data))
## brand_code carb_volume fill_ounces pc_volume
## 0 0 0 0
## carb_pressure carb_temp psc psc_fill
## 0 0 0 0
## psc_co2 mnf_flow carb_pressure1 fill_pressure
## 0 0 0 0
## hyd_pressure1 hyd_pressure2 hyd_pressure3 hyd_pressure4
## 0 0 0 0
## filler_level filler_speed temperature usage_cont
## 0 0 0 0
## carb_flow density mfr balling
## 0 0 0 0
## pressure_vacuum ph oxygen_filler bowl_setpoint
## 0 0 0 0
## pressure_setpoint air_pressurer alch_rel carb_rel
## 0 0 0 0
## balling_lvl
## 0
After performing Random Forest-based imputation using MICE, all missing values in the dataset have been successfully handled. We verified the success of the imputation by checking for any remaining missing values using colSums(is.na(data)), which confirmed that the dataset is now fully complete and ready for further modeling and analysis.
After handling missing data, it is important to address predictors that provide little to no useful information for modeling. Predictors with near-zero variance contain very little variability and do not contribute meaningfully to predictive power. Therefore, we identify and remove such predictors to streamline the modeling process and improve model performance.
nzv_cols <- nearZeroVar(data, names = TRUE)
nzv_cols
## [1] "hyd_pressure1"
#Remove only near-zero variance columns
if(length(nzv_cols) > 0) {
data <- data[, !colnames(data) %in% nzv_cols]
}
dim(data) # Rows and columns after applying -nearZeroVar
## [1] 2567 32
In this case, only the hyd_pressure1 variable was removed due to its near-zero variance. All other predictors were retained for further analysis.
There are 13 outliers using IQR method. However the values are not drastically high or low than the mean distribution we observed in the box plot. Also Most of those outliers come from one value with a pH of 9.36. We removed this outlier, but saw a marked reduction in accuracy and so made the decision to keep the value in the training set.
# Calculate IQR-based outlier thresholds within each brand
outliers_by_brand <- data %>%
group_by(brand_code) %>%
mutate(
Q1 = quantile(ph, 0.25, na.rm = TRUE),
Q3 = quantile(ph, 0.75, na.rm = TRUE),
IQR = Q3 - Q1,
lower = Q1 - 1.5 * IQR,
upper = Q3 + 1.5 * IQR,
is_outlier = (ph < lower | ph > upper)
) %>%
filter(is_outlier == TRUE)
# View the outliers
print(outliers_by_brand)
## # A tibble: 14 × 38
## # Groups: brand_code [4]
## brand_code carb_volume fill_ounces pc_volume carb_pressure carb_temp psc
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 C 5.22 23.9 0.391 63.6 139 0.002
## 2 A 5.55 24.0 0.389 72.4 144. 0.0560
## 3 D 5.31 24.0 0.317 66.8 141. 0.098
## 4 A 5.63 24.3 0.182 70.6 137. 0.068
## 5 A 5.45 24.0 0.212 71.2 143 0.116
## 6 A 5.4 24.0 0.219 62.4 133. 0.062
## 7 A 5.37 24.0 0.227 68.2 141. 0.074
## 8 B 5.43 24.0 0.258 63 134 0.072
## 9 B 5.43 24 0.227 70.8 144. 0.05
## 10 B 5.34 24 0.245 66.4 140. 0.116
## 11 B 5.34 24.1 0.294 68.4 142. 0.106
## 12 B 5.35 24.1 0.294 66.8 140. 0.086
## 13 C 5.27 24.0 0.281 65.4 139. 0.096
## 14 C 5.37 24.1 0.271 69.4 142 0.078
## # ℹ 31 more variables: psc_fill <dbl>, psc_co2 <dbl>, mnf_flow <dbl>,
## # carb_pressure1 <dbl>, fill_pressure <dbl>, hyd_pressure2 <dbl>,
## # hyd_pressure3 <dbl>, hyd_pressure4 <dbl>, filler_level <dbl>,
## # filler_speed <dbl>, temperature <dbl>, usage_cont <dbl>, carb_flow <dbl>,
## # density <dbl>, mfr <dbl>, balling <dbl>, pressure_vacuum <dbl>, ph <dbl>,
## # oxygen_filler <dbl>, bowl_setpoint <dbl>, pressure_setpoint <dbl>,
## # air_pressurer <dbl>, alch_rel <dbl>, carb_rel <dbl>, balling_lvl <dbl>, …
The evaluation dataset analysis using summary() shows that many observations have NA values which need to be treated. randomForest, lm, xgboost cannot handle NAs in predictors.If left untreated, predictions on test data with missing values will be skewed or incomplete. Hence for fair evaluation, test data must undergo the same preprocessing steps as training data.
summary(test_data)
## brand_code carb_volume fill_ounces pc_volume
## Length:267 Min. :5.147 Min. :23.75 Min. :0.09867
## Class :character 1st Qu.:5.287 1st Qu.:23.92 1st Qu.:0.23333
## Mode :character Median :5.340 Median :23.97 Median :0.27533
## Mean :5.369 Mean :23.97 Mean :0.27769
## 3rd Qu.:5.465 3rd Qu.:24.01 3rd Qu.:0.32200
## Max. :5.667 Max. :24.20 Max. :0.46400
## NA's :1 NA's :6 NA's :4
## carb_pressure carb_temp psc psc_fill
## Min. :60.20 Min. :130.0 Min. :0.00400 Min. :0.0200
## 1st Qu.:65.30 1st Qu.:138.4 1st Qu.:0.04450 1st Qu.:0.1000
## Median :68.00 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.25 Mean :141.2 Mean :0.08545 Mean :0.1903
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :77.60 Max. :154.0 Max. :0.24600 Max. :0.6200
## NA's :1 NA's :5 NA's :3
## psc_co2 mnf_flow carb_pressure1 fill_pressure
## Min. :0.00000 Min. :-100.20 Min. :113.0 Min. :37.80
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:120.2 1st Qu.:46.00
## Median :0.04000 Median : 0.20 Median :123.4 Median :47.80
## Mean :0.05107 Mean : 21.03 Mean :123.0 Mean :48.14
## 3rd Qu.:0.06000 3rd Qu.: 141.30 3rd Qu.:125.5 3rd Qu.:50.20
## Max. :0.24000 Max. : 220.40 Max. :136.0 Max. :60.20
## NA's :5 NA's :4 NA's :2
## hyd_pressure1 hyd_pressure2 hyd_pressure3 hyd_pressure4
## Min. :-50.00 Min. :-50.00 Min. :-50.00 Min. : 68.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 90.00
## Median : 10.40 Median : 26.80 Median : 27.70 Median : 98.00
## Mean : 12.01 Mean : 20.11 Mean : 19.61 Mean : 97.84
## 3rd Qu.: 20.40 3rd Qu.: 34.80 3rd Qu.: 33.00 3rd Qu.:104.00
## Max. : 50.00 Max. : 61.40 Max. : 49.20 Max. :140.00
## NA's :1 NA's :1 NA's :4
## filler_level filler_speed temperature usage_cont carb_flow
## Min. : 69.2 Min. :1006 Min. :63.80 Min. :12.90 Min. : 0
## 1st Qu.:100.6 1st Qu.:3812 1st Qu.:65.40 1st Qu.:18.12 1st Qu.:1083
## Median :118.6 Median :3978 Median :65.80 Median :21.44 Median :3038
## Mean :110.3 Mean :3581 Mean :66.23 Mean :20.90 Mean :2409
## 3rd Qu.:120.2 3rd Qu.:3996 3rd Qu.:66.60 3rd Qu.:23.74 3rd Qu.:3215
## Max. :153.2 Max. :4020 Max. :75.40 Max. :24.60 Max. :3858
## NA's :2 NA's :10 NA's :2 NA's :2
## density mfr balling pressure_vacuum
## Min. :0.060 Min. : 15.6 Min. :0.902 Min. :-6.400
## 1st Qu.:0.920 1st Qu.:707.0 1st Qu.:1.498 1st Qu.:-5.600
## Median :0.980 Median :724.6 Median :1.648 Median :-5.200
## Mean :1.177 Mean :697.8 Mean :2.203 Mean :-5.174
## 3rd Qu.:1.600 3rd Qu.:731.5 3rd Qu.:3.242 3rd Qu.:-4.800
## Max. :1.840 Max. :784.8 Max. :3.788 Max. :-3.600
## NA's :1 NA's :31 NA's :1 NA's :1
## ph oxygen_filler bowl_setpoint pressure_setpoint
## Mode:logical Min. :0.00240 Min. : 70.0 Min. :44.00
## NA's:267 1st Qu.:0.01960 1st Qu.:100.0 1st Qu.:46.00
## Median :0.03370 Median :120.0 Median :46.00
## Mean :0.04666 Mean :109.6 Mean :47.73
## 3rd Qu.:0.05440 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :0.39800 Max. :130.0 Max. :52.00
## NA's :3 NA's :1 NA's :2
## air_pressurer alch_rel carb_rel balling_lvl
## Min. :141.2 Min. :6.400 Min. :5.18 Min. :0.000
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.34 1st Qu.:1.380
## Median :142.6 Median :6.580 Median :5.40 Median :1.480
## Mean :142.8 Mean :6.907 Mean :5.44 Mean :2.051
## 3rd Qu.:142.8 3rd Qu.:7.180 3rd Qu.:5.56 3rd Qu.:3.080
## Max. :147.2 Max. :7.820 Max. :5.74 Max. :3.420
## NA's :1 NA's :3 NA's :2
#facting brand code
test_data <- test_data %>%
mutate(`brand_code` = as.factor(`brand_code`))
test_completed <- test_data %>%
select(-ph) %>%
mice(., m = 1, method = 'rf', print = FALSE) %>% complete()
# remove Hyd.Pressure1 as it was removed in the preprocessing for Student Data
# add back in PH
test_eval <- test_completed %>%
select(-hyd_pressure1) %>%
mutate(PH = "")
# Check for any remaining missing values
colSums(is.na(test_eval))
## brand_code carb_volume fill_ounces pc_volume
## 0 0 0 0
## carb_pressure carb_temp psc psc_fill
## 0 0 0 0
## psc_co2 mnf_flow carb_pressure1 fill_pressure
## 0 0 0 0
## hyd_pressure2 hyd_pressure3 hyd_pressure4 filler_level
## 0 0 0 0
## filler_speed temperature usage_cont carb_flow
## 0 0 0 0
## density mfr balling pressure_vacuum
## 0 0 0 0
## oxygen_filler bowl_setpoint pressure_setpoint air_pressurer
## 0 0 0 0
## alch_rel carb_rel balling_lvl PH
## 0 0 0 0
With the dataset now fully cleaned and complete, the next step is to perform feature selection. Feature selection is a critical process to identify the most important predictors that influence the target variable, PH. By selecting the most relevant variables, we can improve model performance, reduce overfitting, and simplify the interpretation of the results. In this section, we will explore relationships between predictors and PH to guide our modeling choices.
Since correlation analysis requires numeric variables, we first selected only the numeric columns from the dataset before computing the correlation matrix.
# Numeric columns
data_numeric <- data %>%
select(where(is.numeric))
# Correlation matrix
cor_matrix <- cor(data_numeric, use = "complete.obs")
# Plot
corrplot(cor_matrix,
method = "circle",
type = "upper",
tl.cex = 0.5,
tl.srt = 45) # Rotate text diagonally
par(mar = c(1, 1, 4, 1)) # top margin = 4 lines
# find the high correlated predictors
high_corr <- findCorrelation(cor_matrix, cutoff = 0.9)
# Then this gives the actual column names:
colnames(data)[high_corr]
## [1] "mfr" "hyd_pressure2" "carb_rel" "carb_flow"
## [5] "hyd_pressure4"
# As we value interpretability, we will drop these highly correlated predictors.
#data <- data[, -high_corr]
The correlation plot shows that most numerical predictors have low to moderate relationships with each other. PH is moderately correlated with variables such as Oxygen_Filler, Pressure_Setpoint, Bowl_Setpoint, Alch_Rel, and Air_Pressurer. A few other predictors also show noticeable associations. Overall, the correlations suggest that several variables may contribute useful information for predicting PH, supporting our next step of feature selection and model building. there is higher levels of correlation.
Next, let’s extract and visualize the Top 5 most correlated predictors with PH.
To predict beverage pH levels accurately, we evaluate a range of machine learning models, including both linear and nonlinear approaches. Our goal is to compare multiple algorithms, identify the best-performing model based on predictive accuracy, and balance model complexity with interpretability. We begin by splitting the dataset into training and testing sets, and then proceed to fit and assess several models: Linear Regression, Boosted Trees, Random Forest, MARS, Cubist, K-Nearest Neighbors (KNN), and Support Vector Machines (SVM).
Since the target variable (pH) differs by Brand Code, you should stratify the train-test split by Brand Code, not by pH. Stratifying by brand ensures each subset (train/test) contains similar brand distributions.
set.seed(786)
# index for training
index <- createDataPartition(data$brand_code, p = .8, list = FALSE)
# train
train_data <- data[index, ]
test_data<- data[-index,]
# Training predictors and response
trainingX <- train_data %>% select(-ph)
trainingY <- train_data$ph
# Test predictors and response
testX <- test_data %>% select(-ph)
testY <- test_data$ph
# 5-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 5)
#pre process
pre_process <- c("center","scale")
We evaluate multiple models to predict pH using linear model, non-linear models, and tree-based models. Each model is trained with 5-fold CV, centered/scaled predictors, and evaluated on test data.
Models linear relationships, no tuning - linear Regression models pH as a linear combination of predictors. It’s simple but may struggle with complex, nonlinear relationships in the data. Implemented using caret’s lm method, it’s trained on train_data, evaluated with 5-fold CV, and performance (RMSE, R-squared, MAE) is assessed on test_data using postResample.
Linear Regression Model
RMSE: 0.14
Rsquared: 0.37
MAE: 0.11
set.seed(786)
lm_model <- train(x = trainingX, y = trainingY,
method = "lm",
preProcess = pre_process,
trControl = ctrl)
lm_pred <- predict(lm_model, testX)
lm_resample <- postResample(lm_pred, testY)
lm_resample
## RMSE Rsquared MAE
## 0.1408669 0.3731524 0.1057965
Ridge
RMSE: 0.14
Rsquared: 0.38
MAE: 0.11
set.seed(786)
ridge_model <- train(ph ~ ., data = train_data,
method = "glmnet",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10),
tuneGrid = expand.grid(alpha = 0,
lambda = seq(0, 1, 0.1)))
ridge_pred <- predict(ridge_model, testX)
ridge_resample <- postResample(ridge_pred, testY)
ridge_resample
## RMSE Rsquared MAE
## 0.1403992 0.3783332 0.1068694
Elastic Net
RMSE: 0.14
Rsquared: 0.37
MAE: 0.11
set.seed(786)
# Elastic net
enetGrid <- expand.grid(lambda = c(0, 0.01, .1), fraction = seq(.05, 1, length = 20))
enet_model <- train(ph ~ ., data = train_data,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProcess = pre_process)
enet_pred <- predict(enet_model, testX)
enet_resample <- postResample(enet_pred, testY)
enet_resample
## RMSE Rsquared MAE
## 0.1406831 0.3746848 0.1058024
PCR
RMSE: 0.16
Rsquared: 0.17
MAE: 0.13
set.seed(786)
# Train PCR model
pcr_model <- train(x = select(train_data, -ph), y = train_data$ph,
method = "pcr",
trControl = ctrl,
preProcess = pre_process)
# Predict on test data
pcr_pred <- predict(pcr_model, select(test_data, -ph))
# Evaluate performance
pcr_resample <- postResample(pcr_pred, test_data$ph)
pcr_resample
## RMSE Rsquared MAE
## 0.1621590 0.1690306 0.1272482
Partial Least Squares (PLS):
RMSE: 0.14
Rsquared: 0.37
MAE: 0.11
set.seed(786)
pls_grid <- expand.grid(ncomp = seq(1, 15, by = 1))
pls_model <- train(x = select(train_data, -ph), y = train_data$ph,
method = "pls",
tuneGrid = pls_grid,
trControl = ctrl,
preProcess = pre_process)
pls_pred <- predict(pls_model, select(test_data, -ph))
pls_resample <- postResample(pls_pred, test_data$ph)
pls_resample
## RMSE Rsquared MAE
## 0.1401566 0.3793208 0.1058806
K-Nearest Neighbors (KNN):
RMSE: 0.127
Rsquared: 0.498
MAE: 0.094
set.seed(786)
knn_grid <- expand.grid(k = seq(1, 21, by = 2))
knn_model <- train(ph ~ ., data = train_data,
method = "knn",
tuneGrid = knn_grid,
trControl = ctrl,
preProcess = pre_process)
knn_pred <- predict(knn_model, select(test_data, -ph))
knn_resample <- postResample(knn_pred, test_data$ph)
knn_resample
## RMSE Rsquared MAE
## 0.12519337 0.51059567 0.09240191
Multivariate Adaptive Regression Splines (MARS):
RMSE: 0.124
Rsquared: 0.513
MAE: 0.089
set.seed(786)
mars_grid <- expand.grid(degree = 1:2, nprune = seq(10, 30, by = 5))
mars_model <- train(x = select(train_data, -ph), y = train_data$ph,
method = "earth",
tuneGrid = mars_grid,
trControl = ctrl,
preProcess = pre_process)
mars_pred <- predict(mars_model, select(test_data, -ph))
mars_resample <- postResample(mars_pred, test_data$ph)
mars_resample
## RMSE Rsquared MAE
## 0.12423616 0.51253767 0.08946004
Neural Network (NN):
RMSE: 7.545
Rsquared: 0.045
MAE: 7.543
set.seed(786)
nn_grid <- expand.grid(.decay = c(0.001, 0.01, 0.1), .size = c(3, 5, 7), .bag = FALSE)
nn_model <- train(ph ~ ., data = train_data,
method = "avNNet",
tuneGrid = nn_grid,
trControl = ctrl,
preProcess = pre_process,
maxit = 200, trace = FALSE)
## Warning: executing %dopar% sequentially: no parallel backend registered
nn_pred <- predict(nn_model, select(test_data, -ph))
nn_resample <- postResample(nn_pred, test_data$ph)
nn_resample
## RMSE Rsquared MAE
## 7.54518307 0.01620162 7.54308630
Support Vector Machines (SVM):
RMSE: 0.123
Rsquared: 0.526
MAE: 0.087
set.seed(786)
svm_grid <- expand.grid(sigma = c(0.01, 0.05, 0.1), C = c(0.1, 1, 10))
svm_model <- train(ph ~ ., data = train_data,
method = "svmRadial",
tuneGrid = svm_grid,
trControl = ctrl,
preProcess = pre_process)
svm_pred <- predict(svm_model, select(test_data, -ph))
svm_resample <- postResample(svm_pred, test_data$ph)
svm_resample
## RMSE Rsquared MAE
## 0.12354270 0.52093686 0.08729071
Random Forest on test_data with 5-fold CV and 50 tree
RMSE: 0.110
Rsquared: 0.631
MAE: 0.074
Using center and scale for preprocessing. The number of optimal predictors was 14. The final value used for the model was mtry = 14.
set.seed(786)
library(partykit) # cforest ..random forest with conditional inference
# default mmtry is number of predictors divided by 3 , so around 10.
#mtry is the number of predictors sampled for each split.
rfGrid <- expand.grid(mtry=seq(2,20,by=3) )
rfModel <- train(x = select(train_data, -ph), y = train_data$ph,
method = "rf",
tuneGrid = rfGrid,
importance = TRUE,
trControl = ctrl,
ntree = 50,
preProcess=pre_process)
#validation plot
plot(rfModel)
rfModel
## Random Forest
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1645, 1644, 1643, 1645, 1643
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.1128771 0.6009651 0.08604055
## 5 0.1058461 0.6387112 0.07906666
## 8 0.1030482 0.6532559 0.07614425
## 11 0.1021637 0.6557643 0.07537543
## 14 0.1019742 0.6543774 0.07500112
## 17 0.1019922 0.6515453 0.07484784
## 20 0.1023996 0.6478821 0.07481463
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 14.
#predict the permeability response for the test data
rfPred <- predict(rfModel, select(test_data, -ph))
#compare the variance of the predicted values to the test values.
rfResample <- postResample(rfPred, test_data$ph)
rfResample
## RMSE Rsquared MAE
## 0.10972929 0.63254966 0.07613698
Gradient Boosting Model(GBM)
RMSE: 0.125
Rsquared: 0.509
MAE: 0.091
For n.trees = 150, interaction.depth = 3, shrinkage = 0.1 and
n.minobsinnode = 10.
set.seed(786)
gbmGrid <- expand.grid(interaction.depth=seq(1,7,by=1),
n.trees=c(100,200),
shrinkage=c(0.01,0.1,0.2),
n.minobsinnode=5)
# Distribution defines the type of loss function that will be optimized during boosting.
# for continuous variable, it should be gaussian
gbmModel <-train(x = select(train_data, -ph), y = train_data$ph,
method = "gbm",
verbose = FALSE,
trControl = ctrl,
preProcess=pre_process)
gbmModel
## Stochastic Gradient Boosting
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1645, 1644, 1643, 1645, 1643
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 0.1328660 0.4128697 0.10549819
## 1 100 0.1288159 0.4409861 0.10220537
## 1 150 0.1269491 0.4546319 0.10056028
## 2 50 0.1260576 0.4717380 0.09955882
## 2 100 0.1214173 0.5027783 0.09526266
## 2 150 0.1193871 0.5163155 0.09295076
## 3 50 0.1220633 0.5014799 0.09577564
## 3 100 0.1182051 0.5261283 0.09182823
## 3 150 0.1166537 0.5364596 0.08997871
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
#predict the permeability response for the test data
gbmPred <- predict(gbmModel, select(test_data, -ph))
#compare the variance of the predicted values to the test values.
gbmResample<- postResample(gbmPred, test_data$ph)
gbmResample
## RMSE Rsquared MAE
## 0.12670901 0.49446004 0.09180879
Cubist
RMSE: 0.102
Rsquared: 0.673
MAE: 0.068
For committees: 20 and neighbors: 9
# committess are the n.trees
# neighbors can take a integer value between 0 to 9
# cubistGrid <- expand.grid(committees = c(20, 30,40,80,100),
# neighbors = c(0, 1, 3, 5, 7, 9))
set.seed(786)
cubistModel <- train(x = select(train_data, -ph), y = train_data$ph,
method = "cubist",
verbose = FALSE,
preProcess=pre_process)
cubistModel$bestTune
## committees neighbors
## 9 20 9
#predict the permeability response for the test data
cubistPred <- predict(cubistModel, select(test_data, -ph))
#compare the variance of the predicted values to the test values.
cubistResample <-postResample(cubistPred, test_data$ph)
cubistResample
## RMSE Rsquared MAE
## 0.10159717 0.67458319 0.06798507
Bagged trees
RMSE: 0.128
Rsquared: 0.492
MAE: 0.096
set.seed(786)
baggedModel <- train(x = select(train_data, -ph),
y = train_data$ph,
method = "treebag",
verbose = FALSE,
# trControl = ctrl,
preProcess = pre_process)
baggedModel$bestTune
## parameter
## 1 none
# Predict the permeability response for the test data
baggedPred <- predict(baggedModel, select(test_data, -ph))
# Compare the variance of the predicted values to the test values
baggedResample <- postResample(baggedPred, test_data$ph)
baggedResample
## RMSE Rsquared MAE
## 0.1280664 0.4899209 0.0958947
model_results <- rbind(
lm_resample,
ridge_resample,
enet_resample,
pcr_resample,
pls_resample,
knn_resample,
mars_resample,
nn_resample,
svm_resample,
rfResample,
gbmResample,
baggedResample,
cubistResample
)
rownames(model_results) <- c("Linear Regression", "Ridge", "Elastic Net", "PCR","PLS", "KNN", "MARS", "Neural Network", "SVM","Random Forest", "GBM", "Bagged", "Cubist")
model_results
## RMSE Rsquared MAE
## Linear Regression 0.1408669 0.37315235 0.10579651
## Ridge 0.1403992 0.37833322 0.10686943
## Elastic Net 0.1406831 0.37468483 0.10580243
## PCR 0.1621590 0.16903059 0.12724820
## PLS 0.1401566 0.37932084 0.10588059
## KNN 0.1251934 0.51059567 0.09240191
## MARS 0.1242362 0.51253767 0.08946004
## Neural Network 7.5451831 0.01620162 7.54308630
## SVM 0.1235427 0.52093686 0.08729071
## Random Forest 0.1097293 0.63254966 0.07613698
## GBM 0.1267090 0.49446004 0.09180879
## Bagged 0.1280664 0.48992085 0.09589470
## Cubist 0.1015972 0.67458319 0.06798507
top_predictors <-data.frame(varImp(cubistModel,scale = TRUE)$importance)
top_predictors |>
arrange(desc(Overall))
## Overall
## mnf_flow 100.000000
## balling 72.519084
## balling_lvl 70.229008
## pressure_vacuum 64.122137
## alch_rel 63.358779
## density 61.832061
## carb_rel 51.145038
## air_pressurer 51.145038
## carb_flow 46.564885
## temperature 46.564885
## hyd_pressure3 44.274809
## oxygen_filler 42.748092
## filler_speed 40.458015
## carb_pressure1 40.458015
## bowl_setpoint 38.167939
## usage_cont 37.404580
## hyd_pressure2 29.007634
## brand_code 22.137405
## carb_volume 21.374046
## pc_volume 19.847328
## hyd_pressure4 19.083969
## filler_level 19.083969
## carb_pressure 19.083969
## pressure_setpoint 15.267176
## carb_temp 14.503817
## mfr 6.870229
## fill_pressure 6.870229
## fill_ounces 3.816794
## psc_fill 3.816794
## psc 0.000000
## psc_co2 0.000000
top_10_predictors <- top_predictors |>
arrange(desc(Overall)) |>
slice_head(n = 10)
# 1. Get variable importance
cubistImp <- varImp(cubistModel)
# 2. Plot simple barplot
plot(caret::varImp(cubistModel), main = "Cubist Regression Model Variable Importance")
After choosing the best performing model, we use it to evaluate against the StudentEvaluation data which was provided in a separate file.
We now predict the pH for this data. This will help ABC beverage company to generate reliable pH predictions that support production optimization, regulatory compliance, and overall product consistency.
The predicted result is close to the box plot shown earlier with the beverage dataset.
ph_prediction <- predict(cubistModel,test_eval)
test_eval$PH <- ph_prediction
#ph predicted data
test_eval %>%
select(brand_code, PH) %>%
ggplot() +
aes(x = brand_code, y = PH) +
geom_boxplot(color = 'skyblue', outlier.color = 'red')+
labs(title = 'Distribution of predicted PH by Brand Code',
x = element_blank(),
y = 'pH')
library(openxlsx)
# Create a new workbook
wb <- createWorkbook()
# Add the sheets
addWorksheet(wb, "Predictions")
# Write data to sheet
writeData(wb,sheet = "Predictions" , test_eval)
# Save the workbook
saveWorkbook(wb, "StudentPrediction.xlsx", overwrite = TRUE)
With the unique nature of this dataset, I believe strong domain knowledge would be necessary to take this model to the next level. Feature engineering would be the place to focus and that additional modeling gains could be made by creating additional features from this rich dataset.
By using a more balanced dataset, researching and analyzing combination features, we can aim to achieve better R squared and lower RMSE.
The final predictions generated using Cubist model for the PH value seem somewhat similar, as they all range between 8 and 9. The patterns still uphold in our predictions when they are grouped by the Brand code.
I also highlighted the variables that have the most effect on the PH. I hope that understanding more about the manufacturing process helps with the new regulations in the beverage industry.