DATA622 - Assignment 4 - Final project

ABC Beverage — pH levels Prediction

1. Introduction

ABC Beverage Company is responding to new manufacturing regulations that require better understanding and control of beverage pH levels. As part of the Data Science team, i am interested to analyze historical production data, identify the key factors affecting pH, and develop a predictive model.

This final project provides: - An exploration of the provided data - Testing of multiple predictive models - Selection of the best-performing model - Final predictions of pH values based on the selected model

The results will be used to support regulatory compliance and optimize production processes.

PH background: In beverages, pH is a measure of acidity or alkalinity, on a scale from 0 (very acidic) to 14 (very alkaline), with 7 being neutral (like pure water).

Why pH Matters in Beverages:

Taste: pH affects sourness/sharpness — lower pH = more acidic/sour
Shelf Life: Lower pH (more acidic) inhibits microbial growth
Regulatory Compliance: pH needs to be controlled to meet safety standards
Process Control: pH can indicate ingredient variability or equipment issues

I sourced the Beverage dataset data from the following website to train the model and the evaluation dataset to perform prediction of PH.

Source:

https://pmc.ncbi.nlm.nih.gov/articles/PMC4808596/

Beverage dataset: 2571 rows x 33 columns.

Evaluation dataset: 267 rows x 33 columns.

2. Business Goal

The goal is to use the Beverage dataset which has 32 predictive features and a target variable Potential for hydrogen (pH), train and compare several predictive modeling approaches, and select the best-performing model. By using the best model, predict the pH for the Evaluation dataset.

This predictive model will enable ABC Beverage to generate reliable pH predictions that support production optimization, regulatory compliance, and overall product consistency.

3. Load and Explore the Data

Importing the historical manufacturing dataset and performing an initial exploration to understand its structure and key characteristics, starting with loading the necessary libraries.

library(tidyverse)
library(readxl)
library(httr)
library(naniar)
library(caret)
library(pls)
library(earth)
library(nnet)
library(kernlab)
library(glmnet)
library(corrplot)
library(caret)
library(mice)

library(rpart)  # singly tree based models(CART)
library(rpart.plot) # nice plots for rpart
library(ipred)  # Bagged Trees
library(randomForest)# randomForest
library(partykit) # cforest ..random forest with conditional inference
library(gbm) # boosted trees
library(Cubist) # cubist model

3.1 Load and Read the data

## Load training (student) data from Github
url <- "https://raw.githubusercontent.com/datanerddhanya/DATA-624/main/StudentData.xlsx"
temp_file <- tempfile(fileext = ".xlsx")
download.file(url, temp_file, mode = "wb")
data <- read_excel(temp_file)%>%
  janitor::clean_names()

# Load test (student evalution) data from Github

GET("https://raw.githubusercontent.com/datanerddhanya/DATA-624/main/StudentEvaluation.xlsx", 
  write_disk(tf <- tempfile(fileext = ".xlsx")))

## Response [https://raw.githubusercontent.com/datanerddhanya/DATA-624/main/StudentEvaluation.xlsx]
##   Date: 2025-05-11 02:04
##   Status: 200
##   Content-Type: application/octet-stream
##   Size: 68.2 kB
## <ON DISK>  C:\Users\ajay2\AppData\Local\Temp\RtmpMdrPBY\file1d0c335c61d1.xlsx

test_data<- read_excel(tf, col_names = TRUE)%>%
  janitor::clean_names()

Let’s check the first few rows of the data set to confirm that all the features and observations are loaded correctly.

# View the first few rows
knitr::kable(head(data))

brand_code	carb_volume	fill_ounces	pc_volume	carb_pressure	carb_temp	psc	psc_fill	psc_co2	mnf_flow	carb_pressure1	fill_pressure	hyd_pressure2	hyd_pressure3	hyd_pressure4	filler_level	filler_speed	temperature	usage_cont	carb_flow	density	mfr	balling	pressure_vacuum	ph	oxygen_filler	bowl_setpoint	pressure_setpoint	air_pressurer	alch_rel	carb_rel	balling_lvl
B	5.340000	23.96667	0.2633333	68.2	141.2	0.104	0.26	0.04	-100	118.8	46.0	NA	NA	118	121.2	4002	66.0	16.18	2932	0.88	725.0	1.398	-4.0	8.36	0.022	120	46.4	142.6	6.58	5.32	1.48
A	5.426667	24.00667	0.2386667	68.4	139.6	0.124	0.22	0.04	-100	121.6	46.0	NA	NA	106	118.6	3986	67.6	19.90	3144	0.92	726.8	1.498	-4.0	8.26	0.026	120	46.8	143.0	6.56	5.30	1.56
B	5.286667	24.06000	0.2633333	70.8	144.8	0.090	0.34	0.16	-100	120.2	46.0	NA	NA	82	120.0	4020	67.0	17.76	2914	1.58	735.0	3.142	-3.8	8.94	0.024	120	46.6	142.0	7.66	5.84	3.28
A	5.440000	24.00667	0.2933333	63.0	132.6	NA	0.42	0.04	-100	115.2	46.4	0	0	92	117.8	4012	65.6	17.42	3062	1.54	730.6	3.042	-4.4	8.24	0.030	120	46.0	146.2	7.14	5.42	3.04
A	5.486667	24.31333	0.1113333	67.2	136.8	0.026	0.16	0.12	-100	118.4	45.8	0	0	92	118.6	4010	65.6	17.68	3054	1.54	722.8	3.042	-4.4	8.26	0.030	120	46.0	146.2	7.14	5.44	3.04
A	5.380000	23.92667	0.2693333	66.6	138.4	0.090	0.24	0.04	-100	119.6	45.6	0	0	116	120.2	4014	66.2	23.82	2948	1.52	738.8	2.992	-4.4	8.32	0.024	120	46.0	146.6	7.16	5.44	3.02

We can see that everything loaded correctly let’s check the structure of the data set to confirm that data types of the features. Rows: 2,571 Columns: 33

# Quick look at the structure of the training data 
glimpse(data)

## Rows: 2,571
## Columns: 33
## $ brand_code        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
## $ carb_volume       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ fill_ounces       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ pc_volume         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ carb_pressure     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ carb_temp         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ psc               <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ psc_fill          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ psc_co2           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ mnf_flow          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ carb_pressure1    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ fill_pressure     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ hyd_pressure1     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure2     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure3     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure4     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ filler_level      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ filler_speed      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ temperature       <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ usage_cont        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ carb_flow         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ density           <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ mfr               <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ balling           <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ pressure_vacuum   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ ph                <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ oxygen_filler     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ bowl_setpoint     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ pressure_setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ air_pressurer     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ alch_rel          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ carb_rel          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ balling_lvl       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…

We can see that all the predictors are numeric with double datatype except for the brand code. The Brand Code variable is currently stored as a character type, and it may be more appropriate to convert it to a factor since it represents a category rather than free text. Additionally, there are some missing values observed in several features, including MFR, Filler Speed, PSC, PSC CO2, and others. We will address these missing values during data pre-processing.

# summary
summary(data)

##   brand_code         carb_volume     fill_ounces      pc_volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  carb_pressure     carb_temp          psc             psc_fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     psc_co2           mnf_flow       carb_pressure1  fill_pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  hyd_pressure1   hyd_pressure2   hyd_pressure3   hyd_pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   filler_level    filler_speed   temperature      usage_cont      carb_flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     density           mfr           balling       pressure_vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        ph        oxygen_filler     bowl_setpoint   pressure_setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  air_pressurer      alch_rel        carb_rel      balling_lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1

The summary statistics show that most features are reasonably scaled with some outliers or extreme values, and a few variables have missing data that will need to be addressed during pre-processing.

dim(data)

## [1] 2571   33

3.2 Visual Analytics: Feature Distributions

data %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") %>%
  filter(!is.na(Value)) %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~Feature, scales = "free") +
  ggtitle("Histograms of Numerical Features")

The histograms above show the distributions of the numerical features in the dataset. Several variables, such as Carb Pressure, Carb Temp, Carb Volume, Fill Ounces, Pressure Vacuum, and PC Volume, appear approximately normally distributed.

Other features like filler_speed, mfr, and oxygen_filler either show signs of skewness and contain a few significant outliers.

Our target variable, PH, also follows a roughly normal distribution overall. However, since the dataset includes a variety of branded beverages with differing fill levels, pressures, carbonation temperatures, and other characteristics, it’s important to examine the PH distribution within each beverage group

data |>
  select(ph) |>
  ggplot() + 
  aes(x = ph) + 
  geom_histogram(bins= 40,fill = "blue", color = "black") + 
  labs(title = "Distribution of pH", y = "Count") +  
  theme_minimal()

  data |>
  ggplot() + 
  aes(x = ph) + 
  geom_histogram(bins= 40, fill = "lightblue", color = "black") +
  labs(title = "Distribution of pH by Brand", y = "Count" ) + 
  facet_wrap(~ `brand_code`, scales = 'free_y')

The overall PH distribution is left skewed and is bi-modal. It indicates that there is grouping within it and hence we check by each brand code.

When we group the PH by the brand code, we see that C & D only look normally distributed whereas A, B are multi-modal. Also there are a lot of NA values.The distribution around PH is better as logged and will help us with the prediction.

Since Brand Code is a character right now but actually represents categories (A, B, C…), we should convert it to a factor before plotting or using it in modeling.

# Convert Brand Code to a factor
data <- data %>%
  mutate(`brand_code` = as.factor(`brand_code`))

# Plot Brand Code distribution
ggplot(data, aes(x = `brand_code`)) +
  geom_bar(fill = "skyblue", color = "black") +
  ggtitle("Distribution of Brand Code") +
  theme_minimal()

data %>%
  select(`brand_code`, ph) %>%
  ggplot() + 
  aes(x = `brand_code`, y = ph) + 
  geom_boxplot(color = 'skyblue', outlier.color = 'red')+
  labs(title = 'Distribution of PH by Brand Code',
       x = element_blank(),
       y = 'pH')

The dataset includes four distinct Brand Codes: A, B, C, and D. Brand B has the highest number of observations, followed by Brand D, while Brands A and C have fewer observations. Additionally, there are some missing values (NA) in the Brand Code column that will need to be addressed during data pre-processing.

The box plot shows that there is variation in PH based on brand code. C has the lowest PH with a very high outlier. The NA value showing PH of 8.5 is seen to be the data of A. D has the highest PH with only one outlier.

As observed by the box plot, PH varies from 7.8 to 9.6 . This indicates the beverage is clearly basic (alkaline), and that’s very uncommon for traditional commercial beverages. The product could be Alkaline water, some enhanced waters used for specialty or specific formulations.

Since PH is a continuous outcome, this confirms that we are dealing with a regression problem. As a first approach, applying linear regression is appropriate and may provide reasonable performance. However, not all predictors are perfectly symmetric or linear, and some exhibit skewness or potential outliers. If linear regression performance is inadequate, we may consider alternative regression techniques such as Multivariate Adaptive Regression Splines (MARS), Support Vector Machines (SVM), Boosted Trees, Random Forests, or Neural Networks to improve predictive accuracy.

4. Pre-processing Dataset

In the previous section, we observed that several columns contain missing values. In this section, we will explore different techniques to handle these missing values through imputation. Before proceeding with imputation, we first summarize the number of missing values present in each column.

4.1 Check Missing Values:

data %>%
  summarise_all(~ sum(is.na(.))) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
  filter(missing_count != 0) %>%
  arrange(desc(missing_count))

## # A tibble: 31 × 2
##    variable       missing_count
##    <chr>                  <int>
##  1 mfr                      212
##  2 brand_code               120
##  3 filler_speed              57
##  4 pc_volume                 39
##  5 psc_co2                   39
##  6 fill_ounces               38
##  7 psc                       33
##  8 carb_pressure1            32
##  9 hyd_pressure4             30
## 10 carb_pressure             27
## # ℹ 21 more rows

The MFR column has the highest number of missing values, accounting for approximately 8.24% of the total observations. Generally, many data scientists recommend removing columns with more than 5% missing data to maintain model reliability. The Brand Code column also has missing values, representing about 4.6% of the total observations.

Also we observe that 4 observations do not have a PH value. Hence we need to remove these observations.

To handle missing values in the remaining columns, we will use the mice package for imputation. However, before proceeding, we need to address the column names by removing any spaces, as the mice functions require variable names without spaces to operate properly.

4.2 Imputing Missing values:

To address missing data in our dataset, we use the MICE (Multivariate Imputation by Chained Equations) technique. MICE is a flexible and widely adopted method that models each incomplete variable conditionally on the other variables, iteratively generating plausible values instead of relying on a single guess. This approach captures the uncertainty around missing data and helps preserve relationships between features.

For the imputation model, we select Random Forests (RF) as the method. Random Forest is a non-parametric, ensemble learning algorithm capable of modeling complex nonlinear relationships and interactions without requiring strict assumptions about the data. This makes it particularly well-suited for imputing both numeric and categorical features in real-world manufacturing datasets like ours.

Before performing imputation, we first visualize the extent and structure of missingness across the dataset.

# Missingness plot
gg_miss_var(data)

The missing value plot highlights that MFR had the highest number of missing values, followed by Brand_Code and Filler_Speed. Several other variables exhibited moderate missingness, while many features were nearly complete.

There are 4 observations with missing PH value - these must be removed. This visualization confirmed the need to apply an imputation strategy before proceeding to model development.

# Remove rows with missing target variable (pH)
data <- data[!is.na(data$ph), ]

set.seed(786)
# MICE imputation
imputed_data <- mice(data, m = 1, method = 'rf', print = FALSE)

# Extract the completed dataset
data <- complete(imputed_data)

# Check for any remaining missing values
colSums(is.na(data))

##        brand_code       carb_volume       fill_ounces         pc_volume 
##                 0                 0                 0                 0 
##     carb_pressure         carb_temp               psc          psc_fill 
##                 0                 0                 0                 0 
##           psc_co2          mnf_flow    carb_pressure1     fill_pressure 
##                 0                 0                 0                 0 
##     hyd_pressure1     hyd_pressure2     hyd_pressure3     hyd_pressure4 
##                 0                 0                 0                 0 
##      filler_level      filler_speed       temperature        usage_cont 
##                 0                 0                 0                 0 
##         carb_flow           density               mfr           balling 
##                 0                 0                 0                 0 
##   pressure_vacuum                ph     oxygen_filler     bowl_setpoint 
##                 0                 0                 0                 0 
## pressure_setpoint     air_pressurer          alch_rel          carb_rel 
##                 0                 0                 0                 0 
##       balling_lvl 
##                 0

After performing Random Forest-based imputation using MICE, all missing values in the dataset have been successfully handled. We verified the success of the imputation by checking for any remaining missing values using colSums(is.na(data)), which confirmed that the dataset is now fully complete and ready for further modeling and analysis.

4.3 Check which columns have near-zero variance

After handling missing data, it is important to address predictors that provide little to no useful information for modeling. Predictors with near-zero variance contain very little variability and do not contribute meaningfully to predictive power. Therefore, we identify and remove such predictors to streamline the modeling process and improve model performance.

nzv_cols <- nearZeroVar(data, names = TRUE)
nzv_cols

## [1] "hyd_pressure1"

#Remove only near-zero variance columns
if(length(nzv_cols) > 0) {
  data <- data[, !colnames(data) %in% nzv_cols]
}

dim(data)  # Rows and columns after applying -nearZeroVar

## [1] 2567   32

In this case, only the hyd_pressure1 variable was removed due to its near-zero variance. All other predictors were retained for further analysis.

4.4 Check for outliers

There are 13 outliers using IQR method. However the values are not drastically high or low than the mean distribution we observed in the box plot. Also Most of those outliers come from one value with a pH of 9.36. We removed this outlier, but saw a marked reduction in accuracy and so made the decision to keep the value in the training set.

# Calculate IQR-based outlier thresholds within each brand
outliers_by_brand <- data %>%
  group_by(brand_code) %>%
  mutate(
    Q1 = quantile(ph, 0.25, na.rm = TRUE),
    Q3 = quantile(ph, 0.75, na.rm = TRUE),
    IQR = Q3 - Q1,
    lower = Q1 - 1.5 * IQR,
    upper = Q3 + 1.5 * IQR,
    is_outlier = (ph < lower | ph > upper)
  ) %>%
  filter(is_outlier == TRUE)

# View the outliers
print(outliers_by_brand)

## # A tibble: 14 × 38
## # Groups:   brand_code [4]
##    brand_code carb_volume fill_ounces pc_volume carb_pressure carb_temp    psc
##    <fct>            <dbl>       <dbl>     <dbl>         <dbl>     <dbl>  <dbl>
##  1 C                 5.22        23.9     0.391          63.6      139  0.002 
##  2 A                 5.55        24.0     0.389          72.4      144. 0.0560
##  3 D                 5.31        24.0     0.317          66.8      141. 0.098 
##  4 A                 5.63        24.3     0.182          70.6      137. 0.068 
##  5 A                 5.45        24.0     0.212          71.2      143  0.116 
##  6 A                 5.4         24.0     0.219          62.4      133. 0.062 
##  7 A                 5.37        24.0     0.227          68.2      141. 0.074 
##  8 B                 5.43        24.0     0.258          63        134  0.072 
##  9 B                 5.43        24       0.227          70.8      144. 0.05  
## 10 B                 5.34        24       0.245          66.4      140. 0.116 
## 11 B                 5.34        24.1     0.294          68.4      142. 0.106 
## 12 B                 5.35        24.1     0.294          66.8      140. 0.086 
## 13 C                 5.27        24.0     0.281          65.4      139. 0.096 
## 14 C                 5.37        24.1     0.271          69.4      142  0.078 
## # ℹ 31 more variables: psc_fill <dbl>, psc_co2 <dbl>, mnf_flow <dbl>,
## #   carb_pressure1 <dbl>, fill_pressure <dbl>, hyd_pressure2 <dbl>,
## #   hyd_pressure3 <dbl>, hyd_pressure4 <dbl>, filler_level <dbl>,
## #   filler_speed <dbl>, temperature <dbl>, usage_cont <dbl>, carb_flow <dbl>,
## #   density <dbl>, mfr <dbl>, balling <dbl>, pressure_vacuum <dbl>, ph <dbl>,
## #   oxygen_filler <dbl>, bowl_setpoint <dbl>, pressure_setpoint <dbl>,
## #   air_pressurer <dbl>, alch_rel <dbl>, carb_rel <dbl>, balling_lvl <dbl>, …

4.5 Evaluation data analysis

The evaluation dataset analysis using summary() shows that many observations have NA values which need to be treated. randomForest, lm, xgboost cannot handle NAs in predictors.If left untreated, predictions on test data with missing values will be skewed or incomplete. Hence for fair evaluation, test data must undergo the same preprocessing steps as training data.

summary(test_data)

##   brand_code         carb_volume     fill_ounces      pc_volume      
##  Length:267         Min.   :5.147   Min.   :23.75   Min.   :0.09867  
##  Class :character   1st Qu.:5.287   1st Qu.:23.92   1st Qu.:0.23333  
##  Mode  :character   Median :5.340   Median :23.97   Median :0.27533  
##                     Mean   :5.369   Mean   :23.97   Mean   :0.27769  
##                     3rd Qu.:5.465   3rd Qu.:24.01   3rd Qu.:0.32200  
##                     Max.   :5.667   Max.   :24.20   Max.   :0.46400  
##                     NA's   :1       NA's   :6       NA's   :4        
##  carb_pressure     carb_temp          psc             psc_fill     
##  Min.   :60.20   Min.   :130.0   Min.   :0.00400   Min.   :0.0200  
##  1st Qu.:65.30   1st Qu.:138.4   1st Qu.:0.04450   1st Qu.:0.1000  
##  Median :68.00   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.25   Mean   :141.2   Mean   :0.08545   Mean   :0.1903  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :77.60   Max.   :154.0   Max.   :0.24600   Max.   :0.6200  
##                  NA's   :1       NA's   :5         NA's   :3       
##     psc_co2           mnf_flow       carb_pressure1  fill_pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :113.0   Min.   :37.80  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:120.2   1st Qu.:46.00  
##  Median :0.04000   Median :   0.20   Median :123.4   Median :47.80  
##  Mean   :0.05107   Mean   :  21.03   Mean   :123.0   Mean   :48.14  
##  3rd Qu.:0.06000   3rd Qu.: 141.30   3rd Qu.:125.5   3rd Qu.:50.20  
##  Max.   :0.24000   Max.   : 220.40   Max.   :136.0   Max.   :60.20  
##  NA's   :5                           NA's   :4       NA's   :2      
##  hyd_pressure1    hyd_pressure2    hyd_pressure3    hyd_pressure4   
##  Min.   :-50.00   Min.   :-50.00   Min.   :-50.00   Min.   : 68.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 90.00  
##  Median : 10.40   Median : 26.80   Median : 27.70   Median : 98.00  
##  Mean   : 12.01   Mean   : 20.11   Mean   : 19.61   Mean   : 97.84  
##  3rd Qu.: 20.40   3rd Qu.: 34.80   3rd Qu.: 33.00   3rd Qu.:104.00  
##  Max.   : 50.00   Max.   : 61.40   Max.   : 49.20   Max.   :140.00  
##                   NA's   :1        NA's   :1        NA's   :4       
##   filler_level    filler_speed   temperature      usage_cont      carb_flow   
##  Min.   : 69.2   Min.   :1006   Min.   :63.80   Min.   :12.90   Min.   :   0  
##  1st Qu.:100.6   1st Qu.:3812   1st Qu.:65.40   1st Qu.:18.12   1st Qu.:1083  
##  Median :118.6   Median :3978   Median :65.80   Median :21.44   Median :3038  
##  Mean   :110.3   Mean   :3581   Mean   :66.23   Mean   :20.90   Mean   :2409  
##  3rd Qu.:120.2   3rd Qu.:3996   3rd Qu.:66.60   3rd Qu.:23.74   3rd Qu.:3215  
##  Max.   :153.2   Max.   :4020   Max.   :75.40   Max.   :24.60   Max.   :3858  
##  NA's   :2       NA's   :10     NA's   :2       NA's   :2                     
##     density           mfr           balling      pressure_vacuum 
##  Min.   :0.060   Min.   : 15.6   Min.   :0.902   Min.   :-6.400  
##  1st Qu.:0.920   1st Qu.:707.0   1st Qu.:1.498   1st Qu.:-5.600  
##  Median :0.980   Median :724.6   Median :1.648   Median :-5.200  
##  Mean   :1.177   Mean   :697.8   Mean   :2.203   Mean   :-5.174  
##  3rd Qu.:1.600   3rd Qu.:731.5   3rd Qu.:3.242   3rd Qu.:-4.800  
##  Max.   :1.840   Max.   :784.8   Max.   :3.788   Max.   :-3.600  
##  NA's   :1       NA's   :31      NA's   :1       NA's   :1       
##     ph          oxygen_filler     bowl_setpoint   pressure_setpoint
##  Mode:logical   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  NA's:267       1st Qu.:0.01960   1st Qu.:100.0   1st Qu.:46.00    
##                 Median :0.03370   Median :120.0   Median :46.00    
##                 Mean   :0.04666   Mean   :109.6   Mean   :47.73    
##                 3rd Qu.:0.05440   3rd Qu.:120.0   3rd Qu.:50.00    
##                 Max.   :0.39800   Max.   :130.0   Max.   :52.00    
##                 NA's   :3         NA's   :1       NA's   :2        
##  air_pressurer      alch_rel        carb_rel     balling_lvl   
##  Min.   :141.2   Min.   :6.400   Min.   :5.18   Min.   :0.000  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.34   1st Qu.:1.380  
##  Median :142.6   Median :6.580   Median :5.40   Median :1.480  
##  Mean   :142.8   Mean   :6.907   Mean   :5.44   Mean   :2.051  
##  3rd Qu.:142.8   3rd Qu.:7.180   3rd Qu.:5.56   3rd Qu.:3.080  
##  Max.   :147.2   Max.   :7.820   Max.   :5.74   Max.   :3.420  
##  NA's   :1       NA's   :3       NA's   :2

#facting brand code
test_data <- test_data %>%
  mutate(`brand_code` = as.factor(`brand_code`))

test_completed <- test_data %>%
  select(-ph) %>%
  mice(., m = 1, method = 'rf', print = FALSE) %>% complete()

# remove Hyd.Pressure1 as it was removed in the preprocessing for Student Data
# add back in PH
test_eval <- test_completed %>%
  select(-hyd_pressure1) %>%
  mutate(PH = "")

# Check for any remaining missing values
colSums(is.na(test_eval))

##        brand_code       carb_volume       fill_ounces         pc_volume 
##                 0                 0                 0                 0 
##     carb_pressure         carb_temp               psc          psc_fill 
##                 0                 0                 0                 0 
##           psc_co2          mnf_flow    carb_pressure1     fill_pressure 
##                 0                 0                 0                 0 
##     hyd_pressure2     hyd_pressure3     hyd_pressure4      filler_level 
##                 0                 0                 0                 0 
##      filler_speed       temperature        usage_cont         carb_flow 
##                 0                 0                 0                 0 
##           density               mfr           balling   pressure_vacuum 
##                 0                 0                 0                 0 
##     oxygen_filler     bowl_setpoint pressure_setpoint     air_pressurer 
##                 0                 0                 0                 0 
##          alch_rel          carb_rel       balling_lvl                PH 
##                 0                 0                 0                 0

5. Features Selection

With the dataset now fully cleaned and complete, the next step is to perform feature selection. Feature selection is a critical process to identify the most important predictors that influence the target variable, PH. By selecting the most relevant variables, we can improve model performance, reduce overfitting, and simplify the interpretation of the results. In this section, we will explore relationships between predictors and PH to guide our modeling choices.

Since correlation analysis requires numeric variables, we first selected only the numeric columns from the dataset before computing the correlation matrix.

# Numeric columns
data_numeric <- data %>%
  select(where(is.numeric))

# Correlation matrix
cor_matrix <- cor(data_numeric, use = "complete.obs")

# Plot
corrplot(cor_matrix, 
         method = "circle", 
         type = "upper", 
         tl.cex = 0.5, 
         tl.srt = 45)  # Rotate text diagonally

par(mar = c(1, 1, 4, 1))  # top margin = 4 lines

# find the high correlated predictors
high_corr <- findCorrelation(cor_matrix, cutoff = 0.9)

# Then this gives the actual column names:
colnames(data)[high_corr]

## [1] "mfr"           "hyd_pressure2" "carb_rel"      "carb_flow"    
## [5] "hyd_pressure4"

# As we value interpretability, we will drop these highly correlated predictors.
#data <- data[, -high_corr]

The correlation plot shows that most numerical predictors have low to moderate relationships with each other. PH is moderately correlated with variables such as Oxygen_Filler, Pressure_Setpoint, Bowl_Setpoint, Alch_Rel, and Air_Pressurer. A few other predictors also show noticeable associations. Overall, the correlations suggest that several variables may contribute useful information for predicting PH, supporting our next step of feature selection and model building. there is higher levels of correlation.

Next, let’s extract and visualize the Top 5 most correlated predictors with PH.

5.1 Top 5 Predictors Positively and Negatively Correlated with PH

# Numeric columns
data_numeric <- data %>%
  select(where(is.numeric))

# Correlation matrix
cor_matrix <- cor(data_numeric, use = "complete.obs")

# Get correlations with PH
ph_correlation <- cor_matrix[, "ph"]

# Remove PH itself (correlation = 1)
ph_correlation <- ph_correlation[names(ph_correlation) != "ph"]

# Top 5 Positive correlations
top5_pos <- sort(ph_correlation, decreasing = TRUE) %>%
  head(5)

# Top 5 Negative correlations
top5_neg <- sort(ph_correlation, decreasing = FALSE) %>%
  head(5)

# Combine into one dataframe
combined_df <- data.frame(
  Feature = c(names(top5_pos), names(top5_neg)),
  Correlation = c(top5_pos, top5_neg)
)

# Add a column to show Positive or Negative
combined_df$Direction <- ifelse(combined_df$Correlation > 0, "Positive", "Negative")

# Plot
library(ggplot2)

ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
  labs(title = "Top Features Positively and Negatively Correlated with PH",
       x = "Feature",
       y = "Correlation with PH",
       fill = "Direction") +
  theme_minimal()

The bar plot above shows the top features most positively and negatively correlated with PH. Positive correlations (in blue) indicate variables that tend to increase as PH increases, while negative correlations (in red) indicate variables that tend to decrease as PH rises. Identifying both types of relationships helps ensure that our predictive model captures the full range of influences on PH, not just variables that move in the same direction. This balanced feature selection improves model robustness and interpretability.

6. Model Building and Selection

To predict beverage pH levels accurately, we evaluate a range of machine learning models, including both linear and nonlinear approaches. Our goal is to compare multiple algorithms, identify the best-performing model based on predictive accuracy, and balance model complexity with interpretability. We begin by splitting the dataset into training and testing sets, and then proceed to fit and assess several models: Linear Regression, Boosted Trees, Random Forest, MARS, Cubist, K-Nearest Neighbors (KNN), and Support Vector Machines (SVM).

6.1 Splitting the Data

Since the target variable (pH) differs by Brand Code, you should stratify the train-test split by Brand Code, not by pH. Stratifying by brand ensures each subset (train/test) contains similar brand distributions.

set.seed(786)

# index for training
index <- createDataPartition(data$brand_code, p = .8, list = FALSE)

# train 
train_data <- data[index, ] 
test_data<- data[-index,]

# Training predictors and response
trainingX <- train_data %>% select(-ph)
trainingY <- train_data$ph

# Test predictors and response
testX <- test_data %>% select(-ph)
testY <- test_data$ph

# 5-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 5)

#pre process
pre_process <- c("center","scale")

6. Model Building and Selection

We evaluate multiple models to predict pH using linear model, non-linear models, and tree-based models. Each model is trained with 5-fold CV, centered/scaled predictors, and evaluated on test data.

6.1 Linear Regression Models: Linear,Ridge,Elastic Net, PLS, PCR

Models linear relationships, no tuning - linear Regression models pH as a linear combination of predictors. It’s simple but may struggle with complex, nonlinear relationships in the data. Implemented using caret’s lm method, it’s trained on train_data, evaluated with 5-fold CV, and performance (RMSE, R-squared, MAE) is assessed on test_data using postResample.

Linear Regression Model

RMSE: 0.14
Rsquared: 0.37
MAE: 0.11

set.seed(786)
lm_model <- train(x = trainingX, y = trainingY,
                  method = "lm",
                  preProcess = pre_process,
                  trControl = ctrl)

lm_pred <- predict(lm_model, testX)
lm_resample <- postResample(lm_pred, testY)
lm_resample

##      RMSE  Rsquared       MAE 
## 0.1408669 0.3731524 0.1057965

Ridge

RMSE: 0.14
Rsquared: 0.38
MAE: 0.11

set.seed(786)
ridge_model <- train(ph ~ ., data = train_data,
                     method = "glmnet",
                     preProcess = c("center", "scale"),
                     trControl = trainControl(method = "cv", number = 10),
                     tuneGrid = expand.grid(alpha = 0,
                                            lambda = seq(0, 1, 0.1)))

ridge_pred <- predict(ridge_model, testX)
ridge_resample <- postResample(ridge_pred, testY)
ridge_resample

##      RMSE  Rsquared       MAE 
## 0.1403992 0.3783332 0.1068694

Elastic Net

RMSE: 0.14
Rsquared: 0.37
MAE: 0.11

set.seed(786)
# Elastic net
enetGrid <- expand.grid(lambda = c(0, 0.01, .1), fraction = seq(.05, 1, length = 20))

enet_model <- train(ph ~ ., data = train_data,
                 method = "enet",
                 tuneGrid = enetGrid,
                 trControl = ctrl,
                 preProcess = pre_process)

enet_pred <- predict(enet_model, testX)
enet_resample <- postResample(enet_pred, testY)
enet_resample

##      RMSE  Rsquared       MAE 
## 0.1406831 0.3746848 0.1058024

PCR

RMSE: 0.16
Rsquared: 0.17
MAE: 0.13

set.seed(786)
# Train PCR model
pcr_model <- train(x = select(train_data, -ph), y = train_data$ph,
                   method = "pcr",
                   trControl = ctrl,
                   preProcess = pre_process)

# Predict on test data
pcr_pred <- predict(pcr_model, select(test_data, -ph))

# Evaluate performance
pcr_resample <- postResample(pcr_pred, test_data$ph)
pcr_resample

##      RMSE  Rsquared       MAE 
## 0.1621590 0.1690306 0.1272482

Partial Least Squares (PLS):

RMSE: 0.14
Rsquared: 0.37
MAE: 0.11

set.seed(786)
pls_grid <- expand.grid(ncomp = seq(1, 15, by = 1))
pls_model <- train(x = select(train_data, -ph), y = train_data$ph,
                   method = "pls",
                   tuneGrid = pls_grid,
                   trControl = ctrl,
                   preProcess = pre_process)
pls_pred <- predict(pls_model, select(test_data, -ph))
pls_resample <- postResample(pls_pred, test_data$ph)
pls_resample

##      RMSE  Rsquared       MAE 
## 0.1401566 0.3793208 0.1058806

6.2 Non-Linear Models: KNN, MARS, NN, SVM

K-Nearest Neighbors (KNN):

RMSE: 0.127
Rsquared: 0.498
MAE: 0.094

set.seed(786)
knn_grid <- expand.grid(k = seq(1, 21, by = 2))

knn_model <- train(ph ~ ., data = train_data,
                   method = "knn",
                   tuneGrid = knn_grid,
                   trControl = ctrl,
                   preProcess = pre_process)

knn_pred <- predict(knn_model, select(test_data, -ph))
knn_resample <- postResample(knn_pred, test_data$ph)
knn_resample

##       RMSE   Rsquared        MAE 
## 0.12519337 0.51059567 0.09240191

Multivariate Adaptive Regression Splines (MARS):

RMSE: 0.124
Rsquared: 0.513
MAE: 0.089

set.seed(786)
mars_grid <- expand.grid(degree = 1:2, nprune = seq(10, 30, by = 5))
mars_model <- train(x = select(train_data, -ph), y = train_data$ph,
                    method = "earth",
                    tuneGrid = mars_grid,
                    trControl = ctrl,
                    preProcess = pre_process)

mars_pred <- predict(mars_model, select(test_data, -ph))
mars_resample <- postResample(mars_pred, test_data$ph)
mars_resample

##       RMSE   Rsquared        MAE 
## 0.12423616 0.51253767 0.08946004

Neural Network (NN):

RMSE: 7.545

Rsquared: 0.045
MAE: 7.543

set.seed(786)

nn_grid <- expand.grid(.decay = c(0.001, 0.01, 0.1), .size = c(3, 5, 7), .bag = FALSE)

nn_model <- train(ph ~ ., data = train_data,
                  method = "avNNet",
                  tuneGrid = nn_grid,
                  trControl = ctrl,
                  preProcess = pre_process,
                  maxit = 200, trace = FALSE)

## Warning: executing %dopar% sequentially: no parallel backend registered

nn_pred <- predict(nn_model, select(test_data, -ph))
nn_resample <- postResample(nn_pred, test_data$ph)
nn_resample

##       RMSE   Rsquared        MAE 
## 7.54518307 0.01620162 7.54308630

Support Vector Machines (SVM):

RMSE: 0.123

Rsquared: 0.526
MAE: 0.087

set.seed(786)

svm_grid <- expand.grid(sigma = c(0.01, 0.05, 0.1), C = c(0.1, 1, 10))
svm_model <- train(ph ~ ., data = train_data,
                   method = "svmRadial",
                   tuneGrid = svm_grid,
                   trControl = ctrl,
                   preProcess = pre_process)

svm_pred <- predict(svm_model, select(test_data, -ph))
svm_resample <- postResample(svm_pred, test_data$ph)
svm_resample

##       RMSE   Rsquared        MAE 
## 0.12354270 0.52093686 0.08729071

6.3 Tree-based models: Random Forest, GBM, Cubist, Bagged Tree

Random Forest on test_data with 5-fold CV and 50 tree

RMSE: 0.110
Rsquared: 0.631
MAE: 0.074

Using center and scale for preprocessing. The number of optimal predictors was 14. The final value used for the model was mtry = 14.

set.seed(786)

library(partykit) # cforest ..random forest with conditional inference

# default mmtry is number of predictors  divided by 3 , so around 10.
#mtry is the number of predictors sampled for each split.
rfGrid <- expand.grid(mtry=seq(2,20,by=3) )


rfModel <- train(x = select(train_data, -ph), y = train_data$ph, 
                 method = "rf", 
                 tuneGrid = rfGrid, 
                 importance = TRUE, 
                 trControl = ctrl,
                 ntree = 50,
                 preProcess=pre_process)

#validation plot
plot(rfModel)

rfModel

## Random Forest 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1645, 1644, 1643, 1645, 1643 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    2    0.1128771  0.6009651  0.08604055
##    5    0.1058461  0.6387112  0.07906666
##    8    0.1030482  0.6532559  0.07614425
##   11    0.1021637  0.6557643  0.07537543
##   14    0.1019742  0.6543774  0.07500112
##   17    0.1019922  0.6515453  0.07484784
##   20    0.1023996  0.6478821  0.07481463
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 14.

#predict the permeability response for the test data
rfPred <- predict(rfModel, select(test_data, -ph))

#compare the variance of the predicted values to the test values.
rfResample <- postResample(rfPred, test_data$ph)

rfResample

##       RMSE   Rsquared        MAE 
## 0.10972929 0.63254966 0.07613698

Gradient Boosting Model(GBM)

RMSE: 0.125
Rsquared: 0.509
MAE: 0.091
For n.trees = 150, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

set.seed(786)

gbmGrid <- expand.grid(interaction.depth=seq(1,7,by=1),
                       n.trees=c(100,200),
                       shrinkage=c(0.01,0.1,0.2),
                       n.minobsinnode=5)


# Distribution defines the type of loss function that will be optimized during boosting.
# for continuous variable, it should be gaussian

gbmModel <-train(x = select(train_data, -ph), y = train_data$ph,
                 method = "gbm",
                 verbose = FALSE, 
                 trControl = ctrl, 
                 preProcess=pre_process)

gbmModel

## Stochastic Gradient Boosting 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1645, 1644, 1643, 1645, 1643 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   MAE       
##   1                   50      0.1328660  0.4128697  0.10549819
##   1                  100      0.1288159  0.4409861  0.10220537
##   1                  150      0.1269491  0.4546319  0.10056028
##   2                   50      0.1260576  0.4717380  0.09955882
##   2                  100      0.1214173  0.5027783  0.09526266
##   2                  150      0.1193871  0.5163155  0.09295076
##   3                   50      0.1220633  0.5014799  0.09577564
##   3                  100      0.1182051  0.5261283  0.09182823
##   3                  150      0.1166537  0.5364596  0.08997871
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.

#predict the permeability response for the test data
gbmPred <- predict(gbmModel, select(test_data, -ph))

#compare the variance of the predicted values to the test values.
gbmResample<- postResample(gbmPred, test_data$ph)

gbmResample

##       RMSE   Rsquared        MAE 
## 0.12670901 0.49446004 0.09180879

Cubist

RMSE: 0.102
Rsquared: 0.673
MAE: 0.068

For committees: 20 and neighbors: 9

# committess are the n.trees
# neighbors can take a integer value between 0 to 9
# cubistGrid <- expand.grid(committees = c(20, 30,40,80,100), 
#                          neighbors = c(0, 1, 3, 5, 7, 9))

set.seed(786)
cubistModel <- train(x = select(train_data, -ph), y = train_data$ph,
                 method = "cubist",
                 verbose = FALSE, 
                 preProcess=pre_process)

cubistModel$bestTune

##   committees neighbors
## 9         20         9

#predict the permeability response for the test data
cubistPred <- predict(cubistModel, select(test_data, -ph))

#compare the variance of the predicted values to the test values.
cubistResample <-postResample(cubistPred, test_data$ph)

cubistResample

##       RMSE   Rsquared        MAE 
## 0.10159717 0.67458319 0.06798507

Bagged trees

RMSE: 0.128
Rsquared: 0.492
MAE: 0.096

set.seed(786)

baggedModel <- train(x = select(train_data, -ph), 
                     y = train_data$ph,
                     method = "treebag",
                     verbose = FALSE, 
#                     trControl = ctrl, 
                     preProcess = pre_process)

baggedModel$bestTune

##   parameter
## 1      none

# Predict the permeability response for the test data
baggedPred <- predict(baggedModel, select(test_data, -ph))

# Compare the variance of the predicted values to the test values
baggedResample <- postResample(baggedPred, test_data$ph)
baggedResample

##      RMSE  Rsquared       MAE 
## 0.1280664 0.4899209 0.0958947

7. Model Evaluation

Based on R squared as the performance metric , cubist model performed the best in optimal resampling and test set performance as it had the highest R squared and lowest RMSE.
Cubist model:
On average, predictions are off by just 0.10 pH units, The model explains ~67.5% of the variance in the ph variable — a good result in most regression problems. The average absolute error is less than 0.07, showing consistent accuracy.

model_results <- rbind(
  lm_resample,
  ridge_resample,
  enet_resample,
  pcr_resample,
  pls_resample,
  knn_resample,
  mars_resample,
  nn_resample,
  svm_resample,
  rfResample,
  gbmResample,
  baggedResample,
  cubistResample
)

rownames(model_results) <- c("Linear Regression", "Ridge", "Elastic Net", "PCR","PLS", "KNN", "MARS", "Neural Network", "SVM","Random Forest", "GBM", "Bagged", "Cubist")


model_results

##                        RMSE   Rsquared        MAE
## Linear Regression 0.1408669 0.37315235 0.10579651
## Ridge             0.1403992 0.37833322 0.10686943
## Elastic Net       0.1406831 0.37468483 0.10580243
## PCR               0.1621590 0.16903059 0.12724820
## PLS               0.1401566 0.37932084 0.10588059
## KNN               0.1251934 0.51059567 0.09240191
## MARS              0.1242362 0.51253767 0.08946004
## Neural Network    7.5451831 0.01620162 7.54308630
## SVM               0.1235427 0.52093686 0.08729071
## Random Forest     0.1097293 0.63254966 0.07613698
## GBM               0.1267090 0.49446004 0.09180879
## Bagged            0.1280664 0.48992085 0.09589470
## Cubist            0.1015972 0.67458319 0.06798507

In the cubist model,among the Top 10 predictors Mnf_Flow, Balling, Balling_Lvl, Pressure_Vacuum, Alch_Rel etc are important features.

top_predictors <-data.frame(varImp(cubistModel,scale = TRUE)$importance)

top_predictors  |>
  arrange(desc(Overall))

##                      Overall
## mnf_flow          100.000000
## balling            72.519084
## balling_lvl        70.229008
## pressure_vacuum    64.122137
## alch_rel           63.358779
## density            61.832061
## carb_rel           51.145038
## air_pressurer      51.145038
## carb_flow          46.564885
## temperature        46.564885
## hyd_pressure3      44.274809
## oxygen_filler      42.748092
## filler_speed       40.458015
## carb_pressure1     40.458015
## bowl_setpoint      38.167939
## usage_cont         37.404580
## hyd_pressure2      29.007634
## brand_code         22.137405
## carb_volume        21.374046
## pc_volume          19.847328
## hyd_pressure4      19.083969
## filler_level       19.083969
## carb_pressure      19.083969
## pressure_setpoint  15.267176
## carb_temp          14.503817
## mfr                 6.870229
## fill_pressure       6.870229
## fill_ounces         3.816794
## psc_fill            3.816794
## psc                 0.000000
## psc_co2             0.000000

top_10_predictors <- top_predictors  |>
  arrange(desc(Overall)) |>
    slice_head(n = 10)

# 1. Get variable importance
cubistImp <- varImp(cubistModel)


# 2. Plot simple barplot

plot(caret::varImp(cubistModel), main = "Cubist Regression Model Variable Importance")

8. Final Model and Predictions

After choosing the best performing model, we use it to evaluate against the StudentEvaluation data which was provided in a separate file.
We now predict the pH for this data. This will help ABC beverage company to generate reliable pH predictions that support production optimization, regulatory compliance, and overall product consistency.
The predicted result is close to the box plot shown earlier with the beverage dataset.

ph_prediction <- predict(cubistModel,test_eval)

test_eval$PH <- ph_prediction

#ph predicted data
test_eval %>%
  select(brand_code, PH) %>%
  ggplot() + 
  aes(x = brand_code, y = PH) + 
  geom_boxplot(color = 'skyblue', outlier.color = 'red')+
  labs(title = 'Distribution of predicted PH by Brand Code',
       x = element_blank(),
       y = 'pH')

library(openxlsx)
# Create a new workbook
wb <- createWorkbook()


# Add the sheets
addWorksheet(wb, "Predictions")

# Write data to sheet
writeData(wb,sheet = "Predictions" , test_eval)


# Save the workbook
saveWorkbook(wb, "StudentPrediction.xlsx", overwrite = TRUE)

9. Conclusion

With the unique nature of this dataset, I believe strong domain knowledge would be necessary to take this model to the next level. Feature engineering would be the place to focus and that additional modeling gains could be made by creating additional features from this rich dataset.

By using a more balanced dataset, researching and analyzing combination features, we can aim to achieve better R squared and lower RMSE.

The final predictions generated using Cubist model for the PH value seem somewhat similar, as they all range between 8 and 9. The patterns still uphold in our predictions when they are grouped by the Brand code.

I also highlighted the variables that have the most effect on the PH. I hope that understanding more about the manufacturing process helps with the new regulations in the beverage industry.

DATA622 - Assignment 4 - Final project

Dhanya Nair

2025-05-10

ABC Beverage — pH levels Prediction

1. Introduction

2. Business Goal

3. Load and Explore the Data

3.1 Load and Read the data

3.2 Visual Analytics: Feature Distributions

4. Pre-processing Dataset

4.1 Check Missing Values:

4.2 Imputing Missing values:

4.3 Check which columns have near-zero variance

4.4 Check for outliers

4.5 Evaluation data analysis

5. Features Selection

5.1 Top 5 Predictors Positively and Negatively Correlated with PH

6. Model Building and Selection

6.1 Splitting the Data

6. Model Building and Selection

6.1 Linear Regression Models: Linear,Ridge,Elastic Net, PLS, PCR

6.2 Non-Linear Models: KNN, MARS, NN, SVM

6.3 Tree-based models: Random Forest, GBM, Cubist, Bagged Tree

7. Model Evaluation

8. Final Model and Predictions

9. Conclusion