1 Introduction

ABC Beverage Company is responding to new manufacturing regulations that require better understanding and control of beverage pH levels. As part of the Data Science team, our task is to analyze beverage data, identify the key factors affecting pH, and develop a predictive model.

This report provides:
- An exploration of the provided data
- Testing of multiple predictive models
- Selection of the best-performing model
- Final predictions of pH values based on the selected model

The results will be used to support regulatory compliance and optimize production processes.

PH background: In beverages, pH is a measure of acidity or alkalinity, on a scale from 0 (very acidic) to 14 (very alkaline), with 7 being neutral (like pure water).

pH matters in beverages due to following factors:
pH affects sourness/sharpness — lower pH = more acidic/sour
Lower pH (more acidic) inhibits microbial growth
pH needs to be controlled to meet safety standards
pH can indicate ingredient variability or equipment issues
https://www.jaidevelopment.com/blog/the-importance-of-ph-and-acidity-in-beverage-formulation#

2 Objective

Our primary goal is to develop an accurate predictive model for beverage pH levels using historical manufacturing data. We aim to identify the key production variables that influence pH, evaluate and compare multiple predictive modeling approaches, and select the model that provides the best balance of accuracy and interpretability. Using the selected model, we will generate reliable pH predictions to support production optimization and regulatory reporting. By achieving this goal, ABC Beverage Company will be better equipped to monitor, control, and maintain pH levels within required standards, ensuring both compliance and product consistency.

3 Data Exploration

We begin by importing this beverage dataset and performing an initial exploration to understand its structure and key characteristics, starting with loading the necessary libraries.

Source: Beverage dataset (StudentData.xlsx): 2571 rows x 33 columns. Evaluation dataset (StudentEvaluation.xlsx): 267 rows x 33 columns. Beverage dataset captures the various aspects of the beverage manufacturing process, including product characteristics, equipment readings, flow rates, pressures, and temperatures. The target variable for prediction is pH, a continuous measure reflecting the acidity or alkalinity of the beverage.

3.1 Load and Read the data

Let’s check the first few rows of the data set to confirm that all the features and observations are loaded correctly.

# View the first few rows
head(data)

## # A tibble: 6 × 33
##   brand_code carb_volume fill_ounces pc_volume carb_pressure carb_temp    psc
##   <chr>            <dbl>       <dbl>     <dbl>         <dbl>     <dbl>  <dbl>
## 1 B                 5.34        24.0     0.263          68.2      141.  0.104
## 2 A                 5.43        24.0     0.239          68.4      140.  0.124
## 3 B                 5.29        24.1     0.263          70.8      145.  0.09 
## 4 A                 5.44        24.0     0.293          63        133. NA    
## 5 A                 5.49        24.3     0.111          67.2      137.  0.026
## 6 A                 5.38        23.9     0.269          66.6      138.  0.09 
## # ℹ 26 more variables: psc_fill <dbl>, psc_co2 <dbl>, mnf_flow <dbl>,
## #   carb_pressure1 <dbl>, fill_pressure <dbl>, hyd_pressure1 <dbl>,
## #   hyd_pressure2 <dbl>, hyd_pressure3 <dbl>, hyd_pressure4 <dbl>,
## #   filler_level <dbl>, filler_speed <dbl>, temperature <dbl>,
## #   usage_cont <dbl>, carb_flow <dbl>, density <dbl>, mfr <dbl>, balling <dbl>,
## #   pressure_vacuum <dbl>, ph <dbl>, oxygen_filler <dbl>, bowl_setpoint <dbl>,
## #   pressure_setpoint <dbl>, air_pressurer <dbl>, alch_rel <dbl>, …

We can see that everything loaded correctly let’s check the structure of the data set to confirm that data types of the features.

glimpse(data)

## Rows: 2,571
## Columns: 33
## $ brand_code        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
## $ carb_volume       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ fill_ounces       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ pc_volume         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ carb_pressure     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ carb_temp         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ psc               <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ psc_fill          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ psc_co2           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ mnf_flow          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ carb_pressure1    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ fill_pressure     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ hyd_pressure1     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure2     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure3     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure4     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ filler_level      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ filler_speed      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ temperature       <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ usage_cont        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ carb_flow         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ density           <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ mfr               <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ balling           <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ pressure_vacuum   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ ph                <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ oxygen_filler     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ bowl_setpoint     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ pressure_setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ air_pressurer     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ alch_rel          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ carb_rel          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ balling_lvl       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…

We can see that all the predictors are numeric with double datatype except for the brand code. The Brand Code variable is currently stored as a character type, and it may be more appropriate to convert it to a factor since it represents a category rather than free text. Additionally, there are some missing values observed in several features, including mfr, filler_speed, psc, psc_co2, and others. We will address these missing values during data pre-processing.

summary(data)

##   brand_code         carb_volume     fill_ounces      pc_volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  carb_pressure     carb_temp          psc             psc_fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     psc_co2           mnf_flow       carb_pressure1  fill_pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  hyd_pressure1   hyd_pressure2   hyd_pressure3   hyd_pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   filler_level    filler_speed   temperature      usage_cont      carb_flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     density           mfr           balling       pressure_vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        ph        oxygen_filler     bowl_setpoint   pressure_setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  air_pressurer      alch_rel        carb_rel      balling_lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1

The summary statistics show that most features are reasonably scaled with some outliers or extreme values, and a few variables have missing data that will need to be addressed during pre-processing.

dim(data)

## [1] 2571   33

The dataset provided contains 2,571 observations and 33 variables, including product characteristics, equipment readings, flow rates, pressures, and temperatures. The target variable for prediction is pH, a continuous measure reflecting the acidity or alkalinity of the beverage.

3.2 Visual Analytics: Feature Distributions

data %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") %>%
  filter(!is.na(Value)) %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~Feature, scales = "free") +
  ggtitle("Histograms of Numerical Features")

The histograms above show the distributions of the numerical features in the dataset. Several variables, such as Carb Pressure, Carb Temp, Carb Volume, Fill Ounces, Pressure Vacuum, and PC Volume, appear approximately normally distributed.

Other features like filler_speed, mfr, and oxygen_filler either show signs of skewness and contain a few significant outliers.

Our target variable, PH, also follows a roughly normal distribution overall. However, since the dataset includes a variety of branded beverages with differing fill levels, pressures, carbonation temperatures, and other characteristics, it’s important to examine the PH distribution within each beverage group

data |>
  select(ph) |>
  ggplot() + 
  aes(x = ph) + 
  geom_histogram(bins= 40,fill = "blue", color = "black") + 
  labs(title = "Distribution of pH", y = "Count") +  
  theme_minimal()

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

  data |>
  ggplot() + 
  aes(x = ph) + 
  geom_histogram(bins= 40, fill = "lightblue", color = "black") +
  labs(title = "Distribution of pH by Brand", y = "Count" ) + 
  facet_wrap(~ `brand_code`, scales = 'free_y')

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

The overall pH distribution is left skewed and is bi-modal. It indicates that there is grouping within it and hence we check by each brand code.

When we group the PH by the brand code, we see that C & D only look normally distributed whereas A, B are multi-modal. Also there are a lot of NA values.The distribution around PH is better as logged and will help us with the prediction.

Since brand_code is a character right now but actually represents categories (A, B, C…), we should convert it to a factor before plotting or using it in modeling.

# Convert Brand Code to a factor
data <- data %>%
  mutate(`brand_code` = as.factor(`brand_code`))

# Plot Brand Code distribution
ggplot(data, aes(x = `brand_code`)) +
  geom_bar(fill = "skyblue", color = "black") +
  ggtitle("Distribution of Brand Code") +
  theme_minimal()

data %>%
  select(`brand_code`, ph) %>%
  ggplot() + 
  aes(x = `brand_code`, y = ph) + 
  geom_boxplot(color = 'skyblue', outlier.color = 'red')+
  labs(title = 'Distribution of PH by Brand Code',
       x = element_blank(),
       y = 'pH')

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The dataset includes four distinct Brand Codes: A, B, C, and D. Brand B has the highest number of observations, followed by Brand D, while Brands A and C have fewer observations. Additionally, there are some missing values (NA) in the Brand Code column that will need to be addressed during data pre-processing.

The box plot shows that there is variation in pH based on brand code. C has the lowest PH with a very high outlier. The NA value showing PH of 8.5 is seen to be the data of A. D has the highest PH with only one outlier.

As observed by the box plot, pH varies from 7.8 to 9.6 . This indicates the beverage is clearly basic (alkaline), and that’s very uncommon for traditional commercial beverages. The product could be Alkaline water, some enhanced waters used for specialty or specific formulations.

Since pH is a continuous outcome, this confirms that we are dealing with a regression problem. As a first approach, applying linear regression is appropriate and may provide reasonable performance. However, not all predictors are perfectly symmetric or linear, and some exhibit skewness or potential outliers. If linear regression performance is inadequate, we may consider alternative regression techniques such as Multivariate Adaptive Regression Splines (MARS), Support Vector Machines (SVM), Boosted Trees, Random Forests, or Neural Networks to improve predictive accuracy.

4 Data Preparation

In the previous section, we observed that several columns contain missing values. In this section, we will explore different techniques to handle these missing values through imputation. Before proceeding with imputation, we first summarize the number of missing values present in each column.

4.1 Check Missing Values

data %>%
  summarise_all(~ sum(is.na(.))) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
  filter(missing_count != 0) %>%
  arrange(desc(missing_count))

## # A tibble: 31 × 2
##    variable       missing_count
##    <chr>                  <int>
##  1 mfr                      212
##  2 brand_code               120
##  3 filler_speed              57
##  4 pc_volume                 39
##  5 psc_co2                   39
##  6 fill_ounces               38
##  7 psc                       33
##  8 carb_pressure1            32
##  9 hyd_pressure4             30
## 10 carb_pressure             27
## # ℹ 21 more rows

The MFR column has the highest number of missing values, accounting for approximately 8.24% of the total observations. Generally, many data scientists recommend removing columns with more than 5% missing data to maintain model reliability. The brand_code column also has missing values, representing about 4.6% of the total observations.

Also we observe that 4 observations do not have a PH value. Hence we need to remove these observations.

To handle missing values in the remaining columns, we will use the mice package for imputation. However, before proceeding, we need to address the column names by removing any spaces, as the mice functions require variable names without spaces to operate properly.

4.2 Imputing Missing values

To address missing data in our dataset, we use the MICE (Multivariate Imputation by Chained Equations) technique. MICE is a flexible and widely adopted method that models each incomplete variable conditionally on the other variables, iteratively generating plausible values instead of relying on a single guess. This approach captures the uncertainty around missing data and helps preserve relationships between features.

For the imputation model, we select Random Forests (RF) as the method. Random Forest is a non-parametric, ensemble learning algorithm capable of modeling complex nonlinear relationships and interactions without requiring strict assumptions about the data. This makes it particularly well-suited for imputing both numeric and categorical features in real-world manufacturing datasets like ours.

Before performing imputation, we first visualize the extent and structure of missingness across the dataset.

# Missingness plot
gg_miss_var(data)

The missing value plot highlights that mfr had the highest number of missing values, followed by brand_code and filler_speed. Several other variables exhibited moderate missingness, while many features were nearly complete.

There are 4 observations with missing PH value - these must be removed. This visualization confirmed the need to apply an imputation strategy before proceeding to model development.

# Remove rows with missing target variable (pH)
data <- data[!is.na(data$ph), ]

set.seed(786)
# MICE imputation
imputed_data <- mice(data, m = 1, method = 'rf', print = FALSE)

# Extract the completed dataset
data <- complete(imputed_data)

# Check for any remaining missing values
colSums(is.na(data))

##        brand_code       carb_volume       fill_ounces         pc_volume 
##                 0                 0                 0                 0 
##     carb_pressure         carb_temp               psc          psc_fill 
##                 0                 0                 0                 0 
##           psc_co2          mnf_flow    carb_pressure1     fill_pressure 
##                 0                 0                 0                 0 
##     hyd_pressure1     hyd_pressure2     hyd_pressure3     hyd_pressure4 
##                 0                 0                 0                 0 
##      filler_level      filler_speed       temperature        usage_cont 
##                 0                 0                 0                 0 
##         carb_flow           density               mfr           balling 
##                 0                 0                 0                 0 
##   pressure_vacuum                ph     oxygen_filler     bowl_setpoint 
##                 0                 0                 0                 0 
## pressure_setpoint     air_pressurer          alch_rel          carb_rel 
##                 0                 0                 0                 0 
##       balling_lvl 
##                 0

After performing Random Forest-based imputation using MICE, all missing values in the dataset have been successfully handled. We verified the success of the imputation by checking for any remaining missing values using colSums(is.na(data)), which confirmed that the dataset is now fully complete and ready for further modeling and analysis.

4.3 Check which columns have near-zero variance

After handling missing data, it is important to address predictors that provide little to no useful information for modeling. Predictors with near-zero variance contain very little variability and do not contribute meaningfully to predictive power. Therefore, we identify and remove such predictors to streamline the modeling process and improve model performance.

nzv_cols <- nearZeroVar(data, names = TRUE)
nzv_cols

## [1] "hyd_pressure1"

#Remove only near-zero variance columns
if(length(nzv_cols) > 0) {
  data <- data[, !colnames(data) %in% nzv_cols]
}

dim(data)  # Rows and columns after applying -nearZeroVar

## [1] 2567   32

In this case, only the hyd_pressure1 variable was removed due to its near-zero variance. All other predictors were retained for further analysis.

4.4 Check for Outliers

There are 13 outliers using IQR method. However the values are not drastically high or low than the mean distribution we observed in the box plot. Hence these observations do not have to be removed from the dataset.

# Calculate IQR-based outlier thresholds within each brand
outliers_by_brand <- data %>%
  group_by(brand_code) %>%
  mutate(
    Q1 = quantile(ph, 0.25, na.rm = TRUE),
    Q3 = quantile(ph, 0.75, na.rm = TRUE),
    IQR = Q3 - Q1,
    lower = Q1 - 1.5 * IQR,
    upper = Q3 + 1.5 * IQR,
    is_outlier = (ph < lower | ph > upper)
  ) %>%
  filter(is_outlier == TRUE)

# View the outliers
print(outliers_by_brand)

## # A tibble: 14 × 38
## # Groups:   brand_code [4]
##    brand_code carb_volume fill_ounces pc_volume carb_pressure carb_temp    psc
##    <fct>            <dbl>       <dbl>     <dbl>         <dbl>     <dbl>  <dbl>
##  1 C                 5.22        23.9     0.391          63.6      139  0.002 
##  2 A                 5.55        24.0     0.389          72.4      144. 0.0560
##  3 D                 5.31        24.0     0.317          66.8      141. 0.098 
##  4 A                 5.63        24.3     0.182          70.6      137. 0.068 
##  5 A                 5.45        24.0     0.212          71.2      143  0.116 
##  6 A                 5.4         24.0     0.219          62.4      133. 0.062 
##  7 A                 5.37        24.0     0.227          68.2      141. 0.074 
##  8 B                 5.43        24.0     0.258          63        134  0.072 
##  9 B                 5.43        24       0.227          70.8      144. 0.05  
## 10 B                 5.34        24       0.245          66.4      140. 0.116 
## 11 B                 5.34        24.1     0.294          68.4      142. 0.106 
## 12 B                 5.35        24.1     0.294          66.8      140. 0.086 
## 13 C                 5.27        24.0     0.281          65.4      139. 0.096 
## 14 C                 5.37        24.1     0.271          69.4      142  0.078 
## # ℹ 31 more variables: psc_fill <dbl>, psc_co2 <dbl>, mnf_flow <dbl>,
## #   carb_pressure1 <dbl>, fill_pressure <dbl>, hyd_pressure2 <dbl>,
## #   hyd_pressure3 <dbl>, hyd_pressure4 <dbl>, filler_level <dbl>,
## #   filler_speed <dbl>, temperature <dbl>, usage_cont <dbl>, carb_flow <dbl>,
## #   density <dbl>, mfr <dbl>, balling <dbl>, pressure_vacuum <dbl>, ph <dbl>,
## #   oxygen_filler <dbl>, bowl_setpoint <dbl>, pressure_setpoint <dbl>,
## #   air_pressurer <dbl>, alch_rel <dbl>, carb_rel <dbl>, balling_lvl <dbl>, …

4.5 Test Data Exploration and Analysis

Analysis of the test data using the summary() function revealed that many observations contain NA values, which must be addressed. Models such as randomForest, lm, and xgboost cannot handle missing values in predictor variables. If these NAs are not properly treated, predictions on the test data may be inaccurate, skewed, or incomplete. Therefore, to ensure a fair and consistent evaluation, the test data should undergo the same preprocessing steps as the training data.

summary(test_data)

##   brand_code         carb_volume     fill_ounces      pc_volume      
##  Length:267         Min.   :5.147   Min.   :23.75   Min.   :0.09867  
##  Class :character   1st Qu.:5.287   1st Qu.:23.92   1st Qu.:0.23333  
##  Mode  :character   Median :5.340   Median :23.97   Median :0.27533  
##                     Mean   :5.369   Mean   :23.97   Mean   :0.27769  
##                     3rd Qu.:5.465   3rd Qu.:24.01   3rd Qu.:0.32200  
##                     Max.   :5.667   Max.   :24.20   Max.   :0.46400  
##                     NA's   :1       NA's   :6       NA's   :4        
##  carb_pressure     carb_temp          psc             psc_fill     
##  Min.   :60.20   Min.   :130.0   Min.   :0.00400   Min.   :0.0200  
##  1st Qu.:65.30   1st Qu.:138.4   1st Qu.:0.04450   1st Qu.:0.1000  
##  Median :68.00   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.25   Mean   :141.2   Mean   :0.08545   Mean   :0.1903  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :77.60   Max.   :154.0   Max.   :0.24600   Max.   :0.6200  
##                  NA's   :1       NA's   :5         NA's   :3       
##     psc_co2           mnf_flow       carb_pressure1  fill_pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :113.0   Min.   :37.80  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:120.2   1st Qu.:46.00  
##  Median :0.04000   Median :   0.20   Median :123.4   Median :47.80  
##  Mean   :0.05107   Mean   :  21.03   Mean   :123.0   Mean   :48.14  
##  3rd Qu.:0.06000   3rd Qu.: 141.30   3rd Qu.:125.5   3rd Qu.:50.20  
##  Max.   :0.24000   Max.   : 220.40   Max.   :136.0   Max.   :60.20  
##  NA's   :5                           NA's   :4       NA's   :2      
##  hyd_pressure1    hyd_pressure2    hyd_pressure3    hyd_pressure4   
##  Min.   :-50.00   Min.   :-50.00   Min.   :-50.00   Min.   : 68.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 90.00  
##  Median : 10.40   Median : 26.80   Median : 27.70   Median : 98.00  
##  Mean   : 12.01   Mean   : 20.11   Mean   : 19.61   Mean   : 97.84  
##  3rd Qu.: 20.40   3rd Qu.: 34.80   3rd Qu.: 33.00   3rd Qu.:104.00  
##  Max.   : 50.00   Max.   : 61.40   Max.   : 49.20   Max.   :140.00  
##                   NA's   :1        NA's   :1        NA's   :4       
##   filler_level    filler_speed   temperature      usage_cont      carb_flow   
##  Min.   : 69.2   Min.   :1006   Min.   :63.80   Min.   :12.90   Min.   :   0  
##  1st Qu.:100.6   1st Qu.:3812   1st Qu.:65.40   1st Qu.:18.12   1st Qu.:1083  
##  Median :118.6   Median :3978   Median :65.80   Median :21.44   Median :3038  
##  Mean   :110.3   Mean   :3581   Mean   :66.23   Mean   :20.90   Mean   :2409  
##  3rd Qu.:120.2   3rd Qu.:3996   3rd Qu.:66.60   3rd Qu.:23.74   3rd Qu.:3215  
##  Max.   :153.2   Max.   :4020   Max.   :75.40   Max.   :24.60   Max.   :3858  
##  NA's   :2       NA's   :10     NA's   :2       NA's   :2                     
##     density           mfr           balling      pressure_vacuum 
##  Min.   :0.060   Min.   : 15.6   Min.   :0.902   Min.   :-6.400  
##  1st Qu.:0.920   1st Qu.:707.0   1st Qu.:1.498   1st Qu.:-5.600  
##  Median :0.980   Median :724.6   Median :1.648   Median :-5.200  
##  Mean   :1.177   Mean   :697.8   Mean   :2.203   Mean   :-5.174  
##  3rd Qu.:1.600   3rd Qu.:731.5   3rd Qu.:3.242   3rd Qu.:-4.800  
##  Max.   :1.840   Max.   :784.8   Max.   :3.788   Max.   :-3.600  
##  NA's   :1       NA's   :31      NA's   :1       NA's   :1       
##     ph          oxygen_filler     bowl_setpoint   pressure_setpoint
##  Mode:logical   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  NA's:267       1st Qu.:0.01960   1st Qu.:100.0   1st Qu.:46.00    
##                 Median :0.03370   Median :120.0   Median :46.00    
##                 Mean   :0.04666   Mean   :109.6   Mean   :47.73    
##                 3rd Qu.:0.05440   3rd Qu.:120.0   3rd Qu.:50.00    
##                 Max.   :0.39800   Max.   :130.0   Max.   :52.00    
##                 NA's   :3         NA's   :1       NA's   :2        
##  air_pressurer      alch_rel        carb_rel     balling_lvl   
##  Min.   :141.2   Min.   :6.400   Min.   :5.18   Min.   :0.000  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.34   1st Qu.:1.380  
##  Median :142.6   Median :6.580   Median :5.40   Median :1.480  
##  Mean   :142.8   Mean   :6.907   Mean   :5.44   Mean   :2.051  
##  3rd Qu.:142.8   3rd Qu.:7.180   3rd Qu.:5.56   3rd Qu.:3.080  
##  Max.   :147.2   Max.   :7.820   Max.   :5.74   Max.   :3.420  
##  NA's   :1       NA's   :3       NA's   :2

# factoring brand code
test_data <- test_data %>%
  mutate(`brand_code` = as.factor(`brand_code`))

test_completed <- test_data %>%
  select(-ph) %>%
  mice(., m = 1, method = 'rf', print = FALSE) %>% complete()

# remove Hyd.Pressure1 as it was removed in the preprocessing for Student Data
# add back in PH
test_eval <- test_completed %>%
  select(-hyd_pressure1) %>%
  mutate(PH = "")

# Check for any remaining missing values
colSums(is.na(test_eval))

##        brand_code       carb_volume       fill_ounces         pc_volume 
##                 0                 0                 0                 0 
##     carb_pressure         carb_temp               psc          psc_fill 
##                 0                 0                 0                 0 
##           psc_co2          mnf_flow    carb_pressure1     fill_pressure 
##                 0                 0                 0                 0 
##     hyd_pressure2     hyd_pressure3     hyd_pressure4      filler_level 
##                 0                 0                 0                 0 
##      filler_speed       temperature        usage_cont         carb_flow 
##                 0                 0                 0                 0 
##           density               mfr           balling   pressure_vacuum 
##                 0                 0                 0                 0 
##     oxygen_filler     bowl_setpoint pressure_setpoint     air_pressurer 
##                 0                 0                 0                 0 
##          alch_rel          carb_rel       balling_lvl                PH 
##                 0                 0                 0                 0

5 Feature Selection

With the dataset now fully cleaned and complete, the next step is to perform feature selection. Feature selection is a critical process to identify the most important predictors that influence the target variable, pH. By selecting the most relevant variables, we can improve model performance, reduce overfitting, and simplify the interpretation of the results. In this section, we will explore relationships between predictors and pH to guide our modeling choices.

Since correlation analysis requires numeric variables, we first selected only the numeric columns from the dataset before computing the correlation matrix.

# Numeric columns
data_numeric <- data %>%
  select(where(is.numeric))

# Correlation matrix
cor_matrix <- cor(data_numeric, use = "complete.obs")

# Plot
corrplot(cor_matrix, 
         method = "circle", 
         type = "upper", 
         tl.cex = 0.5, 
         tl.srt = 45)  # Rotate text diagonally

par(mar = c(1, 1, 4, 1))  # top margin = 4 lines

# find the high correlated predictors
high_corr <- findCorrelation(cor_matrix, cutoff = 0.9)

# Then this gives the actual column names:
colnames(data)[high_corr]

## [1] "mfr"           "hyd_pressure2" "carb_rel"      "carb_flow"    
## [5] "hyd_pressure4"

# As we value interpretability, we will drop these highly correlated predictors.
#data <- data[, -high_corr]

The correlation plot shows that most numerical predictors have low to moderate relationships with each other. PH is moderately correlated with variables such as oxygen_filler, pressure_setpoint, bowl_setpoint, alch_rel, and air_Ppressurer. A few other predictors also show noticeable associations. Overall, the correlations suggest that several variables may contribute useful information for predicting PH, supporting our next step of feature selection and model building.

There is higher levels of correlation for the following variables:

mfr
hyd_pressure2
carb_rel
carb_flow
hyd_pressure4

Next, let’s extract and visualize the Top 5 most correlated predictors with pH.

5.1 Top 5 Predictors Positively and Negatively Correlated with pH

# Numeric columns
data_numeric <- data %>%
  select(where(is.numeric))

# Correlation matrix
cor_matrix <- cor(data_numeric, use = "complete.obs")

# Get correlations with PH
ph_correlation <- cor_matrix[, "ph"]

# Remove PH itself (correlation = 1)
ph_correlation <- ph_correlation[names(ph_correlation) != "ph"]

# Top 5 Positive correlations
top5_pos <- sort(ph_correlation, decreasing = TRUE) %>%
  head(5)

# Top 5 Negative correlations
top5_neg <- sort(ph_correlation, decreasing = FALSE) %>%
  head(5)

# Combine into one dataframe
combined_df <- data.frame(
  Feature = c(names(top5_pos), names(top5_neg)),
  Correlation = c(top5_pos, top5_neg)
)

# Add a column to show Positive or Negative
combined_df$Direction <- ifelse(combined_df$Correlation > 0, "Positive", "Negative")

# Plot
library(ggplot2)

ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
  labs(title = "Top Features Positively and Negatively Correlated with PH",
       x = "Feature",
       y = "Correlation with PH",
       fill = "Direction") +
  theme_minimal()

The bar plot above shows the top features most positively and negatively correlated with pH. Positive correlations (in blue) indicate variables that tend to increase as pH increases, while negative correlations (in red) indicate variables that tend to decrease as pH rises. Identifying both types of relationships helps ensure that our predictive model captures the full range of influences on pH, not just variables that move in the same direction. This balanced feature selection improves model robustness and interpretability.

6 Data Pre-Processing

To predict beverage pH levels accurately, we evaluate a range of machine learning models, including both linear and nonlinear approaches. Our goal is to compare multiple algorithms, identify the best-performing model based on predictive accuracy, and balance model complexity with interpretability. We begin by splitting the dataset into training and testing sets, and then proceed to fit and assess several models to predict pH, including linear model, non-linear models, and tree-based models. Each model is trained with 5-fold CV, centered/scaled predictors, and evaluated on test data.

6.1 Splitting the Data

Since the target variable (pH) differs by Brand Code, you should stratify the train-test split by Brand Code. However, stratifying by pH produced a better R squared. Hence we did by pH. Stratifying by brand ensures each subset (train/test) contains similar brand distributions

set.seed(786)

# index for training
index <- createDataPartition(data$ph, p = .8, list = FALSE)

# train 
train_data <- data[index, ] 
test_data<- data[-index,]

# Training predictors and response
trainingX <- train_data %>% select(-ph)
trainingY <- train_data$ph

# Test predictors and response
testX <- test_data %>% select(-ph)
testY <- test_data$ph

# 5-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 5)

#pre process
pre_process <- c("center","scale")

7 Model Building and Selection

Note: our machines were unable to knit when adding a tuning parameter grid for many of the models (particularly the tree models), even with parallel processing instated. For this reason, we left out a tuning grid for most models. We noticed generally a tuning grid did not majorly affect results of the model anyway.

7.1 Linear Regression Models: Ridge, Lasso, Elastic Net, PLS, PCR

Models linear relationships, no tuning - linear Regression models pH as a linear combination of predictors. It’s simple but may struggle with complex, nonlinear relationships in the data. Implemented using caret’s lm method, it’s trained on train_data, evaluated with 5-fold CV, and performance (RMSE, R-squared, MAE) is assessed on test_data using postResample.

7.1.1 Linear Regression Model

The model produced a lowest RMSE of 0.1321767 and an R² of 0.4186977, indicating that approximately 41.87% of the variance in pH is explained by the model. The MAE was 0.1041143, reflecting the average difference between predicted and actual values. These results suggest that this model may not be the most suitable for accurately predicting pH in this dataset.

set.seed(786)
lm_model <- train(x = trainingX, y = trainingY,
                  method = "lm",
                  preProcess = pre_process,
                  trControl = ctrl)

lm_pred <- predict(lm_model, testX)
lm_resample <- postResample(lm_pred, testY)

lm_model

## Linear Regression 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1347626  0.3897977  0.1037653
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

lm_resample

##      RMSE  Rsquared       MAE 
## 0.1321767 0.4186977 0.1041143

7.1.2 Ridge Model

The lowest RMSE for the model was 0.1323847. The R2 was 0.4188048, which means 41.88% of the variance in pH is explained by the model. The MAE was 0.1043342, which is the average error between prediction and actual values. These results suggest that the Ridge model performs similarly to the linear regression model and is likely not the most effective choice for predicting pH values in this dataset.

set.seed(786)
ridge_model <- train(ph ~ ., data = train_data,
                     method = "glmnet",
                     preProcess = c("center", "scale"),
                     trControl = trainControl(method = "cv", number = 10),
                     tuneGrid = expand.grid(alpha = 0,
                                            lambda = seq(0, 1, 0.1)))

ridge_pred <- predict(ridge_model, testX)
ridge_resample <- postResample(ridge_pred, testY)

ridge_model

## glmnet 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1848, 1850, 1849, 1850, 1849, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE       Rsquared   MAE      
##   0.0     0.1353139  0.3855554  0.1051096
##   0.1     0.1412178  0.3472476  0.1115090
##   0.2     0.1448983  0.3259585  0.1146440
##   0.3     0.1475069  0.3120219  0.1168361
##   0.4     0.1495342  0.3019645  0.1185909
##   0.5     0.1511908  0.2942820  0.1200542
##   0.6     0.1525791  0.2882702  0.1212828
##   0.7     0.1537686  0.2834310  0.1223356
##   0.8     0.1548160  0.2793599  0.1232619
##   0.9     0.1557314  0.2760084  0.1240749
##   1.0     0.1565512  0.2731256  0.1248068
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 0.

ridge_resample

##      RMSE  Rsquared       MAE 
## 0.1323847 0.4188048 0.1043342

7.1.3 Lasso Model

The Lasso Regression model achieved an RMSE of 0.1321259, with an R2 of 0.4191752, meaning it explains approximately 41.92% of the variance in pH. The MAE was 0.1039544, indicating the average error between predicted and actual values. These results are nearly identical to those of the linear and ridge regression models, suggesting that Lasso also may not offer a substantial improvement in predictive performance for this dataset.

set.seed(786)

# For lasso, any value of lambda over 0.07 produced missing values for `lasso_resample`

lasso_model <- train(ph ~ ., data = train_data,
                     method = "glmnet",
                     preProcess = c("center", "scale"),
                     trControl = trainControl(method = "cv", number = 5),
                     tuneGrid = expand.grid(alpha = 1, # set alpha = 1 for lasso                
                                            lambda = seq(0, 0.07, 0.001)))  
lasso_pred <- predict(lasso_model, testX)

lasso_resample <- postResample(lasso_pred, testY)

lasso_model

## glmnet 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE       Rsquared   MAE      
##   0.000   0.1347278  0.3900439  0.1037887
##   0.001   0.1350921  0.3865110  0.1047001
##   0.002   0.1356942  0.3818617  0.1055293
##   0.003   0.1366319  0.3743770  0.1065373
##   0.004   0.1376423  0.3662409  0.1076105
##   0.005   0.1385823  0.3587458  0.1086049
##   0.006   0.1394813  0.3515588  0.1095324
##   0.007   0.1403887  0.3441997  0.1103737
##   0.008   0.1413814  0.3357357  0.1112458
##   0.009   0.1420956  0.3301624  0.1118688
##   0.010   0.1426631  0.3262410  0.1123622
##   0.011   0.1432582  0.3220253  0.1128583
##   0.012   0.1438945  0.3173256  0.1133863
##   0.013   0.1445506  0.3123278  0.1139229
##   0.014   0.1451287  0.3081980  0.1143886
##   0.015   0.1455359  0.3062830  0.1147313
##   0.016   0.1459428  0.3044821  0.1150929
##   0.017   0.1463754  0.3024450  0.1154851
##   0.018   0.1468021  0.3005715  0.1158742
##   0.019   0.1472497  0.2984907  0.1162890
##   0.020   0.1477167  0.2961973  0.1167113
##   0.021   0.1482068  0.2936107  0.1171539
##   0.022   0.1487004  0.2909843  0.1175958
##   0.023   0.1492057  0.2882040  0.1180446
##   0.024   0.1497022  0.2855700  0.1184904
##   0.025   0.1501952  0.2829984  0.1189358
##   0.026   0.1506687  0.2807129  0.1193560
##   0.027   0.1511353  0.2785308  0.1197611
##   0.028   0.1515846  0.2766910  0.1201492
##   0.029   0.1520415  0.2747582  0.1205429
##   0.030   0.1525055  0.2727470  0.1209377
##   0.031   0.1529751  0.2706672  0.1213455
##   0.032   0.1534588  0.2683388  0.1217809
##   0.033   0.1539564  0.2657345  0.1222325
##   0.034   0.1544677  0.2628354  0.1226968
##   0.035   0.1549926  0.2596045  0.1231714
##   0.036   0.1555286  0.2560463  0.1236570
##   0.037   0.1560667  0.2523276  0.1241449
##   0.038   0.1566176  0.2481871  0.1246462
##   0.039   0.1571809  0.2435741  0.1251532
##   0.040   0.1577350  0.2389731  0.1256436
##   0.041   0.1582872  0.2341996  0.1261301
##   0.042   0.1588411  0.2290904  0.1266136
##   0.043   0.1594011  0.2235296  0.1270959
##   0.044   0.1599610  0.2176431  0.1275721
##   0.045   0.1604526  0.2129899  0.1280007
##   0.046   0.1609256  0.2083556  0.1284216
##   0.047   0.1613980  0.2034130  0.1288547
##   0.048   0.1617791  0.2011384  0.1292122
##   0.049   0.1621464  0.1993388  0.1295516
##   0.050   0.1624906  0.1980878  0.1298733
##   0.051   0.1628354  0.1968377  0.1301959
##   0.052   0.1631714  0.1960430  0.1305112
##   0.053   0.1634926  0.1960430  0.1308102
##   0.054   0.1638193  0.1960430  0.1311113
##   0.055   0.1641513  0.1960430  0.1314128
##   0.056   0.1644888  0.1960430  0.1317170
##   0.057   0.1648317  0.1960430  0.1320242
##   0.058   0.1651799  0.1960430  0.1323354
##   0.059   0.1655334  0.1960430  0.1326482
##   0.060   0.1658922  0.1960430  0.1329641
##   0.061   0.1662562  0.1960430  0.1332807
##   0.062   0.1666255  0.1960430  0.1336021
##   0.063   0.1669999  0.1960430  0.1339279
##   0.064   0.1673794  0.1960430  0.1342582
##   0.065   0.1677640  0.1960430  0.1346034
##   0.066   0.1681537  0.1960430  0.1349551
##   0.067   0.1685485  0.1960430  0.1353104
##   0.068   0.1689482  0.1960430  0.1356712
##   0.069   0.1693529  0.1960430  0.1360432
##   0.070   0.1697625  0.1960430  0.1364226
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.

lasso_resample

##      RMSE  Rsquared       MAE 
## 0.1321259 0.4191752 0.1039544

7.1.4 Elastic Net Model

The Elastic Net model produced an RMSE of 0.1321355, an R2 of 0.4192278 (meaning it explains approximately 41.92% of the variance in pH.), and an MAE of 0.1037287. These results are also almost identical to those of the linear, ridge, and lasso models, indicating that Elastic Net also does not offer a significant improvement in predictive performance of pH for this dataset.

set.seed(786)
# Elastic net
enetGrid <- expand.grid(lambda = c(0, 0.01, .1), fraction = seq(.05, 1, length = 20))


enet_model <- train(ph ~ ., data = train_data,
                 method = "enet",
                 tuneGrid = enetGrid,
                 trControl = ctrl,
                 preProcess = pre_process)

enet_pred <- predict(enet_model, testX)
enet_resample <- postResample(enet_pred, testY)

enet_model

## Elasticnet 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE      
##   0.00    0.05      0.1588794  0.2288224  0.1266329
##   0.00    0.10      0.1505693  0.2813210  0.1192215
##   0.00    0.15      0.1456149  0.3058632  0.1147940
##   0.00    0.20      0.1425201  0.3271558  0.1122306
##   0.00    0.25      0.1404272  0.3439818  0.1104047
##   0.00    0.30      0.1388669  0.3567170  0.1089558
##   0.00    0.35      0.1377952  0.3653294  0.1078436
##   0.00    0.40      0.1368919  0.3725047  0.1069213
##   0.00    0.45      0.1361626  0.3783575  0.1061608
##   0.00    0.50      0.1356221  0.3825993  0.1055265
##   0.00    0.55      0.1353130  0.3849023  0.1050746
##   0.00    0.60      0.1350796  0.3867118  0.1047481
##   0.00    0.65      0.1349173  0.3879683  0.1045237
##   0.00    0.70      0.1347928  0.3890431  0.1043445
##   0.00    0.75      0.1347135  0.3897824  0.1042154
##   0.00    0.80      0.1346481  0.3904005  0.1040677
##   0.00    0.85      0.1346132  0.3907657  0.1039328
##   0.00    0.90      0.1346328  0.3906763  0.1038394
##   0.00    0.95      0.1346826  0.3903497  0.1037878
##   0.00    1.00      0.1347626  0.3897977  0.1037653
##   0.01    0.05      0.1607846  0.2087278  0.1282753
##   0.01    0.10      0.1530667  0.2705221  0.1213974
##   0.01    0.15      0.1479276  0.2952410  0.1168996
##   0.01    0.20      0.1445946  0.3117723  0.1139686
##   0.01    0.25      0.1422488  0.3288366  0.1120031
##   0.01    0.30      0.1405741  0.3422756  0.1105516
##   0.01    0.35      0.1392121  0.3534268  0.1093021
##   0.01    0.40      0.1382412  0.3613703  0.1083058
##   0.01    0.45      0.1374554  0.3676011  0.1074878
##   0.01    0.50      0.1367639  0.3732067  0.1067738
##   0.01    0.55      0.1361699  0.3780356  0.1061409
##   0.01    0.60      0.1357258  0.3815199  0.1056252
##   0.01    0.65      0.1354209  0.3838913  0.1052247
##   0.01    0.70      0.1352016  0.3856100  0.1049274
##   0.01    0.75      0.1350272  0.3870194  0.1047040
##   0.01    0.80      0.1349046  0.3880363  0.1045354
##   0.01    0.85      0.1348097  0.3889060  0.1044055
##   0.01    0.90      0.1347357  0.3895964  0.1042708
##   0.01    0.95      0.1347044  0.3899290  0.1041656
##   0.01    1.00      0.1347133  0.3899311  0.1041032
##   0.10    0.05      0.1635442  0.1960430  0.1308649
##   0.10    0.10      0.1573206  0.2448613  0.1252548
##   0.10    0.15      0.1524106  0.2741302  0.1208467
##   0.10    0.20      0.1487372  0.2906560  0.1176366
##   0.10    0.25      0.1459229  0.3042547  0.1151179
##   0.10    0.30      0.1439724  0.3137046  0.1134609
##   0.10    0.35      0.1424246  0.3251681  0.1121838
##   0.10    0.40      0.1411967  0.3344805  0.1111417
##   0.10    0.45      0.1401822  0.3421473  0.1102258
##   0.10    0.50      0.1393853  0.3486653  0.1094415
##   0.10    0.55      0.1387347  0.3542242  0.1087638
##   0.10    0.60      0.1381668  0.3591746  0.1081490
##   0.10    0.65      0.1376905  0.3633251  0.1076347
##   0.10    0.70      0.1372644  0.3670379  0.1071834
##   0.10    0.75      0.1369121  0.3700691  0.1068051
##   0.10    0.80      0.1366150  0.3726553  0.1064724
##   0.10    0.85      0.1363606  0.3749123  0.1061794
##   0.10    0.90      0.1361602  0.3767190  0.1059482
##   0.10    0.95      0.1360220  0.3780369  0.1057840
##   0.10    1.00      0.1359264  0.3790202  0.1056542
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.85 and lambda = 0.

enet_resample

##      RMSE  Rsquared       MAE 
## 0.1321355 0.4192278 0.1037287

7.1.5 Principal Component Regression (PCR)

The PCR model yielded an RMSE of 0.1561616, an R2 of 0.1890207, and an MAE of 0.1243395. Compared to the previous linear-based models, PCR shows poorer performance, explaining only about 18.9% of the variance in pH and producing higher prediction errors. This suggests that PCR is not a suitable model for predicting pH.

set.seed(786)
# Train PCR model
pcr_model <- train(x = select(train_data, -ph), y = train_data$ph,
                   method = "pcr",
                   trControl = ctrl,
                   preProcess = pre_process)

# Predict on test data
pcr_pred <- predict(pcr_model, select(test_data, -ph))

# Evaluate performance
pcr_resample <- postResample(pcr_pred, test_data$ph)

pcr_model

## Principal Component Analysis 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared    MAE      
##   1      0.1711486  0.01681885  0.1372668
##   2      0.1578973  0.16178552  0.1253505
##   3      0.1567185  0.17433906  0.1240842
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

pcr_resample

##      RMSE  Rsquared       MAE 
## 0.1561616 0.1890207 0.1243395

7.1.6 Partial Least Squares (PLS)

The PLS model achieved an RMSE of 0.1324806, an R2 of 0.4159277, and an MAE of 0.1037519. These results are very similar to those of the linear, ridge, lasso, and elastic net models, indicating that PLS captures a comparable amount of variance in pH and provides similar predictive accuracy. However, like the other models, the RMSE and R² values are not strong enough to consider this a reliable model for predicting pH in this dataset.

set.seed(786)
pls_grid <- expand.grid(ncomp = seq(1, 15, by = 1))
pls_model <- train(x = select(train_data, -ph), y = train_data$ph,
                   method = "pls",
                   tuneGrid = pls_grid,
                   trControl = ctrl,
                   preProcess = pre_process)
pls_pred <- predict(pls_model, select(test_data, -ph))
pls_resample <- postResample(pls_pred, test_data$ph)

pls_model

## Partial Least Squares 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.1532790  0.2103613  0.1212352
##    2     0.1458741  0.2848802  0.1137035
##    3     0.1423544  0.3186094  0.1105476
##    4     0.1399725  0.3406841  0.1089077
##    5     0.1376588  0.3624354  0.1063880
##    6     0.1363722  0.3745964  0.1060233
##    7     0.1357017  0.3809953  0.1050317
##    8     0.1353048  0.3847330  0.1047653
##    9     0.1350521  0.3871062  0.1045406
##   10     0.1348994  0.3885238  0.1042537
##   11     0.1348595  0.3888304  0.1041934
##   12     0.1347463  0.3898319  0.1041101
##   13     0.1347837  0.3894784  0.1041348
##   14     0.1347082  0.3900727  0.1039312
##   15     0.1347405  0.3898509  0.1039722
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 14.

pls_resample

##      RMSE  Rsquared       MAE 
## 0.1324806 0.4159277 0.1037519

7.2 Non-Linear Models: KNN, MARS, NN, SVM, SVM Polynomial

7.2.1 K-Nearest Neighbors (KNN)

kNN Model: The lowest RMSE for the model is at k = 9. The lowest RMSE was 0.1172830. The R2 was 0.550085, which means 55.0% of the variance in pH is explained by the kNN model. The MAE was 0.0869197, which is the average error between prediction and actual values. From these results, compared to the models tested, kNN explains a larger proportion of the variance in pH and provides more accurate predictions compared to the linear, ridge, lasso, elastic net, and PLS models. However, while kNN shows improved performance, the RMSE and R2 values are still not strong enough to consider this a highly reliable model for predicting pH in this dataset.

set.seed(786)
knn_grid <- expand.grid(k = seq(1, 21, by = 2))

knn_model <- train(ph ~ ., data = train_data,
                   method = "knn",
                   tuneGrid = knn_grid,
                   trControl = ctrl,
                   preProcess = pre_process)

knn_pred <- predict(knn_model, select(test_data, -ph))
knn_resample <- postResample(knn_pred, test_data$ph)

knn_model

## k-Nearest Neighbors 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE       
##    1  0.1549193  0.3416685  0.10585840
##    3  0.1307828  0.4452436  0.09564591
##    5  0.1267261  0.4672217  0.09411415
##    7  0.1252270  0.4781626  0.09444017
##    9  0.1246390  0.4833880  0.09453652
##   11  0.1252757  0.4790679  0.09498111
##   13  0.1265625  0.4688718  0.09586493
##   15  0.1272809  0.4626442  0.09670261
##   17  0.1277351  0.4594474  0.09711296
##   19  0.1282146  0.4554591  0.09754619
##   21  0.1292802  0.4458702  0.09833988
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.

knn_resample

##      RMSE  Rsquared       MAE 
## 0.1172830 0.5500853 0.0869197

7.2.2 Multivariate Adaptive Regression Splines (MARS)

The MARS model achieved an RMSE of 0.11790237, an R² of 0.53866890, and an MAE of 0.08951604, indicating that 53.87% of the variance in pH is explained by the MARS model. These results show that MARS provides a decent performance, with an RMSE and R2 value indicating a decent fit and predictive accuracy. The performance is better than the linear models, suggesting that MARS captures more complex relationships in the data. Although the model performs well, the RMSE and R2 values are still not sufficient to consider this a highly reliable model for pH prediction in this dataset.

set.seed(786)
mars_grid <- expand.grid(degree = 1:2, nprune = seq(10, 30, by = 5))
mars_model <- train(x = select(train_data, -ph), y = train_data$ph,
                    method = "earth",
                    tuneGrid = mars_grid,
                    trControl = ctrl,
                    preProcess = pre_process)

mars_pred <- predict(mars_model, select(test_data, -ph))
mars_resample <- postResample(mars_pred, test_data$ph)

mars_model

## Multivariate Adaptive Regression Spline 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE       
##   1       10      0.1346239  0.3902344  0.10381266
##   1       15      0.1326133  0.4087794  0.10180204
##   1       20      0.1323283  0.4111927  0.10075282
##   1       25      0.1323283  0.4111927  0.10075282
##   1       30      0.1323283  0.4111927  0.10075282
##   2       10      0.1347855  0.3933993  0.10296016
##   2       15      0.1287403  0.4457438  0.09732794
##   2       20      0.1260527  0.4683132  0.09531076
##   2       25      0.1250275  0.4769808  0.09390600
##   2       30      0.1244525  0.4819273  0.09328216
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 30 and degree = 2.

mars_resample

##       RMSE   Rsquared        MAE 
## 0.11790237 0.53866890 0.08951604

7.2.3 Neural Network (NN)

The Neural Network model achieved an RMSE of 0.11433728, an R² of 0.56840994, and an MAE of 0.08432463. These results show a stronger performance than the rest of the models so far, with the lowest RMSE and the highest R2, indicating that it explains a 56.84% of variance in the pH values. The MAE is also relatively low, suggesting pH predictions with minimal absolute errors. However, like the past few nonlinear models, the RMSE and R2 values still suggest that there is room for improvement.

num_cores <- parallel::detectCores() - 1  # leave one core free
num_cores

## [1] 11

cl <- makePSOCKcluster(num_cores) # number of cores to use
registerDoParallel(cl)

nnetGrid <- expand.grid(decay = c(0, 0.02, .1),
                        size = c(7, 9, 11),
                        bag = FALSE)
set.seed(786)
nnetTune <- train(ph ~ ., data = train_data,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 13 * (ncol(trainingX) + 1) + 13 + 1,
                  maxit = 1000)
nnetTune

## Model Averaged Neural Network 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE       
##   0.00    7    0.2310935  0.4314668  0.10412066
##   0.00    9    0.1455993  0.4536324  0.09055703
##   0.00   11    0.1785133  0.4197705  0.09827147
##   0.02    7    0.1164756  0.5454294  0.08539212
##   0.02    9    0.1165408  0.5469831  0.08492737
##   0.02   11    0.1201854  0.5306942  0.08542269
##   0.10    7    0.1175644  0.5376699  0.08724167
##   0.10    9    0.1172844  0.5426627  0.08658979
##   0.10   11    0.1160258  0.5546873  0.08552272
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 11, decay = 0.1 and bag = FALSE.

nn_pred <- predict(nnetTune, select(test_data, -ph))
nn_resample <- postResample(nn_pred, test_data$ph)
nn_resample

##       RMSE   Rsquared        MAE 
## 0.11433728 0.56840994 0.08432463

# Stop cluster
stopCluster(cl)
registerDoSEQ()

7.2.4 Support Vector Machines (SVM)

The SVM-RBF model achieved an RMSE of 0.11620320, an R2 of 0.55228523, and an MAE of 0.08752177 - with final values used for the model being sigma = 0.05 and C = 10. The RMSE value indicates a low error in prediction, and the R2 suggests a good level of explanatory power compared to other models, since 55.2% of the variance in pH is explained by the model. However, the SVM-RBF model performs better than linear models and some non-linear models, it still falls short of the Neural Network for both RMSE and R2. Nevertheless, while it provides a somewhat reliable approach for predicting pH values in this dataset, it still could use some improvement.

set.seed(786)

svm_grid <- expand.grid(sigma = c(0.01, 0.05, 0.1), C = c(0.1, 1, 10))
svm_model <- train(ph ~ ., data = train_data,
                   method = "svmRadial",
                   tuneGrid = svm_grid,
                   trControl = ctrl,
                   preProcess = pre_process)

svm_pred <- predict(svm_model, select(test_data, -ph))
svm_resample <- postResample(svm_pred, test_data$ph)

svm_model

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644 
## Resampling results across tuning parameters:
## 
##   sigma  C     RMSE       Rsquared   MAE       
##   0.01    0.1  0.1363996  0.3956185  0.10441674
##   0.01    1.0  0.1253245  0.4757716  0.09179527
##   0.01   10.0  0.1218226  0.5050738  0.08849894
##   0.05    0.1  0.1331903  0.4226648  0.10004960
##   0.05    1.0  0.1202833  0.5149270  0.08783407
##   0.05   10.0  0.1185918  0.5309967  0.08834310
##   0.10    0.1  0.1396690  0.3791954  0.10594163
##   0.10    1.0  0.1230643  0.4946691  0.09073942
##   0.10   10.0  0.1218007  0.5004429  0.09133235
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.05 and C = 10.

svm_resample

##       RMSE   Rsquared        MAE 
## 0.11620320 0.55228523 0.08752177

7.2.5 SVM Polynomial

The SVM-Polynomial model achieved an RMSE of 0.12636749, an R2 of 0.47478046, and an MAE of 0.09250217 -with final values used for the model being degree = 2, scale = 0.01, and C = 2. These results place this model on the lower end in terms of performance for this dataset. While it outperforms simpler linear models like PLS and ridge regression in terms of predictive accuracy and variance explained, it does not outperform models like SVM with RBF kernel or the Neural Network. Overall, the SVM-Poly is not among the most reliable for this dataset.

set.seed(786)
svmGrid <- expand.grid(degree = 1:2, 
                       scale = c(0.01, 0.005, 0.001), 
                       C = 2^(-2:5))
svmP_model <- train(ph ~ ., data = train_data,
                  method = "svmPoly",
                  preProc = pre_process,
                  tuneGrid = svmGrid,
                  trControl = ctrl)

svmP_pred <- predict(svmP_model, select(test_data, -ph))
svmP_resample <- postResample(svmP_pred, test_data$ph)

svmP_model

## Support Vector Machines with Polynomial Kernel 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644 
## Resampling results across tuning parameters:
## 
##   degree  scale  C      RMSE       Rsquared   MAE       
##   1       0.001   0.25  0.1500682  0.2902258  0.11825797
##   1       0.001   0.50  0.1454214  0.3172556  0.11392465
##   1       0.001   1.00  0.1418150  0.3413470  0.11050947
##   1       0.001   2.00  0.1392212  0.3585987  0.10767681
##   1       0.001   4.00  0.1375350  0.3695964  0.10567779
##   1       0.001   8.00  0.1367009  0.3762804  0.10451907
##   1       0.001  16.00  0.1364567  0.3784029  0.10389322
##   1       0.001  32.00  0.1365448  0.3781879  0.10353572
##   1       0.005   0.25  0.1408354  0.3478939  0.10950501
##   1       0.005   0.50  0.1385566  0.3629925  0.10686241
##   1       0.005   1.00  0.1372550  0.3718124  0.10525600
##   1       0.005   2.00  0.1365453  0.3776144  0.10423486
##   1       0.005   4.00  0.1365016  0.3781480  0.10376921
##   1       0.005   8.00  0.1366220  0.3778237  0.10347086
##   1       0.005  16.00  0.1366588  0.3775229  0.10332553
##   1       0.005  32.00  0.1365817  0.3785518  0.10318452
##   1       0.010   0.25  0.1385577  0.3629794  0.10686348
##   1       0.010   0.50  0.1372559  0.3718060  0.10525661
##   1       0.010   1.00  0.1365438  0.3776293  0.10423471
##   1       0.010   2.00  0.1365024  0.3781366  0.10376845
##   1       0.010   4.00  0.1366236  0.3778139  0.10347003
##   1       0.010   8.00  0.1366597  0.3775137  0.10332838
##   1       0.010  16.00  0.1365826  0.3785592  0.10318481
##   1       0.010  32.00  0.1367319  0.3775373  0.10308037
##   2       0.001   0.25  0.1452417  0.3195940  0.11376678
##   2       0.001   0.50  0.1414933  0.3449001  0.11022507
##   2       0.001   1.00  0.1386449  0.3643305  0.10714153
##   2       0.001   2.00  0.1365714  0.3793489  0.10464173
##   2       0.001   4.00  0.1349484  0.3927196  0.10272931
##   2       0.001   8.00  0.1333085  0.4065537  0.10068073
##   2       0.001  16.00  0.1310891  0.4258930  0.09812315
##   2       0.001  32.00  0.1290090  0.4434096  0.09551423
##   2       0.005   0.25  0.1357112  0.3910420  0.10408214
##   2       0.005   0.50  0.1326874  0.4152589  0.10052062
##   2       0.005   1.00  0.1298085  0.4382866  0.09714382
##   2       0.005   2.00  0.1276990  0.4553732  0.09436977
##   2       0.005   4.00  0.1265290  0.4640581  0.09263855
##   2       0.005   8.00  0.1257959  0.4698461  0.09161152
##   2       0.005  16.00  0.1259687  0.4701945  0.09114868
##   2       0.005  32.00  0.1264498  0.4689968  0.09132417
##   2       0.010   0.25  0.1303526  0.4361562  0.09788445
##   2       0.010   0.50  0.1279766  0.4543567  0.09479150
##   2       0.010   1.00  0.1265030  0.4645966  0.09271112
##   2       0.010   2.00  0.1257891  0.4701290  0.09170759
##   2       0.010   4.00  0.1259083  0.4704585  0.09115160
##   2       0.010   8.00  0.1263425  0.4697251  0.09130561
##   2       0.010  16.00  0.1278630  0.4628675  0.09224004
##   2       0.010  32.00  0.1305962  0.4485015  0.09392808
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 2, scale = 0.01 and C = 2.

svmP_resample

##       RMSE   Rsquared        MAE 
## 0.12636749 0.47478046 0.09250217

7.3 Tree-based models: Single Tree, Random Forest, Conditional Inference Random Forest, GBM, Cubist, Bagged Tree

7.3.1 Single Regression Tree

The CART model achieved an RMSE of 0.12794622, an R² of 0.45794133, and an MAE of 0.09678968, with the final value used for the model being cp = 0.01. These results indicate mediocore predictive performance, slightly better than linear and PLS models, but also falling short of more complex models like SVM-RBF and NN. The model captures some of the variance (45.79%) in pH but still leaves a considerable amount unexplained, making it only somewhat effective for prediction in this dataset. Therefore, this would not be the model of choice.

set.seed(786)
rpartFit <- train(ph ~ ., data = train_data,
                  method = "rpart",
                  trControl = ctrl,
                  tuneGrid = data.frame(cp = c(0.01, 0.05, 0.1)),
                  preProcess = pre_process)
rpartFit

## CART 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644 
## Resampling results across tuning parameters:
## 
##   cp    RMSE       Rsquared   MAE       
##   0.01  0.1297540  0.4366423  0.09872245
##   0.05  0.1492260  0.2510414  0.11669494
##   0.10  0.1526919  0.2163947  0.11970553
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01.

#predict the permeability response for the test data
rpartPred <- predict(rpartFit, select(test_data, -ph))

#compare the variance of the predicted values to the test values.
rpartResample <- postResample(rpartPred, test_data$ph)

rpartResample

##       RMSE   Rsquared        MAE 
## 0.12794622 0.45794133 0.09678968

7.3.2 Random Forest

The Random Forest model achieved an RMSE of 0.09969844, an R2 of 0.67283982, and an MAE of 0.07037370, making it the most accurate model among all those evaluated so far. It outperformed linear, PLS, and CART in both error and variance explained. The high R2 (67.28%) indicates that Random Forest captured a significant portion of the variability in pH. The final value used for the model was mtry = 31. This result highlights Random Forest as a highly effective modeling approach for this dataset, for predictive accuracy and capturing variance.

set.seed(786)

# default mmtry is number of predictors  divided by 3 , so around 10.
#mtry is the number of predictors sampled for each split.
rfGrid <- expand.grid(mtry=seq(2,38,by=3) )


rfModel <- train(x = select(train_data, -ph), y = train_data$ph, 
                 method = "rf", 
#                 tuneGrid = rfGrid, 
                 importance = TRUE, 
                 trControl = ctrl,
                 ntree = 50,
                 preProcess=pre_process)

#validation plot
plot(rfModel)

rfModel

## Random Forest 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    2    0.1158778  0.5846496  0.08693333
##   16    0.1049511  0.6383291  0.07587949
##   31    0.1036612  0.6423721  0.07401761
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 31.

#predict the permeability response for the test data
rfPred <- predict(rfModel, select(test_data, -ph))

#compare the variance of the predicted values to the test values.
rfResample <- postResample(rfPred, test_data$ph)

rfResample

##       RMSE   Rsquared        MAE 
## 0.09969844 0.67283982 0.07037370

7.3.3 Random Forest with OOB Estimate

The Random Forest model with OOB estimates achieved an RMSE of 0.09926432, an R2 of 0.67786310, and an MAE of 0.07146070, making it slightly more accurate and reliable than the Random Forest, and the best overall so far. The final value used for the model was also mtry = 31, which is the maximum number of predictors available. The high R2 (67.79%) indicates that the model explained a good proportion of the variance in pH, while the low RMSE and MAE reflect high predictive accuracy. This reinforces Random Forest as the strongest model for predicting pH so far.

# Tune the model using the OOB estimates
ctrlOOB <- trainControl(method = "oob")
set.seed(786)
rfTuneOOB <- train(x = select(train_data, -ph), y = train_data$ph,
                   method = "rf",
                   # tuneGrid = mtryGrid,
                   ntree = 50,
                   importance = TRUE,
                   trControl = ctrlOOB,
                   preProcess=pre_process)
rfTuneOOB

## Random Forest 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared 
##    2    0.1168861  0.5396115
##   16    0.1030697  0.6420183
##   31    0.1026615  0.6448478
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 31.

plot(rfTuneOOB)

#predict the permeability response for the test data
rfOOBPred <- predict(rfTuneOOB, select(test_data, -ph))

#compare the variance of the predicted values to the test values.
rfOOBResample <- postResample(rfOOBPred, test_data$ph)

rfOOBResample

##       RMSE   Rsquared        MAE 
## 0.09926432 0.67786310 0.07146070

7.3.4 Conditional Inference Random Forest

The Conditional Inference Random Forest model achieved an RMSE of 0.10561362, an R2 of 0.64000809, and an MAE of 0.07969956. These results demonstrate strong predictive performance, but fell slightly short of the regular Random Forest model, which produced lower error rates and higher R2. The final value used for the model was mtry = 16. Overall, this model performs relatively well but not the best performer so far.

# Conditional Inference Forests
mtryGrid <- data.frame(mtry = floor(seq(10, ncol(select(train_data, -ph)), length = 10)))
set.seed(786)
condrfTune <- train(x = select(train_data, -ph), y = train_data$ph,
                    method = "cforest",
                    # tuneGrid = mtryGrid,
                    controls = cforest_unbiased(ntree = 50),
                    trControl = ctrl,
                    preProcess=pre_process)
condrfTune

## Conditional Inference Random Forest 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    2    0.1272338  0.5006288  0.09855281
##   16    0.1116897  0.5885541  0.08292328
##   31    0.1123024  0.5785856  0.08288324
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 16.

#predict the permeability response for the test data
condrfPred <- predict(condrfTune, select(test_data, -ph))

#compare the variance of the predicted values to the test values.
condrfResample <- postResample(condrfPred, test_data$ph)

condrfResample

##       RMSE   Rsquared        MAE 
## 0.10561362 0.64000809 0.07969956

7.3.5 Conditional Inference OOB Estimate

The Conditional Inference Random Forest model with OOB achieved an RMSE of 0.10509848, an R2 of 0.64372448, and an MAE of 0.07930848. These values reflect similar predictive performance to the regular conditional inference - under performing the Random Foest. The final tuning parameter used was mtry = 16. While slightly behind the standard Random Forest, this model still demonstrates solid capability in predicting pH.

# Tune the model using the OOB estimates
set.seed(786)
condrfTuneOOB <- train(x = select(train_data, -ph), y = train_data$ph,
                   method = "cforest",
                   # tuneGrid = mtryGrid,
                   controls = cforest_unbiased(ntree = 50),
                   trControl = ctrlOOB,
                   preProcess = pre_process)
condrfTuneOOB

## Conditional Inference Random Forest 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    2    0.1244314  0.5183411  0.09510416
##   16    0.1103118  0.5962083  0.08131875
##   31    0.1106121  0.5894080  0.08066653
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 16.

plot(condrfTuneOOB)

#predict the permeability response for the test data
condrfOOBPred <- predict(condrfTuneOOB, select(test_data, -ph))

#compare the variance of the predicted values to the test values.
condrfOOBResample <- postResample(condrfOOBPred, test_data$ph)

condrfOOBResample

##       RMSE   Rsquared        MAE 
## 0.10509848 0.64372448 0.07930848

7.3.6 Gradient Boosting Model(GBM)

The final model selected was based on n.trees = 150 and interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10. The Stochastic Gradient Boosting model achieved an RMSE of 0.1146917, indicating that the model’s predictions are off by this amount on average. The R2 value of 0.56474918 means that approximately 56.47% of the variance in the pH variable is explained by the model. The MAE was 0.09061361, indicating that, on average, the model’s predictions differ from the true values by about 0.09 units. Compared to the other models, the boosted model did not perform as good as the Random Forest and Neural Network models since the R-squared, especially, is somewhat moderate, meaning the model is not fully capturing the variability in the data. As a result, it may not be the most effective model for predicting pH in this dataset.

set.seed(786)

gbmGrid <- expand.grid(interaction.depth=seq(1,7,by=1),
                       n.trees=c(100,200),
                       shrinkage=c(0.01,0.1,0.2),
                       n.minobsinnode=5)


# Distribution defines the type of loss function that will be optimized during boosting.
# for continuous variable, it should be gaussian

gbmModel <-train(x = select(train_data, -ph), y = train_data$ph,
                 method = "gbm",
                 verbose = FALSE, 
                 trControl = ctrl, 
                 preProcess=pre_process)

gbmModel

## Stochastic Gradient Boosting 
## 
## 2055 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   MAE       
##   1                   50      0.1366762  0.3900967  0.10765403
##   1                  100      0.1325738  0.4169541  0.10387050
##   1                  150      0.1310521  0.4264272  0.10206527
##   2                   50      0.1295610  0.4496512  0.10130993
##   2                  100      0.1254424  0.4757725  0.09687421
##   2                  150      0.1241387  0.4831551  0.09505848
##   3                   50      0.1262007  0.4726839  0.09761110
##   3                  100      0.1222503  0.4988562  0.09309484
##   3                  150      0.1204027  0.5123995  0.09138077
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.

#predict the permeability response for the test data
gbmPred <- predict(gbmModel, select(test_data, -ph))

#compare the variance of the predicted values to the test values.
gbmResample<- postResample(gbmPred, test_data$ph)

gbmResample

##       RMSE   Rsquared        MAE 
## 0.11469171 0.56474918 0.09061361

7.3.7 Cubist

The best performance for Cubist was obtained with committees = 20 and neighbors = 5. The final RMSE on the test set was 0.09238683, which means the model’s predictions are off by about this amount on average. The R2 value of 0.7208153 indicates that the model explains about 72.08% of the variance in pH, showing a relatively strong relationship between the predictors and the target variable. The MAE was 0.06862500, meaning that, on average, the predictions differ from the true values by approximately 0.069 units. Compared to the previous models, the Cubist model demonstrates the best predictive performance (out of the models tested) with the lowest RMSE and a highest R-squared. The R-squared value indicates that the model is able to capture a relatively decent portion of the variability in the data, and the low RMSE suggests that it is making relatively accurate predictions.

set.seed(786)
cubistModel <- train(x = select(train_data, -ph), y = train_data$ph,
                 method = "cubist",
                 verbose = FALSE, 
                 preProcess=pre_process)

cubistModel$bestTune

##   committees neighbors
## 9         20         9

#predict the permeability response for the test data
cubistPred <- predict(cubistModel, select(test_data, -ph))

#compare the variance of the predicted values to the test values.
cubistResample <-postResample(cubistPred, test_data$ph)

cubistResample

##       RMSE   Rsquared        MAE 
## 0.09238683 0.72081531 0.06862500

7.3.8 Bagged Trees

The Bagged Trees model achieved an RMSE of 0.1204170, an R² of 0.5211945, and an MAE of 0.0928338. While it outperforms simpler models like CART and linear regression models, it falls short compared to the Random Forest and Cubist model. Its moderate R² (52.12%) suggests it captures just over half of the variance in pH, and although the error metrics are reasonable, they are not low enough to consider this model highly accurate. Overall, Bagged Trees offer modest predictive performance but are not among the top-performing models for this dataset.

set.seed(786)

baggedModel <- train(x = select(train_data, -ph), 
                     y = train_data$ph,
                     method = "treebag",
                     verbose = FALSE, 
#                     trControl = ctrl, 
                     preProcess = pre_process)

baggedModel$bestTune

##   parameter
## 1      none

# Predict the permeability response for the test data
baggedPred <- predict(baggedModel, select(test_data, -ph))

# Compare the variance of the predicted values to the test values
baggedResample <- postResample(baggedPred, test_data$ph)
baggedResample

##      RMSE  Rsquared       MAE 
## 0.1204170 0.5211945 0.0928338

8 Model Evaluation

Linear Regression and its variants: Among the linear models, Elastic Net regression performed the best (~41.9% of the variance in the pH) and PCR performed the worst. Hence dimensionality reduction does not seem desirable. The data does not have a linear relationship as suggested by the outcome.
Non-Linear Regression: Among the Non-Linear models, Neural Network performed the best (~56.8% of the variance in the pH) - while also performing better than all of the linear regression models.
Tree/rule based: Among the Tree/rule-based models, the Cubist performed the best (~ 72% of the variance in the pH). All of the tree-based models performed better than both the linear and non-linear regression models.
Overall: Based on R squared as the performance metric, “Cubist” model performed the best in optimal resampling and test set performance. The model explains ~72% of the variance in the pH variable. The predictions are off by just 0.09 pH units. The average absolute error is less than 0.07, showing consistent accuracy.

# Find the best model
model_results <- rbind(
  lm_resample,
  ridge_resample,
  lasso_resample,
  enet_resample,
  pcr_resample,
  pls_resample,
  knn_resample,
  mars_resample,
  nn_resample,
  svm_resample,
  svmP_resample,
  rpartResample,
  rfResample,
  rfOOBResample,
  condrfResample,
  condrfOOBResample,
  gbmResample,
  baggedResample,
  cubistResample
)

rownames(model_results) <- c("Linear Regression", "Ridge", "Lasso", "Elastic Net", "PCR", "PLS", "KNN", "MARS", "Neural Network", "SVM", "SVM Polynomial", "Single Tree", "Random Forest", "Random Forest OOB", "Conditional Inference Random Forest", "Conditional Inference Random Forest OOB", "GBM", "Bagged", "Cubist")

# Sort by Rsquared
model_results[order(model_results[,2]), ]

##                                               RMSE  Rsquared        MAE
## PCR                                     0.15616156 0.1890207 0.12433952
## PLS                                     0.13248056 0.4159277 0.10375194
## Linear Regression                       0.13217668 0.4186977 0.10411425
## Ridge                                   0.13238471 0.4188048 0.10433420
## Lasso                                   0.13212593 0.4191752 0.10395444
## Elastic Net                             0.13213547 0.4192278 0.10372870
## Single Tree                             0.12794622 0.4579413 0.09678968
## SVM Polynomial                          0.12636749 0.4747805 0.09250217
## Bagged                                  0.12041701 0.5211945 0.09283380
## MARS                                    0.11790237 0.5386689 0.08951604
## KNN                                     0.11728297 0.5500853 0.08691970
## SVM                                     0.11620320 0.5522852 0.08752177
## GBM                                     0.11469171 0.5647492 0.09061361
## Neural Network                          0.11433728 0.5684099 0.08432463
## Conditional Inference Random Forest     0.10561362 0.6400081 0.07969956
## Conditional Inference Random Forest OOB 0.10509848 0.6437245 0.07930848
## Random Forest                           0.09969844 0.6728398 0.07037370
## Random Forest OOB                       0.09926432 0.6778631 0.07146070
## Cubist                                  0.09238683 0.7208153 0.06862500

In the cubist model, among the Top 10 predictors Mnf_Flow, Balling, Balling_Lvl, Pressure_Vacuum, Alch_Rel etc. are important features.

# Find the top predictors
top_predictors <- data.frame(varImp(cubistModel,scale = TRUE)$importance)

top_predictors |>
  arrange(desc(Overall))

##                      Overall
## mnf_flow          100.000000
## balling            85.000000
## alch_rel           74.166667
## density            68.333333
## balling_lvl        67.500000
## pressure_vacuum    66.666667
## air_pressurer      57.500000
## oxygen_filler      56.666667
## carb_rel           54.166667
## hyd_pressure3      50.000000
## bowl_setpoint      49.166667
## temperature        49.166667
## carb_pressure1     48.333333
## carb_flow          42.500000
## usage_cont         39.166667
## filler_speed       37.500000
## hyd_pressure2      37.500000
## filler_level       35.833333
## brand_code         32.500000
## carb_volume        23.333333
## hyd_pressure4      23.333333
## pressure_setpoint  15.833333
## pc_volume          15.833333
## carb_pressure      14.166667
## fill_pressure      14.166667
## carb_temp          11.666667
## fill_ounces         6.666667
## mfr                 5.833333
## psc                 2.500000
## psc_fill            2.500000
## psc_co2             0.000000

top_10_predictors <- top_predictors  |>
  arrange(desc(Overall)) |>
    slice_head(n = 10)

# 1. Get variable importance
cubistImp <- varImp(cubistModel)
# 2. Plot simple barplot
plot(caret::varImp(cubistModel), main = "Cubist Regression Model Variable Importance")

# Correlation 
data |>
  select(c(ph, row.names(top_10_predictors))) |>
  cor(use = "complete.obs") |>
  corrplot(method = "circle", type = "upper", tl.cex = 0.9, tl.srt = 45)

9 Final Model and Predictions

After the model is trained, we use it to evaluate against the StudentEvaluation data which was provided to us in a separate file.

ph_prediction <- predict(cubistModel,test_eval)

test_eval$PH <- ph_prediction

#Predicted pH vs train data
xyplot(test_eval$PH ~ data$ph | data$brand_code,
       xlab = "Actual pH",
       ylab = "Predicted pH",
       pch = 16,
       col = "darkgreen",
       main = "Predicted vs Actual pH by Brand Code"
)

library(openxlsx)

## Warning: package 'openxlsx' was built under R version 4.4.3

# Create a new workbook
wb <- createWorkbook()


# Add the sheets
addWorksheet(wb, "Predictions")

# Write data to sheet
writeData(wb,sheet = "Predictions" , test_eval)


# Save the workbook
saveWorkbook(wb, "StudentPrediction.xlsx", overwrite = TRUE)

10 Conclusion

Our goal is to use the Beverage dataset which has 32 feature variables and a target variable pH, train and compare several predictive modeling approaches, and select the best-performing model. By using the best model, predict the pH for the Evaluation dataset.

This predictive model will enable ABC Beverage to generate reliable pH predictions that support production optimization, regulatory compliance, and overall product consistency.

By diving into the beverage dataset, we explored 33 features, tackled missing values with MICE imputation, and removed low-variance predictors like hyd_pressure1 to streamline our dataset. Our feature selection revealed key drivers of pH, with Mnf_Flow, Balling, and Pressure_Vacuum topping the list, guiding our modeling choices.

We tested a wide range of models—Linear Regression, Ridge, Lasso, Elastic Net, PCR, PLS, KNN, MARS, Neural Networks, SVM, Random Forest, GBM, Cubist, and Bagged Trees—using 5-fold cross-validation and standardized preprocessing. While linear models like Elastic Net (RMSE: 0.1321, R-squared: 0.4192) handled multicollinearity well, and nonlinear models like Neural Network (RMSE: 0.1143 , R-squared: 0.5684) captured complex patterns, tree-based models shone. Random Forest performed strongly (RMSE: 0.0997, R-squared: 0.6728), but Cubist performed the best with an RMSE of 0.092, R-squared of 0.72, and MAE of 0.068 on the test set, balancing accuracy and interpretability.

Manufacturing flow rate(mnf_flow), Measurement of sugar content based on the Balling scale(balling), alcohol release, density, Sugar content level(balling_lvl) and pressure_vaccum are key to our model’s performance.

From a business perspective, a model explaining 72% of the pH variance may not be considered strong enough for predicting pH. We believe strong domain knowledge would be necessary to take this model to the next level. Additional modeling gains could be made by creating additional features and more feature engineering from this dataset.

By using a more balanced dataset, researching and analyzing combination features, we can aim to achieve better performance. The final predictions generated for the target variable pH using Cubist model, are all in the range between 8 and 9 with an exception at 9.19. The patterns are still upheld in our predictions when they are grouped by the Brand code. We also highlighted the variables that have the most effect on the PH. We hope that this information provides adequate details to address the new regulations in the beverage industry.

DATA624 - Final Project

ABC Beverage — pH Levels Prediction

Farhana Akther, Nakesha Fray, Gillian McGovern, Dhanya Nair, Said Naqwe

2025-05-18