Farhana Akther, Nakesha Fray, Gillian McGovern, Dhanya Nair, Said Naqwe
ABC Beverage Company is responding to new manufacturing regulations that require better understanding and control of beverage pH levels. As part of the Data Science team, our task is to analyze beverage data, identify the key factors affecting pH, and develop a predictive model.
This report provides:
- An exploration of the provided data
- Testing of multiple predictive models
- Selection of the best-performing model
- Final predictions of pH values based on the selected model
The results will be used to support regulatory compliance and optimize production processes.
PH background: In beverages, pH is a measure of acidity or alkalinity, on a scale from 0 (very acidic) to 14 (very alkaline), with 7 being neutral (like pure water).
pH matters in beverages due to following
factors:
pH affects sourness/sharpness — lower pH = more acidic/sour
Lower pH (more acidic) inhibits microbial growth
pH needs to be controlled to meet safety standards
pH can indicate ingredient variability or equipment issues
https://www.jaidevelopment.com/blog/the-importance-of-ph-and-acidity-in-beverage-formulation#
Our primary goal is to develop an accurate predictive model for beverage pH levels using historical manufacturing data. We aim to identify the key production variables that influence pH, evaluate and compare multiple predictive modeling approaches, and select the model that provides the best balance of accuracy and interpretability. Using the selected model, we will generate reliable pH predictions to support production optimization and regulatory reporting. By achieving this goal, ABC Beverage Company will be better equipped to monitor, control, and maintain pH levels within required standards, ensuring both compliance and product consistency.
We begin by importing this beverage dataset and performing an initial exploration to understand its structure and key characteristics, starting with loading the necessary libraries.
Source: Beverage dataset (StudentData.xlsx): 2571 rows x 33 columns. Evaluation dataset (StudentEvaluation.xlsx): 267 rows x 33 columns. Beverage dataset captures the various aspects of the beverage manufacturing process, including product characteristics, equipment readings, flow rates, pressures, and temperatures. The target variable for prediction is pH, a continuous measure reflecting the acidity or alkalinity of the beverage.
Let’s check the first few rows of the data set to confirm that all the features and observations are loaded correctly.
# View the first few rows
head(data)
## # A tibble: 6 × 33
## brand_code carb_volume fill_ounces pc_volume carb_pressure carb_temp psc
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 B 5.34 24.0 0.263 68.2 141. 0.104
## 2 A 5.43 24.0 0.239 68.4 140. 0.124
## 3 B 5.29 24.1 0.263 70.8 145. 0.09
## 4 A 5.44 24.0 0.293 63 133. NA
## 5 A 5.49 24.3 0.111 67.2 137. 0.026
## 6 A 5.38 23.9 0.269 66.6 138. 0.09
## # ℹ 26 more variables: psc_fill <dbl>, psc_co2 <dbl>, mnf_flow <dbl>,
## # carb_pressure1 <dbl>, fill_pressure <dbl>, hyd_pressure1 <dbl>,
## # hyd_pressure2 <dbl>, hyd_pressure3 <dbl>, hyd_pressure4 <dbl>,
## # filler_level <dbl>, filler_speed <dbl>, temperature <dbl>,
## # usage_cont <dbl>, carb_flow <dbl>, density <dbl>, mfr <dbl>, balling <dbl>,
## # pressure_vacuum <dbl>, ph <dbl>, oxygen_filler <dbl>, bowl_setpoint <dbl>,
## # pressure_setpoint <dbl>, air_pressurer <dbl>, alch_rel <dbl>, …
We can see that everything loaded correctly let’s check the structure of the data set to confirm that data types of the features.
glimpse(data)
## Rows: 2,571
## Columns: 33
## $ brand_code <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
## $ carb_volume <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ fill_ounces <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ pc_volume <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ carb_pressure <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ carb_temp <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ psc <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ psc_fill <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ psc_co2 <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ mnf_flow <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ carb_pressure1 <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ fill_pressure <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ hyd_pressure1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure2 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure3 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure4 <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ filler_level <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ filler_speed <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ usage_cont <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ carb_flow <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ mfr <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ pressure_vacuum <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ ph <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ oxygen_filler <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ bowl_setpoint <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ pressure_setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ air_pressurer <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ alch_rel <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ carb_rel <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ balling_lvl <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…
We can see that all the predictors are numeric with double datatype
except for the brand code. The Brand Code variable is currently stored
as a character type, and it may be more appropriate to convert it to a
factor since it represents a category rather than free text.
Additionally, there are some missing values observed in several
features, including mfr
, filler_speed
,
psc
, psc_co2
, and others. We will address
these missing values during data pre-processing.
summary(data)
## brand_code carb_volume fill_ounces pc_volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## carb_pressure carb_temp psc psc_fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## psc_co2 mnf_flow carb_pressure1 fill_pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## hyd_pressure1 hyd_pressure2 hyd_pressure3 hyd_pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## filler_level filler_speed temperature usage_cont carb_flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## density mfr balling pressure_vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## ph oxygen_filler bowl_setpoint pressure_setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## air_pressurer alch_rel carb_rel balling_lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
The summary statistics show that most features are reasonably scaled with some outliers or extreme values, and a few variables have missing data that will need to be addressed during pre-processing.
dim(data)
## [1] 2571 33
The dataset provided contains 2,571 observations and 33 variables, including product characteristics, equipment readings, flow rates, pressures, and temperatures. The target variable for prediction is pH, a continuous measure reflecting the acidity or alkalinity of the beverage.
data %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") %>%
filter(!is.na(Value)) %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~Feature, scales = "free") +
ggtitle("Histograms of Numerical Features")
The histograms above show the distributions of the numerical features in the dataset. Several variables, such as Carb Pressure, Carb Temp, Carb Volume, Fill Ounces, Pressure Vacuum, and PC Volume, appear approximately normally distributed.
Other features like filler_speed
, mfr
, and
oxygen_filler
either show signs of skewness and contain a
few significant outliers.
Our target variable, PH
, also follows a roughly normal
distribution overall. However, since the dataset includes a variety of
branded beverages with differing fill levels, pressures, carbonation
temperatures, and other characteristics, it’s important to examine the
PH
distribution within each beverage group
data |>
select(ph) |>
ggplot() +
aes(x = ph) +
geom_histogram(bins= 40,fill = "blue", color = "black") +
labs(title = "Distribution of pH", y = "Count") +
theme_minimal()
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).
data |>
ggplot() +
aes(x = ph) +
geom_histogram(bins= 40, fill = "lightblue", color = "black") +
labs(title = "Distribution of pH by Brand", y = "Count" ) +
facet_wrap(~ `brand_code`, scales = 'free_y')
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).
The overall pH distribution is left skewed and is bi-modal. It indicates that there is grouping within it and hence we check by each brand code.
When we group the PH
by the brand code, we see that C
& D only look normally distributed whereas A, B are multi-modal.
Also there are a lot of NA values.The distribution around
PH
is better as logged and will help us with the
prediction.
Since brand_code
is a character right now but actually
represents categories (A, B, C…), we should convert it to a factor
before plotting or using it in modeling.
# Convert Brand Code to a factor
data <- data %>%
mutate(`brand_code` = as.factor(`brand_code`))
# Plot Brand Code distribution
ggplot(data, aes(x = `brand_code`)) +
geom_bar(fill = "skyblue", color = "black") +
ggtitle("Distribution of Brand Code") +
theme_minimal()
data %>%
select(`brand_code`, ph) %>%
ggplot() +
aes(x = `brand_code`, y = ph) +
geom_boxplot(color = 'skyblue', outlier.color = 'red')+
labs(title = 'Distribution of PH by Brand Code',
x = element_blank(),
y = 'pH')
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The dataset includes four distinct Brand Codes: A, B, C, and D. Brand B has the highest number of observations, followed by Brand D, while Brands A and C have fewer observations. Additionally, there are some missing values (NA) in the Brand Code column that will need to be addressed during data pre-processing.
The box plot shows that there is variation in pH based on brand code.
C has the lowest PH
with a very high outlier. The NA value
showing PH
of 8.5 is seen to be the data of A. D has the
highest PH
with only one outlier.
As observed by the box plot, pH varies from 7.8 to 9.6 . This indicates the beverage is clearly basic (alkaline), and that’s very uncommon for traditional commercial beverages. The product could be Alkaline water, some enhanced waters used for specialty or specific formulations.
Since pH is a continuous outcome, this confirms that we are dealing with a regression problem. As a first approach, applying linear regression is appropriate and may provide reasonable performance. However, not all predictors are perfectly symmetric or linear, and some exhibit skewness or potential outliers. If linear regression performance is inadequate, we may consider alternative regression techniques such as Multivariate Adaptive Regression Splines (MARS), Support Vector Machines (SVM), Boosted Trees, Random Forests, or Neural Networks to improve predictive accuracy.
In the previous section, we observed that several columns contain missing values. In this section, we will explore different techniques to handle these missing values through imputation. Before proceeding with imputation, we first summarize the number of missing values present in each column.
data %>%
summarise_all(~ sum(is.na(.))) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
filter(missing_count != 0) %>%
arrange(desc(missing_count))
## # A tibble: 31 × 2
## variable missing_count
## <chr> <int>
## 1 mfr 212
## 2 brand_code 120
## 3 filler_speed 57
## 4 pc_volume 39
## 5 psc_co2 39
## 6 fill_ounces 38
## 7 psc 33
## 8 carb_pressure1 32
## 9 hyd_pressure4 30
## 10 carb_pressure 27
## # ℹ 21 more rows
The MFR
column has the highest number of missing values,
accounting for approximately 8.24% of the total observations. Generally,
many data scientists recommend removing columns with more than 5%
missing data to maintain model reliability. The brand_code
column also has missing values, representing about 4.6% of the total
observations.
Also we observe that 4 observations do not have a PH
value. Hence we need to remove these observations.
To handle missing values in the remaining columns, we will use the mice package for imputation. However, before proceeding, we need to address the column names by removing any spaces, as the mice functions require variable names without spaces to operate properly.
To address missing data in our dataset, we use the MICE (Multivariate Imputation by Chained Equations) technique. MICE is a flexible and widely adopted method that models each incomplete variable conditionally on the other variables, iteratively generating plausible values instead of relying on a single guess. This approach captures the uncertainty around missing data and helps preserve relationships between features.
For the imputation model, we select Random Forests (RF) as the method. Random Forest is a non-parametric, ensemble learning algorithm capable of modeling complex nonlinear relationships and interactions without requiring strict assumptions about the data. This makes it particularly well-suited for imputing both numeric and categorical features in real-world manufacturing datasets like ours.
Before performing imputation, we first visualize the extent and structure of missingness across the dataset.
# Missingness plot
gg_miss_var(data)
The missing value plot highlights that mfr
had the
highest number of missing values, followed by brand_code
and filler_speed
. Several other variables exhibited
moderate missingness, while many features were nearly complete.
There are 4 observations with missing PH
value - these
must be removed. This visualization confirmed the need to apply an
imputation strategy before proceeding to model development.
# Remove rows with missing target variable (pH)
data <- data[!is.na(data$ph), ]
set.seed(786)
# MICE imputation
imputed_data <- mice(data, m = 1, method = 'rf', print = FALSE)
# Extract the completed dataset
data <- complete(imputed_data)
# Check for any remaining missing values
colSums(is.na(data))
## brand_code carb_volume fill_ounces pc_volume
## 0 0 0 0
## carb_pressure carb_temp psc psc_fill
## 0 0 0 0
## psc_co2 mnf_flow carb_pressure1 fill_pressure
## 0 0 0 0
## hyd_pressure1 hyd_pressure2 hyd_pressure3 hyd_pressure4
## 0 0 0 0
## filler_level filler_speed temperature usage_cont
## 0 0 0 0
## carb_flow density mfr balling
## 0 0 0 0
## pressure_vacuum ph oxygen_filler bowl_setpoint
## 0 0 0 0
## pressure_setpoint air_pressurer alch_rel carb_rel
## 0 0 0 0
## balling_lvl
## 0
After performing Random Forest-based imputation using MICE, all missing values in the dataset have been successfully handled. We verified the success of the imputation by checking for any remaining missing values using colSums(is.na(data)), which confirmed that the dataset is now fully complete and ready for further modeling and analysis.
After handling missing data, it is important to address predictors that provide little to no useful information for modeling. Predictors with near-zero variance contain very little variability and do not contribute meaningfully to predictive power. Therefore, we identify and remove such predictors to streamline the modeling process and improve model performance.
nzv_cols <- nearZeroVar(data, names = TRUE)
nzv_cols
## [1] "hyd_pressure1"
#Remove only near-zero variance columns
if(length(nzv_cols) > 0) {
data <- data[, !colnames(data) %in% nzv_cols]
}
dim(data) # Rows and columns after applying -nearZeroVar
## [1] 2567 32
In this case, only the hyd_pressure1
variable was
removed due to its near-zero variance. All other predictors were
retained for further analysis.
There are 13 outliers using IQR method. However the values are not drastically high or low than the mean distribution we observed in the box plot. Hence these observations do not have to be removed from the dataset.
# Calculate IQR-based outlier thresholds within each brand
outliers_by_brand <- data %>%
group_by(brand_code) %>%
mutate(
Q1 = quantile(ph, 0.25, na.rm = TRUE),
Q3 = quantile(ph, 0.75, na.rm = TRUE),
IQR = Q3 - Q1,
lower = Q1 - 1.5 * IQR,
upper = Q3 + 1.5 * IQR,
is_outlier = (ph < lower | ph > upper)
) %>%
filter(is_outlier == TRUE)
# View the outliers
print(outliers_by_brand)
## # A tibble: 14 × 38
## # Groups: brand_code [4]
## brand_code carb_volume fill_ounces pc_volume carb_pressure carb_temp psc
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 C 5.22 23.9 0.391 63.6 139 0.002
## 2 A 5.55 24.0 0.389 72.4 144. 0.0560
## 3 D 5.31 24.0 0.317 66.8 141. 0.098
## 4 A 5.63 24.3 0.182 70.6 137. 0.068
## 5 A 5.45 24.0 0.212 71.2 143 0.116
## 6 A 5.4 24.0 0.219 62.4 133. 0.062
## 7 A 5.37 24.0 0.227 68.2 141. 0.074
## 8 B 5.43 24.0 0.258 63 134 0.072
## 9 B 5.43 24 0.227 70.8 144. 0.05
## 10 B 5.34 24 0.245 66.4 140. 0.116
## 11 B 5.34 24.1 0.294 68.4 142. 0.106
## 12 B 5.35 24.1 0.294 66.8 140. 0.086
## 13 C 5.27 24.0 0.281 65.4 139. 0.096
## 14 C 5.37 24.1 0.271 69.4 142 0.078
## # ℹ 31 more variables: psc_fill <dbl>, psc_co2 <dbl>, mnf_flow <dbl>,
## # carb_pressure1 <dbl>, fill_pressure <dbl>, hyd_pressure2 <dbl>,
## # hyd_pressure3 <dbl>, hyd_pressure4 <dbl>, filler_level <dbl>,
## # filler_speed <dbl>, temperature <dbl>, usage_cont <dbl>, carb_flow <dbl>,
## # density <dbl>, mfr <dbl>, balling <dbl>, pressure_vacuum <dbl>, ph <dbl>,
## # oxygen_filler <dbl>, bowl_setpoint <dbl>, pressure_setpoint <dbl>,
## # air_pressurer <dbl>, alch_rel <dbl>, carb_rel <dbl>, balling_lvl <dbl>, …
Analysis of the test data using the summary() function revealed that many observations contain NA values, which must be addressed. Models such as randomForest, lm, and xgboost cannot handle missing values in predictor variables. If these NAs are not properly treated, predictions on the test data may be inaccurate, skewed, or incomplete. Therefore, to ensure a fair and consistent evaluation, the test data should undergo the same preprocessing steps as the training data.
summary(test_data)
## brand_code carb_volume fill_ounces pc_volume
## Length:267 Min. :5.147 Min. :23.75 Min. :0.09867
## Class :character 1st Qu.:5.287 1st Qu.:23.92 1st Qu.:0.23333
## Mode :character Median :5.340 Median :23.97 Median :0.27533
## Mean :5.369 Mean :23.97 Mean :0.27769
## 3rd Qu.:5.465 3rd Qu.:24.01 3rd Qu.:0.32200
## Max. :5.667 Max. :24.20 Max. :0.46400
## NA's :1 NA's :6 NA's :4
## carb_pressure carb_temp psc psc_fill
## Min. :60.20 Min. :130.0 Min. :0.00400 Min. :0.0200
## 1st Qu.:65.30 1st Qu.:138.4 1st Qu.:0.04450 1st Qu.:0.1000
## Median :68.00 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.25 Mean :141.2 Mean :0.08545 Mean :0.1903
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :77.60 Max. :154.0 Max. :0.24600 Max. :0.6200
## NA's :1 NA's :5 NA's :3
## psc_co2 mnf_flow carb_pressure1 fill_pressure
## Min. :0.00000 Min. :-100.20 Min. :113.0 Min. :37.80
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:120.2 1st Qu.:46.00
## Median :0.04000 Median : 0.20 Median :123.4 Median :47.80
## Mean :0.05107 Mean : 21.03 Mean :123.0 Mean :48.14
## 3rd Qu.:0.06000 3rd Qu.: 141.30 3rd Qu.:125.5 3rd Qu.:50.20
## Max. :0.24000 Max. : 220.40 Max. :136.0 Max. :60.20
## NA's :5 NA's :4 NA's :2
## hyd_pressure1 hyd_pressure2 hyd_pressure3 hyd_pressure4
## Min. :-50.00 Min. :-50.00 Min. :-50.00 Min. : 68.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 90.00
## Median : 10.40 Median : 26.80 Median : 27.70 Median : 98.00
## Mean : 12.01 Mean : 20.11 Mean : 19.61 Mean : 97.84
## 3rd Qu.: 20.40 3rd Qu.: 34.80 3rd Qu.: 33.00 3rd Qu.:104.00
## Max. : 50.00 Max. : 61.40 Max. : 49.20 Max. :140.00
## NA's :1 NA's :1 NA's :4
## filler_level filler_speed temperature usage_cont carb_flow
## Min. : 69.2 Min. :1006 Min. :63.80 Min. :12.90 Min. : 0
## 1st Qu.:100.6 1st Qu.:3812 1st Qu.:65.40 1st Qu.:18.12 1st Qu.:1083
## Median :118.6 Median :3978 Median :65.80 Median :21.44 Median :3038
## Mean :110.3 Mean :3581 Mean :66.23 Mean :20.90 Mean :2409
## 3rd Qu.:120.2 3rd Qu.:3996 3rd Qu.:66.60 3rd Qu.:23.74 3rd Qu.:3215
## Max. :153.2 Max. :4020 Max. :75.40 Max. :24.60 Max. :3858
## NA's :2 NA's :10 NA's :2 NA's :2
## density mfr balling pressure_vacuum
## Min. :0.060 Min. : 15.6 Min. :0.902 Min. :-6.400
## 1st Qu.:0.920 1st Qu.:707.0 1st Qu.:1.498 1st Qu.:-5.600
## Median :0.980 Median :724.6 Median :1.648 Median :-5.200
## Mean :1.177 Mean :697.8 Mean :2.203 Mean :-5.174
## 3rd Qu.:1.600 3rd Qu.:731.5 3rd Qu.:3.242 3rd Qu.:-4.800
## Max. :1.840 Max. :784.8 Max. :3.788 Max. :-3.600
## NA's :1 NA's :31 NA's :1 NA's :1
## ph oxygen_filler bowl_setpoint pressure_setpoint
## Mode:logical Min. :0.00240 Min. : 70.0 Min. :44.00
## NA's:267 1st Qu.:0.01960 1st Qu.:100.0 1st Qu.:46.00
## Median :0.03370 Median :120.0 Median :46.00
## Mean :0.04666 Mean :109.6 Mean :47.73
## 3rd Qu.:0.05440 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :0.39800 Max. :130.0 Max. :52.00
## NA's :3 NA's :1 NA's :2
## air_pressurer alch_rel carb_rel balling_lvl
## Min. :141.2 Min. :6.400 Min. :5.18 Min. :0.000
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.34 1st Qu.:1.380
## Median :142.6 Median :6.580 Median :5.40 Median :1.480
## Mean :142.8 Mean :6.907 Mean :5.44 Mean :2.051
## 3rd Qu.:142.8 3rd Qu.:7.180 3rd Qu.:5.56 3rd Qu.:3.080
## Max. :147.2 Max. :7.820 Max. :5.74 Max. :3.420
## NA's :1 NA's :3 NA's :2
# factoring brand code
test_data <- test_data %>%
mutate(`brand_code` = as.factor(`brand_code`))
test_completed <- test_data %>%
select(-ph) %>%
mice(., m = 1, method = 'rf', print = FALSE) %>% complete()
# remove Hyd.Pressure1 as it was removed in the preprocessing for Student Data
# add back in PH
test_eval <- test_completed %>%
select(-hyd_pressure1) %>%
mutate(PH = "")
# Check for any remaining missing values
colSums(is.na(test_eval))
## brand_code carb_volume fill_ounces pc_volume
## 0 0 0 0
## carb_pressure carb_temp psc psc_fill
## 0 0 0 0
## psc_co2 mnf_flow carb_pressure1 fill_pressure
## 0 0 0 0
## hyd_pressure2 hyd_pressure3 hyd_pressure4 filler_level
## 0 0 0 0
## filler_speed temperature usage_cont carb_flow
## 0 0 0 0
## density mfr balling pressure_vacuum
## 0 0 0 0
## oxygen_filler bowl_setpoint pressure_setpoint air_pressurer
## 0 0 0 0
## alch_rel carb_rel balling_lvl PH
## 0 0 0 0
With the dataset now fully cleaned and complete, the next step is to perform feature selection. Feature selection is a critical process to identify the most important predictors that influence the target variable, pH. By selecting the most relevant variables, we can improve model performance, reduce overfitting, and simplify the interpretation of the results. In this section, we will explore relationships between predictors and pH to guide our modeling choices.
Since correlation analysis requires numeric variables, we first selected only the numeric columns from the dataset before computing the correlation matrix.
# Numeric columns
data_numeric <- data %>%
select(where(is.numeric))
# Correlation matrix
cor_matrix <- cor(data_numeric, use = "complete.obs")
# Plot
corrplot(cor_matrix,
method = "circle",
type = "upper",
tl.cex = 0.5,
tl.srt = 45) # Rotate text diagonally
par(mar = c(1, 1, 4, 1)) # top margin = 4 lines
# find the high correlated predictors
high_corr <- findCorrelation(cor_matrix, cutoff = 0.9)
# Then this gives the actual column names:
colnames(data)[high_corr]
## [1] "mfr" "hyd_pressure2" "carb_rel" "carb_flow"
## [5] "hyd_pressure4"
# As we value interpretability, we will drop these highly correlated predictors.
#data <- data[, -high_corr]
The correlation plot shows that most numerical predictors have low to
moderate relationships with each other. PH
is moderately
correlated with variables such as oxygen_filler
,
pressure_setpoint
, bowl_setpoint
,
alch_rel
, and air_Ppressurer
. A few other
predictors also show noticeable associations. Overall, the correlations
suggest that several variables may contribute useful information for
predicting PH
, supporting our next step of feature
selection and model building.
There is higher levels of correlation for the following variables:
mfr
hyd_pressure2
carb_rel
carb_flow
hyd_pressure4
Next, let’s extract and visualize the Top 5 most correlated predictors with pH.
To predict beverage pH levels accurately, we evaluate a range of machine learning models, including both linear and nonlinear approaches. Our goal is to compare multiple algorithms, identify the best-performing model based on predictive accuracy, and balance model complexity with interpretability. We begin by splitting the dataset into training and testing sets, and then proceed to fit and assess several models to predict pH, including linear model, non-linear models, and tree-based models. Each model is trained with 5-fold CV, centered/scaled predictors, and evaluated on test data.
Since the target variable (pH) differs by Brand Code, you should stratify the train-test split by Brand Code. However, stratifying by pH produced a better R squared. Hence we did by pH. Stratifying by brand ensures each subset (train/test) contains similar brand distributions
set.seed(786)
# index for training
index <- createDataPartition(data$ph, p = .8, list = FALSE)
# train
train_data <- data[index, ]
test_data<- data[-index,]
# Training predictors and response
trainingX <- train_data %>% select(-ph)
trainingY <- train_data$ph
# Test predictors and response
testX <- test_data %>% select(-ph)
testY <- test_data$ph
# 5-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 5)
#pre process
pre_process <- c("center","scale")
Note: our machines were unable to knit when adding a tuning parameter grid for many of the models (particularly the tree models), even with parallel processing instated. For this reason, we left out a tuning grid for most models. We noticed generally a tuning grid did not majorly affect results of the model anyway.
Models linear relationships, no tuning - linear Regression models pH as a linear combination of predictors. It’s simple but may struggle with complex, nonlinear relationships in the data. Implemented using caret’s lm method, it’s trained on train_data, evaluated with 5-fold CV, and performance (RMSE, R-squared, MAE) is assessed on test_data using postResample.
The model produced a lowest RMSE of 0.1321767 and an R² of 0.4186977, indicating that approximately 41.87% of the variance in pH is explained by the model. The MAE was 0.1041143, reflecting the average difference between predicted and actual values. These results suggest that this model may not be the most suitable for accurately predicting pH in this dataset.
set.seed(786)
lm_model <- train(x = trainingX, y = trainingY,
method = "lm",
preProcess = pre_process,
trControl = ctrl)
lm_pred <- predict(lm_model, testX)
lm_resample <- postResample(lm_pred, testY)
lm_model
## Linear Regression
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1347626 0.3897977 0.1037653
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
lm_resample
## RMSE Rsquared MAE
## 0.1321767 0.4186977 0.1041143
The lowest RMSE for the model was 0.1323847. The R2 was 0.4188048, which means 41.88% of the variance in pH is explained by the model. The MAE was 0.1043342, which is the average error between prediction and actual values. These results suggest that the Ridge model performs similarly to the linear regression model and is likely not the most effective choice for predicting pH values in this dataset.
set.seed(786)
ridge_model <- train(ph ~ ., data = train_data,
method = "glmnet",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10),
tuneGrid = expand.grid(alpha = 0,
lambda = seq(0, 1, 0.1)))
ridge_pred <- predict(ridge_model, testX)
ridge_resample <- postResample(ridge_pred, testY)
ridge_model
## glmnet
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (33), scaled (33)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1848, 1850, 1849, 1850, 1849, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 0.1353139 0.3855554 0.1051096
## 0.1 0.1412178 0.3472476 0.1115090
## 0.2 0.1448983 0.3259585 0.1146440
## 0.3 0.1475069 0.3120219 0.1168361
## 0.4 0.1495342 0.3019645 0.1185909
## 0.5 0.1511908 0.2942820 0.1200542
## 0.6 0.1525791 0.2882702 0.1212828
## 0.7 0.1537686 0.2834310 0.1223356
## 0.8 0.1548160 0.2793599 0.1232619
## 0.9 0.1557314 0.2760084 0.1240749
## 1.0 0.1565512 0.2731256 0.1248068
##
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 0.
ridge_resample
## RMSE Rsquared MAE
## 0.1323847 0.4188048 0.1043342
The Lasso Regression model achieved an RMSE of 0.1321259, with an R2 of 0.4191752, meaning it explains approximately 41.92% of the variance in pH. The MAE was 0.1039544, indicating the average error between predicted and actual values. These results are nearly identical to those of the linear and ridge regression models, suggesting that Lasso also may not offer a substantial improvement in predictive performance for this dataset.
set.seed(786)
# For lasso, any value of lambda over 0.07 produced missing values for `lasso_resample`
lasso_model <- train(ph ~ ., data = train_data,
method = "glmnet",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv", number = 5),
tuneGrid = expand.grid(alpha = 1, # set alpha = 1 for lasso
lambda = seq(0, 0.07, 0.001)))
lasso_pred <- predict(lasso_model, testX)
lasso_resample <- postResample(lasso_pred, testY)
lasso_model
## glmnet
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (33), scaled (33)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000 0.1347278 0.3900439 0.1037887
## 0.001 0.1350921 0.3865110 0.1047001
## 0.002 0.1356942 0.3818617 0.1055293
## 0.003 0.1366319 0.3743770 0.1065373
## 0.004 0.1376423 0.3662409 0.1076105
## 0.005 0.1385823 0.3587458 0.1086049
## 0.006 0.1394813 0.3515588 0.1095324
## 0.007 0.1403887 0.3441997 0.1103737
## 0.008 0.1413814 0.3357357 0.1112458
## 0.009 0.1420956 0.3301624 0.1118688
## 0.010 0.1426631 0.3262410 0.1123622
## 0.011 0.1432582 0.3220253 0.1128583
## 0.012 0.1438945 0.3173256 0.1133863
## 0.013 0.1445506 0.3123278 0.1139229
## 0.014 0.1451287 0.3081980 0.1143886
## 0.015 0.1455359 0.3062830 0.1147313
## 0.016 0.1459428 0.3044821 0.1150929
## 0.017 0.1463754 0.3024450 0.1154851
## 0.018 0.1468021 0.3005715 0.1158742
## 0.019 0.1472497 0.2984907 0.1162890
## 0.020 0.1477167 0.2961973 0.1167113
## 0.021 0.1482068 0.2936107 0.1171539
## 0.022 0.1487004 0.2909843 0.1175958
## 0.023 0.1492057 0.2882040 0.1180446
## 0.024 0.1497022 0.2855700 0.1184904
## 0.025 0.1501952 0.2829984 0.1189358
## 0.026 0.1506687 0.2807129 0.1193560
## 0.027 0.1511353 0.2785308 0.1197611
## 0.028 0.1515846 0.2766910 0.1201492
## 0.029 0.1520415 0.2747582 0.1205429
## 0.030 0.1525055 0.2727470 0.1209377
## 0.031 0.1529751 0.2706672 0.1213455
## 0.032 0.1534588 0.2683388 0.1217809
## 0.033 0.1539564 0.2657345 0.1222325
## 0.034 0.1544677 0.2628354 0.1226968
## 0.035 0.1549926 0.2596045 0.1231714
## 0.036 0.1555286 0.2560463 0.1236570
## 0.037 0.1560667 0.2523276 0.1241449
## 0.038 0.1566176 0.2481871 0.1246462
## 0.039 0.1571809 0.2435741 0.1251532
## 0.040 0.1577350 0.2389731 0.1256436
## 0.041 0.1582872 0.2341996 0.1261301
## 0.042 0.1588411 0.2290904 0.1266136
## 0.043 0.1594011 0.2235296 0.1270959
## 0.044 0.1599610 0.2176431 0.1275721
## 0.045 0.1604526 0.2129899 0.1280007
## 0.046 0.1609256 0.2083556 0.1284216
## 0.047 0.1613980 0.2034130 0.1288547
## 0.048 0.1617791 0.2011384 0.1292122
## 0.049 0.1621464 0.1993388 0.1295516
## 0.050 0.1624906 0.1980878 0.1298733
## 0.051 0.1628354 0.1968377 0.1301959
## 0.052 0.1631714 0.1960430 0.1305112
## 0.053 0.1634926 0.1960430 0.1308102
## 0.054 0.1638193 0.1960430 0.1311113
## 0.055 0.1641513 0.1960430 0.1314128
## 0.056 0.1644888 0.1960430 0.1317170
## 0.057 0.1648317 0.1960430 0.1320242
## 0.058 0.1651799 0.1960430 0.1323354
## 0.059 0.1655334 0.1960430 0.1326482
## 0.060 0.1658922 0.1960430 0.1329641
## 0.061 0.1662562 0.1960430 0.1332807
## 0.062 0.1666255 0.1960430 0.1336021
## 0.063 0.1669999 0.1960430 0.1339279
## 0.064 0.1673794 0.1960430 0.1342582
## 0.065 0.1677640 0.1960430 0.1346034
## 0.066 0.1681537 0.1960430 0.1349551
## 0.067 0.1685485 0.1960430 0.1353104
## 0.068 0.1689482 0.1960430 0.1356712
## 0.069 0.1693529 0.1960430 0.1360432
## 0.070 0.1697625 0.1960430 0.1364226
##
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.
lasso_resample
## RMSE Rsquared MAE
## 0.1321259 0.4191752 0.1039544
The Elastic Net model produced an RMSE of 0.1321355, an R2 of 0.4192278 (meaning it explains approximately 41.92% of the variance in pH.), and an MAE of 0.1037287. These results are also almost identical to those of the linear, ridge, and lasso models, indicating that Elastic Net also does not offer a significant improvement in predictive performance of pH for this dataset.
set.seed(786)
# Elastic net
enetGrid <- expand.grid(lambda = c(0, 0.01, .1), fraction = seq(.05, 1, length = 20))
enet_model <- train(ph ~ ., data = train_data,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProcess = pre_process)
enet_pred <- predict(enet_model, testX)
enet_resample <- postResample(enet_pred, testY)
enet_model
## Elasticnet
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (33), scaled (33)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00 0.05 0.1588794 0.2288224 0.1266329
## 0.00 0.10 0.1505693 0.2813210 0.1192215
## 0.00 0.15 0.1456149 0.3058632 0.1147940
## 0.00 0.20 0.1425201 0.3271558 0.1122306
## 0.00 0.25 0.1404272 0.3439818 0.1104047
## 0.00 0.30 0.1388669 0.3567170 0.1089558
## 0.00 0.35 0.1377952 0.3653294 0.1078436
## 0.00 0.40 0.1368919 0.3725047 0.1069213
## 0.00 0.45 0.1361626 0.3783575 0.1061608
## 0.00 0.50 0.1356221 0.3825993 0.1055265
## 0.00 0.55 0.1353130 0.3849023 0.1050746
## 0.00 0.60 0.1350796 0.3867118 0.1047481
## 0.00 0.65 0.1349173 0.3879683 0.1045237
## 0.00 0.70 0.1347928 0.3890431 0.1043445
## 0.00 0.75 0.1347135 0.3897824 0.1042154
## 0.00 0.80 0.1346481 0.3904005 0.1040677
## 0.00 0.85 0.1346132 0.3907657 0.1039328
## 0.00 0.90 0.1346328 0.3906763 0.1038394
## 0.00 0.95 0.1346826 0.3903497 0.1037878
## 0.00 1.00 0.1347626 0.3897977 0.1037653
## 0.01 0.05 0.1607846 0.2087278 0.1282753
## 0.01 0.10 0.1530667 0.2705221 0.1213974
## 0.01 0.15 0.1479276 0.2952410 0.1168996
## 0.01 0.20 0.1445946 0.3117723 0.1139686
## 0.01 0.25 0.1422488 0.3288366 0.1120031
## 0.01 0.30 0.1405741 0.3422756 0.1105516
## 0.01 0.35 0.1392121 0.3534268 0.1093021
## 0.01 0.40 0.1382412 0.3613703 0.1083058
## 0.01 0.45 0.1374554 0.3676011 0.1074878
## 0.01 0.50 0.1367639 0.3732067 0.1067738
## 0.01 0.55 0.1361699 0.3780356 0.1061409
## 0.01 0.60 0.1357258 0.3815199 0.1056252
## 0.01 0.65 0.1354209 0.3838913 0.1052247
## 0.01 0.70 0.1352016 0.3856100 0.1049274
## 0.01 0.75 0.1350272 0.3870194 0.1047040
## 0.01 0.80 0.1349046 0.3880363 0.1045354
## 0.01 0.85 0.1348097 0.3889060 0.1044055
## 0.01 0.90 0.1347357 0.3895964 0.1042708
## 0.01 0.95 0.1347044 0.3899290 0.1041656
## 0.01 1.00 0.1347133 0.3899311 0.1041032
## 0.10 0.05 0.1635442 0.1960430 0.1308649
## 0.10 0.10 0.1573206 0.2448613 0.1252548
## 0.10 0.15 0.1524106 0.2741302 0.1208467
## 0.10 0.20 0.1487372 0.2906560 0.1176366
## 0.10 0.25 0.1459229 0.3042547 0.1151179
## 0.10 0.30 0.1439724 0.3137046 0.1134609
## 0.10 0.35 0.1424246 0.3251681 0.1121838
## 0.10 0.40 0.1411967 0.3344805 0.1111417
## 0.10 0.45 0.1401822 0.3421473 0.1102258
## 0.10 0.50 0.1393853 0.3486653 0.1094415
## 0.10 0.55 0.1387347 0.3542242 0.1087638
## 0.10 0.60 0.1381668 0.3591746 0.1081490
## 0.10 0.65 0.1376905 0.3633251 0.1076347
## 0.10 0.70 0.1372644 0.3670379 0.1071834
## 0.10 0.75 0.1369121 0.3700691 0.1068051
## 0.10 0.80 0.1366150 0.3726553 0.1064724
## 0.10 0.85 0.1363606 0.3749123 0.1061794
## 0.10 0.90 0.1361602 0.3767190 0.1059482
## 0.10 0.95 0.1360220 0.3780369 0.1057840
## 0.10 1.00 0.1359264 0.3790202 0.1056542
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.85 and lambda = 0.
enet_resample
## RMSE Rsquared MAE
## 0.1321355 0.4192278 0.1037287
The PCR model yielded an RMSE of 0.1561616, an R2 of 0.1890207, and an MAE of 0.1243395. Compared to the previous linear-based models, PCR shows poorer performance, explaining only about 18.9% of the variance in pH and producing higher prediction errors. This suggests that PCR is not a suitable model for predicting pH.
set.seed(786)
# Train PCR model
pcr_model <- train(x = select(train_data, -ph), y = train_data$ph,
method = "pcr",
trControl = ctrl,
preProcess = pre_process)
# Predict on test data
pcr_pred <- predict(pcr_model, select(test_data, -ph))
# Evaluate performance
pcr_resample <- postResample(pcr_pred, test_data$ph)
pcr_model
## Principal Component Analysis
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1711486 0.01681885 0.1372668
## 2 0.1578973 0.16178552 0.1253505
## 3 0.1567185 0.17433906 0.1240842
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
pcr_resample
## RMSE Rsquared MAE
## 0.1561616 0.1890207 0.1243395
The PLS model achieved an RMSE of 0.1324806, an R2 of 0.4159277, and an MAE of 0.1037519. These results are very similar to those of the linear, ridge, lasso, and elastic net models, indicating that PLS captures a comparable amount of variance in pH and provides similar predictive accuracy. However, like the other models, the RMSE and R² values are not strong enough to consider this a reliable model for predicting pH in this dataset.
set.seed(786)
pls_grid <- expand.grid(ncomp = seq(1, 15, by = 1))
pls_model <- train(x = select(train_data, -ph), y = train_data$ph,
method = "pls",
tuneGrid = pls_grid,
trControl = ctrl,
preProcess = pre_process)
pls_pred <- predict(pls_model, select(test_data, -ph))
pls_resample <- postResample(pls_pred, test_data$ph)
pls_model
## Partial Least Squares
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1532790 0.2103613 0.1212352
## 2 0.1458741 0.2848802 0.1137035
## 3 0.1423544 0.3186094 0.1105476
## 4 0.1399725 0.3406841 0.1089077
## 5 0.1376588 0.3624354 0.1063880
## 6 0.1363722 0.3745964 0.1060233
## 7 0.1357017 0.3809953 0.1050317
## 8 0.1353048 0.3847330 0.1047653
## 9 0.1350521 0.3871062 0.1045406
## 10 0.1348994 0.3885238 0.1042537
## 11 0.1348595 0.3888304 0.1041934
## 12 0.1347463 0.3898319 0.1041101
## 13 0.1347837 0.3894784 0.1041348
## 14 0.1347082 0.3900727 0.1039312
## 15 0.1347405 0.3898509 0.1039722
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 14.
pls_resample
## RMSE Rsquared MAE
## 0.1324806 0.4159277 0.1037519
kNN Model: The lowest RMSE for the model is at k = 9. The lowest RMSE was 0.1172830. The R2 was 0.550085, which means 55.0% of the variance in pH is explained by the kNN model. The MAE was 0.0869197, which is the average error between prediction and actual values. From these results, compared to the models tested, kNN explains a larger proportion of the variance in pH and provides more accurate predictions compared to the linear, ridge, lasso, elastic net, and PLS models. However, while kNN shows improved performance, the RMSE and R2 values are still not strong enough to consider this a highly reliable model for predicting pH in this dataset.
set.seed(786)
knn_grid <- expand.grid(k = seq(1, 21, by = 2))
knn_model <- train(ph ~ ., data = train_data,
method = "knn",
tuneGrid = knn_grid,
trControl = ctrl,
preProcess = pre_process)
knn_pred <- predict(knn_model, select(test_data, -ph))
knn_resample <- postResample(knn_pred, test_data$ph)
knn_model
## k-Nearest Neighbors
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (33), scaled (33)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 0.1549193 0.3416685 0.10585840
## 3 0.1307828 0.4452436 0.09564591
## 5 0.1267261 0.4672217 0.09411415
## 7 0.1252270 0.4781626 0.09444017
## 9 0.1246390 0.4833880 0.09453652
## 11 0.1252757 0.4790679 0.09498111
## 13 0.1265625 0.4688718 0.09586493
## 15 0.1272809 0.4626442 0.09670261
## 17 0.1277351 0.4594474 0.09711296
## 19 0.1282146 0.4554591 0.09754619
## 21 0.1292802 0.4458702 0.09833988
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
knn_resample
## RMSE Rsquared MAE
## 0.1172830 0.5500853 0.0869197
The MARS model achieved an RMSE of 0.11790237, an R² of 0.53866890, and an MAE of 0.08951604, indicating that 53.87% of the variance in pH is explained by the MARS model. These results show that MARS provides a decent performance, with an RMSE and R2 value indicating a decent fit and predictive accuracy. The performance is better than the linear models, suggesting that MARS captures more complex relationships in the data. Although the model performs well, the RMSE and R2 values are still not sufficient to consider this a highly reliable model for pH prediction in this dataset.
set.seed(786)
mars_grid <- expand.grid(degree = 1:2, nprune = seq(10, 30, by = 5))
mars_model <- train(x = select(train_data, -ph), y = train_data$ph,
method = "earth",
tuneGrid = mars_grid,
trControl = ctrl,
preProcess = pre_process)
mars_pred <- predict(mars_model, select(test_data, -ph))
mars_resample <- postResample(mars_pred, test_data$ph)
mars_model
## Multivariate Adaptive Regression Spline
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 10 0.1346239 0.3902344 0.10381266
## 1 15 0.1326133 0.4087794 0.10180204
## 1 20 0.1323283 0.4111927 0.10075282
## 1 25 0.1323283 0.4111927 0.10075282
## 1 30 0.1323283 0.4111927 0.10075282
## 2 10 0.1347855 0.3933993 0.10296016
## 2 15 0.1287403 0.4457438 0.09732794
## 2 20 0.1260527 0.4683132 0.09531076
## 2 25 0.1250275 0.4769808 0.09390600
## 2 30 0.1244525 0.4819273 0.09328216
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 30 and degree = 2.
mars_resample
## RMSE Rsquared MAE
## 0.11790237 0.53866890 0.08951604
The Neural Network model achieved an RMSE of 0.11433728, an R² of 0.56840994, and an MAE of 0.08432463. These results show a stronger performance than the rest of the models so far, with the lowest RMSE and the highest R2, indicating that it explains a 56.84% of variance in the pH values. The MAE is also relatively low, suggesting pH predictions with minimal absolute errors. However, like the past few nonlinear models, the RMSE and R2 values still suggest that there is room for improvement.
num_cores <- parallel::detectCores() - 1 # leave one core free
num_cores
## [1] 11
cl <- makePSOCKcluster(num_cores) # number of cores to use
registerDoParallel(cl)
nnetGrid <- expand.grid(decay = c(0, 0.02, .1),
size = c(7, 9, 11),
bag = FALSE)
set.seed(786)
nnetTune <- train(ph ~ ., data = train_data,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 13 * (ncol(trainingX) + 1) + 13 + 1,
maxit = 1000)
nnetTune
## Model Averaged Neural Network
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (33), scaled (33)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 7 0.2310935 0.4314668 0.10412066
## 0.00 9 0.1455993 0.4536324 0.09055703
## 0.00 11 0.1785133 0.4197705 0.09827147
## 0.02 7 0.1164756 0.5454294 0.08539212
## 0.02 9 0.1165408 0.5469831 0.08492737
## 0.02 11 0.1201854 0.5306942 0.08542269
## 0.10 7 0.1175644 0.5376699 0.08724167
## 0.10 9 0.1172844 0.5426627 0.08658979
## 0.10 11 0.1160258 0.5546873 0.08552272
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 11, decay = 0.1 and bag = FALSE.
nn_pred <- predict(nnetTune, select(test_data, -ph))
nn_resample <- postResample(nn_pred, test_data$ph)
nn_resample
## RMSE Rsquared MAE
## 0.11433728 0.56840994 0.08432463
# Stop cluster
stopCluster(cl)
registerDoSEQ()
The SVM-RBF model achieved an RMSE of 0.11620320, an R2 of 0.55228523, and an MAE of 0.08752177 - with final values used for the model being sigma = 0.05 and C = 10. The RMSE value indicates a low error in prediction, and the R2 suggests a good level of explanatory power compared to other models, since 55.2% of the variance in pH is explained by the model. However, the SVM-RBF model performs better than linear models and some non-linear models, it still falls short of the Neural Network for both RMSE and R2. Nevertheless, while it provides a somewhat reliable approach for predicting pH values in this dataset, it still could use some improvement.
set.seed(786)
svm_grid <- expand.grid(sigma = c(0.01, 0.05, 0.1), C = c(0.1, 1, 10))
svm_model <- train(ph ~ ., data = train_data,
method = "svmRadial",
tuneGrid = svm_grid,
trControl = ctrl,
preProcess = pre_process)
svm_pred <- predict(svm_model, select(test_data, -ph))
svm_resample <- postResample(svm_pred, test_data$ph)
svm_model
## Support Vector Machines with Radial Basis Function Kernel
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (33), scaled (33)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644
## Resampling results across tuning parameters:
##
## sigma C RMSE Rsquared MAE
## 0.01 0.1 0.1363996 0.3956185 0.10441674
## 0.01 1.0 0.1253245 0.4757716 0.09179527
## 0.01 10.0 0.1218226 0.5050738 0.08849894
## 0.05 0.1 0.1331903 0.4226648 0.10004960
## 0.05 1.0 0.1202833 0.5149270 0.08783407
## 0.05 10.0 0.1185918 0.5309967 0.08834310
## 0.10 0.1 0.1396690 0.3791954 0.10594163
## 0.10 1.0 0.1230643 0.4946691 0.09073942
## 0.10 10.0 0.1218007 0.5004429 0.09133235
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.05 and C = 10.
svm_resample
## RMSE Rsquared MAE
## 0.11620320 0.55228523 0.08752177
The SVM-Polynomial model achieved an RMSE of 0.12636749, an R2 of 0.47478046, and an MAE of 0.09250217 -with final values used for the model being degree = 2, scale = 0.01, and C = 2. These results place this model on the lower end in terms of performance for this dataset. While it outperforms simpler linear models like PLS and ridge regression in terms of predictive accuracy and variance explained, it does not outperform models like SVM with RBF kernel or the Neural Network. Overall, the SVM-Poly is not among the most reliable for this dataset.
set.seed(786)
svmGrid <- expand.grid(degree = 1:2,
scale = c(0.01, 0.005, 0.001),
C = 2^(-2:5))
svmP_model <- train(ph ~ ., data = train_data,
method = "svmPoly",
preProc = pre_process,
tuneGrid = svmGrid,
trControl = ctrl)
svmP_pred <- predict(svmP_model, select(test_data, -ph))
svmP_resample <- postResample(svmP_pred, test_data$ph)
svmP_model
## Support Vector Machines with Polynomial Kernel
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (33), scaled (33)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644
## Resampling results across tuning parameters:
##
## degree scale C RMSE Rsquared MAE
## 1 0.001 0.25 0.1500682 0.2902258 0.11825797
## 1 0.001 0.50 0.1454214 0.3172556 0.11392465
## 1 0.001 1.00 0.1418150 0.3413470 0.11050947
## 1 0.001 2.00 0.1392212 0.3585987 0.10767681
## 1 0.001 4.00 0.1375350 0.3695964 0.10567779
## 1 0.001 8.00 0.1367009 0.3762804 0.10451907
## 1 0.001 16.00 0.1364567 0.3784029 0.10389322
## 1 0.001 32.00 0.1365448 0.3781879 0.10353572
## 1 0.005 0.25 0.1408354 0.3478939 0.10950501
## 1 0.005 0.50 0.1385566 0.3629925 0.10686241
## 1 0.005 1.00 0.1372550 0.3718124 0.10525600
## 1 0.005 2.00 0.1365453 0.3776144 0.10423486
## 1 0.005 4.00 0.1365016 0.3781480 0.10376921
## 1 0.005 8.00 0.1366220 0.3778237 0.10347086
## 1 0.005 16.00 0.1366588 0.3775229 0.10332553
## 1 0.005 32.00 0.1365817 0.3785518 0.10318452
## 1 0.010 0.25 0.1385577 0.3629794 0.10686348
## 1 0.010 0.50 0.1372559 0.3718060 0.10525661
## 1 0.010 1.00 0.1365438 0.3776293 0.10423471
## 1 0.010 2.00 0.1365024 0.3781366 0.10376845
## 1 0.010 4.00 0.1366236 0.3778139 0.10347003
## 1 0.010 8.00 0.1366597 0.3775137 0.10332838
## 1 0.010 16.00 0.1365826 0.3785592 0.10318481
## 1 0.010 32.00 0.1367319 0.3775373 0.10308037
## 2 0.001 0.25 0.1452417 0.3195940 0.11376678
## 2 0.001 0.50 0.1414933 0.3449001 0.11022507
## 2 0.001 1.00 0.1386449 0.3643305 0.10714153
## 2 0.001 2.00 0.1365714 0.3793489 0.10464173
## 2 0.001 4.00 0.1349484 0.3927196 0.10272931
## 2 0.001 8.00 0.1333085 0.4065537 0.10068073
## 2 0.001 16.00 0.1310891 0.4258930 0.09812315
## 2 0.001 32.00 0.1290090 0.4434096 0.09551423
## 2 0.005 0.25 0.1357112 0.3910420 0.10408214
## 2 0.005 0.50 0.1326874 0.4152589 0.10052062
## 2 0.005 1.00 0.1298085 0.4382866 0.09714382
## 2 0.005 2.00 0.1276990 0.4553732 0.09436977
## 2 0.005 4.00 0.1265290 0.4640581 0.09263855
## 2 0.005 8.00 0.1257959 0.4698461 0.09161152
## 2 0.005 16.00 0.1259687 0.4701945 0.09114868
## 2 0.005 32.00 0.1264498 0.4689968 0.09132417
## 2 0.010 0.25 0.1303526 0.4361562 0.09788445
## 2 0.010 0.50 0.1279766 0.4543567 0.09479150
## 2 0.010 1.00 0.1265030 0.4645966 0.09271112
## 2 0.010 2.00 0.1257891 0.4701290 0.09170759
## 2 0.010 4.00 0.1259083 0.4704585 0.09115160
## 2 0.010 8.00 0.1263425 0.4697251 0.09130561
## 2 0.010 16.00 0.1278630 0.4628675 0.09224004
## 2 0.010 32.00 0.1305962 0.4485015 0.09392808
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 2, scale = 0.01 and C = 2.
svmP_resample
## RMSE Rsquared MAE
## 0.12636749 0.47478046 0.09250217
The CART model achieved an RMSE of 0.12794622, an R² of 0.45794133, and an MAE of 0.09678968, with the final value used for the model being cp = 0.01. These results indicate mediocore predictive performance, slightly better than linear and PLS models, but also falling short of more complex models like SVM-RBF and NN. The model captures some of the variance (45.79%) in pH but still leaves a considerable amount unexplained, making it only somewhat effective for prediction in this dataset. Therefore, this would not be the model of choice.
set.seed(786)
rpartFit <- train(ph ~ ., data = train_data,
method = "rpart",
trControl = ctrl,
tuneGrid = data.frame(cp = c(0.01, 0.05, 0.1)),
preProcess = pre_process)
rpartFit
## CART
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (33), scaled (33)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01 0.1297540 0.4366423 0.09872245
## 0.05 0.1492260 0.2510414 0.11669494
## 0.10 0.1526919 0.2163947 0.11970553
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01.
#predict the permeability response for the test data
rpartPred <- predict(rpartFit, select(test_data, -ph))
#compare the variance of the predicted values to the test values.
rpartResample <- postResample(rpartPred, test_data$ph)
rpartResample
## RMSE Rsquared MAE
## 0.12794622 0.45794133 0.09678968
The Random Forest model achieved an RMSE of 0.09969844, an R2 of 0.67283982, and an MAE of 0.07037370, making it the most accurate model among all those evaluated so far. It outperformed linear, PLS, and CART in both error and variance explained. The high R2 (67.28%) indicates that Random Forest captured a significant portion of the variability in pH. The final value used for the model was mtry = 31. This result highlights Random Forest as a highly effective modeling approach for this dataset, for predictive accuracy and capturing variance.
set.seed(786)
# default mmtry is number of predictors divided by 3 , so around 10.
#mtry is the number of predictors sampled for each split.
rfGrid <- expand.grid(mtry=seq(2,38,by=3) )
rfModel <- train(x = select(train_data, -ph), y = train_data$ph,
method = "rf",
# tuneGrid = rfGrid,
importance = TRUE,
trControl = ctrl,
ntree = 50,
preProcess=pre_process)
#validation plot
plot(rfModel)
rfModel
## Random Forest
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.1158778 0.5846496 0.08693333
## 16 0.1049511 0.6383291 0.07587949
## 31 0.1036612 0.6423721 0.07401761
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 31.
#predict the permeability response for the test data
rfPred <- predict(rfModel, select(test_data, -ph))
#compare the variance of the predicted values to the test values.
rfResample <- postResample(rfPred, test_data$ph)
rfResample
## RMSE Rsquared MAE
## 0.09969844 0.67283982 0.07037370
The Random Forest model with OOB estimates achieved an RMSE of 0.09926432, an R2 of 0.67786310, and an MAE of 0.07146070, making it slightly more accurate and reliable than the Random Forest, and the best overall so far. The final value used for the model was also mtry = 31, which is the maximum number of predictors available. The high R2 (67.79%) indicates that the model explained a good proportion of the variance in pH, while the low RMSE and MAE reflect high predictive accuracy. This reinforces Random Forest as the strongest model for predicting pH so far.
# Tune the model using the OOB estimates
ctrlOOB <- trainControl(method = "oob")
set.seed(786)
rfTuneOOB <- train(x = select(train_data, -ph), y = train_data$ph,
method = "rf",
# tuneGrid = mtryGrid,
ntree = 50,
importance = TRUE,
trControl = ctrlOOB,
preProcess=pre_process)
rfTuneOOB
## Random Forest
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared
## 2 0.1168861 0.5396115
## 16 0.1030697 0.6420183
## 31 0.1026615 0.6448478
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 31.
plot(rfTuneOOB)
#predict the permeability response for the test data
rfOOBPred <- predict(rfTuneOOB, select(test_data, -ph))
#compare the variance of the predicted values to the test values.
rfOOBResample <- postResample(rfOOBPred, test_data$ph)
rfOOBResample
## RMSE Rsquared MAE
## 0.09926432 0.67786310 0.07146070
The Conditional Inference Random Forest model achieved an RMSE of 0.10561362, an R2 of 0.64000809, and an MAE of 0.07969956. These results demonstrate strong predictive performance, but fell slightly short of the regular Random Forest model, which produced lower error rates and higher R2. The final value used for the model was mtry = 16. Overall, this model performs relatively well but not the best performer so far.
# Conditional Inference Forests
mtryGrid <- data.frame(mtry = floor(seq(10, ncol(select(train_data, -ph)), length = 10)))
set.seed(786)
condrfTune <- train(x = select(train_data, -ph), y = train_data$ph,
method = "cforest",
# tuneGrid = mtryGrid,
controls = cforest_unbiased(ntree = 50),
trControl = ctrl,
preProcess=pre_process)
condrfTune
## Conditional Inference Random Forest
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.1272338 0.5006288 0.09855281
## 16 0.1116897 0.5885541 0.08292328
## 31 0.1123024 0.5785856 0.08288324
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 16.
#predict the permeability response for the test data
condrfPred <- predict(condrfTune, select(test_data, -ph))
#compare the variance of the predicted values to the test values.
condrfResample <- postResample(condrfPred, test_data$ph)
condrfResample
## RMSE Rsquared MAE
## 0.10561362 0.64000809 0.07969956
The Conditional Inference Random Forest model with OOB achieved an RMSE of 0.10509848, an R2 of 0.64372448, and an MAE of 0.07930848. These values reflect similar predictive performance to the regular conditional inference - under performing the Random Foest. The final tuning parameter used was mtry = 16. While slightly behind the standard Random Forest, this model still demonstrates solid capability in predicting pH.
# Tune the model using the OOB estimates
set.seed(786)
condrfTuneOOB <- train(x = select(train_data, -ph), y = train_data$ph,
method = "cforest",
# tuneGrid = mtryGrid,
controls = cforest_unbiased(ntree = 50),
trControl = ctrlOOB,
preProcess = pre_process)
condrfTuneOOB
## Conditional Inference Random Forest
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.1244314 0.5183411 0.09510416
## 16 0.1103118 0.5962083 0.08131875
## 31 0.1106121 0.5894080 0.08066653
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 16.
plot(condrfTuneOOB)
#predict the permeability response for the test data
condrfOOBPred <- predict(condrfTuneOOB, select(test_data, -ph))
#compare the variance of the predicted values to the test values.
condrfOOBResample <- postResample(condrfOOBPred, test_data$ph)
condrfOOBResample
## RMSE Rsquared MAE
## 0.10509848 0.64372448 0.07930848
The final model selected was based on n.trees = 150 and interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10. The Stochastic Gradient Boosting model achieved an RMSE of 0.1146917, indicating that the model’s predictions are off by this amount on average. The R2 value of 0.56474918 means that approximately 56.47% of the variance in the pH variable is explained by the model. The MAE was 0.09061361, indicating that, on average, the model’s predictions differ from the true values by about 0.09 units. Compared to the other models, the boosted model did not perform as good as the Random Forest and Neural Network models since the R-squared, especially, is somewhat moderate, meaning the model is not fully capturing the variability in the data. As a result, it may not be the most effective model for predicting pH in this dataset.
set.seed(786)
gbmGrid <- expand.grid(interaction.depth=seq(1,7,by=1),
n.trees=c(100,200),
shrinkage=c(0.01,0.1,0.2),
n.minobsinnode=5)
# Distribution defines the type of loss function that will be optimized during boosting.
# for continuous variable, it should be gaussian
gbmModel <-train(x = select(train_data, -ph), y = train_data$ph,
method = "gbm",
verbose = FALSE,
trControl = ctrl,
preProcess=pre_process)
gbmModel
## Stochastic Gradient Boosting
##
## 2055 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1644, 1643, 1644, 1645, 1644
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 0.1366762 0.3900967 0.10765403
## 1 100 0.1325738 0.4169541 0.10387050
## 1 150 0.1310521 0.4264272 0.10206527
## 2 50 0.1295610 0.4496512 0.10130993
## 2 100 0.1254424 0.4757725 0.09687421
## 2 150 0.1241387 0.4831551 0.09505848
## 3 50 0.1262007 0.4726839 0.09761110
## 3 100 0.1222503 0.4988562 0.09309484
## 3 150 0.1204027 0.5123995 0.09138077
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
#predict the permeability response for the test data
gbmPred <- predict(gbmModel, select(test_data, -ph))
#compare the variance of the predicted values to the test values.
gbmResample<- postResample(gbmPred, test_data$ph)
gbmResample
## RMSE Rsquared MAE
## 0.11469171 0.56474918 0.09061361
The best performance for Cubist was obtained with committees = 20 and neighbors = 5. The final RMSE on the test set was 0.09238683, which means the model’s predictions are off by about this amount on average. The R2 value of 0.7208153 indicates that the model explains about 72.08% of the variance in pH, showing a relatively strong relationship between the predictors and the target variable. The MAE was 0.06862500, meaning that, on average, the predictions differ from the true values by approximately 0.069 units. Compared to the previous models, the Cubist model demonstrates the best predictive performance (out of the models tested) with the lowest RMSE and a highest R-squared. The R-squared value indicates that the model is able to capture a relatively decent portion of the variability in the data, and the low RMSE suggests that it is making relatively accurate predictions.
set.seed(786)
cubistModel <- train(x = select(train_data, -ph), y = train_data$ph,
method = "cubist",
verbose = FALSE,
preProcess=pre_process)
cubistModel$bestTune
## committees neighbors
## 9 20 9
#predict the permeability response for the test data
cubistPred <- predict(cubistModel, select(test_data, -ph))
#compare the variance of the predicted values to the test values.
cubistResample <-postResample(cubistPred, test_data$ph)
cubistResample
## RMSE Rsquared MAE
## 0.09238683 0.72081531 0.06862500
The Bagged Trees model achieved an RMSE of 0.1204170, an R² of 0.5211945, and an MAE of 0.0928338. While it outperforms simpler models like CART and linear regression models, it falls short compared to the Random Forest and Cubist model. Its moderate R² (52.12%) suggests it captures just over half of the variance in pH, and although the error metrics are reasonable, they are not low enough to consider this model highly accurate. Overall, Bagged Trees offer modest predictive performance but are not among the top-performing models for this dataset.
set.seed(786)
baggedModel <- train(x = select(train_data, -ph),
y = train_data$ph,
method = "treebag",
verbose = FALSE,
# trControl = ctrl,
preProcess = pre_process)
baggedModel$bestTune
## parameter
## 1 none
# Predict the permeability response for the test data
baggedPred <- predict(baggedModel, select(test_data, -ph))
# Compare the variance of the predicted values to the test values
baggedResample <- postResample(baggedPred, test_data$ph)
baggedResample
## RMSE Rsquared MAE
## 0.1204170 0.5211945 0.0928338
Linear Regression and its variants: Among the linear models, Elastic Net regression performed the best (~41.9% of the variance in the pH) and PCR performed the worst. Hence dimensionality reduction does not seem desirable. The data does not have a linear relationship as suggested by the outcome.
Non-Linear Regression: Among the Non-Linear models, Neural Network performed the best (~56.8% of the variance in the pH) - while also performing better than all of the linear regression models.
Tree/rule based: Among the Tree/rule-based models, the Cubist performed the best (~ 72% of the variance in the pH). All of the tree-based models performed better than both the linear and non-linear regression models.
Overall: Based on R squared as the performance metric, “Cubist” model performed the best in optimal resampling and test set performance. The model explains ~72% of the variance in the pH variable. The predictions are off by just 0.09 pH units. The average absolute error is less than 0.07, showing consistent accuracy.
# Find the best model
model_results <- rbind(
lm_resample,
ridge_resample,
lasso_resample,
enet_resample,
pcr_resample,
pls_resample,
knn_resample,
mars_resample,
nn_resample,
svm_resample,
svmP_resample,
rpartResample,
rfResample,
rfOOBResample,
condrfResample,
condrfOOBResample,
gbmResample,
baggedResample,
cubistResample
)
rownames(model_results) <- c("Linear Regression", "Ridge", "Lasso", "Elastic Net", "PCR", "PLS", "KNN", "MARS", "Neural Network", "SVM", "SVM Polynomial", "Single Tree", "Random Forest", "Random Forest OOB", "Conditional Inference Random Forest", "Conditional Inference Random Forest OOB", "GBM", "Bagged", "Cubist")
# Sort by Rsquared
model_results[order(model_results[,2]), ]
## RMSE Rsquared MAE
## PCR 0.15616156 0.1890207 0.12433952
## PLS 0.13248056 0.4159277 0.10375194
## Linear Regression 0.13217668 0.4186977 0.10411425
## Ridge 0.13238471 0.4188048 0.10433420
## Lasso 0.13212593 0.4191752 0.10395444
## Elastic Net 0.13213547 0.4192278 0.10372870
## Single Tree 0.12794622 0.4579413 0.09678968
## SVM Polynomial 0.12636749 0.4747805 0.09250217
## Bagged 0.12041701 0.5211945 0.09283380
## MARS 0.11790237 0.5386689 0.08951604
## KNN 0.11728297 0.5500853 0.08691970
## SVM 0.11620320 0.5522852 0.08752177
## GBM 0.11469171 0.5647492 0.09061361
## Neural Network 0.11433728 0.5684099 0.08432463
## Conditional Inference Random Forest 0.10561362 0.6400081 0.07969956
## Conditional Inference Random Forest OOB 0.10509848 0.6437245 0.07930848
## Random Forest 0.09969844 0.6728398 0.07037370
## Random Forest OOB 0.09926432 0.6778631 0.07146070
## Cubist 0.09238683 0.7208153 0.06862500
In the cubist model, among the Top 10 predictors
Mnf_Flow
, Balling
, Balling_Lvl
,
Pressure_Vacuum
, Alch_Rel
etc. are important
features.
# Find the top predictors
top_predictors <- data.frame(varImp(cubistModel,scale = TRUE)$importance)
top_predictors |>
arrange(desc(Overall))
## Overall
## mnf_flow 100.000000
## balling 85.000000
## alch_rel 74.166667
## density 68.333333
## balling_lvl 67.500000
## pressure_vacuum 66.666667
## air_pressurer 57.500000
## oxygen_filler 56.666667
## carb_rel 54.166667
## hyd_pressure3 50.000000
## bowl_setpoint 49.166667
## temperature 49.166667
## carb_pressure1 48.333333
## carb_flow 42.500000
## usage_cont 39.166667
## filler_speed 37.500000
## hyd_pressure2 37.500000
## filler_level 35.833333
## brand_code 32.500000
## carb_volume 23.333333
## hyd_pressure4 23.333333
## pressure_setpoint 15.833333
## pc_volume 15.833333
## carb_pressure 14.166667
## fill_pressure 14.166667
## carb_temp 11.666667
## fill_ounces 6.666667
## mfr 5.833333
## psc 2.500000
## psc_fill 2.500000
## psc_co2 0.000000
top_10_predictors <- top_predictors |>
arrange(desc(Overall)) |>
slice_head(n = 10)
# 1. Get variable importance
cubistImp <- varImp(cubistModel)
# 2. Plot simple barplot
plot(caret::varImp(cubistModel), main = "Cubist Regression Model Variable Importance")
# Correlation
data |>
select(c(ph, row.names(top_10_predictors))) |>
cor(use = "complete.obs") |>
corrplot(method = "circle", type = "upper", tl.cex = 0.9, tl.srt = 45)
After the model is trained, we use it to evaluate against the StudentEvaluation data which was provided to us in a separate file.
ph_prediction <- predict(cubistModel,test_eval)
test_eval$PH <- ph_prediction
#Predicted pH vs train data
xyplot(test_eval$PH ~ data$ph | data$brand_code,
xlab = "Actual pH",
ylab = "Predicted pH",
pch = 16,
col = "darkgreen",
main = "Predicted vs Actual pH by Brand Code"
)
library(openxlsx)
## Warning: package 'openxlsx' was built under R version 4.4.3
# Create a new workbook
wb <- createWorkbook()
# Add the sheets
addWorksheet(wb, "Predictions")
# Write data to sheet
writeData(wb,sheet = "Predictions" , test_eval)
# Save the workbook
saveWorkbook(wb, "StudentPrediction.xlsx", overwrite = TRUE)
Our goal is to use the Beverage dataset which has 32 feature variables and a target variable pH, train and compare several predictive modeling approaches, and select the best-performing model. By using the best model, predict the pH for the Evaluation dataset.
This predictive model will enable ABC Beverage to generate reliable pH predictions that support production optimization, regulatory compliance, and overall product consistency.
By diving into the beverage dataset, we explored 33 features, tackled missing values with MICE imputation, and removed low-variance predictors like hyd_pressure1 to streamline our dataset. Our feature selection revealed key drivers of pH, with Mnf_Flow, Balling, and Pressure_Vacuum topping the list, guiding our modeling choices.
We tested a wide range of models—Linear Regression, Ridge, Lasso, Elastic Net, PCR, PLS, KNN, MARS, Neural Networks, SVM, Random Forest, GBM, Cubist, and Bagged Trees—using 5-fold cross-validation and standardized preprocessing. While linear models like Elastic Net (RMSE: 0.1321, R-squared: 0.4192) handled multicollinearity well, and nonlinear models like Neural Network (RMSE: 0.1143 , R-squared: 0.5684) captured complex patterns, tree-based models shone. Random Forest performed strongly (RMSE: 0.0997, R-squared: 0.6728), but Cubist performed the best with an RMSE of 0.092, R-squared of 0.72, and MAE of 0.068 on the test set, balancing accuracy and interpretability.
Manufacturing flow rate(mnf_flow
), Measurement of sugar
content based on the Balling scale(balling
), alcohol
release, density, Sugar content level(balling_lvl
) and
pressure_vaccum
are key to our model’s performance.
From a business perspective, a model explaining 72% of the pH variance may not be considered strong enough for predicting pH. We believe strong domain knowledge would be necessary to take this model to the next level. Additional modeling gains could be made by creating additional features and more feature engineering from this dataset.
By using a more balanced dataset, researching and analyzing
combination features, we can aim to achieve better performance. The
final predictions generated for the target variable pH using Cubist
model, are all in the range between 8 and 9 with an exception at 9.19.
The patterns are still upheld in our predictions when they are grouped
by the Brand code. We also highlighted the variables that have the most
effect on the PH
. We hope that this information provides
adequate details to address the new regulations in the beverage
industry.