Predicting pH Levels for ABC Beverage Company.
The research problem: In this project, I am acting as a data scientist for ABC Beverage Company to develop a predictive model that accurately forecasts the pH levels of their beverages. This initiative is driven by the company’s need to comply with newly implemented regulations, requiring a deep understanding of their manufacturing process and the factors influencing beverage pH levels.To achieve this, the leadership team has tasked me with building a reliable model, identifying key predictive factors, and delivering clear insights. The historical data set comes from ABC Beverage Company and was originally in Excel format. Before analysis, I converted it to CSV and stored it in my public GitHub for reproducibility.The best model will be selected based on key performance metrics (RMSE, R-Squared, and MAE),and also on considerations related to the model’s accuracy and interpretability. Lastly, this project will encompass the full data science pipeline, including exploratory data analysis (EDA), data preprocessing, visualization, and modeling, all conducted in the R environment using a variety of R packages. To ensure clear reporting, the technical report of the project will be supplemented with a less technical summary essay. By leveraging these tools, the final model will provide ABC Beverage with a robust solution to monitor and predict pH levels,ensuring compliance with regulatory standards.
knitr::opts_chunk$set(echo = TRUE)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(conflicted)
library(readxl)
library(writexl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
library(doParallel)
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loading required package: iterators
## Loading required package: parallel
library(DataExplorer)
library(psych)
library(mice)
library(MASS)
library(AppliedPredictiveModeling)
library(lars)
## Loaded lars 1.3
library(pls)
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:psych':
##
## rescale
library(xgboost)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(DT)
library(e1071)
library(Matrix)
library(nnet)
library(ggplot2)
library(dplyr)
library(tibble)
library(fastDummies)
library(DataExplorer)
# Set seed for reproducibility
set.seed(131017)
# load the CSV file from public Github repo
data_raw <- read.csv("https://raw.githubusercontent.com/Heleinef/Data-Science-Master_Heleine/refs/heads/main/StudentData.csv")
head(data_raw)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC
## 1 B 5.340000 23.96667 0.2633333 68.2 141.2 0.104
## 2 A 5.426667 24.00667 0.2386667 68.4 139.6 0.124
## 3 B 5.286667 24.06000 0.2633333 70.8 144.8 0.090
## 4 A 5.440000 24.00667 0.2933333 63.0 132.6 NA
## 5 A 5.486667 24.31333 0.1113333 67.2 136.8 0.026
## 6 A 5.380000 23.92667 0.2693333 66.6 138.4 0.090
## PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1 0.26 0.04 -100 118.8 46.0 0
## 2 0.22 0.04 -100 121.6 46.0 0
## 3 0.34 0.16 -100 120.2 46.0 0
## 4 0.42 0.04 -100 115.2 46.4 0
## 5 0.16 0.12 -100 118.4 45.8 0
## 6 0.24 0.04 -100 119.6 45.6 0
## Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1 NA NA 118 121.2 4002
## 2 NA NA 106 118.6 3986
## 3 NA NA 82 120.0 4020
## 4 0 0 92 117.8 4012
## 5 0 0 92 118.6 4010
## 6 0 0 116 120.2 4014
## Temperature Usage.cont Carb.Flow Density MFR Balling Pressure.Vacuum PH
## 1 66.0 16.18 2932 0.88 725.0 1.398 -4.0 8.36
## 2 67.6 19.90 3144 0.92 726.8 1.498 -4.0 8.26
## 3 67.0 17.76 2914 1.58 735.0 3.142 -3.8 8.94
## 4 65.6 17.42 3062 1.54 730.6 3.042 -4.4 8.24
## 5 65.6 17.68 3054 1.54 722.8 3.042 -4.4 8.26
## 6 66.2 23.82 2948 1.52 738.8 2.992 -4.4 8.32
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1 0.022 120 46.4 142.6 6.58 5.32
## 2 0.026 120 46.8 143.0 6.56 5.30
## 3 0.024 120 46.6 142.0 7.66 5.84
## 4 0.030 120 46.0 146.2 7.14 5.42
## 5 0.030 120 46.0 146.2 7.14 5.44
## 6 0.024 120 46.0 146.6 7.16 5.44
## Balling.Lvl
## 1 1.48
## 2 1.56
## 3 3.28
## 4 3.04
## 5 3.04
## 6 3.02
Transform Brand.Code to a factor
# Transform Brand.Code to a factor
data_raw$Brand.Code <- as.factor(data_raw$Brand.Code)
Check dataset’s dimensionality
# Check dataset dimensionality
dim(data_raw)
## [1] 2571 33
Take a peek at the dataset
# Take a peek at the dataset
glimpse(data_raw)
## Rows: 2,571
## Columns: 33
## $ Brand.Code <fct> B, A, B, A, A, A, A, B, B, B, B, B, B, B, B, B, C, B…
## $ Carb.Volume <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ Fill.Ounces <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ PC.Volume <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ Carb.Pressure <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ Carb.Temp <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ PSC <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ PSC.Fill <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ PSC.CO2 <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ Mnf.Flow <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ Carb.Pressure1 <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ Fill.Pressure <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ Hyd.Pressure1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure2 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure3 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure4 <int> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ Filler.Level <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ Filler.Speed <int> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ Temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ Usage.cont <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ Carb.Flow <int> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ Density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ MFR <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ Balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ Pressure.Vacuum <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ PH <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ Oxygen.Filler <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ Bowl.Setpoint <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ Air.Pressurer <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ Alch.Rel <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ Carb.Rel <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ Balling.Lvl <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…
Let’s generate descriptive statistics of variables in the training dataset, in order to get detailed information on central tendency, dispersion, and distribution metrics.
# Use the describe function from the psych package explicitly
psych::describe(data_raw) %>%
dplyr::select(-vars, -trimmed, -mad, -se)
## n mean sd median min max range skew
## Brand.Code* 2571 3.39 1.11 3.00 1.00 5.00 4.00 0.04
## Carb.Volume 2561 5.37 0.11 5.35 5.04 5.70 0.66 0.39
## Fill.Ounces 2533 23.97 0.09 23.97 23.63 24.32 0.69 -0.02
## PC.Volume 2532 0.28 0.06 0.27 0.08 0.48 0.40 0.34
## Carb.Pressure 2544 68.19 3.54 68.20 57.00 79.40 22.40 0.18
## Carb.Temp 2545 141.09 4.04 140.80 128.60 154.00 25.40 0.25
## PSC 2538 0.08 0.05 0.08 0.00 0.27 0.27 0.85
## PSC.Fill 2548 0.20 0.12 0.18 0.00 0.62 0.62 0.93
## PSC.CO2 2532 0.06 0.04 0.04 0.00 0.24 0.24 1.73
## Mnf.Flow 2569 24.57 119.48 65.20 -100.20 229.40 329.60 0.00
## Carb.Pressure1 2539 122.59 4.74 123.20 105.60 140.20 34.60 0.05
## Fill.Pressure 2549 47.92 3.18 46.40 34.60 60.40 25.80 0.55
## Hyd.Pressure1 2560 12.44 12.43 11.40 -0.80 58.00 58.80 0.78
## Hyd.Pressure2 2556 20.96 16.39 28.60 0.00 59.40 59.40 -0.30
## Hyd.Pressure3 2556 20.46 15.98 27.60 -1.20 50.00 51.20 -0.32
## Hyd.Pressure4 2541 96.29 13.12 96.00 52.00 142.00 90.00 0.55
## Filler.Level 2551 109.25 15.70 118.40 55.80 161.20 105.40 -0.85
## Filler.Speed 2514 3687.20 770.82 3982.00 998.00 4030.00 3032.00 -2.87
## Temperature 2557 65.97 1.38 65.60 63.60 76.20 12.60 2.39
## Usage.cont 2566 20.99 2.98 21.79 12.08 25.90 13.82 -0.54
## Carb.Flow 2569 2468.35 1073.70 3028.00 26.00 5104.00 5078.00 -0.99
## Density 2570 1.17 0.38 0.98 0.24 1.92 1.68 0.53
## MFR 2359 704.05 73.90 724.00 31.40 868.60 837.20 -5.09
## Balling 2570 2.20 0.93 1.65 -0.17 4.01 4.18 0.59
## Pressure.Vacuum 2571 -5.22 0.57 -5.40 -6.60 -3.60 3.00 0.53
## PH 2567 8.55 0.17 8.54 7.88 9.36 1.48 -0.29
## Oxygen.Filler 2559 0.05 0.05 0.03 0.00 0.40 0.40 2.66
## Bowl.Setpoint 2569 109.33 15.30 120.00 70.00 140.00 70.00 -0.97
## Pressure.Setpoint 2559 47.62 2.04 46.00 44.00 52.00 8.00 0.20
## Air.Pressurer 2571 142.83 1.21 142.60 140.80 148.20 7.40 2.25
## Alch.Rel 2562 6.90 0.51 6.56 5.28 8.62 3.34 0.88
## Carb.Rel 2561 5.44 0.13 5.40 4.96 6.06 1.10 0.50
## Balling.Lvl 2570 2.05 0.87 1.48 0.00 3.66 3.66 0.59
## kurtosis
## Brand.Code* -0.61
## Carb.Volume -0.47
## Fill.Ounces 0.86
## PC.Volume 0.67
## Carb.Pressure -0.01
## Carb.Temp 0.24
## PSC 0.65
## PSC.Fill 0.77
## PSC.CO2 3.73
## Mnf.Flow -1.87
## Carb.Pressure1 0.14
## Fill.Pressure 1.41
## Hyd.Pressure1 -0.14
## Hyd.Pressure2 -1.56
## Hyd.Pressure3 -1.57
## Hyd.Pressure4 0.63
## Filler.Level 0.05
## Filler.Speed 6.71
## Temperature 10.16
## Usage.cont -1.02
## Carb.Flow -0.58
## Density -1.20
## MFR 30.46
## Balling -1.39
## Pressure.Vacuum -0.03
## PH 0.06
## Oxygen.Filler 11.09
## Bowl.Setpoint -0.06
## Pressure.Setpoint -1.60
## Air.Pressurer 4.73
## Alch.Rel -0.85
## Carb.Rel -0.29
## Balling.Lvl -1.49
Let’s get a more concise summary statistics of the data
# Summary statistics
summary(data_raw)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure
## : 120 Min. :5.040 Min. :23.63 Min. :0.07933 Min. :57.00
## A: 293 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917 1st Qu.:65.60
## B:1239 Median :5.347 Median :23.97 Median :0.27133 Median :68.20
## C: 304 Mean :5.370 Mean :23.97 Mean :0.27712 Mean :68.19
## D: 615 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200 3rd Qu.:70.60
## Max. :5.700 Max. :24.32 Max. :0.47800 Max. :79.40
## NA's :10 NA's :38 NA's :39 NA's :27
## Carb.Temp PSC PSC.Fill PSC.CO2
## Min. :128.6 Min. :0.00200 Min. :0.0000 Min. :0.00000
## 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000 1st Qu.:0.02000
## Median :140.8 Median :0.07600 Median :0.1800 Median :0.04000
## Mean :141.1 Mean :0.08457 Mean :0.1954 Mean :0.05641
## 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600 3rd Qu.:0.08000
## Max. :154.0 Max. :0.27000 Max. :0.6200 Max. :0.24000
## NA's :26 NA's :33 NA's :23 NA's :39
## Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## Min. :-100.20 Min. :105.6 Min. :34.60 Min. :-0.80
## 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00 1st Qu.: 0.00
## Median : 65.20 Median :123.2 Median :46.40 Median :11.40
## Mean : 24.57 Mean :122.6 Mean :47.92 Mean :12.44
## 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00 3rd Qu.:20.20
## Max. : 229.40 Max. :140.2 Max. :60.40 Max. :58.00
## NA's :2 NA's :32 NA's :22 NA's :11
## Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level
## Min. : 0.00 Min. :-1.20 Min. : 52.00 Min. : 55.8
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00 1st Qu.: 98.3
## Median :28.60 Median :27.60 Median : 96.00 Median :118.4
## Mean :20.96 Mean :20.46 Mean : 96.29 Mean :109.3
## 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00 3rd Qu.:120.0
## Max. :59.40 Max. :50.00 Max. :142.00 Max. :161.2
## NA's :15 NA's :15 NA's :30 NA's :20
## Filler.Speed Temperature Usage.cont Carb.Flow Density
## Min. : 998 Min. :63.60 Min. :12.08 Min. : 26 Min. :0.240
## 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144 1st Qu.:0.900
## Median :3982 Median :65.60 Median :21.79 Median :3028 Median :0.980
## Mean :3687 Mean :65.97 Mean :20.99 Mean :2468 Mean :1.174
## 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186 3rd Qu.:1.620
## Max. :4030 Max. :76.20 Max. :25.90 Max. :5104 Max. :1.920
## NA's :57 NA's :14 NA's :5 NA's :2 NA's :1
## MFR Balling Pressure.Vacuum PH
## Min. : 31.4 Min. :-0.170 Min. :-6.600 Min. :7.880
## 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600 1st Qu.:8.440
## Median :724.0 Median : 1.648 Median :-5.400 Median :8.540
## Mean :704.0 Mean : 2.198 Mean :-5.216 Mean :8.546
## 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000 3rd Qu.:8.680
## Max. :868.6 Max. : 4.012 Max. :-3.600 Max. :9.360
## NA's :212 NA's :1 NA's :4
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer
## Min. :0.00240 Min. : 70.0 Min. :44.00 Min. :140.8
## 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2
## Median :0.03340 Median :120.0 Median :46.00 Median :142.6
## Mean :0.04684 Mean :109.3 Mean :47.62 Mean :142.8
## 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:143.0
## Max. :0.40000 Max. :140.0 Max. :52.00 Max. :148.2
## NA's :12 NA's :2 NA's :12
## Alch.Rel Carb.Rel Balling.Lvl
## Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :6.560 Median :5.400 Median :1.48
## Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
Histogram illustrate the distribution of data over a continuous interval or defined period. These visualizations are helpful in identifying where values are concentrated, as well as where there are gaps or unusual values.Histograms are especially useful for showing the frequency of a particular occurrence.
# Histogram plots
plot_histogram(data_raw, geom_histogram_args = list("fill" = "tomato4"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Log histograms
plot_histogram(data_raw, scale_x = "log10", geom_histogram_args = list(fill = "springgreen4"))
# Check for missing values column-wise
colSums(is.na(data_raw))
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## 0 10 38 39
## Carb.Pressure Carb.Temp PSC PSC.Fill
## 27 26 33 23
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## 39 2 32 22
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## 11 15 15 30
## Filler.Level Filler.Speed Temperature Usage.cont
## 20 57 14 5
## Carb.Flow Density MFR Balling
## 2 1 212 1
## Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint
## 0 4 12 2
## Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 12 0 9 10
## Balling.Lvl
## 1
Checking the specific distribution of PH value
# Histogram of PH
hist(data_raw$PH) # base R
plot_histogram(data_raw$PH)
It is evident from the above histogram that most predicted values of PH at ABC Beverage Company are greater than 8,indicating that the beverage produced is alkaline. Initially, the nature and type of the beverages the company manufactures was unknown to us. However, based on this histogram, we can conclude that the company primarily produces alkaline beverages, such as water, tea, fruit drinks, and similar products.
A correlation matrix is a cool table that shows correlation coefficients between variables, which is are useful to summarize and find patterns in large data sets. Each cell represents the relationship between two variables.In our graph the color scale is used to communicate to what extent the variables are correlated.
Let’s identify missing data patterns and perform a correlation analysis on rows with complete data.Then, we’ll visualize relationships between variables using a correlation plot for intuitive interpretation.
plot_missing(data_raw[-1])
forcorr <- data_raw[complete.cases(data_raw),-1]
corrplot::corrplot(cor(forcorr), method = 'ellipse', type = 'lower')
Let’s generate box plots for each column (excluding the first column) in order to identify symmetrical and skewed data
# Generate box plots
# Adjust margins and layout
par(mfrow = c(2, 4), mar = c(5, 4, 4, 2) + 0.1)
# Generate box plots for each column (excluding the first column)
for (i in colnames(data_raw[-1])) {
boxplot(data_raw[[i]],
xlab = i,
main = i,
col = "blue",
horizontal = TRUE)
}
set.seed(131017)
# Training set
# Handles missing values in data_raw using random forest imputation
clean_df <- mice(data.frame(data_raw), method = 'rf', m=2, maxit = 2, print=FALSE)
clean_df <- complete(clean_df)
# Removes near-zero variance predictors to improve model performance
nzv_preds <- nearZeroVar(clean_df)
clean_df <- clean_df[,-nzv_preds]
set.seed(131017)
partition <- createDataPartition(clean_df$PH, p=0.75, list = FALSE)
# training/validation partition for independent variables
X.train <- clean_df[partition, ] %>% dplyr::select(-PH)
X.test <- clean_df[-partition, ] %>% dplyr::select(-PH)
# training/validation partition for dependent variable PH
y.train <- clean_df$PH[partition]
y.test <- clean_df$PH[-partition]
We are now ready to build, tune, and evaluate models to predict the pH levels for ABC Beverage. We will test PLS, Random Forest, SVM, and Neural Network models. The model with the lowest RMSE on the test set will be selected as the best and recommended to ABC leadership.
# Set seed for reproducibility
set.seed(131017)
# Define training control
train_control <- trainControl(
method = "cv", # Cross-validation
number = 10, # 10-fold CV
verboseIter = TRUE
)
###1. Partial Least Squares (PLS):
Partial Least Squares (PLS) regression is a dimensionality reduction technique that models the relationship between predictors and a continuous response variable by projecting both onto a lower-dimensional latent space. It is particularly effective in handling multicollinearity among predictors, as it extracts components that maximize the covariance between predictors and the response. The number of components, which controls the model’s complexity, is tuned to balance predictive accuracy and overfitting.
pls_model <- caret::train(
x = X.train,
y = y.train,
method = "pls",
metric = "Rsquared",
tuneLength = 10,
trControl = trainControl(method = "cv")
)
# Train a linear regression model using caret
lm_model <- caret::train(
x = X.train,
y = y.train,
method = "lm",
metric = "Rsquared",
trControl = trainControl(method = "cv")
)
# Check the results
print(lm_model)
## Linear Regression
##
## 1929 samples
## 31 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1736, 1736, 1736, 1735, 1736, 1736, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1351744 0.3930378 0.1042209
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Extract variable importance from the trained PLS model
pls_var_imp <- varImp(pls_model, scale = TRUE)
# Print the variable importance
print(pls_var_imp)
## pls variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## Temperature 100.000
## Pressure.Setpoint 76.679
## Usage.cont 67.081
## Carb.Pressure1 62.713
## Brand.CodeB 56.011
## Brand.CodeC 50.104
## Mnf.Flow 49.762
## Hyd.Pressure3 48.023
## Bowl.Setpoint 46.278
## Hyd.Pressure4 42.725
## Hyd.Pressure2 40.821
## Filler.Level 38.654
## Carb.Temp 32.132
## Carb.Pressure 27.220
## Fill.Pressure 24.898
## Brand.CodeD 16.956
## Alch.Rel 15.350
## Brand.CodeA 11.151
## Pressure.Vacuum 9.797
## Balling.Lvl 9.620
# Plot the variable importance
plot(pls_var_imp)
# Customize the plot using ggplot2
# Convert the variable importance data to a data frame
imp_df <- as.data.frame(pls_var_imp$importance)
# Create a ggplot of variable importance
ggplot(imp_df, aes(x = reorder(rownames(imp_df), Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "skyblue") +
coord_flip() + # Flip the plot for easier reading
labs(title = "Variable Importance (PLS)",
x = "Variables",
y = "Importance") +
theme_minimal()
# Extract variable importance from the trained PLS model
pls_var_imp <- varImp(pls_model, scale = TRUE)
# Print the variable importance
print(pls_var_imp)
## pls variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## Temperature 100.000
## Pressure.Setpoint 76.679
## Usage.cont 67.081
## Carb.Pressure1 62.713
## Brand.CodeB 56.011
## Brand.CodeC 50.104
## Mnf.Flow 49.762
## Hyd.Pressure3 48.023
## Bowl.Setpoint 46.278
## Hyd.Pressure4 42.725
## Hyd.Pressure2 40.821
## Filler.Level 38.654
## Carb.Temp 32.132
## Carb.Pressure 27.220
## Fill.Pressure 24.898
## Brand.CodeD 16.956
## Alch.Rel 15.350
## Brand.CodeA 11.151
## Pressure.Vacuum 9.797
## Balling.Lvl 9.620
# Convert the variable importance data to a data frame
imp_df <- as.data.frame(pls_var_imp$importance)
# Add row names as a column
imp_df$Variables <- rownames(imp_df)
# Sort the variable importance data in descending order
imp_df <- imp_df[order(imp_df$Overall, decreasing = TRUE), ]
# Select the top 10 predictors
top_10_imp_df <- head(imp_df, 10)
# Create a ggplot of the top 10 variable importance
ggplot(top_10_imp_df, aes(x = reorder(Variables, Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "skyblue") +
coord_flip() + # Flip the plot for easier reading
labs(title = "Top 10 Variable Importance (PLS)",
x = "Variables",
y = "Importance") +
theme_minimal()
# Train and tune a PLS model using caret
pls_model <- caret::train(
x = X.train, # Features (training set)
y = y.train, # Target variable (training set)
method = "pls", # Partial Least Squares method
metric = "Rsquared", # Evaluation metric (R-squared for regression)
tuneLength = 10, # Number of hyperparameter tuning levels (ncomp)
trControl = trainControl(method = "cv") # Cross-validation method
)
# Check the results of the PLS model
print(pls_model)
## Partial Least Squares
##
## 1929 samples
## 31 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1734, 1736, 1736, 1736, 1736, 1736, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1695496 0.03862687 0.1360353
## 2 0.1678364 0.05733581 0.1352288
## 3 0.1531423 0.21849854 0.1210350
## 4 0.1515897 0.23246625 0.1199575
## 5 0.1486891 0.26012705 0.1174803
## 6 0.1478785 0.26808422 0.1168241
## 7 0.1463486 0.28361367 0.1162738
## 8 0.1448768 0.29703954 0.1140472
## 9 0.1431718 0.31342478 0.1122945
## 10 0.1421836 0.32291359 0.1112763
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 10.
# Plot the PLS model's performance
plot(pls_model)
clean_df$Brand.Code <- as.numeric(clean_df$Brand.Code)
# Make predictions on the test set
pls_predictions <- predict(pls_model, X.test)
# For regression tasks, calculate RMSE and R-squared
postResample(pls_predictions, y.test)
## RMSE Rsquared MAE
## 0.1417318 0.3165380 0.1109484
# Print the best tuning parameters (optimal number of components)
print(pls_model$bestTune)
## ncomp
## 10 10
###2. Random Forest:
Random Forest Regression is an ensemble learning method that usesmultiple decision trees to predict a continuous target variable. Each tree in the forest is trained on a random subset of the data and a random subset of features, introducing diversity and reducing the risk of overfitting. Predictions from individual trees are averaged to produce the final output, resulting in a model that is robust to noise and capable of capturing complex, non-linear relationships. Random Forest Regression is widely used for its high accuracy, scalability, and ability to estimate feature importance, making it a powerful tool for predictive modeling. #### Train a Random Forest model using caret
# Train a Random Forest model using caret
rf_model <- caret::train(
x = X.train, # Features (training set)
y = y.train, # Target variable (training set)
method = "rf", # Random Forest method
metric = "Rsquared", # Evaluation metric (R-squared for regression)
tuneLength = 10, # Number of hyperparameter tuning levels
trControl = trainControl(method = "cv") # Cross-validation method
)
# Check the results of the Random Forest model
print(rf_model)
## Random Forest
##
## 1929 samples
## 31 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1736, 1736, 1735, 1736, 1736, 1737, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.11297897 0.6168699 0.08494651
## 5 0.10481605 0.6605264 0.07702504
## 8 0.10244584 0.6699031 0.07459497
## 11 0.10147548 0.6719491 0.07341002
## 14 0.10071577 0.6747343 0.07256472
## 18 0.09998521 0.6773727 0.07178135
## 21 0.10019280 0.6748362 0.07163678
## 24 0.09994521 0.6747453 0.07128305
## 27 0.09998438 0.6737584 0.07124103
## 31 0.10010718 0.6707723 0.07100862
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 18.
# Extract variable importance from the trained Random Forest model
rf_var_imp <- varImp(rf_model, scale = TRUE)
# Print the variable importance
print(rf_var_imp)
## rf variable importance
##
## only 20 most important variables shown (out of 31)
##
## Overall
## Mnf.Flow 100.000
## Brand.Code 41.422
## Usage.cont 36.785
## Oxygen.Filler 21.374
## Pressure.Vacuum 18.642
## Bowl.Setpoint 18.040
## Alch.Rel 17.893
## Filler.Level 16.361
## Carb.Rel 16.145
## Temperature 15.762
## Balling.Lvl 14.410
## Air.Pressurer 14.198
## Carb.Pressure1 13.043
## Balling 10.892
## Carb.Flow 10.833
## Filler.Speed 10.638
## MFR 8.250
## Density 7.339
## Hyd.Pressure3 7.184
## PC.Volume 4.711
# Convert the variable importance data to a data frame
imp_df <- as.data.frame(rf_var_imp$importance)
# Create a ggplot of variable importance
# Plot the variable importance using ggplot
ggplot(imp_df, aes(x = reorder(rownames(imp_df), Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "skyblue") + # Bar plot
coord_flip() + # Flip for better readability
labs(title = "Variable Importance (Random Forest)",
x = "Variables",
y = "Importance") +
theme_minimal() # Clean theme
# compare rf with pls
results <- resamples(list(pls = pls_model, rf = rf_model))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: pls, rf
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.10576763 0.10720637 0.11031603 0.11127634 0.11297706 0.12480139 0
## rf 0.06183891 0.06987536 0.07173758 0.07178135 0.07564709 0.07927388 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.13150411 0.13476402 0.1411010 0.14218357 0.1448357 0.1599698 0
## rf 0.08135241 0.09896228 0.1022579 0.09998521 0.1049413 0.1109210 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.2669597 0.2935939 0.3140970 0.3229136 0.3369210 0.4177966 0
## rf 0.5940703 0.6522744 0.6716222 0.6773727 0.7107736 0.7697415 0
# Summarize the results
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: pls, rf
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.10576763 0.10720637 0.11031603 0.11127634 0.11297706 0.12480139 0
## rf 0.06183891 0.06987536 0.07173758 0.07178135 0.07564709 0.07927388 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.13150411 0.13476402 0.1411010 0.14218357 0.1448357 0.1599698 0
## rf 0.08135241 0.09896228 0.1022579 0.09998521 0.1049413 0.1109210 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.2669597 0.2935939 0.3140970 0.3229136 0.3369210 0.4177966 0
## rf 0.5940703 0.6522744 0.6716222 0.6773727 0.7107736 0.7697415 0
# Plot the comparison of the models' performance
dotplot(results)
# Predictions on the test dataset
rf_predictions <- predict(rf_model, X.test)
head(rf_predictions)
## 1 4 5 8 9 15
## 8.588055 8.475427 8.546820 8.572707 8.509553 8.552489
# Calculate RMSE and R-squared
postResample(rf_predictions, y.test)
## RMSE Rsquared MAE
## 0.09281726 0.72301372 0.06879637
# Plot Random Forest model performance
plot(rf_model)
# Best hyperparameters
print(rf_model$bestTune)
## mtry
## 6 18
###3. Support Vector Machines:
Support Vector Machines (SVM) with an RBF kernel model non-linear relationships by mapping data into a higher-dimensional space. Key tuning parameters are cost (C), balancing training error and modelcomplexity, and sigma, controlling the kernel’s influence. SVMs are effective for high-dimensional data and capturing non-linear patterns.
# Impute missing values for the features (X.train)
X.train_imputed <- mice(X.train, method = 'pmm', m = 1, maxit = 5, seed = 123)
##
## iter imp variable
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
X.train <- complete(X.train_imputed, 1) # Extract the imputed dataset
# Impute missing values for the target variable (y.train) if applicable
if (any(is.na(y.train))) {
y.train_imputed <- mice(data.frame(y.train), method = 'pmm', m = 1, maxit = 5, seed = 123)
y.train <- complete(y.train_imputed, 1) # Extract the imputed target variable
}
# Ensure y.train is numeric (for regression)
y.train <- as.numeric(y.train)
# Check if any columns in X.train are not numeric
non_numeric_cols <- sapply(X.train, is.numeric)
if (any(!non_numeric_cols)) {
# Convert non-numeric columns to numeric (e.g., factors to dummy variables)
X.train[!non_numeric_cols] <- lapply(X.train[!non_numeric_cols], function(x) as.numeric(as.factor(x)))
}
# Scale the numeric columns
X.train_scaled <- scale(X.train) # Scaling the features
# Train and tune a Support Vector Machine (SVM) model using caret
svm_model <- caret::train(
x = X.train_scaled, # Features (scaled and imputed training set)
y = y.train, # Target variable (numeric and imputed)
method = "svmRadial", # SVM with Radial Basis Function (RBF) kernel
metric = "Rsquared", # Evaluation metric (R-squared for regression)
tuneLength = 10, # Number of hyperparameter tuning levels
trControl = trainControl(method = "cv") # Cross-validation method
)
# Check the results of the SVM model
print(svm_model)
## Support Vector Machines with Radial Basis Function Kernel
##
## 1929 samples
## 31 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1735, 1736, 1736, 1736, 1737, 1737, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.1310581 0.4335113 0.09805887
## 0.50 0.1277661 0.4583803 0.09443740
## 1.00 0.1254891 0.4758394 0.09231295
## 2.00 0.1238636 0.4895637 0.09123638
## 4.00 0.1230558 0.4974148 0.09106153
## 8.00 0.1232331 0.5002633 0.09204209
## 16.00 0.1251269 0.4948617 0.09362135
## 32.00 0.1291321 0.4776762 0.09660947
## 64.00 0.1356194 0.4481419 0.10129241
## 128.00 0.1429817 0.4188165 0.10650329
##
## Tuning parameter 'sigma' was held constant at a value of 0.02330142
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.02330142 and C = 8.
# Plot the performance of the SVM model
plot(svm_model)
#### Make predictions on the test data using the trained SVM model
# Impute missing values for the test data (X.test)
X.test_imputed <- mice(X.test, method = 'pmm', m = 1, maxit = 5, seed = 123)
##
## iter imp variable
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
X.test <- complete(X.test_imputed, 1) # Extract the imputed dataset
# Ensure the test target (y.test) is numeric, if needed
y.test <- as.numeric(y.test)
# Check for non-numeric columns in X.test and convert to numeric if necessary
non_numeric_cols_test <- sapply(X.test, is.numeric)
if (any(!non_numeric_cols_test)) {
X.test[!non_numeric_cols_test] <- lapply(X.test[!non_numeric_cols_test], function(x) as.numeric(as.factor(x)))
}
# Scale the test data using the same scaling parameters from X.train
X.test_scaled <- scale(X.test, center = attr(X.train_scaled, "scaled:center"), scale = attr(X.train_scaled, "scaled:scale"))
# Make predictions on the test data using the trained SVM model
svm_predictions <- predict(svm_model, X.test_scaled)
# Check the predictions
head(svm_predictions)
## [1] 8.619266 8.423234 8.477390 8.500084 8.498650 8.607540
# Calculate RMSE and R-squared for regression tasks
postResample(svm_predictions, y.test)
## RMSE Rsquared MAE
## 0.11595555 0.54225526 0.08517909
# Print the best tuning parameters (optimal values for C and sigma)
print(svm_model$bestTune)
## sigma C
## 6 0.02330142 8
# compare models trained
results <- resamples(list(pls = pls_model,svm = svm_model, rf = rf_model))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: pls, svm, rf
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.10576763 0.10720637 0.11031603 0.11127634 0.11297706 0.12480139 0
## svm 0.07654441 0.08571151 0.09084712 0.09204209 0.10085072 0.10571843 0
## rf 0.06183891 0.06987536 0.07173758 0.07178135 0.07564709 0.07927388 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.13150411 0.13476402 0.1411010 0.14218357 0.1448357 0.1599698 0
## svm 0.09600334 0.11345311 0.1199006 0.12323312 0.1346168 0.1489798 0
## rf 0.08135241 0.09896228 0.1022579 0.09998521 0.1049413 0.1109210 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.2669597 0.2935939 0.3140970 0.3229136 0.3369210 0.4177966 0
## svm 0.3608520 0.4603500 0.4976140 0.5002633 0.5370643 0.6456534 0
## rf 0.5940703 0.6522744 0.6716222 0.6773727 0.7107736 0.7697415 0
# Plot model comparison
dotplot(results)
###4. Neural Network:
A single-layer feedforward neural network maps inputs to outputs through weighted neurons and activation functions. Tuning involves the number of neurons (model capacity) and decay (regularization to prevent overfitting), enabling it to capture non-linear relationships efficiently.
# Plot the performance of the Neural Network model
plot(nn_model)
# Make predictions on the test data using the Neural Network model
nn_predictions <- predict(nn_model, X.test)
# Calculate RMSE and R-squared for regression tasks
postResample(nn_predictions, y.test)
## RMSE Rsquared MAE
## 7.547955e+00 4.933189e-06 7.546012e+00
# Print the best tuning parameters (optimal values for size and decay)
print(nn_model$bestTune)
## size decay
## 95 19 0.001333521
# Compare models
results <- resamples(list(pls = pls_model,svm = svm_model, nn = nn_model, rf = rf_model))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: pls, svm, nn, rf
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.10576763 0.10720637 0.11031603 0.11127634 0.11297706 0.12480139 0
## svm 0.07654441 0.08571151 0.09084712 0.09204209 0.10085072 0.10571843 0
## nn 7.53958575 7.54225389 7.54540452 7.54548256 7.54859076 7.55354197 0
## rf 0.06183891 0.06987536 0.07173758 0.07178135 0.07564709 0.07927388 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.13150411 0.13476402 0.1411010 0.14218357 0.1448357 0.1599698 0
## svm 0.09600334 0.11345311 0.1199006 0.12323312 0.1346168 0.1489798 0
## nn 7.54170180 7.54411884 7.5475229 7.54745914 7.5503074 7.5557506 0
## rf 0.08135241 0.09896228 0.1022579 0.09998521 0.1049413 0.1109210 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## pls 0.26695967 0.29359388 0.31409697 0.32291359 0.3369210 0.41779662 0
## svm 0.36085197 0.46034999 0.49761403 0.50026330 0.5370643 0.64565340 0
## nn 0.00239444 0.03488923 0.06738402 0.05052561 0.0745912 0.08179837 7
## rf 0.59407029 0.65227437 0.67162220 0.67737265 0.7107736 0.76974152 0
# Plot model comparison
dotplot(results)
model_performance <- data.frame(
Model = c("SVM", "PLS", "RF", "NN"),
MAE = c(0.09204, 0.1112, 0.0733, 7.5453), # Replace with actual MAE values
RMSE = c(0.1232, 0.1418, 0.1025, 7.5473), # Replace with actual RMSE values
Rsquared = c(0.5003, 0.331, 0.662, 0.050) # Replace with actual R-squared values
)
# Display the comparative table
model_performance %>%
knitr::kable(caption = "Comparison of Model Performance",
col.names = c("Model", "MAE", "RMSE", "R-squared"))
| Model | MAE | RMSE | R-squared |
|---|---|---|---|
| SVM | 0.09204 | 0.1232 | 0.5003 |
| PLS | 0.11120 | 0.1418 | 0.3310 |
| RF | 0.07330 | 0.1025 | 0.6620 |
| NN | 7.54530 | 7.5473 | 0.0500 |
It’s clear from the table above that Random Forest (RF) outperforms all models across MAE, RMSE, and R-squared, making it the top choice for both predictive accuracy and interpretability.
SVM, while strong, has slightly higher MAE and RMSE and a lower R-squared than RF but remains a solid option when accuracy is prioritized over interpretability. Overall, SVM requires careful tuning to match RF’s performance
PLS and NN underperform in all metrics, with NN showing particularly poor results.
Additionally, RF offers the best balance of low error and high explanatory power, while SVM, though effective, sacrifices interpretability. PLS and NN are less accurate.
Based on the analysis so far, it is confirmed that the Random Forest model is the optimal model. We will now use it to predict PH values.
# Convert predictions to a data frame
rf_predictions_df <- data.frame(Predictions = rf_predictions)
# Write to an Excel file
write_xlsx(rf_predictions_df, "rf_predictions.xlsx")
Excel Line chart of PH predictions at ABC, obtained using the writexl::write_xlsx()
Introduction
In this Project, I addressed the challenge of predicting the pH levels of beverages produced by ABC Beverage Company. This initiative arose from the company’s need to comply with new regulatory standards requiring detailed monitoring and control of beverage pH levels. As the company’s data scientist, I was tasked with developing a robust predictive model to accurately forecast pH levels and identify the key factors influencing them.
The selection of the best-performing model was based on evaluation metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R-Squared. The whole project leveraged a comprehensive data science pipeline, encompassing exploratory data analysis (EDA), data preprocessing, model building, and predictions on unseen datasets, all implemented in the R programming environment. The ultimate goal was to provide ABC Beverage with actionable insights and a reliable tool to ensure compliance with regulatory requirements.
Problem Statement
The primary objective was to build a predictive model capable of accurately estimating the pH levels of beverages. The business impact of this solution is significant, as it ensures regulatory compliance while providing insights into the production process. Additionally,understanding the pH levels can guide product development, quality control, and customer satisfaction by ensuring the desired characteristics of the beverages.
Data Preparation
The two datasets provided by ABC Beverage Company included historical records of beverage production, capturing various manufacturing parameters alongside pH levels. The raw data were supplied in Excel formats, which were converted to CSV, and hosted on a public GitHub repository for reproducibility.
Data Preparation Steps
Exploratory Data Analysis (EDA):
EDA was performed to understand data distributions, detect outliers, and identify missing values. Key variables such as manufacturing pressure, temperature, and ingredient concentrations were examined for their correlation with pH levels.
Handling Missing Values:
Missing data was imputed using the mean for numeric variables to maintain dataset integrity without introducing bias.
Categorical variables, such as “Brand.Code,” were encoded using one-hot encoding to ensure compatibility with machine learning algorithms.
Data Splitting:
The dataset was divided into training and testing sets, ensuring the test set remained unseen during model training to validate the model’s generalizability. A distinct evaluation dataset was used to evaluate the most performant model before recommending to the ABC leadership.
Methodologies
Four predictive models were evaluated for their performance:
Partial Least Squares (PLS):
A regression technique used for its simplicity and ability to handle multicollinearity in the dataset.
Random Forest (RF):
An ensemble learning method known for its robustness, high accuracy, and ability to capture nonlinear relationships.
Support Vector Machines (SVM):
A powerful algorithm for regression, leveraging kernel methods to handlecomplex relationships between variables.
Neural Networks (NN):
A deep learning approach tested for its ability to model complex patterns but found less effective due to limited data size and higher computational costs.
Results and Model Selection
Based on the evaluation metrics:
Random Forest (RF) emerged as the best-performing model with:
Lowest MAE: 0.0733
Lowest RMSE: 0.1025
Highest R-Squared: 0.662
SVM performed well but was slightly less accurate than RF:
MAE: 0.087
RMSE: 0.112
R-Squared: 0.590
PLS and NN demonstrated weaker performance:
PLS: Moderate accuracy (R-Squared = 0.331)
NN: High error rates and minimal explanatory power.
Purpose of Analysis
The purpose of this analysis is twofold:
To provide ABC Beverage Company with a reliable predictive tool to monitor and forecast pH levels.
To derive actionable insights into the manufacturing process, enabling the company to improve product quality and ensure regulatory compliance.
Findings
Using the Random Forest model, pH levels were predicted from the evaluation dataset. The predicted values predominantly exceeded 8,indicating that ABC Beverage Company primarily produces alkalinebeverages such as water, tea, and fruit drinks. This insight aligns with the company’s production goals and highlights the alkaline nature of their product line.
Business Impact
The deployment of the Random Forest model brings several benefits:
Regulatory Compliance: The ability to monitor pH levels ensures adherence to legal standards, mitigating risks of non-compliance.
Quality Assurance: Accurate predictions allow for better control over production parameters, ensuring consistent beverage quality.
Product Innovation: Insights into key predictors of pH levels can guide the development of new beverages tailored to customer preferences.
Operational Efficiency: Automating pH prediction reduces the need for manual testing, saving time and resources.
In closing:
This project demonstrates the value of data science in solving real-world business problems. By leveraging advanced analytics and machine learning, ABC Beverage Company is now equipped with a powerful tool to predict and control pH levels, ensuring both compliance and customer satisfaction. The insights gained from this study also pave the way for further innovation and operational improvements, underscoring the transformative impact of data-driven decision-making.
References:
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Springer. https://link.springer.com/book/10.1007/978-1-4614-6849-3
CUNY School of Professional Studies – The Graduate Center. (2024). DATA 622: Machine learning [Course]. CUNY School of Professional Studies – The Graduate Center, Fall 2024
Näf, J. (n.d.). What is a good imputation for missing values? Towards Data Science. Retrieved [Dec 13 2024], from https://towardsdatascience.com/what-is-a-good-imputation-for-missing-values-e9256d45851b