Predicting pH Levels for ABC Beverage Company.

The research problem: In this project, I am acting as a data scientist for ABC Beverage Company to develop a predictive model that accurately forecasts the pH levels of their beverages. This initiative is driven by the company’s need to comply with newly implemented regulations, requiring a deep understanding of their manufacturing process and the factors influencing beverage pH levels.To achieve this, the leadership team has tasked me with building a reliable model, identifying key predictive factors, and delivering clear insights. The historical data set comes from ABC Beverage Company and was originally in Excel format. Before analysis, I converted it to CSV and stored it in my public GitHub for reproducibility.The best model will be selected based on key performance metrics (RMSE, R-Squared, and MAE),and also on considerations related to the model’s accuracy and interpretability. Lastly, this project will encompass the full data science pipeline, including exploratory data analysis (EDA), data preprocessing, visualization, and modeling, all conducted in the R environment using a variety of R packages. To ensure clear reporting, the technical report of the project will be supplemented with a less technical summary essay. By leveraging these tools, the final model will provide ABC Beverage with a robust solution to monitor and predict pH levels,ensuring compliance with regulatory standards.

Getting started

Load necessary libraries

knitr::opts_chunk$set(echo = TRUE)

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(conflicted)
library(readxl)
library(writexl)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1

library(doParallel)

## Loading required package: foreach

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loading required package: iterators

## Loading required package: parallel

library(DataExplorer)
library(psych)
library(mice)
library(MASS)
library(AppliedPredictiveModeling)
library(lars)

## Loaded lars 1.3

library(pls)
library(earth)

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

## 
## Attaching package: 'plotrix'

## The following object is masked from 'package:psych':
## 
##     rescale

library(xgboost)
library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(DT)
library(e1071)
library(Matrix)
library(nnet)
library(ggplot2)
library(dplyr)
library(tibble)
library(fastDummies)
library(DataExplorer)
# Set seed for reproducibility
set.seed(131017)

The data:

Import and read the training dataset

# load the CSV file from public Github repo
data_raw <- read.csv("https://raw.githubusercontent.com/Heleinef/Data-Science-Master_Heleine/refs/heads/main/StudentData.csv")
head(data_raw)

##   Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp   PSC
## 1          B    5.340000    23.96667 0.2633333          68.2     141.2 0.104
## 2          A    5.426667    24.00667 0.2386667          68.4     139.6 0.124
## 3          B    5.286667    24.06000 0.2633333          70.8     144.8 0.090
## 4          A    5.440000    24.00667 0.2933333          63.0     132.6    NA
## 5          A    5.486667    24.31333 0.1113333          67.2     136.8 0.026
## 6          A    5.380000    23.92667 0.2693333          66.6     138.4 0.090
##   PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1     0.26    0.04     -100          118.8          46.0             0
## 2     0.22    0.04     -100          121.6          46.0             0
## 3     0.34    0.16     -100          120.2          46.0             0
## 4     0.42    0.04     -100          115.2          46.4             0
## 5     0.16    0.12     -100          118.4          45.8             0
## 6     0.24    0.04     -100          119.6          45.6             0
##   Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1            NA            NA           118        121.2         4002
## 2            NA            NA           106        118.6         3986
## 3            NA            NA            82        120.0         4020
## 4             0             0            92        117.8         4012
## 5             0             0            92        118.6         4010
## 6             0             0           116        120.2         4014
##   Temperature Usage.cont Carb.Flow Density   MFR Balling Pressure.Vacuum   PH
## 1        66.0      16.18      2932    0.88 725.0   1.398            -4.0 8.36
## 2        67.6      19.90      3144    0.92 726.8   1.498            -4.0 8.26
## 3        67.0      17.76      2914    1.58 735.0   3.142            -3.8 8.94
## 4        65.6      17.42      3062    1.54 730.6   3.042            -4.4 8.24
## 5        65.6      17.68      3054    1.54 722.8   3.042            -4.4 8.26
## 6        66.2      23.82      2948    1.52 738.8   2.992            -4.4 8.32
##   Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1         0.022           120              46.4         142.6     6.58     5.32
## 2         0.026           120              46.8         143.0     6.56     5.30
## 3         0.024           120              46.6         142.0     7.66     5.84
## 4         0.030           120              46.0         146.2     7.14     5.42
## 5         0.030           120              46.0         146.2     7.14     5.44
## 6         0.024           120              46.0         146.6     7.16     5.44
##   Balling.Lvl
## 1        1.48
## 2        1.56
## 3        3.28
## 4        3.04
## 5        3.04
## 6        3.02

Exploratory Data Analysis (EDA)

Data preprocessing

Transform Brand.Code to a factor

# Transform Brand.Code to a factor
data_raw$Brand.Code <- as.factor(data_raw$Brand.Code)

Check dataset’s dimensionality

# Check dataset dimensionality
dim(data_raw)

## [1] 2571   33

Take a peek at the dataset

# Take a peek at the dataset
glimpse(data_raw)

## Rows: 2,571
## Columns: 33
## $ Brand.Code        <fct> B, A, B, A, A, A, A, B, B, B, B, B, B, B, B, B, C, B…
## $ Carb.Volume       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ Fill.Ounces       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ PC.Volume         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ Carb.Pressure     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ Carb.Temp         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ PSC               <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ PSC.Fill          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ PSC.CO2           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ Mnf.Flow          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ Carb.Pressure1    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ Fill.Pressure     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ Hyd.Pressure1     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure2     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure3     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure4     <int> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ Filler.Level      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ Filler.Speed      <int> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ Temperature       <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ Usage.cont        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ Carb.Flow         <int> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ Density           <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ MFR               <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ Balling           <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ Pressure.Vacuum   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ PH                <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ Oxygen.Filler     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ Bowl.Setpoint     <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ Air.Pressurer     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ Alch.Rel          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ Carb.Rel          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ Balling.Lvl       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…

Summary statistics

Let’s generate descriptive statistics of variables in the training dataset, in order to get detailed information on central tendency, dispersion, and distribution metrics.

# Use the describe function from the psych package explicitly
psych::describe(data_raw) %>%
  dplyr::select(-vars, -trimmed, -mad, -se)

##                      n    mean      sd  median     min     max   range  skew
## Brand.Code*       2571    3.39    1.11    3.00    1.00    5.00    4.00  0.04
## Carb.Volume       2561    5.37    0.11    5.35    5.04    5.70    0.66  0.39
## Fill.Ounces       2533   23.97    0.09   23.97   23.63   24.32    0.69 -0.02
## PC.Volume         2532    0.28    0.06    0.27    0.08    0.48    0.40  0.34
## Carb.Pressure     2544   68.19    3.54   68.20   57.00   79.40   22.40  0.18
## Carb.Temp         2545  141.09    4.04  140.80  128.60  154.00   25.40  0.25
## PSC               2538    0.08    0.05    0.08    0.00    0.27    0.27  0.85
## PSC.Fill          2548    0.20    0.12    0.18    0.00    0.62    0.62  0.93
## PSC.CO2           2532    0.06    0.04    0.04    0.00    0.24    0.24  1.73
## Mnf.Flow          2569   24.57  119.48   65.20 -100.20  229.40  329.60  0.00
## Carb.Pressure1    2539  122.59    4.74  123.20  105.60  140.20   34.60  0.05
## Fill.Pressure     2549   47.92    3.18   46.40   34.60   60.40   25.80  0.55
## Hyd.Pressure1     2560   12.44   12.43   11.40   -0.80   58.00   58.80  0.78
## Hyd.Pressure2     2556   20.96   16.39   28.60    0.00   59.40   59.40 -0.30
## Hyd.Pressure3     2556   20.46   15.98   27.60   -1.20   50.00   51.20 -0.32
## Hyd.Pressure4     2541   96.29   13.12   96.00   52.00  142.00   90.00  0.55
## Filler.Level      2551  109.25   15.70  118.40   55.80  161.20  105.40 -0.85
## Filler.Speed      2514 3687.20  770.82 3982.00  998.00 4030.00 3032.00 -2.87
## Temperature       2557   65.97    1.38   65.60   63.60   76.20   12.60  2.39
## Usage.cont        2566   20.99    2.98   21.79   12.08   25.90   13.82 -0.54
## Carb.Flow         2569 2468.35 1073.70 3028.00   26.00 5104.00 5078.00 -0.99
## Density           2570    1.17    0.38    0.98    0.24    1.92    1.68  0.53
## MFR               2359  704.05   73.90  724.00   31.40  868.60  837.20 -5.09
## Balling           2570    2.20    0.93    1.65   -0.17    4.01    4.18  0.59
## Pressure.Vacuum   2571   -5.22    0.57   -5.40   -6.60   -3.60    3.00  0.53
## PH                2567    8.55    0.17    8.54    7.88    9.36    1.48 -0.29
## Oxygen.Filler     2559    0.05    0.05    0.03    0.00    0.40    0.40  2.66
## Bowl.Setpoint     2569  109.33   15.30  120.00   70.00  140.00   70.00 -0.97
## Pressure.Setpoint 2559   47.62    2.04   46.00   44.00   52.00    8.00  0.20
## Air.Pressurer     2571  142.83    1.21  142.60  140.80  148.20    7.40  2.25
## Alch.Rel          2562    6.90    0.51    6.56    5.28    8.62    3.34  0.88
## Carb.Rel          2561    5.44    0.13    5.40    4.96    6.06    1.10  0.50
## Balling.Lvl       2570    2.05    0.87    1.48    0.00    3.66    3.66  0.59
##                   kurtosis
## Brand.Code*          -0.61
## Carb.Volume          -0.47
## Fill.Ounces           0.86
## PC.Volume             0.67
## Carb.Pressure        -0.01
## Carb.Temp             0.24
## PSC                   0.65
## PSC.Fill              0.77
## PSC.CO2               3.73
## Mnf.Flow             -1.87
## Carb.Pressure1        0.14
## Fill.Pressure         1.41
## Hyd.Pressure1        -0.14
## Hyd.Pressure2        -1.56
## Hyd.Pressure3        -1.57
## Hyd.Pressure4         0.63
## Filler.Level          0.05
## Filler.Speed          6.71
## Temperature          10.16
## Usage.cont           -1.02
## Carb.Flow            -0.58
## Density              -1.20
## MFR                  30.46
## Balling              -1.39
## Pressure.Vacuum      -0.03
## PH                    0.06
## Oxygen.Filler        11.09
## Bowl.Setpoint        -0.06
## Pressure.Setpoint    -1.60
## Air.Pressurer         4.73
## Alch.Rel             -0.85
## Carb.Rel             -0.29
## Balling.Lvl          -1.49

Let’s get a more concise summary statistics of the data

# Summary statistics
summary(data_raw)

##  Brand.Code  Carb.Volume     Fill.Ounces      PC.Volume       Carb.Pressure  
##   : 120     Min.   :5.040   Min.   :23.63   Min.   :0.07933   Min.   :57.00  
##  A: 293     1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917   1st Qu.:65.60  
##  B:1239     Median :5.347   Median :23.97   Median :0.27133   Median :68.20  
##  C: 304     Mean   :5.370   Mean   :23.97   Mean   :0.27712   Mean   :68.19  
##  D: 615     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200   3rd Qu.:70.60  
##             Max.   :5.700   Max.   :24.32   Max.   :0.47800   Max.   :79.40  
##             NA's   :10      NA's   :38      NA's   :39        NA's   :27     
##    Carb.Temp          PSC             PSC.Fill         PSC.CO2       
##  Min.   :128.6   Min.   :0.00200   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000   1st Qu.:0.02000  
##  Median :140.8   Median :0.07600   Median :0.1800   Median :0.04000  
##  Mean   :141.1   Mean   :0.08457   Mean   :0.1954   Mean   :0.05641  
##  3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600   3rd Qu.:0.08000  
##  Max.   :154.0   Max.   :0.27000   Max.   :0.6200   Max.   :0.24000  
##  NA's   :26      NA's   :33        NA's   :23       NA's   :39       
##     Mnf.Flow       Carb.Pressure1  Fill.Pressure   Hyd.Pressure1  
##  Min.   :-100.20   Min.   :105.6   Min.   :34.60   Min.   :-0.80  
##  1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00   1st Qu.: 0.00  
##  Median :  65.20   Median :123.2   Median :46.40   Median :11.40  
##  Mean   :  24.57   Mean   :122.6   Mean   :47.92   Mean   :12.44  
##  3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00   3rd Qu.:20.20  
##  Max.   : 229.40   Max.   :140.2   Max.   :60.40   Max.   :58.00  
##  NA's   :2         NA's   :32      NA's   :22      NA's   :11     
##  Hyd.Pressure2   Hyd.Pressure3   Hyd.Pressure4     Filler.Level  
##  Min.   : 0.00   Min.   :-1.20   Min.   : 52.00   Min.   : 55.8  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00   1st Qu.: 98.3  
##  Median :28.60   Median :27.60   Median : 96.00   Median :118.4  
##  Mean   :20.96   Mean   :20.46   Mean   : 96.29   Mean   :109.3  
##  3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00   3rd Qu.:120.0  
##  Max.   :59.40   Max.   :50.00   Max.   :142.00   Max.   :161.2  
##  NA's   :15      NA's   :15      NA's   :30       NA's   :20     
##   Filler.Speed   Temperature      Usage.cont      Carb.Flow       Density     
##  Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26   Min.   :0.240  
##  1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144   1st Qu.:0.900  
##  Median :3982   Median :65.60   Median :21.79   Median :3028   Median :0.980  
##  Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468   Mean   :1.174  
##  3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186   3rd Qu.:1.620  
##  Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104   Max.   :1.920  
##  NA's   :57     NA's   :14      NA's   :5       NA's   :2      NA's   :1      
##       MFR           Balling       Pressure.Vacuum        PH       
##  Min.   : 31.4   Min.   :-0.170   Min.   :-6.600   Min.   :7.880  
##  1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600   1st Qu.:8.440  
##  Median :724.0   Median : 1.648   Median :-5.400   Median :8.540  
##  Mean   :704.0   Mean   : 2.198   Mean   :-5.216   Mean   :8.546  
##  3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000   3rd Qu.:8.680  
##  Max.   :868.6   Max.   : 4.012   Max.   :-3.600   Max.   :9.360  
##  NA's   :212     NA's   :1                         NA's   :4      
##  Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint Air.Pressurer  
##  Min.   :0.00240   Min.   : 70.0   Min.   :44.00     Min.   :140.8  
##  1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00     1st Qu.:142.2  
##  Median :0.03340   Median :120.0   Median :46.00     Median :142.6  
##  Mean   :0.04684   Mean   :109.3   Mean   :47.62     Mean   :142.8  
##  3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00     3rd Qu.:143.0  
##  Max.   :0.40000   Max.   :140.0   Max.   :52.00     Max.   :148.2  
##  NA's   :12        NA's   :2       NA's   :12                       
##     Alch.Rel        Carb.Rel      Balling.Lvl  
##  Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :6.560   Median :5.400   Median :1.48  
##  Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :8.620   Max.   :6.060   Max.   :3.66  
##  NA's   :9       NA's   :10      NA's   :1

Histograms:

Histogram illustrate the distribution of data over a continuous interval or defined period. These visualizations are helpful in identifying where values are concentrated, as well as where there are gaps or unusual values.Histograms are especially useful for showing the frequency of a particular occurrence.

# Histogram plots
plot_histogram(data_raw, geom_histogram_args = list("fill" = "tomato4"))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Log histograms
plot_histogram(data_raw, scale_x = "log10", geom_histogram_args = list(fill = "springgreen4"))

# Check for missing values column-wise
colSums(is.na(data_raw))

##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##                 0                10                38                39 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##                27                26                33                23 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##                39                 2                32                22 
##     Hyd.Pressure1     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4 
##                11                15                15                30 
##      Filler.Level      Filler.Speed       Temperature        Usage.cont 
##                20                57                14                 5 
##         Carb.Flow           Density               MFR           Balling 
##                 2                 1               212                 1 
##   Pressure.Vacuum                PH     Oxygen.Filler     Bowl.Setpoint 
##                 0                 4                12                 2 
## Pressure.Setpoint     Air.Pressurer          Alch.Rel          Carb.Rel 
##                12                 0                 9                10 
##       Balling.Lvl 
##                 1

Checking the specific distribution of PH value

# Histogram of PH
hist(data_raw$PH) # base R

plot_histogram(data_raw$PH)

It is evident from the above histogram that most predicted values of PH at ABC Beverage Company are greater than 8,indicating that the beverage produced is alkaline. Initially, the nature and type of the beverages the company manufactures was unknown to us. However, based on this histogram, we can conclude that the company primarily produces alkaline beverages, such as water, tea, fruit drinks, and similar products.

Correlation matrix

A correlation matrix is a cool table that shows correlation coefficients between variables, which is are useful to summarize and find patterns in large data sets. Each cell represents the relationship between two variables.In our graph the color scale is used to communicate to what extent the variables are correlated.

Let’s identify missing data patterns and perform a correlation analysis on rows with complete data.Then, we’ll visualize relationships between variables using a correlation plot for intuitive interpretation.

plot_missing(data_raw[-1])

forcorr <- data_raw[complete.cases(data_raw),-1]
corrplot::corrplot(cor(forcorr), method = 'ellipse', type = 'lower')

Box plots

Let’s generate box plots for each column (excluding the first column) in order to identify symmetrical and skewed data

# Generate box plots
# Adjust margins and layout 
par(mfrow = c(2, 4), mar = c(5, 4, 4, 2) + 0.1)

# Generate box plots for each column (excluding the first column)
for (i in colnames(data_raw[-1])) {
  boxplot(data_raw[[i]], 
          xlab = i, 
          main = i, 
          col = "blue", 
          horizontal = TRUE)
}

Preparing the data for modeling

Handling missing values in the raw data set using random forest imputation

set.seed(131017)

# Training set
# Handles missing values in data_raw using random forest imputation
clean_df <- mice(data.frame(data_raw), method = 'rf', m=2, maxit = 2, print=FALSE)
clean_df <- complete(clean_df)

Removing near-zero variance predictors to improve model performance

# Removes near-zero variance predictors to improve model performance
nzv_preds <- nearZeroVar(clean_df)
clean_df <- clean_df[,-nzv_preds]

Data train/test split

set.seed(131017)
partition <- createDataPartition(clean_df$PH, p=0.75, list = FALSE)

# training/validation partition for independent variables
X.train <- clean_df[partition, ] %>% dplyr::select(-PH)
X.test <- clean_df[-partition, ] %>% dplyr::select(-PH)

# training/validation partition for dependent variable PH
y.train <- clean_df$PH[partition]
y.test <- clean_df$PH[-partition]

We are now ready to build, tune, and evaluate models to predict the pH levels for ABC Beverage. We will test PLS, Random Forest, SVM, and Neural Network models. The model with the lowest RMSE on the test set will be selected as the best and recommended to ABC leadership.

Building the Models

# Set seed for reproducibility
set.seed(131017)

# Define training control
train_control <- trainControl(
  method = "cv",  # Cross-validation
  number = 10,    # 10-fold CV
  verboseIter = TRUE
)

###1. Partial Least Squares (PLS):

Partial Least Squares (PLS) regression is a dimensionality reduction technique that models the relationship between predictors and a continuous response variable by projecting both onto a lower-dimensional latent space. It is particularly effective in handling multicollinearity among predictors, as it extracts components that maximize the covariance between predictors and the response. The number of components, which controls the model’s complexity, is tuned to balance predictive accuracy and overfitting.

Train a linear regression model using caret

pls_model <- caret::train(
  x = X.train,
  y = y.train,
  method = "pls",
  metric = "Rsquared",
  tuneLength = 10,
  trControl = trainControl(method = "cv")
)
# Train a linear regression model using caret
lm_model <- caret::train(
  x = X.train,
  y = y.train,
  method = "lm",
  metric = "Rsquared",
  trControl = trainControl(method = "cv")
)

# Check the results
print(lm_model)

## Linear Regression 
## 
## 1929 samples
##   31 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1736, 1736, 1736, 1735, 1736, 1736, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1351744  0.3930378  0.1042209
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Extract variable importance from the trained PLS model

# Extract variable importance from the trained PLS model
pls_var_imp <- varImp(pls_model, scale = TRUE)

# Print the variable importance
print(pls_var_imp)

## pls variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##                   Overall
## Temperature       100.000
## Pressure.Setpoint  76.679
## Usage.cont         67.081
## Carb.Pressure1     62.713
## Brand.CodeB        56.011
## Brand.CodeC        50.104
## Mnf.Flow           49.762
## Hyd.Pressure3      48.023
## Bowl.Setpoint      46.278
## Hyd.Pressure4      42.725
## Hyd.Pressure2      40.821
## Filler.Level       38.654
## Carb.Temp          32.132
## Carb.Pressure      27.220
## Fill.Pressure      24.898
## Brand.CodeD        16.956
## Alch.Rel           15.350
## Brand.CodeA        11.151
## Pressure.Vacuum     9.797
## Balling.Lvl         9.620

# Plot the variable importance
plot(pls_var_imp)

# Customize the plot using ggplot2


# Convert the variable importance data to a data frame
imp_df <- as.data.frame(pls_var_imp$importance)

# Create a ggplot of variable importance
ggplot(imp_df, aes(x = reorder(rownames(imp_df), Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +                     # Flip the plot for easier reading
  labs(title = "Variable Importance (PLS)",
       x = "Variables",
       y = "Importance") +
  theme_minimal()

# Extract variable importance from the trained PLS model
pls_var_imp <- varImp(pls_model, scale = TRUE)

# Print the variable importance
print(pls_var_imp)

## pls variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##                   Overall
## Temperature       100.000
## Pressure.Setpoint  76.679
## Usage.cont         67.081
## Carb.Pressure1     62.713
## Brand.CodeB        56.011
## Brand.CodeC        50.104
## Mnf.Flow           49.762
## Hyd.Pressure3      48.023
## Bowl.Setpoint      46.278
## Hyd.Pressure4      42.725
## Hyd.Pressure2      40.821
## Filler.Level       38.654
## Carb.Temp          32.132
## Carb.Pressure      27.220
## Fill.Pressure      24.898
## Brand.CodeD        16.956
## Alch.Rel           15.350
## Brand.CodeA        11.151
## Pressure.Vacuum     9.797
## Balling.Lvl         9.620

# Convert the variable importance data to a data frame
imp_df <- as.data.frame(pls_var_imp$importance)

# Add row names as a column
imp_df$Variables <- rownames(imp_df)

# Sort the variable importance data in descending order
imp_df <- imp_df[order(imp_df$Overall, decreasing = TRUE), ]

# Select the top 10 predictors
top_10_imp_df <- head(imp_df, 10)

# Create a ggplot of the top 10 variable importance
ggplot(top_10_imp_df, aes(x = reorder(Variables, Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +                     # Flip the plot for easier reading
  labs(title = "Top 10 Variable Importance (PLS)",
       x = "Variables",
       y = "Importance") +
  theme_minimal()

# Train and tune a PLS model using caret
pls_model <- caret::train(
  x = X.train,                     # Features (training set)
  y = y.train,                     # Target variable (training set)
  method = "pls",                  # Partial Least Squares method
  metric = "Rsquared",             # Evaluation metric (R-squared for regression)
  tuneLength = 10,                 # Number of hyperparameter tuning levels (ncomp)
  trControl = trainControl(method = "cv")  # Cross-validation method
)

# Check the results of the PLS model
print(pls_model)

## Partial Least Squares 
## 
## 1929 samples
##   31 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1734, 1736, 1736, 1736, 1736, 1736, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared    MAE      
##    1     0.1695496  0.03862687  0.1360353
##    2     0.1678364  0.05733581  0.1352288
##    3     0.1531423  0.21849854  0.1210350
##    4     0.1515897  0.23246625  0.1199575
##    5     0.1486891  0.26012705  0.1174803
##    6     0.1478785  0.26808422  0.1168241
##    7     0.1463486  0.28361367  0.1162738
##    8     0.1448768  0.29703954  0.1140472
##    9     0.1431718  0.31342478  0.1122945
##   10     0.1421836  0.32291359  0.1112763
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 10.

# Plot the PLS model's performance
plot(pls_model)

clean_df$Brand.Code <- as.numeric(clean_df$Brand.Code)

# Make predictions on the test set
pls_predictions <- predict(pls_model, X.test)

# For regression tasks, calculate RMSE and R-squared
postResample(pls_predictions, y.test)

##      RMSE  Rsquared       MAE 
## 0.1417318 0.3165380 0.1109484

# Print the best tuning parameters (optimal number of components)
print(pls_model$bestTune)

##    ncomp
## 10    10

###2. Random Forest:

Random Forest Regression is an ensemble learning method that usesmultiple decision trees to predict a continuous target variable. Each tree in the forest is trained on a random subset of the data and a random subset of features, introducing diversity and reducing the risk of overfitting. Predictions from individual trees are averaged to produce the final output, resulting in a model that is robust to noise and capable of capturing complex, non-linear relationships. Random Forest Regression is widely used for its high accuracy, scalability, and ability to estimate feature importance, making it a powerful tool for predictive modeling. #### Train a Random Forest model using caret

# Train a Random Forest model using caret
rf_model <- caret::train(
  x = X.train,                     # Features (training set)
  y = y.train,                     # Target variable (training set)
  method = "rf",                   # Random Forest method
  metric = "Rsquared",             # Evaluation metric (R-squared for regression)
  tuneLength = 10,                 # Number of hyperparameter tuning levels
  trControl = trainControl(method = "cv")  # Cross-validation method
)

# Check the results of the Random Forest model
print(rf_model)

## Random Forest 
## 
## 1929 samples
##   31 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1736, 1736, 1735, 1736, 1736, 1737, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE        Rsquared   MAE       
##    2    0.11297897  0.6168699  0.08494651
##    5    0.10481605  0.6605264  0.07702504
##    8    0.10244584  0.6699031  0.07459497
##   11    0.10147548  0.6719491  0.07341002
##   14    0.10071577  0.6747343  0.07256472
##   18    0.09998521  0.6773727  0.07178135
##   21    0.10019280  0.6748362  0.07163678
##   24    0.09994521  0.6747453  0.07128305
##   27    0.09998438  0.6737584  0.07124103
##   31    0.10010718  0.6707723  0.07100862
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 18.

Extract variable importance from the trained Random Forest model

# Extract variable importance from the trained Random Forest model
rf_var_imp <- varImp(rf_model, scale = TRUE)

# Print the variable importance
print(rf_var_imp)

## rf variable importance
## 
##   only 20 most important variables shown (out of 31)
## 
##                 Overall
## Mnf.Flow        100.000
## Brand.Code       41.422
## Usage.cont       36.785
## Oxygen.Filler    21.374
## Pressure.Vacuum  18.642
## Bowl.Setpoint    18.040
## Alch.Rel         17.893
## Filler.Level     16.361
## Carb.Rel         16.145
## Temperature      15.762
## Balling.Lvl      14.410
## Air.Pressurer    14.198
## Carb.Pressure1   13.043
## Balling          10.892
## Carb.Flow        10.833
## Filler.Speed     10.638
## MFR               8.250
## Density           7.339
## Hyd.Pressure3     7.184
## PC.Volume         4.711

# Convert the variable importance data to a data frame
imp_df <- as.data.frame(rf_var_imp$importance)

# Create a ggplot of variable importance


# Plot the variable importance using ggplot
ggplot(imp_df, aes(x = reorder(rownames(imp_df), Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "skyblue") +    # Bar plot
  coord_flip() +                                     # Flip for better readability
  labs(title = "Variable Importance (Random Forest)",
       x = "Variables",
       y = "Importance") +
  theme_minimal()                                    # Clean theme

Compare Random Forest with PLS

# compare rf with pls
results <- resamples(list(pls = pls_model,  rf = rf_model))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: pls, rf 
## Number of resamples: 10 
## 
## MAE 
##           Min.    1st Qu.     Median       Mean    3rd Qu.       Max. NA's
## pls 0.10576763 0.10720637 0.11031603 0.11127634 0.11297706 0.12480139    0
## rf  0.06183891 0.06987536 0.07173758 0.07178135 0.07564709 0.07927388    0
## 
## RMSE 
##           Min.    1st Qu.    Median       Mean   3rd Qu.      Max. NA's
## pls 0.13150411 0.13476402 0.1411010 0.14218357 0.1448357 0.1599698    0
## rf  0.08135241 0.09896228 0.1022579 0.09998521 0.1049413 0.1109210    0
## 
## Rsquared 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## pls 0.2669597 0.2935939 0.3140970 0.3229136 0.3369210 0.4177966    0
## rf  0.5940703 0.6522744 0.6716222 0.6773727 0.7107736 0.7697415    0

# Summarize the results
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: pls, rf 
## Number of resamples: 10 
## 
## MAE 
##           Min.    1st Qu.     Median       Mean    3rd Qu.       Max. NA's
## pls 0.10576763 0.10720637 0.11031603 0.11127634 0.11297706 0.12480139    0
## rf  0.06183891 0.06987536 0.07173758 0.07178135 0.07564709 0.07927388    0
## 
## RMSE 
##           Min.    1st Qu.    Median       Mean   3rd Qu.      Max. NA's
## pls 0.13150411 0.13476402 0.1411010 0.14218357 0.1448357 0.1599698    0
## rf  0.08135241 0.09896228 0.1022579 0.09998521 0.1049413 0.1109210    0
## 
## Rsquared 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## pls 0.2669597 0.2935939 0.3140970 0.3229136 0.3369210 0.4177966    0
## rf  0.5940703 0.6522744 0.6716222 0.6773727 0.7107736 0.7697415    0

# Plot the comparison of the models' performance
dotplot(results)

# Predictions on the test dataset

rf_predictions <- predict(rf_model, X.test)
head(rf_predictions)

##        1        4        5        8        9       15 
## 8.588055 8.475427 8.546820 8.572707 8.509553 8.552489

# Calculate RMSE and R-squared
postResample(rf_predictions, y.test)

##       RMSE   Rsquared        MAE 
## 0.09281726 0.72301372 0.06879637

# Plot Random Forest model performance
plot(rf_model)

# Best hyperparameters
print(rf_model$bestTune)

##   mtry
## 6   18

###3. Support Vector Machines:

Support Vector Machines (SVM) with an RBF kernel model non-linear relationships by mapping data into a higher-dimensional space. Key tuning parameters are cost (C), balancing training error and modelcomplexity, and sigma, controlling the kernel’s influence. SVMs are effective for high-dimensional data and capturing non-linear patterns.

Train and tune a Support Vector Machine (SVM) model using caret

# Impute missing values for the features (X.train)
X.train_imputed <- mice(X.train, method = 'pmm', m = 1, maxit = 5, seed = 123)

## 
##  iter imp variable
##   1   1
##   2   1
##   3   1
##   4   1
##   5   1

X.train <- complete(X.train_imputed, 1)  # Extract the imputed dataset

# Impute missing values for the target variable (y.train) if applicable
if (any(is.na(y.train))) {
  y.train_imputed <- mice(data.frame(y.train), method = 'pmm', m = 1, maxit = 5, seed = 123)
  y.train <- complete(y.train_imputed, 1)  # Extract the imputed target variable
}

# Ensure y.train is numeric (for regression)
y.train <- as.numeric(y.train)

# Check if any columns in X.train are not numeric
non_numeric_cols <- sapply(X.train, is.numeric)
if (any(!non_numeric_cols)) {
  # Convert non-numeric columns to numeric (e.g., factors to dummy variables)
  X.train[!non_numeric_cols] <- lapply(X.train[!non_numeric_cols], function(x) as.numeric(as.factor(x)))
}

# Scale the numeric columns
X.train_scaled <- scale(X.train)  # Scaling the features

# Train and tune a Support Vector Machine (SVM) model using caret
svm_model <- caret::train(
  x = X.train_scaled,             # Features (scaled and imputed training set)
  y = y.train,                    # Target variable (numeric and imputed)
  method = "svmRadial",           # SVM with Radial Basis Function (RBF) kernel
  metric = "Rsquared",            # Evaluation metric (R-squared for regression)
  tuneLength = 10,                # Number of hyperparameter tuning levels
  trControl = trainControl(method = "cv")  # Cross-validation method
)

# Check the results of the SVM model
print(svm_model)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1929 samples
##   31 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1735, 1736, 1736, 1736, 1737, 1737, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE       Rsquared   MAE       
##     0.25  0.1310581  0.4335113  0.09805887
##     0.50  0.1277661  0.4583803  0.09443740
##     1.00  0.1254891  0.4758394  0.09231295
##     2.00  0.1238636  0.4895637  0.09123638
##     4.00  0.1230558  0.4974148  0.09106153
##     8.00  0.1232331  0.5002633  0.09204209
##    16.00  0.1251269  0.4948617  0.09362135
##    32.00  0.1291321  0.4776762  0.09660947
##    64.00  0.1356194  0.4481419  0.10129241
##   128.00  0.1429817  0.4188165  0.10650329
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02330142
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.02330142 and C = 8.

Plot the performance of the SVM model

# Plot the performance of the SVM model
plot(svm_model)

#### Make predictions on the test data using the trained SVM model

# Impute missing values for the test data (X.test)
X.test_imputed <- mice(X.test, method = 'pmm', m = 1, maxit = 5, seed = 123)

## 
##  iter imp variable
##   1   1
##   2   1
##   3   1
##   4   1
##   5   1

X.test <- complete(X.test_imputed, 1)  # Extract the imputed dataset

# Ensure the test target (y.test) is numeric, if needed
y.test <- as.numeric(y.test)

# Check for non-numeric columns in X.test and convert to numeric if necessary
non_numeric_cols_test <- sapply(X.test, is.numeric)
if (any(!non_numeric_cols_test)) {
  X.test[!non_numeric_cols_test] <- lapply(X.test[!non_numeric_cols_test], function(x) as.numeric(as.factor(x)))
}

# Scale the test data using the same scaling parameters from X.train
X.test_scaled <- scale(X.test, center = attr(X.train_scaled, "scaled:center"), scale = attr(X.train_scaled, "scaled:scale"))

# Make predictions on the test data using the trained SVM model
svm_predictions <- predict(svm_model, X.test_scaled)

# Check the predictions
head(svm_predictions)

## [1] 8.619266 8.423234 8.477390 8.500084 8.498650 8.607540

# Calculate RMSE and R-squared for regression tasks
postResample(svm_predictions, y.test)

##       RMSE   Rsquared        MAE 
## 0.11595555 0.54225526 0.08517909

# Print the best tuning parameters (optimal values for C and sigma)
print(svm_model$bestTune)

##        sigma C
## 6 0.02330142 8

Comparing the PLS, SVM and RF models

# compare models trained
results <- resamples(list(pls = pls_model,svm = svm_model, rf = rf_model))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: pls, svm, rf 
## Number of resamples: 10 
## 
## MAE 
##           Min.    1st Qu.     Median       Mean    3rd Qu.       Max. NA's
## pls 0.10576763 0.10720637 0.11031603 0.11127634 0.11297706 0.12480139    0
## svm 0.07654441 0.08571151 0.09084712 0.09204209 0.10085072 0.10571843    0
## rf  0.06183891 0.06987536 0.07173758 0.07178135 0.07564709 0.07927388    0
## 
## RMSE 
##           Min.    1st Qu.    Median       Mean   3rd Qu.      Max. NA's
## pls 0.13150411 0.13476402 0.1411010 0.14218357 0.1448357 0.1599698    0
## svm 0.09600334 0.11345311 0.1199006 0.12323312 0.1346168 0.1489798    0
## rf  0.08135241 0.09896228 0.1022579 0.09998521 0.1049413 0.1109210    0
## 
## Rsquared 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## pls 0.2669597 0.2935939 0.3140970 0.3229136 0.3369210 0.4177966    0
## svm 0.3608520 0.4603500 0.4976140 0.5002633 0.5370643 0.6456534    0
## rf  0.5940703 0.6522744 0.6716222 0.6773727 0.7107736 0.7697415    0

# Plot model comparison
dotplot(results)

###4. Neural Network:

A single-layer feedforward neural network maps inputs to outputs through weighted neurons and activation functions. Tuning involves the number of neurons (model capacity) and decay (regularization to prevent overfitting), enabling it to capture non-linear relationships efficiently.

Training and tuning a Neural Network model using caret

Plotting the performance of the Neural Network model

# Plot the performance of the Neural Network model
plot(nn_model)

# Make predictions on the test data using the Neural Network model
nn_predictions <- predict(nn_model, X.test)

Evaluate the NN performance

# Calculate RMSE and R-squared for regression tasks
postResample(nn_predictions, y.test)

##         RMSE     Rsquared          MAE 
## 7.547955e+00 4.933189e-06 7.546012e+00

# Print the best tuning parameters (optimal values for size and decay)
print(nn_model$bestTune)

##    size       decay
## 95   19 0.001333521

Comparing the performance of PLS, SVM, NN and RF models based on 3 key metrics:

R-Square, which shows the variance explained by given model.
RMSE (Root Mean Squared Error), which is the standard deviation of the residuals.
MAE (Mean Absolute Error), which is avg of all absoulte errors.

# Compare models 
results <- resamples(list(pls = pls_model,svm = svm_model, nn = nn_model, rf = rf_model))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: pls, svm, nn, rf 
## Number of resamples: 10 
## 
## MAE 
##           Min.    1st Qu.     Median       Mean    3rd Qu.       Max. NA's
## pls 0.10576763 0.10720637 0.11031603 0.11127634 0.11297706 0.12480139    0
## svm 0.07654441 0.08571151 0.09084712 0.09204209 0.10085072 0.10571843    0
## nn  7.53958575 7.54225389 7.54540452 7.54548256 7.54859076 7.55354197    0
## rf  0.06183891 0.06987536 0.07173758 0.07178135 0.07564709 0.07927388    0
## 
## RMSE 
##           Min.    1st Qu.    Median       Mean   3rd Qu.      Max. NA's
## pls 0.13150411 0.13476402 0.1411010 0.14218357 0.1448357 0.1599698    0
## svm 0.09600334 0.11345311 0.1199006 0.12323312 0.1346168 0.1489798    0
## nn  7.54170180 7.54411884 7.5475229 7.54745914 7.5503074 7.5557506    0
## rf  0.08135241 0.09896228 0.1022579 0.09998521 0.1049413 0.1109210    0
## 
## Rsquared 
##           Min.    1st Qu.     Median       Mean   3rd Qu.       Max. NA's
## pls 0.26695967 0.29359388 0.31409697 0.32291359 0.3369210 0.41779662    0
## svm 0.36085197 0.46034999 0.49761403 0.50026330 0.5370643 0.64565340    0
## nn  0.00239444 0.03488923 0.06738402 0.05052561 0.0745912 0.08179837    7
## rf  0.59407029 0.65227437 0.67162220 0.67737265 0.7107736 0.76974152    0

# Plot model comparison
dotplot(results)

model_performance <- data.frame(
  Model = c("SVM", "PLS", "RF", "NN"),
  MAE = c(0.09204, 0.1112, 0.0733, 7.5453),    # Replace with actual MAE values
  RMSE = c(0.1232, 0.1418, 0.1025, 7.5473),    # Replace with actual RMSE values
  Rsquared = c(0.5003, 0.331, 0.662, 0.050)   # Replace with actual R-squared values
)

# Display the comparative table
model_performance %>%
  knitr::kable(caption = "Comparison of Model Performance", 
               col.names = c("Model", "MAE", "RMSE", "R-squared"))

Comparison of Model Performance
Model	MAE	RMSE	R-squared
SVM	0.09204	0.1232	0.5003
PLS	0.11120	0.1418	0.3310
RF	0.07330	0.1025	0.6620
NN	7.54530	7.5473	0.0500

Findings:

It’s clear from the table above that Random Forest (RF) outperforms all models across MAE, RMSE, and R-squared, making it the top choice for both predictive accuracy and interpretability.

SVM, while strong, has slightly higher MAE and RMSE and a lower R-squared than RF but remains a solid option when accuracy is prioritized over interpretability. Overall, SVM requires careful tuning to match RF’s performance

PLS and NN underperform in all metrics, with NN showing particularly poor results.

Additionally, RF offers the best balance of low error and high explanatory power, while SVM, though effective, sacrifices interpretability. PLS and NN are less accurate.

Making Predictions on the evaluation data set

Based on the analysis so far, it is confirmed that the Random Forest model is the optimal model. We will now use it to predict PH values.

Random Forest predictions of PH value levels at ABC Beverage Company

# Convert predictions to a data frame
rf_predictions_df <- data.frame(Predictions = rf_predictions)

# Write to an Excel file
write_xlsx(rf_predictions_df, "rf_predictions.xlsx")

Excel Line chart of PH predictions at ABC, obtained using the writexl::write_xlsx()

Summary Essay (of the Technical Report)

Introduction

In this Project, I addressed the challenge of predicting the pH levels of beverages produced by ABC Beverage Company. This initiative arose from the company’s need to comply with new regulatory standards requiring detailed monitoring and control of beverage pH levels. As the company’s data scientist, I was tasked with developing a robust predictive model to accurately forecast pH levels and identify the key factors influencing them.

The selection of the best-performing model was based on evaluation metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R-Squared. The whole project leveraged a comprehensive data science pipeline, encompassing exploratory data analysis (EDA), data preprocessing, model building, and predictions on unseen datasets, all implemented in the R programming environment. The ultimate goal was to provide ABC Beverage with actionable insights and a reliable tool to ensure compliance with regulatory requirements.

Problem Statement

The primary objective was to build a predictive model capable of accurately estimating the pH levels of beverages. The business impact of this solution is significant, as it ensures regulatory compliance while providing insights into the production process. Additionally,understanding the pH levels can guide product development, quality control, and customer satisfaction by ensuring the desired characteristics of the beverages.

Data Preparation

The two datasets provided by ABC Beverage Company included historical records of beverage production, capturing various manufacturing parameters alongside pH levels. The raw data were supplied in Excel formats, which were converted to CSV, and hosted on a public GitHub repository for reproducibility.

Data Preparation Steps

Exploratory Data Analysis (EDA):

EDA was performed to understand data distributions, detect outliers, and identify missing values. Key variables such as manufacturing pressure, temperature, and ingredient concentrations were examined for their correlation with pH levels.

Handling Missing Values:

Missing data was imputed using the mean for numeric variables to maintain dataset integrity without introducing bias.

Categorical variables, such as “Brand.Code,” were encoded using one-hot encoding to ensure compatibility with machine learning algorithms.

Data Splitting:

The dataset was divided into training and testing sets, ensuring the test set remained unseen during model training to validate the model’s generalizability. A distinct evaluation dataset was used to evaluate the most performant model before recommending to the ABC leadership.

Methodologies

Four predictive models were evaluated for their performance:

Partial Least Squares (PLS):

A regression technique used for its simplicity and ability to handle multicollinearity in the dataset.

Random Forest (RF):

An ensemble learning method known for its robustness, high accuracy, and ability to capture nonlinear relationships.

Support Vector Machines (SVM):

A powerful algorithm for regression, leveraging kernel methods to handlecomplex relationships between variables.

Neural Networks (NN):

A deep learning approach tested for its ability to model complex patterns but found less effective due to limited data size and higher computational costs.

Results and Model Selection

Based on the evaluation metrics:

Random Forest (RF) emerged as the best-performing model with:

Lowest MAE: 0.0733

Lowest RMSE: 0.1025

Highest R-Squared: 0.662

SVM performed well but was slightly less accurate than RF:

MAE: 0.087

RMSE: 0.112

R-Squared: 0.590

PLS and NN demonstrated weaker performance:

PLS: Moderate accuracy (R-Squared = 0.331)

NN: High error rates and minimal explanatory power.

Purpose of Analysis

The purpose of this analysis is twofold:

To provide ABC Beverage Company with a reliable predictive tool to monitor and forecast pH levels.

To derive actionable insights into the manufacturing process, enabling the company to improve product quality and ensure regulatory compliance.

Findings

Using the Random Forest model, pH levels were predicted from the evaluation dataset. The predicted values predominantly exceeded 8,indicating that ABC Beverage Company primarily produces alkalinebeverages such as water, tea, and fruit drinks. This insight aligns with the company’s production goals and highlights the alkaline nature of their product line.

Business Impact

The deployment of the Random Forest model brings several benefits:

Regulatory Compliance: The ability to monitor pH levels ensures adherence to legal standards, mitigating risks of non-compliance.

Quality Assurance: Accurate predictions allow for better control over production parameters, ensuring consistent beverage quality.

Product Innovation: Insights into key predictors of pH levels can guide the development of new beverages tailored to customer preferences.

Operational Efficiency: Automating pH prediction reduces the need for manual testing, saving time and resources.

In closing:

This project demonstrates the value of data science in solving real-world business problems. By leveraging advanced analytics and machine learning, ABC Beverage Company is now equipped with a powerful tool to predict and control pH levels, ensuring both compliance and customer satisfaction. The insights gained from this study also pave the way for further innovation and operational improvements, underscoring the transformative impact of data-driven decision-making.

References:

Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Springer. https://link.springer.com/book/10.1007/978-1-4614-6849-3

CUNY School of Professional Studies – The Graduate Center. (2024). DATA 622: Machine learning [Course]. CUNY School of Professional Studies – The Graduate Center, Fall 2024

Näf, J. (n.d.). What is a good imputation for missing values? Towards Data Science. Retrieved [Dec 13 2024], from https://towardsdatascience.com/what-is-a-good-imputation-for-missing-values-e9256d45851b

Final Project - Data 622

Heleine Fouda

2024-12-15