Project 2 – EDA, Cleaning & Predictive Modelling

1· Setup & Libraries

library(readxl)        
library(readr)         
library(dplyr)
library(stringr)
library(tidyr)        
library(purrr)        
library(writexl)       
library(ggplot2)
library(ggridges)
library(DataExplorer)
library(skimr)
library(caret)         
library(VIM) 
library(gridExtra)
library(smotefamily)   
library(fastDummies)
library(rsample)
library(earth)   

theme_set(theme_minimal())

2· Load Raw Data

train_raw <- read_excel("StudentData.xlsx")
eval_raw  <- read_excel("StudentEvaluation.xlsx")

names(train_raw) <- str_replace_all(names(train_raw), " ", "_")
names(eval_raw)  <- str_replace_all(names(eval_raw),  " ", "_")

3· Exploratory Data Analysis (EDA)

overview The dataset titled studentdata contains 2,571 observations and 33 variables, capturing detailed measurements from ABC Beverage’s manufacturing process. These variables encompass both process metrics—such as pressure, flow, temperature—and chemical indicators, including alcohol content, density, and carbonation, with the primary modeling target being pH. The Brand Code column, a categorical identifier originally representing values A, B, C, and D, has been encoded numerically (1–4) but includes 120 missing entries. Core production variables like Carb Volume (5.04–5.70) and Fill Ounces (23.63–24.32) show tight operational ranges, whereas others like PC Volume and PSC exhibit wider variability and higher missingness (up to 39 missing values).

More complex or sensor-derived features—such as Hyd Pressure1–4, Balling, Oxygen Filler, and Carb Rel—also show notable gaps and fluctuations, suggesting areas for equipment calibration or process standardization. A structural diagnostic reveals that 96.97% of the columns are continuous, with just 3.03% discrete, making the data set well-suited for numerical modeling. Importantly, 79.27% of rows are complete, and only 0.99% of total data points are missing, indicating minimal and scattered missingness. With a footprint of just 670.4 Kb, the data set is lightweight and efficient to work with in R or Python. In summary, the data is mostly clean, structurally sound, and ready for predictive modeling with only minor imputation required.

#take a look at the column and data types
# Advanced data descriptions

skim(train_raw)

## Warning: 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

## Warning: 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

## Warning: 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

Data summary
Name	train_raw
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand_Code	120	0.95	1	1	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb_Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Fill_Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC_Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb_Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb_Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC_Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC_CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf_Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
Carb_Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Fill_Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd_Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Hyd_Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
Hyd_Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
Hyd_Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
Filler_Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
Filler_Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage_cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
Carb_Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Pressure_Vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen_Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Bowl_Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Pressure_Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air_Pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Alch_Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
Carb_Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
Balling_Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

summary(train_raw)

##   Brand_Code         Carb_Volume     Fill_Ounces      PC_Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb_Pressure     Carb_Temp          PSC             PSC_Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC_CO2           Mnf_Flow       Carb_Pressure1  Fill_Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd_Pressure1   Hyd_Pressure2   Hyd_Pressure3   Hyd_Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler_Level    Filler_Speed   Temperature      Usage_cont      Carb_Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure_Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen_Filler     Bowl_Setpoint   Pressure_Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air_Pressurer      Alch_Rel        Carb_Rel      Balling_Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1

# give an idea of what the data set is
plot_intro(train_raw)

missing data

# plot missing data
plot_missing(train_raw)

# Advanced data descriptions
skim(train_raw)

## Warning: 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

## Warning: 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

## Warning: 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

Data summary
Name	train_raw
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand_Code	120	0.95	1	1	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb_Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Fill_Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC_Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb_Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb_Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC_Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC_CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf_Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
Carb_Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Fill_Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd_Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Hyd_Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
Hyd_Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
Hyd_Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
Filler_Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
Filler_Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage_cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
Carb_Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Pressure_Vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen_Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Bowl_Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Pressure_Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air_Pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Alch_Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
Carb_Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
Balling_Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

variables

# Observations with the target variables
colnames(train_raw)

##  [1] "Brand_Code"        "Carb_Volume"       "Fill_Ounces"      
##  [4] "PC_Volume"         "Carb_Pressure"     "Carb_Temp"        
##  [7] "PSC"               "PSC_Fill"          "PSC_CO2"          
## [10] "Mnf_Flow"          "Carb_Pressure1"    "Fill_Pressure"    
## [13] "Hyd_Pressure1"     "Hyd_Pressure2"     "Hyd_Pressure3"    
## [16] "Hyd_Pressure4"     "Filler_Level"      "Filler_Speed"     
## [19] "Temperature"       "Usage_cont"        "Carb_Flow"        
## [22] "Density"           "MFR"               "Balling"          
## [25] "Pressure_Vacuum"   "PH"                "Oxygen_Filler"    
## [28] "Bowl_Setpoint"     "Pressure_Setpoint" "Air_Pressurer"    
## [31] "Alch_Rel"          "Carb_Rel"          "Balling_Lvl"

# Only categorical values  column 
ggplot(train_raw, aes(x = Brand_Code, fill = Brand_Code)) +
  geom_bar() +
  geom_text(stat = "count", aes(label = after_stat(count)),
            vjust = 1.4, colour = "white") +
  labs(title = "Brand_Code frequency", x = NULL, y = "Count") +
  theme(legend.position = "none")

Brand B is the most frequently produced product in this data set with 1239 counts. This imbalance suggests that Brand B dominates production, which could influence the predictive model for pH due to its high representation. It’s important to verify whether pH behavior differs by brand and, if so, account for it in modeling or stratify analysis accordingly.

# plot average PH by brand code
train_raw %>% group_by(Brand_Code) %>% 
  summarise(Mean_PH = mean(PH, na.rm = T)) %>% 
              ggplot(aes(x = Brand_Code, y = Mean_PH, fill = Brand_Code)) + geom_col() + 
              geom_text(aes(label = round(Mean_PH, 3)), vjust = 1.5, colour = "white") +
  labs(title = "Average pH by Brand Code", y = "Mean pH")

We can see that there is a difference in the average pH values in each brand code. This further provides evidence that attaining a more balanced representation of brand code will be important in reducing bias.

# create a correlation graph and p-value
categorical_cols <- sapply(train_raw, is.character) | sapply(data, is.factor)

## Warning in sapply(train_raw, is.character) | sapply(data, is.factor): longer
## object length is not a multiple of shorter object length

train_raw[categorical_cols] <- lapply(train_raw[categorical_cols], function(x) as.numeric(as.factor(x)))
target <- "PH"

# Initializing an empty data frame
results <- data.frame(Feature = character(), Correlation = numeric(), P_value = numeric(), stringsAsFactors = FALSE)

for (col in names(train_raw)) {
  if (col != target) {
    correlation <- cor(train_raw[[col]], train_raw[[target]], use = "complete.obs", method = 'pearson')
    p_value <- cor.test(train_raw[[col]], train_raw[[target]], method = 'pearson')$p.value
    
    results <- rbind(results, data.frame(Feature = col, Correlation = correlation, P_value = p_value))
  }
}

print(results)

##              Feature Correlation       P_value
## 1         Brand_Code  0.09668448  1.655458e-06
## 2        Carb_Volume  0.06349111  1.317168e-03
## 3        Fill_Ounces -0.09505837  1.677065e-06
## 4          PC_Volume  0.04580178  2.128180e-02
## 5      Carb_Pressure  0.05958033  2.665036e-03
## 6          Carb_Temp  0.02887456  1.456400e-01
## 7                PSC -0.09003060  5.651603e-06
## 8           PSC_Fill -0.03625652  6.748775e-02
## 9            PSC_CO2 -0.07558459  1.424376e-04
## 10          Mnf_Flow -0.44697510 2.557184e-126
## 11    Carb_Pressure1 -0.07932723  6.378245e-05
## 12     Fill_Pressure -0.21343571  1.199531e-27
## 13     Hyd_Pressure1 -0.07398881  1.811517e-04
## 14     Hyd_Pressure2 -0.20067834  1.352668e-24
## 15     Hyd_Pressure3 -0.24030003  7.593496e-35
## 16     Hyd_Pressure4 -0.14153475  7.818363e-13
## 17      Filler_Level  0.32327031  3.917540e-63
## 18      Filler_Speed -0.02456179  2.183784e-01
## 19       Temperature -0.15824135  8.592733e-16
## 20        Usage_cont -0.31841555  1.842646e-61
## 21         Carb_Flow  0.15737137  1.089641e-15
## 22           Density  0.07843973  6.938385e-05
## 23               MFR -0.00871805  6.721382e-01
## 24           Balling  0.06527259  9.363280e-04
## 25   Pressure_Vacuum  0.22053722  1.201970e-29
## 26     Oxygen_Filler  0.16798960  1.234045e-17
## 27     Bowl_Setpoint  0.34911980  2.063873e-74
## 28 Pressure_Setpoint -0.31101906  2.010989e-58
## 29     Air_Pressurer -0.01365873  4.891116e-01
## 30          Alch_Rel  0.14893852  3.609012e-14
## 31          Carb_Rel  0.16414447  6.477984e-17
## 32       Balling_Lvl  0.10047517  3.395271e-07

Correlation between PH and other variables reveals that Bowl Setpoint, Filler Level, and Pressure Vacuum have the strongest positive associations with pH, while Mnf Flow, Pressure Setpoint, and Usage Cont show strong negative correlations with all statistically significant with very low p-values. PH influenced key operations. Variables like Carb Temp, Filler Speed, and Air Pressurer show weak or non-significant relationships, suggesting minimal influence on pH.

# Drop non-numeric columns and rows with missing PH
data_clean <- train_raw %>%
  select_if(is.numeric) %>%
  drop_na(PH)


# Compute correlation with PH
correlations <- data_clean %>%
  select(-PH) %>%
  summarise_all(~cor(.x, data_clean$PH, use = "complete.obs")) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Correlation") %>%
  arrange(desc(abs(Correlation))) 
# Barplot of correlations
ggplot(correlations, aes(x = reorder(Variable, Correlation), y = Correlation)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Correlation of Variables with PH",
       x = "Variable",
       y = "Correlation Coefficient") +
  theme_minimal()

The bar plot visualizes the strength and direction of correlation between various manufacturing variables and pH levels in the production process. At a high level, it shows that Bowl Setpoint, Filler Level, and Pressure Vacuum have the strongest positive correlations with pH, suggesting that higher values in these variables are typically associated with higher pH. Conversely, variables like MnF Flow, Pressure Setpoint, and various Hydraulic Pressures show the most negative correlations, indicating that increases in these may lead to lower pH values. Most other variables exhibit weak or negligible correlations, implying limited direct influence on pH individually. This insight helps prioritize which process parameters should be monitored or adjusted when targeting pH control.

Relationships, Distributions and Checking Outliers

# Keep only numeric columns and drop rows with missing PH
data_numeric <- train_raw %>%
  select(where(is.numeric)) %>%
  drop_na(PH)
# Get all column names except PH
predictors <- setdiff(names(data_numeric), "PH")
# Create scatter plots for all numeric variables against PH (excluding PH itself)
# Generate scatter plots where color is mapped to PH
plots <- map(predictors, function(var) {
  ggplot(data_numeric, aes(x = .data[[var]], y = PH, color = PH)) +
    geom_point(alpha = 0.6) +
    geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
    scale_color_gradient(low = "blue", high = "red") +
    labs(title = paste("PH vs", var), x = var, y = "PH") +
    theme_minimal()
})
# Display first 6 plots
grid.arrange(grobs = plots[1:6], ncol = 3)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

# plot histogram for only continuous data
plot_histogram(train_raw)

Some features, like Carb Pressure, Fill Ounces, and PC Volume, display near-normal distributions, indicating stable process control. Others, such as Oxygen Filler, PSC CO2, Carb Flow, and Usage Cont, are heavily right-skewed, with long tails suggesting sporadic high values or potential outliers.

# Select only numeric columns
numeric_data <- train_raw %>% select(where(is.numeric))

# Convert data to long format for ggplot
long_data <- numeric_data %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

# Generate boxplots with facets
ggplot(long_data, aes(x = "", y = Value)) +
  geom_boxplot(fill = "skyblue", outlier.alpha = 0.3) +
  facet_wrap(~ Variable, scales = "free_y") +
  labs(title = "Boxplots of All Numeric Variables", x = NULL, y = "Value") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        strip.text = element_text(size = 8))

## Warning: Removed 844 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The boxplot grid highlights the distribution and outliers across all numeric variables. Several features, such as MFR, Hydraulic Pressures, Filler Speed, Oxygen Filler, and PSC CO2, display a high number of outliers, indicating measurement variability or extreme values. In contrast, variables like pH, Carb Volume, and Fill Ounces are tightly clustered with minimal deviation, suggesting strong process control.

Evaluate correlations among predictors and PH

Using findCorrelation function finds the absolute pairwise correlations between variables at the level set by the cutoff value. It looks at the average correlations between each predictor and identifies those with the largest mean absolute correlation. We don’t have to eliminate these predictors but it gives an idea of which ones might be the most influential.

# Find correlation coefficients *before missing values
cor_coef <- cor(train_raw %>% select(-Brand_Code), use = "complete.obs") #%>% drop_na())

## Create a corplot to visualize the correlations 
## order the correlations into clusters or regions of high correlations 
corrplot::corrplot(cor_coef, order = "hclust")

# Find high correlations 

## Returns column numbers denoting the predictors that are recommended for 
## deletion using the findCorrelation() function in the caret package
highCorr <- findCorrelation(cor_coef, cutoff = .80, exact = TRUE)

## Get names of the columns
colnames(train_raw[, highCorr])

## [1] "MFR"            "Hyd_Pressure2"  "Air_Pressurer"  "Carb_Rel"      
## [5] "Alch_Rel"       "Carb_Pressure1" "Hyd_Pressure4"  "Filler_Level"  
## [9] "Carb_Pressure"

4·Outlier Diagnostics & Handling

4.1 Understand Skewness / Kurtosis

Another way of understanding the distribution of the data is to calculate the skewness and kurtosis. Skewness measures the asymmetry of a distribution, while kurtosis captures how heavy the tails are (i.e how many extreme values reside in the tails) relative to a normal curve. Highly skewed variables are more likely to harbor extreme values that can distort model training.

library(e1071)  # for skewness / kurtosis

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:rsample':
## 
##     permutations

num_only <- select(train_raw, where(is.numeric))

# compute skew / kurtosis excluding Brand_Code and PH
sk_kurt <- tibble(
  Feature   = setdiff(names(num_only), c("PH")),
  Skewness  = apply(select(num_only, -PH), 2, skewness,  na.rm = TRUE),
  Kurtosis  = apply(select(num_only, -PH), 2, kurtosis, na.rm = TRUE)
) %>% arrange(desc(Skewness))

head(sk_kurt, 5)  # top positively skewed

## # A tibble: 5 × 3
##   Feature       Skewness Kurtosis
##   <chr>            <dbl>    <dbl>
## 1 Oxygen_Filler    2.66    11.1  
## 2 Temperature      2.39    10.2  
## 3 Air_Pressurer    2.25     4.73 
## 4 PSC_CO2          1.73     3.73 
## 5 PSC_Fill         0.933    0.769

tail(sk_kurt, 5)  # most negatively skewed

## # A tibble: 5 × 3
##   Feature       Skewness Kurtosis
##   <chr>            <dbl>    <dbl>
## 1 Filler_Level    -0.848   0.0460
## 2 Bowl_Setpoint   -0.974  -0.0564
## 3 Carb_Flow       -0.988  -0.583 
## 4 Filler_Speed    -2.87    6.71  
## 5 MFR             -5.09   30.5

We see Oxygen_Filler, Temperature, and Air_Pressurer have the largest positive skew; Filler_Speed and MFR are the most negatively skewed.

4.2 Count Potential Outliers

An observation beyond \(q1 - 1.5*IQR\) or \(q3 + 1.5*IQR\) is flagged as an extreme value.

count_outliers <- function(x) {
  x <- na.omit(x)
  iqr <- IQR(x)
  q1  <- quantile(x, .25)
  q3  <- quantile(x, .75)
  sum(x < q1 - 1.5*iqr | x > q3 + 1.5*iqr)
}

outlier_tbl <- map_dfr(select(num_only, -PH), count_outliers, .id = "Feature") %>%
  rename(Outlier_Count = 1) %>% arrange(desc(Outlier_Count))

head(outlier_tbl, 10)

## # A tibble: 1 × 32
##   Outlier_Count Carb_Volume Fill_Ounces PC_Volume Carb_Pressure Carb_Temp   PSC
##           <int>       <int>       <int>     <int>         <int>     <int> <int>
## 1             0           5          55        86            14        36    52
## # ℹ 25 more variables: PSC_Fill <int>, PSC_CO2 <int>, Mnf_Flow <int>,
## #   Carb_Pressure1 <int>, Fill_Pressure <int>, Hyd_Pressure1 <int>,
## #   Hyd_Pressure2 <int>, Hyd_Pressure3 <int>, Hyd_Pressure4 <int>,
## #   Filler_Level <int>, Filler_Speed <int>, Temperature <int>,
## #   Usage_cont <int>, Carb_Flow <int>, Density <int>, MFR <int>, Balling <int>,
## #   Pressure_Vacuum <int>, Oxygen_Filler <int>, Bowl_Setpoint <int>,
## #   Pressure_Setpoint <int>, Air_Pressurer <int>, Alch_Rel <int>, …

The same trio (MFR, Filler_Speed, Oxygen_Filler) tops the outlier‑count list, validating our earlier focus.

4.3 Histograms of High‑Outlier Variables (optional)

4.4 IQR Trimming (convert extremes to NA)

Rather than deleting rows, we convert extreme tails (>1.5×IQR) in MFR, Filler_Speed, and Oxygen_Filler to NA. The focuse was on handling outliers only in these three features as they had the most extreme outliers. The recoded NA values will be imputed in the next step.

trim_cols <- c("MFR", "Filler_Speed", "Oxygen_Filler")

for (col in trim_cols) {
  iqr <- IQR(train_raw[[col]], na.rm = TRUE)
  bounds <- quantile(train_raw[[col]], c(.25, .75), na.rm = TRUE) + c(-1, 1) * 1.5 * iqr
  train_raw[[col]] <- ifelse(between(train_raw[[col]], bounds[1], bounds[2]),
                             train_raw[[col]], NA_real_)
}

studentData_outliers_rm <- train_raw  # keep legacy name for downstream code

5· kNN Imputation (with & without outliers)

# 5a· kNN on outlier‑removed frame ---------------------------
set.seed(100)
students_cleaned <- studentData_outliers_rm %>%
  kNN(k = 5) %>%
  select(1:33)    # drop VIM appended columns

# 5b· kNN on original raw frame ------------------------------
set.seed(101)
students_naFilled <- train_raw %>%
  kNN(k = 5) %>%
  select(1:33)

6· Balance Categorical Variable via SMOTE

set.seed(102)
smote_res   <- SMOTE(students_naFilled[,-1], target = students_naFilled$Brand_Code,
                     K = 5, dup_size = 2)
students_balanced <- smote_res$data %>%
  rename(Brand_Code = class)

ggplot(students_balanced, aes(Brand_Code, fill = Brand_Code)) +
  geom_bar() +
  labs(title = "Balanced Brand_Code after SMOTE", y = "Count", x = NULL) +
  geom_text(stat = "count", aes(label = after_stat(count)),
            vjust = 1.4, colour = "white")

  theme(legend.position = "none")

## List of 1
##  $ legend.position: chr "none"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

Using SMOTE partially re balanced the data, adding more observations with Brand_Code A (recoded as 1) and Brand_Code C (recoded as 3).

7· Dummy‑Encoding & Final Modelling Frame

students_balanced <- dummy_cols(students_balanced, select_columns = "Brand_Code") %>%
  select(-Brand_Code)

8· Train/Test Split

set.seed(200)
train_test_split <- initial_split(students_balanced, prop = 0.7)
train_df <- training(train_test_split) %>% drop_na()
test_df  <- testing(train_test_split)  %>% drop_na()

x_train <- select(train_df, -PH)
y_train <- train_df$PH
x_test  <- select(test_df,  -PH)
y_test  <- test_df$PH

ctrl <- trainControl(method = "cv", number = 10, allowParallel = TRUE)

9· Model Training

The exploratory data analysis revealed several factors that assisted in choosing models to test the predictability of pH. The existence of weak correlations and the presence of heavy outliers dissuaded the use of linear models as these would not capture the true relationship between the predictors and pH. With a goal of understanding which predictors had the most influence over pH, a strategy that narrowed down the predictors through feature selection was also an important consideration. An initial round of base models were fitted to evaluate and confirm which models were better suited for the data and our goal.

  run_models <- function(data, target = "ph") {
    library(caret)
    library(randomForest)
    library(xgboost)
    library(e1071)
    library(dplyr)
    
    set.seed(123)
    
    # Step 1: Impute missing values and scale data
    preproc <- preProcess(data, method = c("medianImpute", "center", "scale"))
    data_clean <- predict(preproc, newdata = data)
    
    # Step 2: Train/test split
    train_index <- createDataPartition(data_clean[[target]], p = 0.8, list = FALSE)
    train_data <- data_clean[train_index, ]
    test_data <- data_clean[-train_index, ]
    
    # Extract x and y for models that need matrix input
    x_train <- train_data %>% select(-all_of(target)) %>% as.matrix()
    y_train <- train_data[[target]]
    x_test <- test_data %>% select(-all_of(target)) %>% as.matrix()
    y_test <- test_data[[target]]
    
    results <- list()
    
    ## 1. Linear Regression
    lm_model <- lm(as.formula(paste(target, "~ .")), data = train_data)
    lm_preds <- predict(lm_model, newdata = test_data)
    results$Linear_Regression <- postResample(lm_preds, y_test)
    
    ## 2. Random Forest
    rf_model <- randomForest(as.formula(paste(target, "~ .")), data = train_data)
    rf_preds <- predict(rf_model, newdata = test_data)
    results$Random_Forest <- postResample(rf_preds, y_test)
    
    ## 3. XGBoost
    dtrain <- xgb.DMatrix(data = x_train, label = y_train)
    dtest <- xgb.DMatrix(data = x_test, label = y_test)
    xgb_model <- xgboost(data = dtrain, nrounds = 100, objective = "reg:squarederror", verbose = 0)
    xgb_preds <- predict(xgb_model, newdata = dtest)
    results$XGBoost <- postResample(xgb_preds, y_test)
    
    ## 4. SVR
    svr_model <- svm(x = x_train, y = y_train, type = "eps-regression", kernel = "radial")
    svr_preds <- predict(svr_model, x_test)
    results$SVR <- postResample(svr_preds, y_test)
    
    ## Print all results
    for (model in names(results)) {
      cat("Model:", model, "\n")
      print(results[[model]])
      cat("\n")
    }
    
    return(invisible(results))
  }
  
  cleaned_data <- janitor::clean_names(students_balanced)
  model_results <- run_models(cleaned_data)

## Warning: package 'randomForest' was built under R version 4.3.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

## Warning: package 'xgboost' was built under R version 4.3.3

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

## Model: Linear_Regression 
##      RMSE  Rsquared       MAE 
## 0.7861197 0.3636575 0.6120384 
## 
## Model: Random_Forest 
##      RMSE  Rsquared       MAE 
## 0.5486348 0.7026078 0.3833599 
## 
## Model: XGBoost 
##      RMSE  Rsquared       MAE 
## 0.5877401 0.6449252 0.3906336 
## 
## Model: SVR 
##      RMSE  Rsquared       MAE 
## 0.6580015 0.5538357 0.4538674

  # Convert results to a tidy data frame
  performance_df <- bind_rows(model_results, .id = "Model") %>%
    pivot_longer(cols = -Model, names_to = "Metric", values_to = "Value")
  
  # Plot
  ggplot(performance_df, aes(x = Model, y = Value, fill = Metric)) +
    geom_bar(stat = "identity", position = "dodge") +
    labs(title = "Model Comparison: RMSE, R², MAE",
         y = "Metric Value",
         x = "Model") +
    theme_minimal() +
    scale_fill_brewer(palette = "Set2")

As expected the linear model had the poorest performance based on the lowest Rsquared and the highest RMSE. Non linear models were selected to move forward with the aim of improve accuracy using cross validation and hyper-parameter tuning. The final set of models tested were MARS, SVM, Random Forest, XGBoost and Cubist. The model with the highest R_squared value was used to predict pH for the new data - eval_raw.

9a. MARS (Multivariate Adaptive Regression Splines)

A flexible spline‑based regression that automatically selects knots and interactions.

set.seed(201)
mars_grid <- expand.grid(degree = 1:4, nprune = seq(5,50,5))

mars_fit <- train(x_train, y_train,
                  method = "earth",
                  tuneGrid = mars_grid,
                  trControl = ctrl,
                  metric = "Rsquared",
                  preProcess = c("center","scale"))

9b. Support Vector Machine (RBF kernel)

Captures non‑linear relationships by projecting data into a high‑dimensional feature space.

set.seed(202)
svm_fit <- train(x_train, y_train,
                 method = "svmRadial",
                 tuneLength = 14,
                 trControl = ctrl,
                 metric = "Rsquared",
                 preProcess = c("center","scale"))

9c. Random Forest

set.seed(203)
rf_grid <- expand.grid(mtry = c(12,14,16)) # Number of predictors to sample at each run

rf_fit <- train(x_train, y_train,
                method = "rf",
                tuneLength = 10,
                tuneGrid = rf_grid,
                trControl = ctrl,
                metric = "Rsquared")

9d. XG Boost (XGB)

xgb_grid <- expand.grid(
  nrounds = seq(from = 200, to = 500, by = 50),
  eta = c(0.1, 0.3),
  max_depth = c(6, 8),
  gamma = c(0, 0.5,1),
  colsample_bytree = c(0.6, 0.8, 1),
  min_child_weight = 1,
  subsample = 1
)

xgb_fit <- train(x = x_train,
                 y = y_train,
                 method = "xgbTree",
                 # Use R-sqaured to find best model
                 metric = "Rsquared",
                 # Use 10 fold cross validation and Grid search
                 trControl = ctrl,
                 tuneGrid = xgb_grid,
                 verbose = FALSE,
                 verbosity = 0
                 )

9e · Cubist (Rule‑Based Regression)

set.seed(205)
cub_grid <- expand.grid(committees = c(1,10,50,100), neighbors = c(0,1,5,9))

cubist_fit <- train(x_train, y_train,
                    method = "cubist",
                    tuneGrid = cub_grid,
                    trControl = ctrl,
                    metric = "Rsquared",
                    preProcess = c("center","scale"))

10· Performance on Held‑out Test Set

get_metrics <- function(fit, x, y) {
  p <- as.numeric(predict(fit, x))
  tibble(RMSE = RMSE(p, y),
         Rsq  = R2(p, y),
         MAE  = MAE(p, y))
}

metrics_tbl <- bind_rows(
  get_metrics(mars_fit,  x_test, y_test),
  get_metrics(svm_fit,   x_test, y_test),
  get_metrics(rf_fit,    x_test, y_test),
  get_metrics(xgb_fit,   x_test, y_test),
  get_metrics(cubist_fit,x_test, y_test)
) %>%
  mutate(Model = c("MARS","SVM","Random Forest","XGB","Cubist"), .before = 1)

#knitr::kable(metrics_tbl, digits = 3)
metrics_tbl

## # A tibble: 5 × 4
##   Model           RMSE      Rsq    MAE
##   <chr>          <dbl>    <dbl>  <dbl>
## 1 MARS          1.39   0.000614 0.153 
## 2 SVM           0.112  0.588    0.0757
## 3 Random Forest 0.0994 0.695    0.0685
## 4 XGB           0.101  0.669    0.0699
## 5 Cubist        0.0930 0.717    0.0607

performance_plot_tbl <- metrics_tbl %>%
  pivot_longer(cols = c(RMSE, Rsq, MAE), names_to = "Metric", values_to = "Value")

# Plot
ggplot(performance_plot_tbl, aes(x = Model, y = Value, fill = Metric)) +
  geom_col(position = "dodge") +
  labs(
    title = "Model Performance Metrics on Held-Out Test Data",
    subtitle = "Comparison of RMSE, R², and MAE for 5 Regression Models",
    y = "Metric Value",
    x = "Model"
  ) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1")

11· Variable Importance

cubist_imp <- varImp(cubist_fit)$importance %>%
  tibble::rownames_to_column("Feature") %>%
  arrange(desc(Overall)) %>%
  slice_head(n = 20)

ggplot(cubist_imp, aes(reorder(Feature, Overall), Overall)) +
  geom_col(fill = "steelblue") + coord_flip() +
  labs(title = "Cubist – Top 20", x = NULL, y = "Importance")

12· Deployment on StudentEvaluation File

PH in the evaluation file is entirely NA, so we exclude it before imputation to avoid VIM’s sub‑script error. We then ensure that the dummy columns match the training matrix—adding any missing columns as zeroes in the correct order.

# 12a· kNN‑impute predictors (drop PH first)
eval_knn <- eval_raw %>%
  select(-PH) %>%            # remove all‑NA response
  kNN(k = 5) %>%
  select(-matches("(_imp$|_NAflag$)"))   

# 12b· dummy‑encode Brand_Code
eval_dummy <- dummy_cols(eval_knn, select_columns = "Brand_Code") %>%
  select(-Brand_Code)

# 12c· Align columns with training design matrix
missing_cols <- setdiff(names(x_train), names(eval_dummy))
if (length(missing_cols) > 0) {
  eval_dummy[missing_cols] <- 0
}
# reorder to match training column order
eval_dummy <- eval_dummy[, names(x_train)]

# 12d· Predict pH using Random Forest
pred_eval <- predict(cubist_fit, eval_dummy)

# 12e· Save predictions
write_xlsx(bind_cols(eval_raw, Predicted_PH = pred_eval), "ph_predictions.xlsx")

13· Conclusion

To conclude, The ABC Beverage dataset’s exploratory study demonstrates a complex relationship between process variables and pH results, a crucial quality indicator in the beverage production industry. While the majority of variables exhibit minimal missingness and stable distributions, a few features—such as Temperature, MNR, Oxygen Filler, Filler Speed—show substantial skewness and outliers. This suggests that these features might be candidates for process improvement or sensor re calibration.

Both correlation coefficients and scatter plots support stronger linear relationships between pH and Bowl Setpoint, Filler Level, and Pressure Vacuum versus other predictors. On the other hand, factors like air pressure and filler speed have little predictive value, which might make it possible to simplify the model without noticeably sacrificing accuracy.

It’s interesting to note that Brand Code’s categorical imbalance, in which Brand B predominates the data set, raises questions regarding possible modeling bias. Testing if pH behaves differently between brands is essential, particularly when applying the model to diverse production lines. To prevent skewed performance, brand-stratified sample or brand interactions should be introduced in future modeling efforts.

Among the models we tested, Cubist gave the best fit—lowest RMSE and highest R²—by capturing nonlinear effects in features more effectively than linear or tree‐based methods. To improve on this, we could: - Tune Cubist’s main settings (number of committees, neighbors and rule pruning levels). - Check residuals for any consistent errors and use partial‐dependence plots to see how each predictor really effects pH. - Apply simple transformations (log or Box–Cox) to the most skewed inputs. - Add key interaction terms (for example, Temperature × Filler Speed). - Try fitting Cubist separately for each brand to guard against bias and improve overall accuracy.