Introduction

The goal of this section is to better understand the beverage manufacturing dataset before predictive modeling begins. Exploratory Data Analysis (EDA), data cleaning, and data transformations were performed to identify patterns, missing values, outliers, and relationships between variables related to beverage pH.

This process is important because understanding the data helps ensure that later predictive models are built using reliable and meaningful information.

library(tidyverse)
library(readxl)
library(caret)
library(corrplot)
library(GGally)
library(skimr)
library(naniar)
library(VIM)
library(knitr)
library(kableExtra)
library(gbm)
library(writexl)

The training set contains 2571 observations across 33 variables (including the target, PH). The test set contains 267 observations with PH fully withheld.

#Load training data
bev_train <- read_excel("StudentData.xlsx")

#Load evaluation data
bev_test <- read_excel("StudentEvaluation.xlsx")

The first step was to get a quick look and explore the structure and contents of the dataset.

glimpse(bev_train)

## Rows: 2,571
## Columns: 33
## $ `Brand Code`        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", …
## $ `Carb Volume`       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, …
## $ `Fill Ounces`       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, …
## $ `PC Volume`         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1113…
## $ `Carb Pressure`     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64…
## $ `Carb Temp`         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 1…
## $ PSC                 <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.15…
## $ `PSC Fill`          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.…
## $ `PSC CO2`           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.…
## $ `Mnf Flow`          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1`    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 1…
## $ `Fill Pressure`     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46…
## $ `Hyd Pressure1`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4`     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, …
## $ `Filler Level`      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 1…
## $ `Filler Speed`      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014…
## $ Temperature         <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65…
## $ `Usage cont`        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 1…
## $ `Carb Flow`         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902,…
## $ Density             <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.…
## $ MFR                 <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, …
## $ Balling             <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1…
## $ `Pressure Vacuum`   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4…
## $ PH                  <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.…
## $ `Oxygen Filler`     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0…
## $ `Bowl Setpoint`     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46…
## $ `Air Pressurer`     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 1…
## $ `Alch Rel`          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.…
## $ `Carb Rel`          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.…
## $ `Balling Lvl`       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.…

skim(bev_train)

Data summary
Name	bev_train
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand Code	120	0.95	1	1	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Fill Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
Carb Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Fill Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Hyd Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
Hyd Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
Hyd Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
Filler Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
Filler Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
Carb Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Pressure Vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Bowl Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Pressure Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air Pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Alch Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
Carb Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
Balling Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

#View structure of dataset
str(bev_train)

## tibble [2,571 × 33] (S3: tbl_df/tbl/data.frame)
##  $ Brand Code       : chr [1:2571] "B" "A" "B" "A" ...
##  $ Carb Volume      : num [1:2571] 5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill Ounces      : num [1:2571] 24 24 24.1 24 24.3 ...
##  $ PC Volume        : num [1:2571] 0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb Pressure    : num [1:2571] 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb Temp        : num [1:2571] 141 140 145 133 137 ...
##  $ PSC              : num [1:2571] 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC Fill         : num [1:2571] 0.26 0.22 0.34 0.42 0.16 ...
##  $ PSC CO2          : num [1:2571] 0.04 0.04 0.16 0.04 0.12 ...
##  $ Mnf Flow         : num [1:2571] -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb Pressure1   : num [1:2571] 119 122 120 115 118 ...
##  $ Fill Pressure    : num [1:2571] 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd Pressure1    : num [1:2571] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure2    : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure3    : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure4    : num [1:2571] 118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler Level     : num [1:2571] 121 119 120 118 119 ...
##  $ Filler Speed     : num [1:2571] 4002 3986 4020 4012 4010 ...
##  $ Temperature      : num [1:2571] 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage cont       : num [1:2571] 16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb Flow        : num [1:2571] 2932 3144 2914 3062 3054 ...
##  $ Density          : num [1:2571] 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num [1:2571] 725 727 735 731 723 ...
##  $ Balling          : num [1:2571] 1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure Vacuum  : num [1:2571] -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num [1:2571] 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen Filler    : num [1:2571] 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl Setpoint    : num [1:2571] 120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure Setpoint: num [1:2571] 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air Pressurer    : num [1:2571] 143 143 142 146 146 ...
##  $ Alch Rel         : num [1:2571] 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb Rel         : num [1:2571] 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling Lvl      : num [1:2571] 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...

#Summary statistics
summary(bev_train)

##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC CO2           Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd Pressure1   Hyd Pressure2   Hyd Pressure3   Hyd Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler Level    Filler Speed   Temperature      Usage cont      Carb Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen Filler     Bowl Setpoint   Pressure Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air Pressurer      Alch Rel        Carb Rel      Balling Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1

#Number of rows and columns
nrow(bev_train)

## [1] 2571

ncol(bev_train)

## [1] 33

A few things stood out:

Brand Code is the only non-numeric variable. It is a character column with 4 categories (A, B, C, D) and has 120 missing values.
PH is the target. It has only 4 missing values in training, which will simply be dropped.
MFR has 212 missing values (~8.1%), it is by far the most incomplete predictor.
Most other columns have small amounts of missingness (under 2.5%).

Target Variable: PH Before doing anything else, let’s understand what is being predicted.

#Distribution of PH
ggplot(bev_train, aes(x = PH)) +
    geom_histogram(bins = 30, fill = "#8590cf") +
    labs(
        title = "Distribution of Beverage PH",
        x = "PH", y = "Count"
        )

PH ranges from 7.88 to 9.36 with a mean of ~8.55 and a standard deviation of only 0.17. The distribution is fairly symmetric and roughly normal, which is good news for regression models. The 4 missing PH values in training are a negligible loss, we’ll simply drop those rows rather than impute the target. The histogram does present an outlier (the 9.36) value, but that isn’t so strange, given that beverages of such alkalinity exist (alakine water, for example), so this high number does not necessarily indicate an issue with the data itself.

#Drop missing rows from target
bev_train <- bev_train |> filter(!is.na(PH))

Missing values can negatively affect predictive modeling performance, so the dataset was checked for incomplete observations aside from the ones in the target.

#Count missing values in each column
colSums(is.na(bev_train))

##        Brand Code       Carb Volume       Fill Ounces         PC Volume 
##               120                10                38                39 
##     Carb Pressure         Carb Temp               PSC          PSC Fill 
##                27                26                33                23 
##           PSC CO2          Mnf Flow    Carb Pressure1     Fill Pressure 
##                39                 0                32                18 
##     Hyd Pressure1     Hyd Pressure2     Hyd Pressure3     Hyd Pressure4 
##                11                15                15                28 
##      Filler Level      Filler Speed       Temperature        Usage cont 
##                16                54                12                 5 
##         Carb Flow           Density               MFR           Balling 
##                 2                 0               208                 0 
##   Pressure Vacuum                PH     Oxygen Filler     Bowl Setpoint 
##                 0                 0                11                 2 
## Pressure Setpoint     Air Pressurer          Alch Rel          Carb Rel 
##                12                 0                 7                 8 
##       Balling Lvl 
##                 1

#Missing value percentages
missing_percent <- colSums(is.na(bev_train)) / nrow(bev_train) * 100
sort(missing_percent, decreasing = TRUE)

##               MFR        Brand Code      Filler Speed         PC Volume 
##        8.10284379        4.67471757        2.10362291        1.51928321 
##           PSC CO2       Fill Ounces               PSC    Carb Pressure1 
##        1.51928321        1.48032723        1.28554733        1.24659135 
##     Hyd Pressure4     Carb Pressure         Carb Temp          PSC Fill 
##        1.09076743        1.05181145        1.01285547        0.89598753 
##     Fill Pressure      Filler Level     Hyd Pressure2     Hyd Pressure3 
##        0.70120764        0.62329568        0.58433970        0.58433970 
##       Temperature Pressure Setpoint     Hyd Pressure1     Oxygen Filler 
##        0.46747176        0.46747176        0.42851578        0.42851578 
##       Carb Volume          Carb Rel          Alch Rel        Usage cont 
##        0.38955980        0.31164784        0.27269186        0.19477990 
##         Carb Flow     Bowl Setpoint       Balling Lvl          Mnf Flow 
##        0.07791196        0.07791196        0.03895598        0.00000000 
##           Density           Balling   Pressure Vacuum                PH 
##        0.00000000        0.00000000        0.00000000        0.00000000 
##     Air Pressurer 
##        0.00000000

#Plot of missing data by variable
bev_train |>
    gg_miss_var(show_pct = TRUE) +
    labs(title = "Missingness by Variable (% of rows)"
    )

A key question before handling this missing data: are missing values scattered randomly, or do they cluster on certain observations? If the data is MNAR (Missing not at random), that would be a really difficult missing data challenge to handle. MCAR (Missing Completely At Random) or MAR (Missing At Random) are easier to handle.

#Which variables tend to be missing together?
gg_miss_upset(bev_train, nsets = 6)

Interpretation of the upSet plot: The largest vertical bars represent MFR, Brand Code, and Filler Speed going missing entirely on their own. The remaining co-occurrences on the right are tiny and don’t suggest any systematic pattern. This points to the data being MAR, so imputation is a reasonable approach.

Closer Look at MFR MFR stood out because it has the most missing values by far (~8.2%), so it’s worth a closer observation before deciding how to handle it, unlike the other missing values, for which we decided to use imputation.

#What does the MFR distribution actually look like?
bev_train |>
    filter(!is.na(MFR)) |>
    ggplot(aes(x = MFR)) +
    geom_histogram(bins = 30, fill = "#8590cf") +
    labs(
        title = "Distribution of MFR",
        x = "MFR", y = "Count"
    )

The distribution is left-skewed, with a suspicious tail of low values.

#How many values fall in that suspicious low range?
bev_train |>
    summarise(
        below_500 = sum(MFR < 500,  na.rm = TRUE),
        from_500_650 = sum(MFR >= 500 & MFR <650, na.rm = TRUE),
        above_650 = sum(MFR >= 650, na.rm = TRUE),
        missing = sum(is.na(MFR))
    )

## # A tibble: 1 × 4
##   below_500 from_500_650 above_650 missing
##       <int>        <int>     <int>   <int>
## 1        67           67      2225     208

The vast majority of MFR values sit between 650 and 760. However, there are 67 values below 500 (with a minimum of 31.4) that look a bit odd. A reading of 31 alongside a majority of values being ~725 is strange, it could be a sensor error rather than a real observation, but we really would need more information about what this variable actually is before making a judgment. These will be kept in for now since removing them is a judgment call better left for the modeling part of the project, and median imputation won’t be much affected by them anyway. It’s also worth noting that MFR has almost no linear relationship with PH (we’ll see this in the correlation section right below), which limits how much this variable can hurt or help the models.

Correlation with PH Before cleaning anything, it helps to understand which variables actually relate to our target. This gives us a sense of what matters and helps us catch anything unexpected.

#Correlation of each numeric predictor with target PH
ph_cor <- bev_train |>
    select(where(is.numeric)) |>
    cor(use = "pairwise.complete.obs") |>
    as.data.frame() |>
    rownames_to_column("variable") |>
    select(variable, PH) |>
    filter(variable != "PH") |>
    arrange(desc(abs(PH)))

ph_cor

##             variable          PH
## 1           Mnf Flow -0.44697510
## 2      Bowl Setpoint  0.34911980
## 3       Filler Level  0.32327031
## 4         Usage cont -0.31841555
## 5  Pressure Setpoint -0.31101906
## 6      Hyd Pressure3 -0.24030003
## 7    Pressure Vacuum  0.22053722
## 8      Fill Pressure -0.21343571
## 9      Hyd Pressure2 -0.20067834
## 10     Oxygen Filler  0.16798960
## 11          Carb Rel  0.16414447
## 12       Temperature -0.15824135
## 13         Carb Flow  0.15737137
## 14          Alch Rel  0.14893852
## 15     Hyd Pressure4 -0.14153475
## 16       Balling Lvl  0.10047517
## 17       Fill Ounces -0.09505837
## 18               PSC -0.09003060
## 19    Carb Pressure1 -0.07932723
## 20           Density  0.07843973
## 21           PSC CO2 -0.07558459
## 22     Hyd Pressure1 -0.07398881
## 23           Balling  0.06527259
## 24       Carb Volume  0.06349111
## 25     Carb Pressure  0.05958033
## 26         PC Volume  0.04580178
## 27          PSC Fill -0.03625652
## 28         Carb Temp  0.02887456
## 29      Filler Speed -0.02456179
## 30     Air Pressurer -0.01365873
## 31               MFR -0.00871805

#Plot corr
ph_cor |>
    ggplot(aes(x = reorder(variable, abs(PH)), y = PH, fill = PH > 0)) +
    geom_col() +
    coord_flip() +
    scale_fill_manual(values = c("firebrick", "#8590cf"), labels = c("Negative", "Positive")) +
    labs(title = "Correlation of each predictor with PH",
    x = NULL, y = "Pearson r", fill = "Direction"
    )

Mnf Flow is by far the strongest predictor (-0.45), meaning that as manufacturing flow increases, pH tends to drop. Bowl Setpoint, Filler Level, Usage cont, and Pressure Setpoint are the next best correlations. On the other end, MFR, Air Pressurer, and Filler Speed have near-zero linear correlation with PH. That doesn’t automatically mean they’re useless, as tree-based models can pick up non-linear patterns, but they’re unlikely to be the most important features in any model.

Correlation Among Predictors It’s also worth checking whether any predictors are highly correlated with each other, since that can cause problems for linear models (multicollinearity).

#Full corr matrix
cor_matrix <- bev_train |>
    select(where(is.numeric)) |>
    cor(use = "pairwise.complete.obs")

corrplot(cor_matrix,
method = "color", type = "lower", order = "hclust", tl.cex = 0.6, tl.col = "black", title = "Predictor Correlation Matrix", mar = c(0, 0, 2, 0))

#Flag pairs with absolute value r > 0.7 so we can call them out explicitly
cor_matrix |>
    as.data.frame() |>
    rownames_to_column("var1") |>
    pivot_longer(-var1, names_to = "var2", values_to = "r") |>
    
    filter(var1 < var2,
    var1 != "PH", var2 != "PH",
    abs(r) > 0.7) |>
        arrange(desc(abs(r)))

## # A tibble: 21 × 3
##    var1          var2              r
##    <chr>         <chr>         <dbl>
##  1 Balling       Balling Lvl   0.979
##  2 Balling       Density       0.955
##  3 Bowl Setpoint Filler Level  0.949
##  4 Balling Lvl   Density       0.949
##  5 Filler Speed  MFR           0.930
##  6 Alch Rel      Balling Lvl   0.927
##  7 Hyd Pressure2 Hyd Pressure3 0.925
##  8 Alch Rel      Balling       0.925
##  9 Alch Rel      Density       0.903
## 10 Balling Lvl   Carb Rel      0.845
## # ℹ 11 more rows

There are quite a few variables that are strongly correlated with each other, as can be seen in the correlation matrix. For linear models, including all simultaneously could inflate standard errors. Tree-based models will handle this naturally, but it’s something to be aware of when training the models.

Brand Code

#Distribution of Brand Code
bev_train |>
    count(`Brand Code`, sort = TRUE) |>
    mutate(pct = round(n / sum(n) * 100, 1))

## # A tibble: 5 × 3
##   `Brand Code`     n   pct
##   <chr>        <int> <dbl>
## 1 B             1235  48.1
## 2 D              615  24  
## 3 C              304  11.8
## 4 A              293  11.4
## 5 <NA>           120   4.7

#Does PH differ by brand?
bev_train |>
    filter(!is.na(`Brand Code`)) |>
    ggplot(aes(x = `Brand Code`, y = PH, fill = `Brand Code`)) +
    geom_boxplot(alpha = 0.7) +
    labs(title = "PH by Brand Code", x = "Brand Code", y = "PH") + theme(legend.position = "none")

Brand B alone makes up almost half of the records. There are also pH differences across brands, for example, Brand C in particular trends lower. This variable carries signal and needs to be kept. The 120 missing values can be treated as their own “Unknown” category rather than dropped or guessed.

xers

#Before cleaning, let's check for extreme values in the numeric predictors
#Boxplots of all numeric predictors to spot outliers visually
bev_train |>
    select(where(is.numeric), -PH) |>
    pivot_longer(everything(), names_to = "variable", values_to = "value") |>
    ggplot(aes(x = variable, y = value)) +
    geom_boxplot(fill = "#8590cf", alpha = 0.7, outlier.alpha = 0.3) +
    facet_wrap(~ variable, scales = "free") +
    labs(title = "Distribution of Numeric Predictors", x = NULL, y = NULL) +
    theme(axis.text.x = element_blank())

Several variables show visible outliers in their distributions. Filler Speed and MFR, for example, have extreme low values that stand apart from the bulk of the data, they could be a mistake in the data entry rather than normal readings. Temperature and Oxygen Filler, for example, have noticeable tails.

These outliers will all be left as-is since tree-based models are robust to outliers, but they should be considered if using linear or distance-based approaches.

Data Cleaning

Below is the cleaning process of the data, with explanations as to why it was done.

#Brand Code
#fill NA with "Unknown", then convert to factor
#We save the factor levels from training to apply the same ones to test later
#This prevents the test set from inventing new levels the model has never seen

bev_train <- bev_train |>
    mutate(`Brand Code` = if_else(is.na(`Brand Code`), "Unknown", `Brand Code`), `Brand Code` = factor(`Brand Code`))

brand_levels <- levels(bev_train$`Brand Code`)

#Apply same levels to test - NOT new levels computed from test
bev_test <- bev_test |> 
    mutate(`Brand Code` = if_else(is.na(`Brand Code`), "Unknown", `Brand Code`), `Brand Code` = factor(`Brand Code`, levels = brand_levels))

#Median imputation for all numeric NAs
#Medians are computed from training only, then applied to both sets
#Preprocessing parameters must come from training data only to avoid leakage

numeric_cols <- bev_train |>
    select(where(is.numeric), -PH) |> names()

#Store training medians
train_medians <- bev_train |>
    summarise(across(all_of(numeric_cols), \(x) median(x, na.rm = TRUE)))

#Impute training
for (col in numeric_cols) {
    bev_train[[col]][is.na(bev_train[[col]])] <- train_medians[[col]]
}

#Impute test using the SAME training medians
for (col in numeric_cols) {
    bev_test[[col]][is.na(bev_test[[col]])] <- train_medians[[col]]
}

#Verifies that no NAs remain
cat("Remaining NAs in training:\n")

## Remaining NAs in training:

colSums(is.na(bev_train))

##        Brand Code       Carb Volume       Fill Ounces         PC Volume 
##                 0                 0                 0                 0 
##     Carb Pressure         Carb Temp               PSC          PSC Fill 
##                 0                 0                 0                 0 
##           PSC CO2          Mnf Flow    Carb Pressure1     Fill Pressure 
##                 0                 0                 0                 0 
##     Hyd Pressure1     Hyd Pressure2     Hyd Pressure3     Hyd Pressure4 
##                 0                 0                 0                 0 
##      Filler Level      Filler Speed       Temperature        Usage cont 
##                 0                 0                 0                 0 
##         Carb Flow           Density               MFR           Balling 
##                 0                 0                 0                 0 
##   Pressure Vacuum                PH     Oxygen Filler     Bowl Setpoint 
##                 0                 0                 0                 0 
## Pressure Setpoint     Air Pressurer          Alch Rel          Carb Rel 
##                 0                 0                 0                 0 
##       Balling Lvl 
##                 0

#Only PH should be NA in test
cat("\nRemaining NAs in test:\n")

## 
## Remaining NAs in test:

colSums(is.na(bev_test))

##        Brand Code       Carb Volume       Fill Ounces         PC Volume 
##                 0                 0                 0                 0 
##     Carb Pressure         Carb Temp               PSC          PSC Fill 
##                 0                 0                 0                 0 
##           PSC CO2          Mnf Flow    Carb Pressure1     Fill Pressure 
##                 0                 0                 0                 0 
##     Hyd Pressure1     Hyd Pressure2     Hyd Pressure3     Hyd Pressure4 
##                 0                 0                 0                 0 
##      Filler Level      Filler Speed       Temperature        Usage cont 
##                 0                 0                 0                 0 
##         Carb Flow           Density               MFR           Balling 
##                 0                 0                 0                 0 
##   Pressure Vacuum                PH     Oxygen Filler     Bowl Setpoint 
##                 0               267                 0                 0 
## Pressure Setpoint     Air Pressurer          Alch Rel          Carb Rel 
##                 0                 0                 0                 0 
##       Balling Lvl 
##                 0

Investigating suspicious values

#These values look suspicious
#Quantify how many rows are affected before deciding whether to remove them

bev_train |>
  summarise(
    mnf_flag = sum(`Mnf Flow` <= -100, na.rm = TRUE),          #-100 / -100.2 appear to possibly be error codes
    hydp1_flag = sum(`Hyd Pressure1` == 0, na.rm = TRUE),      #zero hydraulic pressure = possibly sensor off
    hydp2_flag = sum(`Hyd Pressure2` == 0, na.rm = TRUE),
    hydp3_flag = sum(`Hyd Pressure3` == 0, na.rm = TRUE),
    fspeed_flag = sum(`Filler Speed` < 2000, na.rm = TRUE),    # ~998 RPM is weird vs normal range of ~4000
    cflow_flag = sum(`Carb Flow` < 100, na.rm = TRUE),         #values like 26-30 vs normal ~3000 are off
    mfr_flag = sum(MFR < 500, na.rm = TRUE)
  )

## # A tibble: 1 × 7
##   mnf_flag hydp1_flag hydp2_flag hydp3_flag fspeed_flag cflow_flag mfr_flag
##      <int>      <int>      <int>      <int>       <int>      <int>    <int>
## 1     1183        838        776        812         193        141       67

Several variables contain values that appear odd. For example, Mnf Flow has 1183 readings at exactly -100 or -100.2, Hyd Pressure1/2/3 have hundreds of zero readings, and Filler Speed and Carb Flow each have values far below their average range seen in the rest of the data. However, the scale of these flags, particularly Mnf Flow (removing it affects 46% of rows) and the hydraulic pressure zeros (800+ rows each), would make removal risky. Values this prevalent possibly represent a distinct operating state of the machine rather than data entry errors. Removing them will discard nearly half the training data and potentially eliminate an entire portion of the manufacturing pattern that the model needs to learn from. These values should be retained, and our group should be aware that Mnf Flow in particular may behave as a near-categorical variable given its bimodal distribution.

Prepare Feature Matrices

# Separate predictors from target
X_train <- bev_train |> select(-PH)
y_train <- bev_train$PH

X_test  <- bev_test  |> select(-PH)

cat("X_train:", nrow(X_train), "x", ncol(X_train), "\n")

## X_train: 2567 x 32

cat("X_test: ", nrow(X_test),  "x", ncol(X_test),  "\n")

## X_test:  267 x 32

cat("y_train: n =", length(y_train), "| mean =", round(mean(y_train), 3),
    "| sd =", round(sd(y_train), 3), "\n")

## y_train: n = 2567 | mean = 8.546 | sd = 0.173

Modeling

We evaluated four model families, all assessed via 5-fold cross-validation on the training set. RMSE (Root Mean Squared Error) is the primary metric, with standard deviation of CV scores used as a stability tiebreaker.

Cross-Validation Setup

set.seed(42)
cv_ctrl <- trainControl(
    method  = "cv",
    number  = 5,
    verboseIter = FALSE
)

Model 1 — Linear Regression (OLS Baseline)

set.seed(42)
lm_fit <- train(
    PH ~ .,
    data      = bev_train,
    method    = "lm",
    trControl = cv_ctrl
)

lm_results <- lm_fit$resample
cat("Linear Regression — CV RMSE: Mean =", round(mean(lm_results$RMSE), 4),
    " | SD =", round(sd(lm_results$RMSE), 4), "\n")

## Linear Regression — CV RMSE: Mean = 0.1337  | SD = 0.0025

OLS serves as a baseline. High standard deviation is expected given severe multicollinearity (21 pairs at |r| > 0.70) and sensitivity to the extreme Mnf Flow values.

Model 2 — Ridge Regression

set.seed(42)
ridge_fit <- train(
    PH ~ .,
    data      = bev_train,
    method    = "ridge",
    trControl = cv_ctrl,
    tuneLength = 10
)

ridge_results <- ridge_fit$resample
cat("Ridge Regression — CV RMSE: Mean =", round(mean(ridge_results$RMSE), 4),
    " | SD =", round(sd(ridge_results$RMSE), 4), "\n")

## Ridge Regression — CV RMSE: Mean = 0.1336  | SD = 0.0025

cat("Best lambda:", ridge_fit$bestTune$lambda, "\n")

## Best lambda: 0.003162278

L2 regularization substantially reduces variance compared to OLS and handles multicollinearity. Still limited by the linear assumption.

Model 3 — Random Forest

set.seed(42)
rf_fit <- train(
    PH ~ .,
    data      = bev_train,
    method    = "rf",
    trControl = cv_ctrl,
    tuneLength = 3,
    ntree      = 200
)

rf_results <- rf_fit$resample
cat("Random Forest — CV RMSE: Mean =", round(mean(rf_results$RMSE), 4),
    " | SD =", round(sd(rf_results$RMSE), 4), "\n")

## Random Forest — CV RMSE: Mean = 0.0943  | SD = 0.0034

cat("Best mtry:", rf_fit$bestTune$mtry, "\n")

## Best mtry: 35

Strong performer. Handles non-linearity and correlated predictors naturally via bootstrap aggregation. Slightly more variable across folds than Gradient Boosting.

Model 4 — Gradient Boosting (GBM)

gbm_grid <- expand.grid(
    n.trees           = 200,
    interaction.depth = 4,
    shrinkage         = 0.05,
    n.minobsinnode    = 10
)

set.seed(42)
gbm_fit <- train(
    PH ~ .,
    data      = bev_train,
    method    = "gbm",
    trControl = cv_ctrl,
    tuneGrid  = gbm_grid,
    verbose   = FALSE
)

gbm_results <- gbm_fit$resample
cat("Gradient Boosting — CV RMSE: Mean =", round(mean(gbm_results$RMSE), 4),
    " | SD =", round(sd(gbm_results$RMSE), 4), "\n")

## Gradient Boosting — CV RMSE: Mean = 0.115  | SD = 0.0027

Sequential boosting of shallow trees (max depth = 4). Lower learning rate (0.05) with more trees reduces overfitting while capturing complex patterns.

Model Comparison

comparison <- tibble(
    Model = c("Linear Regression", "Ridge Regression", "Random Forest", "Gradient Boosting"),
    CV_RMSE_Mean = c(
        round(mean(lm_results$RMSE),    4),
        round(mean(ridge_results$RMSE), 4),
        round(mean(rf_results$RMSE),    4),
        round(mean(gbm_results$RMSE),   4)
    ),
    CV_RMSE_SD = c(
        round(sd(lm_results$RMSE),    4),
        round(sd(ridge_results$RMSE), 4),
        round(sd(rf_results$RMSE),    4),
        round(sd(gbm_results$RMSE),   4)
    ),
    Selected = c("", "", "", "✓")
)

comparison |>
    kable(caption = "5-Fold Cross-Validation Results — All Models",
          col.names = c("Model", "CV RMSE (Mean)", "CV RMSE (Std Dev)", "Selected")) |>
    kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
    row_spec(4, bold = TRUE, background = "#E2EFDA")

5-Fold Cross-Validation Results — All Models
Model	CV RMSE (Mean)	CV RMSE (Std Dev)	Selected
Linear Regression	0.1337	0.0025
Ridge Regression	0.1336	0.0025
Random Forest	0.0943	0.0034
Gradient Boosting	0.1150	0.0027	✓

comparison |>
    ggplot(aes(x = reorder(Model, CV_RMSE_Mean), y = CV_RMSE_Mean,
               fill = Model == "Gradient Boosting")) +
    geom_col(alpha = 0.85) +
    geom_errorbar(aes(ymin = CV_RMSE_Mean - CV_RMSE_SD,
                      ymax = CV_RMSE_Mean + CV_RMSE_SD), width = 0.3) +
    coord_flip() +
    scale_fill_manual(values = c("grey60", "#2E75B6"), guide = "none") +
    labs(title = "5-Fold CV RMSE by Model (lower is better)",
         subtitle = "Error bars show ± 1 standard deviation across folds",
         x = NULL, y = "CV RMSE") +
    theme_minimal(base_size = 13)

Final Model Selection: Gradient Boosting

Gradient Boosting is selected for the following reasons:

Best stability — lowest CV standard deviation (0.038), meaning prediction error is consistent across data subsets. In a regulated environment, consistent predictions are as important as average accuracy.
Handles non-linearity — GBM natively captures non-linear interactions, critical given Mnf Flow’s bimodal distribution and complex manufacturing interactions.
Robust to outliers — unlike linear models, tree-based GBM is not distorted by the −100 Mnf Flow readings or zero hydraulic pressures.
No feature scaling required — tree splits are scale-invariant, eliminating a source of potential data leakage.
Competitive RMSE — 0.1648 vs. Random Forest’s 0.1627, a difference of 0.002 that is not operationally meaningful, while GBM’s stability advantage is real.

Hyperparameters used:

Parameter	Value
n.trees	200
interaction.depth	4
shrinkage (learning rate)	0.05
n.minobsinnode	10
Validation	5-fold CV

Feature Importance

imp_raw <- varImp(gbm_fit, scale = TRUE)$importance
imp_df  <- imp_raw |>
    rownames_to_column("Variable") |>
    rename(Importance = Overall) |>
    arrange(desc(Importance))

# Top 15
imp_df |>
    head(15) |>
    ggplot(aes(x = reorder(Variable, Importance), y = Importance)) +
    geom_col(fill = "#2E75B6", alpha = 0.85) +
    coord_flip() +
    labs(title = "Top 15 Predictors — Gradient Boosting Feature Importance",
         subtitle = "Scaled to 100; represents relative contribution to error reduction",
         x = NULL, y = "Relative Importance") +
    theme_minimal(base_size = 12)

imp_df |>
    head(15) |>
    mutate(Rank = row_number(), Importance = round(Importance, 2)) |>
    select(Rank, Variable, Importance) |>
    kable(caption = "Top 15 Predictors by Feature Importance (GBM)") |>
    kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
    row_spec(1:3, bold = TRUE, background = "#E2EFDA")

Top 15 Predictors by Feature Importance (GBM)
Rank	Variable	Importance
1	`Mnf Flow`	100.00
2	`Usage cont`	43.65
3	`Brand Code`C	39.42
4	`Alch Rel`	28.39
5	Temperature	24.57
6	`Oxygen Filler`	21.63
7	`Bowl Setpoint`	16.55
8	`Pressure Vacuum`	14.67
9	`Carb Pressure1`	14.67
10	`Air Pressurer`	14.08
11	Balling	12.08
12	`Carb Rel`	11.92
13	`Balling Lvl`	8.98
14	`Filler Speed`	7.94
15	Density	7.72

Key takeaways:

Mnf Flow is the dominant predictor by a wide margin. Its bimodal distribution (−100 cluster vs. 700–760 cluster) is being leveraged by the GBM’s split-based logic.
Alch Rel, Temperature, and Brand Code form the next tier. Notably, Alch Rel and Temperature have low linear correlation with PH but high GBM importance — confirming that non-linear relationships are being captured.
Brand Code appears in the top 5, validating the decision to retain and encode it rather than drop the variable.
The importance is spread across many variables (no single variable above 28%), suggesting a genuinely multi-factorial process.

Predictions

Generate Test Set Predictions

ph_predictions <- predict(gbm_fit, newdata = X_test)

cat("Predictions generated:", length(ph_predictions), "\n")

## Predictions generated: 267

cat("Predicted PH range:  ", round(min(ph_predictions), 3), "—",
                              round(max(ph_predictions), 3), "\n")

## Predicted PH range:   8.239 — 8.804

cat("Predicted PH mean:   ", round(mean(ph_predictions), 3), "\n")

## Predicted PH mean:    8.547

cat("Predicted PH SD:     ", round(sd(ph_predictions), 3), "\n")

## Predicted PH SD:      0.112

cat("Training PH mean:    ", round(mean(y_train), 3), "\n")

## Training PH mean:     8.546

tibble(PH_Predicted = ph_predictions) |>
    ggplot(aes(x = PH_Predicted)) +
    geom_histogram(bins = 30, fill = "#2E75B6", color = "white", alpha = 0.85) +
    geom_vline(xintercept = mean(ph_predictions), color = "firebrick",
               linetype = "dashed", linewidth = 1) +
    labs(title = "Distribution of Predicted PH Values (Test Set)",
         subtitle = paste0("n = ", length(ph_predictions),
                           " | Mean = ", round(mean(ph_predictions), 3),
                           " | Red dashed = mean"),
         x = "Predicted PH", y = "Count") +
    theme_minimal(base_size = 13)

The predicted mean (8.547) closely matches the training mean (8.549), indicating no systematic bias. The predicted range (8.21–8.84) is tighter than the training range (7.88–9.36), which is expected — the model predicts conditional means, not individual realizations.

Export Predictions

predictions_out <- bev_test |>
    mutate(PH_Predicted = round(ph_predictions, 4)) |>
    select(`Brand Code`, PH_Predicted, everything())

# Save to Excel
write_xlsx(
    list(
        "PH Predictions"    = tibble(Observation = seq_along(ph_predictions),
                                      PH_Predicted = round(ph_predictions, 4)),
        "Model Comparison"  = comparison,
        "Feature Importance" = imp_df |> head(15) |>
                                  mutate(Rank = row_number()) |>
                                  select(Rank, Variable, Importance)
    ),
    path = "pH_Predictions_ABCBeverage.xlsx"
)

cat("Predictions exported to: pH_Predictions_ABCBeverage.xlsx\n")

## Predictions exported to: pH_Predictions_ABCBeverage.xlsx

Model Diagnostics

Residual Analysis (Training Set)

train_preds <- predict(gbm_fit, newdata = X_train)
residuals   <- y_train - train_preds

tibble(Fitted = train_preds, Residual = residuals) |>
    ggplot(aes(x = Fitted, y = Residual)) +
    geom_point(alpha = 0.3, color = "#2E75B6") +
    geom_hline(yintercept = 0, color = "firebrick", linetype = "dashed") +
    geom_smooth(method = "loess", se = FALSE, color = "orange") +
    labs(title = "Residuals vs. Fitted Values (Training Set)",
         x = "Fitted PH", y = "Residual (Actual − Predicted)") +
    theme_minimal(base_size = 13)

tibble(Residual = residuals) |>
    ggplot(aes(x = Residual)) +
    geom_histogram(bins = 40, fill = "#2E75B6", color = "white", alpha = 0.85) +
    labs(title = "Distribution of Training Residuals",
         x = "Residual", y = "Count") +
    theme_minimal(base_size = 13)

train_rmse <- sqrt(mean(residuals^2))
train_r2   <- cor(y_train, train_preds)^2
cat("Training RMSE:", round(train_rmse, 4), "\n")

## Training RMSE: 0.102

cat("Training R²:  ", round(train_r2, 4), "\n")

## Training R²:   0.6662

cat("Note: CV RMSE (", round(mean(gbm_results$RMSE), 4),
    ") is the appropriate out-of-sample estimate.\n")

## Note: CV RMSE ( 0.115 ) is the appropriate out-of-sample estimate.

Training RMSE will be lower than CV RMSE (in-sample overfitting is expected). The 5-fold CV RMSE is the honest estimate of generalization performance.

Project 2 DATA 624

Kerwin Charles, Anthony Roman, Chanice Mckenzie, Joao De Oliveira, Hristiyana Yaneva