ABC Beverage Company

Introduction and Abstract

In response to new regulations requiring ABC Beverage to better understand our manufacturing process and predictive factors for PH levels, our data science team analyzed historical production data to build a reliable forecasting model. The goal was to identify the key factors influencing PH and create accurate predictions to support compliance and process optimization.

Methodology

We started by cleaning the data, addressing missing values, removing duplicates, and encoding categorical variables. Numerical features were standardized to ensure consistency. For exploratory data analysis, we visualized the data, checked for outliers, and analyzed correlations between features.

Next, we trained and tested several models including linear regression, decision trees, neural networks, random forests, K-Nearest Neighbors (KNN), and Support Vector Machines (SVM). Model performance was evaluated using RMSE and R-squared metrics.

Final Model and Key Findings

The random forest model delivered the best performance with an R-squared of 0.65 and an RMSE of 0.10. The top five features driving PH predictions were usage_cont, filler_level, temperature, brand_codeC, and carb_flow. SHAP analysis confirmed their importance and provided deeper insights into their impact on PH levels.

PH forecasts have been generated using the final model and added to the StudentEvaluation.xlsx file for further review and action.

Packages and Setup

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readxl)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(ggcorrplot)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(nnet)
library(fastshap)
## 
## Attaching package: 'fastshap'
## 
## The following object is masked from 'package:dplyr':
## 
##     explain
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:gridExtra':
## 
##     combine
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(rpart)
set.seed(12345) 
knitr::opts_chunk$set(
  warning = FALSE,   
  message = FALSE,   
  echo = TRUE,       
  fig.width = 8,    
  fig.height = 6     
)

Data Cleaning

# Load the specific sheet from the first Excel file
training_data <- read_excel("StudentData.xlsx", sheet = "Subset")

# Load the specific sheet from the second Excel file
testing_data <- read_excel("StudentEvaluation.xlsx", sheet = "Subset (2)")
# right away, remove any observations with no PH value.

training_data <-training_data[!is.na(training_data[['PH']]), ]

X <- select(training_data, -PH)
y <- training_data$PH

# no PH values are provided in studentEvaluation.xlsx, so we cannot use it
# for testing. We can only use it to make predictions.
eval_X <- select(testing_data, -PH)
eval_X <- eval_X %>% drop_na()
# Inspect the structure of the data
glimpse(X)
## Rows: 2,567
## Columns: 32
## $ `Brand Code`        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", …
## $ `Carb Volume`       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, …
## $ `Fill Ounces`       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, …
## $ `PC Volume`         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1113…
## $ `Carb Pressure`     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64…
## $ `Carb Temp`         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 1…
## $ PSC                 <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.15…
## $ `PSC Fill`          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.…
## $ `PSC CO2`           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.…
## $ `Mnf Flow`          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1`    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 1…
## $ `Fill Pressure`     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46…
## $ `Hyd Pressure1`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4`     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, …
## $ `Filler Level`      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 1…
## $ `Filler Speed`      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014…
## $ Temperature         <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65…
## $ `Usage cont`        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 1…
## $ `Carb Flow`         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902,…
## $ Density             <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.…
## $ MFR                 <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, …
## $ Balling             <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1…
## $ `Pressure Vacuum`   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4…
## $ `Oxygen Filler`     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0…
## $ `Bowl Setpoint`     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46…
## $ `Air Pressurer`     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 1…
## $ `Alch Rel`          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.…
## $ `Carb Rel`          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.…
## $ `Balling Lvl`       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.…
glimpse(eval_X)
## Rows: 200
## Columns: 32
## $ `Brand Code`        <chr> "A", "B", "B", "A", "A", "B", "B", "B", "C", "B", …
## $ `Carb Volume`       <dbl> 5.393333, 5.293333, 5.406667, 5.480000, 5.406667, …
## $ `Fill Ounces`       <dbl> 23.95333, 23.92000, 24.20000, 23.93333, 23.92000, …
## $ `PC Volume`         <dbl> 0.2266667, 0.3033333, 0.1600000, 0.2433333, 0.3326…
## $ `Carb Pressure`     <dbl> 63.2, 66.4, 69.4, 65.2, 66.8, 63.2, 65.0, 63.8, 64…
## $ `Carb Temp`         <dbl> 135.0, 140.4, 142.2, 134.6, 138.0, 139.6, 138.8, 1…
## $ PSC                 <dbl> 0.042, 0.068, 0.040, 0.088, 0.246, 0.184, 0.152, 0…
## $ `PSC Fill`          <dbl> 0.22, 0.10, 0.30, 0.14, 0.48, 0.26, 0.12, 0.18, 0.…
## $ `PSC CO2`           <dbl> 0.08, 0.02, 0.06, 0.00, 0.04, 0.20, 0.00, 0.02, 0.…
## $ `Mnf Flow`          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1`    <dbl> 118.8, 120.2, 115.0, 117.6, 136.0, 117.2, 117.0, 1…
## $ `Fill Pressure`     <dbl> 46.2, 45.8, 51.4, 46.2, 43.8, 46.2, 45.8, 46.4, 46…
## $ `Hyd Pressure1`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4`     <dbl> 112, 98, 94, 108, 110, 96, 100, 100, 92, 90, 90, 7…
## $ `Filler Level`      <dbl> 120.0, 119.4, 116.0, 119.6, 121.0, 118.4, 119.6, 1…
## $ `Filler Speed`      <dbl> 4012, 4010, 4018, 4010, 4010, 4010, 4010, 4016, 40…
## $ Temperature         <dbl> 65.6, 65.6, 66.4, 66.8, 65.8, 65.8, 65.4, 65.6, 67…
## $ `Usage cont`        <dbl> 17.60, 24.18, 21.32, 17.68, 17.70, 17.16, 20.52, 2…
## $ `Carb Flow`         <dbl> 2916, 3056, 3214, 3042, 2502, 3100, 2926, 2954, 30…
## $ Density             <dbl> 1.50, 0.90, 0.88, 1.48, 1.52, 0.86, 0.92, 0.94, 0.…
## $ MFR                 <dbl> 735.8, 734.8, 752.0, 729.8, 741.2, 735.8, 735.6, 7…
## $ Balling             <dbl> 2.942, 1.448, 1.398, 2.894, 2.992, 1.348, 1.498, 1…
## $ `Pressure Vacuum`   <dbl> -4.4, -4.2, -4.0, -4.2, -4.4, -4.2, -4.8, -4.8, -4…
## $ `Oxygen Filler`     <dbl> 0.030, 0.046, 0.082, 0.042, 0.046, 0.048, 0.066, 0…
## $ `Bowl Setpoint`     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46, 46, 50, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46…
## $ `Air Pressurer`     <dbl> 147.2, 146.6, 145.8, 145.0, 146.2, 147.0, 147.0, 1…
## $ `Alch Rel`          <dbl> 7.14, 6.52, 6.50, 7.18, 7.14, 6.50, 6.54, 6.54, 6.…
## $ `Carb Rel`          <dbl> 5.58, 5.34, 5.38, 5.46, 5.44, 5.38, 5.28, 5.22, 5.…
## $ `Balling Lvl`       <dbl> 3.04, 1.46, 1.46, 3.02, 3.10, 1.42, 1.46, 1.44, 1.…
# Check for missing values
sum(is.na(X))
## [1] 812
sum(is.na(eval_X))
## [1] 0
# View a summary of the data
summary(X)
##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:2567        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23933  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27724  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08464   Mean   :0.1953  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC CO2           Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  70.20   Median :123.2   Median :46.40  
##  Mean   :0.05644   Mean   :  24.63   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39                          NA's   :32      NA's   :18     
##  Hyd Pressure1   Hyd Pressure2   Hyd Pressure3   Hyd Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 62.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.46   Mean   :20.99   Mean   :20.48   Mean   : 96.31  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :28      
##   Filler Level    Filler Speed   Temperature      Usage cont      Carb Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1166  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3030  
##  Mean   :109.3   Mean   :3688   Mean   :65.96   Mean   :20.99   Mean   :2472  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3188  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :16      NA's   :54     NA's   :12      NA's   :5       NA's   :2     
##     Density           MFR           Balling      Pressure Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :0.160   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.:1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median :1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   :2.200   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.:3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   :4.012   Max.   :-3.600  
##                  NA's   :208                                     
##  Oxygen Filler     Bowl Setpoint   Pressure Setpoint Air Pressurer  
##  Min.   :0.00240   Min.   : 70.0   Min.   :44.00     Min.   :140.8  
##  1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00     1st Qu.:142.2  
##  Median :0.03340   Median :120.0   Median :46.00     Median :142.6  
##  Mean   :0.04643   Mean   :109.3   Mean   :47.61     Mean   :142.8  
##  3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00     3rd Qu.:143.0  
##  Max.   :0.40000   Max.   :140.0   Max.   :52.00     Max.   :148.2  
##  NA's   :11        NA's   :2       NA's   :12                       
##     Alch Rel        Carb Rel      Balling Lvl   
##  Min.   :5.280   Min.   :4.960   Min.   :0.000  
##  1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.380  
##  Median :6.560   Median :5.400   Median :1.480  
##  Mean   :6.898   Mean   :5.437   Mean   :2.052  
##  3rd Qu.:7.240   3rd Qu.:5.550   3rd Qu.:3.140  
##  Max.   :8.620   Max.   :6.060   Max.   :3.660  
##  NA's   :7       NA's   :8       NA's   :1
summary(eval_X)
##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:200         Min.   :5.147   Min.   :23.75   Min.   :0.09867  
##  Class :character   1st Qu.:5.287   1st Qu.:23.92   1st Qu.:0.23467  
##  Mode  :character   Median :5.343   Median :23.97   Median :0.27533  
##                     Mean   :5.367   Mean   :23.97   Mean   :0.27681  
##                     3rd Qu.:5.453   3rd Qu.:24.01   3rd Qu.:0.32150  
##                     Max.   :5.633   Max.   :24.20   Max.   :0.46400  
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :60.20   Min.   :130.0   Min.   :0.00400   Min.   :0.0200  
##  1st Qu.:65.55   1st Qu.:138.6   1st Qu.:0.04200   1st Qu.:0.1200  
##  Median :67.80   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.15   Mean   :141.2   Mean   :0.08157   Mean   :0.1857  
##  3rd Qu.:70.05   3rd Qu.:143.4   3rd Qu.:0.10450   3rd Qu.:0.2400  
##  Max.   :77.40   Max.   :154.0   Max.   :0.24600   Max.   :0.5600  
##     PSC CO2          Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.0000   Min.   :-100.20   Min.   :113.0   Min.   :43.60  
##  1st Qu.:0.0200   1st Qu.:-100.00   1st Qu.:119.4   1st Qu.:46.00  
##  Median :0.0400   Median : 107.50   Median :123.4   Median :48.00  
##  Mean   :0.0519   Mean   :  31.25   Mean   :122.7   Mean   :48.37  
##  3rd Qu.:0.0600   3rd Qu.: 144.10   3rd Qu.:125.2   3rd Qu.:50.20  
##  Max.   :0.2400   Max.   : 199.00   Max.   :136.0   Max.   :54.80  
##  Hyd Pressure1   Hyd Pressure2  Hyd Pressure3   Hyd Pressure4   
##  Min.   : 0.00   Min.   : 0.0   Min.   : 0.00   Min.   : 72.00  
##  1st Qu.: 0.00   1st Qu.: 0.0   1st Qu.: 0.00   1st Qu.: 89.50  
##  Median :11.90   Median :30.7   Median :30.30   Median : 96.00  
##  Mean   :13.35   Mean   :23.8   Mean   :23.09   Mean   : 95.15  
##  3rd Qu.:21.00   3rd Qu.:35.4   3rd Qu.:34.60   3rd Qu.:100.50  
##  Max.   :45.40   Max.   :61.4   Max.   :49.20   Max.   :128.00  
##   Filler Level     Filler Speed   Temperature     Usage cont      Carb Flow   
##  Min.   : 69.20   Min.   :1008   Min.   :63.8   Min.   :14.04   Min.   :1048  
##  1st Qu.: 99.35   1st Qu.:3898   1st Qu.:65.4   1st Qu.:18.11   1st Qu.:2692  
##  Median :113.80   Median :3984   Median :65.8   Median :21.67   Median :3066  
##  Mean   :108.28   Mean   :3851   Mean   :65.9   Mean   :21.09   Mean   :2659  
##  3rd Qu.:119.85   3rd Qu.:3996   3rd Qu.:66.4   3rd Qu.:23.79   3rd Qu.:3245  
##  Max.   :137.00   Max.   :4020   Max.   :71.6   Max.   :24.34   Max.   :3824  
##     Density           MFR           Balling      Pressure Vacuum 
##  Min.   :0.460   Min.   : 95.0   Min.   :1.048   Min.   :-6.400  
##  1st Qu.:0.920   1st Qu.:709.0   1st Qu.:1.498   1st Qu.:-5.600  
##  Median :0.980   Median :726.0   Median :1.648   Median :-5.400  
##  Mean   :1.185   Mean   :707.0   Mean   :2.219   Mean   :-5.229  
##  3rd Qu.:1.620   3rd Qu.:731.5   3rd Qu.:3.290   3rd Qu.:-5.000  
##  Max.   :1.840   Max.   :784.8   Max.   :3.788   Max.   :-3.600  
##  Oxygen Filler     Bowl Setpoint   Pressure Setpoint Air Pressurer  
##  Min.   :0.00240   Min.   : 70.0   Min.   :44.00     Min.   :141.2  
##  1st Qu.:0.01570   1st Qu.:100.0   1st Qu.:46.00     1st Qu.:142.2  
##  Median :0.03350   Median :120.0   Median :48.00     Median :142.6  
##  Mean   :0.04585   Mean   :108.6   Mean   :47.87     Mean   :142.8  
##  3rd Qu.:0.05440   3rd Qu.:120.0   3rd Qu.:50.00     3rd Qu.:142.8  
##  Max.   :0.35400   Max.   :130.0   Max.   :52.00     Max.   :147.2  
##     Alch Rel       Carb Rel      Balling Lvl   
##  Min.   :6.40   Min.   :5.220   Min.   :1.260  
##  1st Qu.:6.54   1st Qu.:5.340   1st Qu.:1.380  
##  Median :6.58   Median :5.400   Median :1.460  
##  Mean   :6.89   Mean   :5.435   Mean   :2.056  
##  3rd Qu.:7.16   3rd Qu.:5.540   3rd Qu.:3.080  
##  Max.   :7.82   Max.   :5.740   Max.   :3.420
head(X)
## # A tibble: 6 × 32
##   `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
##   <chr>                <dbl>         <dbl>       <dbl>           <dbl>
## 1 B                     5.34          24.0       0.263            68.2
## 2 A                     5.43          24.0       0.239            68.4
## 3 B                     5.29          24.1       0.263            70.8
## 4 A                     5.44          24.0       0.293            63  
## 5 A                     5.49          24.3       0.111            67.2
## 6 A                     5.38          23.9       0.269            66.6
## # ℹ 27 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## #   `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## #   `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## #   `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## #   `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## #   `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
## #   `Pressure Vacuum` <dbl>, `Oxygen Filler` <dbl>, `Bowl Setpoint` <dbl>, …
head(eval_X)
## # A tibble: 6 × 32
##   `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
##   <chr>                <dbl>         <dbl>       <dbl>           <dbl>
## 1 A                     5.39          24.0       0.227            63.2
## 2 B                     5.29          23.9       0.303            66.4
## 3 B                     5.41          24.2       0.16             69.4
## 4 A                     5.48          23.9       0.243            65.2
## 5 A                     5.41          23.9       0.333            66.8
## 6 B                     5.26          24.1       0.22             63.2
## # ℹ 27 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## #   `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## #   `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## #   `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## #   `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## #   `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
## #   `Pressure Vacuum` <dbl>, `Oxygen Filler` <dbl>, `Bowl Setpoint` <dbl>, …
# Remove rows with missing data
y <- y[complete.cases(X)] 
X <- X %>% drop_na()

eval_X <- eval_X %>% drop_na()
# Standardize column names
X <- X %>% rename_all(tolower) %>% 
  rename_all(gsub, pattern = " ", replacement = "_")
eval_X <- eval_X %>% rename_all(tolower) %>% 
  rename_all(gsub, pattern = " ", replacement = "_")
# remove duplicates
X <- X %>% distinct()
eval_X <- eval_X %>% distinct()
# remove the columns that only have 1 unique value

single_value_cols <- sapply(X, function(col) length(unique(col)) == 1)
X <- X[, !single_value_cols, drop = FALSE]
eval_X <- eval_X[, !single_value_cols, drop = FALSE]
missing_train <- sapply(X, function(col) sum(is.na(col)))
missing_test <- sapply(eval_X, function(col) sum(is.na(col)))

print(missing_train)
##        brand_code       carb_volume       fill_ounces         pc_volume 
##                 0                 0                 0                 0 
##     carb_pressure         carb_temp               psc          psc_fill 
##                 0                 0                 0                 0 
##           psc_co2          mnf_flow    carb_pressure1     fill_pressure 
##                 0                 0                 0                 0 
##     hyd_pressure1     hyd_pressure2     hyd_pressure3     hyd_pressure4 
##                 0                 0                 0                 0 
##      filler_level      filler_speed       temperature        usage_cont 
##                 0                 0                 0                 0 
##         carb_flow           density               mfr           balling 
##                 0                 0                 0                 0 
##   pressure_vacuum     oxygen_filler     bowl_setpoint pressure_setpoint 
##                 0                 0                 0                 0 
##     air_pressurer          alch_rel          carb_rel       balling_lvl 
##                 0                 0                 0                 0
print(missing_test)
##        brand_code       carb_volume       fill_ounces         pc_volume 
##                 0                 0                 0                 0 
##     carb_pressure         carb_temp               psc          psc_fill 
##                 0                 0                 0                 0 
##           psc_co2          mnf_flow    carb_pressure1     fill_pressure 
##                 0                 0                 0                 0 
##     hyd_pressure1     hyd_pressure2     hyd_pressure3     hyd_pressure4 
##                 0                 0                 0                 0 
##      filler_level      filler_speed       temperature        usage_cont 
##                 0                 0                 0                 0 
##         carb_flow           density               mfr           balling 
##                 0                 0                 0                 0 
##   pressure_vacuum     oxygen_filler     bowl_setpoint pressure_setpoint 
##                 0                 0                 0                 0 
##     air_pressurer          alch_rel          carb_rel       balling_lvl 
##                 0                 0                 0                 0
# Convert character columns to factors and ensure date columns are correctly formatted
X <- X %>%
  mutate(across(where(is.character), as.factor))

eval_X <- eval_X %>%
  mutate(across(where(is.character), as.factor))
# one-hot-encode the brand_code field with dummy variables
encode_var <- function(df, col){
  ohm <- model.matrix(~ . - 1, data = df[, col, drop = FALSE])
  ohm <- as.data.frame(ohm)
  ohm <- lapply(ohm, as.factor)
  return(cbind(df[ , !names(df) %in% col, drop = FALSE], ohm))
}

X <- encode_var(X, 'brand_code')
eval_X <- encode_var(eval_X, 'brand_code')
# Data types
str(X)
## 'data.frame':    2038 obs. of  35 variables:
##  $ carb_volume      : num  5.49 5.38 5.25 5.27 5.32 ...
##  $ fill_ounces      : num  24.3 23.9 24 24 23.9 ...
##  $ pc_volume        : num  0.111 0.269 0.263 0.231 0.259 ...
##  $ carb_pressure    : num  67.2 66.6 64.2 72 66.2 61.6 71.6 72.6 68 63.8 ...
##  $ carb_temp        : num  137 138 140 147 139 ...
##  $ psc              : num  0.026 0.09 0.132 0.014 0.078 0.11 0.096 0.16 0.034 0.124 ...
##  $ psc_fill         : num  0.16 0.24 0.12 0.24 0.18 ...
##  $ psc_co2          : num  0.12 0.04 0.14 0.06 0.04 ...
##  $ mnf_flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ carb_pressure1   : num  118 120 121 120 120 ...
##  $ fill_pressure    : num  45.8 45.6 46 45.2 46.6 46.6 46 46.6 47.4 48.8 ...
##  $ hyd_pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hyd_pressure2    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hyd_pressure3    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hyd_pressure4    : num  92 116 90 108 94 86 94 92 96 92 ...
##  $ filler_level     : num  119 120 120 121 120 ...
##  $ filler_speed     : num  4010 4014 4014 4028 4020 ...
##  $ temperature      : num  65.6 66.2 65.4 66.6 65 65.4 65 65 65.2 68.8 ...
##  $ usage_cont       : num  17.7 23.8 18.4 13.5 19 ...
##  $ carb_flow        : num  3054 2948 2902 3038 3056 ...
##  $ density          : num  1.54 1.52 0.9 0.9 0.9 0.92 0.92 0.9 0.9 0.46 ...
##  $ mfr              : num  723 739 740 692 727 ...
##  $ balling          : num  3.04 2.99 1.45 1.45 1.45 ...
##  $ pressure_vacuum  : num  -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.2 ...
##  $ oxygen_filler    : num  0.03 0.024 0.064 0.022 0.03 0.058 0.046 0.066 0.046 0.164 ...
##  $ bowl_setpoint    : num  120 120 120 120 120 120 120 120 120 120 ...
##  $ pressure_setpoint: num  46 46 46 46 46 46 46 46 46 46 ...
##  $ air_pressurer    : num  146 147 147 146 146 ...
##  $ alch_rel         : num  7.14 7.16 6.52 6.54 6.52 6.52 6.52 6.52 6.52 6.52 ...
##  $ carb_rel         : num  5.44 5.44 5.34 5.34 5.34 5.34 5.34 5.34 5.34 5.34 ...
##  $ balling_lvl      : num  3.04 3.02 1.44 1.38 1.44 1.44 1.44 1.44 1.42 1.46 ...
##  $ brand_codeA      : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
##  $ brand_codeB      : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 2 2 1 ...
##  $ brand_codeC      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
##  $ brand_codeD      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

EDA

The cell below creates a function that can be used to count the number of outliers in each column of a dataframe:

count_outliers <- function(dataframe) {
  outlier_counts <- sapply(dataframe, function(column) {
    if (is.numeric(column)) {
      Q1 <- quantile(column, 0.25, na.rm = TRUE)
      Q3 <- quantile(column, 0.75, na.rm = TRUE)
      IQR <- Q3 - Q1
      lower_bound <- Q1 - 1.5 * IQR
      upper_bound <- Q3 + 1.5 * IQR
      sum(column < lower_bound | column > upper_bound, na.rm = TRUE)
    } else {
      NA
    }
  })
  return(outlier_counts)
}

The count_outliers function is used below to plot the number of outliers present in each predictor field:

outlier_counts <- count_outliers(X)
outlier_counts <- data.frame(
  Column = names(outlier_counts),
  Outliers = as.numeric(outlier_counts)
)
outlier_counts <- na.omit(outlier_counts)


ggplot(outlier_counts, aes(x = reorder(Column, -Outliers), y = Outliers)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(
    title = "# of Outliers Present in Predictor Fields",
    x = "Variable Name",
    y = "Number of Outliers"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

Next, the cell below includes function that tests whether or not each column within the dataframe is normal.

test_normality <- function(dataframe) {
  results <- lapply(dataframe, function(column) {
    if (is.numeric(column)) {
      test_result <- tryCatch(
        shapiro.test(column),
        error = function(e) NULL # handle errors (e.g., small sample size)
      )
      if (!is.null(test_result)) {
        return(data.frame(
          Statistic = test_result$statistic,
          P_Value = test_result$p.value
        ))
      } else {
        return(data.frame(Statistic = NA, P_Value = NA))
      }
    } else {
      # return NA for non-numeric columns
      return(data.frame(Statistic = NA, P_Value = NA))
    }
  })
  
  # combine results into a dataframe
  results_df <- do.call(rbind, results)
  rownames(results_df) <- names(dataframe)
  return(results_df)
}

normality_results <- test_normality(X)
normality_results
##                   Statistic      P_Value
## carb_volume       0.9570379 4.605382e-24
## fill_ounces       0.9946490 1.030640e-06
## pc_volume         0.9852127 1.110007e-13
## carb_pressure     0.9964220 9.849891e-05
## carb_temp         0.9960234 3.279974e-05
## psc               0.9547630 1.204310e-24
## psc_fill          0.9393717 4.303840e-28
## psc_co2           0.8376258 1.905078e-41
## mnf_flow          0.7261226 1.232213e-49
## carb_pressure1    0.9866239 7.233734e-13
## fill_pressure     0.8638292 7.322913e-39
## hyd_pressure1     0.9032089 4.276136e-34
## hyd_pressure2     0.7994928 1.168334e-44
## hyd_pressure3     0.8109674 9.495383e-44
## hyd_pressure4     0.9567037 3.769116e-24
## filler_level      0.8375014 1.855578e-41
## filler_speed      0.3512008 8.071603e-65
## temperature       0.9397270 5.074408e-28
## usage_cont        0.8693097 2.854520e-38
## carb_flow         0.7042791 6.671293e-51
## density           0.8204577 5.819635e-43
## mfr               0.4344164 2.904769e-62
## balling           0.7700461 8.205418e-47
## pressure_vacuum   0.9618916 9.729409e-23
## oxygen_filler     0.7951776 5.449870e-45
## bowl_setpoint     0.7916412 2.945908e-45
## pressure_setpoint 0.7390249 7.562504e-49
## air_pressurer     0.7021706 5.082926e-51
## alch_rel          0.7078074 1.055353e-50
## carb_rel          0.9392944 4.152571e-28
## balling_lvl       0.7081517 1.103931e-50
## brand_codeA              NA           NA
## brand_codeB              NA           NA
## brand_codeC              NA           NA
## brand_codeD              NA           NA

The cell below produces a correlation matrix to show the correlations between all pairs of predictor variables, helping to assess the level of multicollinearity.

corr_matrix <- cor(select(X, where(is.numeric)))
ggcorrplot(corr_matrix, lab = FALSE, title = "Correlation Matrix") +
  theme(
    axis.text.x = element_text(size = 6),  
    axis.text.y = element_text(size = 6)  
  )

The cell below produces scatterplots of each explanatory field with the predictor variable PH. Each scatterplot includes the correlation coefficent and a best fit line relating the two fields.

plot_scatter_with_fit <- function(X, y) {
  plots <- list() 
  
  for (col in names(X)) {
    correlation <- cor(X[[col]], y)  
    plot <- ggplot(data = data.frame(x = X[[col]], y = y), aes(x = x, y = y)) +
      geom_point(alpha=.02) +
      geom_smooth(method = "lm", se = FALSE, color = "red") +  
      labs(
        subtitle = paste("Correlation:", round(correlation, 2)),
        x = col,
        y = "PH"
      ) +
      theme_minimal()
    plots[[col]] <- plot
  }
  
  return(plots)
}

scatter_plots <- plot_scatter_with_fit(select(X, where(is.numeric)), y)

for (plot in scatter_plots) {
  print(plot)
}

The cell below plots the distributions of the categorical features in X:

plot_categorical_distributions <- function(df) {
  # Identify categorical columns
  categorical_columns <- 
    names(df)[sapply(df, is.factor) | sapply(df, is.character)]
  
  # Create bar plots for each categorical column
  plots <- lapply(categorical_columns, function(column) {
    ggplot(df, aes_string(x = column)) +
      geom_bar(fill = "steelblue", color = "black") +
      labs(
        title = paste("Distribution of", column),
        x = column,
        y = "Count"
      ) +
      theme_minimal()
  })
  
  return(plots)
}

cat_plots <- plot_categorical_distributions(select(X, where(is.factor)))

for (plot in cat_plots) {
  print(plot)
}

Data Pre-Processing

Split the data into testing and training sets.

train_index <- createDataPartition(y, p = 0.75, list = FALSE)

X_train <- X[train_index, ]
X_test <- X[-train_index, ]

y_train <- y[train_index]
y_test <- y[-train_index]

Apply the Yeo-Johnson transformation.

numeric_features <- sapply(X, is.numeric)

# fit the Yeo-Johnson transformation using training data
preprocess_params <- preProcess(X_train[, numeric_features],
                                method = "YeoJohnson")

# pply the transformation to training and testing data
X_train[, numeric_features] <- predict(preprocess_params,
                                       X_train[, numeric_features])
X_test[, numeric_features] <- predict(preprocess_params,
                                    X_test[, numeric_features])

Use robust scaling to scale the data:

# only use medians and iqrs from training data
medians <- apply(X_train[, numeric_features], 2, median)
iqrs <- apply(X_train[, numeric_features], 2, IQR)

# performs the robust scaling using the IQRs and medians
robust_scale <- function(data, medians, iqrs) {
  scaled_data <- sweep(data, 2, medians, "-")  
  scaled_data <- sweep(scaled_data, 2, iqrs, "/") 
  return(scaled_data)
}

# transform the training and testing data
X_train[, numeric_features] <- robust_scale(X_train[, numeric_features],
                                           medians, iqrs)
X_test[, numeric_features] <- robust_scale(X_test[, numeric_features], 
                                          medians, iqrs)

Finds all pairs of highly correlated variables (\(|r| > 0.7\)) and randomly removes one variable from each pair.

high_corr_indices <- findCorrelation(corr_matrix, cutoff = 0.7, names = TRUE)
X_train <- X_train[, !names(X_train) %in% high_corr_indices]
X_test <- X_test[, !names(X_test) %in% high_corr_indices]

Make new variables (and update old ones) using the transformed data:

X <- rbind(X_train, X_test)
train <- X_train
train$PH <- y_train
test <- X_test
test$PH <- y_test

Check again for outliers:

outlier_counts <- count_outliers(X)
outlier_counts <- data.frame(
  Column = names(outlier_counts),
  Outliers = as.numeric(outlier_counts)
)
outlier_counts <- na.omit(outlier_counts)


ggplot(outlier_counts, aes(x = reorder(Column, -Outliers), y = Outliers)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(
    title = "# of Outliers Present in Predictor Fields",
    x = "Variable Name",
    y = "Number of Outliers"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

Check again for normal distributions:

normality_results <- test_normality(X)
normality_results
##                 Statistic      P_Value
## carb_volume     0.9570379 4.605382e-24
## fill_ounces     0.9946490 1.030640e-06
## pc_volume       0.9892472 3.426857e-11
## carb_temp       0.9980432 1.438409e-02
## psc             0.9547630 1.204310e-24
## psc_fill        0.9887001 1.463930e-11
## psc_co2         0.8376258 1.905078e-41
## carb_pressure1  0.9873377 1.961692e-12
## fill_pressure   0.8638292 7.322913e-39
## hyd_pressure1   0.8740895 9.712845e-38
## hyd_pressure4   0.9570190 4.553325e-24
## filler_level    0.8375014 1.855578e-41
## filler_speed    0.3512008 8.071603e-65
## temperature     0.9397270 5.074408e-28
## usage_cont      0.8693097 2.854520e-38
## carb_flow       0.7919069 3.084296e-45
## pressure_vacuum 0.9802522 3.481334e-16
## oxygen_filler   0.7951776 5.449870e-45
## air_pressurer   0.7021706 5.082926e-51
## brand_codeA            NA           NA
## brand_codeB            NA           NA
## brand_codeC            NA           NA
## brand_codeD            NA           NA

Check again for multicollinearity:

corr_matrix <- cor(select(X, where(is.numeric)))
ggcorrplot(corr_matrix, lab = FALSE, title = "Correlation Matrix") +
  theme(
    axis.text.x = element_text(size = 6),  
    axis.text.y = element_text(size = 6)  
  )

We now have scaled, transformed and removed mutlti collenaer pairs from our data, we can proceed to building models.

Modeling

The first models used is a simple Linear Regression

model_lm1 <- lm(train$PH ~ ., train)
summary(model_lm1)
## 
## Call:
## lm(formula = train$PH ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51212 -0.07911  0.01401  0.09560  0.44786 
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.6571515  0.0169592 510.468  < 2e-16 ***
## carb_volume     -0.0211829  0.0111656  -1.897  0.05800 .  
## fill_ounces     -0.0069887  0.0044412  -1.574  0.11578    
## pc_volume       -0.0030710  0.0053001  -0.579  0.56240    
## carb_temp       -0.0019803  0.0048417  -0.409  0.68258    
## psc             -0.0042448  0.0048281  -0.879  0.37944    
## psc_fill        -0.0065565  0.0055376  -1.184  0.23660    
## psc_co2         -0.0062008  0.0035136  -1.765  0.07780 .  
## carb_pressure1   0.0355027  0.0061499   5.773 9.45e-09 ***
## fill_pressure   -0.0243515  0.0062108  -3.921 9.22e-05 ***
## hyd_pressure1    0.0003830  0.0090633   0.042  0.96630    
## hyd_pressure4    0.0074237  0.0078570   0.945  0.34489    
## filler_level     0.0979577  0.0090507  10.823  < 2e-16 ***
## filler_speed    -0.0036012  0.0011146  -3.231  0.00126 ** 
## temperature     -0.0302962  0.0047424  -6.388 2.23e-10 ***
## usage_cont      -0.0601566  0.0083179  -7.232 7.54e-13 ***
## carb_flow        0.0543009  0.0102074   5.320 1.20e-07 ***
## pressure_vacuum  0.0021836  0.0050273   0.434  0.66409    
## oxygen_filler    0.0001932  0.0032632   0.059  0.95279    
## air_pressurer   -0.0006421  0.0024088  -0.267  0.78985    
## brand_codeA1    -0.1031997  0.0173165  -5.960 3.14e-09 ***
## brand_codeB1    -0.0473982  0.0192697  -2.460  0.01402 *  
## brand_codeC1    -0.1860677  0.0227281  -8.187 5.66e-16 ***
## brand_codeD1            NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1369 on 1506 degrees of freedom
## Multiple R-squared:  0.3715, Adjusted R-squared:  0.3623 
## F-statistic: 40.46 on 22 and 1506 DF,  p-value: < 2.2e-16

Decision Tree

model_tree1 <- rpart(train$PH ~., data = train)

predict(model_tree1, test)
##        2        4        5       10       12       14       15       23 
## 8.532000 8.586820 8.508000 8.447879 8.508000 8.447879 8.586820 8.604404 
##       24       26       30       33       34       37       48       51 
## 8.508000 8.675564 8.447879 8.586820 8.532000 8.586820 8.586820 8.675564 
##       59       60       62       66       68       69       72       74 
## 8.586820 8.586820 8.586820 8.675564 8.532000 8.508000 8.586820 8.532000 
##       82       83       84       87       91       95       97       98 
## 8.675564 8.675564 8.586820 8.586820 8.532000 8.675564 8.745688 8.745688 
##      102      106      119      121      123      131      133      137 
## 8.675564 8.586820 8.447879 8.447879 8.447879 8.675564 8.586820 8.532000 
##      147      152      153      159      165      166      183      191 
## 8.586820 8.532000 8.675564 8.532000 8.586820 8.675564 8.675564 8.675564 
##      193      195      201      202      203      209      210      212 
## 8.745688 8.745688 8.586820 8.586820 8.586820 8.675564 8.586820 8.675564 
##      221      224      225      229      240      244      246      247 
## 8.675564 8.745688 8.675564 8.447879 8.586820 8.532000 8.675564 8.675564 
##      250      252      253      255      264      265      274      279 
## 8.745688 8.675564 8.675564 8.532000 8.675564 8.675564 8.745688 8.745688 
##      282      284      287      293      295      299      307      310 
## 8.745688 8.745688 8.586820 8.745688 8.586820 8.745688 8.447879 8.447879 
##      318      320      322      336      337      339      344      347 
## 8.675564 8.675564 8.745688 8.586820 8.586820 8.586820 8.586820 8.675564 
##      349      357      358      359      363      371      377      379 
## 8.675564 8.675564 8.586820 8.532000 8.532000 8.532000 8.675564 8.675564 
##      382      384      386      388      398      402      406      407 
## 8.586820 8.675564 8.745688 8.745688 8.675564 8.586820 8.447879 8.447879 
##      412      415      416      417      418      419      423      425 
## 8.447879 8.745688 8.675564 8.745688 8.604404 8.675564 8.675564 8.675564 
##      429      430      434      435      436      440      444      445 
## 8.675564 8.675564 8.586820 8.586820 8.586820 8.586820 8.532000 8.532000 
##      446      449      450      453      461      471      480      481 
## 8.586820 8.675564 8.675564 8.675564 8.675564 8.675564 8.745688 8.745688 
##      485      492      495      499      500      506      512      518 
## 8.675564 8.745688 8.447879 8.447879 8.447879 8.447879 8.447879 8.532000 
##      524      525      529      532      537      540      550      552 
## 8.745688 8.532000 8.675564 8.745688 8.675564 8.745688 8.675564 8.532000 
##      553      557      562      563      565      577      578      579 
## 8.532000 8.675564 8.675564 8.745688 8.586820 8.586820 8.586820 8.586820 
##      586      587      593      594      596      597      598      605 
## 8.532000 8.586820 8.586820 8.586820 8.532000 8.586820 8.586820 8.447879 
##      607      608      621      622      623      628      639      640 
## 8.447879 8.532000 8.604404 8.604404 8.604404 8.675564 8.675564 8.675564 
##      643      649      650      653      654      657      660      667 
## 8.675564 8.675564 8.675564 8.675564 8.675564 8.532000 8.675564 8.675564 
##      673      678      679      686      687      689      691      695 
## 8.745688 8.745688 8.745688 8.745688 8.745688 8.586820 8.745688 8.447879 
##      699      705      706      718      720      721      723      731 
## 8.447879 8.586820 8.745688 8.675564 8.675564 8.745688 8.476812 8.604404 
##      740      745      747      757      762      765      768      782 
## 8.675564 8.586820 8.586820 8.586820 8.675564 8.586820 8.675564 8.447879 
##      786      789      795      796      797      805      808      821 
## 8.447879 8.675564 8.532000 8.675564 8.675564 8.675564 8.675564 8.675564 
##      822      828      836      838      842      845      847      853 
## 8.586820 8.675564 8.586820 8.675564 8.675564 8.675564 8.532000 8.675564 
##      854      855      857      859      861      868      869      872 
## 8.675564 8.675564 8.675564 8.675564 8.675564 8.745688 8.745688 8.447879 
##      877      881      889      892      896      900      901      904 
## 8.532000 8.447879 8.447879 8.745688 8.586820 8.586820 8.586820 8.586820 
##      906      912      923      927      928      932      933      946 
## 8.675564 8.675564 8.586820 8.675564 8.675564 8.675564 8.675564 8.532000 
##      950      951      958      960      962      965      967      968 
## 8.675564 8.675564 8.745688 8.675564 8.532000 8.532000 8.586820 8.586820 
##      970      972      976      977      978      984      986      992 
## 8.447879 8.447879 8.586820 8.675564 8.586820 8.532000 8.586820 8.675564 
##      996     1001     1002     1003     1007     1009     1011     1012 
## 8.586820 8.586820 8.586820 8.675564 8.586820 8.586820 8.675564 8.675564 
##     1014     1017     1029     1030     1039     1052     1060     1062 
## 8.447879 8.586820 8.675564 8.745688 8.586820 8.604404 8.604404 8.586820 
##     1067     1075     1078     1081     1082     1086     1095     1097 
## 8.586820 8.604404 8.604404 8.604404 8.604404 8.604404 8.604404 8.604404 
##     1103     1107     1110     1118     1122     1132     1135     1138 
## 8.745688 8.745688 8.447879 8.447879 8.415833 8.586820 8.586820 8.586820 
##     1139     1141     1143     1146     1148     1162     1164     1170 
## 8.604404 8.604404 8.586820 8.604404 8.586820 8.586820 8.604404 8.604404 
##     1176     1182     1184     1191     1195     1202     1205     1207 
## 8.604404 8.586820 8.604404 8.604404 8.604404 8.604404 8.604404 8.604404 
##     1212     1213     1214     1216     1225     1227     1229     1234 
## 8.604404 8.586820 8.532000 8.586820 8.586820 8.745688 8.604404 8.745688 
##     1237     1247     1250     1254     1261     1265     1266     1267 
## 8.532000 8.586820 8.604404 8.604404 8.604404 8.390638 8.390638 8.390638 
##     1274     1278     1280     1281     1284     1285     1290     1296 
## 8.390638 8.390638 8.390638 8.390638 8.390638 8.390638 8.390638 8.390638 
##     1308     1309     1312     1313     1314     1319     1321     1327 
## 8.532000 8.532000 8.532000 8.532000 8.532000 8.532000 8.532000 8.532000 
##     1328     1331     1332     1333     1337     1342     1343     1347 
## 8.532000 8.532000 8.532000 8.532000 8.390638 8.390638 8.390638 8.390638 
##     1348     1350     1358     1361     1367     1372     1377     1379 
## 8.390638 8.390638 8.390638 8.415833 8.532000 8.532000 8.532000 8.532000 
##     1381     1383     1389     1397     1398     1400     1411     1412 
## 8.532000 8.390638 8.390638 8.390638 8.390638 8.476812 8.390638 8.390638 
##     1413     1414     1417     1424     1425     1427     1428     1445 
## 8.390638 8.532000 8.390638 8.390638 8.390638 8.390638 8.532000 8.532000 
##     1457     1459     1466     1468     1472     1477     1482     1486 
## 8.390638 8.390638 8.390638 8.390638 8.390638 8.390638 8.390638 8.532000 
##     1488     1491     1492     1493     1494     1496     1499     1504 
## 8.532000 8.532000 8.532000 8.532000 8.532000 8.532000 8.415833 8.390638 
##     1505     1516     1521     1522     1523     1524     1528     1529 
## 8.390638 8.476812 8.390638 8.390638 8.390638 8.390638 8.532000 8.532000 
##     1530     1532     1534     1538     1541     1546     1560     1562 
## 8.532000 8.532000 8.532000 8.532000 8.532000 8.390638 8.390638 8.390638 
##     1567     1569     1577     1582     1587     1588     1599     1602 
## 8.390638 8.390638 8.542353 8.415833 8.415833 8.415833 8.415833 8.476812 
##     1607     1613     1620     1622     1633     1639     1645     1649 
## 8.476812 8.532000 8.390638 8.390638 8.532000 8.532000 8.204444 8.204444 
##     1650     1655     1657     1658     1661     1662     1664     1671 
## 8.204444 8.532000 8.532000 8.532000 8.532000 8.532000 8.532000 8.204444 
##     1672     1674     1676     1681     1687     1688     1689     1694 
## 8.204444 8.476812 8.476812 8.476812 8.476812 8.476812 8.476812 8.204444 
##     1697     1701     1706     1707     1710     1716     1718     1720 
## 8.476812 8.476812 8.415833 8.415833 8.415833 8.415833 8.415833 8.532000 
##     1725     1726     1727     1731     1735     1743     1747     1749 
## 8.532000 8.532000 8.532000 8.476812 8.476812 8.476812 8.476812 8.476812 
##     1756     1768     1784     1786     1792     1797     1806     1808 
## 8.476812 8.532000 8.532000 8.476812 8.532000 8.476812 8.542353 8.542353 
##     1812     1813     1816     1819     1824     1826     1843     1844 
## 8.476812 8.476812 8.476812 8.476812 8.532000 8.532000 8.532000 8.532000 
##     1846     1847     1850     1853     1859     1860     1866     1868 
## 8.532000 8.532000 8.532000 8.476812 8.476812 8.476812 8.532000 8.532000 
##     1869     1870     1872     1884     1887     1891     1901     1910 
## 8.542353 8.532000 8.532000 8.532000 8.476812 8.476812 8.532000 8.476812 
##     1911     1919     1924     1926     1927     1929     1932     1947 
## 8.542353 8.476812 8.542353 8.476812 8.532000 8.476812 8.532000 8.532000 
##     1949     1952     1957     1972     1976     1979     1983     1988 
## 8.532000 8.532000 8.532000 8.476812 8.476812 8.476812 8.476812 8.532000 
##     1993     1996     2002     2006     2011     2012     2019     2024 
## 8.476812 8.542353 8.532000 8.532000 8.532000 8.415833 8.415833 8.154545 
##     2025     2027     2030     2035     2038 
## 8.154545 8.154545 8.154545 8.154545 8.476812
RMSE(predict(model_tree1, test), test$PH)
## [1] 0.1349601

Plotting Tree1

plot(model_tree1, uniform=TRUE, compress=FALSE, margin=.015)
text(model_tree1, all=TRUE, cex=.5)

Neural Network

my.grid <- expand.grid(decay = c(0.5,0,1), size=c(5,6,7))

model_nnet <- train(PH ~ ., data=train, method="nnet",
      maxit=500, tuneGrid = my.grid, trace =F, linout = 1)
print(model_nnet)
## Neural Network 
## 
## 1529 samples
##   23 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1529, 1529, 1529, 1529, 1529, 1529, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE       
##   0.0    5     0.2586472  0.3048120  0.12498432
##   0.0    6     0.5222105  0.3000944  0.16361182
##   0.0    7     1.2898967  0.2696562  0.26220503
##   0.5    5     0.1306745  0.4284715  0.10059015
##   0.5    6     0.1305440  0.4299680  0.10004188
##   0.5    7     0.1295583  0.4372411  0.09956567
##   1.0    5     0.1302672  0.4243708  0.10052455
##   1.0    6     0.1301176  0.4256484  0.10037720
##   1.0    7     0.1292306  0.4328996  0.09958846
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 7 and decay = 1.

Random Forest

library(caret)

RMSE(predict(model_nnet, test), test$PH)
## [1] 0.1250552
RMSE(predict(model_tree1, test), test$PH)
## [1] 0.1349601
RMSE(predict(model_lm1, test), test$PH)
## [1] 0.1351587
# Load Necessary Libraries
library(caret)
library(randomForest)



# 1. Combine Features and Target for Training
train$PH <- y_train

# 2. Training the Random Forest Model
rfModel <- randomForest(
  x = train[, setdiff(names(train), "PH")],  # Features (exclude the target variable)
  y = train$PH,                             # Target variable
  importance = TRUE,                        # Enable variable importance
  ntree = 1500                               # Number of trees
)

# 3. Combine Features and Target for Testing
test$PH <- y_test

# 4. Making Predictions on the Test Set
rfPred <- predict(rfModel, test[, setdiff(names(test), "PH")])

# 5. Evaluating Model Performance
testY_vec <- test$PH
performance <- postResample(pred = rfPred, obs = testY_vec)

# Display the Performance Metrics
print(performance)
##       RMSE   Rsquared        MAE 
## 0.10235056 0.65131395 0.07790556
#  Define Cross-Validation Control
train_control <- trainControl(
  method = "cv",          
  number = 5,             
  verboseIter = FALSE,    
  search = "grid"         
)

#   Grid of Hyperparameters for k-Values
k_values <- expand.grid(k = c(3, 5, 7, 9, 11))  

#  Train the KNN Model

knn_model <- train(
  x = X_train,            
  y = y_train,            
  method = "knn",         
  tuneGrid = k_values,    
  trControl = train_control  
)

#  Plot the KNN Model Performance
plot(knn_model)

# Train SVM with radial kernel
svm_model <- train(
  PH ~ ., 
  data = train, 
  method = "svmRadial",
  trControl = trainControl(method = "cv", number = 5),
  tuneLength = 5
)
svm_model
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1529 samples
##   23 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1224, 1222, 1223, 1223, 1224 
## Resampling results across tuning parameters:
## 
##   C     RMSE       Rsquared   MAE       
##   0.25  0.1277438  0.4614226  0.09552216
##   0.50  0.1255571  0.4780651  0.09325780
##   1.00  0.1232271  0.4946128  0.09145115
##   2.00  0.1218984  0.5040928  0.09088473
##   4.00  0.1216624  0.5062811  0.09097519
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02956161
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02956161 and C = 4.
# Predict and evaluate
svm_pred <- predict(svm_model, test)
svm_rmse <- RMSE(svm_pred, test$PH)
svm_r2 <- R2(svm_pred, test$PH)

# Display metrics
cat("SVM Model RMSE:", svm_rmse, "\n")
## SVM Model RMSE: 0.1113341
cat("SVM Model R-Squared:", svm_r2, "\n")
## SVM Model R-Squared: 0.5731233

Model Analysis and Explanatory features

# Calculate RMSE and R-squared for Neural Network Model
rmse_nnet <- RMSE(predict(model_nnet, test), test$PH)
rsq_nnet <- R2(predict(model_nnet, test), test$PH)

# Calculate RMSE and R-squared for Tree-Based Model
rmse_tree1 <- RMSE(predict(model_tree1, test), test$PH)
rsq_tree1 <- R2(predict(model_tree1, test), test$PH)

# Calculate RMSE and R-squared for Linear Model
rmse_lm1 <- RMSE(predict(model_lm1, test), test$PH)
rsq_lm1 <- R2(predict(model_lm1, test), test$PH)

# Calculate RMSE and R-squared for Random Forest Model
rmse_rf <- RMSE(rfPred, test$PH)
rsq_rf <- R2(rfPred, test$PH)

# Calculate RMSE and R-squared for KNN Model
rmse_knn <- RMSE(predict(knn_model, test), test$PH)
rsq_knn <- R2(predict(knn_model, test), test$PH)


# Create a Data Frame of RMSE and R-squared Values
model_performance <- data.frame(
  Model = c("Neural Network", "Tree-Based", "Linear Model", "Random Forest", "KNN", "SVM"),
  RMSE = c(rmse_nnet, rmse_tree1, rmse_lm1, rmse_rf, rmse_knn,svm_rmse),
  R_Squared = c(rsq_nnet, rsq_tree1, rsq_lm1, rsq_rf, rsq_knn,svm_r2)
)

# Display the Performance Data Frame
model_performance <- model_performance %>%
  arrange(RMSE, desc(R_Squared))

# Print the sorted model_performance data frame
print(model_performance)
##            Model      RMSE R_Squared
## 1  Random Forest 0.1023506 0.6513140
## 2            SVM 0.1113341 0.5731233
## 3            KNN 0.1187727 0.5122407
## 4 Neural Network 0.1250552 0.4540662
## 5     Tree-Based 0.1349601 0.3675322
## 6   Linear Model 0.1351587 0.3599708

After Tuning the different models we conclude that best preforming model is the Random Forrest Model with an R_Squared of 0.65 and RMSE of 0.10

# Extract variable of importance
var_importance <- importance(rfModel)

importance_df <- data.frame(
  Variable = rownames(var_importance),
  Importance = var_importance[, "IncNodePurity"]
)
# Order

importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]
#ploting


ggplot(importance_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Importance Variables in Random Forest",
       x = "Variable",
       y = "Importance (IncNodePurity)") +
  theme_minimal()

# Define the prediction function for the Random Forest model
pred_wrapper <- function(object, newdata) {
  predict(object, newdata = newdata)
}

# Compute SHAP values for the test set
shap_values <- fastshap::explain(
  object = rfModel,
  X = as.data.frame(test[, -which(names(test) == "PH")]),
  pred_wrapper = pred_wrapper
)

# Convert SHAP values to a data frame
shap_df <- as.data.frame(shap_values)
shap_df$Observation <- 1:nrow(shap_df)

# Calculate mean absolute SHAP values for each feature to determine importance
shap_importance <- colMeans(abs(shap_df[, -ncol(shap_df)]))
shap_importance <- sort(shap_importance, decreasing = TRUE)
ordered_features <- names(shap_importance)

# Melt the SHAP values data frame to long format for ggplot
shap_long <- melt(shap_df, id.vars = "Observation", variable.name = "Feature", value.name = "SHAP")

# Convert Feature to a factor ordered by importance (flipped)
shap_long$Feature <- factor(shap_long$Feature, levels = rev(ordered_features))

# Plot the SHAP values with important features at the top
ggplot(shap_long, aes(x = Feature, y = SHAP, fill = Feature)) +
  geom_boxplot() +
  coord_flip() +
  theme_minimal() +
  labs(title = "SHAP Values for Features Ordered by Importance",
       y = "SHAP Value",
       x = "Features") +
  theme(legend.position = "none")

# top num vars for plotting correlations 
numerical_vars <- c("usage_cont", "filler_level", "temperature","fill_pressure", "carb_flow")

# Calculate correlation coefficients for numeric variables
correlations <- sapply(numerical_vars, function(var) cor(train[[var]], train$PH, use = "complete.obs"))
print("Correlations of PH and top numerical variables")
## [1] "Correlations of PH and top numerical variables"
correlations
##    usage_cont  filler_level   temperature fill_pressure     carb_flow 
##    -0.3580374     0.3399455    -0.1971432    -0.3304814     0.2047809
# Select the relevant columns
data_subset <- train[, c("PH", numerical_vars)]

# Pivot the data to long format for faceting
data_long <- pivot_longer(data_subset, cols = all_of(numerical_vars), names_to = "Variable", values_to = "Value")

# Create the facet plot
ggplot(data_long, aes(x = Value, y = PH)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", color = "blue", se = TRUE) +
  facet_wrap(~ Variable, scales = "free", ncol = 2) +
  labs(title = "Relationships Between PH and Selected Variables",
       x = "Feature Value",
       y = "PH",
        ) +
  theme_minimal()

# Boxplot for the categorical variable 'brand_codeC'

ggplot(train, aes(x = as.factor(brand_codeC), y = PH)) +
    geom_boxplot(fill = "lightblue") +
    labs(title = "Relationship between PH and brand_codeC",
         x = "brand_codeC",
         y = "PH") +
    theme_minimal()

Final Thoughts on the Top Variables for Predicting PH

Combining insights from the variable importance plot, SHAP values, and correlation plots, we can make informed conclusions about how our top 5 variables influence the Random Forest model’s predictions for PH:

usage_cont

  • Importance: Ranked the most important variable in the Random Forest model.

  • SHAP Insight: SHAP values show both positive and negative impacts on PH, suggesting variability in its effect.

  • Correlation Plot: Shows a negative correlation with PH. Higher usage_cont values lead to lower PH levels.

  • Conclusion: usage_cont is a strong predictor, likely capturing flow rate and system efficiency, which impact acidity levels.

filler_level

  • Importance: The second most important variable.

  • SHAP Insight: SHAP values suggest it has a consistent positive influence on PH.

  • Correlation Plot: Indicates a slight positive correlation. Higher filler levels tend to increase PH.

  • Conclusion: Variations in filler levels affect ingredient concentration, influencing PH balance.

temperature

  • Importance: The third most important variable.

  • SHAP Insight: SHAP values indicate a negative influence on PH.

  • Correlation Plot: Shows a clear negative correlation. Higher temperature leads to lower PH.

  • Conclusion: Temperature impacts carbonation and chemical reactions, affecting the acidity and lowering PH.

brand_codeC

  • Importance: The fourth most important variable.

  • SHAP Insight: SHAP values suggest a strong negative impact when Brand C is present.

  • Boxplot: PH is consistently lower when brand_codeC = 1 (Brand C is present).

  • Conclusion: The formulation or process used by Brand C leads to lower PH levels compared to other brands.

fill_pressure

  • Importance: The fifth most important variable.

  • SHAP Insight: SHAP values show a negative impact on PH.

  • Correlation Plot: Displays a negative correlation. Higher fill pressure tends to decrease PH levels.

  • Conclusion: Fill pressure affects the pressure conditions during filling, which can influence ingredient concentration and acidity, thereby lowering PH.

carb_flow

  • Importance: The sixth most important variable.

  • SHAP Insight: SHAP values show a positive impact on PH.

  • Correlation Plot: Displays a slight positive correlation. Higher carb_flow increases PH.

  • Conclusion: Carb flow affects the amount of dissolved CO₂, altering acidity and thereby increasing PH.

Summary of Insights:

usage_cont and temperature have strong negative relationships with PH.

filler_level and carb_flow exhibit positive influences on PH.

fill_pressure negatively influences PH.

brand_codeC distinctly lowers PH.

Conclusion:

In conclusion, while the Random Forest model is the best performer for predicting PH, we need to be clear about its limitations. The model’s R-squared of 0.65 and RMSE of 0.10 are the best scores among the models we trained and tested. But with 35% of the variance in PH left unexplained, there’s room for improvement.

Future analysis should focus on filling the gaps in the data or applying imputation techniques for missing values. We should also consider adding more features or refining existing ones to capture more of what drives PH levels. Right now, the forecasts are based on a model that doesn’t include imputations, and that’s something stakeholders should keep in mind.

The team is confident in the insights delivered so far, but with more complete data and some adjustments, we can make even better predictions in the future.

Forecasts

# Ensure preprocessing is applied to eval_X

# Apply Yeo-Johnson transformation to eval_X
eval_X[, numeric_features] <- predict(preprocess_params, eval_X[, numeric_features])

# Apply robust scaling to eval_X
eval_X[, numeric_features] <- robust_scale(eval_X[, numeric_features], medians, iqrs)
# Remove highly correlated variables from eval_X
eval_X <- eval_X[, !names(eval_X) %in% high_corr_indices]

# Make predictions on the eval_X dataset using the Random Forest model
eval_predictions <- predict(rfModel, eval_X)

# Display the predicted PH values
print("Predicted PH values:")
## [1] "Predicted PH values:"
print(eval_predictions)
##        1        2        3        4        5        6        7        8 
## 8.531851 8.603143 8.496703 8.523643 8.609380 8.529564 8.556028 8.746530 
##        9       10       11       12       13       14       15       16 
## 8.328026 8.581438 8.581967 8.601740 8.710108 8.614099 8.597041 8.615836 
##       17       18       19       20       21       22       23       24 
## 8.487977 8.449442 8.633737 8.682109 8.580535 8.565949 8.553090 8.557087 
##       25       26       27       28       29       30       31       32 
## 8.674184 8.706720 8.321299 8.387354 8.671264 8.719526 8.701226 8.701277 
##       33       34       35       36       37       38       39       40 
## 8.641984 8.730761 8.692891 8.761330 8.682994 8.454105 8.516469 8.654464 
##       41       42       43       44       45       46       47       48 
## 8.755156 8.734625 8.750944 8.735522 8.721475 8.757144 8.571764 8.555226 
##       49       50       51       52       53       54       55       56 
## 8.575807 8.596961 8.584895 8.496598 8.415022 8.533208 8.555514 8.542948 
##       57       58       59       60       61       62       63       64 
## 8.502970 8.655768 8.710757 8.693966 8.767340 8.642309 8.626228 8.607960 
##       65       66       67       68       69       70       71       72 
## 8.708350 8.642886 8.583602 8.706785 8.548307 8.684580 8.632432 8.611212 
##       73       74       75       76       77       78       79       80 
## 8.601411 8.708010 8.705296 8.748947 8.744084 8.428039 8.459278 8.498614 
##       81       82       83       84       85       86       87       88 
## 8.644689 8.648746 8.597674 8.671324 8.674667 8.676715 8.683344 8.675282 
##       89       90       91       92       93       94       95       96 
## 8.610223 8.652668 8.698168 8.601261 8.569489 8.538041 8.521243 8.525146 
##       97       98       99      100      101      102      103      104 
## 8.573202 8.485654 8.540701 8.476891 8.424081 8.511464 8.505408 8.588440 
##      105      106      107      108      109      110      111      112 
## 8.575082 8.576334 8.584466 8.527812 8.652756 8.450518 8.465418 8.439398 
##      113      114      115      116      117      118      119      120 
## 8.490966 8.537927 8.559260 8.626806 8.608619 8.638556 8.717568 8.646098 
##      121      122      123      124      125      126      127      128 
## 8.701178 8.438174 8.542954 8.535612 8.651901 8.448982 8.535418 8.447995 
##      129      130      131      132      133      134      135      136 
## 8.399496 8.507557 8.520241 8.471886 8.354592 8.372130 8.547954 8.400976 
##      137      138      139      140      141      142      143      144 
## 8.382844 8.368475 8.476426 8.457102 8.476027 8.333137 8.339997 8.313790 
##      145      146      147      148      149      150      151      152 
## 8.318392 8.431888 8.467044 8.450848 8.415168 8.378575 8.487884 8.486926 
##      153      154      155      156      157      158      159      160 
## 8.348944 8.407049 8.450529 8.468483 8.372240 8.411245 8.367988 8.332852 
##      161      162      163      164      165      166      167      168 
## 8.545659 8.505944 8.493745 8.492170 8.456895 8.395450 8.555494 8.475673 
##      169      170      171      172      173      174      175      176 
## 8.449899 8.485641 8.463428 8.485945 8.486793 8.533447 8.525938 8.625628 
##      177      178      179      180      181      182      183      184 
## 8.634585 8.492647 8.501467 8.468431 8.555540 8.477963 8.438225 8.452320 
##      185      186      187      188      189      190      191      192 
## 8.456068 8.518920 8.505006 8.445432 8.528754 8.549645 8.559218 8.500270 
##      193      194      195      196      197      198      199      200 
## 8.509977 8.602137 8.572722 8.470956 8.601771 8.300192 8.352504 8.311770
# Add the predictions to the eval_X dataset
eval_X$PH_Predicted <- eval_predictions
# Export the dataset with predictions to a new Excel file
write.csv(eval_X, "StudentEvaluation_WithForecasts.csv", row.names = FALSE)

# Confirm export completion
cat("Forecasts successfully generated and exported to 'StudentEvaluation_WithForecasts.csv'.\n")
## Forecasts successfully generated and exported to 'StudentEvaluation_WithForecasts.csv'.