Introduction and Abstract
In response to new regulations requiring ABC Beverage to better understand our manufacturing process and predictive factors for PH levels, our data science team analyzed historical production data to build a reliable forecasting model. The goal was to identify the key factors influencing PH and create accurate predictions to support compliance and process optimization.
Methodology
We started by cleaning the data, addressing missing values, removing duplicates, and encoding categorical variables. Numerical features were standardized to ensure consistency. For exploratory data analysis, we visualized the data, checked for outliers, and analyzed correlations between features.
Next, we trained and tested several models including linear regression, decision trees, neural networks, random forests, K-Nearest Neighbors (KNN), and Support Vector Machines (SVM). Model performance was evaluated using RMSE and R-squared metrics.
Final Model and Key Findings
The random forest model delivered the best performance with an R-squared of 0.65 and an RMSE of 0.10. The top five features driving PH predictions were usage_cont, filler_level, temperature, brand_codeC, and carb_flow. SHAP analysis confirmed their importance and provided deeper insights into their impact on PH levels.
PH forecasts have been generated using the final model and added to the StudentEvaluation.xlsx file for further review and action.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readxl)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(ggcorrplot)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(nnet)
library(fastshap)
##
## Attaching package: 'fastshap'
##
## The following object is masked from 'package:dplyr':
##
## explain
library(ggplot2)
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:gridExtra':
##
## combine
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(rpart)
set.seed(12345)
knitr::opts_chunk$set(
warning = FALSE,
message = FALSE,
echo = TRUE,
fig.width = 8,
fig.height = 6
)
# Load the specific sheet from the first Excel file
training_data <- read_excel("StudentData.xlsx", sheet = "Subset")
# Load the specific sheet from the second Excel file
testing_data <- read_excel("StudentEvaluation.xlsx", sheet = "Subset (2)")
# right away, remove any observations with no PH value.
training_data <-training_data[!is.na(training_data[['PH']]), ]
X <- select(training_data, -PH)
y <- training_data$PH
# no PH values are provided in studentEvaluation.xlsx, so we cannot use it
# for testing. We can only use it to make predictions.
eval_X <- select(testing_data, -PH)
eval_X <- eval_X %>% drop_na()
# Inspect the structure of the data
glimpse(X)
## Rows: 2,567
## Columns: 32
## $ `Brand Code` <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", …
## $ `Carb Volume` <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, …
## $ `Fill Ounces` <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, …
## $ `PC Volume` <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1113…
## $ `Carb Pressure` <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64…
## $ `Carb Temp` <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 1…
## $ PSC <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.15…
## $ `PSC Fill` <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.…
## $ `PSC CO2` <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.…
## $ `Mnf Flow` <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1` <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 1…
## $ `Fill Pressure` <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46…
## $ `Hyd Pressure1` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2` <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3` <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4` <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, …
## $ `Filler Level` <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 1…
## $ `Filler Speed` <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014…
## $ Temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65…
## $ `Usage cont` <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 1…
## $ `Carb Flow` <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902,…
## $ Density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.…
## $ MFR <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, …
## $ Balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1…
## $ `Pressure Vacuum` <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4…
## $ `Oxygen Filler` <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0…
## $ `Bowl Setpoint` <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46…
## $ `Air Pressurer` <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 1…
## $ `Alch Rel` <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.…
## $ `Carb Rel` <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.…
## $ `Balling Lvl` <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.…
glimpse(eval_X)
## Rows: 200
## Columns: 32
## $ `Brand Code` <chr> "A", "B", "B", "A", "A", "B", "B", "B", "C", "B", …
## $ `Carb Volume` <dbl> 5.393333, 5.293333, 5.406667, 5.480000, 5.406667, …
## $ `Fill Ounces` <dbl> 23.95333, 23.92000, 24.20000, 23.93333, 23.92000, …
## $ `PC Volume` <dbl> 0.2266667, 0.3033333, 0.1600000, 0.2433333, 0.3326…
## $ `Carb Pressure` <dbl> 63.2, 66.4, 69.4, 65.2, 66.8, 63.2, 65.0, 63.8, 64…
## $ `Carb Temp` <dbl> 135.0, 140.4, 142.2, 134.6, 138.0, 139.6, 138.8, 1…
## $ PSC <dbl> 0.042, 0.068, 0.040, 0.088, 0.246, 0.184, 0.152, 0…
## $ `PSC Fill` <dbl> 0.22, 0.10, 0.30, 0.14, 0.48, 0.26, 0.12, 0.18, 0.…
## $ `PSC CO2` <dbl> 0.08, 0.02, 0.06, 0.00, 0.04, 0.20, 0.00, 0.02, 0.…
## $ `Mnf Flow` <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1` <dbl> 118.8, 120.2, 115.0, 117.6, 136.0, 117.2, 117.0, 1…
## $ `Fill Pressure` <dbl> 46.2, 45.8, 51.4, 46.2, 43.8, 46.2, 45.8, 46.4, 46…
## $ `Hyd Pressure1` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4` <dbl> 112, 98, 94, 108, 110, 96, 100, 100, 92, 90, 90, 7…
## $ `Filler Level` <dbl> 120.0, 119.4, 116.0, 119.6, 121.0, 118.4, 119.6, 1…
## $ `Filler Speed` <dbl> 4012, 4010, 4018, 4010, 4010, 4010, 4010, 4016, 40…
## $ Temperature <dbl> 65.6, 65.6, 66.4, 66.8, 65.8, 65.8, 65.4, 65.6, 67…
## $ `Usage cont` <dbl> 17.60, 24.18, 21.32, 17.68, 17.70, 17.16, 20.52, 2…
## $ `Carb Flow` <dbl> 2916, 3056, 3214, 3042, 2502, 3100, 2926, 2954, 30…
## $ Density <dbl> 1.50, 0.90, 0.88, 1.48, 1.52, 0.86, 0.92, 0.94, 0.…
## $ MFR <dbl> 735.8, 734.8, 752.0, 729.8, 741.2, 735.8, 735.6, 7…
## $ Balling <dbl> 2.942, 1.448, 1.398, 2.894, 2.992, 1.348, 1.498, 1…
## $ `Pressure Vacuum` <dbl> -4.4, -4.2, -4.0, -4.2, -4.4, -4.2, -4.8, -4.8, -4…
## $ `Oxygen Filler` <dbl> 0.030, 0.046, 0.082, 0.042, 0.046, 0.048, 0.066, 0…
## $ `Bowl Setpoint` <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46, 46, 50, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46…
## $ `Air Pressurer` <dbl> 147.2, 146.6, 145.8, 145.0, 146.2, 147.0, 147.0, 1…
## $ `Alch Rel` <dbl> 7.14, 6.52, 6.50, 7.18, 7.14, 6.50, 6.54, 6.54, 6.…
## $ `Carb Rel` <dbl> 5.58, 5.34, 5.38, 5.46, 5.44, 5.38, 5.28, 5.22, 5.…
## $ `Balling Lvl` <dbl> 3.04, 1.46, 1.46, 3.02, 3.10, 1.42, 1.46, 1.44, 1.…
# Check for missing values
sum(is.na(X))
## [1] 812
sum(is.na(eval_X))
## [1] 0
# View a summary of the data
summary(X)
## Brand Code Carb Volume Fill Ounces PC Volume
## Length:2567 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23933
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27724
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08464 Mean :0.1953
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 70.20 Median :123.2 Median :46.40
## Mean :0.05644 Mean : 24.63 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :32 NA's :18
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 62.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.46 Mean :20.99 Mean :20.48 Mean : 96.31
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :28
## Filler Level Filler Speed Temperature Usage cont Carb Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1166
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3030
## Mean :109.3 Mean :3688 Mean :65.96 Mean :20.99 Mean :2472
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3188
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :16 NA's :54 NA's :12 NA's :5 NA's :2
## Density MFR Balling Pressure Vacuum
## Min. :0.240 Min. : 31.4 Min. :0.160 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.:1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median :1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean :2.200 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.:3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. :4.012 Max. :-3.600
## NA's :208
## Oxygen Filler Bowl Setpoint Pressure Setpoint Air Pressurer
## Min. :0.00240 Min. : 70.0 Min. :44.00 Min. :140.8
## 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2
## Median :0.03340 Median :120.0 Median :46.00 Median :142.6
## Mean :0.04643 Mean :109.3 Mean :47.61 Mean :142.8
## 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:143.0
## Max. :0.40000 Max. :140.0 Max. :52.00 Max. :148.2
## NA's :11 NA's :2 NA's :12
## Alch Rel Carb Rel Balling Lvl
## Min. :5.280 Min. :4.960 Min. :0.000
## 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.380
## Median :6.560 Median :5.400 Median :1.480
## Mean :6.898 Mean :5.437 Mean :2.052
## 3rd Qu.:7.240 3rd Qu.:5.550 3rd Qu.:3.140
## Max. :8.620 Max. :6.060 Max. :3.660
## NA's :7 NA's :8 NA's :1
summary(eval_X)
## Brand Code Carb Volume Fill Ounces PC Volume
## Length:200 Min. :5.147 Min. :23.75 Min. :0.09867
## Class :character 1st Qu.:5.287 1st Qu.:23.92 1st Qu.:0.23467
## Mode :character Median :5.343 Median :23.97 Median :0.27533
## Mean :5.367 Mean :23.97 Mean :0.27681
## 3rd Qu.:5.453 3rd Qu.:24.01 3rd Qu.:0.32150
## Max. :5.633 Max. :24.20 Max. :0.46400
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :60.20 Min. :130.0 Min. :0.00400 Min. :0.0200
## 1st Qu.:65.55 1st Qu.:138.6 1st Qu.:0.04200 1st Qu.:0.1200
## Median :67.80 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.15 Mean :141.2 Mean :0.08157 Mean :0.1857
## 3rd Qu.:70.05 3rd Qu.:143.4 3rd Qu.:0.10450 3rd Qu.:0.2400
## Max. :77.40 Max. :154.0 Max. :0.24600 Max. :0.5600
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :0.0000 Min. :-100.20 Min. :113.0 Min. :43.60
## 1st Qu.:0.0200 1st Qu.:-100.00 1st Qu.:119.4 1st Qu.:46.00
## Median :0.0400 Median : 107.50 Median :123.4 Median :48.00
## Mean :0.0519 Mean : 31.25 Mean :122.7 Mean :48.37
## 3rd Qu.:0.0600 3rd Qu.: 144.10 3rd Qu.:125.2 3rd Qu.:50.20
## Max. :0.2400 Max. : 199.00 Max. :136.0 Max. :54.80
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 72.00
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 89.50
## Median :11.90 Median :30.7 Median :30.30 Median : 96.00
## Mean :13.35 Mean :23.8 Mean :23.09 Mean : 95.15
## 3rd Qu.:21.00 3rd Qu.:35.4 3rd Qu.:34.60 3rd Qu.:100.50
## Max. :45.40 Max. :61.4 Max. :49.20 Max. :128.00
## Filler Level Filler Speed Temperature Usage cont Carb Flow
## Min. : 69.20 Min. :1008 Min. :63.8 Min. :14.04 Min. :1048
## 1st Qu.: 99.35 1st Qu.:3898 1st Qu.:65.4 1st Qu.:18.11 1st Qu.:2692
## Median :113.80 Median :3984 Median :65.8 Median :21.67 Median :3066
## Mean :108.28 Mean :3851 Mean :65.9 Mean :21.09 Mean :2659
## 3rd Qu.:119.85 3rd Qu.:3996 3rd Qu.:66.4 3rd Qu.:23.79 3rd Qu.:3245
## Max. :137.00 Max. :4020 Max. :71.6 Max. :24.34 Max. :3824
## Density MFR Balling Pressure Vacuum
## Min. :0.460 Min. : 95.0 Min. :1.048 Min. :-6.400
## 1st Qu.:0.920 1st Qu.:709.0 1st Qu.:1.498 1st Qu.:-5.600
## Median :0.980 Median :726.0 Median :1.648 Median :-5.400
## Mean :1.185 Mean :707.0 Mean :2.219 Mean :-5.229
## 3rd Qu.:1.620 3rd Qu.:731.5 3rd Qu.:3.290 3rd Qu.:-5.000
## Max. :1.840 Max. :784.8 Max. :3.788 Max. :-3.600
## Oxygen Filler Bowl Setpoint Pressure Setpoint Air Pressurer
## Min. :0.00240 Min. : 70.0 Min. :44.00 Min. :141.2
## 1st Qu.:0.01570 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2
## Median :0.03350 Median :120.0 Median :48.00 Median :142.6
## Mean :0.04585 Mean :108.6 Mean :47.87 Mean :142.8
## 3rd Qu.:0.05440 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:142.8
## Max. :0.35400 Max. :130.0 Max. :52.00 Max. :147.2
## Alch Rel Carb Rel Balling Lvl
## Min. :6.40 Min. :5.220 Min. :1.260
## 1st Qu.:6.54 1st Qu.:5.340 1st Qu.:1.380
## Median :6.58 Median :5.400 Median :1.460
## Mean :6.89 Mean :5.435 Mean :2.056
## 3rd Qu.:7.16 3rd Qu.:5.540 3rd Qu.:3.080
## Max. :7.82 Max. :5.740 Max. :3.420
head(X)
## # A tibble: 6 × 32
## `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 B 5.34 24.0 0.263 68.2
## 2 A 5.43 24.0 0.239 68.4
## 3 B 5.29 24.1 0.263 70.8
## 4 A 5.44 24.0 0.293 63
## 5 A 5.49 24.3 0.111 67.2
## 6 A 5.38 23.9 0.269 66.6
## # ℹ 27 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## # `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## # `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## # `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## # `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## # `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
## # `Pressure Vacuum` <dbl>, `Oxygen Filler` <dbl>, `Bowl Setpoint` <dbl>, …
head(eval_X)
## # A tibble: 6 × 32
## `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A 5.39 24.0 0.227 63.2
## 2 B 5.29 23.9 0.303 66.4
## 3 B 5.41 24.2 0.16 69.4
## 4 A 5.48 23.9 0.243 65.2
## 5 A 5.41 23.9 0.333 66.8
## 6 B 5.26 24.1 0.22 63.2
## # ℹ 27 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## # `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## # `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## # `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## # `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## # `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
## # `Pressure Vacuum` <dbl>, `Oxygen Filler` <dbl>, `Bowl Setpoint` <dbl>, …
# Remove rows with missing data
y <- y[complete.cases(X)]
X <- X %>% drop_na()
eval_X <- eval_X %>% drop_na()
# Standardize column names
X <- X %>% rename_all(tolower) %>%
rename_all(gsub, pattern = " ", replacement = "_")
eval_X <- eval_X %>% rename_all(tolower) %>%
rename_all(gsub, pattern = " ", replacement = "_")
# remove duplicates
X <- X %>% distinct()
eval_X <- eval_X %>% distinct()
# remove the columns that only have 1 unique value
single_value_cols <- sapply(X, function(col) length(unique(col)) == 1)
X <- X[, !single_value_cols, drop = FALSE]
eval_X <- eval_X[, !single_value_cols, drop = FALSE]
missing_train <- sapply(X, function(col) sum(is.na(col)))
missing_test <- sapply(eval_X, function(col) sum(is.na(col)))
print(missing_train)
## brand_code carb_volume fill_ounces pc_volume
## 0 0 0 0
## carb_pressure carb_temp psc psc_fill
## 0 0 0 0
## psc_co2 mnf_flow carb_pressure1 fill_pressure
## 0 0 0 0
## hyd_pressure1 hyd_pressure2 hyd_pressure3 hyd_pressure4
## 0 0 0 0
## filler_level filler_speed temperature usage_cont
## 0 0 0 0
## carb_flow density mfr balling
## 0 0 0 0
## pressure_vacuum oxygen_filler bowl_setpoint pressure_setpoint
## 0 0 0 0
## air_pressurer alch_rel carb_rel balling_lvl
## 0 0 0 0
print(missing_test)
## brand_code carb_volume fill_ounces pc_volume
## 0 0 0 0
## carb_pressure carb_temp psc psc_fill
## 0 0 0 0
## psc_co2 mnf_flow carb_pressure1 fill_pressure
## 0 0 0 0
## hyd_pressure1 hyd_pressure2 hyd_pressure3 hyd_pressure4
## 0 0 0 0
## filler_level filler_speed temperature usage_cont
## 0 0 0 0
## carb_flow density mfr balling
## 0 0 0 0
## pressure_vacuum oxygen_filler bowl_setpoint pressure_setpoint
## 0 0 0 0
## air_pressurer alch_rel carb_rel balling_lvl
## 0 0 0 0
# Convert character columns to factors and ensure date columns are correctly formatted
X <- X %>%
mutate(across(where(is.character), as.factor))
eval_X <- eval_X %>%
mutate(across(where(is.character), as.factor))
# one-hot-encode the brand_code field with dummy variables
encode_var <- function(df, col){
ohm <- model.matrix(~ . - 1, data = df[, col, drop = FALSE])
ohm <- as.data.frame(ohm)
ohm <- lapply(ohm, as.factor)
return(cbind(df[ , !names(df) %in% col, drop = FALSE], ohm))
}
X <- encode_var(X, 'brand_code')
eval_X <- encode_var(eval_X, 'brand_code')
# Data types
str(X)
## 'data.frame': 2038 obs. of 35 variables:
## $ carb_volume : num 5.49 5.38 5.25 5.27 5.32 ...
## $ fill_ounces : num 24.3 23.9 24 24 23.9 ...
## $ pc_volume : num 0.111 0.269 0.263 0.231 0.259 ...
## $ carb_pressure : num 67.2 66.6 64.2 72 66.2 61.6 71.6 72.6 68 63.8 ...
## $ carb_temp : num 137 138 140 147 139 ...
## $ psc : num 0.026 0.09 0.132 0.014 0.078 0.11 0.096 0.16 0.034 0.124 ...
## $ psc_fill : num 0.16 0.24 0.12 0.24 0.18 ...
## $ psc_co2 : num 0.12 0.04 0.14 0.06 0.04 ...
## $ mnf_flow : num -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ carb_pressure1 : num 118 120 121 120 120 ...
## $ fill_pressure : num 45.8 45.6 46 45.2 46.6 46.6 46 46.6 47.4 48.8 ...
## $ hyd_pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hyd_pressure2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hyd_pressure3 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hyd_pressure4 : num 92 116 90 108 94 86 94 92 96 92 ...
## $ filler_level : num 119 120 120 121 120 ...
## $ filler_speed : num 4010 4014 4014 4028 4020 ...
## $ temperature : num 65.6 66.2 65.4 66.6 65 65.4 65 65 65.2 68.8 ...
## $ usage_cont : num 17.7 23.8 18.4 13.5 19 ...
## $ carb_flow : num 3054 2948 2902 3038 3056 ...
## $ density : num 1.54 1.52 0.9 0.9 0.9 0.92 0.92 0.9 0.9 0.46 ...
## $ mfr : num 723 739 740 692 727 ...
## $ balling : num 3.04 2.99 1.45 1.45 1.45 ...
## $ pressure_vacuum : num -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.2 ...
## $ oxygen_filler : num 0.03 0.024 0.064 0.022 0.03 0.058 0.046 0.066 0.046 0.164 ...
## $ bowl_setpoint : num 120 120 120 120 120 120 120 120 120 120 ...
## $ pressure_setpoint: num 46 46 46 46 46 46 46 46 46 46 ...
## $ air_pressurer : num 146 147 147 146 146 ...
## $ alch_rel : num 7.14 7.16 6.52 6.54 6.52 6.52 6.52 6.52 6.52 6.52 ...
## $ carb_rel : num 5.44 5.44 5.34 5.34 5.34 5.34 5.34 5.34 5.34 5.34 ...
## $ balling_lvl : num 3.04 3.02 1.44 1.38 1.44 1.44 1.44 1.44 1.42 1.46 ...
## $ brand_codeA : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
## $ brand_codeB : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 2 2 1 ...
## $ brand_codeC : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
## $ brand_codeD : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
The cell below creates a function that can be used to count the number of outliers in each column of a dataframe:
count_outliers <- function(dataframe) {
outlier_counts <- sapply(dataframe, function(column) {
if (is.numeric(column)) {
Q1 <- quantile(column, 0.25, na.rm = TRUE)
Q3 <- quantile(column, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
sum(column < lower_bound | column > upper_bound, na.rm = TRUE)
} else {
NA
}
})
return(outlier_counts)
}
The count_outliers
function is used below to plot the
number of outliers present in each predictor field:
outlier_counts <- count_outliers(X)
outlier_counts <- data.frame(
Column = names(outlier_counts),
Outliers = as.numeric(outlier_counts)
)
outlier_counts <- na.omit(outlier_counts)
ggplot(outlier_counts, aes(x = reorder(Column, -Outliers), y = Outliers)) +
geom_bar(stat = "identity", fill = "steelblue") +
theme_minimal() +
labs(
title = "# of Outliers Present in Predictor Fields",
x = "Variable Name",
y = "Number of Outliers"
) +
theme(axis.text.x = element_text(angle = 45, hjust=1))
Next, the cell below includes function that tests whether or not each column within the dataframe is normal.
test_normality <- function(dataframe) {
results <- lapply(dataframe, function(column) {
if (is.numeric(column)) {
test_result <- tryCatch(
shapiro.test(column),
error = function(e) NULL # handle errors (e.g., small sample size)
)
if (!is.null(test_result)) {
return(data.frame(
Statistic = test_result$statistic,
P_Value = test_result$p.value
))
} else {
return(data.frame(Statistic = NA, P_Value = NA))
}
} else {
# return NA for non-numeric columns
return(data.frame(Statistic = NA, P_Value = NA))
}
})
# combine results into a dataframe
results_df <- do.call(rbind, results)
rownames(results_df) <- names(dataframe)
return(results_df)
}
normality_results <- test_normality(X)
normality_results
## Statistic P_Value
## carb_volume 0.9570379 4.605382e-24
## fill_ounces 0.9946490 1.030640e-06
## pc_volume 0.9852127 1.110007e-13
## carb_pressure 0.9964220 9.849891e-05
## carb_temp 0.9960234 3.279974e-05
## psc 0.9547630 1.204310e-24
## psc_fill 0.9393717 4.303840e-28
## psc_co2 0.8376258 1.905078e-41
## mnf_flow 0.7261226 1.232213e-49
## carb_pressure1 0.9866239 7.233734e-13
## fill_pressure 0.8638292 7.322913e-39
## hyd_pressure1 0.9032089 4.276136e-34
## hyd_pressure2 0.7994928 1.168334e-44
## hyd_pressure3 0.8109674 9.495383e-44
## hyd_pressure4 0.9567037 3.769116e-24
## filler_level 0.8375014 1.855578e-41
## filler_speed 0.3512008 8.071603e-65
## temperature 0.9397270 5.074408e-28
## usage_cont 0.8693097 2.854520e-38
## carb_flow 0.7042791 6.671293e-51
## density 0.8204577 5.819635e-43
## mfr 0.4344164 2.904769e-62
## balling 0.7700461 8.205418e-47
## pressure_vacuum 0.9618916 9.729409e-23
## oxygen_filler 0.7951776 5.449870e-45
## bowl_setpoint 0.7916412 2.945908e-45
## pressure_setpoint 0.7390249 7.562504e-49
## air_pressurer 0.7021706 5.082926e-51
## alch_rel 0.7078074 1.055353e-50
## carb_rel 0.9392944 4.152571e-28
## balling_lvl 0.7081517 1.103931e-50
## brand_codeA NA NA
## brand_codeB NA NA
## brand_codeC NA NA
## brand_codeD NA NA
The cell below produces a correlation matrix to show the correlations between all pairs of predictor variables, helping to assess the level of multicollinearity.
corr_matrix <- cor(select(X, where(is.numeric)))
ggcorrplot(corr_matrix, lab = FALSE, title = "Correlation Matrix") +
theme(
axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6)
)
The cell below produces scatterplots of each explanatory field with the predictor variable PH. Each scatterplot includes the correlation coefficent and a best fit line relating the two fields.
plot_scatter_with_fit <- function(X, y) {
plots <- list()
for (col in names(X)) {
correlation <- cor(X[[col]], y)
plot <- ggplot(data = data.frame(x = X[[col]], y = y), aes(x = x, y = y)) +
geom_point(alpha=.02) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
subtitle = paste("Correlation:", round(correlation, 2)),
x = col,
y = "PH"
) +
theme_minimal()
plots[[col]] <- plot
}
return(plots)
}
scatter_plots <- plot_scatter_with_fit(select(X, where(is.numeric)), y)
for (plot in scatter_plots) {
print(plot)
}
The cell below plots the distributions of the categorical features in
X
:
plot_categorical_distributions <- function(df) {
# Identify categorical columns
categorical_columns <-
names(df)[sapply(df, is.factor) | sapply(df, is.character)]
# Create bar plots for each categorical column
plots <- lapply(categorical_columns, function(column) {
ggplot(df, aes_string(x = column)) +
geom_bar(fill = "steelblue", color = "black") +
labs(
title = paste("Distribution of", column),
x = column,
y = "Count"
) +
theme_minimal()
})
return(plots)
}
cat_plots <- plot_categorical_distributions(select(X, where(is.factor)))
for (plot in cat_plots) {
print(plot)
}
Split the data into testing and training sets.
train_index <- createDataPartition(y, p = 0.75, list = FALSE)
X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]
Apply the Yeo-Johnson transformation.
numeric_features <- sapply(X, is.numeric)
# fit the Yeo-Johnson transformation using training data
preprocess_params <- preProcess(X_train[, numeric_features],
method = "YeoJohnson")
# pply the transformation to training and testing data
X_train[, numeric_features] <- predict(preprocess_params,
X_train[, numeric_features])
X_test[, numeric_features] <- predict(preprocess_params,
X_test[, numeric_features])
Use robust scaling to scale the data:
# only use medians and iqrs from training data
medians <- apply(X_train[, numeric_features], 2, median)
iqrs <- apply(X_train[, numeric_features], 2, IQR)
# performs the robust scaling using the IQRs and medians
robust_scale <- function(data, medians, iqrs) {
scaled_data <- sweep(data, 2, medians, "-")
scaled_data <- sweep(scaled_data, 2, iqrs, "/")
return(scaled_data)
}
# transform the training and testing data
X_train[, numeric_features] <- robust_scale(X_train[, numeric_features],
medians, iqrs)
X_test[, numeric_features] <- robust_scale(X_test[, numeric_features],
medians, iqrs)
Finds all pairs of highly correlated variables (\(|r| > 0.7\)) and randomly removes one variable from each pair.
high_corr_indices <- findCorrelation(corr_matrix, cutoff = 0.7, names = TRUE)
X_train <- X_train[, !names(X_train) %in% high_corr_indices]
X_test <- X_test[, !names(X_test) %in% high_corr_indices]
Make new variables (and update old ones) using the transformed data:
X <- rbind(X_train, X_test)
train <- X_train
train$PH <- y_train
test <- X_test
test$PH <- y_test
Check again for outliers:
outlier_counts <- count_outliers(X)
outlier_counts <- data.frame(
Column = names(outlier_counts),
Outliers = as.numeric(outlier_counts)
)
outlier_counts <- na.omit(outlier_counts)
ggplot(outlier_counts, aes(x = reorder(Column, -Outliers), y = Outliers)) +
geom_bar(stat = "identity", fill = "steelblue") +
theme_minimal() +
labs(
title = "# of Outliers Present in Predictor Fields",
x = "Variable Name",
y = "Number of Outliers"
) +
theme(axis.text.x = element_text(angle = 45, hjust=1))
Check again for normal distributions:
normality_results <- test_normality(X)
normality_results
## Statistic P_Value
## carb_volume 0.9570379 4.605382e-24
## fill_ounces 0.9946490 1.030640e-06
## pc_volume 0.9892472 3.426857e-11
## carb_temp 0.9980432 1.438409e-02
## psc 0.9547630 1.204310e-24
## psc_fill 0.9887001 1.463930e-11
## psc_co2 0.8376258 1.905078e-41
## carb_pressure1 0.9873377 1.961692e-12
## fill_pressure 0.8638292 7.322913e-39
## hyd_pressure1 0.8740895 9.712845e-38
## hyd_pressure4 0.9570190 4.553325e-24
## filler_level 0.8375014 1.855578e-41
## filler_speed 0.3512008 8.071603e-65
## temperature 0.9397270 5.074408e-28
## usage_cont 0.8693097 2.854520e-38
## carb_flow 0.7919069 3.084296e-45
## pressure_vacuum 0.9802522 3.481334e-16
## oxygen_filler 0.7951776 5.449870e-45
## air_pressurer 0.7021706 5.082926e-51
## brand_codeA NA NA
## brand_codeB NA NA
## brand_codeC NA NA
## brand_codeD NA NA
Check again for multicollinearity:
corr_matrix <- cor(select(X, where(is.numeric)))
ggcorrplot(corr_matrix, lab = FALSE, title = "Correlation Matrix") +
theme(
axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6)
)
We now have scaled, transformed and removed mutlti collenaer pairs from our data, we can proceed to building models.
The first models used is a simple Linear Regression
model_lm1 <- lm(train$PH ~ ., train)
summary(model_lm1)
##
## Call:
## lm(formula = train$PH ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51212 -0.07911 0.01401 0.09560 0.44786
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.6571515 0.0169592 510.468 < 2e-16 ***
## carb_volume -0.0211829 0.0111656 -1.897 0.05800 .
## fill_ounces -0.0069887 0.0044412 -1.574 0.11578
## pc_volume -0.0030710 0.0053001 -0.579 0.56240
## carb_temp -0.0019803 0.0048417 -0.409 0.68258
## psc -0.0042448 0.0048281 -0.879 0.37944
## psc_fill -0.0065565 0.0055376 -1.184 0.23660
## psc_co2 -0.0062008 0.0035136 -1.765 0.07780 .
## carb_pressure1 0.0355027 0.0061499 5.773 9.45e-09 ***
## fill_pressure -0.0243515 0.0062108 -3.921 9.22e-05 ***
## hyd_pressure1 0.0003830 0.0090633 0.042 0.96630
## hyd_pressure4 0.0074237 0.0078570 0.945 0.34489
## filler_level 0.0979577 0.0090507 10.823 < 2e-16 ***
## filler_speed -0.0036012 0.0011146 -3.231 0.00126 **
## temperature -0.0302962 0.0047424 -6.388 2.23e-10 ***
## usage_cont -0.0601566 0.0083179 -7.232 7.54e-13 ***
## carb_flow 0.0543009 0.0102074 5.320 1.20e-07 ***
## pressure_vacuum 0.0021836 0.0050273 0.434 0.66409
## oxygen_filler 0.0001932 0.0032632 0.059 0.95279
## air_pressurer -0.0006421 0.0024088 -0.267 0.78985
## brand_codeA1 -0.1031997 0.0173165 -5.960 3.14e-09 ***
## brand_codeB1 -0.0473982 0.0192697 -2.460 0.01402 *
## brand_codeC1 -0.1860677 0.0227281 -8.187 5.66e-16 ***
## brand_codeD1 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1369 on 1506 degrees of freedom
## Multiple R-squared: 0.3715, Adjusted R-squared: 0.3623
## F-statistic: 40.46 on 22 and 1506 DF, p-value: < 2.2e-16
Decision Tree
model_tree1 <- rpart(train$PH ~., data = train)
predict(model_tree1, test)
## 2 4 5 10 12 14 15 23
## 8.532000 8.586820 8.508000 8.447879 8.508000 8.447879 8.586820 8.604404
## 24 26 30 33 34 37 48 51
## 8.508000 8.675564 8.447879 8.586820 8.532000 8.586820 8.586820 8.675564
## 59 60 62 66 68 69 72 74
## 8.586820 8.586820 8.586820 8.675564 8.532000 8.508000 8.586820 8.532000
## 82 83 84 87 91 95 97 98
## 8.675564 8.675564 8.586820 8.586820 8.532000 8.675564 8.745688 8.745688
## 102 106 119 121 123 131 133 137
## 8.675564 8.586820 8.447879 8.447879 8.447879 8.675564 8.586820 8.532000
## 147 152 153 159 165 166 183 191
## 8.586820 8.532000 8.675564 8.532000 8.586820 8.675564 8.675564 8.675564
## 193 195 201 202 203 209 210 212
## 8.745688 8.745688 8.586820 8.586820 8.586820 8.675564 8.586820 8.675564
## 221 224 225 229 240 244 246 247
## 8.675564 8.745688 8.675564 8.447879 8.586820 8.532000 8.675564 8.675564
## 250 252 253 255 264 265 274 279
## 8.745688 8.675564 8.675564 8.532000 8.675564 8.675564 8.745688 8.745688
## 282 284 287 293 295 299 307 310
## 8.745688 8.745688 8.586820 8.745688 8.586820 8.745688 8.447879 8.447879
## 318 320 322 336 337 339 344 347
## 8.675564 8.675564 8.745688 8.586820 8.586820 8.586820 8.586820 8.675564
## 349 357 358 359 363 371 377 379
## 8.675564 8.675564 8.586820 8.532000 8.532000 8.532000 8.675564 8.675564
## 382 384 386 388 398 402 406 407
## 8.586820 8.675564 8.745688 8.745688 8.675564 8.586820 8.447879 8.447879
## 412 415 416 417 418 419 423 425
## 8.447879 8.745688 8.675564 8.745688 8.604404 8.675564 8.675564 8.675564
## 429 430 434 435 436 440 444 445
## 8.675564 8.675564 8.586820 8.586820 8.586820 8.586820 8.532000 8.532000
## 446 449 450 453 461 471 480 481
## 8.586820 8.675564 8.675564 8.675564 8.675564 8.675564 8.745688 8.745688
## 485 492 495 499 500 506 512 518
## 8.675564 8.745688 8.447879 8.447879 8.447879 8.447879 8.447879 8.532000
## 524 525 529 532 537 540 550 552
## 8.745688 8.532000 8.675564 8.745688 8.675564 8.745688 8.675564 8.532000
## 553 557 562 563 565 577 578 579
## 8.532000 8.675564 8.675564 8.745688 8.586820 8.586820 8.586820 8.586820
## 586 587 593 594 596 597 598 605
## 8.532000 8.586820 8.586820 8.586820 8.532000 8.586820 8.586820 8.447879
## 607 608 621 622 623 628 639 640
## 8.447879 8.532000 8.604404 8.604404 8.604404 8.675564 8.675564 8.675564
## 643 649 650 653 654 657 660 667
## 8.675564 8.675564 8.675564 8.675564 8.675564 8.532000 8.675564 8.675564
## 673 678 679 686 687 689 691 695
## 8.745688 8.745688 8.745688 8.745688 8.745688 8.586820 8.745688 8.447879
## 699 705 706 718 720 721 723 731
## 8.447879 8.586820 8.745688 8.675564 8.675564 8.745688 8.476812 8.604404
## 740 745 747 757 762 765 768 782
## 8.675564 8.586820 8.586820 8.586820 8.675564 8.586820 8.675564 8.447879
## 786 789 795 796 797 805 808 821
## 8.447879 8.675564 8.532000 8.675564 8.675564 8.675564 8.675564 8.675564
## 822 828 836 838 842 845 847 853
## 8.586820 8.675564 8.586820 8.675564 8.675564 8.675564 8.532000 8.675564
## 854 855 857 859 861 868 869 872
## 8.675564 8.675564 8.675564 8.675564 8.675564 8.745688 8.745688 8.447879
## 877 881 889 892 896 900 901 904
## 8.532000 8.447879 8.447879 8.745688 8.586820 8.586820 8.586820 8.586820
## 906 912 923 927 928 932 933 946
## 8.675564 8.675564 8.586820 8.675564 8.675564 8.675564 8.675564 8.532000
## 950 951 958 960 962 965 967 968
## 8.675564 8.675564 8.745688 8.675564 8.532000 8.532000 8.586820 8.586820
## 970 972 976 977 978 984 986 992
## 8.447879 8.447879 8.586820 8.675564 8.586820 8.532000 8.586820 8.675564
## 996 1001 1002 1003 1007 1009 1011 1012
## 8.586820 8.586820 8.586820 8.675564 8.586820 8.586820 8.675564 8.675564
## 1014 1017 1029 1030 1039 1052 1060 1062
## 8.447879 8.586820 8.675564 8.745688 8.586820 8.604404 8.604404 8.586820
## 1067 1075 1078 1081 1082 1086 1095 1097
## 8.586820 8.604404 8.604404 8.604404 8.604404 8.604404 8.604404 8.604404
## 1103 1107 1110 1118 1122 1132 1135 1138
## 8.745688 8.745688 8.447879 8.447879 8.415833 8.586820 8.586820 8.586820
## 1139 1141 1143 1146 1148 1162 1164 1170
## 8.604404 8.604404 8.586820 8.604404 8.586820 8.586820 8.604404 8.604404
## 1176 1182 1184 1191 1195 1202 1205 1207
## 8.604404 8.586820 8.604404 8.604404 8.604404 8.604404 8.604404 8.604404
## 1212 1213 1214 1216 1225 1227 1229 1234
## 8.604404 8.586820 8.532000 8.586820 8.586820 8.745688 8.604404 8.745688
## 1237 1247 1250 1254 1261 1265 1266 1267
## 8.532000 8.586820 8.604404 8.604404 8.604404 8.390638 8.390638 8.390638
## 1274 1278 1280 1281 1284 1285 1290 1296
## 8.390638 8.390638 8.390638 8.390638 8.390638 8.390638 8.390638 8.390638
## 1308 1309 1312 1313 1314 1319 1321 1327
## 8.532000 8.532000 8.532000 8.532000 8.532000 8.532000 8.532000 8.532000
## 1328 1331 1332 1333 1337 1342 1343 1347
## 8.532000 8.532000 8.532000 8.532000 8.390638 8.390638 8.390638 8.390638
## 1348 1350 1358 1361 1367 1372 1377 1379
## 8.390638 8.390638 8.390638 8.415833 8.532000 8.532000 8.532000 8.532000
## 1381 1383 1389 1397 1398 1400 1411 1412
## 8.532000 8.390638 8.390638 8.390638 8.390638 8.476812 8.390638 8.390638
## 1413 1414 1417 1424 1425 1427 1428 1445
## 8.390638 8.532000 8.390638 8.390638 8.390638 8.390638 8.532000 8.532000
## 1457 1459 1466 1468 1472 1477 1482 1486
## 8.390638 8.390638 8.390638 8.390638 8.390638 8.390638 8.390638 8.532000
## 1488 1491 1492 1493 1494 1496 1499 1504
## 8.532000 8.532000 8.532000 8.532000 8.532000 8.532000 8.415833 8.390638
## 1505 1516 1521 1522 1523 1524 1528 1529
## 8.390638 8.476812 8.390638 8.390638 8.390638 8.390638 8.532000 8.532000
## 1530 1532 1534 1538 1541 1546 1560 1562
## 8.532000 8.532000 8.532000 8.532000 8.532000 8.390638 8.390638 8.390638
## 1567 1569 1577 1582 1587 1588 1599 1602
## 8.390638 8.390638 8.542353 8.415833 8.415833 8.415833 8.415833 8.476812
## 1607 1613 1620 1622 1633 1639 1645 1649
## 8.476812 8.532000 8.390638 8.390638 8.532000 8.532000 8.204444 8.204444
## 1650 1655 1657 1658 1661 1662 1664 1671
## 8.204444 8.532000 8.532000 8.532000 8.532000 8.532000 8.532000 8.204444
## 1672 1674 1676 1681 1687 1688 1689 1694
## 8.204444 8.476812 8.476812 8.476812 8.476812 8.476812 8.476812 8.204444
## 1697 1701 1706 1707 1710 1716 1718 1720
## 8.476812 8.476812 8.415833 8.415833 8.415833 8.415833 8.415833 8.532000
## 1725 1726 1727 1731 1735 1743 1747 1749
## 8.532000 8.532000 8.532000 8.476812 8.476812 8.476812 8.476812 8.476812
## 1756 1768 1784 1786 1792 1797 1806 1808
## 8.476812 8.532000 8.532000 8.476812 8.532000 8.476812 8.542353 8.542353
## 1812 1813 1816 1819 1824 1826 1843 1844
## 8.476812 8.476812 8.476812 8.476812 8.532000 8.532000 8.532000 8.532000
## 1846 1847 1850 1853 1859 1860 1866 1868
## 8.532000 8.532000 8.532000 8.476812 8.476812 8.476812 8.532000 8.532000
## 1869 1870 1872 1884 1887 1891 1901 1910
## 8.542353 8.532000 8.532000 8.532000 8.476812 8.476812 8.532000 8.476812
## 1911 1919 1924 1926 1927 1929 1932 1947
## 8.542353 8.476812 8.542353 8.476812 8.532000 8.476812 8.532000 8.532000
## 1949 1952 1957 1972 1976 1979 1983 1988
## 8.532000 8.532000 8.532000 8.476812 8.476812 8.476812 8.476812 8.532000
## 1993 1996 2002 2006 2011 2012 2019 2024
## 8.476812 8.542353 8.532000 8.532000 8.532000 8.415833 8.415833 8.154545
## 2025 2027 2030 2035 2038
## 8.154545 8.154545 8.154545 8.154545 8.476812
RMSE(predict(model_tree1, test), test$PH)
## [1] 0.1349601
Plotting Tree1
plot(model_tree1, uniform=TRUE, compress=FALSE, margin=.015)
text(model_tree1, all=TRUE, cex=.5)
Neural Network
my.grid <- expand.grid(decay = c(0.5,0,1), size=c(5,6,7))
model_nnet <- train(PH ~ ., data=train, method="nnet",
maxit=500, tuneGrid = my.grid, trace =F, linout = 1)
print(model_nnet)
## Neural Network
##
## 1529 samples
## 23 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1529, 1529, 1529, 1529, 1529, 1529, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.0 5 0.2586472 0.3048120 0.12498432
## 0.0 6 0.5222105 0.3000944 0.16361182
## 0.0 7 1.2898967 0.2696562 0.26220503
## 0.5 5 0.1306745 0.4284715 0.10059015
## 0.5 6 0.1305440 0.4299680 0.10004188
## 0.5 7 0.1295583 0.4372411 0.09956567
## 1.0 5 0.1302672 0.4243708 0.10052455
## 1.0 6 0.1301176 0.4256484 0.10037720
## 1.0 7 0.1292306 0.4328996 0.09958846
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 7 and decay = 1.
Random Forest
library(caret)
RMSE(predict(model_nnet, test), test$PH)
## [1] 0.1250552
RMSE(predict(model_tree1, test), test$PH)
## [1] 0.1349601
RMSE(predict(model_lm1, test), test$PH)
## [1] 0.1351587
# Load Necessary Libraries
library(caret)
library(randomForest)
# 1. Combine Features and Target for Training
train$PH <- y_train
# 2. Training the Random Forest Model
rfModel <- randomForest(
x = train[, setdiff(names(train), "PH")], # Features (exclude the target variable)
y = train$PH, # Target variable
importance = TRUE, # Enable variable importance
ntree = 1500 # Number of trees
)
# 3. Combine Features and Target for Testing
test$PH <- y_test
# 4. Making Predictions on the Test Set
rfPred <- predict(rfModel, test[, setdiff(names(test), "PH")])
# 5. Evaluating Model Performance
testY_vec <- test$PH
performance <- postResample(pred = rfPred, obs = testY_vec)
# Display the Performance Metrics
print(performance)
## RMSE Rsquared MAE
## 0.10235056 0.65131395 0.07790556
# Define Cross-Validation Control
train_control <- trainControl(
method = "cv",
number = 5,
verboseIter = FALSE,
search = "grid"
)
# Grid of Hyperparameters for k-Values
k_values <- expand.grid(k = c(3, 5, 7, 9, 11))
# Train the KNN Model
knn_model <- train(
x = X_train,
y = y_train,
method = "knn",
tuneGrid = k_values,
trControl = train_control
)
# Plot the KNN Model Performance
plot(knn_model)
# Train SVM with radial kernel
svm_model <- train(
PH ~ .,
data = train,
method = "svmRadial",
trControl = trainControl(method = "cv", number = 5),
tuneLength = 5
)
svm_model
## Support Vector Machines with Radial Basis Function Kernel
##
## 1529 samples
## 23 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1224, 1222, 1223, 1223, 1224
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.1277438 0.4614226 0.09552216
## 0.50 0.1255571 0.4780651 0.09325780
## 1.00 0.1232271 0.4946128 0.09145115
## 2.00 0.1218984 0.5040928 0.09088473
## 4.00 0.1216624 0.5062811 0.09097519
##
## Tuning parameter 'sigma' was held constant at a value of 0.02956161
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02956161 and C = 4.
# Predict and evaluate
svm_pred <- predict(svm_model, test)
svm_rmse <- RMSE(svm_pred, test$PH)
svm_r2 <- R2(svm_pred, test$PH)
# Display metrics
cat("SVM Model RMSE:", svm_rmse, "\n")
## SVM Model RMSE: 0.1113341
cat("SVM Model R-Squared:", svm_r2, "\n")
## SVM Model R-Squared: 0.5731233
# Calculate RMSE and R-squared for Neural Network Model
rmse_nnet <- RMSE(predict(model_nnet, test), test$PH)
rsq_nnet <- R2(predict(model_nnet, test), test$PH)
# Calculate RMSE and R-squared for Tree-Based Model
rmse_tree1 <- RMSE(predict(model_tree1, test), test$PH)
rsq_tree1 <- R2(predict(model_tree1, test), test$PH)
# Calculate RMSE and R-squared for Linear Model
rmse_lm1 <- RMSE(predict(model_lm1, test), test$PH)
rsq_lm1 <- R2(predict(model_lm1, test), test$PH)
# Calculate RMSE and R-squared for Random Forest Model
rmse_rf <- RMSE(rfPred, test$PH)
rsq_rf <- R2(rfPred, test$PH)
# Calculate RMSE and R-squared for KNN Model
rmse_knn <- RMSE(predict(knn_model, test), test$PH)
rsq_knn <- R2(predict(knn_model, test), test$PH)
# Create a Data Frame of RMSE and R-squared Values
model_performance <- data.frame(
Model = c("Neural Network", "Tree-Based", "Linear Model", "Random Forest", "KNN", "SVM"),
RMSE = c(rmse_nnet, rmse_tree1, rmse_lm1, rmse_rf, rmse_knn,svm_rmse),
R_Squared = c(rsq_nnet, rsq_tree1, rsq_lm1, rsq_rf, rsq_knn,svm_r2)
)
# Display the Performance Data Frame
model_performance <- model_performance %>%
arrange(RMSE, desc(R_Squared))
# Print the sorted model_performance data frame
print(model_performance)
## Model RMSE R_Squared
## 1 Random Forest 0.1023506 0.6513140
## 2 SVM 0.1113341 0.5731233
## 3 KNN 0.1187727 0.5122407
## 4 Neural Network 0.1250552 0.4540662
## 5 Tree-Based 0.1349601 0.3675322
## 6 Linear Model 0.1351587 0.3599708
After Tuning the different models we conclude that best preforming model is the Random Forrest Model with an R_Squared of 0.65 and RMSE of 0.10
# Extract variable of importance
var_importance <- importance(rfModel)
importance_df <- data.frame(
Variable = rownames(var_importance),
Importance = var_importance[, "IncNodePurity"]
)
# Order
importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]
#ploting
ggplot(importance_df, aes(x = reorder(Variable, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Importance Variables in Random Forest",
x = "Variable",
y = "Importance (IncNodePurity)") +
theme_minimal()
# Define the prediction function for the Random Forest model
pred_wrapper <- function(object, newdata) {
predict(object, newdata = newdata)
}
# Compute SHAP values for the test set
shap_values <- fastshap::explain(
object = rfModel,
X = as.data.frame(test[, -which(names(test) == "PH")]),
pred_wrapper = pred_wrapper
)
# Convert SHAP values to a data frame
shap_df <- as.data.frame(shap_values)
shap_df$Observation <- 1:nrow(shap_df)
# Calculate mean absolute SHAP values for each feature to determine importance
shap_importance <- colMeans(abs(shap_df[, -ncol(shap_df)]))
shap_importance <- sort(shap_importance, decreasing = TRUE)
ordered_features <- names(shap_importance)
# Melt the SHAP values data frame to long format for ggplot
shap_long <- melt(shap_df, id.vars = "Observation", variable.name = "Feature", value.name = "SHAP")
# Convert Feature to a factor ordered by importance (flipped)
shap_long$Feature <- factor(shap_long$Feature, levels = rev(ordered_features))
# Plot the SHAP values with important features at the top
ggplot(shap_long, aes(x = Feature, y = SHAP, fill = Feature)) +
geom_boxplot() +
coord_flip() +
theme_minimal() +
labs(title = "SHAP Values for Features Ordered by Importance",
y = "SHAP Value",
x = "Features") +
theme(legend.position = "none")
# top num vars for plotting correlations
numerical_vars <- c("usage_cont", "filler_level", "temperature","fill_pressure", "carb_flow")
# Calculate correlation coefficients for numeric variables
correlations <- sapply(numerical_vars, function(var) cor(train[[var]], train$PH, use = "complete.obs"))
print("Correlations of PH and top numerical variables")
## [1] "Correlations of PH and top numerical variables"
correlations
## usage_cont filler_level temperature fill_pressure carb_flow
## -0.3580374 0.3399455 -0.1971432 -0.3304814 0.2047809
# Select the relevant columns
data_subset <- train[, c("PH", numerical_vars)]
# Pivot the data to long format for faceting
data_long <- pivot_longer(data_subset, cols = all_of(numerical_vars), names_to = "Variable", values_to = "Value")
# Create the facet plot
ggplot(data_long, aes(x = Value, y = PH)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm", color = "blue", se = TRUE) +
facet_wrap(~ Variable, scales = "free", ncol = 2) +
labs(title = "Relationships Between PH and Selected Variables",
x = "Feature Value",
y = "PH",
) +
theme_minimal()
# Boxplot for the categorical variable 'brand_codeC'
ggplot(train, aes(x = as.factor(brand_codeC), y = PH)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Relationship between PH and brand_codeC",
x = "brand_codeC",
y = "PH") +
theme_minimal()
Combining insights from the variable importance plot, SHAP values, and correlation plots, we can make informed conclusions about how our top 5 variables influence the Random Forest model’s predictions for PH:
usage_cont
Importance: Ranked the most important variable in the Random Forest model.
SHAP Insight: SHAP values show both positive and negative impacts on PH, suggesting variability in its effect.
Correlation Plot: Shows a negative correlation with PH. Higher usage_cont values lead to lower PH levels.
Conclusion: usage_cont is a strong predictor, likely capturing flow rate and system efficiency, which impact acidity levels.
filler_level
Importance: The second most important variable.
SHAP Insight: SHAP values suggest it has a consistent positive influence on PH.
Correlation Plot: Indicates a slight positive correlation. Higher filler levels tend to increase PH.
Conclusion: Variations in filler levels affect ingredient concentration, influencing PH balance.
temperature
Importance: The third most important variable.
SHAP Insight: SHAP values indicate a negative influence on PH.
Correlation Plot: Shows a clear negative correlation. Higher temperature leads to lower PH.
Conclusion: Temperature impacts carbonation and chemical reactions, affecting the acidity and lowering PH.
brand_codeC
Importance: The fourth most important variable.
SHAP Insight: SHAP values suggest a strong negative impact when Brand C is present.
Boxplot: PH is consistently lower when brand_codeC = 1 (Brand C is present).
Conclusion: The formulation or process used by Brand C leads to lower PH levels compared to other brands.
fill_pressure
Importance: The fifth most important variable.
SHAP Insight: SHAP values show a negative impact on PH.
Correlation Plot: Displays a negative correlation. Higher fill pressure tends to decrease PH levels.
Conclusion: Fill pressure affects the pressure conditions during filling, which can influence ingredient concentration and acidity, thereby lowering PH.
carb_flow
Importance: The sixth most important variable.
SHAP Insight: SHAP values show a positive impact on PH.
Correlation Plot: Displays a slight positive correlation. Higher carb_flow increases PH.
Conclusion: Carb flow affects the amount of dissolved CO₂, altering acidity and thereby increasing PH.
Summary of Insights:
usage_cont and temperature have strong negative relationships with PH.
filler_level and carb_flow exhibit positive influences on PH.
fill_pressure negatively influences PH.
brand_codeC distinctly lowers PH.
Conclusion:
In conclusion, while the Random Forest model is the best performer for predicting PH, we need to be clear about its limitations. The model’s R-squared of 0.65 and RMSE of 0.10 are the best scores among the models we trained and tested. But with 35% of the variance in PH left unexplained, there’s room for improvement.
Future analysis should focus on filling the gaps in the data or applying imputation techniques for missing values. We should also consider adding more features or refining existing ones to capture more of what drives PH levels. Right now, the forecasts are based on a model that doesn’t include imputations, and that’s something stakeholders should keep in mind.
The team is confident in the insights delivered so far, but with more complete data and some adjustments, we can make even better predictions in the future.
# Ensure preprocessing is applied to eval_X
# Apply Yeo-Johnson transformation to eval_X
eval_X[, numeric_features] <- predict(preprocess_params, eval_X[, numeric_features])
# Apply robust scaling to eval_X
eval_X[, numeric_features] <- robust_scale(eval_X[, numeric_features], medians, iqrs)
# Remove highly correlated variables from eval_X
eval_X <- eval_X[, !names(eval_X) %in% high_corr_indices]
# Make predictions on the eval_X dataset using the Random Forest model
eval_predictions <- predict(rfModel, eval_X)
# Display the predicted PH values
print("Predicted PH values:")
## [1] "Predicted PH values:"
print(eval_predictions)
## 1 2 3 4 5 6 7 8
## 8.531851 8.603143 8.496703 8.523643 8.609380 8.529564 8.556028 8.746530
## 9 10 11 12 13 14 15 16
## 8.328026 8.581438 8.581967 8.601740 8.710108 8.614099 8.597041 8.615836
## 17 18 19 20 21 22 23 24
## 8.487977 8.449442 8.633737 8.682109 8.580535 8.565949 8.553090 8.557087
## 25 26 27 28 29 30 31 32
## 8.674184 8.706720 8.321299 8.387354 8.671264 8.719526 8.701226 8.701277
## 33 34 35 36 37 38 39 40
## 8.641984 8.730761 8.692891 8.761330 8.682994 8.454105 8.516469 8.654464
## 41 42 43 44 45 46 47 48
## 8.755156 8.734625 8.750944 8.735522 8.721475 8.757144 8.571764 8.555226
## 49 50 51 52 53 54 55 56
## 8.575807 8.596961 8.584895 8.496598 8.415022 8.533208 8.555514 8.542948
## 57 58 59 60 61 62 63 64
## 8.502970 8.655768 8.710757 8.693966 8.767340 8.642309 8.626228 8.607960
## 65 66 67 68 69 70 71 72
## 8.708350 8.642886 8.583602 8.706785 8.548307 8.684580 8.632432 8.611212
## 73 74 75 76 77 78 79 80
## 8.601411 8.708010 8.705296 8.748947 8.744084 8.428039 8.459278 8.498614
## 81 82 83 84 85 86 87 88
## 8.644689 8.648746 8.597674 8.671324 8.674667 8.676715 8.683344 8.675282
## 89 90 91 92 93 94 95 96
## 8.610223 8.652668 8.698168 8.601261 8.569489 8.538041 8.521243 8.525146
## 97 98 99 100 101 102 103 104
## 8.573202 8.485654 8.540701 8.476891 8.424081 8.511464 8.505408 8.588440
## 105 106 107 108 109 110 111 112
## 8.575082 8.576334 8.584466 8.527812 8.652756 8.450518 8.465418 8.439398
## 113 114 115 116 117 118 119 120
## 8.490966 8.537927 8.559260 8.626806 8.608619 8.638556 8.717568 8.646098
## 121 122 123 124 125 126 127 128
## 8.701178 8.438174 8.542954 8.535612 8.651901 8.448982 8.535418 8.447995
## 129 130 131 132 133 134 135 136
## 8.399496 8.507557 8.520241 8.471886 8.354592 8.372130 8.547954 8.400976
## 137 138 139 140 141 142 143 144
## 8.382844 8.368475 8.476426 8.457102 8.476027 8.333137 8.339997 8.313790
## 145 146 147 148 149 150 151 152
## 8.318392 8.431888 8.467044 8.450848 8.415168 8.378575 8.487884 8.486926
## 153 154 155 156 157 158 159 160
## 8.348944 8.407049 8.450529 8.468483 8.372240 8.411245 8.367988 8.332852
## 161 162 163 164 165 166 167 168
## 8.545659 8.505944 8.493745 8.492170 8.456895 8.395450 8.555494 8.475673
## 169 170 171 172 173 174 175 176
## 8.449899 8.485641 8.463428 8.485945 8.486793 8.533447 8.525938 8.625628
## 177 178 179 180 181 182 183 184
## 8.634585 8.492647 8.501467 8.468431 8.555540 8.477963 8.438225 8.452320
## 185 186 187 188 189 190 191 192
## 8.456068 8.518920 8.505006 8.445432 8.528754 8.549645 8.559218 8.500270
## 193 194 195 196 197 198 199 200
## 8.509977 8.602137 8.572722 8.470956 8.601771 8.300192 8.352504 8.311770
# Add the predictions to the eval_X dataset
eval_X$PH_Predicted <- eval_predictions
# Export the dataset with predictions to a new Excel file
write.csv(eval_X, "StudentEvaluation_WithForecasts.csv", row.names = FALSE)
# Confirm export completion
cat("Forecasts successfully generated and exported to 'StudentEvaluation_WithForecasts.csv'.\n")
## Forecasts successfully generated and exported to 'StudentEvaluation_WithForecasts.csv'.