This report documents the methodology and findings for the ABC Beverage pH prediction project. The full project prompt as given:
Project #2 (Team) Assignment
This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
This document is the technical report deliverable. The companion
non-technical executive summary is submitted separately, as is
ABC Beverage Final Sheet.xlsx, the formal Excel deliverable
that contains the predictions, out-of-fold residuals, model comparison
metrics, and a ranked top-factors table with business interpretations.
The Rmd also auto-generates a minimal
abc_beverage_PH_predictions.csv on each knit, which
contains the same row-level predictions in CSV form.
The data consists of 2,571 historical production records (training set) and 267 records where pH is to be predicted (evaluation set). Each row has 33 process and brand measurements. The sections that follow cover the data exploration, the cleaning steps, the modeling pipeline, and a side-by-side comparison of six modeling approaches: linear regression, elastic net, random forest, gradient boosting, support vector machine with radial kernel, and Cubist.
Recap located at the end of the section. Skip there to jump past the code/workflow.
library(tidyverse)
library(VIM)
test_data <- read.csv(file.path(projPath, "testdataproj2.csv"), na.strings = c("", "NA", "null", "NULL"), header = TRUE)
training_data <- read.csv(file.path(projPath, "trainingdataproj2.csv"), na.strings = c("", "NA", "null", "NULL"), header = TRUE)
glimpse(test_data)
## Rows: 267
## Columns: 33
## $ Brand.Code <chr> "D", "A", "B", "B", "B", "B", "A", "B", "A", "D", "B…
## $ Carb.Volume <dbl> 5.480000, 5.393333, 5.293333, 5.266667, 5.406667, 5.…
## $ Fill.Ounces <dbl> 24.03333, 23.95333, 23.92000, 23.94000, 24.20000, 24…
## $ PC.Volume <dbl> 0.2700000, 0.2266667, 0.3033333, 0.1860000, 0.160000…
## $ Carb.Pressure <dbl> 65.4, 63.2, 66.4, 64.8, 69.4, 73.4, 65.2, 67.4, 66.8…
## $ Carb.Temp <dbl> 134.6, 135.0, 140.4, 139.0, 142.2, 147.2, 134.6, 139…
## $ PSC <dbl> 0.236, 0.042, 0.068, 0.004, 0.040, 0.078, 0.088, 0.0…
## $ PSC.Fill <dbl> 0.40, 0.22, 0.10, 0.20, 0.30, 0.22, 0.14, 0.10, 0.48…
## $ PSC.CO2 <dbl> 0.04, 0.08, 0.02, 0.02, 0.06, NA, 0.00, 0.04, 0.04, …
## $ Mnf.Flow <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ Carb.Pressure1 <dbl> 116.6, 118.8, 120.2, 124.8, 115.0, 118.6, 117.6, 121…
## $ Fill.Pressure <dbl> 46.0, 46.2, 45.8, 40.0, 51.4, 46.4, 46.2, 40.0, 43.8…
## $ Hyd.Pressure1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure2 <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Hyd.Pressure3 <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Hyd.Pressure4 <int> 96, 112, 98, 132, 94, 94, 108, 108, 110, 106, 98, 96…
## $ Filler.Level <dbl> 129.4, 120.0, 119.4, 120.2, 116.0, 120.4, 119.6, 131…
## $ Filler.Speed <int> 3986, 4012, 4010, NA, 4018, 4010, 4010, NA, 4010, 10…
## $ Temperature <dbl> 66.0, 65.6, 65.6, 74.4, 66.4, 66.6, 66.8, NA, 65.8, …
## $ Usage.cont <dbl> 21.66, 17.60, 24.18, 18.12, 21.32, 18.00, 17.68, 12.…
## $ Carb.Flow <int> 2950, 2916, 3056, 28, 3214, 3064, 3042, 1972, 2502, …
## $ Density <dbl> 0.88, 1.50, 0.90, 0.74, 0.88, 0.84, 1.48, 1.60, 1.52…
## $ MFR <dbl> 727.6, 735.8, 734.8, NA, 752.0, 732.0, 729.8, NA, 74…
## $ Balling <dbl> 1.398, 2.942, 1.448, 1.056, 1.398, 1.298, 2.894, 3.3…
## $ Pressure.Vacuum <dbl> -3.8, -4.4, -4.2, -4.0, -4.0, -3.8, -4.2, -4.4, -4.4…
## $ PH <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Oxygen.Filler <dbl> 0.022, 0.030, 0.046, NA, 0.082, 0.064, 0.042, 0.096,…
## $ Bowl.Setpoint <int> 130, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ Pressure.Setpoint <dbl> 45.2, 46.0, 46.0, 46.0, 50.0, 46.0, 46.0, 46.0, 46.0…
## $ Air.Pressurer <dbl> 142.6, 147.2, 146.6, 146.4, 145.8, 146.0, 145.0, 146…
## $ Alch.Rel <dbl> 6.56, 7.14, 6.52, 6.48, 6.50, 6.50, 7.18, 7.16, 7.14…
## $ Carb.Rel <dbl> 5.34, 5.58, 5.34, 5.50, 5.38, 5.42, 5.46, 5.42, 5.44…
## $ Balling.Lvl <dbl> 1.48, 3.04, 1.46, 1.48, 1.46, 1.44, 3.02, 3.00, 3.10…
glimpse(training_data)
## Rows: 2,571
## Columns: 33
## $ Brand.Code <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
## $ Carb.Volume <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ Fill.Ounces <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ PC.Volume <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ Carb.Pressure <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ Carb.Temp <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ PSC <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ PSC.Fill <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ PSC.CO2 <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ Mnf.Flow <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ Carb.Pressure1 <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ Fill.Pressure <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ Hyd.Pressure1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure2 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure3 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure4 <int> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ Filler.Level <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ Filler.Speed <int> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ Temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ Usage.cont <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ Carb.Flow <int> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ Density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ MFR <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ Balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ Pressure.Vacuum <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ PH <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ Oxygen.Filler <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ Bowl.Setpoint <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ Air.Pressurer <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ Alch.Rel <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ Carb.Rel <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ Balling.Lvl <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…
str(training_data)
## 'data.frame': 2571 obs. of 33 variables:
## $ Brand.Code : chr "B" "A" "B" "A" ...
## $ Carb.Volume : num 5.34 5.43 5.29 5.44 5.49 ...
## $ Fill.Ounces : num 24 24 24.1 24 24.3 ...
## $ PC.Volume : num 0.263 0.239 0.263 0.293 0.111 ...
## $ Carb.Pressure : num 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
## $ Carb.Temp : num 141 140 145 133 137 ...
## $ PSC : num 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC.Fill : num 0.26 0.22 0.34 0.42 0.16 0.24 0.4 0.34 0.12 0.24 ...
## $ PSC.CO2 : num 0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
## $ Mnf.Flow : num -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb.Pressure1 : num 119 122 120 115 118 ...
## $ Fill.Pressure : num 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
## $ Hyd.Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure2 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure3 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : int 118 106 82 92 92 116 124 132 90 108 ...
## $ Filler.Level : num 121 119 120 118 119 ...
## $ Filler.Speed : int 4002 3986 4020 4012 4010 4014 NA 1004 4014 4028 ...
## $ Temperature : num 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage.cont : num 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb.Flow : int 2932 3144 2914 3062 3054 2948 30 684 2902 3038 ...
## $ Density : num 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
## $ MFR : num 725 727 735 731 723 ...
## $ Balling : num 1.4 1.5 3.14 3.04 3.04 ...
## $ Pressure.Vacuum : num -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
## $ PH : num 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Oxygen.Filler : num 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl.Setpoint : int 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure.Setpoint: num 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air.Pressurer : num 143 143 142 146 146 ...
## $ Alch.Rel : num 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
## $ Carb.Rel : num 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Balling.Lvl : num 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
All the rows with numbers are formatted as numbers. All the variables are numeric except for brand code, likely the only categorical variable.
glimpse() and
str() to confirm everything imported correctly.Brand.Code,
which is the only categorical predictor.Recap located at the end of the section. Skip there to jump past the code/workflow.
We’ll look at:
The number of cases
Missing values
Data types (numerical vs. categorical)
nrow(training_data)
## [1] 2571
nrow(test_data)
## [1] 267
The training data has 2571 cases and the test data has 267.
Counting the number of columns/variables:
ncol(test_data)
## [1] 33
ncol(training_data)
## [1] 33
Both data sets have 33 columns.
Are all the variables represented in both data sets?
colnames(test_data)
## [1] "Brand.Code" "Carb.Volume" "Fill.Ounces"
## [4] "PC.Volume" "Carb.Pressure" "Carb.Temp"
## [7] "PSC" "PSC.Fill" "PSC.CO2"
## [10] "Mnf.Flow" "Carb.Pressure1" "Fill.Pressure"
## [13] "Hyd.Pressure1" "Hyd.Pressure2" "Hyd.Pressure3"
## [16] "Hyd.Pressure4" "Filler.Level" "Filler.Speed"
## [19] "Temperature" "Usage.cont" "Carb.Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure.Vacuum" "PH" "Oxygen.Filler"
## [28] "Bowl.Setpoint" "Pressure.Setpoint" "Air.Pressurer"
## [31] "Alch.Rel" "Carb.Rel" "Balling.Lvl"
colnames(training_data)
## [1] "Brand.Code" "Carb.Volume" "Fill.Ounces"
## [4] "PC.Volume" "Carb.Pressure" "Carb.Temp"
## [7] "PSC" "PSC.Fill" "PSC.CO2"
## [10] "Mnf.Flow" "Carb.Pressure1" "Fill.Pressure"
## [13] "Hyd.Pressure1" "Hyd.Pressure2" "Hyd.Pressure3"
## [16] "Hyd.Pressure4" "Filler.Level" "Filler.Speed"
## [19] "Temperature" "Usage.cont" "Carb.Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure.Vacuum" "PH" "Oxygen.Filler"
## [28] "Bowl.Setpoint" "Pressure.Setpoint" "Air.Pressurer"
## [31] "Alch.Rel" "Carb.Rel" "Balling.Lvl"
Both data sets have the same columns, so we can proceed.
The response we’re predicting is PH. A quick look at its
distribution in the training data:
ggplot(training_data %>% filter(!is.na(PH)), aes(x = PH)) +
geom_histogram(bins = 40, fill = "#3b6e8c", color = "white") +
geom_vline(aes(xintercept = mean(PH, na.rm = TRUE)),
color = "#8c3b3b", linetype = "dashed") +
labs(
title = "PH distribution (training set)",
x = "PH",
y = "Count"
) +
theme_minimal(base_size = 11)
Most batches sit between roughly 8.4 and 8.7. The distribution is unimodal, slightly left-skewed, with a long thin left tail. The dashed line marks the mean. The full observed range is about 7.88 to 9.36, but the bulk of the data is in a narrow band, so model errors on the order of 0.1 PH unit represent a meaningful fraction of the typical batch-to-batch variation.
How many missing values are there? Are there any trends? How many complete cases are there?
test_data)Note: test_data here is the evaluation file used for the
project deliverable. PH is intentionally blank in this file because
that’s what the model predicts. It’s not a held-out test split from
training.
Count nulls/missing values in the evaluation set:
sort(colSums(is.na(test_data)), decreasing = TRUE)
## PH MFR Filler.Speed Brand.Code
## 267 31 10 8
## Fill.Ounces PSC PSC.CO2 PC.Volume
## 6 5 5 4
## Carb.Pressure1 Hyd.Pressure4 PSC.Fill Oxygen.Filler
## 4 4 3 3
## Alch.Rel Fill.Pressure Filler.Level Temperature
## 3 2 2 2
## Usage.cont Pressure.Setpoint Carb.Rel Carb.Volume
## 2 2 2 1
## Carb.Temp Hyd.Pressure2 Hyd.Pressure3 Density
## 1 1 1 1
## Balling Pressure.Vacuum Bowl.Setpoint Air.Pressurer
## 1 1 1 1
## Carb.Pressure Mnf.Flow Hyd.Pressure1 Carb.Flow
## 0 0 0 0
## Balling.Lvl
## 0
First off, all the pH values are missing. This is not a test set -it’s what we will predict. The training data will have to be split on its own into a training and test set.
Most columns are missing a few (under 10) values or none at all. MFR is missing the most values - 31 out of 267, or about 12%. Filler speed is next, missing only 10 values.
Next, we’ll look at the data row-wise and see if there are full or close-to-full missing cases.
rowSums(is.na(test_data))
## [1] 3 1 1 4 1 2 1 4 1 2 3 1 1 1 1 4 1 1 1 1 1 2 3 1 1
## [26] 1 1 1 2 1 1 2 2 2 2 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1
## [51] 3 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [76] 1 2 1 1 1 1 1 1 1 5 2 1 1 1 1 2 15 1 2 2 1 3 1 2 1
## [101] 2 1 1 1 1 1 1 1 1 2 1 1 1 1 3 1 1 2 1 1 2 1 1 1 1
## [126] 1 2 2 1 1 2 2 2 3 1 1 3 2 2 2 1 1 1 2 1 1 1 1 2 1
## [151] 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
## [176] 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
## [201] 1 1 2 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [226] 1 3 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 3 1
## [251] 2 1 3 1 1 9 2 1 1 1 1 1 1 1 1 1 1
#totals by row (1 does not count)
table(rowSums(is.na(test_data)))
##
## 1 2 3 4 5 9 15
## 200 50 11 3 1 1 1
#total NAs (subtracting 267 for the missing pH values)
sum(is.na(test_data)) - 267
## [1] 107
Overall, a data set with 267 rows and 33 columns is missing 107 values, or about 1.3%. 67 out of 267 cases are missing at least one value (about 25%).
One case has 14 missing values, so it may make sense to drop that; another is missing 8. Otherwise, most rows with missing values are missing only 1-3.
Which variables are missing data?
sort(colSums(is.na(training_data)), decreasing = TRUE)
## MFR Brand.Code Filler.Speed PC.Volume
## 212 120 57 39
## PSC.CO2 Fill.Ounces PSC Carb.Pressure1
## 39 38 33 32
## Hyd.Pressure4 Carb.Pressure Carb.Temp PSC.Fill
## 30 27 26 23
## Fill.Pressure Filler.Level Hyd.Pressure2 Hyd.Pressure3
## 22 20 15 15
## Temperature Oxygen.Filler Pressure.Setpoint Hyd.Pressure1
## 14 12 12 11
## Carb.Volume Carb.Rel Alch.Rel Usage.cont
## 10 10 9 5
## PH Mnf.Flow Carb.Flow Bowl.Setpoint
## 4 2 2 2
## Density Balling Balling.Lvl Pressure.Vacuum
## 1 1 1 0
## Air.Pressurer
## 0
MFR is missing 212 out of 2571 cases, about 8.2%. Brand code, the only categorical variable, is missing 120 times, or about 4.7% of the time. Filler speed is missing 57 times, about 2.2%, and everything else is under 2%.
Let’s see if anything strange is happening across rows:
#nothing out of the ordinary across rows. Most rows are complete.
#Rows with NAs typically 1 or 2 NAs, wich
rowSums(is.na(training_data))
## [1] 2 2 2 1 0 0 2 1 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 0
## [25] 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1
## [49] 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [73] 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0
## [97] 0 1 0 0 0 0 0 0 0 0 0 2 5 0 0 0 0 0 1 0 0 0 0 0
## [121] 0 0 0 0 2 0 0 1 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0
## [145] 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [169] 0 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 2 0 0 0 0 0 0 0
## [193] 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0
## [217] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 1
## [241] 0 1 6 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 2
## [265] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [289] 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 3 0 0 0 0 0 2
## [313] 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 1 2 0
## [337] 0 1 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## [361] 0 0 0 0 2 2 0 2 0 0 0 1 0 2 1 0 1 0 0 0 0 0 0 0
## [385] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0
## [409] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 1 1 0
## [433] 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [457] 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 1 0 0 4 1 1 1 1 1
## [481] 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 3 0 1 1 0 0 0 0 0
## [505] 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0
## [529] 2 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0
## [553] 0 0 0 0 0 1 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## [577] 1 0 0 0 0 0 0 1 0 0 0 3 0 0 1 0 0 0 0 0 0 0 0 0
## [601] 0 0 0 0 0 0 0 0 2 4 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [625] 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0
## [649] 1 0 0 0 0 0 1 0 2 2 1 0 0 0 0 0 0 0 0 0 0 0 1 0
## [673] 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 5 2 0 0 1 2 3
## [697] 3 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 3 0
## [721] 0 2 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 9 4 0 0 0 0 0
## [745] 0 0 0 0 0 1 0 0 1 0 3 0 0 1 0 0 0 0 0 0 1 0 0 0
## [769] 0 0 2 2 3 3 3 3 4 3 3 3 3 3 3 3 4 0 0 0 0 0 0 0
## [793] 0 0 2 0 0 0 1 2 4 1 0 0 0 0 0 0 2 1 0 0 0 0 0 3
## [817] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 1 0 0 0 1 0 0 0
## [841] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [865] 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 1 0 0 0
## [889] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0
## [913] 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0
## [937] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 1 1
## [961] 2 1 1 2 2 0 0 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## [985] 0 0 0 0 0 0 0 0 0 0 2 1 1 1 0 3 0 0 0 0 0 0 0 0
## [1009] 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0
## [1033] 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0
## [1057] 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [1081] 0 0 0 0 0 0 0 1 0 0 0 0 0 8 2 0 1 0 0 0 0 1 0 0
## [1105] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
## [1129] 0 0 0 0 0 0 0 0 3 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0
## [1153] 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## [1177] 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
## [1201] 1 0 1 0 0 0 0 0 1 1 1 1 3 2 3 2 4 1 1 1 1 1 0 0
## [1225] 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 4 2 0 0 1 0 0
## [1249] 0 0 0 0 3 0 0 0 0 0 0 2 0 0 0 0 1 0 0 1 1 1 0 0
## [1273] 1 0 0 0 0 0 1 3 0 0 0 0 1 0 0 0 0 0 0 0 2 0 2 1
## [1297] 0 1 0 4 0 1 0 1 0 0 1 1 1 0 0 0 0 0 1 1 1 2 1 1
## [1321] 1 1 1 2 1 1 0 0 0 0 1 1 1 1 1 1 2 1 1 1 2 1 1 1
## [1345] 1 1 2 1 1 1 1 1 1 1 2 1 2 0 1 1 1 1 1 1 3 1 1 2
## [1369] 2 1 0 0 0 0 0 2 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0
## [1393] 0 0 0 2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1417] 0 0 0 0 3 2 0 0 0 0 0 1 0 0 3 0 0 0 0 0 0 0 0 0
## [1441] 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0
## [1465] 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [1489] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
## [1513] 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 4 1 0 0 0 0 0 6 0
## [1537] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1
## [1561] 2 3 0 0 0 0 0 0 0 0 0 0 0 1 3 0 0 0 0 0 0 0 0 0
## [1585] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1609] 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0
## [1633] 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1657] 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0
## [1681] 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [1705] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 12 1 0 0 1 0 3 1 2 0
## [1729] 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
## [1753] 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 1 0 1 0
## [1777] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 2 1 1
## [1801] 4 1 1 1 1 1 3 1 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0
## [1825] 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 5 0 0 0
## [1849] 0 0 1 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0
## [1873] 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0
## [1897] 0 0 0 2 0 2 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0
## [1921] 1 0 0 0 0 2 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0
## [1945] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 1 1 1 1 1 0 0
## [1969] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
## [1993] 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
## [2017] 3 1 1 1 3 1 1 3 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## [2041] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2065] 0 2 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 0
## [2089] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
## [2113] 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0 0 0 0
## [2137] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0
## [2161] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0
## [2185] 1 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [2209] 0 0 0 0 0 5 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0
## [2233] 0 0 1 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2257] 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 2 0
## [2281] 0 1 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1 0 0 0 2 0 0
## [2305] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 1 2
## [2329] 0 1 1 0 1 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0
## [2353] 0 0 0 0 2 0 3 0 0 0 0 0 0 0 0 0 0 1 0 0 3 0 0 0
## [2377] 0 0 0 2 0 0 0 0 2 3 0 0 1 0 0 0 0 0 0 1 0 0 0 0
## [2401] 0 0 0 0 0 0 0 0 1 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0
## [2425] 0 0 1 1 1 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 1 0
## [2449] 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1
## [2473] 0 0 0 0 2 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## [2497] 0 0 0 0 0 0 0 1 0 0 0 7 1 0 0 0 0 0 2 0 0 0 0 0
## [2521] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [2545] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 1 1
## [2569] 0 0 3
table(rowSums(is.na(training_data)))
##
## 0 1 2 3 4 5 6 7 8 9 12
## 2038 345 119 45 13 4 3 1 1 1 1
sum(is.na(test_data))
## [1] 374
Overall, 844 values are missing across the training set, about 1% of all cells. 2038/2571 cases are complete. That’s about 79% complete, 21% incomplete. Let’s also check to see if the data is MCAR.
library(naniar)
numeric_only <- training_data %>%
select(-Brand.Code)
numeric_only <- numeric_only %>%
dplyr::select(where(is.numeric)) %>%
mutate(across(everything(), as.double))
mcar_test(numeric_only)
## # A tibble: 1 × 4
## statistic df p.value missing.patterns
## <dbl> <dbl> <dbl> <int>
## 1 10791. 2871 0 99
The p-value is 0, which means the data most likely isn’t missing completely at random (there are patterns in the missing data).
The training set has 120 rows where Brand.Code is blank (about 4.7% of cases). Before deciding how to handle them in modeling, we check whether those rows look different from the rest. If the missingness carries signal about PH, we should preserve it rather than impute it away.
# How does PH look in rows where Brand.Code is missing vs known?
training_data %>%
mutate(brand_missing = is.na(Brand.Code) | Brand.Code == "") %>%
filter(!is.na(PH)) %>%
group_by(brand_missing) %>%
summarise(
n = n(),
ph_mean = mean(PH),
ph_median = median(PH),
ph_sd = sd(PH),
ph_min = min(PH),
ph_max = max(PH)
)
## # A tibble: 2 × 7
## brand_missing n ph_mean ph_median ph_sd ph_min ph_max
## <lgl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FALSE 2447 8.55 8.54 0.172 7.88 9.36
## 2 TRUE 120 8.49 8.51 0.177 8.12 8.88
Two visible differences. Missing-Brand rows have a mean PH about 0.06 units below known-Brand rows (8.49 vs 8.55), and a noticeably narrower PH range (8.12 to 8.88 vs 7.88 to 9.36). They cluster toward the middle of the distribution rather than the tails.
That looks like what Kuhn and Johnson calls informative missingness:
the absence of a Brand.Code value is itself associated with a specific
PH pattern. Imputing the missing values into one of A/B/C/D would roll
these rows into whichever brand the imputer picks, distorting that
brand’s signal and erasing the mean shift. The cleaner approach is to
keep the missingness as its own level ("Unknown") when we
model.
PH values in the evaluation set are missing. This
is the file we are predicting on, not a test setMFR (~8% training, ~12% eval)Brand.Code (~5% training)Filler.Speed (~2% training)naniar::mcar_test) which returned p ~ 0, so the
missingness is not completely at randomRecap located at the end of the section. Skip there to jump past the code/workflow.
In this section:
Variables with zero or near-zero variance
Highly correlated variables
Outliers/range of variables/scaling (if needed)
This data set has 33 variables. Let’s check to see if they’re all (potentially) providing useful predictive information by testing for near-zero or zero variance.
library(caret)
nzv_results <- nearZeroVar(training_data, saveMetrics = TRUE)
nzv_results
## freqRatio percentUnique zeroVar nzv
## Brand.Code 2.014634 0.1555815 FALSE FALSE
## Carb.Volume 1.055556 3.9284325 FALSE FALSE
## Fill.Ounces 1.163043 3.5783742 FALSE FALSE
## PC.Volume 1.100000 17.6584986 FALSE FALSE
## Carb.Pressure 1.016393 4.1229094 FALSE FALSE
## Carb.Temp 1.016667 4.7841307 FALSE FALSE
## PSC 1.211538 5.0175029 FALSE FALSE
## PSC.Fill 1.091892 1.2446519 FALSE FALSE
## PSC.CO2 1.078303 0.5056398 FALSE FALSE
## Mnf.Flow 1.048443 18.9420459 FALSE FALSE
## Carb.Pressure1 1.031746 5.4453520 FALSE FALSE
## Fill.Pressure 1.762931 4.2007001 FALSE FALSE
## Hyd.Pressure1 31.111111 9.5293660 FALSE TRUE
## Hyd.Pressure2 7.271028 8.0513419 FALSE FALSE
## Hyd.Pressure3 11.450704 7.4679113 FALSE FALSE
## Hyd.Pressure4 1.008264 1.5558149 FALSE FALSE
## Filler.Level 1.086957 11.2018670 FALSE FALSE
## Filler.Speed 1.045161 9.4904706 FALSE FALSE
## Temperature 1.040000 2.1781408 FALSE FALSE
## Usage.cont 1.105263 18.7086737 FALSE FALSE
## Carb.Flow 1.586207 20.7312330 FALSE FALSE
## Density 1.108374 3.0338390 FALSE FALSE
## MFR 1.222222 22.8315830 FALSE FALSE
## Balling 1.193182 8.4402956 FALSE FALSE
## Pressure.Vacuum 1.389728 0.6223259 FALSE FALSE
## PH 1.078125 2.0225593 FALSE FALSE
## Oxygen.Filler 1.266667 13.1466356 FALSE FALSE
## Bowl.Setpoint 2.990847 0.4278491 FALSE FALSE
## Pressure.Setpoint 1.319361 0.3111630 FALSE FALSE
## Air.Pressurer 1.093407 1.2446519 FALSE FALSE
## Alch.Rel 1.098315 2.0614547 FALSE FALSE
## Carb.Rel 1.011583 1.6336056 FALSE FALSE
## Balling.Lvl 1.294872 3.1894205 FALSE FALSE
Hyd.Pressure1 has a near-zero variance, so we could
eliminate that one.
Removing some of the 33 variables will help simplify this data set. We will remove the variable with NZV and highly colinear variables (greater than or about .90, as defined above).
the findcorrelation function says to drop
Filler.Speed instead of MFR. However, we will
opt to drop MFR instead, given that it has the highest
number of missing values (Filler.Speed has 57 as opposed to
MFR’s 212). Having to impute all these values will makes
the variable less reliable, and dropping it would mean having to drop
more cases.
#removing the variable with NZV
training_data_new <- training_data %>%
select(-Hyd.Pressure1)
#removing correlated variables
training_data_new <- training_data_new %>%
select(-Pressure.Setpoint, -MFR, -Hyd.Pressure3,
-Pressure.Vacuum, -Hyd.Pressure4, -Bowl.Setpoint, -Carb.Rel,
-Balling, -Alch.Rel, - Balling.Lvl, - Density)
#donig the same for the test set
test_data_new <- test_data %>%
select(-Hyd.Pressure1, -Pressure.Setpoint, -MFR, -Hyd.Pressure3,
-Pressure.Vacuum, -Hyd.Pressure4, -Bowl.Setpoint, -Carb.Rel,
-Balling, -Alch.Rel, - Balling.Lvl, - Density)
Before we deal with missing values, let’s see if there are any glaring outliers or data entry errors.
#may be an outlier in oxygen filler - the max calues is much higher than the
summary(test_data_new)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## Length:267 Min. :5.147 Min. :23.75 Min. :0.09867
## Class :character 1st Qu.:5.287 1st Qu.:23.92 1st Qu.:0.23333
## Mode :character Median :5.340 Median :23.97 Median :0.27533
## Mean :5.369 Mean :23.97 Mean :0.27769
## 3rd Qu.:5.465 3rd Qu.:24.01 3rd Qu.:0.32200
## Max. :5.667 Max. :24.20 Max. :0.46400
## NA's :1 NA's :6 NA's :4
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :60.20 Min. :130.0 Min. :0.00400 Min. :0.0200
## 1st Qu.:65.30 1st Qu.:138.4 1st Qu.:0.04450 1st Qu.:0.1000
## Median :68.00 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.25 Mean :141.2 Mean :0.08545 Mean :0.1903
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :77.60 Max. :154.0 Max. :0.24600 Max. :0.6200
## NA's :1 NA's :5 NA's :3
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :113.0 Min. :37.80
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:120.2 1st Qu.:46.00
## Median :0.04000 Median : 0.20 Median :123.4 Median :47.80
## Mean :0.05107 Mean : 21.03 Mean :123.0 Mean :48.14
## 3rd Qu.:0.06000 3rd Qu.: 141.30 3rd Qu.:125.5 3rd Qu.:50.20
## Max. :0.24000 Max. : 220.40 Max. :136.0 Max. :60.20
## NA's :5 NA's :4 NA's :2
## Hyd.Pressure2 Filler.Level Filler.Speed Temperature
## Min. :-50.00 Min. : 69.2 Min. :1006 Min. :63.80
## 1st Qu.: 0.00 1st Qu.:100.6 1st Qu.:3812 1st Qu.:65.40
## Median : 26.80 Median :118.6 Median :3978 Median :65.80
## Mean : 20.11 Mean :110.3 Mean :3581 Mean :66.23
## 3rd Qu.: 34.80 3rd Qu.:120.2 3rd Qu.:3996 3rd Qu.:66.60
## Max. : 61.40 Max. :153.2 Max. :4020 Max. :75.40
## NA's :1 NA's :2 NA's :10 NA's :2
## Usage.cont Carb.Flow PH Oxygen.Filler
## Min. :12.90 Min. : 0 Mode:logical Min. :0.00240
## 1st Qu.:18.12 1st Qu.:1083 NA's:267 1st Qu.:0.01960
## Median :21.44 Median :3038 Median :0.03370
## Mean :20.90 Mean :2409 Mean :0.04666
## 3rd Qu.:23.74 3rd Qu.:3215 3rd Qu.:0.05440
## Max. :24.60 Max. :3858 Max. :0.39800
## NA's :2 NA's :3
## Air.Pressurer
## Min. :141.2
## 1st Qu.:142.2
## Median :142.6
## Mean :142.8
## 3rd Qu.:142.8
## Max. :147.2
## NA's :1
#there's also at least one really large value on oxygen filler here
summary(training_data_new)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd.Pressure2 Filler.Level Filler.Speed Temperature Usage.cont
## Min. : 0.00 Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08
## 1st Qu.: 0.00 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36
## Median :28.60 Median :118.4 Median :3982 Median :65.60 Median :21.79
## Mean :20.96 Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99
## 3rd Qu.:34.60 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75
## Max. :59.40 Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90
## NA's :15 NA's :20 NA's :57 NA's :14 NA's :5
## Carb.Flow PH Oxygen.Filler Air.Pressurer
## Min. : 26 Min. :7.880 Min. :0.00240 Min. :140.8
## 1st Qu.:1144 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:142.2
## Median :3028 Median :8.540 Median :0.03340 Median :142.6
## Mean :2468 Mean :8.546 Mean :0.04684 Mean :142.8
## 3rd Qu.:3186 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:143.0
## Max. :5104 Max. :9.360 Max. :0.40000 Max. :148.2
## NA's :2 NA's :4 NA's :12
After quickly browsing the remaining variables, while some of the fields have a wide range of values, none of the values look like errors (there are clusters of similar low/high values for each variable). While they may count as outliers (we’ll calculate later) they are likely legitimate data ant not input errors.
Let’s check histograms to see if anything stands out:
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Brand.Code n
## 1 A 293
## 2 B 1239
## 3 C 304
## 4 D 615
## 5 <NA> 120
Not all the data is normally distributed. pH, the response variable
is normally distributed. Other variables close to a normal distribution
include Carb.Pressure, Carb.Pressure1
(slightly bimodal), Carb.Temp, Fill.Ounces,
Carb.Volume (also slightly bimodal), PSC
(skewed slightly right), PC.Volume,
Temperature (skewed slightly right), and
PSC.Fill (skewed right with a large peak).
A few are more clearly bimodal/have very different peaks on either
side of the graph: Air.Pressurer, Carb.Flow,
Fill.Pressure, Filler.Level,
Filler.Speed, Hyd.Pressure2,
Mnf.Flow, and Usage.cont.
Mnf.Flow is worth a little extra attention. About 40% of training records share a single Mnf.Flow value of exactly -100, with the rest spread between roughly 120 and 229. The IQR-based outlier check doesn’t flag this (it’s not a single rare point), so the records stay in. We can’t tell from the data alone whether -100 represents an idle state, a cleaning cycle, a placeholder for a missing reading, or something else. We kept all records and flagged it as an open question. The Limitations section comes back to this.
Finally, some are skewed right: in addition to the varaibles
mentioned above, Oxygen.Filler looks like it may have some
outliers, and PSC.CO2 has a strange distribution, where the
variable clusters around certain values, almost like a discrete
variable.
#histograms
for (col in names(training_data_new %>% select(-Brand.Code))) {
hist(training_data_new[[col]], main=col)
}
#box plots
for (col in names(training_data_new %>% select(-Brand.Code))) {
boxplot(training_data_new[[col]], main=col)
}
Calculating the number of outliers in each column based on the IQR method
outliers <- sapply(training_data_new %>% select(-Brand.Code), function(x) {
Q1 <- quantile(x, 0.25, na.rm=TRUE)
Q3 <- quantile(x, 0.75, na.rm=TRUE)
IQR <- Q3 - Q1
sum(x < Q1 - 1.5*IQR | x > Q3 + 1.5*IQR, na.rm=TRUE)
})
outliers
## Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp
## 5 55 86 14 36
## PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1
## 52 54 77 0 24
## Fill.Pressure Hyd.Pressure2 Filler.Level Filler.Speed Temperature
## 79 0 9 439 123
## Usage.cont Carb.Flow PH Oxygen.Filler Air.Pressurer
## 0 0 18 189 220
Cross-checking the graphs with the data set, we could remove the one
Filler.Level value that appears on the top of the box plot,
161.2 when the next-highest is 154. pH has one comparatively high value
(9.36) that’s nowhere close to the next-highest value. We can remove
that row since it’s missing 4 other variables. PH at that level isn’t
“impossible” here, but with 4 missing predictor values in the same row
and the value isolated far from the rest of the distribution, it seems
more like a data entry issue than a real measurement we want to be
using. Other outlier values are not one-offs, but instead there are a
handful of data points at the high/low potential outlier values, so
we’re leaving the other alone. (For example, it looks like there’s one
particularly high value for Oxygen.Filler, but ther when we
double checked, there 5 cases with the same value. Some light notes on
that are written below.)
#removing the one outliers
training_data_new <- training_data_new %>%
filter(PH < 9 | is.na(PH)) %>%
filter(Filler.Level < 160 | is.na(Filler.Level))
#removing pH NAs because we shouldn't impute the response variable
training_data_no_na <- training_data_new %>%
filter(!is.na(PH))
#leaving out the categorical and blank response variable
for (col in names(test_data_new %>% select(-Brand.Code, -PH))) {
boxplot(test_data_new[[col]], main=col)
}
test_data_new %>% count(Brand.Code)
## Brand.Code n
## 1 A 35
## 2 B 129
## 3 C 31
## 4 D 64
## 5 <NA> 8
Oxygen.Filler shows up again with a few outliers, which
just seems to be the case for this variable; the outliers in filler
speed are all clustered; filler level has outliers on both ends, as does
fill pressure. Carb.Pressure1 has one high value,
PSC.CO2 has a few higher values designated outliers due to
the clustering on the lower end, same is true for PSC.Fill,
and the rest are similar.
We won’t remove any of these because they don’t look like errors, and in some cases represent a different, smaller cluster of values.
The correlation prune above takes us to 20 predictors. This “pruned” or “narrowed” set is what we can use for linear models, PLS, and penalized regression. However, when it comes to tree-based models, we know that those are okay with collinearity, so they will be given the “unpruned” or “wider” set of predictors.
So:
training_data_new / test_data_new: 20
predictors after Near-zero-variance (NZV) removal and r > 0.90
correlation prune. For use with linear models, PLS, and ridge / lasso /
elastic net.training_data_wide / test_data_wide: 31
predictors after just NZV removal. For use with random forest, gradient
boosting, and Cubist.# wide feature set: still did NZV removal, plus has the same outlier filters as the narrow set
training_data_wide <- training_data %>%
select(-Hyd.Pressure1) %>%
filter(PH < 9 | is.na(PH)) %>%
filter(Filler.Level < 160 | is.na(Filler.Level))
training_data_wide_no_na <- training_data_wide %>%
filter(!is.na(PH))
# same column drop on the evaluation set
test_data_wide <- test_data %>%
select(-Hyd.Pressure1)
Hyd.Pressure1 for near-zero varianceMFR instead of
Filler.Speed, since MFR had more missingness
and would have been less reliable to keep.PH = 9.36, one with
Filler.Level = 161.2). Everything else looked liked like
legitimate data.PH from the training data
since we can’t impute the responsePH)training_data_wide_no_na, 31 predictors) by skipping the
correlation prune, for tree-based models that are fine with there being
collinearityRecap located at the end of the section. Skip there to jump past the code/workflow.
sum(is.na(training_data_no_na))
## [1] 515
sort(colSums(is.na(training_data_no_na)), decreasing = TRUE)
## Brand.Code Filler.Speed PC.Volume PSC.CO2 Fill.Ounces
## 120 52 39 39 38
## PSC Carb.Pressure1 Carb.Pressure Carb.Temp PSC.Fill
## 33 32 27 26 23
## Fill.Pressure Hyd.Pressure2 Filler.Level Temperature Oxygen.Filler
## 17 15 15 11 11
## Carb.Volume Usage.cont Carb.Flow Mnf.Flow PH
## 10 5 2 0 0
## Air.Pressurer
## 0
#checking the number of complete cases
table(rowSums(is.na(training_data_no_na)))
##
## 0 1 2 3 4 5 6
## 2169 309 64 18 2 2 1
516 missing values across 2566 cases and 21 variables. This means 516/53886 is about .9% overall missingness, which is very low. Worth noting, brand code is missing in 120 cases.
2169 cases our of 2566 are complete, or about 85%. If we dropped all NAs, we would be dropping 15% of the data. So instead, we’ll be using imputation.
However! the imputation we’re doing here is mostly
an inspection step, plus test_data_imputed gets used by the
eval prediction chunk much later in our worflow, and that’s only if
narrow-set model wins (like a linear model). In our actual model
training, we’re going to let caret do the imputation per CV
fold to avoid data leakage.
Now for test_data_imputed, we’re using VIM
package for simple KNN imputation. This package just fills in the
nearest neighbor; it doesn’t change the other variables. Vim can handle
categorical variables, like brand code.
training_data_imputed <- kNN(training_data_no_na, k = 5)
#making sure it worked
sum(is.na(training_data_imputed))
## [1] 0
First, let’s check how many missing values there are after we’ve removed columns in the evaluation set.
table(rowSums(is.na(test_data_new)))
##
## 1 2 3 4 5 9
## 225 33 6 1 1 1
colSums(is.na(test_data_new))
## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure
## 8 1 6 4 0
## Carb.Temp PSC PSC.Fill PSC.CO2 Mnf.Flow
## 1 5 3 5 0
## Carb.Pressure1 Fill.Pressure Hyd.Pressure2 Filler.Level Filler.Speed
## 4 2 1 2 10
## Temperature Usage.cont Carb.Flow PH Oxygen.Filler
## 2 2 0 267 3
## Air.Pressurer
## 1
#total NAs (subtracting 267 for the missing pH values)
sum(is.na(test_data_new)) - 267
## [1] 60
Now there are only 60 missing values total. 42 rows have missing values. One row has 8 missing values (the one that had 14 before).
Imputing a row that sparse means the filled-in predictor values will be pretty noisy, but the deliverable expects a prediction for every row in the evaluation set. For that reason, we’re opting to keep the row in and make a note that the prediction is low-confidence, but we aren’t going to drop the row.
#keeping all 267 rows
count(test_data_new)
## n
## 1 267
Imputing with the same method we used for the training data:
#list all columns except PH
cols_to_impute <- setdiff(colnames(test_data_new), "PH")
test_data_imputed <- kNN(test_data_new, variable = cols_to_impute, k = 5)
#we should get 267 cases
sum(is.na(test_data_imputed))
## [1] 267
kNN(k = 5) to get training_data_imputedkNN(k = 5)
to get test_data_imputed. The sparse row gets imputed too
but we’re flgging this as a low-confidence prediction.PH in the
eval set) and ready for modelingRecap located at the end of the section. Skip there to jump past the code/workflow.
Originally we made this section to produce an 80/20 split for a quick
sanity check on the data. When we updated our modeling pipeline to use
10-fold CV on the full training_data_no_na data instead, it
meant that we weren’t using these train_x /
test_x objects downstream anymore. For that reason, the
chunk is set to eval=FALSE so it doesn’t run on knit, but
we wanted to keep the code here both for reference, and for the option
to revert if we want to.
set.seed(624)
train_index <- sample(nrow(training_data_imputed),
0.8 * nrow(training_data_imputed))
#create a training set
train_y <- training_data_imputed$PH[train_index]
train_x <- training_data_imputed[train_index, ] %>% dplyr::select(-PH)
#test set
test_y <- training_data_imputed$PH[-train_index]
test_x <- training_data_imputed[-train_index, ] %>% dplyr::select(-PH)
We didn’t scale anything individually in this section, since every package should have an option to center/scale automatically, and use a Box-Cox for the data with less normal distributions.
eval=FALSE and is kept
only as a historical reference. The modeling pipeline uses 10-fold CV on
training_data_no_na directly, so train_x /
test_x etc. are not created on knit.preProc options rather than included in
here.Recap located at the end of the section. Skip there to jump past the workflow.
We fit six models in total: linear regression and elastic net (linear family), random forest and gradient boosting (tree-based ensembles), SVM with a radial kernel, and Cubist (rule-based). The rationale for each individual model is in its own section. The setup decisions below apply to all six.
The datasets that actually get used for fitting. Out
of the several training_data_* variants we built earlier,
only two actually feed the modeling: training_data_no_na
(20 predictors, narrow) and training_data_wide_no_na (31
predictors, wide). The other variants are inspection artifacts or
downstream prep for the eval set.
caret In class and in Kuhn and Johnson,
we used caret::preProcess for imputation, transformations,
and scaling inside the resampling loop. We’re sticking with that for the
sake of sticking with the class standard and what we’ve been learning
how to use effectively. However we did stumble upon the
recipes package made by the same author later. Potentially
worth looking into in the future.
Per-model feature sets. We know we discussed this in
the previous section but saying again. Linear-family models use the
correlation-pruned 20-predictor set (training_data_no_na).
Tree-based and rule-based models (where collinearity is fine) use the
wider 31-predictor set (training_data_wide_no_na).
About R-squared. R-squared values reported throughout this document are out-of-fold averages from cross-validation, not in-sample fits. They reflect how well each model generalizes to unseen data.
Imputation inside resampling. For model fitting,
we’re starting from training_data_no_na (which has not been
imputed) and we’re letting caret handle imputation inside each CV.
preProcess(method = c("YeoJohnson", "center", "scale", "knnImpute"))trainControl(method = "cv", number = 10).We did this so that the imputation parameters are estimated per fold rather than once on the full training set, which avoids bias or data leakage from imputing before the split.
Brand.Code handling. Rather than imputing the 120
missing brand values, we opted to treat missing as its own factor level
called "Unknown". In our EDA section, we saw that the rows
missing a brand did have a bit of a different PH profile from
known-Brand rows, so we’re treating the missingness itself carries
signal worth preserving.
Shared infrastructure for the model comparison.
Every model we fit shares the same controlObject (10-fold
CV) and is trained immediately after set.seed(624). This
way, we’re getting the same CV folds across models, which is what
resamples() requires for side-by-side comparisons of the
models at the end. Each fitted model also gets appended to a
models list along the way.
10-fold CV. We started this work with repeated
10-fold CV (5 repeats) but switched to plain 10-fold once random
forest’s runtime made repeated CV impractical. Plain 10-fold still gives
us the variance estimates resamples() needs for the
cross-model comparison. Applied to all six models for consistency.
Hyperparameter tuning. Each model has parameters
caret can tune over (e.g., mtry for random forest,
alpha and lambda for elastic net,
committees and neighbors for Cubist). Where we
set tuneLength = N, caret tries N values for each tunable
parameter using its own default ranges and picks the combination with
the lowest CV RMSE. We used tuneLength = 10 for most models
and tuneLength = 5 for random forest to keep runtime
manageable.
set.seed(624)
# treating Brand.Code NAs as their own factor level (I guess this is called informative-missingness)
training_for_modeling <- training_data_no_na %>%
mutate(Brand.Code = factor(ifelse(is.na(Brand.Code) | Brand.Code == "", "Unknown", Brand.Code)))
# shared control object for every model.
# We're using this object= and using set.seed(624) right before each train() call
# This way we can compare our CV folds to one another
controlObject <- trainControl(method = "cv", number = 10,
savePredictions = "final")
# simple list to accumulate fitted models.
# goal was to make the resamples() comparison at the end easy
models <- list()
# made a running table of CV metrics
# the set up is each model section appends one row after fitting
model_metrics <- tibble(model = character(),
cv_RMSE = numeric(),
cv_Rsquared = numeric(),
cv_MAE = numeric())
Per-model output pattern. To keep model sections comparable across the rest of this report, each one produces the same chunks in the same order. Anyone adding a new model, here’s the pattern we’re using:
set.seed(624) and the shared
controlObject. Print the trained object so caret shows CV
RMSE / R-squared / MAE. Append getTrainPerf(model) to
model_metrics to add to the final comparison table for the
end.plot(varImp(model), top = 20): to get predictors matter
most.training_data_imputed is an inspection
/ sanity-check artifact for imputation. The actual model fitting
startswith data that has not been imputed yet
(training_data_no_na) and lets caret impute inside each CV
fold to avoid data leakage.Brand.Code missing values have been kept as an
“Unknown” factor level rather than imputed.controlObject and are trained after
set.seed(624). This was so the CV folds would be the same
for all models, which is what resamples() needs to compare
the models at the end.model_metrics so the final comparison table
builds itself as we go.Recap located at the end of the section. Skip there to jump past the code/workflow.
First model: ordinary linear regression. It assumes a straight-line relationship between each predictor and pH and gives us a baseline that more flexible models need to beat. It also gives us interpretable coefficients to compare against the variable-importance rankings from later models. Uses the narrow 20-predictor set with the full preProcess pipeline (YeoJohnson, center, scale, knnImpute).
Numeric pre-processing. Inside the
train() call, we’re using YeoJohnson rather than Box-Cox
because four predictors in the narrow set (Mnf.Flow,
Hyd.Pressure2, PSC.Fill, PSC.CO2)
had non-positive values. YeoJohnson is able to handle that data. After
YeoJohnson, predictors are centered, scaled, and KNN-imputed inside each
CV fold.
set.seed(624)
lm_model <- train(
PH ~ .,
data = training_for_modeling,
method = "lm",
trControl = controlObject,
preProcess = c("YeoJohnson", "center", "scale", "knnImpute"),
na.action = na.pass
)
models$lm <- lm_model
# append CV metrics to the running comparison table
model_metrics <- bind_rows(
model_metrics,
getTrainPerf(lm_model) %>%
transmute(model = "lm",
cv_RMSE = TrainRMSE,
cv_Rsquared = TrainRsquared,
cv_MAE = TrainMAE)
)
lm_model
## Linear Regression
##
## 2565 samples
## 20 predictor
##
## Pre-processing: Yeo-Johnson transformation (23), centered (23), scaled
## (23), nearest neighbor imputation (23)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2309, 2308, 2308, 2309, 2308, 2309, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.135323 0.3819844 0.1060805
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Variable importance. Which predictors matter most for pH under linear assumptions.
plot(varImp(lm_model), top = 20)
Coefficients. Each coefficient tells us how a predictor moves pH: positive pushes pH up, negative pushes it down, and bigger magnitudes mean stronger effects. Sorted by absolute t-statistic. (The t-statistic is used to sort by becuase it ranks the predictors by how confidently we can say their effect is real).
lm_coefs <- summary(lm_model$finalModel)$coefficients %>%
as.data.frame() %>%
tibble::rownames_to_column("predictor") %>%
arrange(desc(abs(`t value`)))
kable(lm_coefs, digits = 4)
| predictor | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 8.5458 | 0.0027 | 3204.4593 | 0.0000 |
| Mnf.Flow | -0.0783 | 0.0051 | -15.3528 | 0.0000 |
| Carb.Pressure1 | 0.0291 | 0.0033 | 8.8721 | 0.0000 |
| Brand.CodeD | 0.0382 | 0.0046 | 8.3867 | 0.0000 |
| Filler.Level | 0.0314 | 0.0039 | 8.1312 | 0.0000 |
| Brand.CodeC | -0.0285 | 0.0041 | -6.9551 | 0.0000 |
| Usage.cont | -0.0236 | 0.0034 | -6.9504 | 0.0000 |
| Hyd.Pressure2 | 0.0229 | 0.0040 | 5.7547 | 0.0000 |
| Temperature | -0.0172 | 0.0031 | -5.5098 | 0.0000 |
| Brand.CodeB | 0.0259 | 0.0051 | 5.0266 | 0.0000 |
| Oxygen.Filler | -0.0141 | 0.0033 | -4.3139 | 0.0000 |
| Carb.Flow | 0.0123 | 0.0038 | 3.2128 | 0.0013 |
| Fill.Ounces | -0.0065 | 0.0028 | -2.3031 | 0.0214 |
| PC.Volume | -0.0069 | 0.0032 | -2.1764 | 0.0296 |
| PSC.CO2 | -0.0055 | 0.0028 | -1.9824 | 0.0475 |
| Fill.Pressure | 0.0056 | 0.0031 | 1.7896 | 0.0736 |
| PSC.Fill | -0.0045 | 0.0028 | -1.6172 | 0.1060 |
| PSC | -0.0037 | 0.0029 | -1.3111 | 0.1900 |
| Brand.CodeUnknown | -0.0041 | 0.0034 | -1.2032 | 0.2290 |
| Carb.Volume | -0.0080 | 0.0073 | -1.0976 | 0.2725 |
| Filler.Speed | -0.0035 | 0.0035 | -0.9768 | 0.3288 |
| Air.Pressurer | -0.0026 | 0.0028 | -0.9043 | 0.3659 |
| Carb.Temp | 0.0031 | 0.0097 | 0.3154 | 0.7525 |
| Carb.Pressure | 0.0005 | 0.0107 | 0.0475 | 0.9621 |
Residual summary. A quick look at how big the model’s errors are (using out-of-fold predictions) and how well its predictions cover the observed pH range.
lm_preds <- lm_model$pred
lm_preds$residual <- lm_preds$obs - lm_preds$pred
residual_summary <- tibble(
metric = c("Mean residual",
"SD residual",
"Max |residual|",
"99th pct |residual|",
"Cor(predicted, residual)",
"Predicted PH range",
"Observed PH range"),
value = c(
sprintf("%.5f", mean(lm_preds$residual)),
sprintf("%.4f", sd(lm_preds$residual)),
sprintf("%.4f", max(abs(lm_preds$residual))),
sprintf("%.4f", quantile(abs(lm_preds$residual), 0.99)),
sprintf("%.4f", cor(lm_preds$pred, lm_preds$residual)),
sprintf("%.3f to %.3f", min(lm_preds$pred), max(lm_preds$pred)),
sprintf("%.3f to %.3f", min(lm_preds$obs), max(lm_preds$obs))
)
)
kable(residual_summary)
| metric | value |
|---|---|
| Mean residual | 0.00010 |
| SD residual | 0.1354 |
| Max |residual| | 0.5222 |
| 99th pct |residual| | 0.3819 |
| Cor(predicted, residual) | -0.0122 |
| Predicted PH range | 8.243 to 8.821 |
| Observed PH range | 7.880 to 8.940 |
models$lm, and the CV metrics got appended to
model_metrics.Mnf.Flow is the dominant one, then
Carb.Pressure1, Brand.CodeD,
Filler.Level, Brand.CodeC,
Usage.cont, Hyd.Pressure2, and
Temperature.Mnf.Flow,
Usage.cont, Temperature, and
Oxygen.Filler are associated with pushing pH down. Higher
Carb.Pressure1, Filler.Level, and
Hyd.Pressure2 are associated with pushing pH up. Brand D
has the highest baseline pH, Brand C has the lowest, with a spread of
about 0.07 pH units between them.Recap located at the end of the section. Skip there to jump past the code/workflow.
Next model: elastic net via glmnet. It’s a regularized
version of linear regression that blends ridge (L2) and lasso (L1)
penalties. caret tunes both alpha (0 = pure ridge, 1 = pure lasso, in
between = blended) and lambda (penalty strength), so the tuned model
lands wherever on the ridge-to-lasso spectrum the data prefers. We’re
using elastic net rather than pure ridge or pure lasso so the data can
choose. Uses the narrow 20-predictor set.
set.seed(624)
elastic_model <- train(
PH ~ .,
data = training_for_modeling,
method = "glmnet",
trControl = controlObject,
preProcess = c("YeoJohnson", "center", "scale", "knnImpute"),
na.action = na.pass,
tuneLength = 10
)
models$elastic_net <- elastic_model
# append CV metrics to the running comparison table
model_metrics <- bind_rows(
model_metrics,
getTrainPerf(elastic_model) %>%
transmute(model = "elastic_net",
cv_RMSE = TrainRMSE,
cv_Rsquared = TrainRsquared,
cv_MAE = TrainMAE)
)
elastic_model
## glmnet
##
## 2565 samples
## 20 predictor
##
## Pre-processing: Yeo-Johnson transformation (23), centered (23), scaled
## (23), nearest neighbor imputation (23)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2309, 2308, 2308, 2309, 2308, 2309, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.1 8.205501e-05 0.1352969 0.3821881 0.1060736
## 0.1 1.895577e-04 0.1352969 0.3821881 0.1060736
## 0.1 4.379029e-04 0.1352967 0.3821888 0.1060734
## 0.1 1.011613e-03 0.1352888 0.3822386 0.1060833
## 0.1 2.336956e-03 0.1352949 0.3822044 0.1061194
## 0.1 5.398672e-03 0.1354210 0.3814225 0.1062983
## 0.1 1.247163e-02 0.1359829 0.3781338 0.1069120
## 0.1 2.881109e-02 0.1378184 0.3676227 0.1087592
## 0.1 6.655735e-02 0.1417918 0.3503984 0.1124650
## 0.2 8.205501e-05 0.1352949 0.3821971 0.1060705
## 0.2 1.895577e-04 0.1352949 0.3821971 0.1060705
## 0.2 4.379029e-04 0.1352918 0.3822272 0.1060711
## 0.2 1.011613e-03 0.1352845 0.3822709 0.1060817
## 0.2 2.336956e-03 0.1353236 0.3819867 0.1061460
## 0.2 5.398672e-03 0.1355763 0.3803218 0.1064271
## 0.2 1.247163e-02 0.1364983 0.3748519 0.1074226
## 0.2 2.881109e-02 0.1393419 0.3578141 0.1101886
## 0.2 6.655735e-02 0.1452406 0.3306459 0.1155264
## 0.3 8.205501e-05 0.1352948 0.3821931 0.1060698
## 0.3 1.895577e-04 0.1352948 0.3821931 0.1060698
## 0.3 4.379029e-04 0.1352884 0.3822508 0.1060692
## 0.3 1.011613e-03 0.1352888 0.3822301 0.1060889
## 0.3 2.336956e-03 0.1353692 0.3816357 0.1061863
## 0.3 5.398672e-03 0.1357395 0.3792223 0.1065789
## 0.3 1.247163e-02 0.1371641 0.3702854 0.1080726
## 0.3 2.881109e-02 0.1409657 0.3469195 0.1117011
## 0.3 6.655735e-02 0.1482228 0.3176529 0.1180306
## 0.4 8.205501e-05 0.1352939 0.3822017 0.1060665
## 0.4 1.895577e-04 0.1352936 0.3822045 0.1060664
## 0.4 4.379029e-04 0.1352861 0.3822642 0.1060680
## 0.4 1.011613e-03 0.1352978 0.3821529 0.1060979
## 0.4 2.336956e-03 0.1354221 0.3812355 0.1062318
## 0.4 5.398672e-03 0.1359219 0.3780308 0.1067725
## 0.4 1.247163e-02 0.1379422 0.3645416 0.1088196
## 0.4 2.881109e-02 0.1426845 0.3342049 0.1131936
## 0.4 6.655735e-02 0.1516498 0.2956408 0.1209188
## 0.5 8.205501e-05 0.1352950 0.3821991 0.1060671
## 0.5 1.895577e-04 0.1352930 0.3822194 0.1060659
## 0.5 4.379029e-04 0.1352847 0.3822703 0.1060679
## 0.5 1.011613e-03 0.1353115 0.3820366 0.1061091
## 0.5 2.336956e-03 0.1354835 0.3807707 0.1062798
## 0.5 5.398672e-03 0.1361376 0.3765940 0.1069982
## 0.5 1.247163e-02 0.1387742 0.3581356 0.1096054
## 0.5 2.881109e-02 0.1440213 0.3261374 0.1143598
## 0.5 6.655735e-02 0.1547604 0.2760646 0.1236015
## 0.6 8.205501e-05 0.1352950 0.3822063 0.1060660
## 0.6 1.895577e-04 0.1352916 0.3822337 0.1060648
## 0.6 4.379029e-04 0.1352859 0.3822556 0.1060711
## 0.6 1.011613e-03 0.1353280 0.3818988 0.1061223
## 0.6 2.336956e-03 0.1355465 0.3803100 0.1063358
## 0.6 5.398672e-03 0.1363875 0.3748844 0.1072553
## 0.6 1.247163e-02 0.1394883 0.3528928 0.1102807
## 0.6 2.881109e-02 0.1451476 0.3210245 0.1153005
## 0.6 6.655735e-02 0.1576892 0.2544858 0.1260112
## 0.7 8.205501e-05 0.1352953 0.3822031 0.1060662
## 0.7 1.895577e-04 0.1352901 0.3822439 0.1060639
## 0.7 4.379029e-04 0.1352879 0.3822350 0.1060745
## 0.7 1.011613e-03 0.1353454 0.3817574 0.1061353
## 0.7 2.336956e-03 0.1356063 0.3798965 0.1063980
## 0.7 5.398672e-03 0.1366529 0.3730540 0.1075254
## 0.7 1.247163e-02 0.1401931 0.3477693 0.1109172
## 0.7 2.881109e-02 0.1464599 0.3134222 0.1163966
## 0.7 6.655735e-02 0.1606661 0.2229432 0.1286070
## 0.8 8.205501e-05 0.1352955 0.3822027 0.1060658
## 0.8 1.895577e-04 0.1352885 0.3822534 0.1060632
## 0.8 4.379029e-04 0.1352905 0.3822109 0.1060781
## 0.8 1.011613e-03 0.1353639 0.3816088 0.1061501
## 0.8 2.336956e-03 0.1356712 0.3794523 0.1064668
## 0.8 5.398672e-03 0.1369502 0.3709247 0.1078254
## 0.8 1.247163e-02 0.1409764 0.3416701 0.1116031
## 0.8 2.881109e-02 0.1479698 0.3023956 0.1176199
## 0.8 6.655735e-02 0.1630316 0.2026599 0.1307093
## 0.9 8.205501e-05 0.1352944 0.3822071 0.1060651
## 0.9 1.895577e-04 0.1352876 0.3822582 0.1060628
## 0.9 4.379029e-04 0.1352940 0.3821791 0.1060822
## 0.9 1.011613e-03 0.1353852 0.3814395 0.1061664
## 0.9 2.336956e-03 0.1357447 0.3789511 0.1065437
## 0.9 5.398672e-03 0.1372838 0.3684240 0.1081469
## 0.9 1.247163e-02 0.1418270 0.3345759 0.1123365
## 0.9 2.881109e-02 0.1495628 0.2889571 0.1189366
## 0.9 6.655735e-02 0.1649994 0.2025497 0.1324476
## 1.0 8.205501e-05 0.1352943 0.3822066 0.1060644
## 1.0 1.895577e-04 0.1352870 0.3822609 0.1060626
## 1.0 4.379029e-04 0.1352986 0.3821371 0.1060865
## 1.0 1.011613e-03 0.1354075 0.3812623 0.1061845
## 1.0 2.336956e-03 0.1358259 0.3783950 0.1066304
## 1.0 5.398672e-03 0.1376515 0.3655664 0.1085026
## 1.0 1.247163e-02 0.1425809 0.3284122 0.1130134
## 1.0 2.881109e-02 0.1509621 0.2776060 0.1201083
## 1.0 6.655735e-02 0.1673016 0.2025497 0.1344976
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.2 and lambda = 0.001011613.
Variable importance.
plot(varImp(elastic_model), top = 20)
Coefficients at the tuned lambda. glmnet doesn’t expose t-statistics (penalized regression invalidates classical inference), so we report just the coefficients sorted by absolute size. Coefficients exactly equal to zero have been shrunk out by the lasso component.
elastic_coefs <- coef(elastic_model$finalModel,
s = elastic_model$bestTune$lambda) %>%
as.matrix() %>%
as.data.frame()
colnames(elastic_coefs) <- "coefficient"
elastic_coefs <- elastic_coefs %>%
tibble::rownames_to_column("predictor") %>%
arrange(desc(abs(coefficient)))
kable(elastic_coefs, digits = 4)
| predictor | coefficient |
|---|---|
| (Intercept) | 8.5458 |
| Mnf.Flow | -0.0765 |
| Brand.CodeD | 0.0370 |
| Filler.Level | 0.0308 |
| Brand.CodeC | -0.0285 |
| Carb.Pressure1 | 0.0283 |
| Brand.CodeB | 0.0254 |
| Usage.cont | -0.0234 |
| Hyd.Pressure2 | 0.0215 |
| Temperature | -0.0172 |
| Oxygen.Filler | -0.0132 |
| Carb.Flow | 0.0118 |
| Carb.Volume | -0.0067 |
| Fill.Ounces | -0.0066 |
| PC.Volume | -0.0063 |
| PSC.CO2 | -0.0053 |
| Fill.Pressure | 0.0049 |
| PSC.Fill | -0.0043 |
| Brand.CodeUnknown | -0.0039 |
| PSC | -0.0039 |
| Carb.Temp | 0.0034 |
| Filler.Speed | -0.0029 |
| Air.Pressurer | -0.0025 |
| Carb.Pressure | 0.0000 |
Residual summary.
elastic_preds <- elastic_model$pred
elastic_preds$residual <- elastic_preds$obs - elastic_preds$pred
residual_summary_elastic <- tibble(
metric = c("Mean residual",
"SD residual",
"Max |residual|",
"99th pct |residual|",
"Cor(predicted, residual)",
"Predicted PH range",
"Observed PH range"),
value = c(
sprintf("%.5f", mean(elastic_preds$residual)),
sprintf("%.4f", sd(elastic_preds$residual)),
sprintf("%.4f", max(abs(elastic_preds$residual))),
sprintf("%.4f", quantile(abs(elastic_preds$residual), 0.99)),
sprintf("%.4f", cor(elastic_preds$pred, elastic_preds$residual)),
sprintf("%.3f to %.3f", min(elastic_preds$pred), max(elastic_preds$pred)),
sprintf("%.3f to %.3f", min(elastic_preds$obs), max(elastic_preds$obs))
)
)
kable(residual_summary_elastic)
| metric | value |
|---|---|
| Mean residual | 0.00009 |
| SD residual | 0.1354 |
| Max |residual| | 0.5211 |
| 99th pct |residual| | 0.3830 |
| Cor(predicted, residual) | -0.0025 |
| Predicted PH range | 8.247 to 8.814 |
| Observed PH range | 7.880 to 8.940 |
models$elastic_net, with CV metrics appended to
model_metrics.Carb.Pressure got zeroed
out by the lasso component (in lm it had p = 0.96 and a near-zero
estimate, so this seemed like the right call).Recap located at the end of the section. Skip there to jump past the code/workflow.
First tree-based model: random forest via the
randomForest engine, wrapped by caret. Random forest fits
many decorrelated trees on bootstrap samples and averages their
predictions, which tends to reduce variance significantly compared to a
single tree. Trees don’t need Box-Cox or scaling, and they handle
collinearity natively, so this is where the wider 31-predictor set earns
its keep. We’re using random forest as the standard tree-based starting
point for tabular regression problems.
Wide-set setup. We mirror the Brand.Code “Unknown”
treatment we used for the linear-family models, but on the wider feature
set (training_data_wide_no_na).
training_for_modeling_wide <- training_data_wide_no_na %>%
mutate(Brand.Code = factor(ifelse(is.na(Brand.Code) | Brand.Code == "", "Unknown", Brand.Code)))
set.seed(624)
rf_model <- train(
PH ~ .,
data = training_for_modeling_wide,
method = "rf",
trControl = controlObject,
preProcess = c("knnImpute"),
na.action = na.pass,
tuneLength = 5,
ntree = 200,
importance = TRUE
)
models$random_forest <- rf_model
# append CV metrics to the running comparison table
model_metrics <- bind_rows(
model_metrics,
getTrainPerf(rf_model) %>%
transmute(model = "random_forest",
cv_RMSE = TrainRMSE,
cv_Rsquared = TrainRsquared,
cv_MAE = TrainMAE)
)
rf_model
## Random Forest
##
## 2565 samples
## 31 predictor
##
## Pre-processing: nearest neighbor imputation (34), centered (34), scaled (34)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2309, 2308, 2308, 2309, 2308, 2309, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.10836944 0.6475309 0.08201382
## 10 0.09602037 0.7051168 0.07021546
## 18 0.09352415 0.7164101 0.06814411
## 26 0.09262867 0.7176686 0.06724975
## 34 0.09264205 0.7145657 0.06679322
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 26.
Variable importance.
plot(varImp(rf_model), top = 20)
Residual summary.
rf_preds <- rf_model$pred
rf_preds$residual <- rf_preds$obs - rf_preds$pred
residual_summary_rf <- tibble(
metric = c("Mean residual",
"SD residual",
"Max |residual|",
"99th pct |residual|",
"Cor(predicted, residual)",
"Predicted PH range",
"Observed PH range"),
value = c(
sprintf("%.5f", mean(rf_preds$residual)),
sprintf("%.4f", sd(rf_preds$residual)),
sprintf("%.4f", max(abs(rf_preds$residual))),
sprintf("%.4f", quantile(abs(rf_preds$residual), 0.99)),
sprintf("%.4f", cor(rf_preds$pred, rf_preds$residual)),
sprintf("%.3f to %.3f", min(rf_preds$pred), max(rf_preds$pred)),
sprintf("%.3f to %.3f", min(rf_preds$obs), max(rf_preds$obs))
)
)
kable(residual_summary_rf)
| metric | value |
|---|---|
| Mean residual | 0.00016 |
| SD residual | 0.0927 |
| Max |residual| | 0.4936 |
| 99th pct |residual| | 0.3129 |
| Cor(predicted, residual) | 0.1688 |
| Predicted PH range | 8.106 to 8.825 |
| Observed PH range | 7.880 to 8.940 |
models$random_forest,
CV metrics appended to model_metrics.mtry = 26 (out of candidates 2, 10, 18, 26, 34).
caret’s default heuristic for regression is p/3 (about 11
for our wider set), so the data preferred sampling more predictors per
split than the default. Performance was nearly flat between mtry = 26
and 34.Recap located at the end of the section. Skip there to jump past the code/workflow.
Next tree-based model: gradient boosted regression trees via
gbm, wrapped by caret. gbm fits a sequence of shallow trees
where each tree corrects the previous tree’s residuals. The additive
ensemble often edges out random forest on tabular data with moderate
complexity. We’re trying gbm to test that pattern on this dataset. Like
RF, gbm doesn’t need scaling or Box-Cox, and it uses the wider
31-predictor feature set.
set.seed(624)
gbm_model <- train(
PH ~ .,
data = training_for_modeling_wide,
method = "gbm",
trControl = controlObject,
preProcess = c("knnImpute"),
na.action = na.pass,
tuneLength = 5,
verbose = FALSE
)
models$gbm <- gbm_model
# append CV metrics to the running comparison table
model_metrics <- bind_rows(
model_metrics,
getTrainPerf(gbm_model) %>%
transmute(model = "gbm",
cv_RMSE = TrainRMSE,
cv_Rsquared = TrainRsquared,
cv_MAE = TrainMAE)
)
gbm_model
## Stochastic Gradient Boosting
##
## 2565 samples
## 31 predictor
##
## Pre-processing: nearest neighbor imputation (34), centered (34), scaled (34)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2309, 2308, 2308, 2309, 2308, 2309, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 0.1347471 0.4054205 0.10671216
## 1 100 0.1301928 0.4381286 0.10291059
## 1 150 0.1278529 0.4565340 0.10070481
## 1 200 0.1266412 0.4640929 0.09913740
## 1 250 0.1259596 0.4677972 0.09834841
## 2 50 0.1269948 0.4699914 0.10037830
## 2 100 0.1218318 0.5060873 0.09517020
## 2 150 0.1190883 0.5255693 0.09221965
## 2 200 0.1172401 0.5384435 0.09033028
## 2 250 0.1161839 0.5455670 0.08926316
## 3 50 0.1224024 0.5075129 0.09612938
## 3 100 0.1169598 0.5433132 0.09064043
## 3 150 0.1141460 0.5619903 0.08800436
## 3 200 0.1127828 0.5712665 0.08656881
## 3 250 0.1115276 0.5798266 0.08533088
## 4 50 0.1190138 0.5335019 0.09268078
## 4 100 0.1131954 0.5726500 0.08728230
## 4 150 0.1106681 0.5895382 0.08459785
## 4 200 0.1089566 0.6001547 0.08293913
## 4 250 0.1076794 0.6088125 0.08172274
## 5 50 0.1168357 0.5493290 0.09055053
## 5 100 0.1115768 0.5836692 0.08555954
## 5 150 0.1100452 0.5925252 0.08377814
## 5 200 0.1084427 0.6035992 0.08200728
## 5 250 0.1072945 0.6118269 0.08100282
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 250, interaction.depth =
## 5, shrinkage = 0.1 and n.minobsinnode = 10.
Variable importance.
plot(varImp(gbm_model), top = 20)
Residual summary.
gbm_preds <- gbm_model$pred
gbm_preds$residual <- gbm_preds$obs - gbm_preds$pred
residual_summary_gbm <- tibble(
metric = c("Mean residual",
"SD residual",
"Max |residual|",
"99th pct |residual|",
"Cor(predicted, residual)",
"Predicted PH range",
"Observed PH range"),
value = c(
sprintf("%.5f", mean(gbm_preds$residual)),
sprintf("%.4f", sd(gbm_preds$residual)),
sprintf("%.4f", max(abs(gbm_preds$residual))),
sprintf("%.4f", quantile(abs(gbm_preds$residual), 0.99)),
sprintf("%.4f", cor(gbm_preds$pred, gbm_preds$residual)),
sprintf("%.3f to %.3f", min(gbm_preds$pred), max(gbm_preds$pred)),
sprintf("%.3f to %.3f", min(gbm_preds$obs), max(gbm_preds$obs))
)
)
kable(residual_summary_gbm)
| metric | value |
|---|---|
| Mean residual | -0.00001 |
| SD residual | 0.1074 |
| Max |residual| | 0.5212 |
| 99th pct |residual| | 0.3347 |
| Cor(predicted, residual) | 0.0277 |
| Predicted PH range | 8.147 to 8.880 |
| Observed PH range | 7.880 to 8.940 |
models$gbm, with CV metrics appended to
model_metrics.Mnf.Flow (100, top),
Usage.cont (34), Brand.CodeC (33),
Alch.Rel (31), Pressure.Vacuum (24),
Temperature (23), Oxygen.Filler (22),
Carb.Pressure1 (18). The Mnf.Flow + brand +
Pressure.Vacuum + Alch.Rel +
Oxygen + Temperature cluster stays consistent
with random forest.Recap located at the end of the section. Skip there to jump past the code/workflow.
Next model: support vector machine with a radial basis function (RBF)
kernel via kernlab, wrapped by caret. SVM-radial fits a
nonlinear decision boundary, which lets it capture curved relationships
that linear models can’t. We’re using it as a nonlinear, non-tree
alternative to round out the methodological mix. SVM-radial is sensitive
to scaling, so it uses the narrow feature set with the same preProcess
pipeline as the linear-family models (YeoJohnson, center, scale,
knnImpute). SVM doesn’t natively output variable importance. caret falls
back to a filter-based metric (LOESS pseudo-R-squared on each
predictor).
set.seed(624)
svm_model <- train(
PH ~ .,
data = training_for_modeling,
method = "svmRadial",
trControl = controlObject,
preProcess = c("YeoJohnson", "center", "scale", "knnImpute"),
na.action = na.pass,
tuneLength = 10
)
models$svm_radial <- svm_model
# append CV metrics to the running comparison table
model_metrics <- bind_rows(
model_metrics,
getTrainPerf(svm_model) %>%
transmute(model = "svm_radial",
cv_RMSE = TrainRMSE,
cv_Rsquared = TrainRsquared,
cv_MAE = TrainMAE)
)
svm_model
## Support Vector Machines with Radial Basis Function Kernel
##
## 2565 samples
## 20 predictor
##
## Pre-processing: Yeo-Johnson transformation (23), centered (23), scaled
## (23), nearest neighbor imputation (23)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2309, 2308, 2308, 2309, 2308, 2309, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.1270926 0.4615164 0.09621241
## 0.50 0.1254609 0.4735196 0.09454896
## 1.00 0.1240362 0.4845785 0.09310004
## 2.00 0.1228184 0.4944277 0.09201413
## 4.00 0.1228893 0.4963252 0.09200250
## 8.00 0.1235185 0.4952756 0.09258064
## 16.00 0.1259486 0.4829499 0.09452523
## 32.00 0.1309150 0.4589153 0.09847563
## 64.00 0.1377398 0.4296198 0.10351639
## 128.00 0.1481310 0.3896024 0.11151409
##
## Tuning parameter 'sigma' was held constant at a value of 0.02806381
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02806381 and C = 2.
Variable importance. (Filter-based since SVM doesn’t have native importance.)
plot(varImp(svm_model), top = 20)
Residual summary.
svm_preds <- svm_model$pred
svm_preds$residual <- svm_preds$obs - svm_preds$pred
residual_summary_svm <- tibble(
metric = c("Mean residual",
"SD residual",
"Max |residual|",
"99th pct |residual|",
"Cor(predicted, residual)",
"Predicted PH range",
"Observed PH range"),
value = c(
sprintf("%.5f", mean(svm_preds$residual)),
sprintf("%.4f", sd(svm_preds$residual)),
sprintf("%.4f", max(abs(svm_preds$residual))),
sprintf("%.4f", quantile(abs(svm_preds$residual), 0.99)),
sprintf("%.4f", cor(svm_preds$pred, svm_preds$residual)),
sprintf("%.3f to %.3f", min(svm_preds$pred), max(svm_preds$pred)),
sprintf("%.3f to %.3f", min(svm_preds$obs), max(svm_preds$obs))
)
)
kable(residual_summary_svm)
| metric | value |
|---|---|
| Mean residual | -0.00785 |
| SD residual | 0.1227 |
| Max |residual| | 0.5218 |
| 99th pct |residual| | 0.3696 |
| Cor(predicted, residual) | -0.0362 |
| Predicted PH range | 8.181 to 8.858 |
| Observed PH range | 7.880 to 8.940 |
models$svm_radial,
with CV metrics appended to model_metrics.Mnf.Flow (100),
Usage.cont (92), Oxygen.Filler (87),
Filler.Level (66), Filler.Speed (61),
PC.Volume (47), Hyd.Pressure2 (44),
Fill.Pressure (43), Brand.Code (31),
Carb.Pressure1 (30). Worth noting this measures individual
predictor strength against pH, not contribution to the SVM model
itself.Recap located at the end of the section. Skip there to jump past the code/workflow.
Last model: Cubist, a rule-based regression model wrapped by caret. Cubist partitions the data into IF-THEN rules (similar to a decision tree) but fits a linear regression model inside each rule’s terminal node. Predictions are then averaged across committees of rules. We’re using Cubist as a hybrid between tree-based and linear approaches: the rules give us tree-like flexibility, and the linear models inside each rule give us interpretable local fits. Uses the wider 31-predictor feature set. The preProcess pipeline includes YeoJohnson and center/scale alongside knnImpute since the linear models inside each rule can benefit from transformed and scaled inputs.
set.seed(624)
cubist_model <- train(
PH ~ .,
data = training_for_modeling_wide,
method = "cubist",
trControl = controlObject,
preProcess = c("YeoJohnson", "center", "scale", "knnImpute"),
na.action = na.pass,
tuneLength = 10
)
models$cubist <- cubist_model
# append CV metrics to the running comparison table
model_metrics <- bind_rows(
model_metrics,
getTrainPerf(cubist_model) %>%
transmute(model = "cubist",
cv_RMSE = TrainRMSE,
cv_Rsquared = TrainRsquared,
cv_MAE = TrainMAE)
)
# Save the fitted model so the executive summary Rmd can load it
# (one source of truth for the chosen final model)
saveRDS(cubist_model, file.path(projPath, "cubist_model.rds"))
cubist_model
## Cubist
##
## 2565 samples
## 31 predictor
##
## Pre-processing: Yeo-Johnson transformation (34), centered (34), scaled
## (34), nearest neighbor imputation (34)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2309, 2308, 2308, 2309, 2308, 2309, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 0.11709102 0.5663388 0.07956496
## 1 5 0.11397926 0.5970794 0.07691098
## 1 9 0.11341373 0.5955684 0.07676179
## 10 0 0.09788118 0.6797203 0.07145907
## 10 5 0.09304572 0.7073057 0.06653415
## 10 9 0.09301498 0.7071579 0.06697869
## 20 0 0.09712281 0.6859118 0.07090767
## 20 5 0.09278729 0.7082375 0.06621061
## 20 9 0.09259569 0.7093332 0.06663032
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 9.
Variable importance.
plot(varImp(cubist_model), top = 20)
Residual summary.
cubist_preds <- cubist_model$pred
cubist_preds$residual <- cubist_preds$obs - cubist_preds$pred
residual_summary_cubist <- tibble(
metric = c("Mean residual",
"SD residual",
"Max |residual|",
"99th pct |residual|",
"Cor(predicted, residual)",
"Predicted PH range",
"Observed PH range"),
value = c(
sprintf("%.5f", mean(cubist_preds$residual)),
sprintf("%.4f", sd(cubist_preds$residual)),
sprintf("%.4f", max(abs(cubist_preds$residual))),
sprintf("%.4f", quantile(abs(cubist_preds$residual), 0.99)),
sprintf("%.4f", cor(cubist_preds$pred, cubist_preds$residual)),
sprintf("%.3f to %.3f", min(cubist_preds$pred), max(cubist_preds$pred)),
sprintf("%.3f to %.3f", min(cubist_preds$obs), max(cubist_preds$obs))
)
)
kable(residual_summary_cubist)
| metric | value |
|---|---|
| Mean residual | -0.00184 |
| SD residual | 0.0927 |
| Max |residual| | 0.5606 |
| 99th pct |residual| | 0.3092 |
| Cor(predicted, residual) | 0.0125 |
| Predicted PH range | 8.089 to 9.050 |
| Observed PH range | 7.880 to 8.940 |
models$cubist, CV metrics appended
to model_metrics.Recap located at the end of the section. Skip there to jump past the code/workflow.
All six models trained on the same shared CV folds. Comparing on
identical resampling structure via resamples():
resamps <- resamples(models)
# Per-model distribution of CV RMSE, R-squared, MAE
summary(resamps)
##
## Call:
## summary.resamples(object = resamps)
##
## Models: lm, elastic_net, random_forest, gbm, svm_radial, cubist
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## lm 0.09944406 0.10344348 0.10579847 0.10608045 0.10801644 0.11503317
## elastic_net 0.09931834 0.10366849 0.10565713 0.10608166 0.10814000 0.11490515
## random_forest 0.06301973 0.06535915 0.06826452 0.06724975 0.06913352 0.07021189
## gbm 0.07716085 0.08009642 0.08069963 0.08100282 0.08199387 0.08410312
## svm_radial 0.08474276 0.08916845 0.09162930 0.09201413 0.09451886 0.10020877
## cubist 0.06348532 0.06501563 0.06592527 0.06663032 0.06758502 0.07097344
## NA's
## lm 0
## elastic_net 0
## random_forest 0
## gbm 0
## svm_radial 0
## cubist 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## lm 0.12545563 0.13435216 0.13577290 0.13532304 0.13683712 0.14307621
## elastic_net 0.12534418 0.13441468 0.13555477 0.13528448 0.13693033 0.14305028
## random_forest 0.08598685 0.09052627 0.09304523 0.09262867 0.09548760 0.09804908
## gbm 0.10395866 0.10536129 0.10651967 0.10729448 0.10913545 0.11301913
## svm_radial 0.11475325 0.11915590 0.12344822 0.12281843 0.12541262 0.13163894
## cubist 0.08348632 0.08981811 0.09280274 0.09259569 0.09716919 0.09864226
## NA's
## lm 0
## elastic_net 0
## random_forest 0
## gbm 0
## svm_radial 0
## cubist 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lm 0.2899607 0.3565501 0.3810568 0.3819844 0.4171481 0.4590614 0
## elastic_net 0.2893673 0.3570577 0.3813374 0.3822709 0.4166551 0.4597328 0
## random_forest 0.6549461 0.6918054 0.7207900 0.7176686 0.7334283 0.7828948 0
## gbm 0.5533038 0.5802150 0.6141933 0.6118269 0.6337196 0.6765891 0
## svm_radial 0.4614281 0.4713815 0.4818061 0.4944277 0.5115837 0.5712570 0
## cubist 0.6180778 0.6944906 0.7108700 0.7093332 0.7304141 0.7711805 0
# Side-by-side boxplots of per-fold RMSE; lower median is better, narrow box is more stable
bwplot(resamps, metric = "RMSE")
# Means with confidence intervals for RMSE and R-squared
dotplot(resamps, metric = c("RMSE", "Rsquared"))
# Sorted comparison table from the running metrics
kable(model_metrics %>% arrange(cv_RMSE))
| model | cv_RMSE | cv_Rsquared | cv_MAE |
|---|---|---|---|
| cubist | 0.0925957 | 0.7093332 | 0.0666303 |
| random_forest | 0.0926287 | 0.7176686 | 0.0672497 |
| gbm | 0.1072945 | 0.6118269 | 0.0810028 |
| svm_radial | 0.1228184 | 0.4944277 | 0.0920141 |
| elastic_net | 0.1352845 | 0.3822709 | 0.1060817 |
| lm | 0.1353230 | 0.3819844 | 0.1060805 |
Recap located at the end of the section. Skip there to jump past the code/workflow.
The deliverable requires predictions on the evaluation set. We set
this up to pick the best model from model_metrics based on
the lowest CV RMSE and write the predictions to
abc_beverage_PH_predictions.csv. The formal Excel
deliverable, ABC Beverage Final Sheet.xlsx, is built from
these same predictions and adds out-of-fold residuals, the model
comparison table, and ranked top factors with business interpretations.
If we add new models in the future, just make sure to add the new
model’s name to the model_feature_set mapping inside the
chunk so the eval set gets prepared with the right feature set.
# Find the best model by cross-validated RMSE
best_model_name <- model_metrics %>%
arrange(cv_RMSE) %>%
slice(1) %>%
pull(model)
best_model <- models[[best_model_name]]
cat("Best model selected:", best_model_name, "\n")
## Best model selected: cubist
cat("CV RMSE:", round(model_metrics %>% filter(model == best_model_name) %>% pull(cv_RMSE), 4), "\n")
## CV RMSE: 0.0926
# Map each model to which feature set it was trained on. If we ADD A NEW MODEL
# must also add an entry here so the eval set gets prepared with the right column
model_feature_set <- c(
lm = "narrow",
elastic_net = "narrow",
svm_radial = "narrow",
random_forest = "wide",
gbm = "wide",
cubist = "wide"
)
if (!best_model_name %in% names(model_feature_set)) {
warning("Best model '", best_model_name, "' is not in the model_feature_set mapping. ",
"Defaulting to wide; if that's wrong, update the mapping in this chunk.")
feature_set_choice <- "wide"
} else {
feature_set_choice <- model_feature_set[[best_model_name]]
}
# Prepare the eval set with the right features and Brand.Code level alignment
if (feature_set_choice == "narrow") {
eval_for_prediction <- test_data_imputed %>%
mutate(Brand.Code = factor(
ifelse(is.na(Brand.Code) | Brand.Code == "", "Unknown", as.character(Brand.Code)),
levels = levels(training_for_modeling$Brand.Code)
))
} else {
eval_for_prediction <- test_data_wide %>%
mutate(Brand.Code = factor(
ifelse(is.na(Brand.Code) | Brand.Code == "", "Unknown", as.character(Brand.Code)),
levels = levels(training_for_modeling_wide$Brand.Code)
))
}
# Predict
eval_predictions <- predict(best_model, newdata = eval_for_prediction, na.action = na.pass)
# Flag rows that were sparse in the original eval data (predictor NAs only)
original_predictor_nas <- rowSums(is.na(test_data %>% select(-PH)))
# Build the output data frame
predictions_output <- tibble(
row_id = seq_along(eval_predictions),
Brand.Code = as.character(eval_for_prediction$Brand.Code),
predicted_PH = round(eval_predictions, 4),
notes = ifelse(original_predictor_nas >= 5,
paste0("low confidence: original row had ", original_predictor_nas, " predictor NAs"),
"")
)
# Write CSV alongside the rmd
write_csv(predictions_output, file.path(projPath, "abc_beverage_PH_predictions.csv"))
# Show the first few rows for confirmation
head(predictions_output)
## # A tibble: 6 × 4
## row_id Brand.Code predicted_PH notes
## <int> <chr> <dbl> <chr>
## 1 1 D 8.60 ""
## 2 2 A 8.42 ""
## 3 3 B 8.53 ""
## 4 4 B 8.59 ""
## 5 5 B 8.44 ""
## 6 6 B 8.56 ""
abc_beverage_PH_predictions.csv on each knit. The output
has four columns: row_id, Brand.Code,
predicted_PH, and a notes column that flags
rows where the original evaluation data had 5+ predictor NAs (since
those predictions lean heavily on imputed values). The formal Excel
deliverable is ABC Beverage Final Sheet.xlsx, which is
built from these predictions and adds out-of-fold residuals, the model
comparison table, and ranked top factors with interpretations.model_metrics. Right now that’s cubist.model_feature_set inside the chunk (narrow or wide); the
chunk picks up whichever model has the lowest CV RMSE.Six modeling approaches were tested with shared 10-fold cross-validation folds: linear regression, elastic net, random forest, gradient boosting, support vector machine with radial kernel, and Cubist. Cubist edged out random forest on cross-validated RMSE (0.0926 vs 0.0926, with Cubist ahead by about 3e-5) and was selected as the final model. Random forest came in essentially tied and is a defensible alternative. Both substantially outperformed the linear-family models (RMSE 0.135, R-squared 0.38).
Top predictors across the lineup
Mnf.Flow was the top predictor in five of the six
models, and a close second in random forest (basically tied with Brand C
at the top).Pressure.Vacuum,
Alch.Rel, Balling.Lvl,
Oxygen.Filler, Air.Pressurer, and
Temperature. Some of these were dropped from the linear
models’ feature set during correlation pruning, which is why they don’t
show up in those rankings.Model performance
The chosen Cubist model explains about 71% of the variation in pH on out-of-fold predictions, with typical predictions falling within roughly 0.067 pH units of the true value (the MAE). That makes it useful for prioritizing operational levers but not precise enough to replace per-batch pH measurement.
A few caveats are worth calling out before acting on these findings.
Investigate what the low Mnf.Flow state represents before acting on it. Mnf.Flow was the top predictor in nearly every model, but the bimodal data pattern means “tighten Mnf.Flow tolerances” may not be the right action. ABC Beverage’s process engineering team should confirm what the low value means. If it represents a controllable operational choice, controlling it is likely the highest-impact action available.
Set brand-specific pH targets. Each brand has its own baseline pH, with about 0.19 pH units of spread between the lowest (Brand C) and highest (Brand D). A single universal target will systematically miss for some brands. Treat each brand as its own production specification.
Add secondary monitoring on the next-tier
predictors. Pressure.Vacuum,
Alch.Rel, Balling.Lvl,
Oxygen.Filler, Air.Pressurer, and
Temperature each had a real effect on pH and surfaced
consistently in our top models. They aren’t the primary levers to
control, but tracking them alongside Mnf.Flow and brand would give
operations more context for diagnosing what changed when pH goes out of
spec.
Pilot before broad rollout. Choose one production line, tighten monitoring on the top predictors above, and compare resulting pH consistency to the historical baseline over two to four weeks. Use the pilot to calibrate brand-specific targets before scaling across all lines.
Retrain the model quarterly, or after any major process change. The relationships above reflect the current equipment, recipes, and raw materials. They will drift over time and especially after any significant operational change. A quarterly refresh plus an ad-hoc retrain tied to major changes will keep the recommendations aligned with reality.
A few directions to consider once the current findings have been acted on: