Project instructions

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.2
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
## Warning: package 'readxl' was built under R version 4.4.3
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.4.3
library(funModeling)
## Warning: package 'funModeling' was built under R version 4.4.3
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 4.4.3
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## 
## funModeling v.1.9.6 :)
## Examples and tutorials at livebook.datascienceheroes.com
##  / Now in Spanish: librovivodecienciadedatos.ai
library(skimr)
## Warning: package 'skimr' was built under R version 4.4.3
library(janitor)
## Warning: package 'janitor' was built under R version 4.4.2
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.4.3
library(fpp3)
## Warning: package 'fpp3' was built under R version 4.4.3
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.3 ──
## ✔ tsibble     1.2.0     ✔ feasts      0.5.0
## ✔ tsibbledata 0.4.1     ✔ fable       0.5.0
## ✔ ggtime      0.2.0
## Warning: package 'tsibble' was built under R version 4.4.3
## Warning: package 'tsibbledata' was built under R version 4.4.3
## Warning: package 'ggtime' was built under R version 4.4.3
## Warning: package 'feasts' was built under R version 4.4.3
## Warning: package 'fabletools' was built under R version 4.4.3
## Warning: package 'fable' was built under R version 4.4.3
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ Hmisc::src()         masks dplyr::src()
## ✖ Hmisc::summarize()   masks dplyr::summarize()
## ✖ tsibble::union()     masks base::union()
library(readxl)
library(curl)
## Warning: package 'curl' was built under R version 4.4.3
## Using libcurl 8.14.1 with Schannel
## 
## Attaching package: 'curl'
## 
## The following object is masked from 'package:readr':
## 
##     parse_date
library(latex2exp)
## Warning: package 'latex2exp' was built under R version 4.4.3
library(seasonal)
## Warning: package 'seasonal' was built under R version 4.4.3
## 
## Attaching package: 'seasonal'
## 
## The following object is masked from 'package:tibble':
## 
##     view
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.3
## 
## Attaching package: 'GGally'
## 
## The following object is masked from 'package:funModeling':
## 
##     range01
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.2
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(doParallel)
## Warning: package 'doParallel' was built under R version 4.4.3
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.4.2
## 
## Attaching package: 'foreach'
## 
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## 
## Loading required package: iterators
## Warning: package 'iterators' was built under R version 4.4.2
## Loading required package: parallel
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.4.2
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(Hmisc)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.3
## 
## Attaching package: 'e1071'
## 
## The following object is masked from 'package:fabletools':
## 
##     interpolate
## 
## The following object is masked from 'package:Hmisc':
## 
##     impute
## 
## The following object is masked from 'package:ggplot2':
## 
##     element
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following objects are masked from 'package:fabletools':
## 
##     MAE, RMSE
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(VIM)
## Warning: package 'VIM' was built under R version 4.4.3
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## 
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## 
## The following object is masked from 'package:datasets':
## 
##     sleep
library(rpart)
library(forecast)
## Warning: package 'forecast' was built under R version 4.4.3
library(urca)
## Warning: package 'urca' was built under R version 4.4.3
library(earth)
## Warning: package 'earth' was built under R version 4.4.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.4.3
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.4.3
## 
## Attaching package: 'plotmo'
## 
## The following object is masked from 'package:urca':
## 
##     plotres
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.4.3
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-10
library(cluster)
## Warning: package 'cluster' was built under R version 4.4.3
library(kernlab)
## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(aTSA)
## 
## Attaching package: 'aTSA'
## 
## The following object is masked from 'package:forecast':
## 
##     forecast
## 
## The following objects are masked from 'package:fabletools':
## 
##     estimate, forecast
## 
## The following object is masked from 'package:graphics':
## 
##     identify
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:gridExtra':
## 
##     combine
## 
## The following object is masked from 'package:seasonal':
## 
##     outlier
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(party)
## Warning: package 'party' was built under R version 4.4.3
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.4.3
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 4.4.3
## Loading required package: stats4
## 
## Attaching package: 'modeltools'
## 
## The following object is masked from 'package:kernlab':
## 
##     prior
## 
## The following object is masked from 'package:fabletools':
## 
##     refit
## 
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.4.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.3
## 
## Attaching package: 'zoo'
## 
## The following object is masked from 'package:tsibble':
## 
##     index
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.4.3
## 
## Attaching package: 'strucchange'
## 
## The following object is masked from 'package:stringr':
## 
##     boundary
## 
## 
## Attaching package: 'party'
## 
## The following object is masked from 'package:fabletools':
## 
##     response
## 
## The following object is masked from 'package:dplyr':
## 
##     where
library(gbm)
## Warning: package 'gbm' was built under R version 4.4.3
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.4.3
library(partykit)
## Warning: package 'partykit' was built under R version 4.4.3
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.4.3
## Registered S3 method overwritten by 'inum':
##   method          from   
##   format.interval tsibble
## 
## Attaching package: 'partykit'
## 
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.4.3
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.3
## Welcome to factoextra!
## Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.4.3
library(naniar)
## Warning: package 'naniar' was built under R version 4.4.3
## 
## Attaching package: 'naniar'
## 
## The following object is masked from 'package:tsibble':
## 
##     pedestrian
## 
## The following object is masked from 'package:skimr':
## 
##     n_complete
library(mice)
## Warning: package 'mice' was built under R version 4.4.3
## Registered S3 method overwritten by 'lme4':
##   method           from
##   na.action.merMod car 
## 
## Attaching package: 'mice'
## 
## The following object is masked from 'package:kernlab':
## 
##     convergence
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
 library(writexl)
## Warning: package 'writexl' was built under R version 4.4.3

Introduction

pH is a measure of acidity and alkalinity on a scale from 0 to 14, where a pH of 7 is considered neutral. Values below 7 are acidic, while values above 7 are alkaline. Maintaining proper pH levels is an important part of beverage manufacturing because it can affect product taste, quality, consistency, and safety. Even small changes in ingredients or production conditions can influence the final pH of a beverage.

This analysis explores the relationships between production variables and pH in order to identify the most important predictors and build accurate predictive models.

Load data

training_df <- read_excel("StudentData- Training data.xlsx")
test_df <-read_excel("StudentEvaluation-Test data.xlsx")

Data exploration

Summary stats

# Review first few rows
head(training_df)
head(test_df)
# Check summary stats
summary(training_df)
##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC CO2           Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd Pressure1   Hyd Pressure2   Hyd Pressure3   Hyd Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler Level    Filler Speed   Temperature      Usage cont      Carb Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen Filler     Bowl Setpoint   Pressure Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air Pressurer      Alch Rel        Carb Rel      Balling Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1
summary(test_df)
##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:267         Min.   :5.147   Min.   :23.75   Min.   :0.09867  
##  Class :character   1st Qu.:5.287   1st Qu.:23.92   1st Qu.:0.23333  
##  Mode  :character   Median :5.340   Median :23.97   Median :0.27533  
##                     Mean   :5.369   Mean   :23.97   Mean   :0.27769  
##                     3rd Qu.:5.465   3rd Qu.:24.01   3rd Qu.:0.32200  
##                     Max.   :5.667   Max.   :24.20   Max.   :0.46400  
##                     NA's   :1       NA's   :6       NA's   :4        
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :60.20   Min.   :130.0   Min.   :0.00400   Min.   :0.0200  
##  1st Qu.:65.30   1st Qu.:138.4   1st Qu.:0.04450   1st Qu.:0.1000  
##  Median :68.00   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.25   Mean   :141.2   Mean   :0.08545   Mean   :0.1903  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :77.60   Max.   :154.0   Max.   :0.24600   Max.   :0.6200  
##                  NA's   :1       NA's   :5         NA's   :3       
##     PSC CO2           Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :113.0   Min.   :37.80  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:120.2   1st Qu.:46.00  
##  Median :0.04000   Median :   0.20   Median :123.4   Median :47.80  
##  Mean   :0.05107   Mean   :  21.03   Mean   :123.0   Mean   :48.14  
##  3rd Qu.:0.06000   3rd Qu.: 141.30   3rd Qu.:125.5   3rd Qu.:50.20  
##  Max.   :0.24000   Max.   : 220.40   Max.   :136.0   Max.   :60.20  
##  NA's   :5                           NA's   :4       NA's   :2      
##  Hyd Pressure1    Hyd Pressure2    Hyd Pressure3    Hyd Pressure4   
##  Min.   :-50.00   Min.   :-50.00   Min.   :-50.00   Min.   : 68.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 90.00  
##  Median : 10.40   Median : 26.80   Median : 27.70   Median : 98.00  
##  Mean   : 12.01   Mean   : 20.11   Mean   : 19.61   Mean   : 97.84  
##  3rd Qu.: 20.40   3rd Qu.: 34.80   3rd Qu.: 33.00   3rd Qu.:104.00  
##  Max.   : 50.00   Max.   : 61.40   Max.   : 49.20   Max.   :140.00  
##                   NA's   :1        NA's   :1        NA's   :4       
##   Filler Level    Filler Speed   Temperature      Usage cont      Carb Flow   
##  Min.   : 69.2   Min.   :1006   Min.   :63.80   Min.   :12.90   Min.   :   0  
##  1st Qu.:100.6   1st Qu.:3812   1st Qu.:65.40   1st Qu.:18.12   1st Qu.:1083  
##  Median :118.6   Median :3978   Median :65.80   Median :21.44   Median :3038  
##  Mean   :110.3   Mean   :3581   Mean   :66.23   Mean   :20.90   Mean   :2409  
##  3rd Qu.:120.2   3rd Qu.:3996   3rd Qu.:66.60   3rd Qu.:23.74   3rd Qu.:3215  
##  Max.   :153.2   Max.   :4020   Max.   :75.40   Max.   :24.60   Max.   :3858  
##  NA's   :2       NA's   :10     NA's   :2       NA's   :2                     
##     Density           MFR           Balling      Pressure Vacuum 
##  Min.   :0.060   Min.   : 15.6   Min.   :0.902   Min.   :-6.400  
##  1st Qu.:0.920   1st Qu.:707.0   1st Qu.:1.498   1st Qu.:-5.600  
##  Median :0.980   Median :724.6   Median :1.648   Median :-5.200  
##  Mean   :1.177   Mean   :697.8   Mean   :2.203   Mean   :-5.174  
##  3rd Qu.:1.600   3rd Qu.:731.5   3rd Qu.:3.242   3rd Qu.:-4.800  
##  Max.   :1.840   Max.   :784.8   Max.   :3.788   Max.   :-3.600  
##  NA's   :1       NA's   :31      NA's   :1       NA's   :1       
##     PH          Oxygen Filler     Bowl Setpoint   Pressure Setpoint
##  Mode:logical   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  NA's:267       1st Qu.:0.01960   1st Qu.:100.0   1st Qu.:46.00    
##                 Median :0.03370   Median :120.0   Median :46.00    
##                 Mean   :0.04666   Mean   :109.6   Mean   :47.73    
##                 3rd Qu.:0.05440   3rd Qu.:120.0   3rd Qu.:50.00    
##                 Max.   :0.39800   Max.   :130.0   Max.   :52.00    
##                 NA's   :3         NA's   :1       NA's   :2        
##  Air Pressurer      Alch Rel        Carb Rel     Balling Lvl   
##  Min.   :141.2   Min.   :6.400   Min.   :5.18   Min.   :0.000  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.34   1st Qu.:1.380  
##  Median :142.6   Median :6.580   Median :5.40   Median :1.480  
##  Mean   :142.8   Mean   :6.907   Mean   :5.44   Mean   :2.051  
##  3rd Qu.:142.8   3rd Qu.:7.180   3rd Qu.:5.56   3rd Qu.:3.080  
##  Max.   :147.2   Max.   :7.820   Max.   :5.74   Max.   :3.420  
##  NA's   :1       NA's   :3       NA's   :2
# Check data dimensions
dims <- data.frame("Train"= dim(training_df), 
                   "Test"= dim(test_df))
rownames(dims) <- c("Observations", "Predictors")
dims
# Check variable names 
colnames(training_df)
##  [1] "Brand Code"        "Carb Volume"       "Fill Ounces"      
##  [4] "PC Volume"         "Carb Pressure"     "Carb Temp"        
##  [7] "PSC"               "PSC Fill"          "PSC CO2"          
## [10] "Mnf Flow"          "Carb Pressure1"    "Fill Pressure"    
## [13] "Hyd Pressure1"     "Hyd Pressure2"     "Hyd Pressure3"    
## [16] "Hyd Pressure4"     "Filler Level"      "Filler Speed"     
## [19] "Temperature"       "Usage cont"        "Carb Flow"        
## [22] "Density"           "MFR"               "Balling"          
## [25] "Pressure Vacuum"   "PH"                "Oxygen Filler"    
## [28] "Bowl Setpoint"     "Pressure Setpoint" "Air Pressurer"    
## [31] "Alch Rel"          "Carb Rel"          "Balling Lvl"

EDA

ggplot(training_df, aes(x = PH)) +
  geom_histogram(binwidth = 0.1) +
  theme_minimal() +
  labs(title = "Distribution of pH", x = "pH", y = "Frequency")
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

The plot is slightly left-skewed. The peak is around pH of 8.5. There is also an outlier around 9.4 but it is very minor.

#Distribution Brand Code plot
ggplot(training_df, aes(x = `Brand Code`)) +
  geom_bar() +
  theme_minimal() +
  labs(title = "Distribution by brand code", x = "Brand code", y = "Count")

Brand B appears the most often in this dataset.

# Boxplot of pH by brand

brand_df <- training_df %>%
  filter(!is.na(`Brand Code`))

ggplot(brand_df, aes(x = `Brand Code`, y = PH, fill = `Brand Code`)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Boxplot of pH by brand code", x = "Brand code", y = "pH")
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Brand B has the highest pH while brand C has the lowest.

Missing values

gg_miss_var(training_df, show_pct = TRUE)

There are several variables with missing data. MFR is the highest at 8%. The other variables are much lower and can be imputed.

Distribution of variables

# Histograms of distribution (numeric variables)

plot_num(training_df, bins = 10)

There are several variables with outliers. The outliers in Mnf Flow, Hyd Pressure 1, Hyd Pressure 2 and Hyd Pressure 3 look particularly significant so those will be investigated further.

Mnf flow exploration

This variable appears interesting based on the summary stats. It seems odd to me that manufacturing flow would yield a negative value and it seems significant based on the histogram.

# Mnf Flow

mnf_df <- training_df |>
  select(PH, `Mnf Flow`) |>
  drop_na() |>
  filter(`Mnf Flow` < 0)

skim(mnf_df)
Data summary
Name mnf_df
Number of rows 1183
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
PH 0 1 8.63 0.16 8.08 8.54 8.66 8.74 9.36 ▁▃▇▁▁
Mnf Flow 0 1 -100.10 0.10 -100.20 -100.20 -100.00 -100.00 -100.00 ▇▁▁▁▇
# % of values <0
training_df |>
  select(`Mnf Flow`) |>
  drop_na() |>
  mutate(group = ifelse(`Mnf Flow` < 0, "< 0", "> 0")) |>
  group_by(group) |>
  summarise(
    n = n(),
    percentage = round(n() / nrow(drop_na(select(training_df, `Mnf Flow`))) * 100, 1))
# Unique values
training_df |>
  filter(`Mnf Flow` < 0) |>
  distinct(`Mnf Flow`) |>
  arrange(`Mnf Flow`)

Based on the additional statistics, we can see that values below 0 consist only of -100 and -100.2. These values also make up a significant proportion of the dataset, accounting for about half of all observations. This is unusual compared to the rest of the data, suggesting that these values likely represent missing or invalid data. Therefore, for the purposes of this assignment, they should be imputed.

Hyd Pressure 1-3

These variables are also a little odd based on the distribution histogram with many variables near 0.

# Percent of values <0
training_df |>
  select(`Hyd Pressure1`, `Hyd Pressure2`, `Hyd Pressure3`) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Value") |>
  group_by(Variable) |>
  summarise(
    zeros      = sum(Value == 0, na.rm = TRUE),
    total      = n(),
    pct_zero   = round(zeros / total * 100, 1)
  )
# Check unique values of Hyd Pressure 1
training_df |>
  filter(`Hyd Pressure1` < 0) |>
  distinct(`Hyd Pressure1`) |>
  arrange(`Hyd Pressure1`)
# Check unique values of Hyd Pressure 2
training_df |>
  filter(`Hyd Pressure2` < 0) |>
  distinct(`Hyd Pressure2`) |>
  arrange(`Hyd Pressure2`)
# Check unique values of Hyd Pressure 3
training_df |>
  filter(`Hyd Pressure3` < 0) |>
  distinct(`Hyd Pressure3`) |>
  arrange(`Hyd Pressure3`)

Over 30% of the values in these 3 predictors are < 0 and there are very few unique values <0 in each of these variables. Since I will be exploring several different models, it may be best to recode these values to test multiple models.

Correlation

num <- training_df |>
  select(where(is.numeric))

# pH correlation
cor_table <- cor(drop_na(num))[, "PH"] |>
  as.data.frame() |>
  rownames_to_column(var = "Variable") |>
  setNames(c("Variable", "Coefficient")) |> 
  arrange(desc(Coefficient))

kable(cor_table, "html", escape = F, col.names = c('Variable', 'Coefficient')) |>
  kable_styling("striped", full_width = F) |>
  column_spec(1, bold = T)
Variable Coefficient
PH 1.0000000
Bowl Setpoint 0.3615875
Filler Level 0.3520440
Carb Flow 0.2335937
Pressure Vacuum 0.2197355
Carb Rel 0.1960515
Alch Rel 0.1666822
Oxygen Filler 0.1644854
Balling Lvl 0.1093712
PC Volume 0.0988667
Density 0.0955469
Balling 0.0767002
Carb Pressure 0.0762134
Carb Volume 0.0721325
Carb Temp 0.0322794
Air Pressurer -0.0079972
PSC Fill -0.0238098
Filler Speed -0.0408830
MFR -0.0451965
Hyd Pressure1 -0.0470664
PSC -0.0698730
PSC CO2 -0.0852599
Fill Ounces -0.1183359
Carb Pressure1 -0.1187642
Hyd Pressure4 -0.1714340
Temperature -0.1826596
Hyd Pressure2 -0.2226600
Hyd Pressure3 -0.2681018
Pressure Setpoint -0.3116639
Fill Pressure -0.3165145
Usage cont -0.3576120
Mnf Flow -0.4592313

Key correlations include bowl setpoint, filler level, carb flow, mnf flow, fill pressure, usage cont. The correlation of Hyd Pressures 1-3 which was identified earlier as having an abnormal distribution are lower. Although it may not be entirely necessary to impute these, I will do so for the purposes of exploring different models and maximizing accuracy.

Data prep

During the EDA, I found that many of the variables had missing data. I will also be transforming the MNF Flow and Hyd Pressure 1-3 since they had so many values at either 0 or <0. While the Hyd Pressure variables are not highly correlated and may not make a big impact, I would like to adjust them because I will be exploring models like neural network and SVM which are much more sensitive to abnormal distributions and outliers.

# Standardize variable names 
training_df <- clean_names(training_df)
test_df <- clean_names(test_df)

kNN (k-Nearest Neighbors) imputation finds the k most similar rows based on other variable values and uses their average to fill in missing data. I will be using this method to impute missing values.

set.seed(111)
# KNN 
# Training dataset
training_knn <- training_df |> 
  mutate(
    across(c(hyd_pressure2, hyd_pressure3, hyd_pressure1),
           ~ ifelse(. == 0, NA, .)),
    mnf_flow = ifelse(mnf_flow < 0, NA, mnf_flow))

# Impute all NAs 
training_knn <- kNN(
  training_knn[, !names(training_knn) %in% c("brand_code", "ph")],
  k = 5) |>
  select(!ends_with("_imp"))

# Add brand_code and ph back
training_knn$brand_code <- training_df$brand_code
training_knn$ph <- training_df$ph   

# Test dataset
test_knn <- test_df |> 
  mutate(
    across(c(hyd_pressure2, hyd_pressure3, hyd_pressure1),
           ~ ifelse(. == 0, NA, .)),
    mnf_flow = ifelse(mnf_flow < 0, NA, mnf_flow))

# Impute all NAs
test_knn <- kNN(
  test_knn[, !names(test_knn) %in% c("brand_code", "ph")],
  k = 5) |>
  select(!ends_with("_imp"))

# Add brand_code and ph back
test_knn$brand_code <- test_df$brand_code
test_knn$ph <- test_df$ph   
# Encode brand name 

training_knn <- training_knn |>
  mutate(brand_code = as.numeric(as.factor(brand_code)))

test_knn <- test_knn |>
  mutate(brand_code = as.numeric(as.factor(brand_code)))

# Mutate NAs
training_knn <- training_knn |>
  mutate(brand_code = ifelse(is.na(brand_code), 
                             names(which.max(table(brand_code))), 
                             brand_code)) |>
  mutate(brand_code = as.numeric(as.factor(brand_code)))

test_knn <- test_knn |>
  mutate(brand_code = ifelse(is.na(brand_code), 
                             names(which.max(table(brand_code))), 
                             brand_code)) |>
  mutate(brand_code = as.numeric(as.factor(brand_code)))
# check for other NAs
colSums(is.na(training_knn))
##       carb_volume       fill_ounces         pc_volume     carb_pressure 
##                 0                 0                 0                 0 
##         carb_temp               psc          psc_fill           psc_co2 
##                 0                 0                 0                 0 
##          mnf_flow    carb_pressure1     fill_pressure     hyd_pressure1 
##                 0                 0                 0                 0 
##     hyd_pressure2     hyd_pressure3     hyd_pressure4      filler_level 
##                 0                 0                 0                 0 
##      filler_speed       temperature        usage_cont         carb_flow 
##                 0                 0                 0                 0 
##           density               mfr           balling   pressure_vacuum 
##                 0                 0                 0                 0 
##     oxygen_filler     bowl_setpoint pressure_setpoint     air_pressurer 
##                 0                 0                 0                 0 
##          alch_rel          carb_rel       balling_lvl        brand_code 
##                 0                 0                 0                 0 
##                ph 
##                 4
colSums(is.na(test_knn))
##       carb_volume       fill_ounces         pc_volume     carb_pressure 
##                 0                 0                 0                 0 
##         carb_temp               psc          psc_fill           psc_co2 
##                 0                 0                 0                 0 
##          mnf_flow    carb_pressure1     fill_pressure     hyd_pressure1 
##                 0                 0                 0                 0 
##     hyd_pressure2     hyd_pressure3     hyd_pressure4      filler_level 
##                 0                 0                 0                 0 
##      filler_speed       temperature        usage_cont         carb_flow 
##                 0                 0                 0                 0 
##           density               mfr           balling   pressure_vacuum 
##                 0                 0                 0                 0 
##     oxygen_filler     bowl_setpoint pressure_setpoint     air_pressurer 
##                 0                 0                 0                 0 
##          alch_rel          carb_rel       balling_lvl        brand_code 
##                 0                 0                 0                 0 
##                ph 
##               267
# drop 4 missing ph rows in training data
training_knn <- training_knn |> drop_na(ph)

Model Exploration

For this assignment, I will exploring the use of random forest, gradient boost, neural network, support vector machine, and cubist models.

Split train/test set

The training data must be split first. For this exploration I will use the standard 80/20 split.

set.seed(111)
train_split <- createDataPartition(training_knn$ph, p = 0.8, list = FALSE)
train_setx <- training_knn[train_split, ]
train_sety <- training_knn[-train_split, ]

# Cross validation to reduce variance when testing models 
control <- trainControl(method = "cv", number = 10)

Random Forest

RF builds multiple decision trees, each trained on a random subset of data and variables and the final prediction is the average across all trees. The results of RF used in tuned training set and the validation set are below:

set.seed(111)
rf_grid <- expand.grid(.mtry = c(2, 4, 6, 8, 10))

rf_model <- train(ph ~ ., 
                  data = train_setx, 
                  method = "rf",
                  tune_Grid = rf_grid,
                  preProc = c("center", "scale"),
                  trControl = control,
                  ntree = 500)

rf_tune <- rf_model$results[rf_model$results$mtry == rf_model$bestTune$mtry, ]

rf_tune
# Performance metrics using validation set
rf_pred    <- predict(rf_model, newdata = train_sety)
rf_results <- postResample(pred = rf_pred, obs = train_sety$ph)

rf_results
##       RMSE   Rsquared        MAE 
## 0.10722156 0.65006303 0.07749993
# actual vs predicted values
res <- data.frame(actual = train_sety$ph, predicted = rf_pred)

# Scatter plot
ggplot(res, aes(x = actual, y = predicted)) +
    geom_point(color = "gray20", 
               alpha = 0.7) +
    geom_abline(slope = 1, intercept = 0, 
                color = "tomato", 
                linewidth = 1) +
    labs(title = "Actual and Predicted pH values", 
         x = "Actual pH", y = "Predicted pH") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))

rf_importance <- varImp(rf_model)$importance |>
  as.data.frame() |>
  arrange(desc(Overall))

rf_importance

The importance of features was identified above. Usage_cont was the highest importance followed by bowl_stpoint, temperature, and carb_rel.

Gradient Boost

GBM builds trees sequentially where each tree tries to correct the errors of the previous one. The results of GBM used in tuned training set and the validation set are below:

set.seed(111)

# Grid with parameters
gbm_grid <- expand.grid(
  .n.trees          = c(100, 500, 1000),
  .interaction.depth = c(3, 5, 7),
  .shrinkage        = 0.1,
  .n.minobsinnode   = c(5, 10))

gbm_model <- train(ph ~ ., data = train_setx,
                   method = "gbm",
                   preProc= c("center", "scale"),
                   tuneGrid = gbm_grid,
                   trControl = control,
                   verbose = FALSE)

gbm_tune <- merge(gbm_model$results, gbm_model$bestTune)

gbm_tune
# Performance metrics using validation set
gbm_pred    <- predict(gbm_model, newdata = train_sety)
gbm_results <- postResample(pred = gbm_pred, obs = train_sety$ph)

gbm_results
##       RMSE   Rsquared        MAE 
## 0.11260069 0.59689326 0.08365995
gbm_importance <- varImp(gbm_model)$importance |>
  as.data.frame() |>
  arrange(desc(Overall))

gbm_importance

In GBM model, usage_cont had the highest importance, followed by oxygen_filler. temperature, and bowl_setpoint. These are within the top 5 predictors in the random forest model.

Neural Network

Neural networks utilize layers of interconnected nodes to learn complex patterns by adjusting weights through repeated passes of the data. The results of NN used in tuned training set and the validation set are below:

set.seed(111)
#Grid with parameters
nngrid <- expand.grid(
  .decay = c(0, 0.01, 0.1), 
  .size = 1:10, 
  .bag = FALSE)

#Train avNNet
nn <- train(ph ~ ., data = train_setx, 
                   method = "avNNet",
                   preProc = c("center", "scale"),
                   tuneGrid = nngrid,
                   trControl = control,
                   linout = TRUE,
                   trace = FALSE)  
## Warning: executing %dopar% sequentially: no parallel backend registered
nnet_tune <- nn$results[nn$results$size == nn$bestTune$size & nn$results$decay == nn$bestTune$decay, ]
nnet_tune
# Performance metrics using validation set
nn_pred    <- predict(nn, newdata = train_sety)
nn_results <- postResample(pred = nn_pred, obs = train_sety$ph)

nn_results
##       RMSE   Rsquared        MAE 
## 0.12181929 0.52758475 0.09097902
nn_importance <- varImp(nn)$importance |>
  as.data.frame() |>
  arrange(desc(Overall))

nn_importance

In the neural network model, usage_cont had the highest importance followed by bowl_setpoint, filler_level, and pressure setpoint. This is similar to the previous models.

SVM

SVM finds the optimal boundary that best separates data points by maximizing the margin between groups and uses kernel functions to handle non-linear relationships. The results of SVM used in tuned training set and the validation set are below:

set.seed(111)
#Grid with parameters
svmgrid <- expand.grid(sigma = c(0.001, 0.01, 0.1), 
                       C = c(0.1, 1, 10, 100))
#Train SVM
svm_model <- train(ph ~ ., data = train_setx, 
                  method = "svmRadial", 
                  preProc = c("center", "scale"),
                  tuneGrid = svmgrid,
                  trControl = control)


svm_tune <- svm_model$results[svm_model$results$sigma == svm_model$bestTune$sigma & svm_model$results$C == svm_model$bestTune$C, ]
svm_tune
# Performance metrics using validation set
svm_pred    <- predict(svm_model, newdata = train_sety)
svm_results <- postResample(pred = svm_pred, obs = train_sety$ph)

# Final PH predictions on test set
svm_pred_test <- predict(svm_model, newdata = test_knn)

svm_results
##       RMSE   Rsquared        MAE 
## 0.12067508 0.54041820 0.09007167
svm_importance <- varImp(svm_model)$importance |>
  as.data.frame() |>
  arrange(desc(Overall))

svm_importance

Like the previous models, usage_cont has the highest importance followed by bowl_setpoint, filler_level, pressure_setpoint, and carb_flow.

Cubist

Cubist model is a rule-based model that creates if/then rules combined with linear regression equations at each endpoint. The results of the cubist model used in tuned training set and the validation set are below:

set.seed(111)

# Grid with parameters
cubist_grid <- expand.grid(committees = c(1, 10, 50, 100),
                          neighbors  = c(0, 1, 5, 9))

# Train Cubist
cubist_model <- train(ph ~ ., data = train_setx,
                     method   = "cubist",
                     tuneGrid = cubist_grid,
                     trControl = control)

cubist_tune <- cubist_model$results[
  cubist_model$results$committees == cubist_model$bestTune$committees &
  cubist_model$results$neighbors  == cubist_model$bestTune$neighbors, ]

cubist_tune
# Performance metrics using validation set
cubist_pred    <- predict(cubist_model, newdata = train_sety)
cubist_results <- postResample(pred = cubist_pred, obs = train_sety$ph)

# Final PH predictions on test set
cubist_pred_test <- predict(cubist_model, newdata = test_knn)

cubist_results
##       RMSE   Rsquared        MAE 
## 0.10594912 0.64490210 0.07490204
cubist_importance <- varImp(cubist_model)$importance |>
  as.data.frame() |>
  arrange(desc(Overall))

cubist_importance

Interestingly, the cubist model is the only model where bowl_setpoint is the highest importance. Usage_cont is not within the top 5 important predictors.

Final Model Selection

Based on the results, we can see that all these models performed relatively similarly without a major outlier. The cubist model marginally had the lowest RMSE at 0.106 while RMSE of random forest was 0.107. The Rsquared was highest in random forest at 0.650, just a little higher than cubist (0.645). Cubist also has the lowest MAE score, indicating that predictions are close to the actual values. The MAE value was similar to the random forest model. These results suggest that cubist model and random forest are performing very similarly since their performance values are nearly the same.

# Summary of all results 
all_tune <- bind_rows(
  rf_tune |> mutate(Model = "Random Forest"),
  gbm_tune |> mutate(Model = "Gradient Boost"),
  svm_tune |> mutate(Model = "Support Vector Machine"),
  nnet_tune |> mutate(Model = "Neural Network"),
  cubist_tune |> mutate(Model = "Cubist")
) |>
  select(Model, RMSE, Rsquared, MAE)

kable(all_tune, "html", digits = 3) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, position = "center") %>%
  row_spec(0, bold = TRUE, color = "black", background = "gray") %>%
  add_header_above(c("Model Tuning Comparison" = 4), bold = TRUE)
Model Tuning Comparison
Model RMSE Rsquared MAE
Random Forest 0.107 0.631 0.078
Gradient Boost 0.111 0.586 0.081
Support Vector Machine 0.124 0.478 0.093
Neural Network 0.120 0.514 0.090
Cubist 0.102 0.649 0.074
all_results <- bind_rows(
  as.data.frame(t(rf_results)) |> mutate(Model = "Random Forest"),
  as.data.frame(t(gbm_results)) |> mutate(Model = "Gradient Boost"),
  as.data.frame(t(svm_results)) |> mutate(Model = "Support Vector Machine"),
  as.data.frame(t(nn_results)) |> mutate(Model = "Neural Network"),
  as.data.frame(t(cubist_results)) |> mutate(Model = "Cubist")
) |>
  select(Model, RMSE, Rsquared, MAE)

kable(all_results, "html", digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, position = "center") %>%
  row_spec(0, bold = TRUE, color = "black", background = "gray") %>%
  add_header_above(c("Model Performance Comparison" = 4), bold = TRUE)
Model Performance Comparison
Model RMSE Rsquared MAE
Random Forest 0.107 0.650 0.077
Gradient Boost 0.113 0.597 0.084
Support Vector Machine 0.121 0.540 0.090
Neural Network 0.122 0.528 0.091
Cubist 0.106 0.645 0.075

Although these 2 models are performing similarly, they have different strengths and weaknesses:

Random Forest: - Strengths: Handles non-linear relationships well, resistant to overfitting, can handle large numbers of predictors, robust to outliers and noisy data - Weaknesses: Computationally expensive and slow with large datasets, overfitting can still occur with too many trees, can be harder to interpret

Cubist: - Strengths: Can handle non-linear relationships, easy to interpret, usually faster than ensemble tree methods - Weaknesses: Primarily designed for regression, not classification, may require more preprocessing due to sensitivity to noise

From reviewing the results and strengths and weaknesses of both of these models, I would recommend using the Cubist model. It performed the best, although marginally, and it will be more easily interpretable and faster for continued use with more data.

An interesting result to note from identifying the important variables is that mnf_flow which was one of the highest correlated variables to pH was considered of low importance compared to other variables. This may have occurred due to signal loss of imputing so many values in that variable. It will be important for future use of this model to investigate those negative numbers again and determine the what they represent. Further feature engineering should be done to maximize predictive value of this model.

Predictions for the test dataset

# pH predictions on test dataset
test_pred <- predict(cubist_model, newdata = test_knn)

test_results <- data.frame(
  ph_pred = test_pred)

test_df_pred <- test_knn
test_results$ph <- test_pred

par(mfrow = c(1, 2))
hist(training_knn$ph, main = "Training Data Distribution", xlab = "pH")
hist(test_pred, main = "Predicted pH Distribution", xlab = "pH")

The training data and preidcted test data pH levels compared and the distributions look similar.

# Save to Excel
final_output <- test_knn |>
  select(-ph) |>                    
  mutate(ph_predicted = test_pred)

write_xlsx(final_output, "ph_predictions.xlsx")