This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.2
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
## Warning: package 'readxl' was built under R version 4.4.3
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.4.3
library(funModeling)
## Warning: package 'funModeling' was built under R version 4.4.3
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 4.4.3
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:dplyr':
##
## src, summarize
##
## The following objects are masked from 'package:base':
##
## format.pval, units
##
## funModeling v.1.9.6 :)
## Examples and tutorials at livebook.datascienceheroes.com
## / Now in Spanish: librovivodecienciadedatos.ai
library(skimr)
## Warning: package 'skimr' was built under R version 4.4.3
library(janitor)
## Warning: package 'janitor' was built under R version 4.4.2
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.4.3
library(fpp3)
## Warning: package 'fpp3' was built under R version 4.4.3
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.3 ──
## ✔ tsibble 1.2.0 ✔ feasts 0.5.0
## ✔ tsibbledata 0.4.1 ✔ fable 0.5.0
## ✔ ggtime 0.2.0
## Warning: package 'tsibble' was built under R version 4.4.3
## Warning: package 'tsibbledata' was built under R version 4.4.3
## Warning: package 'ggtime' was built under R version 4.4.3
## Warning: package 'feasts' was built under R version 4.4.3
## Warning: package 'fabletools' was built under R version 4.4.3
## Warning: package 'fable' was built under R version 4.4.3
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ Hmisc::src() masks dplyr::src()
## ✖ Hmisc::summarize() masks dplyr::summarize()
## ✖ tsibble::union() masks base::union()
library(readxl)
library(curl)
## Warning: package 'curl' was built under R version 4.4.3
## Using libcurl 8.14.1 with Schannel
##
## Attaching package: 'curl'
##
## The following object is masked from 'package:readr':
##
## parse_date
library(latex2exp)
## Warning: package 'latex2exp' was built under R version 4.4.3
library(seasonal)
## Warning: package 'seasonal' was built under R version 4.4.3
##
## Attaching package: 'seasonal'
##
## The following object is masked from 'package:tibble':
##
## view
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.3
##
## Attaching package: 'GGally'
##
## The following object is masked from 'package:funModeling':
##
## range01
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.2
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(doParallel)
## Warning: package 'doParallel' was built under R version 4.4.3
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.4.2
##
## Attaching package: 'foreach'
##
## The following objects are masked from 'package:purrr':
##
## accumulate, when
##
## Loading required package: iterators
## Warning: package 'iterators' was built under R version 4.4.2
## Loading required package: parallel
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.4.2
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(Hmisc)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.3
##
## Attaching package: 'e1071'
##
## The following object is masked from 'package:fabletools':
##
## interpolate
##
## The following object is masked from 'package:Hmisc':
##
## impute
##
## The following object is masked from 'package:ggplot2':
##
## element
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following objects are masked from 'package:fabletools':
##
## MAE, RMSE
##
## The following object is masked from 'package:purrr':
##
## lift
library(VIM)
## Warning: package 'VIM' was built under R version 4.4.3
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
##
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
##
## The following object is masked from 'package:datasets':
##
## sleep
library(rpart)
library(forecast)
## Warning: package 'forecast' was built under R version 4.4.3
library(urca)
## Warning: package 'urca' was built under R version 4.4.3
library(earth)
## Warning: package 'earth' was built under R version 4.4.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.4.3
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.4.3
##
## Attaching package: 'plotmo'
##
## The following object is masked from 'package:urca':
##
## plotres
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.4.3
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-10
library(cluster)
## Warning: package 'cluster' was built under R version 4.4.3
library(kernlab)
##
## Attaching package: 'kernlab'
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## alpha
library(aTSA)
##
## Attaching package: 'aTSA'
##
## The following object is masked from 'package:forecast':
##
## forecast
##
## The following objects are masked from 'package:fabletools':
##
## estimate, forecast
##
## The following object is masked from 'package:graphics':
##
## identify
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:gridExtra':
##
## combine
##
## The following object is masked from 'package:seasonal':
##
## outlier
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(party)
## Warning: package 'party' was built under R version 4.4.3
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.4.3
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 4.4.3
## Loading required package: stats4
##
## Attaching package: 'modeltools'
##
## The following object is masked from 'package:kernlab':
##
## prior
##
## The following object is masked from 'package:fabletools':
##
## refit
##
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.4.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.3
##
## Attaching package: 'zoo'
##
## The following object is masked from 'package:tsibble':
##
## index
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.4.3
##
## Attaching package: 'strucchange'
##
## The following object is masked from 'package:stringr':
##
## boundary
##
##
## Attaching package: 'party'
##
## The following object is masked from 'package:fabletools':
##
## response
##
## The following object is masked from 'package:dplyr':
##
## where
library(gbm)
## Warning: package 'gbm' was built under R version 4.4.3
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.4.3
library(partykit)
## Warning: package 'partykit' was built under R version 4.4.3
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.4.3
## Registered S3 method overwritten by 'inum':
## method from
## format.interval tsibble
##
## Attaching package: 'partykit'
##
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.4.3
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.3
## Welcome to factoextra!
## Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.4.3
library(naniar)
## Warning: package 'naniar' was built under R version 4.4.3
##
## Attaching package: 'naniar'
##
## The following object is masked from 'package:tsibble':
##
## pedestrian
##
## The following object is masked from 'package:skimr':
##
## n_complete
library(mice)
## Warning: package 'mice' was built under R version 4.4.3
## Registered S3 method overwritten by 'lme4':
## method from
## na.action.merMod car
##
## Attaching package: 'mice'
##
## The following object is masked from 'package:kernlab':
##
## convergence
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(writexl)
## Warning: package 'writexl' was built under R version 4.4.3
pH is a measure of acidity and alkalinity on a scale from 0 to 14, where a pH of 7 is considered neutral. Values below 7 are acidic, while values above 7 are alkaline. Maintaining proper pH levels is an important part of beverage manufacturing because it can affect product taste, quality, consistency, and safety. Even small changes in ingredients or production conditions can influence the final pH of a beverage.
This analysis explores the relationships between production variables and pH in order to identify the most important predictors and build accurate predictive models.
training_df <- read_excel("StudentData- Training data.xlsx")
test_df <-read_excel("StudentEvaluation-Test data.xlsx")
# Review first few rows
head(training_df)
head(test_df)
# Check summary stats
summary(training_df)
## Brand Code Carb Volume Fill Ounces PC Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler Level Filler Speed Temperature Usage cont Carb Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## PH Oxygen Filler Bowl Setpoint Pressure Setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## Air Pressurer Alch Rel Carb Rel Balling Lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
summary(test_df)
## Brand Code Carb Volume Fill Ounces PC Volume
## Length:267 Min. :5.147 Min. :23.75 Min. :0.09867
## Class :character 1st Qu.:5.287 1st Qu.:23.92 1st Qu.:0.23333
## Mode :character Median :5.340 Median :23.97 Median :0.27533
## Mean :5.369 Mean :23.97 Mean :0.27769
## 3rd Qu.:5.465 3rd Qu.:24.01 3rd Qu.:0.32200
## Max. :5.667 Max. :24.20 Max. :0.46400
## NA's :1 NA's :6 NA's :4
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :60.20 Min. :130.0 Min. :0.00400 Min. :0.0200
## 1st Qu.:65.30 1st Qu.:138.4 1st Qu.:0.04450 1st Qu.:0.1000
## Median :68.00 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.25 Mean :141.2 Mean :0.08545 Mean :0.1903
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :77.60 Max. :154.0 Max. :0.24600 Max. :0.6200
## NA's :1 NA's :5 NA's :3
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :0.00000 Min. :-100.20 Min. :113.0 Min. :37.80
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:120.2 1st Qu.:46.00
## Median :0.04000 Median : 0.20 Median :123.4 Median :47.80
## Mean :0.05107 Mean : 21.03 Mean :123.0 Mean :48.14
## 3rd Qu.:0.06000 3rd Qu.: 141.30 3rd Qu.:125.5 3rd Qu.:50.20
## Max. :0.24000 Max. : 220.40 Max. :136.0 Max. :60.20
## NA's :5 NA's :4 NA's :2
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## Min. :-50.00 Min. :-50.00 Min. :-50.00 Min. : 68.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 90.00
## Median : 10.40 Median : 26.80 Median : 27.70 Median : 98.00
## Mean : 12.01 Mean : 20.11 Mean : 19.61 Mean : 97.84
## 3rd Qu.: 20.40 3rd Qu.: 34.80 3rd Qu.: 33.00 3rd Qu.:104.00
## Max. : 50.00 Max. : 61.40 Max. : 49.20 Max. :140.00
## NA's :1 NA's :1 NA's :4
## Filler Level Filler Speed Temperature Usage cont Carb Flow
## Min. : 69.2 Min. :1006 Min. :63.80 Min. :12.90 Min. : 0
## 1st Qu.:100.6 1st Qu.:3812 1st Qu.:65.40 1st Qu.:18.12 1st Qu.:1083
## Median :118.6 Median :3978 Median :65.80 Median :21.44 Median :3038
## Mean :110.3 Mean :3581 Mean :66.23 Mean :20.90 Mean :2409
## 3rd Qu.:120.2 3rd Qu.:3996 3rd Qu.:66.60 3rd Qu.:23.74 3rd Qu.:3215
## Max. :153.2 Max. :4020 Max. :75.40 Max. :24.60 Max. :3858
## NA's :2 NA's :10 NA's :2 NA's :2
## Density MFR Balling Pressure Vacuum
## Min. :0.060 Min. : 15.6 Min. :0.902 Min. :-6.400
## 1st Qu.:0.920 1st Qu.:707.0 1st Qu.:1.498 1st Qu.:-5.600
## Median :0.980 Median :724.6 Median :1.648 Median :-5.200
## Mean :1.177 Mean :697.8 Mean :2.203 Mean :-5.174
## 3rd Qu.:1.600 3rd Qu.:731.5 3rd Qu.:3.242 3rd Qu.:-4.800
## Max. :1.840 Max. :784.8 Max. :3.788 Max. :-3.600
## NA's :1 NA's :31 NA's :1 NA's :1
## PH Oxygen Filler Bowl Setpoint Pressure Setpoint
## Mode:logical Min. :0.00240 Min. : 70.0 Min. :44.00
## NA's:267 1st Qu.:0.01960 1st Qu.:100.0 1st Qu.:46.00
## Median :0.03370 Median :120.0 Median :46.00
## Mean :0.04666 Mean :109.6 Mean :47.73
## 3rd Qu.:0.05440 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :0.39800 Max. :130.0 Max. :52.00
## NA's :3 NA's :1 NA's :2
## Air Pressurer Alch Rel Carb Rel Balling Lvl
## Min. :141.2 Min. :6.400 Min. :5.18 Min. :0.000
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.34 1st Qu.:1.380
## Median :142.6 Median :6.580 Median :5.40 Median :1.480
## Mean :142.8 Mean :6.907 Mean :5.44 Mean :2.051
## 3rd Qu.:142.8 3rd Qu.:7.180 3rd Qu.:5.56 3rd Qu.:3.080
## Max. :147.2 Max. :7.820 Max. :5.74 Max. :3.420
## NA's :1 NA's :3 NA's :2
# Check data dimensions
dims <- data.frame("Train"= dim(training_df),
"Test"= dim(test_df))
rownames(dims) <- c("Observations", "Predictors")
dims
# Check variable names
colnames(training_df)
## [1] "Brand Code" "Carb Volume" "Fill Ounces"
## [4] "PC Volume" "Carb Pressure" "Carb Temp"
## [7] "PSC" "PSC Fill" "PSC CO2"
## [10] "Mnf Flow" "Carb Pressure1" "Fill Pressure"
## [13] "Hyd Pressure1" "Hyd Pressure2" "Hyd Pressure3"
## [16] "Hyd Pressure4" "Filler Level" "Filler Speed"
## [19] "Temperature" "Usage cont" "Carb Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure Vacuum" "PH" "Oxygen Filler"
## [28] "Bowl Setpoint" "Pressure Setpoint" "Air Pressurer"
## [31] "Alch Rel" "Carb Rel" "Balling Lvl"
ggplot(training_df, aes(x = PH)) +
geom_histogram(binwidth = 0.1) +
theme_minimal() +
labs(title = "Distribution of pH", x = "pH", y = "Frequency")
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).
The plot is slightly left-skewed. The peak is around pH of 8.5. There is
also an outlier around 9.4 but it is very minor.
#Distribution Brand Code plot
ggplot(training_df, aes(x = `Brand Code`)) +
geom_bar() +
theme_minimal() +
labs(title = "Distribution by brand code", x = "Brand code", y = "Count")
Brand B appears the most often in this dataset.
# Boxplot of pH by brand
brand_df <- training_df %>%
filter(!is.na(`Brand Code`))
ggplot(brand_df, aes(x = `Brand Code`, y = PH, fill = `Brand Code`)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Boxplot of pH by brand code", x = "Brand code", y = "pH")
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Brand B has the highest pH while brand C has the lowest.
gg_miss_var(training_df, show_pct = TRUE)
There are several variables with missing data. MFR is the highest at 8%.
The other variables are much lower and can be imputed.
# Histograms of distribution (numeric variables)
plot_num(training_df, bins = 10)
There are several variables with outliers. The outliers in Mnf Flow, Hyd
Pressure 1, Hyd Pressure 2 and Hyd Pressure 3 look particularly
significant so those will be investigated further.
This variable appears interesting based on the summary stats. It seems odd to me that manufacturing flow would yield a negative value and it seems significant based on the histogram.
# Mnf Flow
mnf_df <- training_df |>
select(PH, `Mnf Flow`) |>
drop_na() |>
filter(`Mnf Flow` < 0)
skim(mnf_df)
| Name | mnf_df |
| Number of rows | 1183 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| PH | 0 | 1 | 8.63 | 0.16 | 8.08 | 8.54 | 8.66 | 8.74 | 9.36 | ▁▃▇▁▁ |
| Mnf Flow | 0 | 1 | -100.10 | 0.10 | -100.20 | -100.20 | -100.00 | -100.00 | -100.00 | ▇▁▁▁▇ |
# % of values <0
training_df |>
select(`Mnf Flow`) |>
drop_na() |>
mutate(group = ifelse(`Mnf Flow` < 0, "< 0", "> 0")) |>
group_by(group) |>
summarise(
n = n(),
percentage = round(n() / nrow(drop_na(select(training_df, `Mnf Flow`))) * 100, 1))
# Unique values
training_df |>
filter(`Mnf Flow` < 0) |>
distinct(`Mnf Flow`) |>
arrange(`Mnf Flow`)
Based on the additional statistics, we can see that values below 0 consist only of -100 and -100.2. These values also make up a significant proportion of the dataset, accounting for about half of all observations. This is unusual compared to the rest of the data, suggesting that these values likely represent missing or invalid data. Therefore, for the purposes of this assignment, they should be imputed.
These variables are also a little odd based on the distribution histogram with many variables near 0.
# Percent of values <0
training_df |>
select(`Hyd Pressure1`, `Hyd Pressure2`, `Hyd Pressure3`) |>
pivot_longer(everything(), names_to = "Variable", values_to = "Value") |>
group_by(Variable) |>
summarise(
zeros = sum(Value == 0, na.rm = TRUE),
total = n(),
pct_zero = round(zeros / total * 100, 1)
)
# Check unique values of Hyd Pressure 1
training_df |>
filter(`Hyd Pressure1` < 0) |>
distinct(`Hyd Pressure1`) |>
arrange(`Hyd Pressure1`)
# Check unique values of Hyd Pressure 2
training_df |>
filter(`Hyd Pressure2` < 0) |>
distinct(`Hyd Pressure2`) |>
arrange(`Hyd Pressure2`)
# Check unique values of Hyd Pressure 3
training_df |>
filter(`Hyd Pressure3` < 0) |>
distinct(`Hyd Pressure3`) |>
arrange(`Hyd Pressure3`)
Over 30% of the values in these 3 predictors are < 0 and there are very few unique values <0 in each of these variables. Since I will be exploring several different models, it may be best to recode these values to test multiple models.
num <- training_df |>
select(where(is.numeric))
# pH correlation
cor_table <- cor(drop_na(num))[, "PH"] |>
as.data.frame() |>
rownames_to_column(var = "Variable") |>
setNames(c("Variable", "Coefficient")) |>
arrange(desc(Coefficient))
kable(cor_table, "html", escape = F, col.names = c('Variable', 'Coefficient')) |>
kable_styling("striped", full_width = F) |>
column_spec(1, bold = T)
| Variable | Coefficient |
|---|---|
| PH | 1.0000000 |
| Bowl Setpoint | 0.3615875 |
| Filler Level | 0.3520440 |
| Carb Flow | 0.2335937 |
| Pressure Vacuum | 0.2197355 |
| Carb Rel | 0.1960515 |
| Alch Rel | 0.1666822 |
| Oxygen Filler | 0.1644854 |
| Balling Lvl | 0.1093712 |
| PC Volume | 0.0988667 |
| Density | 0.0955469 |
| Balling | 0.0767002 |
| Carb Pressure | 0.0762134 |
| Carb Volume | 0.0721325 |
| Carb Temp | 0.0322794 |
| Air Pressurer | -0.0079972 |
| PSC Fill | -0.0238098 |
| Filler Speed | -0.0408830 |
| MFR | -0.0451965 |
| Hyd Pressure1 | -0.0470664 |
| PSC | -0.0698730 |
| PSC CO2 | -0.0852599 |
| Fill Ounces | -0.1183359 |
| Carb Pressure1 | -0.1187642 |
| Hyd Pressure4 | -0.1714340 |
| Temperature | -0.1826596 |
| Hyd Pressure2 | -0.2226600 |
| Hyd Pressure3 | -0.2681018 |
| Pressure Setpoint | -0.3116639 |
| Fill Pressure | -0.3165145 |
| Usage cont | -0.3576120 |
| Mnf Flow | -0.4592313 |
Key correlations include bowl setpoint, filler level, carb flow, mnf flow, fill pressure, usage cont. The correlation of Hyd Pressures 1-3 which was identified earlier as having an abnormal distribution are lower. Although it may not be entirely necessary to impute these, I will do so for the purposes of exploring different models and maximizing accuracy.
During the EDA, I found that many of the variables had missing data. I will also be transforming the MNF Flow and Hyd Pressure 1-3 since they had so many values at either 0 or <0. While the Hyd Pressure variables are not highly correlated and may not make a big impact, I would like to adjust them because I will be exploring models like neural network and SVM which are much more sensitive to abnormal distributions and outliers.
# Standardize variable names
training_df <- clean_names(training_df)
test_df <- clean_names(test_df)
kNN (k-Nearest Neighbors) imputation finds the k most similar rows based on other variable values and uses their average to fill in missing data. I will be using this method to impute missing values.
set.seed(111)
# KNN
# Training dataset
training_knn <- training_df |>
mutate(
across(c(hyd_pressure2, hyd_pressure3, hyd_pressure1),
~ ifelse(. == 0, NA, .)),
mnf_flow = ifelse(mnf_flow < 0, NA, mnf_flow))
# Impute all NAs
training_knn <- kNN(
training_knn[, !names(training_knn) %in% c("brand_code", "ph")],
k = 5) |>
select(!ends_with("_imp"))
# Add brand_code and ph back
training_knn$brand_code <- training_df$brand_code
training_knn$ph <- training_df$ph
# Test dataset
test_knn <- test_df |>
mutate(
across(c(hyd_pressure2, hyd_pressure3, hyd_pressure1),
~ ifelse(. == 0, NA, .)),
mnf_flow = ifelse(mnf_flow < 0, NA, mnf_flow))
# Impute all NAs
test_knn <- kNN(
test_knn[, !names(test_knn) %in% c("brand_code", "ph")],
k = 5) |>
select(!ends_with("_imp"))
# Add brand_code and ph back
test_knn$brand_code <- test_df$brand_code
test_knn$ph <- test_df$ph
# Encode brand name
training_knn <- training_knn |>
mutate(brand_code = as.numeric(as.factor(brand_code)))
test_knn <- test_knn |>
mutate(brand_code = as.numeric(as.factor(brand_code)))
# Mutate NAs
training_knn <- training_knn |>
mutate(brand_code = ifelse(is.na(brand_code),
names(which.max(table(brand_code))),
brand_code)) |>
mutate(brand_code = as.numeric(as.factor(brand_code)))
test_knn <- test_knn |>
mutate(brand_code = ifelse(is.na(brand_code),
names(which.max(table(brand_code))),
brand_code)) |>
mutate(brand_code = as.numeric(as.factor(brand_code)))
# check for other NAs
colSums(is.na(training_knn))
## carb_volume fill_ounces pc_volume carb_pressure
## 0 0 0 0
## carb_temp psc psc_fill psc_co2
## 0 0 0 0
## mnf_flow carb_pressure1 fill_pressure hyd_pressure1
## 0 0 0 0
## hyd_pressure2 hyd_pressure3 hyd_pressure4 filler_level
## 0 0 0 0
## filler_speed temperature usage_cont carb_flow
## 0 0 0 0
## density mfr balling pressure_vacuum
## 0 0 0 0
## oxygen_filler bowl_setpoint pressure_setpoint air_pressurer
## 0 0 0 0
## alch_rel carb_rel balling_lvl brand_code
## 0 0 0 0
## ph
## 4
colSums(is.na(test_knn))
## carb_volume fill_ounces pc_volume carb_pressure
## 0 0 0 0
## carb_temp psc psc_fill psc_co2
## 0 0 0 0
## mnf_flow carb_pressure1 fill_pressure hyd_pressure1
## 0 0 0 0
## hyd_pressure2 hyd_pressure3 hyd_pressure4 filler_level
## 0 0 0 0
## filler_speed temperature usage_cont carb_flow
## 0 0 0 0
## density mfr balling pressure_vacuum
## 0 0 0 0
## oxygen_filler bowl_setpoint pressure_setpoint air_pressurer
## 0 0 0 0
## alch_rel carb_rel balling_lvl brand_code
## 0 0 0 0
## ph
## 267
# drop 4 missing ph rows in training data
training_knn <- training_knn |> drop_na(ph)
For this assignment, I will exploring the use of random forest, gradient boost, neural network, support vector machine, and cubist models.
The training data must be split first. For this exploration I will use the standard 80/20 split.
set.seed(111)
train_split <- createDataPartition(training_knn$ph, p = 0.8, list = FALSE)
train_setx <- training_knn[train_split, ]
train_sety <- training_knn[-train_split, ]
# Cross validation to reduce variance when testing models
control <- trainControl(method = "cv", number = 10)
RF builds multiple decision trees, each trained on a random subset of data and variables and the final prediction is the average across all trees. The results of RF used in tuned training set and the validation set are below:
set.seed(111)
rf_grid <- expand.grid(.mtry = c(2, 4, 6, 8, 10))
rf_model <- train(ph ~ .,
data = train_setx,
method = "rf",
tune_Grid = rf_grid,
preProc = c("center", "scale"),
trControl = control,
ntree = 500)
rf_tune <- rf_model$results[rf_model$results$mtry == rf_model$bestTune$mtry, ]
rf_tune
# Performance metrics using validation set
rf_pred <- predict(rf_model, newdata = train_sety)
rf_results <- postResample(pred = rf_pred, obs = train_sety$ph)
rf_results
## RMSE Rsquared MAE
## 0.10722156 0.65006303 0.07749993
# actual vs predicted values
res <- data.frame(actual = train_sety$ph, predicted = rf_pred)
# Scatter plot
ggplot(res, aes(x = actual, y = predicted)) +
geom_point(color = "gray20",
alpha = 0.7) +
geom_abline(slope = 1, intercept = 0,
color = "tomato",
linewidth = 1) +
labs(title = "Actual and Predicted pH values",
x = "Actual pH", y = "Predicted pH") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))
rf_importance <- varImp(rf_model)$importance |>
as.data.frame() |>
arrange(desc(Overall))
rf_importance
The importance of features was identified above. Usage_cont was the highest importance followed by bowl_stpoint, temperature, and carb_rel.
GBM builds trees sequentially where each tree tries to correct the errors of the previous one. The results of GBM used in tuned training set and the validation set are below:
set.seed(111)
# Grid with parameters
gbm_grid <- expand.grid(
.n.trees = c(100, 500, 1000),
.interaction.depth = c(3, 5, 7),
.shrinkage = 0.1,
.n.minobsinnode = c(5, 10))
gbm_model <- train(ph ~ ., data = train_setx,
method = "gbm",
preProc= c("center", "scale"),
tuneGrid = gbm_grid,
trControl = control,
verbose = FALSE)
gbm_tune <- merge(gbm_model$results, gbm_model$bestTune)
gbm_tune
# Performance metrics using validation set
gbm_pred <- predict(gbm_model, newdata = train_sety)
gbm_results <- postResample(pred = gbm_pred, obs = train_sety$ph)
gbm_results
## RMSE Rsquared MAE
## 0.11260069 0.59689326 0.08365995
gbm_importance <- varImp(gbm_model)$importance |>
as.data.frame() |>
arrange(desc(Overall))
gbm_importance
In GBM model, usage_cont had the highest importance, followed by oxygen_filler. temperature, and bowl_setpoint. These are within the top 5 predictors in the random forest model.
Neural networks utilize layers of interconnected nodes to learn complex patterns by adjusting weights through repeated passes of the data. The results of NN used in tuned training set and the validation set are below:
set.seed(111)
#Grid with parameters
nngrid <- expand.grid(
.decay = c(0, 0.01, 0.1),
.size = 1:10,
.bag = FALSE)
#Train avNNet
nn <- train(ph ~ ., data = train_setx,
method = "avNNet",
preProc = c("center", "scale"),
tuneGrid = nngrid,
trControl = control,
linout = TRUE,
trace = FALSE)
## Warning: executing %dopar% sequentially: no parallel backend registered
nnet_tune <- nn$results[nn$results$size == nn$bestTune$size & nn$results$decay == nn$bestTune$decay, ]
nnet_tune
# Performance metrics using validation set
nn_pred <- predict(nn, newdata = train_sety)
nn_results <- postResample(pred = nn_pred, obs = train_sety$ph)
nn_results
## RMSE Rsquared MAE
## 0.12181929 0.52758475 0.09097902
nn_importance <- varImp(nn)$importance |>
as.data.frame() |>
arrange(desc(Overall))
nn_importance
In the neural network model, usage_cont had the highest importance followed by bowl_setpoint, filler_level, and pressure setpoint. This is similar to the previous models.
SVM finds the optimal boundary that best separates data points by maximizing the margin between groups and uses kernel functions to handle non-linear relationships. The results of SVM used in tuned training set and the validation set are below:
set.seed(111)
#Grid with parameters
svmgrid <- expand.grid(sigma = c(0.001, 0.01, 0.1),
C = c(0.1, 1, 10, 100))
#Train SVM
svm_model <- train(ph ~ ., data = train_setx,
method = "svmRadial",
preProc = c("center", "scale"),
tuneGrid = svmgrid,
trControl = control)
svm_tune <- svm_model$results[svm_model$results$sigma == svm_model$bestTune$sigma & svm_model$results$C == svm_model$bestTune$C, ]
svm_tune
# Performance metrics using validation set
svm_pred <- predict(svm_model, newdata = train_sety)
svm_results <- postResample(pred = svm_pred, obs = train_sety$ph)
# Final PH predictions on test set
svm_pred_test <- predict(svm_model, newdata = test_knn)
svm_results
## RMSE Rsquared MAE
## 0.12067508 0.54041820 0.09007167
svm_importance <- varImp(svm_model)$importance |>
as.data.frame() |>
arrange(desc(Overall))
svm_importance
Like the previous models, usage_cont has the highest importance followed by bowl_setpoint, filler_level, pressure_setpoint, and carb_flow.
Cubist model is a rule-based model that creates if/then rules combined with linear regression equations at each endpoint. The results of the cubist model used in tuned training set and the validation set are below:
set.seed(111)
# Grid with parameters
cubist_grid <- expand.grid(committees = c(1, 10, 50, 100),
neighbors = c(0, 1, 5, 9))
# Train Cubist
cubist_model <- train(ph ~ ., data = train_setx,
method = "cubist",
tuneGrid = cubist_grid,
trControl = control)
cubist_tune <- cubist_model$results[
cubist_model$results$committees == cubist_model$bestTune$committees &
cubist_model$results$neighbors == cubist_model$bestTune$neighbors, ]
cubist_tune
# Performance metrics using validation set
cubist_pred <- predict(cubist_model, newdata = train_sety)
cubist_results <- postResample(pred = cubist_pred, obs = train_sety$ph)
# Final PH predictions on test set
cubist_pred_test <- predict(cubist_model, newdata = test_knn)
cubist_results
## RMSE Rsquared MAE
## 0.10594912 0.64490210 0.07490204
cubist_importance <- varImp(cubist_model)$importance |>
as.data.frame() |>
arrange(desc(Overall))
cubist_importance
Interestingly, the cubist model is the only model where bowl_setpoint is the highest importance. Usage_cont is not within the top 5 important predictors.
Based on the results, we can see that all these models performed relatively similarly without a major outlier. The cubist model marginally had the lowest RMSE at 0.106 while RMSE of random forest was 0.107. The Rsquared was highest in random forest at 0.650, just a little higher than cubist (0.645). Cubist also has the lowest MAE score, indicating that predictions are close to the actual values. The MAE value was similar to the random forest model. These results suggest that cubist model and random forest are performing very similarly since their performance values are nearly the same.
# Summary of all results
all_tune <- bind_rows(
rf_tune |> mutate(Model = "Random Forest"),
gbm_tune |> mutate(Model = "Gradient Boost"),
svm_tune |> mutate(Model = "Support Vector Machine"),
nnet_tune |> mutate(Model = "Neural Network"),
cubist_tune |> mutate(Model = "Cubist")
) |>
select(Model, RMSE, Rsquared, MAE)
kable(all_tune, "html", digits = 3) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, position = "center") %>%
row_spec(0, bold = TRUE, color = "black", background = "gray") %>%
add_header_above(c("Model Tuning Comparison" = 4), bold = TRUE)
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| Random Forest | 0.107 | 0.631 | 0.078 |
| Gradient Boost | 0.111 | 0.586 | 0.081 |
| Support Vector Machine | 0.124 | 0.478 | 0.093 |
| Neural Network | 0.120 | 0.514 | 0.090 |
| Cubist | 0.102 | 0.649 | 0.074 |
all_results <- bind_rows(
as.data.frame(t(rf_results)) |> mutate(Model = "Random Forest"),
as.data.frame(t(gbm_results)) |> mutate(Model = "Gradient Boost"),
as.data.frame(t(svm_results)) |> mutate(Model = "Support Vector Machine"),
as.data.frame(t(nn_results)) |> mutate(Model = "Neural Network"),
as.data.frame(t(cubist_results)) |> mutate(Model = "Cubist")
) |>
select(Model, RMSE, Rsquared, MAE)
kable(all_results, "html", digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, position = "center") %>%
row_spec(0, bold = TRUE, color = "black", background = "gray") %>%
add_header_above(c("Model Performance Comparison" = 4), bold = TRUE)
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| Random Forest | 0.107 | 0.650 | 0.077 |
| Gradient Boost | 0.113 | 0.597 | 0.084 |
| Support Vector Machine | 0.121 | 0.540 | 0.090 |
| Neural Network | 0.122 | 0.528 | 0.091 |
| Cubist | 0.106 | 0.645 | 0.075 |
Although these 2 models are performing similarly, they have different strengths and weaknesses:
Random Forest: - Strengths: Handles non-linear relationships well, resistant to overfitting, can handle large numbers of predictors, robust to outliers and noisy data - Weaknesses: Computationally expensive and slow with large datasets, overfitting can still occur with too many trees, can be harder to interpret
Cubist: - Strengths: Can handle non-linear relationships, easy to interpret, usually faster than ensemble tree methods - Weaknesses: Primarily designed for regression, not classification, may require more preprocessing due to sensitivity to noise
From reviewing the results and strengths and weaknesses of both of these models, I would recommend using the Cubist model. It performed the best, although marginally, and it will be more easily interpretable and faster for continued use with more data.
An interesting result to note from identifying the important variables is that mnf_flow which was one of the highest correlated variables to pH was considered of low importance compared to other variables. This may have occurred due to signal loss of imputing so many values in that variable. It will be important for future use of this model to investigate those negative numbers again and determine the what they represent. Further feature engineering should be done to maximize predictive value of this model.
# pH predictions on test dataset
test_pred <- predict(cubist_model, newdata = test_knn)
test_results <- data.frame(
ph_pred = test_pred)
test_df_pred <- test_knn
test_results$ph <- test_pred
par(mfrow = c(1, 2))
hist(training_knn$ph, main = "Training Data Distribution", xlab = "pH")
hist(test_pred, main = "Predicted pH Distribution", xlab = "pH")
The training data and preidcted test data pH levels compared and the
distributions look similar.
# Save to Excel
final_output <- test_knn |>
select(-ph) |>
mutate(ph_predicted = test_pred)
write_xlsx(final_output, "ph_predictions.xlsx")