This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH. Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach. Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
New Regulations by ABC beverage company leadership requires the company’s production unit to better understand the manufacturing process, the predictive factors and their relationship to the PH of the beverages.
The research is an effort to find the predictive variables related to the ph of the beverages and build the predictive model for ph of beverages
The data set is a historic data containing predictors associated to the PH and is provided in an excel file. We will utilize this historic dataset to analyze and predict the PH of beverages. Two excel files are provided:
student_evaluation_gh_file <- "https://github.com/hovig/Team5-Data624-Project2/raw/master/StudentEvaluation.xlsx"
student_data_gh_file <- "https://github.com/hovig/Team5-Data624-Project2/raw/master/StudentData.xlsx"
se_temp_file <- tempfile(fileext = ".xlsx")
sd_temp_file <- tempfile(fileext = ".xlsx")
se_data <- GET(student_evaluation_gh_file, authenticate(Sys.getenv("GITHUB_PAT"), ""), write_disk(path = se_temp_file))
df_data <- GET(student_data_gh_file, authenticate(Sys.getenv("GITHUB_PAT"), ""), write_disk(path = sd_temp_file))
se_data <- readxl::read_excel(se_temp_file)
sd_data <- readxl::read_excel(sd_temp_file)
dim(sd_data)
## [1] 2571 33
colnames(sd_data)
## [1] "Brand Code" "Carb Volume" "Fill Ounces"
## [4] "PC Volume" "Carb Pressure" "Carb Temp"
## [7] "PSC" "PSC Fill" "PSC CO2"
## [10] "Mnf Flow" "Carb Pressure1" "Fill Pressure"
## [13] "Hyd Pressure1" "Hyd Pressure2" "Hyd Pressure3"
## [16] "Hyd Pressure4" "Filler Level" "Filler Speed"
## [19] "Temperature" "Usage cont" "Carb Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure Vacuum" "PH" "Oxygen Filler"
## [28] "Bowl Setpoint" "Pressure Setpoint" "Air Pressurer"
## [31] "Alch Rel" "Carb Rel" "Balling Lvl"
str(sd_data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2571 obs. of 33 variables:
## $ Brand Code : chr "B" "A" "B" "A" ...
## $ Carb Volume : num 5.34 5.43 5.29 5.44 5.49 ...
## $ Fill Ounces : num 24 24 24.1 24 24.3 ...
## $ PC Volume : num 0.263 0.239 0.263 0.293 0.111 ...
## $ Carb Pressure : num 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
## $ Carb Temp : num 141 140 145 133 137 ...
## $ PSC : num 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC Fill : num 0.26 0.22 0.34 0.42 0.16 ...
## $ PSC CO2 : num 0.04 0.04 0.16 0.04 0.12 ...
## $ Mnf Flow : num -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb Pressure1 : num 119 122 120 115 118 ...
## $ Fill Pressure : num 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
## $ Hyd Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd Pressure2 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure3 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure4 : num 118 106 82 92 92 116 124 132 90 108 ...
## $ Filler Level : num 121 119 120 118 119 ...
## $ Filler Speed : num 4002 3986 4020 4012 4010 ...
## $ Temperature : num 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage cont : num 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb Flow : num 2932 3144 2914 3062 3054 ...
## $ Density : num 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
## $ MFR : num 725 727 735 731 723 ...
## $ Balling : num 1.4 1.5 3.14 3.04 3.04 ...
## $ Pressure Vacuum : num -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
## $ PH : num 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Oxygen Filler : num 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl Setpoint : num 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure Setpoint: num 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air Pressurer : num 143 143 142 146 146 ...
## $ Alch Rel : num 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
## $ Carb Rel : num 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Balling Lvl : num 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
dim(se_data)
## [1] 267 33
colnames(se_data)
## [1] "Brand Code" "Carb Volume" "Fill Ounces"
## [4] "PC Volume" "Carb Pressure" "Carb Temp"
## [7] "PSC" "PSC Fill" "PSC CO2"
## [10] "Mnf Flow" "Carb Pressure1" "Fill Pressure"
## [13] "Hyd Pressure1" "Hyd Pressure2" "Hyd Pressure3"
## [16] "Hyd Pressure4" "Filler Level" "Filler Speed"
## [19] "Temperature" "Usage cont" "Carb Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure Vacuum" "PH" "Oxygen Filler"
## [28] "Bowl Setpoint" "Pressure Setpoint" "Air Pressurer"
## [31] "Alch Rel" "Carb Rel" "Balling Lvl"
str(se_data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 267 obs. of 33 variables:
## $ Brand Code : chr "D" "A" "B" "B" ...
## $ Carb Volume : num 5.48 5.39 5.29 5.27 5.41 ...
## $ Fill Ounces : num 24 24 23.9 23.9 24.2 ...
## $ PC Volume : num 0.27 0.227 0.303 0.186 0.16 ...
## $ Carb Pressure : num 65.4 63.2 66.4 64.8 69.4 73.4 65.2 67.4 66.8 72.6 ...
## $ Carb Temp : num 135 135 140 139 142 ...
## $ PSC : num 0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
## $ PSC Fill : num 0.4 0.22 0.1 0.2 0.3 ...
## $ PSC CO2 : num 0.04 0.08 0.02 0.02 0.06 ...
## $ Mnf Flow : num -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb Pressure1 : num 117 119 120 125 115 ...
## $ Fill Pressure : num 46 46.2 45.8 40 51.4 46.4 46.2 40 43.8 40.8 ...
## $ Hyd Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd Pressure2 : num NA 0 0 0 0 0 0 0 0 0 ...
## $ Hyd Pressure3 : num NA 0 0 0 0 0 0 0 0 0 ...
## $ Hyd Pressure4 : num 96 112 98 132 94 94 108 108 110 106 ...
## $ Filler Level : num 129 120 119 120 116 ...
## $ Filler Speed : num 3986 4012 4010 NA 4018 ...
## $ Temperature : num 66 65.6 65.6 74.4 66.4 66.6 66.8 NA 65.8 66 ...
## $ Usage cont : num 21.7 17.6 24.2 18.1 21.3 ...
## $ Carb Flow : num 2950 2916 3056 28 3214 ...
## $ Density : num 0.88 1.5 0.9 0.74 0.88 0.84 1.48 1.6 1.52 1.48 ...
## $ MFR : num 728 736 735 NA 752 ...
## $ Balling : num 1.4 2.94 1.45 1.06 1.4 ...
## $ Pressure Vacuum : num -3.8 -4.4 -4.2 -4 -4 -3.8 -4.2 -4.4 -4.4 -4.2 ...
## $ PH : logi NA NA NA NA NA NA ...
## $ Oxygen Filler : num 0.022 0.03 0.046 NA 0.082 0.064 0.042 0.096 0.046 0.096 ...
## $ Bowl Setpoint : num 130 120 120 120 120 120 120 120 120 120 ...
## $ Pressure Setpoint: num 45.2 46 46 46 50 46 46 46 46 46 ...
## $ Air Pressurer : num 143 147 147 146 146 ...
## $ Alch Rel : num 6.56 7.14 6.52 6.48 6.5 6.5 7.18 7.16 7.14 7.78 ...
## $ Carb Rel : num 5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
## $ Balling Lvl : num 1.48 3.04 1.46 1.48 1.46 1.44 3.02 3 3.1 3.12 ...
bev_raw <- read_csv('https://raw.githubusercontent.com/hovig/Team5-Data624-Project2/master/StudentData.csv')
head(bev_raw)
## # A tibble: 6 x 33
## `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 B 5.34 24.0 0.263 68.2
## 2 A 5.43 24.0 0.239 68.4
## 3 B 5.29 24.1 0.263 70.8
## 4 A 5.44 24.0 0.293 63
## 5 A 5.49 24.3 0.111 67.2
## 6 A 5.38 23.9 0.269 66.6
## # … with 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC
## # Fill` <dbl>, `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb
## # Pressure1` <dbl>, `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd
## # Pressure2` <dbl>, `Hyd Pressure3` <dbl>, `Hyd Pressure4` <int>,
## # `Filler Level` <dbl>, `Filler Speed` <int>, Temperature <dbl>, `Usage
## # cont` <dbl>, `Carb Flow` <int>, Density <dbl>, MFR <dbl>,
## # Balling <dbl>, `Pressure Vacuum` <dbl>, PH <dbl>, `Oxygen
## # Filler` <dbl>, `Bowl Setpoint` <int>, `Pressure Setpoint` <dbl>, `Air
## # Pressurer` <dbl>, `Alch Rel` <dbl>, `Carb Rel` <dbl>, `Balling
## # Lvl` <dbl>
bev_raw %>%
ggplot(aes(PH, fill=PH > 9)) +
geom_histogram(bins=30) +
theme_bw() +
theme(legend.position='center') +
labs(y='Count', title='PH Levels in Dataset')
bev_raw <- bev_raw %>%
filter(!is.na(bev_raw$PH), bev_raw$PH < 9)
dim(bev_raw)
## [1] 2566 33
str(bev_raw)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2566 obs. of 33 variables:
## $ Brand Code : chr "B" "A" "B" "A" ...
## $ Carb Volume : num 5.34 5.43 5.29 5.44 5.49 ...
## $ Fill Ounces : num 24 24 24.1 24 24.3 ...
## $ PC Volume : num 0.263 0.239 0.263 0.293 0.111 ...
## $ Carb Pressure : num 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
## $ Carb Temp : num 141 140 145 133 137 ...
## $ PSC : num 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC Fill : num 0.26 0.22 0.34 0.42 0.16 0.24 0.4 0.34 0.12 0.24 ...
## $ PSC CO2 : num 0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
## $ Mnf Flow : num -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb Pressure1 : num 119 122 120 115 118 ...
## $ Fill Pressure : num 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
## $ Hyd Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd Pressure2 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure3 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure4 : int 118 106 82 92 92 116 124 132 90 108 ...
## $ Filler Level : num 121 119 120 118 119 ...
## $ Filler Speed : int 4002 3986 4020 4012 4010 4014 NA 1004 4014 4028 ...
## $ Temperature : num 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage cont : num 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb Flow : int 2932 3144 2914 3062 3054 2948 30 684 2902 3038 ...
## $ Density : num 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
## $ MFR : num 725 727 735 731 723 ...
## $ Balling : num 1.4 1.5 3.14 3.04 3.04 ...
## $ Pressure Vacuum : num -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
## $ PH : num 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Oxygen Filler : num 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl Setpoint : int 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure Setpoint: num 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air Pressurer : num 143 143 142 146 146 ...
## $ Alch Rel : num 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
## $ Carb Rel : num 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Balling Lvl : num 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 33
## .. ..$ Brand Code : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Carb Volume : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Fill Ounces : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ PC Volume : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Carb Pressure : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Carb Temp : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ PSC : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ PSC Fill : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ PSC CO2 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Mnf Flow : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Carb Pressure1 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Fill Pressure : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Hyd Pressure1 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Hyd Pressure2 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Hyd Pressure3 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Hyd Pressure4 : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Filler Level : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Filler Speed : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Temperature : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Usage cont : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Carb Flow : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Density : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ MFR : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Balling : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Pressure Vacuum : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ PH : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Oxygen Filler : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Bowl Setpoint : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Pressure Setpoint: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Air Pressurer : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Alch Rel : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Carb Rel : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Balling Lvl : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
hist.data.frame(bev_raw)
table(bev_raw$`Brand Code`)
##
## A B C D
## 293 1235 303 615
summary(bev_raw)
## Brand Code Carb Volume Fill Ounces PC Volume
## Length:2566 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23933
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27719
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08468 Mean :0.1953
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 70.70 Median :123.2 Median :46.40
## Mean :0.05645 Mean : 24.68 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :32 NA's :17
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## Min. :-0.80 Min. : 0.0 Min. :-1.20 Min. : 62.00
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.6 Median :27.60 Median : 96.00
## Mean :12.46 Mean :21.0 Mean :20.49 Mean : 96.31
## 3rd Qu.:20.20 3rd Qu.:34.6 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.4 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :27
## Filler Level Filler Speed Temperature Usage cont
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.38
## Median :118.4 Median :3982 Median :65.60 Median :21.80
## Mean :109.3 Mean :3688 Mean :65.96 Mean :21.00
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.76
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90
## NA's :15 NA's :53 NA's :11 NA's :5
## Carb Flow Density MFR Balling
## Min. : 26 Min. :0.300 Min. : 31.4 Min. :0.346
## 1st Qu.:1170 1st Qu.:0.900 1st Qu.:706.3 1st Qu.:1.496
## Median :3030 Median :0.980 Median :724.0 Median :1.648
## Mean :2473 Mean :1.175 Mean :704.0 Mean :2.201
## 3rd Qu.:3188 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.:3.292
## Max. :5104 Max. :1.920 Max. :868.6 Max. :4.012
## NA's :2 NA's :207
## Pressure Vacuum PH Oxygen Filler Bowl Setpoint
## Min. :-6.600 Min. :7.880 Min. :0.00240 Min. : 70.0
## 1st Qu.:-5.600 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0
## Median :-5.400 Median :8.540 Median :0.03340 Median :120.0
## Mean :-5.216 Mean :8.545 Mean :0.04635 Mean :109.3
## 3rd Qu.:-5.000 3rd Qu.:8.680 3rd Qu.:0.05960 3rd Qu.:120.0
## Max. :-3.600 Max. :8.940 Max. :0.40000 Max. :140.0
## NA's :11 NA's :2
## Pressure Setpoint Air Pressurer Alch Rel Carb Rel
## Min. :44.00 Min. :140.8 Min. :5.280 Min. :4.960
## 1st Qu.:46.00 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340
## Median :46.00 Median :142.6 Median :6.560 Median :5.400
## Mean :47.61 Mean :142.8 Mean :6.898 Mean :5.437
## 3rd Qu.:50.00 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.550
## Max. :52.00 Max. :148.2 Max. :8.620 Max. :6.060
## NA's :12 NA's :6 NA's :7
## Balling Lvl
## Min. :0.000
## 1st Qu.:1.380
## Median :1.480
## Mean :2.052
## 3rd Qu.:3.140
## Max. :3.660
## NA's :1
describe(bev_raw %>% select(-`Brand Code`))
## vars n mean sd median trimmed mad min
## Carb Volume 1 2556 5.37 0.11 5.35 5.37 0.11 5.04
## Fill Ounces 2 2528 23.97 0.09 23.97 23.98 0.08 23.63
## PC Volume 3 2527 0.28 0.06 0.27 0.27 0.05 0.08
## Carb Pressure 4 2539 68.19 3.54 68.20 68.12 3.56 57.00
## Carb Temp 5 2540 141.09 4.03 140.80 140.99 3.85 128.60
## PSC 6 2533 0.08 0.05 0.08 0.08 0.05 0.00
## PSC Fill 7 2543 0.20 0.12 0.18 0.18 0.12 0.00
## PSC CO2 8 2527 0.06 0.04 0.04 0.05 0.03 0.00
## Mnf Flow 9 2566 24.68 119.50 70.70 21.20 160.57 -100.20
## Carb Pressure1 10 2534 122.56 4.72 123.20 122.52 4.45 105.60
## Fill Pressure 11 2549 47.92 3.18 46.40 47.71 2.37 34.60
## Hyd Pressure1 12 2555 12.46 12.43 11.40 10.87 16.90 -0.80
## Hyd Pressure2 13 2551 21.00 16.38 28.60 21.10 13.34 0.00
## Hyd Pressure3 14 2551 20.49 15.97 27.60 20.54 13.64 -1.20
## Hyd Pressure4 15 2539 96.31 13.10 96.00 95.46 11.86 62.00
## Filler Level 16 2551 109.25 15.70 118.40 111.04 9.19 55.80
## Filler Speed 17 2513 3688.11 769.63 3982.00 3920.25 47.44 998.00
## Temperature 18 2555 65.96 1.38 65.60 65.80 0.89 63.60
## Usage cont 19 2561 21.00 2.98 21.80 21.25 3.17 12.08
## Carb Flow 20 2564 2473.00 1069.56 3030.00 2604.91 323.21 26.00
## Density 21 2566 1.17 0.38 0.98 1.15 0.15 0.30
## MFR 22 2359 704.05 73.90 724.00 718.16 15.42 31.40
## Balling 23 2566 2.20 0.93 1.65 2.13 0.37 0.35
## Pressure Vacuum 24 2566 -5.22 0.57 -5.40 -5.25 0.59 -6.60
## PH 25 2566 8.55 0.17 8.54 8.55 0.18 7.88
## Oxygen Filler 26 2555 0.05 0.04 0.03 0.04 0.02 0.00
## Bowl Setpoint 27 2564 109.34 15.29 120.00 111.36 0.00 70.00
## Pressure Setpoint 28 2554 47.61 2.04 46.00 47.60 0.00 44.00
## Air Pressurer 29 2566 142.83 1.21 142.60 142.58 0.59 140.80
## Alch Rel 30 2560 6.90 0.51 6.56 6.84 0.06 5.28
## Carb Rel 31 2559 5.44 0.13 5.40 5.43 0.12 4.96
## Balling Lvl 32 2565 2.05 0.87 1.48 1.98 0.21 0.00
## max range skew kurtosis se
## Carb Volume 5.70 0.66 0.39 -0.47 0.00
## Fill Ounces 24.32 0.69 -0.02 0.87 0.00
## PC Volume 0.48 0.40 0.35 0.67 0.00
## Carb Pressure 79.40 22.40 0.18 -0.01 0.07
## Carb Temp 154.00 25.40 0.24 0.23 0.08
## PSC 0.27 0.27 0.85 0.65 0.00
## PSC Fill 0.62 0.62 0.93 0.77 0.00
## PSC CO2 0.24 0.24 1.73 3.71 0.00
## Mnf Flow 229.40 329.60 0.00 -1.87 2.36
## Carb Pressure1 140.20 34.60 0.03 0.10 0.09
## Fill Pressure 60.40 25.80 0.55 1.41 0.06
## Hyd Pressure1 58.00 58.80 0.78 -0.15 0.25
## Hyd Pressure2 59.40 59.40 -0.31 -1.56 0.32
## Hyd Pressure3 50.00 51.20 -0.32 -1.57 0.32
## Hyd Pressure4 142.00 80.00 0.56 0.61 0.26
## Filler Level 161.20 105.40 -0.85 0.05 0.31
## Filler Speed 4030.00 3032.00 -2.88 6.75 15.35
## Temperature 76.20 12.60 2.39 10.25 0.03
## Usage cont 25.90 13.82 -0.54 -1.02 0.06
## Carb Flow 5104.00 5078.00 -0.99 -0.57 21.12
## Density 1.92 1.62 0.54 -1.22 0.01
## MFR 868.60 837.20 -5.09 30.46 1.52
## Balling 4.01 3.67 0.60 -1.41 0.02
## Pressure Vacuum -3.60 3.00 0.53 -0.03 0.01
## PH 8.94 1.06 -0.33 -0.08 0.00
## Oxygen Filler 0.40 0.40 2.41 8.89 0.00
## Bowl Setpoint 140.00 70.00 -0.97 -0.06 0.30
## Pressure Setpoint 52.00 8.00 0.20 -1.60 0.04
## Air Pressurer 148.20 7.40 2.25 4.72 0.02
## Alch Rel 8.62 3.34 0.88 -0.85 0.01
## Carb Rel 6.06 1.10 0.50 -0.30 0.00
## Balling Lvl 3.66 3.66 0.60 -1.51 0.02
To filter for near-zero variance predictors, the caret package function nearZeroVar will return the column numbers of any predictors that fulfill the conditions outlined. A zero variance predictor will never be chosen for a split since it offers no possible predictive information.
df1 <- bev_raw %>% select(-`Brand Code`) %>% mutate_each(funs(as.numeric(.)))%>%complete.cases()%>%
as.data.frame()
names(df1)[nearZeroVar(df1)]
## character(0)
nzv <- nearZeroVar(bev_raw,saveMetrics= TRUE)
nzv[nzv$nzv,]
## freqRatio percentUnique zeroVar nzv
## Hyd Pressure1 31 9.547935 FALSE TRUE
Box plots for the variables reveal, that besides having the outliers in the variables, most of the variables are skewed. For example: Variables density, carb flow, filler speed and oxygen filler are skewed providing us an opportunity to further check their distribution.
df.m <- melt(bev_raw %>% select(-MFR, -`Filler Speed`, -`Carb Flow`,-`Bowl Setpoint`,`Carb Pressure1`,
`Hyd Pressure4`, `Air Pressurer`, `Carb Temp`, `Filler Level`, `Mnf Flow`), id.var = "Brand Code")
p <-ggplot(data = df.m, aes(x=variable, y=value)) +
geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4,aes(fill=variable)) +
scale_y_continuous(name = "Predictors for PH", breaks = seq(0, 2, 0.5)) +
coord_flip()
p
Normality is one of the most widely used technique to understand the continuous predictors. In the below plot we can see normal distribution behavior of the given dataset for different features.
bev_raw%>%
select(-`Brand Code`) %>%
select(2:20) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density()
bev_raw%>%
select(-`Brand Code`) %>%
select(`Carb Flow`, Density, MFR, Balling, `Pressure Vacuum`, PH, `Oxygen Filler`, `Bowl Setpoint`, `Pressure Setpoint`, `Air Pressurer`, `Alch Rel`,`Carb Rel`,`Balling Lvl`) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density()
We perform 3 data preparation steps - Remove the near-zero variance variables we previously mentioned - Impute missing values with a Random Forest Regression with the MICE package - Create dummy variables for the categorical variable, brand
Random Forest was chosen as the regression method in imputation because it requires very little tuning, allowing and there is no interface to tune the imputation model in MICE.
cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(42)
bev <- select(bev_raw, -PH) %>%
union_all(se_data) #since we're not using the labels (y values), combining the sets for data clearning is fine
cols <- str_replace(colnames(bev), '\\s', '')
colnames(bev) <- cols
bev <- mutate(bev, BrandCode = ifelse(is.na(BrandCode), 'Unknown', BrandCode))
y <- bev_raw$PH
zero_vars <- nearZeroVar(bev[1:2566, ])
bev_new <- bev[, -zero_vars]
pred <- mice(data = bev_new, m = 3, method = 'rf', maxit = 3)
bev_imputed <- complete(pred)
form <- dummyVars(~ ., data = bev_imputed)
bev_imputed <- predict(form, bev_imputed) %>% data.frame() %>% as.matrix()
We will split the data into train/test sets before running cross validation. This will allow us to make sure that predictions are consistent at all levels of PH.
bev_eval <- bev_imputed[2567:2833,]
bev_train <- bev_imputed[1:2566,]
samples <- createDataPartition(y, p = .75, list = F)
x_train <- bev_train[samples, ]
x_test<- bev_train[-samples, ]
y_train <- y[samples]
y_test <- y[-samples]
For state-of-the-art prediction quality, we will use a model stack. This will consist of tuning models separately and then combining the candidate models into a metamodel that will formulate predictions with a linear combination of our tuned models.
From the perspective of understanding the manufacturing process, the model stack will also provide benefits. The stack is like a panel of experts, each looking at the data through slightly different lenses to form their diagnoses. By looking at the predictors each model uses, we can gather assemble a complete picture of the factors that affect our manufacturing process.
Four of the five models will be tree-based ensembles because this model type can handle non-linear relationships, is robust to outliers and skewed distributions, and can model complex interactions. One non-tree modle is included for divsersity in the stack.
xgb_grid <- expand.grid(eta = c(.01), nrounds = c(1000), max_depth = c(6), gamma = c(0), colsample_bytree = c(.8), min_child_weight = c(.8), subsample = c(.8))
cub_grid <- expand.grid(.committees = c(5), .neighbors = c(7))
mars_grid <- expand.grid(.degree = c(2), .nprune = 24)
dart_grid <- expand.grid(eta = c(.01), nrounds = c(1000), gamma = c(.1), skip_drop = c(.6), rate_drop = c(.4), max_depth = c(6), colsample_bytree = c(.6), min_child_weight = c(.6), subsample = c(.6))
rf_grid <- expand.grid(mtry = 25)
tuning_list <-list(
caretModelSpec(method="xgbTree", tuneGrid = xgb_grid),
caretModelSpec(method="xgbDART", tuneGrid = dart_grid),
caretModelSpec(method="cubist", tuneGrid = cub_grid),
caretModelSpec(method="rf", tuneGrid = rf_grid, importance = TRUE),
caretModelSpec(method="bagEarth", tuneGrid = mars_grid)
)
my_control <- trainControl(method = 'cv',
number = 5,
savePredictions = 'final',
index = createFolds(y_train, 5),
allowParallel = TRUE)
mod_list <- caretList(
x = x_train,
y = y_train,
preProcess = c('center', 'scale'),
trControl = my_control,
tuneList = tuning_list
)
The training grid is commented out because of the time required to run, but can be uncommented for reproducibility of our model optimization
#helper function
show_results <- function(model){
rslts <- model$results %>%
arrange(RMSE)
head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
}
show_results(mod_list$bagEarth)
degree | nprune | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|---|
2 | 24 | 0.1452284 | 0.3442843 | 0.0974178 | 0.0224892 | 0.0981137 | 0.0023522 |
predict(mod_list$bagEarth, x_test) %>%
postResample(y_test)
## RMSE Rsquared MAE
## 0.12291518 0.53214584 0.09257829
show_results(mod_list$rf)
mtry | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|
25 | 0.1201781 | 0.508212 | 0.0916387 | 0.0022317 | 0.0173091 | 0.0018639 |
predict(mod_list$rf, x_test) %>%
postResample(y_test)
## RMSE Rsquared MAE
## 0.09838320 0.71234554 0.06991311
show_results(mod_list$cubist)
committees | neighbors | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|---|
5 | 7 | 0.1230572 | 0.4793229 | 0.0922648 | 0.0026294 | 0.0169731 | 0.0013288 |
predict(mod_list$cubist, x_test) %>%
postResample(y_test)
## RMSE Rsquared MAE
## 0.10363209 0.66533091 0.07230681
show_results(mod_list$xgbTree)
eta | nrounds | max_depth | gamma | colsample_bytree | min_child_weight | subsample | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.01 | 1000 | 6 | 0 | 0.8 | 0.8 | 0.8 | 0.1211406 | 0.4911182 | 0.0917093 | 0.001546 | 0.0146323 | 0.001122 |
predict(mod_list$xgbTree, x_test) %>%
postResample(y_test)
## RMSE Rsquared MAE
## 0.10330553 0.67674199 0.07522457
show_results(mod_list$xgbDART)
eta | nrounds | gamma | skip_drop | rate_drop | max_depth | colsample_bytree | min_child_weight | subsample | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.01 | 1000 | 0.1 | 0.6 | 0.4 | 6 | 0.6 | 0.6 | 0.6 | 0.1296332 | 0.4451329 | 0.1028912 | 0.0017605 | 0.0167511 | 0.0013675 |
predict(mod_list$xgbDART, x_test) %>%
postResample(y_test)
## RMSE Rsquared MAE
## 0.11611208 0.61036059 0.09047387
Assessment
While the XGBTee model offers the lowest RMSE, it still only explains about 50% of the variance in PH. There is plenty of room for the other models to offer value. We will use the hyperparameters yielding the best RMSE in our final ensemble that will be built later.
p1 <- varImp(mod_list$cubist) %>%
plot(top = 10, main = 'Cubist')
p2 <- varImp(mod_list$xgbTree) %>%
plot(top = 10, main = 'XGBTrees')
p3 <- varImp(mod_list$xgbDART) %>%
plot(top = 10, main = 'XGBDart')
p4 <- varImp(mod_list$bagEarth) %>%
plot(top = 10, main = 'MARS')
p5 <- varImp(mod_list$rf) %>%
plot(top = 10, main = 'RF')
grid.arrange(p1,p2,p3,p4,p5, nrow = 3, ncol = 2)
Observations:
Based on what we see, there are more factors contributing to PH with the ensemble than with a single model. We’ll confirm this by looking at the count by predictor among each top 10:
Variable Counts among the Models
top_10 <- function(model) {
varImp(model)$importance %>%
rownames_to_column('var') %>%
arrange(desc(Overall)) %>%
head(10) %>%
select(var)
}
rbind(top_10(mod_list$xgbDART), top_10(mod_list$xgbTree), top_10(mod_list$cubist), top_10(mod_list$bagEarth), top_10(mod_list$rf)) %>%
group_by(var) %>%
summarise(ModelCount = n()) %>%
arrange(desc(ModelCount)) %>%
kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
var | ModelCount |
---|---|
AirPressurer | 5 |
AlchRel | 5 |
MnfFlow | 5 |
PressureVacuum | 5 |
BrandCodeC | 4 |
OxygenFiller | 4 |
Temperature | 4 |
BowlSetpoint | 3 |
CarbRel | 3 |
Usagecont | 3 |
Balling | 2 |
BallingLvl | 2 |
HydPressure3 | 2 |
CarbFlow | 1 |
Density | 1 |
FillerSpeed | 1 |
Best Single Tree
Because 4 of the 5 models in the ensemble are tree-based, looking a single decision tree model will give us an idea of not just what predictors are important, but how they are being used to form predictions.
rp_mod <- rpart::rpart(ph ~ ., data = data.frame(bev_train, ph = y), method = 'anova')
rpart.plot(rp_mod, uniform=TRUE)
Now that our models are optimized and understood, we will form the ensemble and test the results. We first look at their prediction correlations to confirm that the stack, intuitively, be effective. The lower the correlation between the models, the more effective the stack will be.
resamples <- resamples(mod_list)
modelCor(resamples) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
xgbTree | xgbDART | cubist | rf | bagEarth | |
---|---|---|---|---|---|
xgbTree | 1.0000000 | 0.0359329 | 0.5686530 | 0.9384567 | 0.2858379 |
xgbDART | 0.0359329 | 1.0000000 | 0.4606575 | 0.1812077 | 0.2035874 |
cubist | 0.5686530 | 0.4606575 | 1.0000000 | 0.4387601 | 0.0939014 |
rf | 0.9384567 | 0.1812077 | 0.4387601 | 1.0000000 | 0.1553078 |
bagEarth | 0.2858379 | 0.2035874 | 0.0939014 | 0.1553078 | 1.0000000 |
Our model list has negative correlations, indicating that it will be effective.
ensemble1 <- caretStack(
mod_list,
method="glmnet",
metric="RMSE",
trControl=trainControl(
method="cv",
number=5,
savePredictions="final"
)
)
ensemble1$ens_model$results %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
alpha | lambda | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|---|
0.10 | 0.0002408 | 0.1163325 | 0.5291532 | 0.0869919 | 0.0024546 | 0.0235280 | 0.0018167 |
0.10 | 0.0024077 | 0.1164400 | 0.5283258 | 0.0873434 | 0.0022805 | 0.0230699 | 0.0017847 |
0.10 | 0.0240768 | 0.1174627 | 0.5216820 | 0.0888550 | 0.0020787 | 0.0230548 | 0.0018040 |
0.55 | 0.0002408 | 0.1163292 | 0.5291801 | 0.0870144 | 0.0024430 | 0.0234748 | 0.0018136 |
0.55 | 0.0024077 | 0.1167687 | 0.5257197 | 0.0879296 | 0.0021121 | 0.0225295 | 0.0017564 |
0.55 | 0.0240768 | 0.1180667 | 0.5238896 | 0.0897646 | 0.0021940 | 0.0226635 | 0.0018279 |
1.00 | 0.0002408 | 0.1163299 | 0.5291771 | 0.0870469 | 0.0024256 | 0.0233859 | 0.0018068 |
1.00 | 0.0024077 | 0.1170109 | 0.5238458 | 0.0883080 | 0.0021029 | 0.0223825 | 0.0018120 |
1.00 | 0.0240768 | 0.1195952 | 0.5235662 | 0.0915518 | 0.0022784 | 0.0222044 | 0.0018043 |
We see a several percent improvement over the best single model.
preds <- predict(ensemble1, x_test)
preds %>% postResample(y_test)
## RMSE Rsquared MAE
## 0.09345040 0.73030404 0.06499146
Predicted Vs Actual
preds_df <- data.frame(predicted = preds, actual = y_test)
ggplot(preds_df) + geom_point(aes(x = predicted, y = actual)) + geom_smooth(aes(x = predicted, y = actual))
On the test set, the ensemble’s predictions are consistent throughout the range of PH. Consistent quality is, theoretically, a feature of ensemble models because the models are weighted such that predictions are maximized. If one model predicts better at lower PH, it will be upweighted over that range. There are several outliers, but when not all of the variance is explainable, this is unavoidable.
With our hyperparameters of the individual models optimized, we want to provide each model the entire training set and build the meta-model based on these optimized single models.
xgb_grid <- expand.grid(eta = c(.01), nrounds = c(1000), max_depth = c(6), gamma = c(0), colsample_bytree = c(.8), min_child_weight = c(.8), subsample = c(.8))
cub_grid <- expand.grid(.committees = c(5), .neighbors = c(7))
mars_grid <- expand.grid(.degree = c(2), .nprune = 24)
dart_grid <- expand.grid(eta = c(.01), nrounds = c(1000), gamma = c(.1), skip_drop = c(.6), rate_drop = c(.4), max_depth = c(6), colsample_bytree = c(.6), min_child_weight = c(.6), subsample = c(.6))
rf_grid <- expand.grid(mtry = 25)
tuning_list <-list(
caretModelSpec(method="xgbTree", tuneGrid = xgb_grid),
caretModelSpec(method="xgbDART", tuneGrid = dart_grid),
caretModelSpec(method="cubist", tuneGrid = cub_grid),
caretModelSpec(method="rf", tuneGrid = rf_grid, importance = TRUE),
caretModelSpec(method="bagEarth", tuneGrid = mars_grid)
)
my_control <- trainControl(method = 'cv',
number = 5,
savePredictions = 'final',
index = createFolds(y, 5),
allowParallel = TRUE)
mod_list <- caretList(
x = bev_train,
y = y,
preProcess = c('center', 'scale'),
trControl = my_control,
tuneList = tuning_list
)
final_ensemble <- caretStack(
mod_list,
method="glmnet",
metric="RMSE",
trControl=trainControl(
method="cv",
number=5,
savePredictions="final"
)
)
final_ensemble$ens_model$results %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
alpha | lambda | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|---|
0.10 | 0.0002475 | 0.1152983 | 0.5496251 | 0.0852560 | 0.0014370 | 0.0099183 | 0.0011410 |
0.10 | 0.0024746 | 0.1154572 | 0.5484537 | 0.0856576 | 0.0013007 | 0.0096406 | 0.0011113 |
0.10 | 0.0247464 | 0.1166370 | 0.5409228 | 0.0874485 | 0.0011971 | 0.0112153 | 0.0011077 |
0.55 | 0.0002475 | 0.1153058 | 0.5495683 | 0.0852897 | 0.0014150 | 0.0099203 | 0.0011290 |
0.55 | 0.0024746 | 0.1159065 | 0.5450050 | 0.0863737 | 0.0012400 | 0.0099376 | 0.0011478 |
0.55 | 0.0247464 | 0.1172433 | 0.5435591 | 0.0883593 | 0.0013210 | 0.0100228 | 0.0010285 |
1.00 | 0.0002475 | 0.1153181 | 0.5494758 | 0.0853263 | 0.0013953 | 0.0099031 | 0.0011202 |
1.00 | 0.0024746 | 0.1161103 | 0.5435126 | 0.0867137 | 0.0012249 | 0.0099660 | 0.0010985 |
1.00 | 0.0247464 | 0.1188490 | 0.5434207 | 0.0902560 | 0.0014307 | 0.0099946 | 0.0010162 |
With more data available, the ensemble improve around 6%. Our model is considerably more predictive than a “baseline” single model.
Ensemble Model Importances
plot(varImp(final_ensemble$ens_model))
The MARS model is excluded, but all four tree models contribute the the final prediction
Export Predictions
predictions <- predict(final_ensemble, bev_eval)
predictions %>% tibble::enframe(name = NULL) %>% datatable()
write.csv(predictions, file = "predictions.csv")
After working on extracting the data from the given files, we processed data cleansing and handle the missing values along with the NAs. We trained and tested the data 75% to 25% respectively. Our models were able to produce for us predicted values for pH which are also saved in predictions.csv file separately. We notice that all the values predicted are greater than 7 and more specifically greater than 8. This scale translates into saying that the beverage made is alkaline. At the beginning of this study, we were not informed about the nature of the ABC Beverage company, meaning of what type of beverage manufacturer it was. But from our studies we can conclude that this company produces alkaline beverages like water, dairy, tea, fruit drinks, etc.