This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH
library(tidyverse)
library(kableExtra)
library(xgboost)
library(plyr)
library (e1071)
library(corrplot)
library(ggplot2)
library(tidyr)
library(dplyr)
library(caret)
library(Matrix)
library(writexl)
library(psych)
The data set contains 267 observations and 33 variables. The variable name BrandCode is a character variable, the remaining variables are numeric. PH is the respond variable
## [1] 2571 33
## [1] 2038
## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure
## A : 293 Min. :5.040 Min. :23.63 Min. :0.07933 Min. :57.00
## B :1239 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917 1st Qu.:65.60
## C : 304 Median :5.347 Median :23.97 Median :0.27133 Median :68.20
## D : 615 Mean :5.370 Mean :23.97 Mean :0.27712 Mean :68.19
## NA's: 120 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200 3rd Qu.:70.60
## Max. :5.700 Max. :24.32 Max. :0.47800 Max. :79.40
## NA's :10 NA's :38 NA's :39 NA's :27
## Carb.Temp PSC PSC.Fill PSC.CO2
## Min. :128.6 Min. :0.00200 Min. :0.0000 Min. :0.00000
## 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000 1st Qu.:0.02000
## Median :140.8 Median :0.07600 Median :0.1800 Median :0.04000
## Mean :141.1 Mean :0.08457 Mean :0.1954 Mean :0.05641
## 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600 3rd Qu.:0.08000
## Max. :154.0 Max. :0.27000 Max. :0.6200 Max. :0.24000
## NA's :26 NA's :33 NA's :23 NA's :39
## Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## Min. :-100.20 Min. :105.6 Min. :34.60 Min. :-0.80
## 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00 1st Qu.: 0.00
## Median : 65.20 Median :123.2 Median :46.40 Median :11.40
## Mean : 24.57 Mean :122.6 Mean :47.92 Mean :12.44
## 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00 3rd Qu.:20.20
## Max. : 229.40 Max. :140.2 Max. :60.40 Max. :58.00
## NA's :2 NA's :32 NA's :22 NA's :11
## Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level
## Min. : 0.00 Min. :-1.20 Min. : 52.00 Min. : 55.8
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00 1st Qu.: 98.3
## Median :28.60 Median :27.60 Median : 96.00 Median :118.4
## Mean :20.96 Mean :20.46 Mean : 96.29 Mean :109.3
## 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00 3rd Qu.:120.0
## Max. :59.40 Max. :50.00 Max. :142.00 Max. :161.2
## NA's :15 NA's :15 NA's :30 NA's :20
## Filler.Speed Temperature Usage.cont Carb.Flow Density
## Min. : 998 Min. :63.60 Min. :12.08 Min. : 26 Min. :0.240
## 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144 1st Qu.:0.900
## Median :3982 Median :65.60 Median :21.79 Median :3028 Median :0.980
## Mean :3687 Mean :65.97 Mean :20.99 Mean :2468 Mean :1.174
## 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186 3rd Qu.:1.620
## Max. :4030 Max. :76.20 Max. :25.90 Max. :5104 Max. :1.920
## NA's :57 NA's :14 NA's :5 NA's :2 NA's :1
## MFR Balling Pressure.Vacuum PH
## Min. : 31.4 Min. :-0.170 Min. :-6.600 Min. :7.880
## 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600 1st Qu.:8.440
## Median :724.0 Median : 1.648 Median :-5.400 Median :8.540
## Mean :704.0 Mean : 2.198 Mean :-5.216 Mean :8.546
## 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000 3rd Qu.:8.680
## Max. :868.6 Max. : 4.012 Max. :-3.600 Max. :9.360
## NA's :212 NA's :1 NA's :4
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer
## Min. :0.00240 Min. : 70.0 Min. :44.00 Min. :140.8
## 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2
## Median :0.03340 Median :120.0 Median :46.00 Median :142.6
## Mean :0.04684 Mean :109.3 Mean :47.62 Mean :142.8
## 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:143.0
## Max. :0.40000 Max. :140.0 Max. :52.00 Max. :148.2
## NA's :12 NA's :2 NA's :12
## Alch.Rel Carb.Rel Balling.Lvl
## Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :6.560 Median :5.400 Median :1.48
## Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
The training data set contains 2571 observations and 33 variables. The variable name BrandCode is a character variable, the remaining variables are numeric. PH is the response variable.
## [1] "Brand.Code" "Carb.Volume" "Fill.Ounces"
## [4] "PC.Volume" "Carb.Pressure" "Carb.Temp"
## [7] "PSC" "PSC.Fill" "PSC.CO2"
## [10] "Mnf.Flow" "Carb.Pressure1" "Fill.Pressure"
## [13] "Hyd.Pressure1" "Hyd.Pressure2" "Hyd.Pressure3"
## [16] "Hyd.Pressure4" "Filler.Level" "Filler.Speed"
## [19] "Temperature" "Usage.cont" "Carb.Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure.Vacuum" "PH" "Oxygen.Filler"
## [28] "Bowl.Setpoint" "Pressure.Setpoint" "Air.Pressurer"
## [31] "Alch.Rel" "Carb.Rel" "Balling.Lvl"
n | mean | sd | median | min | max | range | skew | kurtosis | |
---|---|---|---|---|---|---|---|---|---|
Brand.Code* | 2451 | 2.51 | 1.00 | 2.00 | 1.00 | 4.00 | 3.00 | 0.38 | -1.06 |
Carb.Volume | 2561 | 5.37 | 0.11 | 5.35 | 5.04 | 5.70 | 0.66 | 0.39 | -0.47 |
Fill.Ounces | 2533 | 23.97 | 0.09 | 23.97 | 23.63 | 24.32 | 0.69 | -0.02 | 0.86 |
PC.Volume | 2532 | 0.28 | 0.06 | 0.27 | 0.08 | 0.48 | 0.40 | 0.34 | 0.67 |
Carb.Pressure | 2544 | 68.19 | 3.54 | 68.20 | 57.00 | 79.40 | 22.40 | 0.18 | -0.01 |
Carb.Temp | 2545 | 141.09 | 4.04 | 140.80 | 128.60 | 154.00 | 25.40 | 0.25 | 0.24 |
PSC | 2538 | 0.08 | 0.05 | 0.08 | 0.00 | 0.27 | 0.27 | 0.85 | 0.65 |
PSC.Fill | 2548 | 0.20 | 0.12 | 0.18 | 0.00 | 0.62 | 0.62 | 0.93 | 0.77 |
PSC.CO2 | 2532 | 0.06 | 0.04 | 0.04 | 0.00 | 0.24 | 0.24 | 1.73 | 3.73 |
Mnf.Flow | 2569 | 24.57 | 119.48 | 65.20 | -100.20 | 229.40 | 329.60 | 0.00 | -1.87 |
Carb.Pressure1 | 2539 | 122.59 | 4.74 | 123.20 | 105.60 | 140.20 | 34.60 | 0.05 | 0.14 |
Fill.Pressure | 2549 | 47.92 | 3.18 | 46.40 | 34.60 | 60.40 | 25.80 | 0.55 | 1.41 |
Hyd.Pressure1 | 2560 | 12.44 | 12.43 | 11.40 | -0.80 | 58.00 | 58.80 | 0.78 | -0.14 |
Hyd.Pressure2 | 2556 | 20.96 | 16.39 | 28.60 | 0.00 | 59.40 | 59.40 | -0.30 | -1.56 |
Hyd.Pressure3 | 2556 | 20.46 | 15.98 | 27.60 | -1.20 | 50.00 | 51.20 | -0.32 | -1.57 |
Hyd.Pressure4 | 2541 | 96.29 | 13.12 | 96.00 | 52.00 | 142.00 | 90.00 | 0.55 | 0.63 |
Filler.Level | 2551 | 109.25 | 15.70 | 118.40 | 55.80 | 161.20 | 105.40 | -0.85 | 0.05 |
Filler.Speed | 2514 | 3687.20 | 770.82 | 3982.00 | 998.00 | 4030.00 | 3032.00 | -2.87 | 6.71 |
Temperature | 2557 | 65.97 | 1.38 | 65.60 | 63.60 | 76.20 | 12.60 | 2.39 | 10.16 |
Usage.cont | 2566 | 20.99 | 2.98 | 21.79 | 12.08 | 25.90 | 13.82 | -0.54 | -1.02 |
Carb.Flow | 2569 | 2468.35 | 1073.70 | 3028.00 | 26.00 | 5104.00 | 5078.00 | -0.99 | -0.58 |
Density | 2570 | 1.17 | 0.38 | 0.98 | 0.24 | 1.92 | 1.68 | 0.53 | -1.20 |
MFR | 2359 | 704.05 | 73.90 | 724.00 | 31.40 | 868.60 | 837.20 | -5.09 | 30.46 |
Balling | 2570 | 2.20 | 0.93 | 1.65 | -0.17 | 4.01 | 4.18 | 0.59 | -1.39 |
Pressure.Vacuum | 2571 | -5.22 | 0.57 | -5.40 | -6.60 | -3.60 | 3.00 | 0.53 | -0.03 |
PH | 2567 | 8.55 | 0.17 | 8.54 | 7.88 | 9.36 | 1.48 | -0.29 | 0.06 |
Oxygen.Filler | 2559 | 0.05 | 0.05 | 0.03 | 0.00 | 0.40 | 0.40 | 2.66 | 11.09 |
Bowl.Setpoint | 2569 | 109.33 | 15.30 | 120.00 | 70.00 | 140.00 | 70.00 | -0.97 | -0.06 |
Pressure.Setpoint | 2559 | 47.62 | 2.04 | 46.00 | 44.00 | 52.00 | 8.00 | 0.20 | -1.60 |
Air.Pressurer | 2571 | 142.83 | 1.21 | 142.60 | 140.80 | 148.20 | 7.40 | 2.25 | 4.73 |
Alch.Rel | 2562 | 6.90 | 0.51 | 6.56 | 5.28 | 8.62 | 3.34 | 0.88 | -0.85 |
Carb.Rel | 2561 | 5.44 | 0.13 | 5.40 | 4.96 | 6.06 | 1.10 | 0.50 | -0.29 |
Balling.Lvl | 2570 | 2.05 | 0.87 | 1.48 | 0.00 | 3.66 | 3.66 | 0.59 | -1.49 |
There are 2038 observations that has no missing values in any of the 33 columns, which indicates that the data has minimum missing values. It also means that there are 533 (2571 minus 2038) observations that has some missing values in some of their variables.
Looking at Missing values of each of the numerical variables, the maximum NA is 212 at MFR ,followed by Filler Speed (57 missing), followed by a few variables which has missing values in 30s (PC volume, fill ounces, PSC CO2, carb pressure 1, hyd pressure 4), then Followed by variables which has missing values in 20s (carb pressure, PSC Phil, feel pressure, filler level). Then, the rest of the variables Have missing values that are in teens or below.
Kurtosis For each of the variables also confirmed that MFR is highly skilled with Kurtosis=30.46. Mnf Flow have a median at 724, and mean at 704. But the range of it is from 31.4 till 868, with a wopping range of 837!
Besides MFR, the skewness of the rest of the variables are alright. The next batch of variables with relatively high skewness (their Kurtosis value) is temperature ( endpoint one 6 ), oxygen filler ( 11.09 ), and air pressure ( 4.73 ) .
The outcome variable, PH value in the beverage, is shown on histogram here. We can see that it is a continuous variable with no gap at no clear patterns of missing value.
Except a few observations which is somewhat outliers at the right tale, it pretty much follows a normal distribution. There are slightly more observations on the right side ( higher values ) of the histogram, but we decide not to do too much about it because the s Skewness is minimum.
We decided that this outcome is satisfactory in being used as is as a numerical variable outcome. We will provide models based on PH outcome as a continuous numerical variable, without intentional cutoff points below.
Visualization of histogram of each individual predictor variables indicate that beside the numerical variables, there are many categorical variables (discrete variables), such as pressure.set point, aich.rel).
The obvious discrete variables are: Brand Code (ABCD 4 brands in total), Pressure Setpoint, Bowl Setpoint, PSC.CO2, Pressure Vacuume. Each of these varaiables have no more than 10-12 unique numbers to make the count.
There are some bi-mode variables: (such as pressure set point, density, hyd pressure 2, hyd pressure3).
Multi-mode (>2mode) Variables include carb flow.
Histogram also indicated the right skewness of MFR, which has a spike of counts at around the 40.
These variables have a significant numbers of apps observations at 0: hyd pressure1,hyd pressure2,hyd pressure 3,
The close to normally distributed variables judging from the histograms are : carb pressure 1, carb pressure 2, Carb Temp, Carb Volume, Fill Ounces, PC Volume.
We chose bins=15 and facet wrap for the histograms. This findings are preserved after changing the numbers of bins.
Because some of the variables are skewed, so the box plot shows data many of these predictors are recognized as outliers. these variables include: MFR ,filler.speed, Oxygen,filler, Air.presseurer. But we predict that after transformation later on, some of these so called “outliers” will not persist.
Besides the above mentioned four variables, these variables also have extreme outliers: PSC fill, PSC CO2, Temperature, Pressure.vacume, Alch.Rel, Carb.Rel.
Interestingly, the outcome variable pH also have a few outliers.
This plot below indicates relationship between target and explanatory variables
Among all 33 predicted variables my, majority of them have clear Association with the outcome. Maybe half of these predictors, if they are numerical and continuous, have clear relationship to the outcome in linear fashion. The predictors that clear Le demonstrate the linearity with the outcome include below:
Carb volume, fill ounces, PC volumes, carb pressure, carb temperatures, PSC fill, PSC, carb pressure1, carb rel.
Explaining these variables from common sense perspective, they all make sense in beverage production, we feel that these variables are the predictors that have good and continuous measurement, oftentimes from the environment, rather than work worker controlled source. Therefore, it is not surprising that they have good linearity with the pH value (outcome) of the beverage.
Above is the good news from predictors, which favors linear model, as well as other tree based the models. However, we have also seen that many other variables, even that they are linear and numerical predictors, they either have outliers, or their measure month is not continuous enough, in other words, interrupted in patterns, therefore may produce errors if we fit the linear model two outcome directly without tuning of these variables, or without other sophisticated modeling. Such non perfect numerical predictor variables include:
Mnf Flow, fill pressure, Hyd pressure 1, Hyd Pressure2, Hyd pressure3, Fill levels, Filler Speed, temperature, carb flow, MFR, Density, Bailing, Oxygen Filler, Air Pressure
Finally many of the predictor variables are in discrete or nominal variable fashion, which has levels in less than 10 or even 3. so when we fit these variables into the model, we have to be oh extremely careful that the levels of predictors can be overly simplified in terms of explanation due to the overly crude way of describing the nature of this variable.
We find the following very strong correlations: Carb.Volume with Density, Balling, Alch.Rel, Carb.Rel, and Balling.Lvl.
Carb.Pressure with Carb.Temp Filler.Level with Bowl.Setpoint Filler.Speed with MFR
Correlation plot above indicates that some explanantory variables are correleted each other.
We find out explanantory variables that hig correleted each other by using findCorrelation() with using threshold as 0.6.
## [1] "Mnf.Flow" "Balling" "Hyd.Pressure3" "Alch.Rel"
## [5] "Balling.Lvl" "Carb.Rel" "Density" "Hyd.Pressure2"
## [9] "Fill.Pressure" "Filler.Level" "Carb.Pressure" "Filler.Speed"
Below shows top 10 Explanatory variables that positively correleted highly to PH
## rowname PH
## 1 PH 1.00000000
## 2 Bowl.Setpoint 0.36158753
## 3 Filler.Level 0.35204396
## 4 Carb.Flow 0.23359370
## 5 Pressure.Vacuum 0.21973550
## 6 Carb.Rel 0.19605148
## 7 Alch.Rel 0.16668223
## 8 Oxygen.Filler 0.16448536
## 9 Balling.Lvl 0.10937117
## 10 PC.Volume 0.09886673
Bowl Setpoint, Filler Level, Carb Flow, Pressure Vacuum are the top 5 explanatory variables that are positively associated with the PH outcome.
Their correlation to the PH outcome rANGES FROM 0.36 (TOP1) TO 0.22 (TOP 5TH).
The next set of varialbes (top 6-top10) have a correlation to outcome range from 0.196 (top 6th) to 0.098 (top 10th).
Below shows top 10 Explanatory variables that negatively correlated highly to PH
## rowname PH
## 1 Mnf.Flow -0.4592313
## 2 Usage.cont -0.3576120
## 3 Fill.Pressure -0.3165145
## 4 Pressure.Setpoint -0.3116639
## 5 Hyd.Pressure3 -0.2681018
## 6 Hyd.Pressure2 -0.2226600
## 7 Temperature -0.1826596
## 8 Hyd.Pressure4 -0.1714340
## 9 Carb.Pressure1 -0.1187642
## 10 Fill.Ounces -0.1183359
Mnf Flow stands out as the top 1 variable that is negatively associated with the PH outcome (correlation = -0.46), with a much higher correlation than the 2nd variable Usage Count (correlation at -0.35).
Also, Fill Pressure, PRessure Setpoint have a correlation with PH around -0.35.
## NULL
“Hyd Pressure1” should be removed from the dataset since it hold constant values.WE are going to handle this in Model Data PreProcessing part
Using VIM library to explore missing values.
##
## Variables sorted by number of missings:
## Variable Count
## MFR 0.0824581875
## Brand.Code 0.0466744457
## Filler.Speed 0.0221703617
## PC.Volume 0.0151691949
## PSC.CO2 0.0151691949
## Fill.Ounces 0.0147802412
## PSC 0.0128354726
## Carb.Pressure1 0.0124465189
## Hyd.Pressure4 0.0116686114
## Carb.Pressure 0.0105017503
## Carb.Temp 0.0101127966
## PSC.Fill 0.0089459354
## Fill.Pressure 0.0085569817
## Filler.Level 0.0077790743
## Hyd.Pressure2 0.0058343057
## Hyd.Pressure3 0.0058343057
## Temperature 0.0054453520
## Oxygen.Filler 0.0046674446
## Pressure.Setpoint 0.0046674446
## Hyd.Pressure1 0.0042784909
## Carb.Volume 0.0038895371
## Carb.Rel 0.0038895371
## Alch.Rel 0.0035005834
## Usage.cont 0.0019447686
## PH 0.0015558149
## Mnf.Flow 0.0007779074
## Carb.Flow 0.0007779074
## Bowl.Setpoint 0.0007779074
## Density 0.0003889537
## Balling 0.0003889537
## Balling.Lvl 0.0003889537
## Pressure.Vacuum 0.0000000000
## Air.Pressurer 0.0000000000
MFR stands out as a Having a significant amount of missings.
Followed by fillet speed, pace co2.
These three variables contains as many missing values as the rest of all missing values from all variables.
The pattern of MR indicates that it has more missing values at the high end, and the also in the middle part.
## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure
## A: 307 Min. :5.040 Min. :23.63 Min. :0.07933 Min. :57.00
## B:1314 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23933 1st Qu.:65.60
## C: 330 Median :5.347 Median :23.97 Median :0.27133 Median :68.20
## D: 620 Mean :5.371 Mean :23.97 Mean :0.27795 Mean :68.22
## 3rd Qu.:5.457 3rd Qu.:24.03 3rd Qu.:0.31267 3rd Qu.:70.60
## Max. :5.700 Max. :24.32 Max. :0.47800 Max. :79.40
## Carb.Temp PSC PSC.Fill PSC.CO2
## Min. :128.6 Min. :0.00200 Min. :0.0000 Min. :0.00000
## 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000 1st Qu.:0.02000
## Median :140.8 Median :0.07600 Median :0.1800 Median :0.04000
## Mean :141.1 Mean :0.08486 Mean :0.1961 Mean :0.05641
## 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600 3rd Qu.:0.08000
## Max. :154.0 Max. :0.27000 Max. :0.6200 Max. :0.24000
## Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure2
## Min. :-100.20 Min. :105.6 Min. :34.6 Min. : 0.00
## 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.0 1st Qu.: 0.00
## Median : 64.80 Median :123.2 Median :46.4 Median :28.60
## Mean : 24.47 Mean :122.6 Mean :47.9 Mean :20.95
## 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.0 3rd Qu.:34.60
## Max. : 229.40 Max. :140.2 Max. :60.4 Max. :59.40
## Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## Min. :-1.20 Min. : 52.00 Min. : 55.8 Min. : 998
## 1st Qu.: 0.00 1st Qu.: 86.00 1st Qu.: 97.4 1st Qu.:3819
## Median :27.40 Median : 96.00 Median :118.4 Median :3980
## Mean :20.43 Mean : 96.48 Mean :109.2 Mean :3636
## 3rd Qu.:33.20 3rd Qu.:102.00 3rd Qu.:120.0 3rd Qu.:3997
## Max. :50.00 Max. :142.00 Max. :161.2 Max. :4030
## Temperature Usage.cont Carb.Flow Density MFR
## Min. :63.60 Min. :12.08 Min. : 26 Min. :0.240 Min. : 31.4
## 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1142 1st Qu.:0.900 1st Qu.:695.0
## Median :65.60 Median :21.78 Median :3028 Median :0.980 Median :721.4
## Mean :65.98 Mean :20.99 Mean :2468 Mean :1.174 Mean :672.4
## 3rd Qu.:66.40 3rd Qu.:23.76 3rd Qu.:3187 3rd Qu.:1.620 3rd Qu.:730.4
## Max. :76.20 Max. :25.90 Max. :5104 Max. :1.920 Max. :868.6
## Balling Pressure.Vacuum PH Oxygen.Filler
## Min. :-0.170 Min. :-6.600 Min. :7.880 Min. :0.00240
## 1st Qu.: 1.496 1st Qu.:-5.600 1st Qu.:8.440 1st Qu.:0.02200
## Median : 1.648 Median :-5.400 Median :8.540 Median :0.03340
## Mean : 2.197 Mean :-5.216 Mean :8.546 Mean :0.04709
## 3rd Qu.: 3.292 3rd Qu.:-5.000 3rd Qu.:8.680 3rd Qu.:0.06000
## Max. : 4.012 Max. :-3.600 Max. :9.360 Max. :0.40000
## Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel
## Min. : 70.0 Min. :44.00 Min. :140.8 Min. :5.280
## 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2 1st Qu.:6.540
## Median :120.0 Median :46.00 Median :142.6 Median :6.560
## Mean :109.3 Mean :47.61 Mean :142.8 Mean :6.897
## 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:143.0 3rd Qu.:7.230
## Max. :140.0 Max. :52.00 Max. :148.2 Max. :8.620
## Carb.Rel Balling.Lvl
## Min. :4.960 Min. :0.00
## 1st Qu.:5.340 1st Qu.:1.38
## Median :5.400 Median :1.48
## Mean :5.436 Mean :2.05
## 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :6.060 Max. :3.66
We used MICE package, CMM method to impute the missing data.
For the variables that has near zero observations, we used the function of nearzeroVar, to avoid imputing the missing zeros.
From previous data exploration, we know that the non-zero observations occur most in the variables of HYD pressure1.
By choosing not to impute the variable of HYD pressure1 (with non zeros), in the evaluated model, we exclude that variable.
let’s look at the variables missing percentage after impution to check if we missing anything.
##
## Variables sorted by number of missings:
## Variable Count
## Brand.Code 0
## Carb.Volume 0
## Fill.Ounces 0
## PC.Volume 0
## Carb.Pressure 0
## Carb.Temp 0
## PSC 0
## PSC.Fill 0
## PSC.CO2 0
## Mnf.Flow 0
## Carb.Pressure1 0
## Fill.Pressure 0
## Hyd.Pressure1 0
## Hyd.Pressure2 0
## Hyd.Pressure3 0
## Hyd.Pressure4 0
## Filler.Level 0
## Filler.Speed 0
## Temperature 0
## Usage.cont 0
## Carb.Flow 0
## Density 0
## MFR 0
## Balling 0
## Pressure.Vacuum 0
## PH 0
## Oxygen.Filler 0
## Bowl.Setpoint 0
## Pressure.Setpoint 0
## Air.Pressurer 0
## Alch.Rel 0
## Carb.Rel 0
By using the aggr function, we visualized the missing variables again.
We can see that on the left side graph, there is no missing values from the data anymore. On the right side, the figure shows that all the missing variables are now complete.
This figure (on the right side) also showed us that there is one variable that this amputation has excluded, due to our command to exclude the near zero variable, and as we know, this variable is HYD pressure1. Now the data is good for furthe analysis.
Splitting dataset into training and test sets.
We used the 80/20 rule to create the training data set and the testing data set. The function of createDataPartition is used for that purpose, which select the random sample from the completed Data. Here, completed means imputed data.
Now the Data has been evaluated, with missing that is imputed. The Data is ready to go for various of Modeling effort.
First,before any modeling occured, we created an empty data frame called models_test_evaluation, which is the place holder for all the model evaluates. For each model, we will select Root Mean Square of Error (RMSE), R-squared, Mean Aboslute Error (MAE). Once these evaluators are available from the model run, they are put into this dataframe, one row per modeling.
We first will run the traditional linear regression model, as we have numerical outcome, and mostly numerical predictors.
Next we will apply a few of the tree-based model and rule based models, which are more modern, and utilizing the 33 variables in an ensumble (or bagged way/mechanism), rather than assuming all linearity relationship to the outcome for all 33 predictor variables indiviually, Which as we know is a very strict assumption, and our data does not fully support that assumption.
Most of the variables are associated with outcome, but not in the linear fashion.
##
## Call:
## lm(formula = PH ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51974 -0.07740 0.01110 0.08818 0.42252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.160e+01 1.189e+00 9.753 < 2e-16 ***
## Brand.CodeB 5.335e-02 2.126e-02 2.509 0.012175 *
## Brand.CodeC -8.146e-02 2.135e-02 -3.814 0.000141 ***
## Brand.CodeD 7.551e-02 1.778e-02 4.248 2.26e-05 ***
## Carb.Volume -1.123e-01 9.762e-02 -1.150 0.250105
## Fill.Ounces -8.862e-02 3.508e-02 -2.526 0.011600 *
## PC.Volume -1.389e-01 5.724e-02 -2.426 0.015340 *
## Carb.Pressure 3.039e-03 4.604e-03 0.660 0.509249
## Carb.Temp -1.657e-03 3.626e-03 -0.457 0.647812
## PSC -9.067e-02 6.288e-02 -1.442 0.149466
## PSC.Fill -4.208e-02 2.586e-02 -1.627 0.103825
## PSC.CO2 -1.245e-01 6.873e-02 -1.811 0.070243 .
## Mnf.Flow -7.467e-04 5.047e-05 -14.795 < 2e-16 ***
## Carb.Pressure1 6.035e-03 7.624e-04 7.916 4.00e-15 ***
## Fill.Pressure 2.398e-03 1.344e-03 1.785 0.074424 .
## Hyd.Pressure2 -7.184e-04 5.179e-04 -1.387 0.165541
## Hyd.Pressure3 3.187e-03 6.303e-04 5.056 4.66e-07 ***
## Hyd.Pressure4 -3.703e-04 3.401e-04 -1.089 0.276332
## Filler.Level -1.124e-03 6.835e-04 -1.645 0.100216
## Filler.Speed 8.078e-06 1.094e-05 0.739 0.460246
## Temperature -1.284e-02 2.513e-03 -5.109 3.54e-07 ***
## Usage.cont -6.344e-03 1.262e-03 -5.025 5.46e-07 ***
## Carb.Flow 1.161e-05 4.214e-06 2.755 0.005918 **
## Density -9.454e-02 3.052e-02 -3.098 0.001976 **
## MFR -6.350e-05 5.822e-05 -1.091 0.275590
## Balling -1.001e-01 2.625e-02 -3.813 0.000141 ***
## Pressure.Vacuum -2.075e-02 8.341e-03 -2.487 0.012947 *
## Oxygen.Filler -3.115e-01 7.497e-02 -4.155 3.38e-05 ***
## Bowl.Setpoint 3.044e-03 7.148e-04 4.258 2.16e-05 ***
## Pressure.Setpoint -9.112e-03 2.155e-03 -4.228 2.47e-05 ***
## Air.Pressurer -1.022e-03 2.632e-03 -0.388 0.697839
## Alch.Rel 2.068e-02 2.290e-02 0.903 0.366516
## Carb.Rel 7.397e-03 5.141e-02 0.144 0.885609
## Balling.Lvl 1.299e-01 2.393e-02 5.427 6.42e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1315 on 2024 degrees of freedom
## Multiple R-squared: 0.4176, Adjusted R-squared: 0.4081
## F-statistic: 43.98 on 33 and 2024 DF, p-value: < 2.2e-16
First, we run linear regression model. This is our basic banhmark model.
Because linear regression is a traditional model, and our data contains mostly numerical continuous variable, and our outcome pH is also continuous variable. Therefore we first chose linear model as the basic machine learning technique to predict the beverage’s PH outcome.
We used the LM function for linear regression model. All variables are fitted directly into them model as it was defined in the original data, with missing that is filled in.
The overall F statistics is 33.98 all 33 variables, with 2024 degree of freedoms (Our training data contains 2571 observations, minuus the corresponding num of variables, equals 2024.). There is a high significant P value for the overall model, but we have to be very careful that over fit Could be the culprit behind this P value.
Examining the T students statistics and associated P values with it, the following variables are highly significant: Brand code C versus A, Brand code D Versus A, MFR, carb flow, carb pressure 1, Temperature, usage count, balling, Oxygen filler, Bowl setpoint, pressure setpoint, balling lvl
The next few models are all assuming non-linear fashion, which are more popular machine learning algorithms and also more truthful to this data prediction. We chosed a few tree-based modeling.
The ensemble techniques of the nonlinear models have a few advantages. By packing or bagging the variables into trees,the variance of a prediction through these ensemble process are reduced, which fit even the unstable predictions with less stringent assumption than linear model.
First tree based models We selected in is bagged tree model. Each Model in the bagged tree ensemble is used to generate a prediction for a new sample and these M predictions are averaged to give the bag to models prediction. Two steps Algorithms are used, First upon bootstrapling sample of the original data is generated; Second step, pruning of the tree model is produced. This algorithm applies from first to the mth (from 1 to M) observations, and then repeated so on so forth.
Compared with linear regression model, backing also has another advantage, where they provide their own internal estimate of performance with cross validation. In our model, we Chose five bootstrap samples for each algorithem, and then fit 10 Cross valication, with 25 tuning algorithms.
We chose not to print the evaluation, rather we we will produce summary data set which contains this model evaluation side-by-side at the end.
Because bagged bootstrapping is an computationally really expensive process, the run time yes about 5 to 10 times longer compared to the linear model.
Support vector machine algorithm has some advantage over linear regression in that it minimize the effect of outliers.
In linear regression even one outlier can influence parameter estimation, but SVM uses the Square to residuals when the abosulte outliers are small while uses the absolute residuals when the absolute residuals are large. By this “weighted effect”, the mangitude of outlier influence is minimized.
Because our data contains quite a few outliers in a few of the observations ,we expect that SVM will give us a more robust prediction than linear model.
KNN, which stands for K nearest neighbors model, utilizes the K closest samples (Usually in means)from the training set to predict. Its prediction power can be negatively influenced by different skills of predictions, which generates unbalanced distance. Because the 33 variables of ours have such issue, we utilized the Options of centered and scaled predictors to overcome this issue.
As with other model, 10 folds of cross validation were chosen, and 25 tuning algorithm within the KNN modeling were specified. The models evaluation were bind into the models_test_evaluation data frame, to compared with other models.
Random forest is one step further of the bagged tree model, but it differs from the simple bagged tree samples that it completely removes the inter-dependency of bootstrap samples from regular tree models. It reduces the correlation among predictors by and adding randomness to the construction process, hence with the name random forest.
As with other models, we specified 10 cross validations, 25 tuning algorithms, and we export the model evaluators for future comparisons.
Cubist is a rule-based machine learning model. A rule based machine learner has one step further than the tree based modeling, in that is the identification and utilization of a set of relational rules that collectively represent the knowledge. In contrast to Tree based models, Which generate machine learning rule (also only SINGLE set of rule is applied) within it self, or in other case, uses only one rule universally across all the algorithms.
A cute best model resembles a piecewise linear model to predict numeric values, except that the rules can overlap.
Here, we did the 10 cross validation within each ensemble, with 25 tuning lens length. Then Store the RMS and other evaluators for print out later, and store them into the models_test_evaluation data frame as a row.
Due to the strict assumption of linearity between predictors and outcome, problem arise when multiple variables, in our case, 33 predictors are in presence, many of whom do not have the perfect linear relationship with the outcome. Such problem might be solved by introducing some nonlinearity in the model, such as to supplement the previous linear regression model with additional complexity. Adding a squared term, or even higher dimensional term, for some variables, or introducing interaction term with two correlated variables. But the model can be overly complex by such, which introduces many more unnecessary variables in addition to the existing 33 variables, which exacerbates the overfitting problem even further.
The multivariate adaptive regression spline (MARS) model is a solution to such dilemma. When used with a single predictor, MARS can fit separate linear regression lines for different ranges of engine displacement. The slopes and intercepts are estimated for this model, as well as the number and size of the separate regions for the linear models.
## Model RMSE Rsquared MAE
## 1 MARS 0.1516822 0.2798340 0.11504878
## 2 Cubist 0.1016618 0.6765888 0.06932229
## 3 Random Forest 0.1021156 0.6940524 0.07083906
## 4 KNN 0.1272681 0.4975171 0.09238821
## 5 SVM 0.1218256 0.5365683 0.08454838
## 6 Bagged-Tree 0.1201316 0.5491420 0.08516670
The table above shows our models performance.We evaluated models using below criteria:
Overall, Except the MARS model (R^2=0.27), the R squared are Within the range of 0.50 to 0.69 for all the tree based and rule based models.
Remember that the R squared in Linear model is 0.42(multiple R squared), and 0.4081(adjusted R squares), the lower R^2of MARS indicates that it is an inferior model to linear model.
The rest of five models have shown improvement in R-squared compared to the linear model. The improvements are most robust in random forest model (0.69 R squared, or 50% improvement from the linear model,), and the cubist model model (R-square 0.676, also 50% improvement from the linear model as well). The KNN and the SVM, bagged tree Model have R-squared around 0.53, not a significant improvement from linear model in terms of R squared.
RMSE is interpreted as how far, on average, the residuals are from zero.
The RMSE is lowest in Cubist model (RMSE=0.10) and in random forest model (RMSE=0.101). The MARS have the worst performance in terms of RMSE (RMSE=0.15).The rest of 3 tree based models (KNN, SVM, bagged tree) have similar RMSE at 0.12.
The MAE value follows exactly the same pattern as RMSE. The best the performers are cubist, random forest model. While the worst performer is MARS. The rest of three models perform similarly.
Based on what we’ve seen above table Cubist model gives best performance among the other models.So we are going to select Cubist models as champion model and predict values by using evaluation dataset and export in excel file.
Taking into all considerations of RMSE, R squared, MAE, Cubist is our best model. Random forest model follows very closely to Cubist.
The linear model and the MARS, multi-adaptive regression sblinds model clearly do not have much advantage in predicting PH from these 33 variables.
We propose for the following research direction:
The Cubist is the best performing models for this data set in predicting the PH value outcome by 33 variables detailing the manufacture process. Clearly, the rule-based model performs better than most of the tree based models, while tree based models perform better than linear regression, also better than the multivariate spline regression which makes underlying assumption in linearity to some extent.
Cubist uses a set of rules (knowledgebase) that collectively make up the prediction model, by a series of if -then expression within each of its algorithms when determining each ensemble. In current CUBIST analysis, we used the built in or automatically generated hyperparameters only.
But we have observed that the preferable performance difference between Cubist and the random forest model (a tree-based model) is very small and they could be very close in head-to-head competition of preferred model if certain conditions change. In order to decipher further that can make a difference in performance between those two, we would propose to do more hyper-parameter tuning in cubist, including manipulating numbers of “committees” and “neighboring” adjustment. According to the book Kuhn and Johnson, the differences between Cubist and the model trees are in below details:
• The specific techniques used for linear model smoothing, creating rules,and pruning are different • An optional boosting—like procedure called committees • The predictions generated by the model rules can be adjusted using nearby points from the training set data
There is one important and the only categorical variable, Brand Code. There are a total of four Brands, A,B.C. D, as well as some blanks. Intuitively, it makes sense that different brands of drinks talor different groups of cutomers who have different taste, which may be reflected in the PH preference by customers. I would assume that too the manufacture process preferred for each brand by their corresponding cutomer base could be different.
For the current analysis, we only did to the overall, or pooled analysis that did not take into consideration brand difference. For future analysis, I would like to repeat the same analysis again, but looking at each Brand separately. The sample size should be enough for such analysis, due to the equal allocation to four Brands. At least the sample size should be sufficient for some of the modeling, or most of the modeling, if not for all modeling.
But bear in mind that there are 33 variables to start with, and would be 32 without the brand, by dividing the sample for subgroups, the over fitting issue could be more prominent than the pooled analysis, therefore they deem more careful scrutiny in explaining the result in the sub group analysis.
We will also look at attempting more detailed pre-processing of the data , in terms of outliers and missing values.
In this data set, there are a few hundreds of observations that have missings in some variables. In current analysis, we only excluded the hyd pressure1 variable, which has a large proportion of zeros. We did not put special effort into addressing the rest of the outliers in other variables.
The same is true for missing variables. Since the missing variables are contained in less than 10% of the observations, and the musings do not possess any particular patten and, when we only let the imputation took care of it. The only variable that we asked the imputation to exclude that was hyd pressure1, which has a significant number of observations at 0.
However, they are saw some variables that may deem further investigation. For example, minf Flow variable, which is an important variable in many of the models as among the topic important list of predictors. However, this variable has a few spikes by in a few specific measurement point, which does not really follow the linearity pattern. In the future, we may try more taming mechanism to control these spikes in terms of larger than normal numbers of observation in those datapoints. Some sort of “weighting” mechanism might be introduced to alleviate their influence.