## * checking for file 'C:\Users\eric.hirsch\AppData\Local\Temp\Rtmp089TnF\remotes68b818922527\ericonsi-EHData-5e2d822/DESCRIPTION' ... OK
## * preparing 'EHData':
## * checking DESCRIPTION meta-information ... OK
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## Omitted 'LazyData' from DESCRIPTION
## * creating default NAMESPACE file
## * building 'EHData_0.1.0.tar.gz'
##
ABC Beverage is facing new regulations requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Below is the technical report detailing our examination of the data and creation of our PH model.
The data consists of 2571 beverage records, including PH (the target) and 32 possible predictors of PH. 31 of the predictors are numeric, and one (Brand) is character. Many of the predictors have missing values.
## Brand Code Carb Volume Fill Ounces PC Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler Level Filler Speed Temperature Usage cont Carb Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## PH Oxygen Filler Bowl Setpoint Pressure Setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## Air Pressurer Alch Rel Carb Rel Balling Lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
## 'data.frame': 2571 obs. of 33 variables:
## $ Brand Code : chr "B" "A" "B" "A" ...
## $ Carb Volume : num 5.34 5.43 5.29 5.44 5.49 ...
## $ Fill Ounces : num 24 24 24.1 24 24.3 ...
## $ PC Volume : num 0.263 0.239 0.263 0.293 0.111 ...
## $ Carb Pressure : num 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
## $ Carb Temp : num 141 140 145 133 137 ...
## $ PSC : num 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC Fill : num 0.26 0.22 0.34 0.42 0.16 ...
## $ PSC CO2 : num 0.04 0.04 0.16 0.04 0.12 ...
## $ Mnf Flow : num -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb Pressure1 : num 119 122 120 115 118 ...
## $ Fill Pressure : num 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
## $ Hyd Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd Pressure2 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure3 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure4 : num 118 106 82 92 92 116 124 132 90 108 ...
## $ Filler Level : num 121 119 120 118 119 ...
## $ Filler Speed : num 4002 3986 4020 4012 4010 ...
## $ Temperature : num 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage cont : num 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb Flow : num 2932 3144 2914 3062 3054 ...
## $ Density : num 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
## $ MFR : num 725 727 735 731 723 ...
## $ Balling : num 1.4 1.5 3.14 3.04 3.04 ...
## $ Pressure Vacuum : num -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
## $ PH : num 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Oxygen Filler : num 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl Setpoint : num 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure Setpoint: num 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air Pressurer : num 143 143 142 146 146 ...
## $ Alch Rel : num 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
## $ Carb Rel : num 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Balling Lvl : num 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
A number of predictors have missing values; however, the only one of significance is MFR (8%). The missing values do not appear to be correlated with each other, and records with missing MFR do not appear to be correlated with PH. We therefore conclude that missingness is a relatively random phenomenon in this dataset and do a simple median imputation for all missing values.
There are 4 missing target values as well - however, there are so few that we decide to impute them rather than lose them from the database.
For the only character predictor, Brand, we will create a variable Brand_NA when we dummify it.
## [[1]]
##
## [[2]]
##
## [[3]]
## [1] "Correlation between MFR missingness and PH"
##
## Pearson's product-moment correlation
##
## data: dfPH_Num$PH and mfr_flag
## t = 0.67288, df = 2565, p-value = 0.5011
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.02541582 0.05194583
## sample estimates:
## cor
## 0.01328488
There is some degree of multicollinearity in the database. In particular, Density, Balling, Balling level and Alch Rel are all highly correlated with each other. While multicollinearity does not necessarily interfere with predictive power, it does impede inference. Since we need to not only predict PH but understand our model, we will remove Balling, since Balling level and Density carry most of its information.
## col1 col2 correlation
## 1 Hyd Pressure2 Hyd Pressure3 0.9249726
## 3 Filler Level Bowl Setpoint 0.9349807
## 4 Density Balling 0.9551594
## 5 Density Alch Rel 0.9007296
## 6 Density Balling Lvl 0.9473110
## 8 Balling Alch Rel 0.9233016
## 9 Balling Balling Lvl 0.9776232
## 13 Alch Rel Balling Lvl 0.9213186
PH is normally-enough distributed so we will not transform it at this point. We see some skewed distributions among some of the predictors and may need to correct for that depending on our algorithm choice.
A number of factors have a wealth of disturbing zeroes (PSC, HydPressure1, etc.). Sometimes zeroes are a stand in for NA. In a real world case we would use domain knowledge to assess the validity of these zeroes - however, since we don’t know, we will assume they are legitimate values. While there are high and low values, these values come in groups - therefore, we will not remove outliers but we will run a regression to check for influential points.
A fair number of factors are correlated with PH - this bodes well for a predictive model.
We run an ordinary least square regression to gain insight into our data. We use the Step AIC algorithm which eliminates predictors based on improvements to AIC.
A large number of factors remain in the model. R-Squared is 41% - we hope to improve this with more powerful predictive algorithms. The model shows heteroskedasticity, so we will perform a boxcox on PH. The QQ plot suggests a relatively normally distributed target.
There are 3 influential records. We will remove them.
##
## Call:
## lm(formula = PH ~ Carb.Volume + Fill.Ounces + PC.Volume + PSC.Fill +
## PSC.CO2 + Mnf.Flow + Carb.Pressure1 + Fill.Pressure + Hyd.Pressure2 +
## Hyd.Pressure3 + Filler.Level + Filler.Speed + Temperature +
## Usage.cont + Carb.Flow + Density + Pressure.Vacuum + Oxygen.Filler +
## Bowl.Setpoint + Pressure.Setpoint + Alch.Rel + Carb.Rel +
## Balling.Lvl + Brand.Code_A + Brand.Code_C + Brand.Code_NA,
## data = train_reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49520 -0.07542 0.00888 0.08463 0.77874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.059e+01 9.273e-01 11.418 < 2e-16 ***
## Carb.Volume -1.133e-01 5.210e-02 -2.175 0.029746 *
## Fill.Ounces -6.574e-02 3.577e-02 -1.838 0.066222 .
## PC.Volume -1.441e-01 5.611e-02 -2.568 0.010307 *
## PSC.Fill -7.439e-02 2.560e-02 -2.905 0.003708 **
## PSC.CO2 -1.186e-01 7.217e-02 -1.643 0.100588
## Mnf.Flow -7.223e-04 5.006e-05 -14.426 < 2e-16 ***
## Carb.Pressure1 6.084e-03 7.670e-04 7.932 3.52e-15 ***
## Fill.Pressure 2.251e-03 1.338e-03 1.682 0.092716 .
## Hyd.Pressure2 -7.374e-04 5.118e-04 -1.441 0.149824
## Hyd.Pressure3 2.814e-03 6.208e-04 4.534 6.14e-06 ***
## Filler.Level -1.018e-03 5.738e-04 -1.774 0.076177 .
## Filler.Speed -8.790e-06 4.785e-06 -1.837 0.066349 .
## Temperature -1.536e-02 2.460e-03 -6.243 5.21e-10 ***
## Usage.cont -7.048e-03 1.265e-03 -5.572 2.86e-08 ***
## Carb.Flow 1.333e-05 3.809e-06 3.499 0.000478 ***
## Density -1.306e-01 2.872e-02 -4.548 5.74e-06 ***
## Pressure.Vacuum -1.426e-02 6.939e-03 -2.054 0.040068 *
## Oxygen.Filler -3.011e-01 7.780e-02 -3.871 0.000112 ***
## Bowl.Setpoint 3.200e-03 5.895e-04 5.428 6.39e-08 ***
## Pressure.Setpoint -8.865e-03 2.178e-03 -4.070 4.89e-05 ***
## Alch.Rel 4.325e-02 2.100e-02 2.060 0.039571 *
## Carb.Rel 7.725e-02 4.953e-02 1.560 0.119003
## Balling.Lvl 3.469e-02 1.669e-02 2.079 0.037779 *
## Brand.Code_A -6.332e-02 1.409e-02 -4.494 7.38e-06 ***
## Brand.Code_C -1.438e-01 9.894e-03 -14.534 < 2e-16 ***
## Brand.Code_NA -8.037e-02 1.480e-02 -5.431 6.28e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1322 on 2031 degrees of freedom
## Multiple R-squared: 0.4269, Adjusted R-squared: 0.4196
## F-statistic: 58.2 on 26 and 2031 DF, p-value: < 2.2e-16
##
## [1] "VIF Analysis"
## Carb.Volume Fill.Ounces PC.Volume PSC.Fill
## 3.610124 1.130350 1.349181 1.070175
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## 1.067560 4.227891 1.541221 2.078689
## Hyd.Pressure2 Hyd.Pressure3 Filler.Level Filler.Speed
## 8.258166 11.477926 9.428191 1.593705
## Temperature Usage.cont Carb.Flow Density
## 1.368047 1.638725 1.970053 13.765302
## Pressure.Vacuum Oxygen.Filler Bowl.Setpoint Pressure.Setpoint
## 1.829806 1.457571 9.657993 2.310694
## Alch.Rel Carb.Rel Balling.Lvl Brand.Code_A
## 13.271702 4.776149 24.839875 2.311940
## Brand.Code_C Brand.Code_NA
## 1.217488 1.158246
## NULL
##
## studentized Breusch-Pagan test
##
## data: step3
## BP = 175.11, df = 26, p-value < 2.2e-16
##
##
## Shapiro-Wilk normality test
##
## data: step3$residuals
## W = 0.98311, p-value = 6.867e-15
##
## [1] "AIC: -2459.46659486999"
## [1] "RMSE on evaluation set: 0.135864489235962"
We now test our data on 11 machine learning algorithms in order to determine the best predictive model.
1). Lasso
##
## Call: glmnet(x = x, y = yTrain, alpha = 1, lambda = lambda1, preProc = c("center", "scale"))
##
## Df %Dev Lambda
## 1 31 41.37 0.003883
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 1.1335359 0.4324615 0.8980178
2) Ridge
##
## Call: glmnet(x = x, y = yTrain, alpha = 0, lambda = lambda1, preProc = c("center", "scale"))
##
## Df %Dev Lambda
## 1 34 40.82 0.06478
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 1.1385401 0.4301397 0.9068312
3). PLS
## Partial Least Squares
##
## 2055 samples
## 34 predictor
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1850, 1850, 1848, 1851, 1850, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.263061 0.2448886 1.0074657
## 2 1.187339 0.3334997 0.9378354
## 3 1.168906 0.3538846 0.9273594
## 4 1.156480 0.3678340 0.9134054
## 5 1.142611 0.3823360 0.8993623
## 6 1.137697 0.3878219 0.8918344
## 7 1.134364 0.3916038 0.8891298
## 8 1.133212 0.3929897 0.8857015
## 9 1.131913 0.3947154 0.8849038
## 10 1.130753 0.3961795 0.8830160
## 11 1.131235 0.3959340 0.8833858
## 12 1.130503 0.3968130 0.8830807
## 13 1.130620 0.3966851 0.8842539
## 14 1.130716 0.3967071 0.8833765
## 15 1.130563 0.3968566 0.8831855
## 16 1.130471 0.3969327 0.8838876
## 17 1.130257 0.3971837 0.8838742
## 18 1.130270 0.3972030 0.8836797
## 19 1.130315 0.3971755 0.8835155
## 20 1.130318 0.3971639 0.8835026
## 21 1.130346 0.3971251 0.8834728
## 22 1.130306 0.3971646 0.8834484
## 23 1.130272 0.3972038 0.8834363
## 24 1.130247 0.3972255 0.8834253
## 25 1.130246 0.3972241 0.8834229
## 26 1.130226 0.3972462 0.8834154
## 27 1.130234 0.3972385 0.8834186
## 28 1.130229 0.3972431 0.8834130
## 29 1.130234 0.3972384 0.8834207
## 30 1.130229 0.3972431 0.8834173
## 31 1.130230 0.3972427 0.8834175
## 32 1.130231 0.3972411 0.8834181
## 33 1.130231 0.3972413 0.8834182
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 26.
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 1.1324323 0.4325440 0.8935133
4.) KNN
## k-Nearest Neighbors
##
## 2055 samples
## 34 predictor
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2055, 2055, 2055, 2055, 2055, 2055, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.122539 0.4359751 0.8261009
## 7 1.103944 0.4444383 0.8206457
## 9 1.096882 0.4469509 0.8205372
## 11 1.091612 0.4497004 0.8196408
## 13 1.094392 0.4461560 0.8246401
## 15 1.099234 0.4410076 0.8312303
## 17 1.101534 0.4382708 0.8348568
## 19 1.105219 0.4341171 0.8387305
## 21 1.107823 0.4315698 0.8418797
## 23 1.110153 0.4292667 0.8453432
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 1.0588980 0.5041245 0.8189062
5). Neural Net
Decay = .04, size = 3.
## a 34-3-1 network with 109 weights
## options were - linear output units decay=0.04
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 1.0539986 0.5063971 0.8131051
6.) MARS
## Selected 18 of 22 terms, and 11 of 34 predictors
## Termination condition: RSq changed by less than 0.001 at 22 terms
## Importance: Mnf.Flow, Brand.Code_C, Alch.Rel, Pressure.Vacuum, Usage.cont, ...
## Number of terms at each degree of interaction: 1 17 (additive model)
## GCV 1.236777 RSS 2455.74 GRSq 0.4136019 RSq 0.4328546
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 1.1131853 0.4503775 0.8774195
7.) and 8.) SVM (linear and radial)
## Support Vector Machines with Radial Basis Function Kernel
##
## 2055 samples
## 34 predictor
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1851, 1850, 1851, 1849, 1849, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.069443 0.4650716 0.7998372
## 0.50 1.043366 0.4879701 0.7730102
## 1.00 1.017891 0.5111853 0.7492804
##
## Tuning parameter 'sigma' was held constant at a value of 0.01999207
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01999207 and C = 1.
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 0.9953225 0.5661287 0.7511437
## [1] ""
## [1] "___________________________________________"
## [1] ""
## Support Vector Machines with Linear Kernel
##
## 2055 samples
## 34 predictor
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1848, 1849, 1849, 1850, 1850, 1850, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.149602 0.3815252 0.8805875
##
## Tuning parameter 'C' was held constant at a value of 1
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 1.1420360 0.4220422 0.8760670
9.) Random Forest
##
## Call:
## randomForest(formula = PH ~ ., data = dfTrain, importance = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 11
##
## Mean of squared residuals: 0.6698616
## % Var explained: 68.21
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 0.8391764 0.7183977 0.6373290
10.) GBM
Depth = 4, ntrees = 10000, shrinkage = .007
## gbm(formula = PH ~ ., distribution = "gaussian", data = dfTrain,
## n.trees = 10000, interaction.depth = 4, shrinkage = 0.007,
## cv.folds = 10, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 10000 iterations were performed.
## The best cross-validation iteration was 9981.
## There were 34 predictors of which 34 had non-zero influence.
## [1] "Number of trees: 10000"
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 0.8988662 0.6420692 0.6861177
11.) Cubist
##
## Call:
## cubist.default(x = dfTrain2, y = yTrain)
##
## Number of samples: 2055
## Number of predictors: 34
##
## Number of committees: 1
## Number of rules: 34
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 1.1306746 0.4803376 0.7775883
The table below shows results for the 11 algorithms plus OLS. Random Forest is clearly the best performer.
## Rank Model RMSE RSquared
## A A Random Forest 0.84 0.72
## B B GBM 0.9 0.64
## C C SVM_Radial 0.99 0.57
## D D NeuralNet 1.05 0.51
## E E KNN 1.06 0.5
## F F Cubist 1.13 0.48
## G G MARS 1.11 0.45
## H H Lasso 1.13 0.43
## I I PLS 1.13 0.43
## J J Ridge 1.14 0.43
## K K OLS Regression 1.15 0.43
## L L SVM-Linear 1.15 0.38
First we re-run our regression with scaled factors to get a baseline.
##
## Call:
## lm(formula = PH ~ Carb.Volume + Fill.Ounces + PC.Volume + PSC.Fill +
## PSC.CO2 + Mnf.Flow + Carb.Pressure1 + Fill.Pressure + Hyd.Pressure3 +
## Filler.Level + Filler.Speed + Temperature + Usage.cont +
## Carb.Flow + Density + Pressure.Vacuum + Oxygen.Filler + Bowl.Setpoint +
## Pressure.Setpoint + Alch.Rel + Carb.Rel + Balling.Lvl + Brand.Code_A +
## Brand.Code_C + Brand.Code_NA, data = train_reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1256 -0.6429 0.0822 0.7450 3.3638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.02190 0.02462 1463.039 < 2e-16 ***
## Carb.Volume -0.07202 0.04722 -1.525 0.12739
## Fill.Ounces -0.04707 0.02640 -1.783 0.07474 .
## PC.Volume -0.07440 0.02889 -2.575 0.01010 *
## PSC.Fill -0.07086 0.02546 -2.784 0.00543 **
## PSC.CO2 -0.04142 0.02547 -1.626 0.10414
## Mnf.Flow -0.73720 0.05046 -14.608 < 2e-16 ***
## Carb.Pressure1 0.24648 0.03030 8.134 7.18e-16 ***
## Fill.Pressure 0.05464 0.03573 1.529 0.12636
## Hyd.Pressure3 0.29502 0.04541 6.497 1.03e-10 ***
## Filler.Level -0.10592 0.07481 -1.416 0.15698
## Filler.Speed -0.06233 0.03102 -2.009 0.04466 *
## Temperature -0.18261 0.02874 -6.353 2.60e-10 ***
## Usage.cont -0.19541 0.03167 -6.169 8.25e-10 ***
## Carb.Flow 0.14851 0.03442 4.314 1.68e-05 ***
## Density -0.50805 0.08856 -5.737 1.11e-08 ***
## Pressure.Vacuum -0.06751 0.03352 -2.014 0.04413 *
## Oxygen.Filler -0.14543 0.03088 -4.709 2.65e-06 ***
## Bowl.Setpoint 0.37860 0.07555 5.011 5.88e-07 ***
## Pressure.Setpoint -0.14347 0.03702 -3.876 0.00011 ***
## Alch.Rel 0.14028 0.09708 1.445 0.14859
## Carb.Rel 0.08534 0.05548 1.538 0.12412
## Balling.Lvl 0.37985 0.12676 2.996 0.00276 **
## Brand.Code_A -0.19192 0.03882 -4.943 8.31e-07 ***
## Brand.Code_C -0.38780 0.02687 -14.434 < 2e-16 ***
## Brand.Code_NA -0.14932 0.02695 -5.541 3.39e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.115 on 2029 degrees of freedom
## Multiple R-squared: 0.4329, Adjusted R-squared: 0.426
## F-statistic: 61.96 on 25 and 2029 DF, p-value: < 2.2e-16
## NULL
## [1] "AIC: 6308.88440476437"
## [1] "RMSE on evaluation set: 1.14741016219081"
We compare the top ten factors in our regression with those of Random Forest, SVM_Radial and GBM and look for patterns.
## [1] "Top 10 regression coefficients"
## Overall
## (Intercept) 36.0219000
## Mnf.Flow -0.7372041
## Density -0.5080496
## Brand.Code_C -0.3878025
## Balling.Lvl 0.3798479
## Bowl.Setpoint 0.3785963
## Hyd.Pressure3 0.2950237
## Carb.Pressure1 0.2464791
## Usage.cont -0.1954141
## Brand.Code_A -0.1919208
## [1] "Top 10 Random Forest factors"
## Overall
## Mnf.Flow 1.0167775
## Bowl.Setpoint 0.4024298
## Usage.cont 0.3435747
## Brand.Code_C 0.3103242
## Alch.Rel 0.3100911
## Filler.Level 0.2705480
## Carb.Rel 0.2540801
## Oxygen.Filler 0.2280317
## Balling.Lvl 0.2167567
## Pressure.Vacuum 0.1779475
## [1] "Top 10 GBM factors"
## var rel.inf
## Mnf.Flow Mnf.Flow 15.82497366
## Usage.cont Usage.cont 6.97882702
## Brand.Code_C Brand.Code_C 6.01164824
## Oxygen.Filler Oxygen.Filler 5.15379594
## Alch.Rel Alch.Rel 4.93587597
## Temperature Temperature 3.99368339
## Pressure.Vacuum Pressure.Vacuum 3.73609307
## Filler.Speed Filler.Speed 3.72275486
## Density Density 3.48646005
## Carb.Pressure1 Carb.Pressure1 3.41335395
## Carb.Flow Carb.Flow 3.27816148
## Air.Pressurer Air.Pressurer 3.09218593
## Balling.Lvl Balling.Lvl 2.97257559
## Carb.Rel Carb.Rel 2.96981095
## PC.Volume PC.Volume 2.65663599
## MFR MFR 2.44782129
## Bowl.Setpoint Bowl.Setpoint 2.27001115
## Fill.Ounces Fill.Ounces 2.22973957
## Filler.Level Filler.Level 1.92122785
## Fill.Pressure Fill.Pressure 1.82917602
## Hyd.Pressure1 Hyd.Pressure1 1.82580948
## Carb.Volume Carb.Volume 1.76754512
## Hyd.Pressure2 Hyd.Pressure2 1.64196348
## PSC PSC 1.64040504
## PSC.Fill PSC.Fill 1.61106914
## Carb.Temp Carb.Temp 1.53281964
## Hyd.Pressure4 Hyd.Pressure4 1.50187474
## Carb.Pressure Carb.Pressure 1.39626941
## Hyd.Pressure3 Hyd.Pressure3 1.31030040
## PSC.CO2 PSC.CO2 0.98193654
## Brand.Code_A Brand.Code_A 0.85224870
## Brand.Code_NA Brand.Code_NA 0.47856352
## Pressure.Setpoint Pressure.Setpoint 0.45064391
## Brand.Code_D Brand.Code_D 0.08373893
## [1] "Top 10 SVM factors"
## Overall
## Mnf.Flow 0.22046070
## Usage.cont 0.15144673
## Bowl.Setpoint 0.12827585
## Filler.Level 0.11208175
## Pressure.Setpoint 0.09496808
## Carb.Flow 0.08926366
## Brand.Code_C 0.07448118
## Hyd.Pressure3 0.05787995
## Pressure.Vacuum 0.04843824
## Fill.Pressure 0.04696594
When we examine the most important predictors, not only in our Random Forest model but in the other higher performing models as well, we find that Mnf.Flow is the most important predictor, followed by Usage.cont and Bowl.Setpoint. Alch.Rel, Filler.level and Brand (especially Brand C) are also factors.
Now we can use some tools that might give us insight into how these variables operate.
When we run an individual tree (see below) we can see that, at least for this tree, Mnf.Flow is the first node. While our Random Forest model aggregates many different trees, we can presume that many of them begin with Mnf.Flow.
We can also examine pdp plots. In this case, we will use our boosted model (our second best model), because the tree already gives us insight with respect to the rf model.
The pdp plots from the boosted model provide some confirmation of the tree model. Mnf.Flow breaks at -.50 and at -.100, Usage.cont at 23, and Alch.rel at 7.6.
Pdp plots don’t necessarily capture the affects of interactions. When we re-examine the most important predictors from our rf model, we can see that they all have significant correlations with PH. In addition, while the first three factors are also correlated with each other, the correlations are only about 50%. These are reassuring signs we have a robust model. However, we would want to understand these interactions better before making further statements about variable importance.
## col1 col2 correlation
## 1 Mnf.Flow Bowl.Setpoint -0.5792410
## 2 Mnf.Flow Usage.cont 0.5192248
We tested 11 machine learning on beverage data in order to predict PH. OUr final model is tree-based (Random Forest). The Model R-Squared is 72%, which suggests that our predictions will tend to be in the ballpark but there is still quite a bit of room for small errors up and down.