## * checking for file 'C:\Users\eric.hirsch\AppData\Local\Temp\Rtmp089TnF\remotes68b818922527\ericonsi-EHData-5e2d822/DESCRIPTION' ... OK
## * preparing 'EHData':
## * checking DESCRIPTION meta-information ... OK
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## Omitted 'LazyData' from DESCRIPTION
## * creating default NAMESPACE file
## * building 'EHData_0.1.0.tar.gz'
## 

Introduction

ABC Beverage is facing new regulations requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Below is the technical report detailing our examination of the data and creation of our PH model.

Data Summary and Preparation

The data consists of 2571 beverage records, including PH (the target) and 32 possible predictors of PH. 31 of the predictors are numeric, and one (Brand) is character. Many of the predictors have missing values.

##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC CO2           Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd Pressure1   Hyd Pressure2   Hyd Pressure3   Hyd Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler Level    Filler Speed   Temperature      Usage cont      Carb Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen Filler     Bowl Setpoint   Pressure Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air Pressurer      Alch Rel        Carb Rel      Balling Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1
## 'data.frame':    2571 obs. of  33 variables:
##  $ Brand Code       : chr  "B" "A" "B" "A" ...
##  $ Carb Volume      : num  5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill Ounces      : num  24 24 24.1 24 24.3 ...
##  $ PC Volume        : num  0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb Pressure    : num  68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb Temp        : num  141 140 145 133 137 ...
##  $ PSC              : num  0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC Fill         : num  0.26 0.22 0.34 0.42 0.16 ...
##  $ PSC CO2          : num  0.04 0.04 0.16 0.04 0.12 ...
##  $ Mnf Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb Pressure1   : num  119 122 120 115 118 ...
##  $ Fill Pressure    : num  46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure2    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure3    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure4    : num  118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler Level     : num  121 119 120 118 119 ...
##  $ Filler Speed     : num  4002 3986 4020 4012 4010 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb Flow        : num  2932 3144 2914 3062 3054 ...
##  $ Density          : num  0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Balling          : num  1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure Vacuum  : num  -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl Setpoint    : num  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air Pressurer    : num  143 143 142 146 146 ...
##  $ Alch Rel         : num  6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling Lvl      : num  1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...

Missing Values

A number of predictors have missing values; however, the only one of significance is MFR (8%). The missing values do not appear to be correlated with each other, and records with missing MFR do not appear to be correlated with PH. We therefore conclude that missingness is a relatively random phenomenon in this dataset and do a simple median imputation for all missing values.

There are 4 missing target values as well - however, there are so few that we decide to impute them rather than lose them from the database.

For the only character predictor, Brand, we will create a variable Brand_NA when we dummify it.

## [[1]]

## 
## [[2]]

## 
## [[3]]

## [1] "Correlation between MFR missingness and PH"
## 
##  Pearson's product-moment correlation
## 
## data:  dfPH_Num$PH and mfr_flag
## t = 0.67288, df = 2565, p-value = 0.5011
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02541582  0.05194583
## sample estimates:
##        cor 
## 0.01328488

Multicollinearity

There is some degree of multicollinearity in the database. In particular, Density, Balling, Balling level and Alch Rel are all highly correlated with each other. While multicollinearity does not necessarily interfere with predictive power, it does impede inference. Since we need to not only predict PH but understand our model, we will remove Balling, since Balling level and Density carry most of its information.

##             col1          col2 correlation
## 1  Hyd Pressure2 Hyd Pressure3   0.9249726
## 3   Filler Level Bowl Setpoint   0.9349807
## 4        Density       Balling   0.9551594
## 5        Density      Alch Rel   0.9007296
## 6        Density   Balling Lvl   0.9473110
## 8        Balling      Alch Rel   0.9233016
## 9        Balling   Balling Lvl   0.9776232
## 13      Alch Rel   Balling Lvl   0.9213186

Outliers, Distributions and Correlations

PH is normally-enough distributed so we will not transform it at this point. We see some skewed distributions among some of the predictors and may need to correct for that depending on our algorithm choice.

A number of factors have a wealth of disturbing zeroes (PSC, HydPressure1, etc.). Sometimes zeroes are a stand in for NA. In a real world case we would use domain knowledge to assess the validity of these zeroes - however, since we don’t know, we will assume they are legitimate values. While there are high and low values, these values come in groups - therefore, we will not remove outliers but we will run a regression to check for influential points.

A fair number of factors are correlated with PH - this bodes well for a predictive model.

Exploratory Regression Model

We run an ordinary least square regression to gain insight into our data. We use the Step AIC algorithm which eliminates predictors based on improvements to AIC.

A large number of factors remain in the model. R-Squared is 41% - we hope to improve this with more powerful predictive algorithms. The model shows heteroskedasticity, so we will perform a boxcox on PH. The QQ plot suggests a relatively normally distributed target.

There are 3 influential records. We will remove them.

## 
## Call:
## lm(formula = PH ~ Carb.Volume + Fill.Ounces + PC.Volume + PSC.Fill + 
##     PSC.CO2 + Mnf.Flow + Carb.Pressure1 + Fill.Pressure + Hyd.Pressure2 + 
##     Hyd.Pressure3 + Filler.Level + Filler.Speed + Temperature + 
##     Usage.cont + Carb.Flow + Density + Pressure.Vacuum + Oxygen.Filler + 
##     Bowl.Setpoint + Pressure.Setpoint + Alch.Rel + Carb.Rel + 
##     Balling.Lvl + Brand.Code_A + Brand.Code_C + Brand.Code_NA, 
##     data = train_reg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49520 -0.07542  0.00888  0.08463  0.77874 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.059e+01  9.273e-01  11.418  < 2e-16 ***
## Carb.Volume       -1.133e-01  5.210e-02  -2.175 0.029746 *  
## Fill.Ounces       -6.574e-02  3.577e-02  -1.838 0.066222 .  
## PC.Volume         -1.441e-01  5.611e-02  -2.568 0.010307 *  
## PSC.Fill          -7.439e-02  2.560e-02  -2.905 0.003708 ** 
## PSC.CO2           -1.186e-01  7.217e-02  -1.643 0.100588    
## Mnf.Flow          -7.223e-04  5.006e-05 -14.426  < 2e-16 ***
## Carb.Pressure1     6.084e-03  7.670e-04   7.932 3.52e-15 ***
## Fill.Pressure      2.251e-03  1.338e-03   1.682 0.092716 .  
## Hyd.Pressure2     -7.374e-04  5.118e-04  -1.441 0.149824    
## Hyd.Pressure3      2.814e-03  6.208e-04   4.534 6.14e-06 ***
## Filler.Level      -1.018e-03  5.738e-04  -1.774 0.076177 .  
## Filler.Speed      -8.790e-06  4.785e-06  -1.837 0.066349 .  
## Temperature       -1.536e-02  2.460e-03  -6.243 5.21e-10 ***
## Usage.cont        -7.048e-03  1.265e-03  -5.572 2.86e-08 ***
## Carb.Flow          1.333e-05  3.809e-06   3.499 0.000478 ***
## Density           -1.306e-01  2.872e-02  -4.548 5.74e-06 ***
## Pressure.Vacuum   -1.426e-02  6.939e-03  -2.054 0.040068 *  
## Oxygen.Filler     -3.011e-01  7.780e-02  -3.871 0.000112 ***
## Bowl.Setpoint      3.200e-03  5.895e-04   5.428 6.39e-08 ***
## Pressure.Setpoint -8.865e-03  2.178e-03  -4.070 4.89e-05 ***
## Alch.Rel           4.325e-02  2.100e-02   2.060 0.039571 *  
## Carb.Rel           7.725e-02  4.953e-02   1.560 0.119003    
## Balling.Lvl        3.469e-02  1.669e-02   2.079 0.037779 *  
## Brand.Code_A      -6.332e-02  1.409e-02  -4.494 7.38e-06 ***
## Brand.Code_C      -1.438e-01  9.894e-03 -14.534  < 2e-16 ***
## Brand.Code_NA     -8.037e-02  1.480e-02  -5.431 6.28e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1322 on 2031 degrees of freedom
## Multiple R-squared:  0.4269, Adjusted R-squared:  0.4196 
## F-statistic:  58.2 on 26 and 2031 DF,  p-value: < 2.2e-16
## 
## [1] "VIF Analysis"
##       Carb.Volume       Fill.Ounces         PC.Volume          PSC.Fill 
##          3.610124          1.130350          1.349181          1.070175 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##          1.067560          4.227891          1.541221          2.078689 
##     Hyd.Pressure2     Hyd.Pressure3      Filler.Level      Filler.Speed 
##          8.258166         11.477926          9.428191          1.593705 
##       Temperature        Usage.cont         Carb.Flow           Density 
##          1.368047          1.638725          1.970053         13.765302 
##   Pressure.Vacuum     Oxygen.Filler     Bowl.Setpoint Pressure.Setpoint 
##          1.829806          1.457571          9.657993          2.310694 
##          Alch.Rel          Carb.Rel       Balling.Lvl      Brand.Code_A 
##         13.271702          4.776149         24.839875          2.311940 
##      Brand.Code_C     Brand.Code_NA 
##          1.217488          1.158246

## NULL
## 
##  studentized Breusch-Pagan test
## 
## data:  step3
## BP = 175.11, df = 26, p-value < 2.2e-16
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  step3$residuals
## W = 0.98311, p-value = 6.867e-15
## 
## [1] "AIC:  -2459.46659486999"
## [1] "RMSE on evaluation set:  0.135864489235962"

Machine Learning Models

We now test our data on 11 machine learning algorithms in order to determine the best predictive model.

1). Lasso

## 
## Call:  glmnet(x = x, y = yTrain, alpha = 1, lambda = lambda1, preProc = c("center",      "scale")) 
## 
##   Df  %Dev   Lambda
## 1 31 41.37 0.003883
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 1.1335359 0.4324615 0.8980178

2) Ridge

## 
## Call:  glmnet(x = x, y = yTrain, alpha = 0, lambda = lambda1, preProc = c("center",      "scale")) 
## 
##   Df  %Dev  Lambda
## 1 34 40.82 0.06478
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 1.1385401 0.4301397 0.9068312

3). PLS

## Partial Least Squares 
## 
## 2055 samples
##   34 predictor
## 
## Pre-processing: centered (34), scaled (34) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1850, 1850, 1848, 1851, 1850, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.263061  0.2448886  1.0074657
##    2     1.187339  0.3334997  0.9378354
##    3     1.168906  0.3538846  0.9273594
##    4     1.156480  0.3678340  0.9134054
##    5     1.142611  0.3823360  0.8993623
##    6     1.137697  0.3878219  0.8918344
##    7     1.134364  0.3916038  0.8891298
##    8     1.133212  0.3929897  0.8857015
##    9     1.131913  0.3947154  0.8849038
##   10     1.130753  0.3961795  0.8830160
##   11     1.131235  0.3959340  0.8833858
##   12     1.130503  0.3968130  0.8830807
##   13     1.130620  0.3966851  0.8842539
##   14     1.130716  0.3967071  0.8833765
##   15     1.130563  0.3968566  0.8831855
##   16     1.130471  0.3969327  0.8838876
##   17     1.130257  0.3971837  0.8838742
##   18     1.130270  0.3972030  0.8836797
##   19     1.130315  0.3971755  0.8835155
##   20     1.130318  0.3971639  0.8835026
##   21     1.130346  0.3971251  0.8834728
##   22     1.130306  0.3971646  0.8834484
##   23     1.130272  0.3972038  0.8834363
##   24     1.130247  0.3972255  0.8834253
##   25     1.130246  0.3972241  0.8834229
##   26     1.130226  0.3972462  0.8834154
##   27     1.130234  0.3972385  0.8834186
##   28     1.130229  0.3972431  0.8834130
##   29     1.130234  0.3972384  0.8834207
##   30     1.130229  0.3972431  0.8834173
##   31     1.130230  0.3972427  0.8834175
##   32     1.130231  0.3972411  0.8834181
##   33     1.130231  0.3972413  0.8834182
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 26.
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 1.1324323 0.4325440 0.8935133

4.) KNN

## k-Nearest Neighbors 
## 
## 2055 samples
##   34 predictor
## 
## Pre-processing: centered (34), scaled (34) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2055, 2055, 2055, 2055, 2055, 2055, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE      
##    5  1.122539  0.4359751  0.8261009
##    7  1.103944  0.4444383  0.8206457
##    9  1.096882  0.4469509  0.8205372
##   11  1.091612  0.4497004  0.8196408
##   13  1.094392  0.4461560  0.8246401
##   15  1.099234  0.4410076  0.8312303
##   17  1.101534  0.4382708  0.8348568
##   19  1.105219  0.4341171  0.8387305
##   21  1.107823  0.4315698  0.8418797
##   23  1.110153  0.4292667  0.8453432
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 1.0588980 0.5041245 0.8189062

5). Neural Net

Decay = .04, size = 3.

## a 34-3-1 network with 109 weights
## options were - linear output units  decay=0.04
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 1.0539986 0.5063971 0.8131051

6.) MARS

## Selected 18 of 22 terms, and 11 of 34 predictors
## Termination condition: RSq changed by less than 0.001 at 22 terms
## Importance: Mnf.Flow, Brand.Code_C, Alch.Rel, Pressure.Vacuum, Usage.cont, ...
## Number of terms at each degree of interaction: 1 17 (additive model)
## GCV 1.236777    RSS 2455.74    GRSq 0.4136019    RSq 0.4328546
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 1.1131853 0.4503775 0.8774195

7.) and 8.) SVM (linear and radial)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 2055 samples
##   34 predictor
## 
## Pre-processing: centered (34), scaled (34) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1851, 1850, 1851, 1849, 1849, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared   MAE      
##   0.25  1.069443  0.4650716  0.7998372
##   0.50  1.043366  0.4879701  0.7730102
##   1.00  1.017891  0.5111853  0.7492804
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01999207
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01999207 and C = 1.
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 0.9953225 0.5661287 0.7511437
## [1] ""
## [1] "___________________________________________"
## [1] ""
## Support Vector Machines with Linear Kernel 
## 
## 2055 samples
##   34 predictor
## 
## Pre-processing: centered (34), scaled (34) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1848, 1849, 1849, 1850, 1850, 1850, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   1.149602  0.3815252  0.8805875
## 
## Tuning parameter 'C' was held constant at a value of 1
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 1.1420360 0.4220422 0.8760670

9.) Random Forest

## 
## Call:
##  randomForest(formula = PH ~ ., data = dfTrain, importance = TRUE,      ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 11
## 
##           Mean of squared residuals: 0.6698616
##                     % Var explained: 68.21
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 0.8391764 0.7183977 0.6373290

10.) GBM

Depth = 4, ntrees = 10000, shrinkage = .007

## gbm(formula = PH ~ ., distribution = "gaussian", data = dfTrain, 
##     n.trees = 10000, interaction.depth = 4, shrinkage = 0.007, 
##     cv.folds = 10, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 10000 iterations were performed.
## The best cross-validation iteration was 9981.
## There were 34 predictors of which 34 had non-zero influence.
## [1] "Number of trees: 10000"
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 0.8988662 0.6420692 0.6861177

11.) Cubist

## 
## Call:
## cubist.default(x = dfTrain2, y = yTrain)
## 
## Number of samples: 2055 
## Number of predictors: 34 
## 
## Number of committees: 1 
## Number of rules: 34
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 1.1306746 0.4803376 0.7775883

Choosing the best model

The table below shows results for the 11 algorithms plus OLS. Random Forest is clearly the best performer.

##   Rank Model          RMSE RSquared
## A A    Random Forest  0.84 0.72    
## B B    GBM            0.9  0.64    
## C C    SVM_Radial     0.99 0.57    
## D D    NeuralNet      1.05 0.51    
## E E    KNN            1.06 0.5     
## F F    Cubist         1.13 0.48    
## G G    MARS           1.11 0.45    
## H H    Lasso          1.13 0.43    
## I I    PLS            1.13 0.43    
## J J    Ridge          1.14 0.43    
## K K    OLS Regression 1.15 0.43    
## L L    SVM-Linear     1.15 0.38

Assessing variable importance

First we re-run our regression with scaled factors to get a baseline.

## 
## Call:
## lm(formula = PH ~ Carb.Volume + Fill.Ounces + PC.Volume + PSC.Fill + 
##     PSC.CO2 + Mnf.Flow + Carb.Pressure1 + Fill.Pressure + Hyd.Pressure3 + 
##     Filler.Level + Filler.Speed + Temperature + Usage.cont + 
##     Carb.Flow + Density + Pressure.Vacuum + Oxygen.Filler + Bowl.Setpoint + 
##     Pressure.Setpoint + Alch.Rel + Carb.Rel + Balling.Lvl + Brand.Code_A + 
##     Brand.Code_C + Brand.Code_NA, data = train_reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1256 -0.6429  0.0822  0.7450  3.3638 
## 
## Coefficients:
##                   Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)       36.02190    0.02462 1463.039  < 2e-16 ***
## Carb.Volume       -0.07202    0.04722   -1.525  0.12739    
## Fill.Ounces       -0.04707    0.02640   -1.783  0.07474 .  
## PC.Volume         -0.07440    0.02889   -2.575  0.01010 *  
## PSC.Fill          -0.07086    0.02546   -2.784  0.00543 ** 
## PSC.CO2           -0.04142    0.02547   -1.626  0.10414    
## Mnf.Flow          -0.73720    0.05046  -14.608  < 2e-16 ***
## Carb.Pressure1     0.24648    0.03030    8.134 7.18e-16 ***
## Fill.Pressure      0.05464    0.03573    1.529  0.12636    
## Hyd.Pressure3      0.29502    0.04541    6.497 1.03e-10 ***
## Filler.Level      -0.10592    0.07481   -1.416  0.15698    
## Filler.Speed      -0.06233    0.03102   -2.009  0.04466 *  
## Temperature       -0.18261    0.02874   -6.353 2.60e-10 ***
## Usage.cont        -0.19541    0.03167   -6.169 8.25e-10 ***
## Carb.Flow          0.14851    0.03442    4.314 1.68e-05 ***
## Density           -0.50805    0.08856   -5.737 1.11e-08 ***
## Pressure.Vacuum   -0.06751    0.03352   -2.014  0.04413 *  
## Oxygen.Filler     -0.14543    0.03088   -4.709 2.65e-06 ***
## Bowl.Setpoint      0.37860    0.07555    5.011 5.88e-07 ***
## Pressure.Setpoint -0.14347    0.03702   -3.876  0.00011 ***
## Alch.Rel           0.14028    0.09708    1.445  0.14859    
## Carb.Rel           0.08534    0.05548    1.538  0.12412    
## Balling.Lvl        0.37985    0.12676    2.996  0.00276 ** 
## Brand.Code_A      -0.19192    0.03882   -4.943 8.31e-07 ***
## Brand.Code_C      -0.38780    0.02687  -14.434  < 2e-16 ***
## Brand.Code_NA     -0.14932    0.02695   -5.541 3.39e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.115 on 2029 degrees of freedom
## Multiple R-squared:  0.4329, Adjusted R-squared:  0.426 
## F-statistic: 61.96 on 25 and 2029 DF,  p-value: < 2.2e-16

## NULL
## [1] "AIC:  6308.88440476437"
## [1] "RMSE on evaluation set:  1.14741016219081"

We compare the top ten factors in our regression with those of Random Forest, SVM_Radial and GBM and look for patterns.

## [1] "Top 10 regression coefficients"
##                   Overall
## (Intercept)    36.0219000
## Mnf.Flow       -0.7372041
## Density        -0.5080496
## Brand.Code_C   -0.3878025
## Balling.Lvl     0.3798479
## Bowl.Setpoint   0.3785963
## Hyd.Pressure3   0.2950237
## Carb.Pressure1  0.2464791
## Usage.cont     -0.1954141
## Brand.Code_A   -0.1919208
## [1] "Top 10 Random Forest factors"
##                   Overall
## Mnf.Flow        1.0167775
## Bowl.Setpoint   0.4024298
## Usage.cont      0.3435747
## Brand.Code_C    0.3103242
## Alch.Rel        0.3100911
## Filler.Level    0.2705480
## Carb.Rel        0.2540801
## Oxygen.Filler   0.2280317
## Balling.Lvl     0.2167567
## Pressure.Vacuum 0.1779475
## [1] "Top 10 GBM factors"

##                                 var     rel.inf
## Mnf.Flow                   Mnf.Flow 15.82497366
## Usage.cont               Usage.cont  6.97882702
## Brand.Code_C           Brand.Code_C  6.01164824
## Oxygen.Filler         Oxygen.Filler  5.15379594
## Alch.Rel                   Alch.Rel  4.93587597
## Temperature             Temperature  3.99368339
## Pressure.Vacuum     Pressure.Vacuum  3.73609307
## Filler.Speed           Filler.Speed  3.72275486
## Density                     Density  3.48646005
## Carb.Pressure1       Carb.Pressure1  3.41335395
## Carb.Flow                 Carb.Flow  3.27816148
## Air.Pressurer         Air.Pressurer  3.09218593
## Balling.Lvl             Balling.Lvl  2.97257559
## Carb.Rel                   Carb.Rel  2.96981095
## PC.Volume                 PC.Volume  2.65663599
## MFR                             MFR  2.44782129
## Bowl.Setpoint         Bowl.Setpoint  2.27001115
## Fill.Ounces             Fill.Ounces  2.22973957
## Filler.Level           Filler.Level  1.92122785
## Fill.Pressure         Fill.Pressure  1.82917602
## Hyd.Pressure1         Hyd.Pressure1  1.82580948
## Carb.Volume             Carb.Volume  1.76754512
## Hyd.Pressure2         Hyd.Pressure2  1.64196348
## PSC                             PSC  1.64040504
## PSC.Fill                   PSC.Fill  1.61106914
## Carb.Temp                 Carb.Temp  1.53281964
## Hyd.Pressure4         Hyd.Pressure4  1.50187474
## Carb.Pressure         Carb.Pressure  1.39626941
## Hyd.Pressure3         Hyd.Pressure3  1.31030040
## PSC.CO2                     PSC.CO2  0.98193654
## Brand.Code_A           Brand.Code_A  0.85224870
## Brand.Code_NA         Brand.Code_NA  0.47856352
## Pressure.Setpoint Pressure.Setpoint  0.45064391
## Brand.Code_D           Brand.Code_D  0.08373893
## [1] "Top 10 SVM factors"
##                      Overall
## Mnf.Flow          0.22046070
## Usage.cont        0.15144673
## Bowl.Setpoint     0.12827585
## Filler.Level      0.11208175
## Pressure.Setpoint 0.09496808
## Carb.Flow         0.08926366
## Brand.Code_C      0.07448118
## Hyd.Pressure3     0.05787995
## Pressure.Vacuum   0.04843824
## Fill.Pressure     0.04696594

When we examine the most important predictors, not only in our Random Forest model but in the other higher performing models as well, we find that Mnf.Flow is the most important predictor, followed by Usage.cont and Bowl.Setpoint. Alch.Rel, Filler.level and Brand (especially Brand C) are also factors.

Now we can use some tools that might give us insight into how these variables operate.

When we run an individual tree (see below) we can see that, at least for this tree, Mnf.Flow is the first node. While our Random Forest model aggregates many different trees, we can presume that many of them begin with Mnf.Flow.

We can also examine pdp plots. In this case, we will use our boosted model (our second best model), because the tree already gives us insight with respect to the rf model.

The pdp plots from the boosted model provide some confirmation of the tree model. Mnf.Flow breaks at -.50 and at -.100, Usage.cont at 23, and Alch.rel at 7.6.

Pdp plots don’t necessarily capture the affects of interactions. When we re-examine the most important predictors from our rf model, we can see that they all have significant correlations with PH. In addition, while the first three factors are also correlated with each other, the correlations are only about 50%. These are reassuring signs we have a robust model. However, we would want to understand these interactions better before making further statements about variable importance.

##       col1          col2 correlation
## 1 Mnf.Flow Bowl.Setpoint  -0.5792410
## 2 Mnf.Flow    Usage.cont   0.5192248

Conclusion

We tested 11 machine learning on beverage data in order to predict PH. OUr final model is tree-based (Random Forest). The Model R-Squared is 72%, which suggests that our predictions will tend to be in the ballpark but there is still quite a bit of room for small errors up and down.