Libraries

Team Members:

Project 2: Prediction of PH in Beverages

Problem Statement

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Exploratory Analysis

First we load our data. The data was provided in an excel document but for reproducibility, we’ve uploaded it to github so anyone can use the link. The column, “Brand.Code” is a categorical variable and so we change the data type to a factor.

Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow Density MFR Balling Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
B 5.340000 23.96667 0.2633333 68.2 141.2 0.104 0.26 0.04 -100 118.8 46.0 0 NA NA 118 121.2 4002 66.0 16.18 2932 0.88 725.0 1.398 -4.0 8.36 0.022 120 46.4 142.6 6.58 5.32 1.48
A 5.426667 24.00667 0.2386667 68.4 139.6 0.124 0.22 0.04 -100 121.6 46.0 0 NA NA 106 118.6 3986 67.6 19.90 3144 0.92 726.8 1.498 -4.0 8.26 0.026 120 46.8 143.0 6.56 5.30 1.56
B 5.286667 24.06000 0.2633333 70.8 144.8 0.090 0.34 0.16 -100 120.2 46.0 0 NA NA 82 120.0 4020 67.0 17.76 2914 1.58 735.0 3.142 -3.8 8.94 0.024 120 46.6 142.0 7.66 5.84 3.28
A 5.440000 24.00667 0.2933333 63.0 132.6 NA 0.42 0.04 -100 115.2 46.4 0 0 0 92 117.8 4012 65.6 17.42 3062 1.54 730.6 3.042 -4.4 8.24 0.030 120 46.0 146.2 7.14 5.42 3.04
A 5.486667 24.31333 0.1113333 67.2 136.8 0.026 0.16 0.12 -100 118.4 45.8 0 0 0 92 118.6 4010 65.6 17.68 3054 1.54 722.8 3.042 -4.4 8.26 0.030 120 46.0 146.2 7.14 5.44 3.04
A 5.380000 23.92667 0.2693333 66.6 138.4 0.090 0.24 0.04 -100 119.6 45.6 0 0 0 116 120.2 4014 66.2 23.82 2948 1.52 738.8 2.992 -4.4 8.32 0.024 120 46.0 146.6 7.16 5.44 3.02

In the next sections we do some exploratory data analysis. We get a list of the column names and explore the completeness of the data. One pattern we look for is to see if some rows or columns are more incomplete than others. These incomplete fields or records could be excluded or used as a dummy variable in the future.

Descriptive Summary Statistics of the Predictors:

vars n mean sd median trimmed mad min max range skew kurtosis se
Brand.Code* 1 2571 3.3893427 1.1066245 3.0000000 3.4200292 1.4826000 1.0000000 5.000 4.0000000 0.0410894 -0.6106962 0.0218247
Carb.Volume 2 2561 5.3701978 0.1063852 5.3466667 5.3654840 0.1087240 5.0400000 5.700 0.6600000 0.3922121 -0.4669916 0.0021022
Fill.Ounces 3 2533 23.9747546 0.0875299 23.9733333 23.9751390 0.0790720 23.6333333 24.320 0.6866667 -0.0215452 0.8624714 0.0017392
PC.Volume 4 2532 0.2771187 0.0606953 0.2713333 0.2745818 0.0523852 0.0793333 0.478 0.3986667 0.3396269 0.6699690 0.0012062
Carb.Pressure 5 2544 68.1895755 3.5382039 68.2000000 68.1212574 3.5582400 57.0000000 79.400 22.4000000 0.1822162 -0.0138046 0.0701495
Carb.Temp 6 2545 141.0949234 4.0373861 140.8000000 140.9912617 3.8547600 128.6000000 154.000 25.4000000 0.2468280 0.2375822 0.0800307
PSC 7 2538 0.0845737 0.0492690 0.0760000 0.0802746 0.0474432 0.0020000 0.270 0.2680000 0.8491445 0.6480498 0.0009780
PSC.Fill 8 2548 0.1953689 0.1177817 0.1800000 0.1837059 0.1186080 0.0000000 0.620 0.6200000 0.9334450 0.7691466 0.0023333
PSC.CO2 9 2532 0.0564139 0.0430387 0.0400000 0.0494965 0.0296520 0.0000000 0.240 0.2400000 1.7288937 3.7250025 0.0008553
Mnf.Flow 10 2569 24.5689373 119.4811263 65.2000000 21.0679631 169.0164000 -100.2000000 229.400 329.6000000 0.0041430 -1.8697072 2.3573130
Carb.Pressure1 11 2539 122.5863726 4.7428819 123.2000000 122.5379242 4.4478000 105.6000000 140.200 34.6000000 0.0543587 0.1418265 0.0941263
Fill.Pressure 12 2549 47.9221656 3.1775457 46.4000000 47.7071044 2.3721600 34.6000000 60.400 25.8000000 0.5471107 1.4067532 0.0629371
Hyd.Pressure1 13 2560 12.4375781 12.4332538 11.4000000 10.8374023 16.9016400 -0.8000000 58.000 58.8000000 0.7798043 -0.1426463 0.2457338
Hyd.Pressure2 14 2556 20.9610329 16.3863066 28.6000000 21.0519062 13.3434000 0.0000000 59.400 59.4000000 -0.3019570 -1.5592984 0.3241161
Hyd.Pressure3 15 2556 20.4584507 15.9757236 27.6000000 20.5052786 13.9364400 -1.2000000 50.000 51.2000000 -0.3189061 -1.5745834 0.3159949
Hyd.Pressure4 16 2541 96.2888627 13.1225594 96.0000000 95.4530251 11.8608000 52.0000000 142.000 90.0000000 0.5459786 0.6340041 0.2603252
Filler.Level 17 2551 109.2523716 15.6984241 118.4000000 111.0417442 9.1921200 55.8000000 161.200 105.4000000 -0.8482847 0.0460488 0.3108142
Filler.Speed 18 2514 3687.1988862 770.8200208 3982.0000000 3919.9870775 47.4432000 998.0000000 4030.000 3032.0000000 -2.8700359 6.7059692 15.3734149
Temperature 19 2557 65.9675401 1.3827783 65.6000000 65.7986321 0.8895600 63.6000000 76.200 12.6000000 2.3869732 10.1612904 0.0273456
Usage.cont 20 2566 20.9929618 2.9779364 21.7900000 21.2517819 3.1875900 12.0800000 25.900 13.8200000 -0.5353253 -1.0170230 0.0587878
Carb.Flow 21 2569 2468.3542234 1073.6964743 3028.0000000 2601.1356344 326.1720000 26.0000000 5104.000 5078.0000000 -0.9877287 -0.5826893 21.1835857
Density 22 2570 1.1736498 0.3775269 0.9800000 1.1533463 0.1482600 0.2400000 1.920 1.6800000 0.5260149 -1.1992070 0.0074470
MFR 23 2359 704.0492582 73.8983094 724.0000000 718.1566967 15.4190400 31.4000000 868.600 837.2000000 -5.0917729 30.4558939 1.5214950
Balling 24 2570 2.1977696 0.9310914 1.6480000 2.1287189 0.3706500 -0.1700000 4.012 4.1820000 0.5939224 -1.3855651 0.0183665
Pressure.Vacuum 25 2571 -5.2161027 0.5699933 -5.4000000 -5.2521147 0.5930400 -6.6000000 -3.600 3.0000000 0.5256608 -0.0313126 0.0112414
Oxygen.Filler 26 2559 0.0468426 0.0466436 0.0334000 0.0388837 0.0249077 0.0024000 0.400 0.3976000 2.6603955 11.0882098 0.0009221
Bowl.Setpoint 27 2569 109.3265862 15.3031541 120.0000000 111.3466213 0.0000000 70.0000000 140.000 70.0000000 -0.9743842 -0.0564212 0.3019249
Pressure.Setpoint 28 2559 47.6153966 2.0390474 46.0000000 47.6026354 0.0000000 44.0000000 52.000 8.0000000 0.2031970 -1.6012622 0.0403081
Air.Pressurer 29 2571 142.8339946 1.2119170 142.6000000 142.5812348 0.5930400 140.8000000 148.200 7.4000000 2.2521053 4.7336291 0.0239013
Alch.Rel 30 2562 6.8974161 0.5052753 6.5600000 6.8384390 0.0593040 5.2800000 8.620 3.3400000 0.8836378 -0.8506221 0.0099825
Carb.Rel 31 2561 5.4367825 0.1287183 5.4000000 5.4301318 0.1186080 4.9600000 6.060 1.1000000 0.5032472 -0.2949480 0.0025435
Balling.Lvl 32 2570 2.0500078 0.8703089 1.4800000 1.9827237 0.2075640 0.0000000 3.660 3.6600000 0.5858456 -1.4858636 0.0171675

We have use the describe() method of the psych package to review all the baseline statistical metrics for the predictors. From the table above, it can be noted that predictors like Oxygen Filler, PSC, PSC Fill, PSC CO2 and PC Volume has near zero mean values and there is notable skewness in data for many of the predictors.

Descriptive Statistical Plots

PH Level distribution

Above we can see that the PH distribution for all the brands combined follows a normal distribution. Below we can see that most of the observations belong to the “B” brand, there are 4 labeled brands and one unlabeled.

## Dimensions of Training DF:
##  2571 33
## 
## 
## Name of columns in Dataframe:

Box Plot:

Density Plots

Histogram of variables in the data set

Now we look at the distribution of each column. For modeling purposes, each column would idealy have a normal distribution so we are looking at which columns might be candidates for transformations.

We notice many columns are normally distributed, some have bimodal distributions, some are skewed with a long tail of outliers, and some only have values in intervals of 2’s or 10’s. In our model pre-processing, we may choose different transformations to normalize and standardize our data. One pattern that stands out is that many columns have a high number of zero variables. We can assume that there is some significance to these high number of zeros and there effect may be linear in nature. For these data we will create dummy variables based on a specified cutoff value.

## [1] 2571   41

Correlation Plot

Here we observe the correlation between variables. Variables that are highly correlated offer limited additional insights for our model. Non-linear models typically handle these highly correlated values better than linear models. When looking for the most influential variables on our outcome variable, we will have to keep these in mind as well.

Missing Value Analysis

Below is an analysis of predictors with NA values.

The size of the data used in this study consisted of 2571 observations, 32 predictors and one response variable (PH).

##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##                 0                10                38                39 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##                27                26                33                23 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##                39                 2                32                22 
##     Hyd.Pressure1     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4 
##                11                15                15                30 
##      Filler.Level      Filler.Speed       Temperature        Usage.cont 
##                20                57                14                 5 
##         Carb.Flow           Density               MFR           Balling 
##                 2                 1               212                 1 
##   Pressure.Vacuum                PH     Oxygen.Filler     Bowl.Setpoint 
##                 0                 4                12                 2 
## Pressure.Setpoint     Air.Pressurer          Alch.Rel          Carb.Rel 
##                12                 0                 9                10 
##       Balling.Lvl 
##                 1

Above we can see the number of missing values (n.a.’s) in all the columns of the data set. Below we can see the distribution of the missing values in the data set and the percentage with respect all values.

## 
##  Variables sorted by number of missings: 
##           Variable        Count
##                MFR 0.0824581875
##       Filler.Speed 0.0221703617
##          PC.Volume 0.0151691949
##            PSC.CO2 0.0151691949
##        Fill.Ounces 0.0147802412
##                PSC 0.0128354726
##     Carb.Pressure1 0.0124465189
##      Hyd.Pressure4 0.0116686114
##      Carb.Pressure 0.0105017503
##          Carb.Temp 0.0101127966
##           PSC.Fill 0.0089459354
##      Fill.Pressure 0.0085569817
##       Filler.Level 0.0077790743
##      Hyd.Pressure2 0.0058343057
##      Hyd.Pressure3 0.0058343057
##        Temperature 0.0054453520
##      Oxygen.Filler 0.0046674446
##  Pressure.Setpoint 0.0046674446
##      Hyd.Pressure1 0.0042784909
##        Carb.Volume 0.0038895371
##           Carb.Rel 0.0038895371
##           Alch.Rel 0.0035005834
##         Usage.cont 0.0019447686
##                 PH 0.0015558149
##           Mnf.Flow 0.0007779074
##          Carb.Flow 0.0007779074
##      Bowl.Setpoint 0.0007779074
##            Density 0.0003889537
##            Balling 0.0003889537
##        Balling.Lvl 0.0003889537
##         Brand.Code 0.0000000000
##    Pressure.Vacuum 0.0000000000
##      Air.Pressurer 0.0000000000

In order to optimize the prediction model, we need to re-evaluate the list of predictors that need to be part of the model and also handle the missing values by deploying appropriate imputation techniques.

Imputation Strategy:

As noted earlier, there are many missing variables from our model. Since many of the models we are planning on using will only use complete cases, we need to impute the missing data points. Occasionally missing data, when manually entered, can be assumed to be zero but we do not believe this to be the case. Some methods for handling missing data are using the mean, median, or mode. We will use a method called multivariate imputation by chained equations or MICE. MICE is a great imputation method because it preserves the relations within the data and the uncertainty about those relations. We use the argument m=5 to indicate that we will do 5 imputations, each with the same dataset but different imputed values, that will then be analyzed then pooled to ensure relations and uncertainty is maintained. The method pmm is short for predictive mean matching. These imputations are restricted to observed values so this method works for our categorical variable as well. This will preserve non-linear relationships and it is computationally faster than other MICE methods.

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##                ""             "pmm"             "pmm"             "pmm" 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##     Hyd.Pressure1     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##      Filler.Level      Filler.Speed       Temperature        Usage.cont 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##         Carb.Flow           Density               MFR           Balling 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##   Pressure.Vacuum                PH     Oxygen.Filler     Bowl.Setpoint 
##                ""             "pmm"             "pmm"             "pmm" 
## Pressure.Setpoint     Air.Pressurer          Alch.Rel          Carb.Rel 
##             "pmm"                ""             "pmm"             "pmm" 
##       Balling.Lvl Air.Pressurer_bin   Balling.Lvl_bin       Balling_bin 
##             "pmm"                ""             "pmm"             "pmm" 
##       Density_bin Hyd.Pressure1_bin Hyd.Pressure2_bin Hyd.Pressure3_bin 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##      Mnf.Flow_bin 
##             "pmm" 
## PredictorMatrix:
##               Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure
## Brand.Code             0           1           1         1             1
## Carb.Volume            1           0           1         1             1
## Fill.Ounces            1           1           0         1             1
## PC.Volume              1           1           1         0             1
## Carb.Pressure          1           1           1         1             0
## Carb.Temp              1           1           1         1             1
##               Carb.Temp PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1
## Brand.Code            1   1        1       1        1              1
## Carb.Volume           1   1        1       1        1              1
## Fill.Ounces           1   1        1       1        1              1
## PC.Volume             1   1        1       1        1              1
## Carb.Pressure         1   1        1       1        1              1
## Carb.Temp             0   1        1       1        1              1
##               Fill.Pressure Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3
## Brand.Code                1             1             1             1
## Carb.Volume               1             1             1             1
## Fill.Ounces               1             1             1             1
## PC.Volume                 1             1             1             1
## Carb.Pressure             1             1             1             1
## Carb.Temp                 1             1             1             1
##               Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont
## Brand.Code                1            1            1           1          1
## Carb.Volume               1            1            1           1          1
## Fill.Ounces               1            1            1           1          1
## PC.Volume                 1            1            1           1          1
## Carb.Pressure             1            1            1           1          1
## Carb.Temp                 1            1            1           1          1
##               Carb.Flow Density MFR Balling Pressure.Vacuum PH Oxygen.Filler
## Brand.Code            1       1   1       1               1  1             1
## Carb.Volume           1       1   1       1               1  1             1
## Fill.Ounces           1       1   1       1               1  1             1
## PC.Volume             1       1   1       1               1  1             1
## Carb.Pressure         1       1   1       1               1  1             1
## Carb.Temp             1       1   1       1               1  1             1
##               Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## Brand.Code                1                 1             1        1        1
## Carb.Volume               1                 1             1        1        1
## Fill.Ounces               1                 1             1        1        1
## PC.Volume                 1                 1             1        1        1
## Carb.Pressure             1                 1             1        1        1
## Carb.Temp                 1                 1             1        1        1
##               Balling.Lvl Air.Pressurer_bin Balling.Lvl_bin Balling_bin
## Brand.Code              1                 1               1           1
## Carb.Volume             1                 1               1           1
## Fill.Ounces             1                 1               1           1
## PC.Volume               1                 1               1           1
## Carb.Pressure           1                 1               1           1
## Carb.Temp               1                 1               1           1
##               Density_bin Hyd.Pressure1_bin Hyd.Pressure2_bin Hyd.Pressure3_bin
## Brand.Code              1                 1                 1                 1
## Carb.Volume             1                 1                 1                 1
## Fill.Ounces             1                 1                 1                 1
## PC.Volume               1                 1                 1                 1
## Carb.Pressure           1                 1                 1                 1
## Carb.Temp               1                 1                 1                 1
##               Mnf.Flow_bin
## Brand.Code               1
## Carb.Volume              1
## Fill.Ounces              1
## PC.Volume                1
## Carb.Pressure            1
## Carb.Temp                1
## Number of logged events:  75 
##   it im             dep meth             out
## 1  1  1             MFR  pmm Balling.Lvl_bin
## 2  1  1 Balling.Lvl_bin  pmm     Balling_bin
## 3  1  1     Balling_bin  pmm Balling.Lvl_bin
## 4  1  2             MFR  pmm Balling.Lvl_bin
## 5  1  2 Balling.Lvl_bin  pmm     Balling_bin
## 6  1  2     Balling_bin  pmm Balling.Lvl_bin
##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##                 0                 0                 0                 0 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##                 0                 0                 0                 0 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##                 0                 0                 0                 0 
##     Hyd.Pressure1     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4 
##                 0                 0                 0                 0 
##      Filler.Level      Filler.Speed       Temperature        Usage.cont 
##                 0                 0                 0                 0 
##         Carb.Flow           Density               MFR           Balling 
##                 0                 0                 0                 0 
##   Pressure.Vacuum                PH     Oxygen.Filler     Bowl.Setpoint 
##                 0                 0                 0                 0 
## Pressure.Setpoint     Air.Pressurer          Alch.Rel          Carb.Rel 
##                 0                 0                 0                 0 
##       Balling.Lvl Air.Pressurer_bin   Balling.Lvl_bin       Balling_bin 
##                 0                 0                 0                 0 
##       Density_bin Hyd.Pressure1_bin Hyd.Pressure2_bin Hyd.Pressure3_bin 
##                 0                 0                 0                 0 
##      Mnf.Flow_bin 
##                 0
##   Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp   PSC
## 1          B    5.340000    23.96667 0.2633333          68.2     141.2 0.104
## 2          A    5.426667    24.00667 0.2386667          68.4     139.6 0.124
## 3          B    5.286667    24.06000 0.2633333          70.8     144.8 0.090
## 4          A    5.440000    24.00667 0.2933333          63.0     132.6 0.044
## 5          A    5.486667    24.31333 0.1113333          67.2     136.8 0.026
## 6          A    5.380000    23.92667 0.2693333          66.6     138.4 0.090
##   PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1     0.26    0.04     -100          118.8          46.0             0
## 2     0.22    0.04     -100          121.6          46.0             0
## 3     0.34    0.16     -100          120.2          46.0             0
## 4     0.42    0.04     -100          115.2          46.4             0
## 5     0.16    0.12     -100          118.4          45.8             0
## 6     0.24    0.04     -100          119.6          45.6             0
##   Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1             0             0           118        121.2         4002
## 2             0             0           106        118.6         3986
## 3             0             0            82        120.0         4020
## 4             0             0            92        117.8         4012
## 5             0             0            92        118.6         4010
## 6             0             0           116        120.2         4014
##   Temperature Usage.cont Carb.Flow Density   MFR Balling Pressure.Vacuum   PH
## 1        66.0      16.18      2932    0.88 725.0   1.398            -4.0 8.36
## 2        67.6      19.90      3144    0.92 726.8   1.498            -4.0 8.26
## 3        67.0      17.76      2914    1.58 735.0   3.142            -3.8 8.94
## 4        65.6      17.42      3062    1.54 730.6   3.042            -4.4 8.24
## 5        65.6      17.68      3054    1.54 722.8   3.042            -4.4 8.26
## 6        66.2      23.82      2948    1.52 738.8   2.992            -4.4 8.32
##   Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1         0.022           120              46.4         142.6     6.58     5.32
## 2         0.026           120              46.8         143.0     6.56     5.30
## 3         0.024           120              46.6         142.0     7.66     5.84
## 4         0.030           120              46.0         146.2     7.14     5.42
## 5         0.030           120              46.0         146.2     7.14     5.44
## 6         0.024           120              46.0         146.6     7.16     5.44
##   Balling.Lvl Air.Pressurer_bin Balling.Lvl_bin Balling_bin Density_bin
## 1        1.48                 0               0           0           0
## 2        1.56                 0               0           0           0
## 3        3.28                 0               1           1           1
## 4        3.04                 1               1           1           1
## 5        3.04                 1               1           1           1
## 6        3.02                 1               1           1           1
##   Hyd.Pressure1_bin Hyd.Pressure2_bin Hyd.Pressure3_bin Mnf.Flow_bin
## 1                 0                 0                 0            1
## 2                 0                 0                 0            1
## 3                 0                 0                 0            1
## 4                 0                 0                 0            1
## 5                 0                 0                 0            1
## 6                 0                 0                 0            1

Now we’re going to remove the highly correlated predictor variables. First we turn our dataframe into a matrix object then determine the correlation between columns. We use the findCorrelation() function to make a list of columns to remove. The absolute values of pair-wise correlations are considered. If two variables have a high correlation, the function looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation. We’re using the cutoff coefficient of .75. This reduces the data from 41 columns to 26 columns.

Next we pre-process the data using center and scale. Center subtracts the mean of the predictor’s data from the predictor values. This makes it easier to interpret the intercept. Scale divides by the predictor data by the standard deviation which will standardize the units of the regression coefficients. This pre-processing won’t affect our estimates and our p-values will remain the same.

Model Training

Next we split our pre-processed dataset into a training set and a test set to evaluate our model’s effectiveness. The training set is data used to estimate the various values needed to mathematically define the relationships between the predictors and outcome. We will create various models then test their effectiveness on the test data. The test set will be used only when a few strong candidate models have been finalized. Which data will be used in the test and training data is selected randomly. We will use 80% of the data in the training sample and 20% in the test sample.

Now we start to build our models. First we define our resampling method. A subset of samples are used to fit a model and the remaining samples are used to estimate the efficacy of the model. Resampling methods can produce reasonable predictions of how well the model will perform on future samples. In this case we are using 10-fold cross-validation because the bias and variance properties are good and is relatively quick to compute.

Regression Models

The first set of models we’ll be building is linear regression, partial least squares, ridge regression, and robust linear regression. Each of these models seeks to find estimates of the parameters so that the sum of the squared errors or a function of the sum of the squared errors is minimized. The interpretability of coefficients makes it very attractive as a modeling tool but if the data has curvature or nonlinear structure, then regression will not be able to identify these characteristics.

Ordinary linear regression equation can be written as

yi = b0 + b1xi1 + b2xi2 + … + bjxij + ei

where yi represents the numeric response for the ith sample, b0 represents the estimated intercept, bj represents the estimated coefficient for the tth predictor, xij represents the value of the jth predictor for the ith sample, and ei represents random error that cannot be explained by the model.

Partial least squares is another regression technique that handles correlated values well. Like principal component analysis, partial least squares finds linear combinations of the predictors with the goal of maximally summarizing the covariance with the response variable. This strikes a compromise between the objectives of predictor space dimension reduction and a predictive relationship with the response.

Ridge regression adds a penalty on the sum of the squared regression parameters. The effect of this penalty is that the parameter estimates are only allowed to become large if there is a proportional reduction in SSE. This controls for collinearity by reducing features that don’t improve our model. There is no feature selection but some features can become negligible if they are highly correlated with other influential features.

With Robust Linear Regression, we seek to minimize the effect of outliers on the regression equations. One drawback of minimizing SSE is that the parameter estimates can be influenced by just one observation that falls far from the overall trend in the data. When data may contain influential observations, an alternative minimization metric that is less sensitive, such as not squaring residuals when they are large, can be used to find the best parameter estimates.

Each of these models can be constructed using the train() function.

We then assess the performance of these models using two measures of accuracy: R^2 and Root Mean Squared Error.

RMSE is a function of the model residuals, which are the observed values minus the model predictions. This is calculated by squaring the residuals, summing them, then taking the square root. The value is usually interpreted as either how far (on average) the residuals are from zero or as the average distance between the observed values and the model predictions.

The R^2 or coefficient of determination can be interpreted as the proportion of the information in the data that is explained by the model. This is calculated by finding the correlation coefficient between the observed and predicted values (usually denoted by R) and squaring it. This is a measure of correlation, not accuracy. We will still need to validate our predictions on test data to avoid over-fitting.

## All four linear regression models took
## 24.499 sec elapsed
## to train.
##                   Model      RMSE   Rsquare
## 1     Linear Regression 0.1391485 0.3595576
## 2   Robust Linear Model 0.1429901 0.3218594
## 3 Partial Least Squares 0.1398985 0.3543206
## 4      Ridge-regression 0.1396745 0.3542312
## [1] 2571   33
## [1] 2571   41

We can now evaluate the performance of each model using different types of training data. We test our imputed predictors with and without our binary predictors which were derived from our zero inflated predictors. For these linear models, the imputed predictors without the binary predictors seem to give us to highest R^2 and lowest RSME. We will continue to test different combinations with our non-linear models.

Here we plot our observed vs. predicted measures for our training data using each of our models. We notice the y intercept and slope appears to change slightly to compensate for the differences in the plot area but we’re looking for the distance from the slope. We see outliers are handled differently but overall the shape of the distribution seems consistent. There does not appear to be clusters or patterns of changing accuracy in the distribution.

Linear Regression Models: Variable Importance:

Next we look at the variable importance. For linear models, variable importance is calculated using the t-statistic, predicted minus actual divided by standard error, for each model parameter that is used. You notice that different features are more or less influential depending on the model used.

Non-Linear Models

Next we build the non-linear models. We’ve decided to try k-nearest neighbors (KNN), support vector machines (SVM), multivariate adaptive regression splines (MARS), and neural networks. These models are not based on simple linear combinations of the predictors.

Neural networks, like partial least squares, the outcome is modeled by an intermediary set of unobserved variables. These hidden units are linear combinations of the original predictors, but, unlike PLS models, they are not estimated in a hierarchical fashion. There are no constraints that help define these linear combinations. Each unit must then be related to the outcome using another linear combination connecting the hidden units. Treating this model as a nonlinear regression model, the parameters are usually optimized using the back-propagation algorithm to minimize the sum of the squared residuals.

MARS uses surrogate features instead of the original predictors. However, whereas PLS and neural networks are based on linear combinations of the predictors, MARS creates two contrasted versions of a predictor to enter the model. MARS features breaks the predictor into two groups, a “hinge” function of the original based on a cut point that achieves the smallest error, and models linear relationships between the predictor and the outcome in each group. The new features are added to a basic linear regression model to estimate the slopes and intercepts.

Support Vector Machines follow the framework of robust regression where we seek to minimize the effect of outliers on the regression equations. We find parameter estimates that minimize SSE by not squaring the residuals when they are very large. In addition samples that the model fits well have no effect on the regression equation. A threshold is set using resampling and a kernel function which specifies the relationship between predictors and outcome so that only poorly predicted points called support vectors are used to fit the line. The radial kernel we are using has an additional parameter which impacts the smoothness of the upper and lower boundary.

K-Nearest Neighbors simply predicts a new sample using the K-closest samples from the training set. The predicted response for the new sample is then the mean of the K neighbors’ responses. Distances between samples can be defined as Euclidean distance, Minkowski distance, Tanimoto, Hamming, and cosine could be used for specific contexts. Predictors with the largest scales will contribute most to the distance between samples so centering and scaling the data during pre-processing is important.

## All 3 non-linear regression models took
## 199.142 sec elapsed
## to train.
##                    Model      RMSE   Rsquare
## 1    k-Nearest Neighbors 0.1292144 0.4502330
## 2 Support Vector Machine 0.1189232 0.5380865
## 3             MARS Tuned 0.1304607 0.4481918
## [1] 41
## [1] 216
##                   Model      RMSE   Rsquare
## 1 Neural Network avNNet 0.1328926 0.4127167

Again we plot that observed vs. predicted values. Outliers are handled differently between each model and we notice the data are clustered closer to the slope overall.

##                   Model      RMSE   Rsquare
## 1 Neural Network avNNet 0.1328926 0.4127167

Non Linear Models: Variable Importance

The top predictors in our best performing non-linear model (Support Vector Machine (SVM)) are:

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                   Overall
## Oxygen.Filler      100.00
## Filler.Level        75.73
## Mnf.Flow            66.59
## Balling             62.83
## Filler.Speed        62.14
## Mnf.Flow_bin        55.18
## Hyd.Pressure1       54.73
## Bowl.Setpoint       54.53
## Hyd.Pressure3       54.32
## Hyd.Pressure2       50.74
## Fill.Pressure       47.98
## Usage.cont          47.65
## Pressure.Setpoint   42.66
## Density             35.88
## Carb.Pressure1      33.04
## Balling.Lvl         32.92
## Brand.Code          32.86
## Carb.Rel            29.04
## Carb.Flow           27.42
## Alch.Rel            26.36

Tree Based Models

## [1] 41
## [1] 6.403124

Finally we create a number of tree based models. We look at Single Tree, Gradient Boosted Machine, Bagged Tree, and Random Forrest.

Single Tree models consist of one or more nested if-then statements for the predictors that partition the data. Within these partitions, a model is used to predict the outcome. A two-dimensional predictor space is cut into as many terminal nodes as there are any that space will be predicted by a single number. Tree based models are highly interpretable and effectively handle many different data types. For regression, the model begins with the entire data set and searches every distinct value of every predictor to find the predictor and split value that partitions the data into two groups such that the overall sums of squares error are minimized. A complexity parameter is added to avoid overfitting by penalize the error rate using the size of the tree.

Other Tree models combine multiple trees for an ensemble which uses the average of the training data in the terminal nodes. Bagging is a general approach that uses bootstrapping in conjunction with any regression model to construct an ensemble. Bagging effectively reduces the variance of a prediction through its aggregation process. The bootstrap sampling also provides an inherent test or out-of-bag sample that can be used to assess the predictive performance of that specific model since they were not used to build the model.

Random Forests improve on bagging by removing the inherent correlation between trees. There is a lack of independence since all of the original predictors are considered at every split of every tree that leads to this correlation and decreased performance. Random forests increase variance by adding randomness to the tree building process. Random split selection, where trees are built using a random subset of the top k predictors at each split in the tree, can greatly improve performance of our model. Each model in the ensemble is then used to generate a prediction for a new sample and these m predictions are averaged to give the forest’s prediction.

Gradient Boosted Machines function by combining a number of weak classifiers to a new classifier with a lower error rate. Using SSE, the model combines trees until the residuals are minimized. This method can be continued until a desired number of iterations and a specified tree depth. Each tree in this model is dependent on past trees so this requires lots of compute resources.

##                      Model      RMSE   Rsquare
## 1              Single Tree 0.1361073 0.3970723
## 4 Gradient Boosted Machine 0.1136855 0.5790925
## 3              Bagged Tree 0.1053778 0.6298429
## 2       Random Forest Grid 0.1052708 0.6328797

Tree-Based Models: Variable Importance:

The top predictors in our best performing Tree-Based model (Random Forest) are:

##     overall           names
## 13 7.336494        Mnf.Flow
## 23 4.196069      Usage.cont
## 3  3.194608     Brand.CodeC
## 43 3.119403    Mnf.Flow_bin
## 29 2.652327   Oxygen.Filler
## 22 2.381487     Temperature
## 33 2.313193        Alch.Rel
## 20 2.218293    Filler.Level
## 28 2.142022 Pressure.Vacuum
## 30 2.100721   Bowl.Setpoint

Model Comparison

##    lm_predictions pls_predictions ridge_predictions rlmPCA_model_predictions
## 2        8.550958        8.547756          8.543726                 8.553193
## 10       8.671324        8.677543          8.685910                 8.697293
## 11       8.641584        8.643147          8.645991                 8.665189
## 13       8.619212        8.619005          8.625213                 8.646452
## 22       8.526197        8.527720          8.521271                 8.475645
##    knnModel_predictions predictions_svm predictions_mars_tuned
## 2              8.716364        8.448527               8.553114
## 10             8.530909        8.611234               8.538793
## 11             8.563636        8.588461               8.605153
## 13             8.558182        8.598095               8.581092
## 22             8.623636        8.628847               8.456375
##    predictions_NNModel_1 single_t_predictions rf_predictions bgTree_predictions
## 2               8.569021             8.651509       8.573427           8.535549
## 10              8.620143             8.527234       8.590510           8.598029
## 11              8.617352             8.527234       8.516007           8.531157
## 13              8.598503             8.527234       8.562499           8.577970
## 22              8.490855             8.527234       8.510701           8.496079
##    gbm_predictions   PH
## 2         8.460649 8.26
## 10        8.616586 8.50
## 11        8.594001 8.34
## 13        8.590004 8.34
## 22        8.644030 8.48

Our final results from the models are displayed below. The Random Forest model using the training dataset which contained the highly correlated predictors and the binary predictors based on the zero inflated predictors performed the best with an RMSE of 0.0937 and an R^2 of 0.729

##                       Model      RMSE   Rsquare
## 10       Random Forest Grid 0.1052708 0.6328797
## 11              Bagged Tree 0.1053778 0.6298429
## 12 Gradient Boosted Machine 0.1136855 0.5790925
## 7    Support Vector Machine 0.1189232 0.5380865
## 6       k-Nearest Neighbors 0.1292144 0.4502330
## 8                MARS Tuned 0.1304607 0.4481918
## 5     Neural Network avNNet 0.1328926 0.4127167
## 9               Single Tree 0.1361073 0.3970723
## 1         Linear Regression 0.1391485 0.3595576
## 3     Partial Least Squares 0.1398985 0.3543206
## 4          Ridge-regression 0.1396745 0.3542312
## 2       Robust Linear Model 0.1429901 0.3218594

With only given Predictors

10 Random Forest 0.1044049 0.6597720
11 Bagged Tree 0.1071411 0.6239057
12 Gradient Boosted Machine 0.1119545 0.5820163
7 Support Vector Machine 0.1177673 0.5381996
6 k-Nearest Neighbors 0.1258894 0.4788588
8 MARS Tuned 0.1290927 0.4454786
9 Single Tree 0.1310870 0.4338619
5 Neural Network avNNet 0.1383973 0.3706766
3 Partial Least Squares 0.1397042 0.3519163
1 Linear Regression 0.1398716 0.3504730
4 Ridge-regression 0.1399046 0.3496677
2 Robust Linear Model 0.1436261 0.3131386

Model Prediction Results:

Final Thoughts

  • No.1 most important variable in linear models (Carb.Pressure1) was not present in either non-linear or tree-based models. However, there are predictors like Mnf.Flow which is present in all 3 categories of models as an important variable. We observed some overlap of predictors between linear model and random forest like Temperature, Usage.cont, Brand.CodeB etc.
  • Initially the linear and non-linear models that we tried with the available training data set, we were not able to achieve a high R^2 value, but after trying out tree based models, we were able to improve R^2 significantly.
  • However a model with R^2 of 0.73 may not yield a very high quality prediction in real life. So further model tuning and validation effort may be necessary to improve the accuracy of the prediction generated by the model.
  • As a next step, the current model needs to be deployed in the production environment and it’s prediction performance and accuracy should be tested with new data set generated from ABC Beverage company’s manufacturing process.