Libraries

Team Members:

Soumya Ghosh
Jose Mawyin
Randy Thompson

Project 2: Prediction of PH in Beverages

Problem Statement

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Exploratory Analysis

First we load our data. The data was provided in an excel document but for reproducibility, we’ve uploaded it to github so anyone can use the link. The column, “Brand.Code” is a categorical variable and so we change the data type to a factor.

Brand.Code	Carb.Volume	Fill.Ounces	PC.Volume	Carb.Pressure	Carb.Temp	PSC	PSC.Fill	PSC.CO2	Mnf.Flow	Carb.Pressure1	Fill.Pressure	Hyd.Pressure2	Hyd.Pressure3	Hyd.Pressure4	Filler.Level	Filler.Speed	Temperature	Usage.cont	Carb.Flow	Density	MFR	Balling	Pressure.Vacuum	PH	Oxygen.Filler	Bowl.Setpoint	Pressure.Setpoint	Air.Pressurer	Alch.Rel	Carb.Rel	Balling.Lvl
B	5.340000	23.96667	0.2633333	68.2	141.2	0.104	0.26	0.04	-100	118.8	46.0	NA	NA	118	121.2	4002	66.0	16.18	2932	0.88	725.0	1.398	-4.0	8.36	0.022	120	46.4	142.6	6.58	5.32	1.48
A	5.426667	24.00667	0.2386667	68.4	139.6	0.124	0.22	0.04	-100	121.6	46.0	NA	NA	106	118.6	3986	67.6	19.90	3144	0.92	726.8	1.498	-4.0	8.26	0.026	120	46.8	143.0	6.56	5.30	1.56
B	5.286667	24.06000	0.2633333	70.8	144.8	0.090	0.34	0.16	-100	120.2	46.0	NA	NA	82	120.0	4020	67.0	17.76	2914	1.58	735.0	3.142	-3.8	8.94	0.024	120	46.6	142.0	7.66	5.84	3.28
A	5.440000	24.00667	0.2933333	63.0	132.6	NA	0.42	0.04	-100	115.2	46.4	0	0	92	117.8	4012	65.6	17.42	3062	1.54	730.6	3.042	-4.4	8.24	0.030	120	46.0	146.2	7.14	5.42	3.04
A	5.486667	24.31333	0.1113333	67.2	136.8	0.026	0.16	0.12	-100	118.4	45.8	0	0	92	118.6	4010	65.6	17.68	3054	1.54	722.8	3.042	-4.4	8.26	0.030	120	46.0	146.2	7.14	5.44	3.04
A	5.380000	23.92667	0.2693333	66.6	138.4	0.090	0.24	0.04	-100	119.6	45.6	0	0	116	120.2	4014	66.2	23.82	2948	1.52	738.8	2.992	-4.4	8.32	0.024	120	46.0	146.6	7.16	5.44	3.02

In the next sections we do some exploratory data analysis. We get a list of the column names and explore the completeness of the data. One pattern we look for is to see if some rows or columns are more incomplete than others. These incomplete fields or records could be excluded or used as a dummy variable in the future.

Descriptive Summary Statistics of the Predictors:

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Brand.Code*	1	2571	3.3893427	1.1066245	3.0000000	3.4200292	1.4826000	1.0000000	5.000	4.0000000	0.0410894	-0.6106962	0.0218247
Carb.Volume	2	2561	5.3701978	0.1063852	5.3466667	5.3654840	0.1087240	5.0400000	5.700	0.6600000	0.3922121	-0.4669916	0.0021022
Fill.Ounces	3	2533	23.9747546	0.0875299	23.9733333	23.9751390	0.0790720	23.6333333	24.320	0.6866667	-0.0215452	0.8624714	0.0017392
PC.Volume	4	2532	0.2771187	0.0606953	0.2713333	0.2745818	0.0523852	0.0793333	0.478	0.3986667	0.3396269	0.6699690	0.0012062
Carb.Pressure	5	2544	68.1895755	3.5382039	68.2000000	68.1212574	3.5582400	57.0000000	79.400	22.4000000	0.1822162	-0.0138046	0.0701495
Carb.Temp	6	2545	141.0949234	4.0373861	140.8000000	140.9912617	3.8547600	128.6000000	154.000	25.4000000	0.2468280	0.2375822	0.0800307
PSC	7	2538	0.0845737	0.0492690	0.0760000	0.0802746	0.0474432	0.0020000	0.270	0.2680000	0.8491445	0.6480498	0.0009780
PSC.Fill	8	2548	0.1953689	0.1177817	0.1800000	0.1837059	0.1186080	0.0000000	0.620	0.6200000	0.9334450	0.7691466	0.0023333
PSC.CO2	9	2532	0.0564139	0.0430387	0.0400000	0.0494965	0.0296520	0.0000000	0.240	0.2400000	1.7288937	3.7250025	0.0008553
Mnf.Flow	10	2569	24.5689373	119.4811263	65.2000000	21.0679631	169.0164000	-100.2000000	229.400	329.6000000	0.0041430	-1.8697072	2.3573130
Carb.Pressure1	11	2539	122.5863726	4.7428819	123.2000000	122.5379242	4.4478000	105.6000000	140.200	34.6000000	0.0543587	0.1418265	0.0941263
Fill.Pressure	12	2549	47.9221656	3.1775457	46.4000000	47.7071044	2.3721600	34.6000000	60.400	25.8000000	0.5471107	1.4067532	0.0629371
Hyd.Pressure1	13	2560	12.4375781	12.4332538	11.4000000	10.8374023	16.9016400	-0.8000000	58.000	58.8000000	0.7798043	-0.1426463	0.2457338
Hyd.Pressure2	14	2556	20.9610329	16.3863066	28.6000000	21.0519062	13.3434000	0.0000000	59.400	59.4000000	-0.3019570	-1.5592984	0.3241161
Hyd.Pressure3	15	2556	20.4584507	15.9757236	27.6000000	20.5052786	13.9364400	-1.2000000	50.000	51.2000000	-0.3189061	-1.5745834	0.3159949
Hyd.Pressure4	16	2541	96.2888627	13.1225594	96.0000000	95.4530251	11.8608000	52.0000000	142.000	90.0000000	0.5459786	0.6340041	0.2603252
Filler.Level	17	2551	109.2523716	15.6984241	118.4000000	111.0417442	9.1921200	55.8000000	161.200	105.4000000	-0.8482847	0.0460488	0.3108142
Filler.Speed	18	2514	3687.1988862	770.8200208	3982.0000000	3919.9870775	47.4432000	998.0000000	4030.000	3032.0000000	-2.8700359	6.7059692	15.3734149
Temperature	19	2557	65.9675401	1.3827783	65.6000000	65.7986321	0.8895600	63.6000000	76.200	12.6000000	2.3869732	10.1612904	0.0273456
Usage.cont	20	2566	20.9929618	2.9779364	21.7900000	21.2517819	3.1875900	12.0800000	25.900	13.8200000	-0.5353253	-1.0170230	0.0587878
Carb.Flow	21	2569	2468.3542234	1073.6964743	3028.0000000	2601.1356344	326.1720000	26.0000000	5104.000	5078.0000000	-0.9877287	-0.5826893	21.1835857
Density	22	2570	1.1736498	0.3775269	0.9800000	1.1533463	0.1482600	0.2400000	1.920	1.6800000	0.5260149	-1.1992070	0.0074470
MFR	23	2359	704.0492582	73.8983094	724.0000000	718.1566967	15.4190400	31.4000000	868.600	837.2000000	-5.0917729	30.4558939	1.5214950
Balling	24	2570	2.1977696	0.9310914	1.6480000	2.1287189	0.3706500	-0.1700000	4.012	4.1820000	0.5939224	-1.3855651	0.0183665
Pressure.Vacuum	25	2571	-5.2161027	0.5699933	-5.4000000	-5.2521147	0.5930400	-6.6000000	-3.600	3.0000000	0.5256608	-0.0313126	0.0112414
Oxygen.Filler	26	2559	0.0468426	0.0466436	0.0334000	0.0388837	0.0249077	0.0024000	0.400	0.3976000	2.6603955	11.0882098	0.0009221
Bowl.Setpoint	27	2569	109.3265862	15.3031541	120.0000000	111.3466213	0.0000000	70.0000000	140.000	70.0000000	-0.9743842	-0.0564212	0.3019249
Pressure.Setpoint	28	2559	47.6153966	2.0390474	46.0000000	47.6026354	0.0000000	44.0000000	52.000	8.0000000	0.2031970	-1.6012622	0.0403081
Air.Pressurer	29	2571	142.8339946	1.2119170	142.6000000	142.5812348	0.5930400	140.8000000	148.200	7.4000000	2.2521053	4.7336291	0.0239013
Alch.Rel	30	2562	6.8974161	0.5052753	6.5600000	6.8384390	0.0593040	5.2800000	8.620	3.3400000	0.8836378	-0.8506221	0.0099825
Carb.Rel	31	2561	5.4367825	0.1287183	5.4000000	5.4301318	0.1186080	4.9600000	6.060	1.1000000	0.5032472	-0.2949480	0.0025435
Balling.Lvl	32	2570	2.0500078	0.8703089	1.4800000	1.9827237	0.2075640	0.0000000	3.660	3.6600000	0.5858456	-1.4858636	0.0171675

We have use the describe() method of the psych package to review all the baseline statistical metrics for the predictors. From the table above, it can be noted that predictors like Oxygen Filler, PSC, PSC Fill, PSC CO2 and PC Volume has near zero mean values and there is notable skewness in data for many of the predictors.

Descriptive Statistical Plots

PH Level distribution

Above we can see that the PH distribution for all the brands combined follows a normal distribution. Below we can see that most of the observations belong to the “B” brand, there are 4 labeled brands and one unlabeled.

## Dimensions of Training DF:
##  2571 33

## 
## 
## Name of columns in Dataframe:

Box Plot:

Density Plots

Histogram of variables in the data set

Now we look at the distribution of each column. For modeling purposes, each column would idealy have a normal distribution so we are looking at which columns might be candidates for transformations.

We notice many columns are normally distributed, some have bimodal distributions, some are skewed with a long tail of outliers, and some only have values in intervals of 2’s or 10’s. In our model pre-processing, we may choose different transformations to normalize and standardize our data. One pattern that stands out is that many columns have a high number of zero variables. We can assume that there is some significance to these high number of zeros and there effect may be linear in nature. For these data we will create dummy variables based on a specified cutoff value.

## [1] 2571   41

Correlation Plot

Here we observe the correlation between variables. Variables that are highly correlated offer limited additional insights for our model. Non-linear models typically handle these highly correlated values better than linear models. When looking for the most influential variables on our outcome variable, we will have to keep these in mind as well.

Missing Value Analysis

Below is an analysis of predictors with NA values.

The size of the data used in this study consisted of 2571 observations, 32 predictors and one response variable (PH).

##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##                 0                10                38                39 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##                27                26                33                23 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##                39                 2                32                22 
##     Hyd.Pressure1     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4 
##                11                15                15                30 
##      Filler.Level      Filler.Speed       Temperature        Usage.cont 
##                20                57                14                 5 
##         Carb.Flow           Density               MFR           Balling 
##                 2                 1               212                 1 
##   Pressure.Vacuum                PH     Oxygen.Filler     Bowl.Setpoint 
##                 0                 4                12                 2 
## Pressure.Setpoint     Air.Pressurer          Alch.Rel          Carb.Rel 
##                12                 0                 9                10 
##       Balling.Lvl 
##                 1

Above we can see the number of missing values (n.a.’s) in all the columns of the data set. Below we can see the distribution of the missing values in the data set and the percentage with respect all values.

## 
##  Variables sorted by number of missings: 
##           Variable        Count
##                MFR 0.0824581875
##       Filler.Speed 0.0221703617
##          PC.Volume 0.0151691949
##            PSC.CO2 0.0151691949
##        Fill.Ounces 0.0147802412
##                PSC 0.0128354726
##     Carb.Pressure1 0.0124465189
##      Hyd.Pressure4 0.0116686114
##      Carb.Pressure 0.0105017503
##          Carb.Temp 0.0101127966
##           PSC.Fill 0.0089459354
##      Fill.Pressure 0.0085569817
##       Filler.Level 0.0077790743
##      Hyd.Pressure2 0.0058343057
##      Hyd.Pressure3 0.0058343057
##        Temperature 0.0054453520
##      Oxygen.Filler 0.0046674446
##  Pressure.Setpoint 0.0046674446
##      Hyd.Pressure1 0.0042784909
##        Carb.Volume 0.0038895371
##           Carb.Rel 0.0038895371
##           Alch.Rel 0.0035005834
##         Usage.cont 0.0019447686
##                 PH 0.0015558149
##           Mnf.Flow 0.0007779074
##          Carb.Flow 0.0007779074
##      Bowl.Setpoint 0.0007779074
##            Density 0.0003889537
##            Balling 0.0003889537
##        Balling.Lvl 0.0003889537
##         Brand.Code 0.0000000000
##    Pressure.Vacuum 0.0000000000
##      Air.Pressurer 0.0000000000

In order to optimize the prediction model, we need to re-evaluate the list of predictors that need to be part of the model and also handle the missing values by deploying appropriate imputation techniques.

Imputation Strategy:

As noted earlier, there are many missing variables from our model. Since many of the models we are planning on using will only use complete cases, we need to impute the missing data points. Occasionally missing data, when manually entered, can be assumed to be zero but we do not believe this to be the case. Some methods for handling missing data are using the mean, median, or mode. We will use a method called multivariate imputation by chained equations or MICE. MICE is a great imputation method because it preserves the relations within the data and the uncertainty about those relations. We use the argument m=5 to indicate that we will do 5 imputations, each with the same dataset but different imputed values, that will then be analyzed then pooled to ensure relations and uncertainty is maintained. The method pmm is short for predictive mean matching. These imputations are restricted to observed values so this method works for our categorical variable as well. This will preserve non-linear relationships and it is computationally faster than other MICE methods.

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##                ""             "pmm"             "pmm"             "pmm" 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##     Hyd.Pressure1     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##      Filler.Level      Filler.Speed       Temperature        Usage.cont 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##         Carb.Flow           Density               MFR           Balling 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##   Pressure.Vacuum                PH     Oxygen.Filler     Bowl.Setpoint 
##                ""             "pmm"             "pmm"             "pmm" 
## Pressure.Setpoint     Air.Pressurer          Alch.Rel          Carb.Rel 
##             "pmm"                ""             "pmm"             "pmm" 
##       Balling.Lvl Air.Pressurer_bin   Balling.Lvl_bin       Balling_bin 
##             "pmm"                ""             "pmm"             "pmm" 
##       Density_bin Hyd.Pressure1_bin Hyd.Pressure2_bin Hyd.Pressure3_bin 
##             "pmm"             "pmm"             "pmm"             "pmm" 
##      Mnf.Flow_bin 
##             "pmm" 
## PredictorMatrix:
##               Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure
## Brand.Code             0           1           1         1             1
## Carb.Volume            1           0           1         1             1
## Fill.Ounces            1           1           0         1             1
## PC.Volume              1           1           1         0             1
## Carb.Pressure          1           1           1         1             0
## Carb.Temp              1           1           1         1             1
##               Carb.Temp PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1
## Brand.Code            1   1        1       1        1              1
## Carb.Volume           1   1        1       1        1              1
## Fill.Ounces           1   1        1       1        1              1
## PC.Volume             1   1        1       1        1              1
## Carb.Pressure         1   1        1       1        1              1
## Carb.Temp             0   1        1       1        1              1
##               Fill.Pressure Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3
## Brand.Code                1             1             1             1
## Carb.Volume               1             1             1             1
## Fill.Ounces               1             1             1             1
## PC.Volume                 1             1             1             1
## Carb.Pressure             1             1             1             1
## Carb.Temp                 1             1             1             1
##               Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont
## Brand.Code                1            1            1           1          1
## Carb.Volume               1            1            1           1          1
## Fill.Ounces               1            1            1           1          1
## PC.Volume                 1            1            1           1          1
## Carb.Pressure             1            1            1           1          1
## Carb.Temp                 1            1            1           1          1
##               Carb.Flow Density MFR Balling Pressure.Vacuum PH Oxygen.Filler
## Brand.Code            1       1   1       1               1  1             1
## Carb.Volume           1       1   1       1               1  1             1
## Fill.Ounces           1       1   1       1               1  1             1
## PC.Volume             1       1   1       1               1  1             1
## Carb.Pressure         1       1   1       1               1  1             1
## Carb.Temp             1       1   1       1               1  1             1
##               Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## Brand.Code                1                 1             1        1        1
## Carb.Volume               1                 1             1        1        1
## Fill.Ounces               1                 1             1        1        1
## PC.Volume                 1                 1             1        1        1
## Carb.Pressure             1                 1             1        1        1
## Carb.Temp                 1                 1             1        1        1
##               Balling.Lvl Air.Pressurer_bin Balling.Lvl_bin Balling_bin
## Brand.Code              1                 1               1           1
## Carb.Volume             1                 1               1           1
## Fill.Ounces             1                 1               1           1
## PC.Volume               1                 1               1           1
## Carb.Pressure           1                 1               1           1
## Carb.Temp               1                 1               1           1
##               Density_bin Hyd.Pressure1_bin Hyd.Pressure2_bin Hyd.Pressure3_bin
## Brand.Code              1                 1                 1                 1
## Carb.Volume             1                 1                 1                 1
## Fill.Ounces             1                 1                 1                 1
## PC.Volume               1                 1                 1                 1
## Carb.Pressure           1                 1                 1                 1
## Carb.Temp               1                 1                 1                 1
##               Mnf.Flow_bin
## Brand.Code               1
## Carb.Volume              1
## Fill.Ounces              1
## PC.Volume                1
## Carb.Pressure            1
## Carb.Temp                1
## Number of logged events:  75 
##   it im             dep meth             out
## 1  1  1             MFR  pmm Balling.Lvl_bin
## 2  1  1 Balling.Lvl_bin  pmm     Balling_bin
## 3  1  1     Balling_bin  pmm Balling.Lvl_bin
## 4  1  2             MFR  pmm Balling.Lvl_bin
## 5  1  2 Balling.Lvl_bin  pmm     Balling_bin
## 6  1  2     Balling_bin  pmm Balling.Lvl_bin

##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##                 0                 0                 0                 0 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##                 0                 0                 0                 0 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##                 0                 0                 0                 0 
##     Hyd.Pressure1     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4 
##                 0                 0                 0                 0 
##      Filler.Level      Filler.Speed       Temperature        Usage.cont 
##                 0                 0                 0                 0 
##         Carb.Flow           Density               MFR           Balling 
##                 0                 0                 0                 0 
##   Pressure.Vacuum                PH     Oxygen.Filler     Bowl.Setpoint 
##                 0                 0                 0                 0 
## Pressure.Setpoint     Air.Pressurer          Alch.Rel          Carb.Rel 
##                 0                 0                 0                 0 
##       Balling.Lvl Air.Pressurer_bin   Balling.Lvl_bin       Balling_bin 
##                 0                 0                 0                 0 
##       Density_bin Hyd.Pressure1_bin Hyd.Pressure2_bin Hyd.Pressure3_bin 
##                 0                 0                 0                 0 
##      Mnf.Flow_bin 
##                 0

##   Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp   PSC
## 1          B    5.340000    23.96667 0.2633333          68.2     141.2 0.104
## 2          A    5.426667    24.00667 0.2386667          68.4     139.6 0.124
## 3          B    5.286667    24.06000 0.2633333          70.8     144.8 0.090
## 4          A    5.440000    24.00667 0.2933333          63.0     132.6 0.044
## 5          A    5.486667    24.31333 0.1113333          67.2     136.8 0.026
## 6          A    5.380000    23.92667 0.2693333          66.6     138.4 0.090
##   PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1     0.26    0.04     -100          118.8          46.0             0
## 2     0.22    0.04     -100          121.6          46.0             0
## 3     0.34    0.16     -100          120.2          46.0             0
## 4     0.42    0.04     -100          115.2          46.4             0
## 5     0.16    0.12     -100          118.4          45.8             0
## 6     0.24    0.04     -100          119.6          45.6             0
##   Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1             0             0           118        121.2         4002
## 2             0             0           106        118.6         3986
## 3             0             0            82        120.0         4020
## 4             0             0            92        117.8         4012
## 5             0             0            92        118.6         4010
## 6             0             0           116        120.2         4014
##   Temperature Usage.cont Carb.Flow Density   MFR Balling Pressure.Vacuum   PH
## 1        66.0      16.18      2932    0.88 725.0   1.398            -4.0 8.36
## 2        67.6      19.90      3144    0.92 726.8   1.498            -4.0 8.26
## 3        67.0      17.76      2914    1.58 735.0   3.142            -3.8 8.94
## 4        65.6      17.42      3062    1.54 730.6   3.042            -4.4 8.24
## 5        65.6      17.68      3054    1.54 722.8   3.042            -4.4 8.26
## 6        66.2      23.82      2948    1.52 738.8   2.992            -4.4 8.32
##   Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1         0.022           120              46.4         142.6     6.58     5.32
## 2         0.026           120              46.8         143.0     6.56     5.30
## 3         0.024           120              46.6         142.0     7.66     5.84
## 4         0.030           120              46.0         146.2     7.14     5.42
## 5         0.030           120              46.0         146.2     7.14     5.44
## 6         0.024           120              46.0         146.6     7.16     5.44
##   Balling.Lvl Air.Pressurer_bin Balling.Lvl_bin Balling_bin Density_bin
## 1        1.48                 0               0           0           0
## 2        1.56                 0               0           0           0
## 3        3.28                 0               1           1           1
## 4        3.04                 1               1           1           1
## 5        3.04                 1               1           1           1
## 6        3.02                 1               1           1           1
##   Hyd.Pressure1_bin Hyd.Pressure2_bin Hyd.Pressure3_bin Mnf.Flow_bin
## 1                 0                 0                 0            1
## 2                 0                 0                 0            1
## 3                 0                 0                 0            1
## 4                 0                 0                 0            1
## 5                 0                 0                 0            1
## 6                 0                 0                 0            1

Now we’re going to remove the highly correlated predictor variables. First we turn our dataframe into a matrix object then determine the correlation between columns. We use the findCorrelation() function to make a list of columns to remove. The absolute values of pair-wise correlations are considered. If two variables have a high correlation, the function looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation. We’re using the cutoff coefficient of .75. This reduces the data from 41 columns to 26 columns.

Next we pre-process the data using center and scale. Center subtracts the mean of the predictor’s data from the predictor values. This makes it easier to interpret the intercept. Scale divides by the predictor data by the standard deviation which will standardize the units of the regression coefficients. This pre-processing won’t affect our estimates and our p-values will remain the same.

Model Training

Next we split our pre-processed dataset into a training set and a test set to evaluate our model’s effectiveness. The training set is data used to estimate the various values needed to mathematically define the relationships between the predictors and outcome. We will create various models then test their effectiveness on the test data. The test set will be used only when a few strong candidate models have been finalized. Which data will be used in the test and training data is selected randomly. We will use 80% of the data in the training sample and 20% in the test sample.

Now we start to build our models. First we define our resampling method. A subset of samples are used to fit a model and the remaining samples are used to estimate the efficacy of the model. Resampling methods can produce reasonable predictions of how well the model will perform on future samples. In this case we are using 10-fold cross-validation because the bias and variance properties are good and is relatively quick to compute.

Regression Models

The first set of models we’ll be building is linear regression, partial least squares, ridge regression, and robust linear regression. Each of these models seeks to find estimates of the parameters so that the sum of the squared errors or a function of the sum of the squared errors is minimized. The interpretability of coefficients makes it very attractive as a modeling tool but if the data has curvature or nonlinear structure, then regression will not be able to identify these characteristics.

Ordinary linear regression equation can be written as

yi = b0 + b1xi1 + b2xi2 + … + bjxij + ei

where yi represents the numeric response for the ith sample, b0 represents the estimated intercept, bj represents the estimated coefficient for the tth predictor, xij represents the value of the jth predictor for the ith sample, and ei represents random error that cannot be explained by the model.

Partial least squares is another regression technique that handles correlated values well. Like principal component analysis, partial least squares finds linear combinations of the predictors with the goal of maximally summarizing the covariance with the response variable. This strikes a compromise between the objectives of predictor space dimension reduction and a predictive relationship with the response.

Ridge regression adds a penalty on the sum of the squared regression parameters. The effect of this penalty is that the parameter estimates are only allowed to become large if there is a proportional reduction in SSE. This controls for collinearity by reducing features that don’t improve our model. There is no feature selection but some features can become negligible if they are highly correlated with other influential features.

With Robust Linear Regression, we seek to minimize the effect of outliers on the regression equations. One drawback of minimizing SSE is that the parameter estimates can be influenced by just one observation that falls far from the overall trend in the data. When data may contain influential observations, an alternative minimization metric that is less sensitive, such as not squaring residuals when they are large, can be used to find the best parameter estimates.

Each of these models can be constructed using the train() function.

We then assess the performance of these models using two measures of accuracy: R^2 and Root Mean Squared Error.

RMSE is a function of the model residuals, which are the observed values minus the model predictions. This is calculated by squaring the residuals, summing them, then taking the square root. The value is usually interpreted as either how far (on average) the residuals are from zero or as the average distance between the observed values and the model predictions.

The R^2 or coefficient of determination can be interpreted as the proportion of the information in the data that is explained by the model. This is calculated by finding the correlation coefficient between the observed and predicted values (usually denoted by R) and squaring it. This is a measure of correlation, not accuracy. We will still need to validate our predictions on test data to avoid over-fitting.

## All four linear regression models took

## 24.499 sec elapsed

## to train.

##                   Model      RMSE   Rsquare
## 1     Linear Regression 0.1391485 0.3595576
## 2   Robust Linear Model 0.1429901 0.3218594
## 3 Partial Least Squares 0.1398985 0.3543206
## 4      Ridge-regression 0.1396745 0.3542312

## [1] 2571   33

## [1] 2571   41

We can now evaluate the performance of each model using different types of training data. We test our imputed predictors with and without our binary predictors which were derived from our zero inflated predictors. For these linear models, the imputed predictors without the binary predictors seem to give us to highest R^2 and lowest RSME. We will continue to test different combinations with our non-linear models.

Here we plot our observed vs. predicted measures for our training data using each of our models. We notice the y intercept and slope appears to change slightly to compensate for the differences in the plot area but we’re looking for the distance from the slope. We see outliers are handled differently but overall the shape of the distribution seems consistent. There does not appear to be clusters or patterns of changing accuracy in the distribution.

Linear Regression Models: Variable Importance:

Next we look at the variable importance. For linear models, variable importance is calculated using the t-statistic, predicted minus actual divided by standard error, for each model parameter that is used. You notice that different features are more or less influential depending on the model used.

Non-Linear Models

Next we build the non-linear models. We’ve decided to try k-nearest neighbors (KNN), support vector machines (SVM), multivariate adaptive regression splines (MARS), and neural networks. These models are not based on simple linear combinations of the predictors.

Neural networks, like partial least squares, the outcome is modeled by an intermediary set of unobserved variables. These hidden units are linear combinations of the original predictors, but, unlike PLS models, they are not estimated in a hierarchical fashion. There are no constraints that help define these linear combinations. Each unit must then be related to the outcome using another linear combination connecting the hidden units. Treating this model as a nonlinear regression model, the parameters are usually optimized using the back-propagation algorithm to minimize the sum of the squared residuals.

MARS uses surrogate features instead of the original predictors. However, whereas PLS and neural networks are based on linear combinations of the predictors, MARS creates two contrasted versions of a predictor to enter the model. MARS features breaks the predictor into two groups, a “hinge” function of the original based on a cut point that achieves the smallest error, and models linear relationships between the predictor and the outcome in each group. The new features are added to a basic linear regression model to estimate the slopes and intercepts.

Support Vector Machines follow the framework of robust regression where we seek to minimize the effect of outliers on the regression equations. We find parameter estimates that minimize SSE by not squaring the residuals when they are very large. In addition samples that the model fits well have no effect on the regression equation. A threshold is set using resampling and a kernel function which specifies the relationship between predictors and outcome so that only poorly predicted points called support vectors are used to fit the line. The radial kernel we are using has an additional parameter which impacts the smoothness of the upper and lower boundary.

K-Nearest Neighbors simply predicts a new sample using the K-closest samples from the training set. The predicted response for the new sample is then the mean of the K neighbors’ responses. Distances between samples can be defined as Euclidean distance, Minkowski distance, Tanimoto, Hamming, and cosine could be used for specific contexts. Predictors with the largest scales will contribute most to the distance between samples so centering and scaling the data during pre-processing is important.

## All 3 non-linear regression models took

## 199.142 sec elapsed

## to train.

##                    Model      RMSE   Rsquare
## 1    k-Nearest Neighbors 0.1292144 0.4502330
## 2 Support Vector Machine 0.1189232 0.5380865
## 3             MARS Tuned 0.1304607 0.4481918

## [1] 41

## [1] 216

##                   Model      RMSE   Rsquare
## 1 Neural Network avNNet 0.1328926 0.4127167

Again we plot that observed vs. predicted values. Outliers are handled differently between each model and we notice the data are clustered closer to the slope overall.

##                   Model      RMSE   Rsquare
## 1 Neural Network avNNet 0.1328926 0.4127167

Non Linear Models: Variable Importance

The top predictors in our best performing non-linear model (Support Vector Machine (SVM)) are:

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                   Overall
## Oxygen.Filler      100.00
## Filler.Level        75.73
## Mnf.Flow            66.59
## Balling             62.83
## Filler.Speed        62.14
## Mnf.Flow_bin        55.18
## Hyd.Pressure1       54.73
## Bowl.Setpoint       54.53
## Hyd.Pressure3       54.32
## Hyd.Pressure2       50.74
## Fill.Pressure       47.98
## Usage.cont          47.65
## Pressure.Setpoint   42.66
## Density             35.88
## Carb.Pressure1      33.04
## Balling.Lvl         32.92
## Brand.Code          32.86
## Carb.Rel            29.04
## Carb.Flow           27.42
## Alch.Rel            26.36

Tree Based Models

## [1] 41

## [1] 6.403124

Finally we create a number of tree based models. We look at Single Tree, Gradient Boosted Machine, Bagged Tree, and Random Forrest.

Single Tree models consist of one or more nested if-then statements for the predictors that partition the data. Within these partitions, a model is used to predict the outcome. A two-dimensional predictor space is cut into as many terminal nodes as there are any that space will be predicted by a single number. Tree based models are highly interpretable and effectively handle many different data types. For regression, the model begins with the entire data set and searches every distinct value of every predictor to find the predictor and split value that partitions the data into two groups such that the overall sums of squares error are minimized. A complexity parameter is added to avoid overfitting by penalize the error rate using the size of the tree.

Other Tree models combine multiple trees for an ensemble which uses the average of the training data in the terminal nodes. Bagging is a general approach that uses bootstrapping in conjunction with any regression model to construct an ensemble. Bagging effectively reduces the variance of a prediction through its aggregation process. The bootstrap sampling also provides an inherent test or out-of-bag sample that can be used to assess the predictive performance of that specific model since they were not used to build the model.

Random Forests improve on bagging by removing the inherent correlation between trees. There is a lack of independence since all of the original predictors are considered at every split of every tree that leads to this correlation and decreased performance. Random forests increase variance by adding randomness to the tree building process. Random split selection, where trees are built using a random subset of the top k predictors at each split in the tree, can greatly improve performance of our model. Each model in the ensemble is then used to generate a prediction for a new sample and these m predictions are averaged to give the forest’s prediction.

Gradient Boosted Machines function by combining a number of weak classifiers to a new classifier with a lower error rate. Using SSE, the model combines trees until the residuals are minimized. This method can be continued until a desired number of iterations and a specified tree depth. Each tree in this model is dependent on past trees so this requires lots of compute resources.

##                      Model      RMSE   Rsquare
## 1              Single Tree 0.1361073 0.3970723
## 4 Gradient Boosted Machine 0.1136855 0.5790925
## 3              Bagged Tree 0.1053778 0.6298429
## 2       Random Forest Grid 0.1052708 0.6328797

Tree-Based Models: Variable Importance:

The top predictors in our best performing Tree-Based model (Random Forest) are:

##     overall           names
## 13 7.336494        Mnf.Flow
## 23 4.196069      Usage.cont
## 3  3.194608     Brand.CodeC
## 43 3.119403    Mnf.Flow_bin
## 29 2.652327   Oxygen.Filler
## 22 2.381487     Temperature
## 33 2.313193        Alch.Rel
## 20 2.218293    Filler.Level
## 28 2.142022 Pressure.Vacuum
## 30 2.100721   Bowl.Setpoint

Model Comparison

##    lm_predictions pls_predictions ridge_predictions rlmPCA_model_predictions
## 2        8.550958        8.547756          8.543726                 8.553193
## 10       8.671324        8.677543          8.685910                 8.697293
## 11       8.641584        8.643147          8.645991                 8.665189
## 13       8.619212        8.619005          8.625213                 8.646452
## 22       8.526197        8.527720          8.521271                 8.475645
##    knnModel_predictions predictions_svm predictions_mars_tuned
## 2              8.716364        8.448527               8.553114
## 10             8.530909        8.611234               8.538793
## 11             8.563636        8.588461               8.605153
## 13             8.558182        8.598095               8.581092
## 22             8.623636        8.628847               8.456375
##    predictions_NNModel_1 single_t_predictions rf_predictions bgTree_predictions
## 2               8.569021             8.651509       8.573427           8.535549
## 10              8.620143             8.527234       8.590510           8.598029
## 11              8.617352             8.527234       8.516007           8.531157
## 13              8.598503             8.527234       8.562499           8.577970
## 22              8.490855             8.527234       8.510701           8.496079
##    gbm_predictions   PH
## 2         8.460649 8.26
## 10        8.616586 8.50
## 11        8.594001 8.34
## 13        8.590004 8.34
## 22        8.644030 8.48

Our final results from the models are displayed below. The Random Forest model using the training dataset which contained the highly correlated predictors and the binary predictors based on the zero inflated predictors performed the best with an RMSE of 0.0937 and an R^2 of 0.729

##                       Model      RMSE   Rsquare
## 10       Random Forest Grid 0.1052708 0.6328797
## 11              Bagged Tree 0.1053778 0.6298429
## 12 Gradient Boosted Machine 0.1136855 0.5790925
## 7    Support Vector Machine 0.1189232 0.5380865
## 6       k-Nearest Neighbors 0.1292144 0.4502330
## 8                MARS Tuned 0.1304607 0.4481918
## 5     Neural Network avNNet 0.1328926 0.4127167
## 9               Single Tree 0.1361073 0.3970723
## 1         Linear Regression 0.1391485 0.3595576
## 3     Partial Least Squares 0.1398985 0.3543206
## 4          Ridge-regression 0.1396745 0.3542312
## 2       Robust Linear Model 0.1429901 0.3218594

With only given Predictors

10 Random Forest 0.1044049 0.6597720
11 Bagged Tree 0.1071411 0.6239057
12 Gradient Boosted Machine 0.1119545 0.5820163
7 Support Vector Machine 0.1177673 0.5381996
6 k-Nearest Neighbors 0.1258894 0.4788588
8 MARS Tuned 0.1290927 0.4454786
9 Single Tree 0.1310870 0.4338619
5 Neural Network avNNet 0.1383973 0.3706766
3 Partial Least Squares 0.1397042 0.3519163
1 Linear Regression 0.1398716 0.3504730
4 Ridge-regression 0.1399046 0.3496677
2 Robust Linear Model 0.1436261 0.3131386

Model Prediction Results:

Final Thoughts

No.1 most important variable in linear models (Carb.Pressure1) was not present in either non-linear or tree-based models. However, there are predictors like Mnf.Flow which is present in all 3 categories of models as an important variable. We observed some overlap of predictors between linear model and random forest like Temperature, Usage.cont, Brand.CodeB etc.
Initially the linear and non-linear models that we tried with the available training data set, we were not able to achieve a high R^2 value, but after trying out tree based models, we were able to improve R^2 significantly.
However a model with R^2 of 0.73 may not yield a very high quality prediction in real life. So further model tuning and validation effort may be necessary to improve the accuracy of the prediction generated by the model.
As a next step, the current model needs to be deployed in the production environment and it’s prediction performance and accuracy should be tested with new data set generated from ABC Beverage company’s manufacturing process.

DATA 624 - Project2 - PH Level Prediction

Soumya Ghosh, Jose Mawyin & Randy Thompson

September 20, 2020