Data624 Final Project

Assignment Overview / Problem Statement

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Data Import

There are two files provided:

StudentData.xlsx: This is the training dataset. Note the PH column will be our target we are trying to predict.
StudentEvaluation.xlsx: This is the evaluation dataset. Note the PH column is empty in this dataset.

# Load the ABC Beverages' train dataset.
beverage.train <- read.csv('./data/StudentData.csv', na.strings = c('', 'NA'), stringsAsFactors = FALSE)

#  Load the ABC Beverages' evaluation dataset. 
beverage.eval <- read.csv('./data/StudentEvaluation.csv', na.strings = c('', 'NA'), stringsAsFactors = FALSE)

Remove Empty PH Value From The Evaluation Set

The evaulation dataset contains an empty PH column, so we will remove it for now until it is needed later on in the project.

# Remove the empty PH column from the evaluation data.
beverage.eval <- beverage.eval %>% dplyr::select(-PH)

Data Exploration

Evaluation Dataset

The first step in our data exploration is to take a brief look at the evaluation data set. To get an idea of it’s structure, we will print out the first 40 rows of the data.

# Examine the structure of the evaluation dataset.
head(beverage.eval, 40) %>% kable() %>% kable_styling() %>% scroll_box(width = '100%', height = '600px')

Brand.Code	Carb.Volume	Fill.Ounces	PC.Volume	Carb.Pressure	Carb.Temp	PSC	PSC.Fill	PSC.CO2	Mnf.Flow	Carb.Pressure1	Fill.Pressure	Hyd.Pressure2	Hyd.Pressure3	Hyd.Pressure4	Filler.Level	Filler.Speed	Temperature	Usage.cont	Carb.Flow	Density	MFR	Balling	Pressure.Vacuum	Oxygen.Filler	Bowl.Setpoint	Pressure.Setpoint	Air.Pressurer	Alch.Rel	Carb.Rel	Balling.Lvl
D	5.480000	24.03333	0.2700000	65.4	134.6	0.236	0.40	0.04	-100	116.6	46.0	NA	NA	96	129.4	3986	66.0	21.66	2950	0.88	727.6	1.398	-3.8	0.022	130	45.2	142.6	6.56	5.34	1.48
A	5.393333	23.95333	0.2266667	63.2	135.0	0.042	0.22	0.08	-100	118.8	46.2	0	0	112	120.0	4012	65.6	17.60	2916	1.50	735.8	2.942	-4.4	0.030	120	46.0	147.2	7.14	5.58	3.04
B	5.293333	23.92000	0.3033333	66.4	140.4	0.068	0.10	0.02	-100	120.2	45.8	0	0	98	119.4	4010	65.6	24.18	3056	0.90	734.8	1.448	-4.2	0.046	120	46.0	146.6	6.52	5.34	1.46
B	5.266667	23.94000	0.1860000	64.8	139.0	0.004	0.20	0.02	-100	124.8	40.0	0	0	132	120.2	NA	74.4	18.12	28	0.74	NA	1.056	-4.0	NA	120	46.0	146.4	6.48	5.50	1.48
B	5.406667	24.20000	0.1600000	69.4	142.2	0.040	0.30	0.06	-100	115.0	51.4	0	0	94	116.0	4018	66.4	21.32	3214	0.88	752.0	1.398	-4.0	0.082	120	50.0	145.8	6.50	5.38	1.46
B	5.286667	24.10667	0.2120000	73.4	147.2	0.078	0.22	NA	-100	118.6	46.4	0	0	94	120.4	4010	66.6	18.00	3064	0.84	732.0	1.298	-3.8	0.064	120	46.0	146.0	6.50	5.42	1.44
A	5.480000	23.93333	0.2433333	65.2	134.6	0.088	0.14	0.00	-100	117.6	46.2	0	0	108	119.6	4010	66.8	17.68	3042	1.48	729.8	2.894	-4.2	0.042	120	46.0	145.0	7.18	5.46	3.02
B	5.420000	24.06667	0.1226667	67.4	139.0	0.076	0.10	0.04	-100	121.4	40.0	0	0	108	131.4	NA	NA	12.90	1972	1.60	NA	3.320	-4.4	0.096	120	46.0	146.0	7.16	5.42	3.00
A	5.406667	23.92000	0.3326667	66.8	138.0	0.246	0.48	0.04	-100	136.0	43.8	0	0	110	121.0	4010	65.8	17.70	2502	1.52	741.2	2.992	-4.4	0.046	120	46.0	146.2	7.14	5.44	3.10
D	5.473333	24.02667	0.2560000	72.6	144.0	0.146	0.10	0.02	-100	126.6	40.8	0	0	106	120.8	1006	66.0	22.80	28	1.48	NA	2.892	-4.2	0.096	120	46.0	146.0	7.78	5.52	3.12
B	5.180000	NA	0.3433333	64.0	140.8	NA	0.34	0.04	-100	121.2	46.6	0	0	98	120.2	4010	65.4	20.04	3172	0.86	732.8	1.348	-4.2	0.066	120	46.0	147.0	6.52	5.36	1.38
B	5.260000	24.08000	0.2200000	63.2	139.6	0.184	0.26	0.20	-100	117.2	46.2	0	0	96	118.4	4010	65.8	17.16	3100	0.86	735.8	1.348	-4.2	0.048	120	46.0	147.0	6.50	5.38	1.42
B	5.300000	24.06000	0.2820000	65.0	138.8	0.152	0.12	0.00	-100	117.0	45.8	0	0	100	119.6	4010	65.4	20.52	2926	0.92	735.6	1.498	-4.8	0.066	120	46.0	147.0	6.54	5.28	1.46
B	5.306667	23.94000	0.2886667	63.8	137.2	0.100	0.18	0.02	-100	122.0	46.4	0	0	100	119.8	4016	65.6	21.44	2954	0.94	736.4	1.548	-4.8	0.050	120	46.0	142.4	6.54	5.22	1.44
C	5.273333	23.97333	0.3206667	64.6	140.0	0.080	0.28	0.10	-100	116.4	46.2	0	0	92	120.2	4012	67.6	21.08	3074	0.98	738.0	1.648	-4.2	0.046	120	46.0	142.4	6.62	5.26	1.60
NA	5.253333	23.88667	0.3193333	65.0	140.0	0.048	0.26	0.02	-100	125.6	43.4	0	0	110	130.4	NA	69.0	18.16	32	0.80	NA	1.198	-4.8	0.160	120	46.0	142.2	6.52	5.28	1.60
B	5.340000	23.98667	0.2533333	70.4	144.8	0.114	0.12	0.00	-100	118.0	45.6	0	0	90	119.2	3998	66.0	18.60	3004	0.88	730.4	1.398	-4.0	0.102	120	46.0	142.0	6.50	5.36	1.40
B	5.266667	23.94000	0.2746667	65.4	140.2	0.122	0.42	0.06	-100	116.4	46.2	0	0	90	120.6	3992	65.8	18.18	3090	0.94	728.6	1.548	-4.4	0.060	120	46.0	142.2	6.52	5.30	1.44
D	5.506667	23.89333	0.2493333	68.4	138.6	0.058	0.12	0.02	-100	118.0	46.2	0	0	76	120.2	3996	64.2	21.68	2936	1.64	729.8	3.290	-4.2	0.154	120	46.0	142.4	7.76	5.62	3.16
B	5.320000	23.96000	0.1906667	66.4	140.2	0.038	0.04	0.00	-100	117.8	45.4	0	0	92	120.0	3996	65.4	22.28	2972	0.92	726.8	1.498	-4.4	0.022	120	46.0	142.6	6.56	5.38	1.46
B	5.273333	23.96667	0.1993333	68.4	141.8	0.008	0.30	0.20	-100	117.2	46.0	0	0	94	120.2	3998	65.6	24.02	3094	0.92	732.0	1.498	-4.4	0.022	120	46.0	142.4	6.58	5.40	1.46
B	5.533333	23.98667	0.2466667	70.4	142.2	0.062	0.08	NA	-100	121.0	47.6	0	0	118	120.0	2834	67.4	13.56	3154	0.70	523.4	0.946	-4.4	0.054	120	46.0	141.8	6.50	5.48	1.36
A	5.426667	23.98667	0.2553333	69.0	140.4	0.122	0.48	0.04	-100	121.2	38.8	0	0	NA	131.4	1386	66.4	19.32	868	1.56	NA	3.092	-4.4	0.022	120	46.0	142.8	7.08	5.54	3.18
B	5.406667	23.94000	0.3293333	66.2	137.8	0.208	0.46	0.02	-100	130.6	44.6	0	0	96	119.0	4002	67.0	17.52	2592	0.88	731.4	1.398	-4.0	0.022	120	46.0	142.8	6.50	5.46	1.40
B	5.453333	24.09333	0.2353333	66.4	136.2	0.072	0.06	0.06	-100	125.2	46.4	0	0	94	120.2	4010	66.4	20.38	2996	0.90	736.8	1.448	-4.6	0.024	120	46.0	142.6	6.50	5.38	1.40
C	5.266667	23.94667	0.2766667	64.8	139.2	0.048	0.18	0.02	-100	116.6	46.2	0	0	90	120.6	4014	66.0	24.12	3060	0.90	738.6	1.448	-4.4	0.022	120	46.0	142.2	6.54	5.32	1.40
C	5.253333	23.99333	0.2933333	70.4	146.4	0.040	0.14	0.02	-100	122.2	46.0	0	0	102	119.8	4012	66.8	17.54	3136	0.94	741.2	1.548	-4.2	0.022	120	46.0	142.2	6.62	5.34	1.52
D	5.500000	24.04667	0.2466667	71.0	141.8	0.040	0.02	0.00	-100	120.4	46.4	0	0	78	119.4	4010	65.6	18.12	3194	1.64	735.2	3.290	-3.8	0.174	120	46.0	142.0	7.74	5.66	3.28
D	5.480000	23.89333	0.2246667	70.4	140.8	NA	0.34	0.06	-100	120.2	50.4	0	0	80	120.0	4010	64.8	16.94	3162	1.66	740.8	3.340	-3.8	0.022	120	50.0	142.0	7.74	5.62	3.24
D	5.486667	23.98000	0.3066667	69.6	140.0	0.234	0.16	0.02	-100	122.6	47.0	0	0	78	120.2	4010	64.6	17.04	2982	1.66	733.8	3.340	-4.0	0.024	120	46.0	141.8	7.74	5.62	3.26
D	5.466667	24.04000	0.2300000	70.2	141.2	0.004	0.10	0.04	-100	120.2	46.6	0	0	78	120.4	4010	65.4	23.44	3182	1.68	732.4	3.390	-4.0	0.066	120	46.0	142.0	7.72	5.56	3.28
D	5.460000	24.04667	0.2780000	70.8	141.8	0.174	0.62	0.10	-100	125.6	40.0	0	0	104	121.8	1008	70.6	19.22	32	1.42	NA	2.750	-4.2	0.024	120	46.0	141.8	7.72	5.58	3.28
B	5.320000	NA	0.2686667	64.4	137.0	0.068	0.10	0.04	-100	120.4	46.8	0	0	104	119.2	4018	66.0	18.04	3190	0.92	733.6	1.496	-4.2	0.022	120	46.0	142.4	6.50	5.38	1.50
B	5.313333	NA	0.3186667	64.2	136.8	0.218	0.44	0.04	-100	118.2	45.8	0	0	100	120.4	4014	65.2	22.02	2862	0.90	729.8	1.448	-4.2	0.022	120	46.0	142.0	6.50	5.36	1.52
B	5.266667	NA	0.2800000	73.2	149.8	0.040	0.24	0.04	-100	124.6	44.8	0	0	102	121.0	3990	65.0	16.72	3122	0.92	724.2	1.498	-4.0	0.036	120	46.0	141.6	6.52	5.34	1.56
A	5.486667	24.10667	0.1986667	75.0	146.4	0.086	0.28	0.04	-100	114.8	46.2	0	0	100	120.2	4016	65.4	17.76	3048	1.50	732.8	2.942	-4.0	0.022	120	46.0	142.0	7.16	5.44	3.00
A	5.460000	23.98000	0.2046667	65.8	136.6	0.010	0.04	0.06	-100	116.8	45.6	0	0	100	119.2	4010	66.2	17.76	3080	1.50	730.0	2.942	-3.6	0.022	120	46.0	142.6	7.14	5.54	3.08
A	5.440000	23.92000	0.2360000	69.4	142.0	0.050	0.16	0.08	-100	116.4	46.0	0	0	98	120.8	4012	66.4	21.46	3102	1.52	731.8	2.992	-3.6	0.022	120	46.0	142.4	7.14	5.56	3.06
D	5.480000	23.90667	0.1786667	70.4	141.0	0.090	0.22	0.04	-100	118.0	46.0	0	0	74	120.8	4012	65.0	19.68	2900	1.70	732.2	3.440	-4.4	0.052	120	46.0	142.6	7.68	5.58	3.32
D	5.473333	23.92000	0.3473333	74.2	145.2	NA	0.50	0.08	-100	125.8	46.8	0	0	78	123.2	4010	65.2	16.74	2880	1.72	745.4	3.490	-4.4	0.050	120	46.0	141.8	7.70	5.60	3.32

We will now check the summary statistics for the data.

##   Brand.Code         Carb.Volume     Fill.Ounces      PC.Volume      
##  Length:267         Min.   :5.147   Min.   :23.75   Min.   :0.09867  
##  Class :character   1st Qu.:5.287   1st Qu.:23.92   1st Qu.:0.23333  
##  Mode  :character   Median :5.340   Median :23.97   Median :0.27533  
##                     Mean   :5.369   Mean   :23.97   Mean   :0.27769  
##                     3rd Qu.:5.465   3rd Qu.:24.01   3rd Qu.:0.32200  
##                     Max.   :5.667   Max.   :24.20   Max.   :0.46400  
##                     NA's   :1       NA's   :6       NA's   :4        
##  Carb.Pressure     Carb.Temp          PSC             PSC.Fill     
##  Min.   :60.20   Min.   :130.0   Min.   :0.00400   Min.   :0.0200  
##  1st Qu.:65.30   1st Qu.:138.4   1st Qu.:0.04450   1st Qu.:0.1000  
##  Median :68.00   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.25   Mean   :141.2   Mean   :0.08545   Mean   :0.1903  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :77.60   Max.   :154.0   Max.   :0.24600   Max.   :0.6200  
##                  NA's   :1       NA's   :5         NA's   :3       
##     PSC.CO2           Mnf.Flow       Carb.Pressure1  Fill.Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :113.0   Min.   :37.80  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:120.2   1st Qu.:46.00  
##  Median :0.04000   Median :   0.20   Median :123.4   Median :47.80  
##  Mean   :0.05107   Mean   :  21.03   Mean   :123.0   Mean   :48.14  
##  3rd Qu.:0.06000   3rd Qu.: 141.30   3rd Qu.:125.5   3rd Qu.:50.20  
##  Max.   :0.24000   Max.   : 220.40   Max.   :136.0   Max.   :60.20  
##  NA's   :5                           NA's   :4       NA's   :2      
##  Hyd.Pressure1    Hyd.Pressure2    Hyd.Pressure3    Hyd.Pressure4   
##  Min.   :-50.00   Min.   :-50.00   Min.   :-50.00   Min.   : 68.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 90.00  
##  Median : 10.40   Median : 26.80   Median : 27.70   Median : 98.00  
##  Mean   : 12.01   Mean   : 20.11   Mean   : 19.61   Mean   : 97.84  
##  3rd Qu.: 20.40   3rd Qu.: 34.80   3rd Qu.: 33.00   3rd Qu.:104.00  
##  Max.   : 50.00   Max.   : 61.40   Max.   : 49.20   Max.   :140.00  
##                   NA's   :1        NA's   :1        NA's   :4       
##   Filler.Level    Filler.Speed   Temperature      Usage.cont      Carb.Flow   
##  Min.   : 69.2   Min.   :1006   Min.   :63.80   Min.   :12.90   Min.   :   0  
##  1st Qu.:100.6   1st Qu.:3812   1st Qu.:65.40   1st Qu.:18.12   1st Qu.:1083  
##  Median :118.6   Median :3978   Median :65.80   Median :21.44   Median :3038  
##  Mean   :110.3   Mean   :3581   Mean   :66.23   Mean   :20.90   Mean   :2409  
##  3rd Qu.:120.2   3rd Qu.:3996   3rd Qu.:66.60   3rd Qu.:23.74   3rd Qu.:3215  
##  Max.   :153.2   Max.   :4020   Max.   :75.40   Max.   :24.60   Max.   :3858  
##  NA's   :2       NA's   :10     NA's   :2       NA's   :2                     
##     Density           MFR           Balling      Pressure.Vacuum 
##  Min.   :0.060   Min.   : 15.6   Min.   :0.902   Min.   :-6.400  
##  1st Qu.:0.920   1st Qu.:707.0   1st Qu.:1.498   1st Qu.:-5.600  
##  Median :0.980   Median :724.6   Median :1.648   Median :-5.200  
##  Mean   :1.177   Mean   :697.8   Mean   :2.203   Mean   :-5.174  
##  3rd Qu.:1.600   3rd Qu.:731.5   3rd Qu.:3.242   3rd Qu.:-4.800  
##  Max.   :1.840   Max.   :784.8   Max.   :3.788   Max.   :-3.600  
##  NA's   :1       NA's   :31      NA's   :1       NA's   :1       
##  Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint Air.Pressurer  
##  Min.   :0.00240   Min.   : 70.0   Min.   :44.00     Min.   :141.2  
##  1st Qu.:0.01960   1st Qu.:100.0   1st Qu.:46.00     1st Qu.:142.2  
##  Median :0.03370   Median :120.0   Median :46.00     Median :142.6  
##  Mean   :0.04666   Mean   :109.6   Mean   :47.73     Mean   :142.8  
##  3rd Qu.:0.05440   3rd Qu.:120.0   3rd Qu.:50.00     3rd Qu.:142.8  
##  Max.   :0.39800   Max.   :130.0   Max.   :52.00     Max.   :147.2  
##  NA's   :3         NA's   :1       NA's   :2         NA's   :1      
##     Alch.Rel        Carb.Rel     Balling.Lvl   
##  Min.   :6.400   Min.   :5.18   Min.   :0.000  
##  1st Qu.:6.540   1st Qu.:5.34   1st Qu.:1.380  
##  Median :6.580   Median :5.40   Median :1.480  
##  Mean   :6.907   Mean   :5.44   Mean   :2.051  
##  3rd Qu.:7.180   3rd Qu.:5.56   3rd Qu.:3.080  
##  Max.   :7.820   Max.   :5.74   Max.   :3.420  
##  NA's   :3       NA's   :2

The summary statistics for the evaluation dataset tell us that it contains missing values so we will need to impute these later on in the project.

Our next step is to examine the training dataset in detail as this is the main dataset that we will be working with throughout the project.

Training Dataset

Firstly, we will take a look at the first few observations in the dataset so we can get a feel for the data. We will then explore the structure of the data using the str() function which will tell us how many observations and variables it contains, and whether or not it contains missing values.

# Take a look at the structure of the training dataset.
head(beverage.train, 40) %>% kable() %>% kable_styling() %>% scroll_box(width = '100%', height = '600px')

Brand.Code	Carb.Volume	Fill.Ounces	PC.Volume	Carb.Pressure	Carb.Temp	PSC	PSC.Fill	PSC.CO2	Mnf.Flow	Carb.Pressure1	Fill.Pressure	Hyd.Pressure2	Hyd.Pressure3	Hyd.Pressure4	Filler.Level	Filler.Speed	Temperature	Usage.cont	Carb.Flow	Density	MFR	Balling	Pressure.Vacuum	PH	Oxygen.Filler	Bowl.Setpoint	Pressure.Setpoint	Air.Pressurer	Alch.Rel	Carb.Rel	Balling.Lvl
B	5.340000	23.96667	0.2633333	68.2	141.2	0.104	0.26	0.04	-100	118.8	46.0	NA	NA	118	121.2	4002	66.0	16.18	2932	0.88	725.0	1.398	-4.0	8.36	0.022	120	46.4	142.6	6.58	5.32	1.48
A	5.426667	24.00667	0.2386667	68.4	139.6	0.124	0.22	0.04	-100	121.6	46.0	NA	NA	106	118.6	3986	67.6	19.90	3144	0.92	726.8	1.498	-4.0	8.26	0.026	120	46.8	143.0	6.56	5.30	1.56
B	5.286667	24.06000	0.2633333	70.8	144.8	0.090	0.34	0.16	-100	120.2	46.0	NA	NA	82	120.0	4020	67.0	17.76	2914	1.58	735.0	3.142	-3.8	8.94	0.024	120	46.6	142.0	7.66	5.84	3.28
A	5.440000	24.00667	0.2933333	63.0	132.6	NA	0.42	0.04	-100	115.2	46.4	0	0	92	117.8	4012	65.6	17.42	3062	1.54	730.6	3.042	-4.4	8.24	0.030	120	46.0	146.2	7.14	5.42	3.04
A	5.486667	24.31333	0.1113333	67.2	136.8	0.026	0.16	0.12	-100	118.4	45.8	0	0	92	118.6	4010	65.6	17.68	3054	1.54	722.8	3.042	-4.4	8.26	0.030	120	46.0	146.2	7.14	5.44	3.04
A	5.380000	23.92667	0.2693333	66.6	138.4	0.090	0.24	0.04	-100	119.6	45.6	0	0	116	120.2	4014	66.2	23.82	2948	1.52	738.8	2.992	-4.4	8.32	0.024	120	46.0	146.6	7.16	5.44	3.02
A	5.313333	23.88667	0.2680000	64.2	136.8	0.128	0.40	0.04	-100	122.2	51.8	0	0	124	123.4	NA	65.8	20.74	30	0.84	NA	1.298	-4.4	8.40	0.066	120	46.0	146.2	6.54	5.38	1.44
B	5.320000	24.17333	0.2206667	67.6	141.4	0.154	0.34	0.04	-100	124.2	46.8	0	0	132	118.6	1004	65.2	18.96	684	0.84	NA	1.298	-4.4	8.38	0.046	120	46.0	146.4	6.52	5.34	1.44
B	5.246667	23.98000	0.2626667	64.2	140.2	0.132	0.12	0.14	-100	120.8	46.0	0	0	90	120.2	4014	65.4	18.40	2902	0.90	740.4	1.446	-4.4	8.38	0.064	120	46.0	147.2	6.52	5.34	1.44
B	5.266667	24.00667	0.2313333	72.0	147.4	0.014	0.24	0.06	-100	119.8	45.2	0	0	108	120.8	4028	66.6	13.50	3038	0.90	692.4	1.448	-4.4	8.50	0.022	120	46.0	146.2	6.54	5.34	1.38
B	5.320000	23.92000	0.2586667	66.2	139.4	0.078	0.18	0.04	-100	119.6	46.6	0	0	94	119.6	4020	65.0	19.04	3056	0.90	727.0	1.448	-4.4	8.34	0.030	120	46.0	146.2	6.52	5.34	1.44
B	5.353333	24.06667	0.2513333	61.6	132.8	0.110	0.18	0.02	-100	119.2	46.6	0	0	86	119.6	4012	65.4	18.44	3110	0.92	735.0	1.498	-4.4	8.34	0.058	120	46.0	146.8	6.52	5.34	1.44
B	5.220000	23.89333	0.2673333	63.4	141.0	0.114	0.38	NA	-100	117.4	45.4	0	0	98	121.0	4012	65.0	17.12	2870	0.92	729.6	1.498	-4.4	8.34	0.048	120	46.0	146.0	6.52	5.34	1.46
B	5.266667	23.89333	0.2286667	71.6	147.8	0.096	0.22	0.04	-100	113.6	46.0	0	0	94	120.0	4012	65.0	23.44	3040	0.92	731.0	1.498	-4.4	8.38	0.046	120	46.0	146.8	6.52	5.34	1.44
B	5.266667	23.87333	0.3340000	72.6	148.0	0.160	0.36	0.08	-100	120.2	46.6	0	0	92	120.0	4010	65.0	21.16	3056	0.90	732.4	1.448	-4.4	8.40	0.066	120	46.0	146.6	6.52	5.34	1.44
B	5.286667	23.86667	0.2566667	68.0	143.2	0.034	0.16	0.02	-100	129.0	47.4	0	0	96	119.8	4010	65.2	19.88	3290	0.90	731.0	1.448	-4.4	8.42	0.046	120	46.0	146.0	6.52	5.34	1.42
C	5.226667	23.69333	0.3166667	63.8	138.2	0.124	0.20	0.06	-100	123.4	48.8	0	0	92	120.2	1624	68.8	17.02	3200	0.46	295.8	0.346	-4.2	8.58	0.164	120	46.0	146.6	6.52	5.34	1.46
B	5.353333	23.99333	0.2793333	64.8	137.0	0.146	0.06	0.02	-100	115.6	46.4	0	0	94	120.4	4012	65.2	21.82	3082	0.88	726.4	1.398	-4.0	8.50	0.046	120	46.0	146.8	6.52	5.36	1.46
B	5.366667	24.09333	0.2613333	70.6	143.8	0.220	0.48	0.08	-100	121.4	47.0	0	0	98	116.4	3060	65.4	20.32	3324	0.84	535.8	1.298	-4.0	8.44	0.064	120	46.0	146.8	6.52	5.28	1.44
C	5.213333	23.98667	0.2353333	62.6	140.8	0.246	0.10	0.20	-100	119.6	45.4	0	0	102	120.2	4012	67.8	16.44	2970	0.86	731.8	1.348	-4.0	8.30	0.046	120	46.0	146.2	6.62	5.34	1.38
C	5.220000	24.26000	0.1120000	66.8	143.4	0.042	0.08	0.06	-100	116.6	46.4	0	0	94	121.0	4010	65.4	16.56	3090	0.94	726.4	1.548	-4.2	8.42	0.022	120	46.0	146.2	6.52	5.34	1.52
B	5.333333	24.09333	0.3046667	66.0	139.4	0.060	0.06	0.08	-100	130.2	44.2	0	0	130	100.2	1008	69.8	21.98	30	0.74	NA	1.048	-4.0	8.48	NA	100	50.0	147.0	6.50	5.40	1.48
B	5.340000	23.98667	0.2120000	68.2	142.2	0.038	0.16	0.04	-100	113.6	51.4	0	0	100	96.8	2936	66.6	19.36	3418	0.82	519.0	1.248	-4.2	8.52	0.254	100	50.0	146.8	6.50	5.38	1.48
B	5.413333	23.98667	0.2926667	70.0	142.8	0.124	0.02	0.04	-100	118.2	50.2	0	0	96	100.4	4016	66.0	24.00	3206	0.92	732.6	1.496	-4.2	8.44	0.084	100	50.0	146.4	6.52	5.38	1.48
B	5.373333	24.02000	0.2813333	68.0	141.0	0.102	0.26	0.02	-100	119.2	50.0	0	0	90	100.2	4010	66.2	21.58	3220	0.90	734.4	1.448	-4.2	8.44	0.064	100	50.0	147.2	6.50	5.28	1.50
B	5.313333	23.98667	0.2940000	68.2	142.2	0.052	0.18	0.10	-100	118.0	50.0	0	0	94	99.8	4016	66.0	20.72	3206	0.92	732.8	1.496	-4.2	8.40	0.206	100	50.0	146.6	6.50	NA	1.50
B	5.360000	24.02667	0.2780000	67.0	139.8	0.080	0.34	0.04	-100	115.2	50.2	0	0	102	100.2	4014	68.2	21.60	3168	0.86	740.4	1.348	-4.2	8.42	0.096	100	50.0	146.8	6.48	5.38	1.48
B	5.446667	24.02000	0.0900000	70.8	142.6	0.012	0.34	0.02	-100	124.4	50.0	0	0	96	100.8	4012	65.6	23.58	3138	0.88	729.4	1.398	-4.2	8.42	0.090	100	50.0	147.0	6.50	5.38	1.48
B	5.380000	24.07333	0.2180000	66.6	138.8	0.040	0.18	0.04	-100	116.4	50.0	0	0	92	100.0	4010	65.8	21.40	3212	0.90	731.0	1.448	-4.2	8.40	0.064	100	50.0	147.4	6.52	5.40	1.46
B	5.393333	24.08667	0.2120000	65.8	137.4	0.102	0.10	0.02	-100	118.6	50.0	0	0	102	100.6	4016	66.4	18.32	3164	0.90	734.2	1.448	-4.2	8.44	0.084	100	50.0	147.0	6.50	5.40	1.46
B	5.406667	24.11333	0.2220000	67.4	138.6	0.128	0.22	0.06	-100	116.8	49.8	0	0	100	100.4	4014	65.8	21.16	3194	0.90	731.6	1.448	-4.2	8.36	0.096	100	50.0	146.6	6.50	5.38	1.46
B	5.366667	24.09333	0.2106667	70.8	144.2	0.068	0.04	0.08	-100	115.2	50.2	0	0	96	100.2	4010	65.8	22.90	3182	0.90	731.0	1.398	-4.2	8.36	0.084	100	50.0	146.2	6.50	5.38	1.46
B	5.300000	24.07333	0.1860000	69.6	143.8	0.052	0.42	0.24	-100	121.2	50.4	0	0	94	110.0	4014	65.8	19.18	3214	0.90	730.6	1.448	-4.2	8.40	0.082	110	50.0	146.4	6.50	5.38	1.46
B	5.360000	24.08000	0.1546667	68.6	141.6	0.088	0.04	0.06	-100	119.6	50.0	0	0	100	109.4	4010	65.8	15.88	3198	0.92	740.0	1.496	-4.2	8.38	0.062	110	50.0	147.0	6.52	5.38	1.50
B	5.366667	24.04667	0.1326667	68.2	141.0	0.112	0.34	0.16	-100	118.0	50.2	0	0	94	110.0	4010	65.8	18.54	3220	0.92	733.8	1.496	-4.2	8.44	0.064	110	50.0	145.8	6.52	5.38	1.50
B	5.373333	24.00667	0.3160000	69.4	142.8	NA	0.28	0.02	-100	120.4	49.8	0	0	96	110.2	4012	67.4	19.98	3208	0.90	728.4	1.398	-4.0	8.36	0.064	110	50.0	146.0	6.50	5.36	1.50
B	5.346667	23.98667	0.2280000	68.8	142.4	0.164	0.26	0.06	-100	121.4	43.0	0	0	120	120.0	1006	67.0	13.66	1464	0.86	NA	1.346	-4.0	8.40	0.080	120	50.0	148.2	6.50	5.40	1.50
B	5.373333	24.01333	0.2000000	65.2	137.4	0.112	0.34	0.02	-100	116.0	50.2	0	0	96	120.8	4014	66.4	21.98	3222	0.88	702.4	1.398	-4.0	8.38	0.060	120	50.0	147.0	6.50	5.40	1.48
B	5.326667	24.06000	0.2393333	75.8	151.4	0.080	0.08	0.02	-100	117.0	45.8	0	0	94	119.6	4012	66.6	18.12	2986	0.86	732.6	1.346	-3.8	8.36	0.082	120	46.0	145.8	6.50	5.42	1.44
C	5.273333	23.86000	0.1126667	65.6	140.2	0.050	0.10	0.04	-100	119.4	45.8	0	0	92	119.2	4014	67.2	13.78	2976	0.92	722.0	1.496	-3.8	8.28	0.062	120	46.0	146.2	6.64	5.38	1.60

# Examine the structure of the training data.
str(beverage.train)

## 'data.frame':    2571 obs. of  33 variables:
##  $ Brand.Code       : chr  "B" "A" "B" "A" ...
##  $ Carb.Volume      : num  5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill.Ounces      : num  24 24 24.1 24 24.3 ...
##  $ PC.Volume        : num  0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb.Pressure    : num  68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb.Temp        : num  141 140 145 133 137 ...
##  $ PSC              : num  0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC.Fill         : num  0.26 0.22 0.34 0.42 0.16 0.24 0.4 0.34 0.12 0.24 ...
##  $ PSC.CO2          : num  0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
##  $ Mnf.Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb.Pressure1   : num  119 122 120 115 118 ...
##  $ Fill.Pressure    : num  46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd.Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure2    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure3    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : int  118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler.Level     : num  121 119 120 118 119 ...
##  $ Filler.Speed     : int  4002 3986 4020 4012 4010 4014 NA 1004 4014 4028 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage.cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb.Flow        : int  2932 3144 2914 3062 3054 2948 30 684 2902 3038 ...
##  $ Density          : num  0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Balling          : num  1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure.Vacuum  : num  -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen.Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl.Setpoint    : int  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure.Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air.Pressurer    : num  143 143 142 146 146 ...
##  $ Alch.Rel         : num  6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb.Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling.Lvl      : num  1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...

The results of running the training data through the str() function reveal that the dataset consists of 33 Columns, and 2571 Observations. Almost all of the variables are numerical, with the exception of the Brand.Code variable which is categorical. An other important revelation is that some of the variables contain missing values.

The training dataset contains 32 predictor variables which include 1 categorical variable and the rest are numeric (continuous and discrete) variables. There are 2571 records in the training data and 267 records in the evaluation dataset. The target column is the PH column.

The data has the following variables:

Brand Code: categorical, values: A, B, C, D
Carb Volume: Numeric
Fill Ounces: Numeric
PC Volume: Numeric
Carb Pressure: Numeric
Carb Temp: Numeric
PSC: Numeric
PSC Fill: Numeric
PSC CO2: Numeric
Mnf Flow: Numeric
Carb Pressure1: Numeric
Fill Pressure: Numeric
Hyd Pressure1: Numeric
Hyd Pressure2: Numeric
Hyd Pressure3: Numeric
Hyd Pressure4: Numeric
Filler Level: Numeric
Filler Speed: Numeric
Temperature: Numeric
Usage cont: Numeric
Carb Flow: Numeric
Density: Numeric
MFR: Numeric
Balling: Numeric
Pressure Vacuum: Numeric
PH: This is the numeric TARGET variable that has to be predicted.
Bowl Setpoint: Numeric
Pressure Setpoint: Numeric
Air Pressurer: Numeric
Alch Rel: Numeric
Carb Rel: Numeric
Balling Lvl: Numeric

Now let’s check the summary statistics for the data.

##   Brand.Code         Carb.Volume     Fill.Ounces      PC.Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb.Pressure     Carb.Temp          PSC             PSC.Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC.CO2           Mnf.Flow       Carb.Pressure1  Fill.Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd.Pressure1   Hyd.Pressure2   Hyd.Pressure3   Hyd.Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler.Level    Filler.Speed   Temperature      Usage.cont      Carb.Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure.Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air.Pressurer      Alch.Rel        Carb.Rel      Balling.Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1

##   rows columns discrete_columns continuous_columns all_missing_columns
## 1 2571      33                1                 32                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1                  844          2038              84843       645352

Data summary

Name	beverage.train
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand.Code	120	0.95	1	1	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb.Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	<U+2581><U+2586><U+2587><U+2585><U+2581>
Fill.Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	<U+2581><U+2582><U+2587><U+2582><U+2581>
PC.Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	<U+2581><U+2583><U+2587><U+2582><U+2581>
Carb.Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	<U+2581><U+2585><U+2587><U+2583><U+2581>
Carb.Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	<U+2581><U+2585><U+2587><U+2583><U+2581>
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	<U+2586><U+2587><U+2583><U+2581><U+2581>
PSC.Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	<U+2586><U+2587><U+2583><U+2581><U+2581>
PSC.CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	<U+2587><U+2585><U+2582><U+2581><U+2581>
Mnf.Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	<U+2587><U+2581><U+2581><U+2587><U+2582>
Carb.Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	<U+2581><U+2583><U+2587><U+2582><U+2581>
Fill.Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	<U+2581><U+2581><U+2587><U+2582><U+2581>
Hyd.Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	<U+2587><U+2585><U+2582><U+2581><U+2581>
Hyd.Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	<U+2587><U+2582><U+2587><U+2585><U+2581>
Hyd.Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	<U+2587><U+2581><U+2583><U+2587><U+2581>
Hyd.Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	<U+2581><U+2583><U+2587><U+2582><U+2581>
Filler.Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	<U+2581><U+2583><U+2585><U+2587><U+2581>
Filler.Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	<U+2581><U+2581><U+2581><U+2581><U+2587>
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	<U+2587><U+2583><U+2581><U+2581><U+2581>
Usage.cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	<U+2581><U+2583><U+2585><U+2583><U+2587>
Carb.Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	<U+2582><U+2585><U+2586><U+2587><U+2581>
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	<U+2581><U+2585><U+2587><U+2582><U+2586>
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	<U+2581><U+2581><U+2581><U+2582><U+2587>
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	<U+2581><U+2587><U+2587><U+2581><U+2587>
Pressure.Vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	<U+2582><U+2587><U+2586><U+2582><U+2581>
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	<U+2581><U+2585><U+2587><U+2582><U+2581>
Oxygen.Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	<U+2587><U+2581><U+2581><U+2581><U+2581>
Bowl.Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	<U+2581><U+2582><U+2583><U+2587><U+2581>
Pressure.Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	<U+2581><U+2587><U+2581><U+2586><U+2581>
Air.Pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	<U+2585><U+2587><U+2581><U+2581><U+2581>
Alch.Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	<U+2581><U+2587><U+2582><U+2583><U+2581>
Carb.Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	<U+2581><U+2587><U+2587><U+2582><U+2581>
Balling.Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	<U+2581><U+2587><U+2582><U+2581><U+2586>

From the above, we see that most of the predictors (except for 2) contain missing data and will therefore need to be imputed. For the target variable (PH), we see that 4 rows are missing “PH” values. These rows will need to be dropped since they cannot be used for training.

Let’s look at the distribution of the target variable next.

The above histogram reveals that the target variable is not very skewed, even though there are some outliers. The minimum value for PH is 7.88 and the maximum value is 9.36 indicating that ABC manufactures relatively alkaline beverages - likely to be green tea or fruit and vegetable juices.

Missing Values
variable	n_miss	pct_miss
MFR	212	8.2458187
Brand.Code	120	4.6674446
Filler.Speed	57	2.2170362
PC.Volume	39	1.5169195
PSC.CO2	39	1.5169195
Fill.Ounces	38	1.4780241
PSC	33	1.2835473
Carb.Pressure1	32	1.2446519
Hyd.Pressure4	30	1.1668611
Carb.Pressure	27	1.0501750
Carb.Temp	26	1.0112797
PSC.Fill	23	0.8945935
Fill.Pressure	22	0.8556982
Filler.Level	20	0.7779074
Hyd.Pressure2	15	0.5834306
Hyd.Pressure3	15	0.5834306
Temperature	14	0.5445352
Oxygen.Filler	12	0.4667445
Pressure.Setpoint	12	0.4667445
Hyd.Pressure1	11	0.4278491
Carb.Volume	10	0.3889537
Carb.Rel	10	0.3889537
Alch.Rel	9	0.3500583
Usage.cont	5	0.1944769
PH	4	0.1555815
Mnf.Flow	2	0.0777907
Carb.Flow	2	0.0777907
Bowl.Setpoint	2	0.0777907
Density	1	0.0388954
Balling	1	0.0388954
Balling.Lvl	1	0.0388954
Pressure.Vacuum	0	0.0000000
Air.Pressurer	0	0.0000000

The above statistics tell us that about 8.25% of the records are missing a value for MFR. We may need to drop this feature since as missingness increases, the increasing amount of imputed values would have potential negative consequences.

The second most missing variable is the categorical variable called “Brand Code”, which is missing about 4.67% percent of its values. These could potentially be a 5th brand besides the existing A,B,C and D or could be one of the existing 4 brands. In any case, we will create a new feature category ‘Unknown’ for these records. The rest of the predictors are missing smaller percentages of values, and we can use imputation for these records.

From the above plots, we can see that a lot of the predictors are significantly skewed, suggesting that we might need to transform the data. Several features are discrete with limited possible values, e.g. Pressure Setpoint. We also see a number of bimodal variables such as Carb Flow, Balling, and Balling Level.

Boxplots

We now use boxplots to check the spread of each predictor.

The boxplots reveal outliers, but we don’t have have a strong reason to impute or drop them from the dataset.

Predictor-Target Correlations

We will now derive the correlations for the numeric predictors. This will enable us to focus on those predictors that show stronger positive or negative correlations with PH. Predictors with correlations closer to zero will most likely not provide any meaningful information for the target variable.

##          values               ind
## 1   0.361587534     Bowl.Setpoint
## 2   0.352043962      Filler.Level
## 3   0.233593699         Carb.Flow
## 4   0.219735497   Pressure.Vacuum
## 5   0.196051481          Carb.Rel
## 6   0.166682228          Alch.Rel
## 7   0.164485364     Oxygen.Filler
## 8   0.109371168       Balling.Lvl
## 9   0.098866734         PC.Volume
## 10  0.095546936           Density
## 11  0.076700227           Balling
## 12  0.076213407     Carb.Pressure
## 13  0.072132509       Carb.Volume
## 14  0.032279368         Carb.Temp
## 15 -0.007997231     Air.Pressurer
## 16 -0.023809796          PSC.Fill
## 17 -0.040882953      Filler.Speed
## 18 -0.045196477               MFR
## 19 -0.047066423     Hyd.Pressure1
## 20 -0.069873041               PSC
## 21 -0.085259857           PSC.CO2
## 22 -0.118335903       Fill.Ounces
## 23 -0.118764185    Carb.Pressure1
## 24 -0.171434026     Hyd.Pressure4
## 25 -0.182659650       Temperature
## 26 -0.222660048     Hyd.Pressure2
## 27 -0.268101792     Hyd.Pressure3
## 28 -0.311663908 Pressure.Setpoint
## 29 -0.316514463     Fill.Pressure
## 30 -0.357611993        Usage.cont
## 31 -0.459231253          Mnf.Flow

From the above, we can see that the variables Bowl Setpoint, Filler Level, Carb Flow, Pressure Vacuum, and Carb Rel have the strongest positive correlations with PH, while Mnf Flow, Usage cont, Fill Pressure, Pressure Setpoint, and Hyd Pressure3 have the strongest negative correlations with PH. The other features have a weak or slightly negative correlation, which implies they have less predictive power.

Multicollinearity

One problem that can occur with multiple regression and other models is a correlation between predictors or multicollinearity. A quick check is to run correlations between all predictors.

We can see that some predictors are highly correlated with one another, such as Balling Level and Carb Volume, Carb Rel and Alch Rel, Density, and Balling, with a correlation between 0.75 and 1. When we start examining predictors for our models, we’ll have to consider the correlations between them and avoid including pairs with strong correlations.

In general, it looks like many of the predictors go hand-in-hand with other features and multicollinearity could be a problem.

## 6 variables from the 31 input variables have collinearity problem: 
##  
## Balling Bowl.Setpoint Balling.Lvl MFR Hyd.Pressure3 Alch.Rel 
## 
## After excluding the collinear variables, the linear correlation coefficients ranges between: 
## min correlation ( Pressure.Setpoint ~ PC.Volume ):  0.0001991472 
## max correlation ( Carb.Rel ~ Density ):  0.852689 
## 
## ---------- VIFs of the remained variables -------- 
##            Variables       VIF
## 1        Carb.Volume 17.159340
## 2        Fill.Ounces  1.153764
## 3          PC.Volume  1.685901
## 4      Carb.Pressure 43.298925
## 5          Carb.Temp 35.460787
## 6                PSC  1.155980
## 7           PSC.Fill  1.109449
## 8            PSC.CO2  1.064636
## 9           Mnf.Flow  4.262754
## 10    Carb.Pressure1  1.434406
## 11     Fill.Pressure  3.490621
## 12     Hyd.Pressure1  2.935470
## 13     Hyd.Pressure2  4.900597
## 14     Hyd.Pressure4  1.752686
## 15      Filler.Level  2.618103
## 16      Filler.Speed  1.273500
## 17       Temperature  1.151185
## 18        Usage.cont  1.718776
## 19         Carb.Flow  1.987496
## 20           Density  4.499376
## 21   Pressure.Vacuum  2.054866
## 22     Oxygen.Filler  1.561878
## 23 Pressure.Setpoint  3.300894
## 24     Air.Pressurer  1.167671
## 25          Carb.Rel  6.339500

The vifcor function from the usdm package allows us to do an early analysis into multi-collinearity. As can be seen from the above, this function tells us that 6 of the 31 numeric predictors are highly correlated.

Near-Zero Variance Predictors

Lastly, we want to check for any features that show near zero-variance. Predictors that are the same across most of the instances will add little predictive information.

##               freqRatio percentUnique zeroVar  nzv
## Hyd.Pressure1  31.11111      9.529366   FALSE TRUE

Since “Hyd Pressure1” displays near-zero variance, we will drop this feature prior to modeling.

2. Data Preparation

To summarize our data preparation and exploration, we distinguish our findings into a few categories below.

Removed Fields

MFR has more than 8% missing values, so we can remove this predictor.
Hyd Pressure1 shows little variance, so we can remove this predictor.
We had 4 rows with missing PH that need to be removed.
We replace missing values for Brand Code with “Unknown”.
Impute remaining missing values using Predictive mean matching via the mice package.

Imputing Missing Values

30 out of 33 variables contain missing values of varying quantities (ranging from 212 to 1). This is enough to justify imputation. Rather than removing entire observations with missing values and jeopardizing the accuracy of the data, we will use the mice package’s mice() function to impute them.

The mice package offers an array of imputation methods (Predictive mean matching, mean, norm, to name a few), but due to the fact that the dataset contains both numeric and categorical variables, we have decided to use the Predictive mean matching method as this covers both variable types.

set.seed(200)

# Impute missing values in training data using the Predictive mean matching imputation method.
beverage.train.clean <- mice(beverage.train.clean, m = 1, method = 'pmm', print = FALSE) %>% complete()

# After imputation, check if any missing values remain.
colSums(is.na(beverage.train.clean))

##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##                 0                 0                 0                 0 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##                 0                 0                 0                 0 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##                 0                 0                 0                 0 
##     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4      Filler.Level 
##                 0                 0                 0                 0 
##      Filler.Speed       Temperature        Usage.cont         Carb.Flow 
##                 0                 0                 0                 0 
##           Density           Balling   Pressure.Vacuum                PH 
##                 0                 0                 0                 0 
##     Oxygen.Filler     Bowl.Setpoint Pressure.Setpoint     Air.Pressurer 
##                 0                 0                 0                 0 
##          Alch.Rel          Carb.Rel       Balling.Lvl 
##                 0                 0                 0

# Impute missing values in test data using the Predictive mean matching imputation method.
beverage.eval.clean <- mice(beverage.eval.clean, m = 1, method = 'pmm', print = FALSE) %>% complete()

As per the above results, we can confirm that the missing values have been eliminated after imputation.

Convert Categorical Variable to Dummy variables

“Brand.Code” is a categorical variable with values A, B, C, D and Unknown. So we will convert it to a set of dummy variables for modeling.

Transform Predictors With Skewed Distributions

As discussed earlier, some of the predictors are highly skewed. To address this, we scale, center, and apply the Box-Cox transformation to them using the “preProcess” function from the “caret” package. These transformations should result in more normal distributions.

## Created from 267 samples and 34 variables
## 
## Pre-processing:
##   - Box-Cox transformation (22)
##   - centered (34)
##   - ignored (0)
##   - scaled (34)
## 
## Lambda estimates for Box-Cox transformation:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.000  -2.000   0.100  -0.300   0.675   2.000

Here are some plots to demonstrate the changes in distributions after the transformations:

# Prepare data for ggplot.
gather_df <- bev.train.transformed %>% dplyr::select(-c(PH)) %>% gather(key = 'variable', value = 'value')

# Histogram plots of each variable.
ggplot(gather_df) + geom_histogram(aes(x=value, y = ..density..), bins = 30) +
  geom_density(aes(x = value), color = 'red') +
  facet_wrap(. ~variable, scales = 'free', ncol = 4)

As expected, the plots of the dummy variables are binary. For the others, we can still see bimodal predictors since we did not apply any feature engineering to them. Some predictors such as ‘PSC Fill’ and ‘Temperature’ still show some skewness, but we can move on to building the models.

Pre-Modeling Data Splitting

Here, we perform a train-test split with a 80:20 ratio.

# Split the training data into train and test sets using an 80% data split.
trainingData <- createDataPartition(bev.train.transformed$PH, p = 0.8, list = FALSE)

# Training data splits.
trainingDataSet <- bev.train.transformed[trainingData, ]
xTrainData <- subset(trainingDataSet, select = -PH)
yTrainData <- subset(trainingDataSet, select = PH)

# Test data splits.
testDataSet <- bev.train.transformed[-trainingData, ]
xTestData <- subset(testDataSet, select = -PH)
yTestData <- subset(testDataSet, select = PH)

Model Building/Fitting

In this section, we will build and run 3 categories of models: tree, linear, and non-linear. We will then compare the results of the models in each category, select the best category performer, and then select the overall best performer.

Non-Linear Models

In the non-linear category, we will build and run 2 models - a Support Vector Machine (SVM) model, and a K-Nearest Neighbors (KNN) model. We will use the caret package’s train() function to build the models, and use the same training and test datasets for both models.

Support Vector Machine (SVM) Model

set.seed(200)

# Define the SVM model.
svmModel = train(x = xTrainData, 
                 y = yTrainData$PH,
                 preProcess = c('center', 'scale'),
                 method = 'svmRadial', 
                 tuneLength = 10,
                 trControl = trainControl(method = 'repeatedcv'))

# Run predict() and postResample() on the model and display the results.
svmPrediction <- predict(svmModel, newdata = xTestData)
svmPerformance <- postResample(pred = svmPrediction, obs = yTestData$PH)
svmPerformance

##       RMSE   Rsquared        MAE 
## 0.10884499 0.59016597 0.08127628

# Predict on test data and calculate performance
results<-data.frame()
results <- data.frame(t(postResample(pred = svmPrediction, obs = yTestData$PH))) %>%mutate(Model = "SVM") %>% 
rbind(results)

K Nearest Neighbors (KNN) Model

set.seed(200)

# Define the KNN model.
knnModel <- train(x = xTrainData,
                  y = yTrainData$PH,
                  preProcess = c('center', 'scale'),
                  method = 'knn',
                  tuneLength = 10)

# Run predict() and postResample() on the model and display the results.
knnPrediction <- predict(knnModel, newdata = xTestData)
knnPerformance <- postResample(pred = knnPrediction, obs = yTestData$PH)
knnPerformance

##      RMSE  Rsquared       MAE 
## 0.1191253 0.5096819 0.0913447

# Predict on test data and calculate performance
results <- data.frame(t(postResample(pred = knnPrediction, obs = yTestData$PH))) %>% mutate(Model = "k-Nearest Neighbors(kNN)") %>% rbind(results)

Linear Models

In the linear model category, we will build and run a generalized linear model (GLM), and a partial least squares (PLS) model.

Generalized Linear Model (GLM)

set.seed(200)

# Define the GLM model.
glmModel = train(PH ~ .,
                 data = trainingDataSet, 
                 metric = 'RMSE',
                 preProcess = c('center', 'scale'),
                 method = 'glm', 
                 trControl = trainControl(method = 'cv', number = 5, savePredictions = TRUE))

# Run predict() and postResample() on the model and display the results.
glmModelPrediction <- predict(glmModel, xTestData)
glmPerformance <- postResample(pred = glmModelPrediction, obs = yTestData$PH)
glmPerformance

##      RMSE  Rsquared       MAE 
## 0.1295556 0.4133405 0.1024145

# Predict on test data and calculate performance
results <- data.frame(t(postResample(pred = glmModelPrediction, obs = yTestData$PH))) %>% mutate(Model = "Generalized Linear Model(GLM)") %>% rbind(results)

Partial Least Squares Model (PLS)

set.seed(200)

# Define the PLS model.
plsModel = train(PH ~ .,
                 data = trainingDataSet, 
                 metric = 'RMSE',
                 preProcess = c('center', 'scale'),
                 method = 'pls', 
                 trControl = trainControl(method = 'cv', number = 5, savePredictions = TRUE))

# Run predict() and postResample() on the model and display the results.
plsModelPrediction <- predict(plsModel, xTestData)
plsPerformance <- postResample(pred = plsModelPrediction, obs = yTestData$PH)
plsPerformance

##      RMSE  Rsquared       MAE 
## 0.1299029 0.4088798 0.1040693

# Predict on test data and calculate performance
results <- data.frame(t(postResample(pred = plsModelPrediction, obs = yTestData$PH))) %>% mutate(Model = "Partial Least Squares(PLS)") %>% rbind(results)

Tree Models

In this category, we will build and run a cubist model, and a single tree model.

Cubist Model

set.seed(200)

# Define the Cubist model.
cubistModel <- cubist(xTrainData, 
                      yTrainData$PH, 
                      committees = 6)

# Run predict() and postResample() on the model and display the results.
cubistModelPrediction <- predict(cubistModel, newdata = xTestData)
cubistPerformance <- postResample(pred = cubistModelPrediction, obs = yTestData$PH)
cubistPerformance

##       RMSE   Rsquared        MAE 
## 0.10518426 0.61494856 0.07688106

# Predict on test data and calculate performance
results <- data.frame(t(postResample(pred = cubistModelPrediction, obs = yTestData$PH))) %>% mutate(Model = "Tree Model(Cubist)") %>% rbind(results)

Single Tree Model

set.seed(100)

# Define the Single Tree model.
singleTreeModel <- train(xTrainData,
                         yTrainData$PH,
                         method = 'rpart2',
                         tuneLength = 10,
                         trControl = trainControl(method = 'cv'))

# Run predict() and postResample() on the model and display the results.
singleTreeModelPrediction <- predict(singleTreeModel, newdata = xTestData)
singleTreePerformance <- postResample(pred = singleTreeModelPrediction , obs = yTestData$PH) 
singleTreePerformance

##       RMSE   Rsquared        MAE 
## 0.12522668 0.45164526 0.09759854

# Predict on test data and calculate performance.
results <- data.frame(t(postResample(pred = singleTreeModelPrediction, obs = yTestData$PH))) %>% mutate(Model = "Single Tree Model") %>% rbind(results)

Model Comparisons

After running our models, we will now compare the results of the models and select the best performing model within each category. This will allow us to select the overall best performing model.

Non-Linear Model Comparisons

nonLinearComparisons <- rbind(
  'Support Vector Machine' = svmPerformance,
  'K Nearest Neighbors' = knnPerformance)

nonLinearComparisons %>% kable() %>% kable_styling(bootstrap_options = c('striped'))

	RMSE	Rsquared	MAE
Support Vector Machine	0.1088450	0.5901660	0.0812763
K Nearest Neighbors	0.1191253	0.5096819	0.0913447

Using RMSE and Rsquared as the selection criteria for the best performing model, the support vector machine model yielded the best performance. The Rsquared value of the model is 0.57 which tells us that the model explains 57% of the variability in the data. This trumps the Rsquared value of the KNN model (52%), but not by much.

Linear Model Comparisons

linearComparisons <- rbind(
  'Generalized Linear Model' = glmPerformance,
  'Partial Least Squares' = plsPerformance)

linearComparisons %>% kable() %>% kable_styling(bootstrap_options = c('striped'))

	RMSE	Rsquared	MAE
Generalized Linear Model	0.1295556	0.4133405	0.1024145
Partial Least Squares	0.1299029	0.4088798	0.1040693

Again, using RMSE and Rsquared to select the best model, the GLM and PLS models are almost the same in terms of their performance. However, the generalized linear model performs slightly better than the partial least squares model. The GLM model explains 41% of the data variance which is higher than the Rsquared value of the PLS model by a fraction.

Tree Model Comparisons

treeComparisons <- rbind(
  'Cubist' = cubistPerformance,
  'Single Tree' = singleTreePerformance)

treeComparisons %>% kable() %>% kable_styling(bootstrap_options = c('striped'))

	RMSE	Rsquared	MAE
Cubist	0.1051843	0.6149486	0.0768811
Single Tree	0.1252267	0.4516453	0.0975985

Finally, In the tree model category, based on the fact that the cubist model has a lower RMSE than that of the single tree model, and the fact that it explains 61% of the data variance (as opposed to the single tree model’s 45%), the cubist tree model is the best performing model in this category.

Model Summary

We now consolidate the results from all the models using the following criteria: root mean squared error (RMSE), R-squared, and Mean Absolute Error (MAE). The table below lists these criteria for each model.

results %>% dplyr::select(Model, RMSE, Rsquared, MAE)

##                           Model      RMSE  Rsquared        MAE
## 1             Single Tree Model 0.1252267 0.4516453 0.09759854
## 2            Tree Model(Cubist) 0.1051843 0.6149486 0.07688106
## 3    Partial Least Squares(PLS) 0.1299029 0.4088798 0.10406927
## 4 Generalized Linear Model(GLM) 0.1295556 0.4133405 0.10241455
## 5      k-Nearest Neighbors(kNN) 0.1191253 0.5096819 0.09134470
## 6                           SVM 0.1088450 0.5901660 0.08127628

Model Selection And Top Predictor Analysis

Based on the RMSE and RSquared values of all the models we ran, the Cubist model is the overall best performer. This is expected given that this model is more tolerant of multi-collinearity and works well with non-linear features. The Rsquared for the Cubist model tells us that it explains 61% of the data variance which falls within an acceptable RSquared value range. Based on this, we will proceed with the Cubist model as the best predictive model for this project.

Let’s inspect the predictors that this model found important.

var.imp.cubist<-varImp(cubistModel, scale = FALSE)
var.imp.cubist

##                   Overall
## Mnf.Flow             73.0
## Alch.Rel             57.5
## Balling.Lvl          54.5
## Pressure.Vacuum      54.0
## Brand.CodeC          27.5
## Bowl.Setpoint        40.0
## Carb.Flow            36.5
## Filler.Speed         24.0
## Oxygen.Filler        44.5
## Balling              53.5
## Carb.Rel             35.5
## Usage.cont           27.0
## Density              41.5
## Air.Pressurer        32.5
## Hyd.Pressure3        36.5
## Hyd.Pressure2        25.5
## Carb.Pressure1       29.5
## Temperature          29.5
## Filler.Level         16.0
## PC.Volume            10.5
## Pressure.Setpoint    10.5
## Carb.Volume          16.5
## Carb.Pressure        22.5
## Carb.Temp            19.0
## Brand.CodeB          13.0
## PSC.Fill              4.5
## Brand.CodeD           4.0
## Fill.Pressure         3.0
## Hyd.Pressure4         3.0
## Fill.Ounces           2.0
## PSC                   2.0
## PSC.CO2               2.0
## Brand.CodeA           1.0
## Brand.CodeUnknown     0.0

Interestingly, we can see that the list of important predictors contains some that had strong correlations (positive and negative) with the target variable. For example: Alch Rel, Bowl Setpoint, Carb Flow, Pressure Vacuum, Oxygen Filler and Mnf Flow. At the same time, there are other predictors that showed strong correlation with PH, but did not make it to the top 10 important predictors. For example: Filler Level, Carb Rel, Oxygen Filler, Usage cont, Fill Pressure, Temperature, Pressure Setpoint, Hyd Pressure2 and Hyd Pressure3.

Instead, the topmost important predictors had variables such as Balling Lvl, Bowl Setpoint, Filler Speed and Balling in the important predictors list that did not demonstrate the strongest correlations.

This begins to make more sense when we compare to the predictor-predictor correlations calculated previously as well as the results of the vifcor function used previously. We can see that Carb Rel and Alch Rel are strongly correlated, as are Alch Rel and Hyd Pressure3. This indicates that the model is taking into account multi-collinearity and avoiding predictors that are strongly correlated with others that have already been selected and thereby do not provide incremental predictive power.

Predictions

Now that we have identified the Cubist model as the best predictive model, we will apply the model to the evaluation dataset by replacing the empty PH values in the evaluation dataset with the Cubist model’s predictions.

# Define the "evaluationDataClean" variable.
evaluationDataClean <- bev.eval.transformed

# Run predict() on the Cubit model.
cubistPredictions <- predict(cubistModel, newdata = evaluationDataClean)

# Replace the empty PH values in the evaluation set with the Cubist predictions.
evaluationDataClean$PH <- round(cubistPredictions,2)

# Take a look at the evaluation data after PH value replacement.
head(evaluationDataClean, 20) %>% kable() %>% kable_styling() %>% scroll_box(width = '100%', height = '600px')

Brand.CodeA	Brand.CodeB	Brand.CodeC	Brand.CodeD	Brand.CodeUnknown	Carb.Volume	Fill.Ounces	PC.Volume	Carb.Pressure	Carb.Temp	PSC	PSC.Fill	PSC.CO2	Mnf.Flow	Carb.Pressure1	Fill.Pressure	Hyd.Pressure2	Hyd.Pressure3	Hyd.Pressure4	Filler.Level	Filler.Speed	Temperature	Usage.cont	Carb.Flow	Density	Balling	Pressure.Vacuum	Oxygen.Filler	Bowl.Setpoint	Pressure.Setpoint	Air.Pressurer	Alch.Rel	Carb.Rel	Balling.Lvl	PH
-0.3876816	-0.9650293	-0.3617512	1.7776376	-0.1754205	1.0173477	0.8453014	-0.1261451	-0.7151727	-1.6286322	2.2451102	1.6099559	-0.2942791	-1.027837	-1.4492875	-0.6003036	-1.151696	-1.17903	-0.0759471	1.3488585	0.5086529	-0.1059513	0.1956905	0.4661381	-0.7278954	-0.8968551	2.3563890	-0.3646397	1.5281699	-1.2774435	-0.1787716	-0.6935289	-0.7893006	-0.6467639	8.63
2.5697753	-0.9650293	-0.3617512	-0.5604375	-0.1754205	0.2569971	-0.2061344	-0.8152380	-1.4064544	-1.5210765	-0.7797427	0.4326833	0.7395699	-1.027837	-0.9470071	-0.5412123	-1.163307	-1.17903	1.0115862	0.6212109	0.5500784	-0.3700120	-1.1183338	0.4368621	0.8571200	0.9606936	1.3300336	-0.0738317	0.7065401	-0.8299445	3.5102683	0.5986430	1.1043790	1.1194968	8.44
-0.3876816	1.0323569	-0.3617512	-0.5604375	-0.1754205	-0.6671561	-0.6473493	0.4039264	-0.4234268	-0.1578536	-0.1784639	-0.7460745	-0.8112036	-1.027837	-0.6293102	-0.6595235	-1.163307	-1.17903	0.0695220	0.5766276	0.5468822	-0.3700120	1.1479870	0.5574101	-0.6723612	-0.7892285	1.6721521	0.3576903	0.7065401	-0.8299445	3.0487050	-0.7955619	-0.7893006	-0.6694082	8.60
-0.3876816	1.0323569	-0.3617512	-0.5604375	-0.1754205	-0.9225304	-0.3823992	-1.4619252	-0.8967401	-0.4960754	-2.3057805	0.2696713	-0.8112036	-1.027837	0.4042797	-2.4369491	-1.163307	-1.17903	2.1707445	0.6361217	-0.2105342	4.4933952	-0.9652103	-2.0498698	-1.1279666	-1.8425756	2.0142706	0.7859125	0.7065401	-0.8299445	2.8935880	-0.8994902	0.5006257	-0.6467639	8.59
-0.3876816	1.0323569	-0.3617512	-0.5604375	-0.1754205	0.3763584	3.0023969	-1.8753809	0.3774154	0.2624104	-0.8329359	1.0082946	0.2226454	-1.027837	-1.8169667	0.9525837	-1.163307	-1.17903	-0.2244790	0.3282101	0.5596763	0.1533516	0.0752223	0.6934571	-0.7278954	-0.8968551	2.0142706	1.0031796	0.7065401	1.0948096	2.4244009	-0.8472862	-0.4560038	-0.6694082	8.58
-0.3876816	1.0323569	-0.3617512	-0.5604375	-0.1754205	-0.7306375	1.7999363	-1.0484694	1.2962940	1.3499991	0.0209609	0.4326833	1.2564944	-1.027837	-0.9925143	-0.4822488	-1.163307	-1.17903	-0.2244790	0.6510572	0.5468822	0.2812546	-1.0009422	0.5642985	-0.8401151	-1.1317616	2.3563890	0.7172809	0.7065401	-0.8299445	2.5814395	-0.8472862	-0.1300589	-0.6920526	8.56
2.5697753	-0.9650293	-0.3617512	-0.5604375	-0.1754205	1.0173477	-0.4706421	-0.5502023	-0.7751385	-1.6286322	0.2079618	-0.2880740	-1.3281282	-1.027837	-1.2205124	-0.5412123	-1.163307	-1.17903	0.7550124	0.5914639	0.5468822	0.4080104	-1.0950665	0.5453553	0.8095406	0.9279513	1.6721521	0.2627355	0.7065401	-0.8299445	1.7897350	0.6763990	0.1887486	1.0968525	8.49
-0.3876816	1.0323569	-0.3617512	-0.5604375	-0.1754205	0.4948399	1.2803082	-2.4690609	-0.1445701	-0.4960754	-0.0178523	-0.7460745	-0.2942791	-1.027837	-0.3581749	-2.4369491	-1.163307	-1.17903	0.7550124	1.5107538	0.5023036	0.0242877	-2.3000814	-0.3759754	1.0922312	1.1916247	1.3300336	1.1925299	0.7065401	-0.8299445	2.5814395	0.6376839	-0.1300589	1.0742081	8.60
2.5697753	-0.9650293	-0.3617512	-0.5604375	-0.1754205	0.3763584	-0.6473493	0.8703893	-0.3103806	-0.7439904	2.3548087	2.0287088	-0.2942791	-1.027837	2.8590587	-1.2589854	-1.163307	-1.17903	0.8844656	0.6960130	0.5468822	-0.2373797	-1.0892332	0.0803846	0.9045094	0.9939035	1.3300336	0.3576903	0.7065401	-0.8299445	2.7378340	0.5986430	0.0302239	1.1874299	8.54
-0.3876816	-0.9650293	-0.3617512	1.7776376	-0.1754205	0.9601386	0.7580828	-0.3487751	1.1245987	0.6670133	1.1282743	-0.7460745	-0.8112036	-1.027837	0.8045664	-2.1844568	-1.163307	-1.17903	0.6231393	0.6810029	-2.4548396	-0.1059513	0.6135222	-2.0498698	0.8095406	0.9265682	1.6721521	1.1925299	0.7065401	-0.8299445	2.5814395	1.7026429	0.6540279	1.2100743	8.61
-0.3876816	1.0323569	-0.3617512	-0.5604375	-0.1754205	-1.7799105	-2.0717096	1.0400121	-1.1468189	-0.0630670	0.6173947	1.2615772	-0.2942791	-1.027837	-0.4032894	-0.4234122	-1.163307	-1.17903	0.0695220	0.6361217	0.5468822	-0.5038630	-0.3612107	0.6572927	-0.7838095	-1.0108226	1.6721521	0.7520127	0.7065401	-0.8299445	3.3570417	-0.7955619	-0.6217195	-0.7599857	8.74
-0.3876816	1.0323569	-0.3617512	-0.5604375	-0.1754205	-0.9869816	1.4538053	-0.9212523	-1.4064544	-0.3498769	1.6328781	0.7338316	3.8411170	-1.027837	-1.3119287	-0.5412123	-1.163307	-1.17903	-0.0759471	0.5028187	0.5468822	-0.2373797	-1.2444176	0.5952966	-0.7838095	-1.0108226	1.6721521	0.4027100	0.7065401	-0.8299445	3.3570417	-0.8472862	-0.4560038	-0.7146970	8.55
-0.3876816	1.0323569	-0.3617512	-0.5604375	-0.1754205	-0.6039141	1.1934515	0.0646806	-0.8356587	-0.5452300	1.2119820	-0.5055641	-1.3281282	-1.027837	-1.3576836	-0.6595235	-1.163307	-1.17903	0.2120521	0.5914639	0.5468822	-0.5038630	-0.2007139	0.4454727	-0.6171961	-0.6873889	0.6457967	0.7520127	0.7065401	-0.8299445	3.3570417	-0.7443114	-1.3035206	-0.6694082	8.63
-0.3876816	1.0323569	-0.3617512	-0.5604375	-0.1754205	-0.5409102	-0.3823992	0.1706949	-1.2108131	-0.9462379	0.4188365	0.0965566	-0.8112036	-1.027837	-0.2230094	-0.4822488	-1.163307	-1.17903	0.2120521	0.6063250	0.5564754	-0.3700120	0.1175229	0.4695823	-0.5623897	-0.5908470	0.6457967	0.4462532	0.7065401	-0.8299445	-0.3473212	-0.7443114	-1.8355740	-0.6920526	8.57
-0.3876816	-0.9650293	2.7539772	-0.5604375	-0.1754205	-0.8583235	0.0577117	0.6795635	-0.9583897	-0.2534538	0.0592796	0.8740066	1.2564944	-1.027837	-1.4951365	-0.5412123	-1.163307	-1.17903	-0.3762053	0.6361217	0.5500784	0.9038331	-0.0086665	0.5729091	-0.4538136	-0.4119692	1.6721521	0.3576903	0.7065401	-0.8299445	-0.3473212	-0.5439333	-1.4788500	-0.5108976	8.43
-0.3876816	-0.9650293	-0.3617512	-0.5604375	5.6792381	-1.0516784	-1.0904126	0.6583607	-0.8356587	-0.2534538	-0.6273472	0.7338316	-0.8112036	-1.027837	0.5824681	-1.3805093	-1.163307	-1.17903	0.8844656	1.4294958	-2.4540372	1.7303860	-0.9532469	-2.0464255	-0.9539501	-1.3975431	0.6457967	1.8488327	0.7065401	-0.8299445	-0.5165825	-0.7955619	-1.3035206	-0.5108976	8.60
-0.3876816	1.0323569	-0.3617512	-0.5604375	-0.1754205	-0.2294224	0.2332426	-0.3911808	0.6218844	0.8420124	0.6494338	-0.5055641	-1.3281282	-1.027837	-1.1292205	-0.7188728	-1.163307	-1.17903	-0.5312665	0.5618162	0.5277389	-0.1059513	-0.8199085	0.5126351	-0.7278954	-0.8968551	2.0142706	1.2669589	0.7065401	-0.8299445	-0.6865594	-0.8472862	-0.6217195	-0.7373413	8.52
-0.3876816	1.0323569	-0.3617512	-0.5604375	-0.1754205	-0.9225304	-0.3823992	-0.0519351	-0.7151727	-0.2055514	0.7748625	1.7190407	0.2226454	-1.027837	-1.4951365	-0.5412123	-1.163307	-1.17903	-0.5312665	0.6660177	0.5181887	-0.2373797	-0.9472553	0.5866860	-0.5623897	-0.5908470	1.3300336	0.6451271	0.7065401	-0.8299445	-0.5165825	-0.7955619	-1.1301723	-0.6920526	8.54
-0.3876816	-0.9650293	-0.3617512	1.7776376	-0.1754205	1.2441099	-1.0016517	-0.4547894	0.1221458	-0.5945976	-0.3931767	-0.5055641	-0.8112036	-1.027837	-1.1292205	-0.5412123	-1.163307	-1.17903	-1.7241003	0.6361217	0.5245539	-1.3334161	0.2028362	0.4540833	1.1850323	1.1748584	1.6721521	1.7973723	0.7065401	-0.8299445	-0.3473212	1.6722219	1.3966330	1.2553630	8.62
-0.3876816	1.0323569	-0.3617512	-0.5604375	-0.1754205	-0.4156124	-0.1181122	-1.3877152	-0.4234268	-0.2055514	-0.8874764	-1.7215960	-1.3281282	-1.027837	-1.1748510	-0.7783524	-1.163307	-1.17903	-0.3762053	0.6212109	0.5245539	-0.5038630	0.4202745	0.4850813	-0.6171961	-0.6873889	1.3300336	-0.3646397	0.7065401	-0.8299445	-0.1787716	-0.6935289	-0.4560038	-0.6694082	8.55

Looking at the PH column (the last column) in the final data sample above, we can see that the empty PH values have now been replaced with our Cubist model predictions.

# Inspect range of predicted values for evaluation dataset.
summary(evaluationDataClean$PH)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.130   8.430   8.510   8.504   8.590   8.870

histogram(evaluationDataClean$PH)

From the above, we can see that the range of the target variable in the evaluation dataset is slightly narrower than its corresponding range in the training data. This gives us confidence that the model seems to work well on unseen data. Obviously the real accuracy metric would be to compare the actual PH values for the evaluation data, which we did not have access to. While not perfectly normal, the shape of the distribution of the predicted PH level is reasonably close to normal.

Export Final Results As A CSV File

Finally, we will write the final results to a CSV file.

# Save the final results to a CSV file.
write.csv(evaluationDataClean, './data/FinalPHPredictions.csv', row.names=F)

Conclusion

Based on the exploratory data analysis of the training dataset, we decided to prepare the data - this included dropping some columns and rows, creating a separate “Unknown” group for missing brand codes, creating dummy variables for the single categorical variable i.e. Brand Code, imputing missing variable data and box-cox transforming the predictors to make them less skewed.

After this, we used 3 categories of models: Linear, Non-linear and Tree-based. We trained 2 models from each category for a total of 6 models. We used a combination of RMSE and R-squared as the performance metrics to decide the final model. We decided to go with the Cubist model because it’s metrics were clearly better than the other models. This model has the lowest RMSE and also happens to have the highest r-squared as well. This is not surprising given that these models handle non-linear relationships and multi-collinearity better. This comes across in the list of top predictors selected by this model, as described in a previous section.

For the final model selected, we see that it considers the following as the top 5 predictors in terms of importance: Mnf.Flow, Alch.Rel, Balling.Lvl, Pressure.Vacuum and Brand Code C. Finding Brand Code as a top predictor is interesting because at the end of the day, Brand is a not a physical/chemical construct that can be linked to PH levels. but we think it must be best encapsulating other chemical features collectively that are in turn helpful in explaining the PH levels.

We see that the range of the predicted values in the evaluation data is in line with the range of the predicted values in the training data, which gives us confidence that the selected model seems generalizable. Besides, the general shape of the distribution of the predicted values is approximately normal.

As with any real-world data science process, the logical next step would be to calculate better accuracy metrics by comparing the predicted values to the actual PH values for the evaluation data. Our recommendation to the manager of ABC Beverages would be to go with the Cubist model and put in place an on-going process to keep monitoring the model and fine-tuning in case the model metrics show any deterioration.

References

Linear Models with R: Julian Faraway. Applied Predictive Modeling: Kuhn & Johnson https://newalbanysmiles.com/ph-values-of-common-beverages/