Assignment Overview / Problem Statement

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Data Import

There are two files provided:

  1. StudentData.xlsx: This is the training dataset. Note the PH column will be our target we are trying to predict.

  2. StudentEvaluation.xlsx: This is the evaluation dataset. Note the PH column is empty in this dataset.

# Load the ABC Beverages' train dataset.
beverage.train <- read.csv('./data/StudentData.csv', na.strings = c('', 'NA'), stringsAsFactors = FALSE)

#  Load the ABC Beverages' evaluation dataset. 
beverage.eval <- read.csv('./data/StudentEvaluation.csv', na.strings = c('', 'NA'), stringsAsFactors = FALSE)

 

Remove Empty PH Value From The Evaluation Set

The evaulation dataset contains an empty PH column, so we will remove it for now until it is needed later on in the project.

# Remove the empty PH column from the evaluation data.
beverage.eval <- beverage.eval %>% dplyr::select(-PH)

Data Exploration

Evaluation Dataset

The first step in our data exploration is to take a brief look at the evaluation data set. To get an idea of it’s structure, we will print out the first 40 rows of the data.

# Examine the structure of the evaluation dataset.
head(beverage.eval, 40) %>% kable() %>% kable_styling() %>% scroll_box(width = '100%', height = '600px')
Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow Density MFR Balling Pressure.Vacuum Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
D 5.480000 24.03333 0.2700000 65.4 134.6 0.236 0.40 0.04 -100 116.6 46.0 0 NA NA 96 129.4 3986 66.0 21.66 2950 0.88 727.6 1.398 -3.8 0.022 130 45.2 142.6 6.56 5.34 1.48
A 5.393333 23.95333 0.2266667 63.2 135.0 0.042 0.22 0.08 -100 118.8 46.2 0 0 0 112 120.0 4012 65.6 17.60 2916 1.50 735.8 2.942 -4.4 0.030 120 46.0 147.2 7.14 5.58 3.04
B 5.293333 23.92000 0.3033333 66.4 140.4 0.068 0.10 0.02 -100 120.2 45.8 0 0 0 98 119.4 4010 65.6 24.18 3056 0.90 734.8 1.448 -4.2 0.046 120 46.0 146.6 6.52 5.34 1.46
B 5.266667 23.94000 0.1860000 64.8 139.0 0.004 0.20 0.02 -100 124.8 40.0 0 0 0 132 120.2 NA 74.4 18.12 28 0.74 NA 1.056 -4.0 NA 120 46.0 146.4 6.48 5.50 1.48
B 5.406667 24.20000 0.1600000 69.4 142.2 0.040 0.30 0.06 -100 115.0 51.4 0 0 0 94 116.0 4018 66.4 21.32 3214 0.88 752.0 1.398 -4.0 0.082 120 50.0 145.8 6.50 5.38 1.46
B 5.286667 24.10667 0.2120000 73.4 147.2 0.078 0.22 NA -100 118.6 46.4 0 0 0 94 120.4 4010 66.6 18.00 3064 0.84 732.0 1.298 -3.8 0.064 120 46.0 146.0 6.50 5.42 1.44
A 5.480000 23.93333 0.2433333 65.2 134.6 0.088 0.14 0.00 -100 117.6 46.2 0 0 0 108 119.6 4010 66.8 17.68 3042 1.48 729.8 2.894 -4.2 0.042 120 46.0 145.0 7.18 5.46 3.02
B 5.420000 24.06667 0.1226667 67.4 139.0 0.076 0.10 0.04 -100 121.4 40.0 0 0 0 108 131.4 NA NA 12.90 1972 1.60 NA 3.320 -4.4 0.096 120 46.0 146.0 7.16 5.42 3.00
A 5.406667 23.92000 0.3326667 66.8 138.0 0.246 0.48 0.04 -100 136.0 43.8 0 0 0 110 121.0 4010 65.8 17.70 2502 1.52 741.2 2.992 -4.4 0.046 120 46.0 146.2 7.14 5.44 3.10
D 5.473333 24.02667 0.2560000 72.6 144.0 0.146 0.10 0.02 -100 126.6 40.8 0 0 0 106 120.8 1006 66.0 22.80 28 1.48 NA 2.892 -4.2 0.096 120 46.0 146.0 7.78 5.52 3.12
B 5.180000 NA 0.3433333 64.0 140.8 NA 0.34 0.04 -100 121.2 46.6 0 0 0 98 120.2 4010 65.4 20.04 3172 0.86 732.8 1.348 -4.2 0.066 120 46.0 147.0 6.52 5.36 1.38
B 5.260000 24.08000 0.2200000 63.2 139.6 0.184 0.26 0.20 -100 117.2 46.2 0 0 0 96 118.4 4010 65.8 17.16 3100 0.86 735.8 1.348 -4.2 0.048 120 46.0 147.0 6.50 5.38 1.42
B 5.300000 24.06000 0.2820000 65.0 138.8 0.152 0.12 0.00 -100 117.0 45.8 0 0 0 100 119.6 4010 65.4 20.52 2926 0.92 735.6 1.498 -4.8 0.066 120 46.0 147.0 6.54 5.28 1.46
B 5.306667 23.94000 0.2886667 63.8 137.2 0.100 0.18 0.02 -100 122.0 46.4 0 0 0 100 119.8 4016 65.6 21.44 2954 0.94 736.4 1.548 -4.8 0.050 120 46.0 142.4 6.54 5.22 1.44
C 5.273333 23.97333 0.3206667 64.6 140.0 0.080 0.28 0.10 -100 116.4 46.2 0 0 0 92 120.2 4012 67.6 21.08 3074 0.98 738.0 1.648 -4.2 0.046 120 46.0 142.4 6.62 5.26 1.60
NA 5.253333 23.88667 0.3193333 65.0 140.0 0.048 0.26 0.02 -100 125.6 43.4 0 0 0 110 130.4 NA 69.0 18.16 32 0.80 NA 1.198 -4.8 0.160 120 46.0 142.2 6.52 5.28 1.60
B 5.340000 23.98667 0.2533333 70.4 144.8 0.114 0.12 0.00 -100 118.0 45.6 0 0 0 90 119.2 3998 66.0 18.60 3004 0.88 730.4 1.398 -4.0 0.102 120 46.0 142.0 6.50 5.36 1.40
B 5.266667 23.94000 0.2746667 65.4 140.2 0.122 0.42 0.06 -100 116.4 46.2 0 0 0 90 120.6 3992 65.8 18.18 3090 0.94 728.6 1.548 -4.4 0.060 120 46.0 142.2 6.52 5.30 1.44
D 5.506667 23.89333 0.2493333 68.4 138.6 0.058 0.12 0.02 -100 118.0 46.2 0 0 0 76 120.2 3996 64.2 21.68 2936 1.64 729.8 3.290 -4.2 0.154 120 46.0 142.4 7.76 5.62 3.16
B 5.320000 23.96000 0.1906667 66.4 140.2 0.038 0.04 0.00 -100 117.8 45.4 0 0 0 92 120.0 3996 65.4 22.28 2972 0.92 726.8 1.498 -4.4 0.022 120 46.0 142.6 6.56 5.38 1.46
B 5.273333 23.96667 0.1993333 68.4 141.8 0.008 0.30 0.20 -100 117.2 46.0 0 0 0 94 120.2 3998 65.6 24.02 3094 0.92 732.0 1.498 -4.4 0.022 120 46.0 142.4 6.58 5.40 1.46
B 5.533333 23.98667 0.2466667 70.4 142.2 0.062 0.08 NA -100 121.0 47.6 0 0 0 118 120.0 2834 67.4 13.56 3154 0.70 523.4 0.946 -4.4 0.054 120 46.0 141.8 6.50 5.48 1.36
A 5.426667 23.98667 0.2553333 69.0 140.4 0.122 0.48 0.04 -100 121.2 38.8 0 0 0 NA 131.4 1386 66.4 19.32 868 1.56 NA 3.092 -4.4 0.022 120 46.0 142.8 7.08 5.54 3.18
B 5.406667 23.94000 0.3293333 66.2 137.8 0.208 0.46 0.02 -100 130.6 44.6 0 0 0 96 119.0 4002 67.0 17.52 2592 0.88 731.4 1.398 -4.0 0.022 120 46.0 142.8 6.50 5.46 1.40
B 5.453333 24.09333 0.2353333 66.4 136.2 0.072 0.06 0.06 -100 125.2 46.4 0 0 0 94 120.2 4010 66.4 20.38 2996 0.90 736.8 1.448 -4.6 0.024 120 46.0 142.6 6.50 5.38 1.40
C 5.266667 23.94667 0.2766667 64.8 139.2 0.048 0.18 0.02 -100 116.6 46.2 0 0 0 90 120.6 4014 66.0 24.12 3060 0.90 738.6 1.448 -4.4 0.022 120 46.0 142.2 6.54 5.32 1.40
C 5.253333 23.99333 0.2933333 70.4 146.4 0.040 0.14 0.02 -100 122.2 46.0 0 0 0 102 119.8 4012 66.8 17.54 3136 0.94 741.2 1.548 -4.2 0.022 120 46.0 142.2 6.62 5.34 1.52
D 5.500000 24.04667 0.2466667 71.0 141.8 0.040 0.02 0.00 -100 120.4 46.4 0 0 0 78 119.4 4010 65.6 18.12 3194 1.64 735.2 3.290 -3.8 0.174 120 46.0 142.0 7.74 5.66 3.28
D 5.480000 23.89333 0.2246667 70.4 140.8 NA 0.34 0.06 -100 120.2 50.4 0 0 0 80 120.0 4010 64.8 16.94 3162 1.66 740.8 3.340 -3.8 0.022 120 50.0 142.0 7.74 5.62 3.24
D 5.486667 23.98000 0.3066667 69.6 140.0 0.234 0.16 0.02 -100 122.6 47.0 0 0 0 78 120.2 4010 64.6 17.04 2982 1.66 733.8 3.340 -4.0 0.024 120 46.0 141.8 7.74 5.62 3.26
D 5.466667 24.04000 0.2300000 70.2 141.2 0.004 0.10 0.04 -100 120.2 46.6 0 0 0 78 120.4 4010 65.4 23.44 3182 1.68 732.4 3.390 -4.0 0.066 120 46.0 142.0 7.72 5.56 3.28
D 5.460000 24.04667 0.2780000 70.8 141.8 0.174 0.62 0.10 -100 125.6 40.0 0 0 0 104 121.8 1008 70.6 19.22 32 1.42 NA 2.750 -4.2 0.024 120 46.0 141.8 7.72 5.58 3.28
B 5.320000 NA 0.2686667 64.4 137.0 0.068 0.10 0.04 -100 120.4 46.8 0 0 0 104 119.2 4018 66.0 18.04 3190 0.92 733.6 1.496 -4.2 0.022 120 46.0 142.4 6.50 5.38 1.50
B 5.313333 NA 0.3186667 64.2 136.8 0.218 0.44 0.04 -100 118.2 45.8 0 0 0 100 120.4 4014 65.2 22.02 2862 0.90 729.8 1.448 -4.2 0.022 120 46.0 142.0 6.50 5.36 1.52
B 5.266667 NA 0.2800000 73.2 149.8 0.040 0.24 0.04 -100 124.6 44.8 0 0 0 102 121.0 3990 65.0 16.72 3122 0.92 724.2 1.498 -4.0 0.036 120 46.0 141.6 6.52 5.34 1.56
A 5.486667 24.10667 0.1986667 75.0 146.4 0.086 0.28 0.04 -100 114.8 46.2 0 0 0 100 120.2 4016 65.4 17.76 3048 1.50 732.8 2.942 -4.0 0.022 120 46.0 142.0 7.16 5.44 3.00
A 5.460000 23.98000 0.2046667 65.8 136.6 0.010 0.04 0.06 -100 116.8 45.6 0 0 0 100 119.2 4010 66.2 17.76 3080 1.50 730.0 2.942 -3.6 0.022 120 46.0 142.6 7.14 5.54 3.08
A 5.440000 23.92000 0.2360000 69.4 142.0 0.050 0.16 0.08 -100 116.4 46.0 0 0 0 98 120.8 4012 66.4 21.46 3102 1.52 731.8 2.992 -3.6 0.022 120 46.0 142.4 7.14 5.56 3.06
D 5.480000 23.90667 0.1786667 70.4 141.0 0.090 0.22 0.04 -100 118.0 46.0 0 0 0 74 120.8 4012 65.0 19.68 2900 1.70 732.2 3.440 -4.4 0.052 120 46.0 142.6 7.68 5.58 3.32
D 5.473333 23.92000 0.3473333 74.2 145.2 NA 0.50 0.08 -100 125.8 46.8 0 0 0 78 123.2 4010 65.2 16.74 2880 1.72 745.4 3.490 -4.4 0.050 120 46.0 141.8 7.70 5.60 3.32

 

We will now check the summary statistics for the data.

##   Brand.Code         Carb.Volume     Fill.Ounces      PC.Volume      
##  Length:267         Min.   :5.147   Min.   :23.75   Min.   :0.09867  
##  Class :character   1st Qu.:5.287   1st Qu.:23.92   1st Qu.:0.23333  
##  Mode  :character   Median :5.340   Median :23.97   Median :0.27533  
##                     Mean   :5.369   Mean   :23.97   Mean   :0.27769  
##                     3rd Qu.:5.465   3rd Qu.:24.01   3rd Qu.:0.32200  
##                     Max.   :5.667   Max.   :24.20   Max.   :0.46400  
##                     NA's   :1       NA's   :6       NA's   :4        
##  Carb.Pressure     Carb.Temp          PSC             PSC.Fill     
##  Min.   :60.20   Min.   :130.0   Min.   :0.00400   Min.   :0.0200  
##  1st Qu.:65.30   1st Qu.:138.4   1st Qu.:0.04450   1st Qu.:0.1000  
##  Median :68.00   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.25   Mean   :141.2   Mean   :0.08545   Mean   :0.1903  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :77.60   Max.   :154.0   Max.   :0.24600   Max.   :0.6200  
##                  NA's   :1       NA's   :5         NA's   :3       
##     PSC.CO2           Mnf.Flow       Carb.Pressure1  Fill.Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :113.0   Min.   :37.80  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:120.2   1st Qu.:46.00  
##  Median :0.04000   Median :   0.20   Median :123.4   Median :47.80  
##  Mean   :0.05107   Mean   :  21.03   Mean   :123.0   Mean   :48.14  
##  3rd Qu.:0.06000   3rd Qu.: 141.30   3rd Qu.:125.5   3rd Qu.:50.20  
##  Max.   :0.24000   Max.   : 220.40   Max.   :136.0   Max.   :60.20  
##  NA's   :5                           NA's   :4       NA's   :2      
##  Hyd.Pressure1    Hyd.Pressure2    Hyd.Pressure3    Hyd.Pressure4   
##  Min.   :-50.00   Min.   :-50.00   Min.   :-50.00   Min.   : 68.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 90.00  
##  Median : 10.40   Median : 26.80   Median : 27.70   Median : 98.00  
##  Mean   : 12.01   Mean   : 20.11   Mean   : 19.61   Mean   : 97.84  
##  3rd Qu.: 20.40   3rd Qu.: 34.80   3rd Qu.: 33.00   3rd Qu.:104.00  
##  Max.   : 50.00   Max.   : 61.40   Max.   : 49.20   Max.   :140.00  
##                   NA's   :1        NA's   :1        NA's   :4       
##   Filler.Level    Filler.Speed   Temperature      Usage.cont      Carb.Flow   
##  Min.   : 69.2   Min.   :1006   Min.   :63.80   Min.   :12.90   Min.   :   0  
##  1st Qu.:100.6   1st Qu.:3812   1st Qu.:65.40   1st Qu.:18.12   1st Qu.:1083  
##  Median :118.6   Median :3978   Median :65.80   Median :21.44   Median :3038  
##  Mean   :110.3   Mean   :3581   Mean   :66.23   Mean   :20.90   Mean   :2409  
##  3rd Qu.:120.2   3rd Qu.:3996   3rd Qu.:66.60   3rd Qu.:23.74   3rd Qu.:3215  
##  Max.   :153.2   Max.   :4020   Max.   :75.40   Max.   :24.60   Max.   :3858  
##  NA's   :2       NA's   :10     NA's   :2       NA's   :2                     
##     Density           MFR           Balling      Pressure.Vacuum 
##  Min.   :0.060   Min.   : 15.6   Min.   :0.902   Min.   :-6.400  
##  1st Qu.:0.920   1st Qu.:707.0   1st Qu.:1.498   1st Qu.:-5.600  
##  Median :0.980   Median :724.6   Median :1.648   Median :-5.200  
##  Mean   :1.177   Mean   :697.8   Mean   :2.203   Mean   :-5.174  
##  3rd Qu.:1.600   3rd Qu.:731.5   3rd Qu.:3.242   3rd Qu.:-4.800  
##  Max.   :1.840   Max.   :784.8   Max.   :3.788   Max.   :-3.600  
##  NA's   :1       NA's   :31      NA's   :1       NA's   :1       
##  Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint Air.Pressurer  
##  Min.   :0.00240   Min.   : 70.0   Min.   :44.00     Min.   :141.2  
##  1st Qu.:0.01960   1st Qu.:100.0   1st Qu.:46.00     1st Qu.:142.2  
##  Median :0.03370   Median :120.0   Median :46.00     Median :142.6  
##  Mean   :0.04666   Mean   :109.6   Mean   :47.73     Mean   :142.8  
##  3rd Qu.:0.05440   3rd Qu.:120.0   3rd Qu.:50.00     3rd Qu.:142.8  
##  Max.   :0.39800   Max.   :130.0   Max.   :52.00     Max.   :147.2  
##  NA's   :3         NA's   :1       NA's   :2         NA's   :1      
##     Alch.Rel        Carb.Rel     Balling.Lvl   
##  Min.   :6.400   Min.   :5.18   Min.   :0.000  
##  1st Qu.:6.540   1st Qu.:5.34   1st Qu.:1.380  
##  Median :6.580   Median :5.40   Median :1.480  
##  Mean   :6.907   Mean   :5.44   Mean   :2.051  
##  3rd Qu.:7.180   3rd Qu.:5.56   3rd Qu.:3.080  
##  Max.   :7.820   Max.   :5.74   Max.   :3.420  
##  NA's   :3       NA's   :2

The summary statistics for the evaluation dataset tell us that it contains missing values so we will need to impute these later on in the project.

 

Our next step is to examine the training dataset in detail as this is the main dataset that we will be working with throughout the project.

Training Dataset

Firstly, we will take a look at the first few observations in the dataset so we can get a feel for the data. We will then explore the structure of the data using the str() function which will tell us how many observations and variables it contains, and whether or not it contains missing values.

# Take a look at the structure of the training dataset.
head(beverage.train, 40) %>% kable() %>% kable_styling() %>% scroll_box(width = '100%', height = '600px')
Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow Density MFR Balling Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
B 5.340000 23.96667 0.2633333 68.2 141.2 0.104 0.26 0.04 -100 118.8 46.0 0 NA NA 118 121.2 4002 66.0 16.18 2932 0.88 725.0 1.398 -4.0 8.36 0.022 120 46.4 142.6 6.58 5.32 1.48
A 5.426667 24.00667 0.2386667 68.4 139.6 0.124 0.22 0.04 -100 121.6 46.0 0 NA NA 106 118.6 3986 67.6 19.90 3144 0.92 726.8 1.498 -4.0 8.26 0.026 120 46.8 143.0 6.56 5.30 1.56
B 5.286667 24.06000 0.2633333 70.8 144.8 0.090 0.34 0.16 -100 120.2 46.0 0 NA NA 82 120.0 4020 67.0 17.76 2914 1.58 735.0 3.142 -3.8 8.94 0.024 120 46.6 142.0 7.66 5.84 3.28
A 5.440000 24.00667 0.2933333 63.0 132.6 NA 0.42 0.04 -100 115.2 46.4 0 0 0 92 117.8 4012 65.6 17.42 3062 1.54 730.6 3.042 -4.4 8.24 0.030 120 46.0 146.2 7.14 5.42 3.04
A 5.486667 24.31333 0.1113333 67.2 136.8 0.026 0.16 0.12 -100 118.4 45.8 0 0 0 92 118.6 4010 65.6 17.68 3054 1.54 722.8 3.042 -4.4 8.26 0.030 120 46.0 146.2 7.14 5.44 3.04
A 5.380000 23.92667 0.2693333 66.6 138.4 0.090 0.24 0.04 -100 119.6 45.6 0 0 0 116 120.2 4014 66.2 23.82 2948 1.52 738.8 2.992 -4.4 8.32 0.024 120 46.0 146.6 7.16 5.44 3.02
A 5.313333 23.88667 0.2680000 64.2 136.8 0.128 0.40 0.04 -100 122.2 51.8 0 0 0 124 123.4 NA 65.8 20.74 30 0.84 NA 1.298 -4.4 8.40 0.066 120 46.0 146.2 6.54 5.38 1.44
B 5.320000 24.17333 0.2206667 67.6 141.4 0.154 0.34 0.04 -100 124.2 46.8 0 0 0 132 118.6 1004 65.2 18.96 684 0.84 NA 1.298 -4.4 8.38 0.046 120 46.0 146.4 6.52 5.34 1.44
B 5.246667 23.98000 0.2626667 64.2 140.2 0.132 0.12 0.14 -100 120.8 46.0 0 0 0 90 120.2 4014 65.4 18.40 2902 0.90 740.4 1.446 -4.4 8.38 0.064 120 46.0 147.2 6.52 5.34 1.44
B 5.266667 24.00667 0.2313333 72.0 147.4 0.014 0.24 0.06 -100 119.8 45.2 0 0 0 108 120.8 4028 66.6 13.50 3038 0.90 692.4 1.448 -4.4 8.50 0.022 120 46.0 146.2 6.54 5.34 1.38
B 5.320000 23.92000 0.2586667 66.2 139.4 0.078 0.18 0.04 -100 119.6 46.6 0 0 0 94 119.6 4020 65.0 19.04 3056 0.90 727.0 1.448 -4.4 8.34 0.030 120 46.0 146.2 6.52 5.34 1.44
B 5.353333 24.06667 0.2513333 61.6 132.8 0.110 0.18 0.02 -100 119.2 46.6 0 0 0 86 119.6 4012 65.4 18.44 3110 0.92 735.0 1.498 -4.4 8.34 0.058 120 46.0 146.8 6.52 5.34 1.44
B 5.220000 23.89333 0.2673333 63.4 141.0 0.114 0.38 NA -100 117.4 45.4 0 0 0 98 121.0 4012 65.0 17.12 2870 0.92 729.6 1.498 -4.4 8.34 0.048 120 46.0 146.0 6.52 5.34 1.46
B 5.266667 23.89333 0.2286667 71.6 147.8 0.096 0.22 0.04 -100 113.6 46.0 0 0 0 94 120.0 4012 65.0 23.44 3040 0.92 731.0 1.498 -4.4 8.38 0.046 120 46.0 146.8 6.52 5.34 1.44
B 5.266667 23.87333 0.3340000 72.6 148.0 0.160 0.36 0.08 -100 120.2 46.6 0 0 0 92 120.0 4010 65.0 21.16 3056 0.90 732.4 1.448 -4.4 8.40 0.066 120 46.0 146.6 6.52 5.34 1.44
B 5.286667 23.86667 0.2566667 68.0 143.2 0.034 0.16 0.02 -100 129.0 47.4 0 0 0 96 119.8 4010 65.2 19.88 3290 0.90 731.0 1.448 -4.4 8.42 0.046 120 46.0 146.0 6.52 5.34 1.42
C 5.226667 23.69333 0.3166667 63.8 138.2 0.124 0.20 0.06 -100 123.4 48.8 0 0 0 92 120.2 1624 68.8 17.02 3200 0.46 295.8 0.346 -4.2 8.58 0.164 120 46.0 146.6 6.52 5.34 1.46
B 5.353333 23.99333 0.2793333 64.8 137.0 0.146 0.06 0.02 -100 115.6 46.4 0 0 0 94 120.4 4012 65.2 21.82 3082 0.88 726.4 1.398 -4.0 8.50 0.046 120 46.0 146.8 6.52 5.36 1.46
B 5.366667 24.09333 0.2613333 70.6 143.8 0.220 0.48 0.08 -100 121.4 47.0 0 0 0 98 116.4 3060 65.4 20.32 3324 0.84 535.8 1.298 -4.0 8.44 0.064 120 46.0 146.8 6.52 5.28 1.44
C 5.213333 23.98667 0.2353333 62.6 140.8 0.246 0.10 0.20 -100 119.6 45.4 0 0 0 102 120.2 4012 67.8 16.44 2970 0.86 731.8 1.348 -4.0 8.30 0.046 120 46.0 146.2 6.62 5.34 1.38
C 5.220000 24.26000 0.1120000 66.8 143.4 0.042 0.08 0.06 -100 116.6 46.4 0 0 0 94 121.0 4010 65.4 16.56 3090 0.94 726.4 1.548 -4.2 8.42 0.022 120 46.0 146.2 6.52 5.34 1.52
B 5.333333 24.09333 0.3046667 66.0 139.4 0.060 0.06 0.08 -100 130.2 44.2 0 0 0 130 100.2 1008 69.8 21.98 30 0.74 NA 1.048 -4.0 8.48 NA 100 50.0 147.0 6.50 5.40 1.48
B 5.340000 23.98667 0.2120000 68.2 142.2 0.038 0.16 0.04 -100 113.6 51.4 0 0 0 100 96.8 2936 66.6 19.36 3418 0.82 519.0 1.248 -4.2 8.52 0.254 100 50.0 146.8 6.50 5.38 1.48
B 5.413333 23.98667 0.2926667 70.0 142.8 0.124 0.02 0.04 -100 118.2 50.2 0 0 0 96 100.4 4016 66.0 24.00 3206 0.92 732.6 1.496 -4.2 8.44 0.084 100 50.0 146.4 6.52 5.38 1.48
B 5.373333 24.02000 0.2813333 68.0 141.0 0.102 0.26 0.02 -100 119.2 50.0 0 0 0 90 100.2 4010 66.2 21.58 3220 0.90 734.4 1.448 -4.2 8.44 0.064 100 50.0 147.2 6.50 5.28 1.50
B 5.313333 23.98667 0.2940000 68.2 142.2 0.052 0.18 0.10 -100 118.0 50.0 0 0 0 94 99.8 4016 66.0 20.72 3206 0.92 732.8 1.496 -4.2 8.40 0.206 100 50.0 146.6 6.50 NA 1.50
B 5.360000 24.02667 0.2780000 67.0 139.8 0.080 0.34 0.04 -100 115.2 50.2 0 0 0 102 100.2 4014 68.2 21.60 3168 0.86 740.4 1.348 -4.2 8.42 0.096 100 50.0 146.8 6.48 5.38 1.48
B 5.446667 24.02000 0.0900000 70.8 142.6 0.012 0.34 0.02 -100 124.4 50.0 0 0 0 96 100.8 4012 65.6 23.58 3138 0.88 729.4 1.398 -4.2 8.42 0.090 100 50.0 147.0 6.50 5.38 1.48
B 5.380000 24.07333 0.2180000 66.6 138.8 0.040 0.18 0.04 -100 116.4 50.0 0 0 0 92 100.0 4010 65.8 21.40 3212 0.90 731.0 1.448 -4.2 8.40 0.064 100 50.0 147.4 6.52 5.40 1.46
B 5.393333 24.08667 0.2120000 65.8 137.4 0.102 0.10 0.02 -100 118.6 50.0 0 0 0 102 100.6 4016 66.4 18.32 3164 0.90 734.2 1.448 -4.2 8.44 0.084 100 50.0 147.0 6.50 5.40 1.46
B 5.406667 24.11333 0.2220000 67.4 138.6 0.128 0.22 0.06 -100 116.8 49.8 0 0 0 100 100.4 4014 65.8 21.16 3194 0.90 731.6 1.448 -4.2 8.36 0.096 100 50.0 146.6 6.50 5.38 1.46
B 5.366667 24.09333 0.2106667 70.8 144.2 0.068 0.04 0.08 -100 115.2 50.2 0 0 0 96 100.2 4010 65.8 22.90 3182 0.90 731.0 1.398 -4.2 8.36 0.084 100 50.0 146.2 6.50 5.38 1.46
B 5.300000 24.07333 0.1860000 69.6 143.8 0.052 0.42 0.24 -100 121.2 50.4 0 0 0 94 110.0 4014 65.8 19.18 3214 0.90 730.6 1.448 -4.2 8.40 0.082 110 50.0 146.4 6.50 5.38 1.46
B 5.360000 24.08000 0.1546667 68.6 141.6 0.088 0.04 0.06 -100 119.6 50.0 0 0 0 100 109.4 4010 65.8 15.88 3198 0.92 740.0 1.496 -4.2 8.38 0.062 110 50.0 147.0 6.52 5.38 1.50
B 5.366667 24.04667 0.1326667 68.2 141.0 0.112 0.34 0.16 -100 118.0 50.2 0 0 0 94 110.0 4010 65.8 18.54 3220 0.92 733.8 1.496 -4.2 8.44 0.064 110 50.0 145.8 6.52 5.38 1.50
B 5.373333 24.00667 0.3160000 69.4 142.8 NA 0.28 0.02 -100 120.4 49.8 0 0 0 96 110.2 4012 67.4 19.98 3208 0.90 728.4 1.398 -4.0 8.36 0.064 110 50.0 146.0 6.50 5.36 1.50
B 5.346667 23.98667 0.2280000 68.8 142.4 0.164 0.26 0.06 -100 121.4 43.0 0 0 0 120 120.0 1006 67.0 13.66 1464 0.86 NA 1.346 -4.0 8.40 0.080 120 50.0 148.2 6.50 5.40 1.50
B 5.373333 24.01333 0.2000000 65.2 137.4 0.112 0.34 0.02 -100 116.0 50.2 0 0 0 96 120.8 4014 66.4 21.98 3222 0.88 702.4 1.398 -4.0 8.38 0.060 120 50.0 147.0 6.50 5.40 1.48
B 5.326667 24.06000 0.2393333 75.8 151.4 0.080 0.08 0.02 -100 117.0 45.8 0 0 0 94 119.6 4012 66.6 18.12 2986 0.86 732.6 1.346 -3.8 8.36 0.082 120 46.0 145.8 6.50 5.42 1.44
C 5.273333 23.86000 0.1126667 65.6 140.2 0.050 0.10 0.04 -100 119.4 45.8 0 0 0 92 119.2 4014 67.2 13.78 2976 0.92 722.0 1.496 -3.8 8.28 0.062 120 46.0 146.2 6.64 5.38 1.60
# Examine the structure of the training data.
str(beverage.train)
## 'data.frame':    2571 obs. of  33 variables:
##  $ Brand.Code       : chr  "B" "A" "B" "A" ...
##  $ Carb.Volume      : num  5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill.Ounces      : num  24 24 24.1 24 24.3 ...
##  $ PC.Volume        : num  0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb.Pressure    : num  68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb.Temp        : num  141 140 145 133 137 ...
##  $ PSC              : num  0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC.Fill         : num  0.26 0.22 0.34 0.42 0.16 0.24 0.4 0.34 0.12 0.24 ...
##  $ PSC.CO2          : num  0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
##  $ Mnf.Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb.Pressure1   : num  119 122 120 115 118 ...
##  $ Fill.Pressure    : num  46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd.Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure2    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure3    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : int  118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler.Level     : num  121 119 120 118 119 ...
##  $ Filler.Speed     : int  4002 3986 4020 4012 4010 4014 NA 1004 4014 4028 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage.cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb.Flow        : int  2932 3144 2914 3062 3054 2948 30 684 2902 3038 ...
##  $ Density          : num  0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Balling          : num  1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure.Vacuum  : num  -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen.Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl.Setpoint    : int  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure.Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air.Pressurer    : num  143 143 142 146 146 ...
##  $ Alch.Rel         : num  6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb.Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling.Lvl      : num  1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...

The results of running the training data through the str() function reveal that the dataset consists of 33 Columns, and 2571 Observations. Almost all of the variables are numerical, with the exception of the Brand.Code variable which is categorical. An other important revelation is that some of the variables contain missing values.

The training dataset contains 32 predictor variables which include 1 categorical variable and the rest are numeric (continuous and discrete) variables. There are 2571 records in the training data and 267 records in the evaluation dataset. The target column is the PH column.

The data has the following variables:

  • Brand Code: categorical, values: A, B, C, D

  • Carb Volume: Numeric

  • Fill Ounces: Numeric

  • PC Volume: Numeric

  • Carb Pressure: Numeric

  • Carb Temp: Numeric

  • PSC: Numeric

  • PSC Fill: Numeric

  • PSC CO2: Numeric

  • Mnf Flow: Numeric

  • Carb Pressure1: Numeric

  • Fill Pressure: Numeric

  • Hyd Pressure1: Numeric

  • Hyd Pressure2: Numeric

  • Hyd Pressure3: Numeric

  • Hyd Pressure4: Numeric

  • Filler Level: Numeric

  • Filler Speed: Numeric

  • Temperature: Numeric

  • Usage cont: Numeric

  • Carb Flow: Numeric

  • Density: Numeric

  • MFR: Numeric

  • Balling: Numeric

  • Pressure Vacuum: Numeric

  • PH: This is the numeric TARGET variable that has to be predicted.

  • Bowl Setpoint: Numeric

  • Pressure Setpoint: Numeric

  • Air Pressurer: Numeric

  • Alch Rel: Numeric

  • Carb Rel: Numeric

  • Balling Lvl: Numeric

Now let’s check the summary statistics for the data.

##   Brand.Code         Carb.Volume     Fill.Ounces      PC.Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb.Pressure     Carb.Temp          PSC             PSC.Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC.CO2           Mnf.Flow       Carb.Pressure1  Fill.Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd.Pressure1   Hyd.Pressure2   Hyd.Pressure3   Hyd.Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler.Level    Filler.Speed   Temperature      Usage.cont      Carb.Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure.Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air.Pressurer      Alch.Rel        Carb.Rel      Balling.Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1
##   rows columns discrete_columns continuous_columns all_missing_columns
## 1 2571      33                1                 32                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1                  844          2038              84843       645352
Data summary
Name beverage.train
Number of rows 2571
Number of columns 33
_______________________
Column type frequency:
character 1
numeric 32
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Brand.Code 120 0.95 1 1 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Carb.Volume 10 1.00 5.37 0.11 5.04 5.29 5.35 5.45 5.70 <U+2581><U+2586><U+2587><U+2585><U+2581>
Fill.Ounces 38 0.99 23.97 0.09 23.63 23.92 23.97 24.03 24.32 <U+2581><U+2582><U+2587><U+2582><U+2581>
PC.Volume 39 0.98 0.28 0.06 0.08 0.24 0.27 0.31 0.48 <U+2581><U+2583><U+2587><U+2582><U+2581>
Carb.Pressure 27 0.99 68.19 3.54 57.00 65.60 68.20 70.60 79.40 <U+2581><U+2585><U+2587><U+2583><U+2581>
Carb.Temp 26 0.99 141.09 4.04 128.60 138.40 140.80 143.80 154.00 <U+2581><U+2585><U+2587><U+2583><U+2581>
PSC 33 0.99 0.08 0.05 0.00 0.05 0.08 0.11 0.27 <U+2586><U+2587><U+2583><U+2581><U+2581>
PSC.Fill 23 0.99 0.20 0.12 0.00 0.10 0.18 0.26 0.62 <U+2586><U+2587><U+2583><U+2581><U+2581>
PSC.CO2 39 0.98 0.06 0.04 0.00 0.02 0.04 0.08 0.24 <U+2587><U+2585><U+2582><U+2581><U+2581>
Mnf.Flow 2 1.00 24.57 119.48 -100.20 -100.00 65.20 140.80 229.40 <U+2587><U+2581><U+2581><U+2587><U+2582>
Carb.Pressure1 32 0.99 122.59 4.74 105.60 119.00 123.20 125.40 140.20 <U+2581><U+2583><U+2587><U+2582><U+2581>
Fill.Pressure 22 0.99 47.92 3.18 34.60 46.00 46.40 50.00 60.40 <U+2581><U+2581><U+2587><U+2582><U+2581>
Hyd.Pressure1 11 1.00 12.44 12.43 -0.80 0.00 11.40 20.20 58.00 <U+2587><U+2585><U+2582><U+2581><U+2581>
Hyd.Pressure2 15 0.99 20.96 16.39 0.00 0.00 28.60 34.60 59.40 <U+2587><U+2582><U+2587><U+2585><U+2581>
Hyd.Pressure3 15 0.99 20.46 15.98 -1.20 0.00 27.60 33.40 50.00 <U+2587><U+2581><U+2583><U+2587><U+2581>
Hyd.Pressure4 30 0.99 96.29 13.12 52.00 86.00 96.00 102.00 142.00 <U+2581><U+2583><U+2587><U+2582><U+2581>
Filler.Level 20 0.99 109.25 15.70 55.80 98.30 118.40 120.00 161.20 <U+2581><U+2583><U+2585><U+2587><U+2581>
Filler.Speed 57 0.98 3687.20 770.82 998.00 3888.00 3982.00 3998.00 4030.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
Temperature 14 0.99 65.97 1.38 63.60 65.20 65.60 66.40 76.20 <U+2587><U+2583><U+2581><U+2581><U+2581>
Usage.cont 5 1.00 20.99 2.98 12.08 18.36 21.79 23.75 25.90 <U+2581><U+2583><U+2585><U+2583><U+2587>
Carb.Flow 2 1.00 2468.35 1073.70 26.00 1144.00 3028.00 3186.00 5104.00 <U+2582><U+2585><U+2586><U+2587><U+2581>
Density 1 1.00 1.17 0.38 0.24 0.90 0.98 1.62 1.92 <U+2581><U+2585><U+2587><U+2582><U+2586>
MFR 212 0.92 704.05 73.90 31.40 706.30 724.00 731.00 868.60 <U+2581><U+2581><U+2581><U+2582><U+2587>
Balling 1 1.00 2.20 0.93 -0.17 1.50 1.65 3.29 4.01 <U+2581><U+2587><U+2587><U+2581><U+2587>
Pressure.Vacuum 0 1.00 -5.22 0.57 -6.60 -5.60 -5.40 -5.00 -3.60 <U+2582><U+2587><U+2586><U+2582><U+2581>
PH 4 1.00 8.55 0.17 7.88 8.44 8.54 8.68 9.36 <U+2581><U+2585><U+2587><U+2582><U+2581>
Oxygen.Filler 12 1.00 0.05 0.05 0.00 0.02 0.03 0.06 0.40 <U+2587><U+2581><U+2581><U+2581><U+2581>
Bowl.Setpoint 2 1.00 109.33 15.30 70.00 100.00 120.00 120.00 140.00 <U+2581><U+2582><U+2583><U+2587><U+2581>
Pressure.Setpoint 12 1.00 47.62 2.04 44.00 46.00 46.00 50.00 52.00 <U+2581><U+2587><U+2581><U+2586><U+2581>
Air.Pressurer 0 1.00 142.83 1.21 140.80 142.20 142.60 143.00 148.20 <U+2585><U+2587><U+2581><U+2581><U+2581>
Alch.Rel 9 1.00 6.90 0.51 5.28 6.54 6.56 7.24 8.62 <U+2581><U+2587><U+2582><U+2583><U+2581>
Carb.Rel 10 1.00 5.44 0.13 4.96 5.34 5.40 5.54 6.06 <U+2581><U+2587><U+2587><U+2582><U+2581>
Balling.Lvl 1 1.00 2.05 0.87 0.00 1.38 1.48 3.14 3.66 <U+2581><U+2587><U+2582><U+2581><U+2586>

 

From the above, we see that most of the predictors (except for 2) contain missing data and will therefore need to be imputed. For the target variable (PH), we see that 4 rows are missing “PH” values. These rows will need to be dropped since they cannot be used for training.

Let’s look at the distribution of the target variable next.

The above histogram reveals that the target variable is not very skewed, even though there are some outliers. The minimum value for PH is 7.88 and the maximum value is 9.36 indicating that ABC manufactures relatively alkaline beverages - likely to be green tea or fruit and vegetable juices.

Missing Values
variable n_miss pct_miss
MFR 212 8.2458187
Brand.Code 120 4.6674446
Filler.Speed 57 2.2170362
PC.Volume 39 1.5169195
PSC.CO2 39 1.5169195
Fill.Ounces 38 1.4780241
PSC 33 1.2835473
Carb.Pressure1 32 1.2446519
Hyd.Pressure4 30 1.1668611
Carb.Pressure 27 1.0501750
Carb.Temp 26 1.0112797
PSC.Fill 23 0.8945935
Fill.Pressure 22 0.8556982
Filler.Level 20 0.7779074
Hyd.Pressure2 15 0.5834306
Hyd.Pressure3 15 0.5834306
Temperature 14 0.5445352
Oxygen.Filler 12 0.4667445
Pressure.Setpoint 12 0.4667445
Hyd.Pressure1 11 0.4278491
Carb.Volume 10 0.3889537
Carb.Rel 10 0.3889537
Alch.Rel 9 0.3500583
Usage.cont 5 0.1944769
PH 4 0.1555815
Mnf.Flow 2 0.0777907
Carb.Flow 2 0.0777907
Bowl.Setpoint 2 0.0777907
Density 1 0.0388954
Balling 1 0.0388954
Balling.Lvl 1 0.0388954
Pressure.Vacuum 0 0.0000000
Air.Pressurer 0 0.0000000

The above statistics tell us that about 8.25% of the records are missing a value for MFR. We may need to drop this feature since as missingness increases, the increasing amount of imputed values would have potential negative consequences.

The second most missing variable is the categorical variable called “Brand Code”, which is missing about 4.67% percent of its values. These could potentially be a 5th brand besides the existing A,B,C and D or could be one of the existing 4 brands. In any case, we will create a new feature category ‘Unknown’ for these records. The rest of the predictors are missing smaller percentages of values, and we can use imputation for these records.

 

From the above plots, we can see that a lot of the predictors are significantly skewed, suggesting that we might need to transform the data. Several features are discrete with limited possible values, e.g. Pressure Setpoint. We also see a number of bimodal variables such as Carb Flow, Balling, and Balling Level.

Boxplots

We now use boxplots to check the spread of each predictor.

The boxplots reveal outliers, but we don’t have have a strong reason to impute or drop them from the dataset.

Predictor-Target Correlations

We will now derive the correlations for the numeric predictors. This will enable us to focus on those predictors that show stronger positive or negative correlations with PH. Predictors with correlations closer to zero will most likely not provide any meaningful information for the target variable.

##          values               ind
## 1   0.361587534     Bowl.Setpoint
## 2   0.352043962      Filler.Level
## 3   0.233593699         Carb.Flow
## 4   0.219735497   Pressure.Vacuum
## 5   0.196051481          Carb.Rel
## 6   0.166682228          Alch.Rel
## 7   0.164485364     Oxygen.Filler
## 8   0.109371168       Balling.Lvl
## 9   0.098866734         PC.Volume
## 10  0.095546936           Density
## 11  0.076700227           Balling
## 12  0.076213407     Carb.Pressure
## 13  0.072132509       Carb.Volume
## 14  0.032279368         Carb.Temp
## 15 -0.007997231     Air.Pressurer
## 16 -0.023809796          PSC.Fill
## 17 -0.040882953      Filler.Speed
## 18 -0.045196477               MFR
## 19 -0.047066423     Hyd.Pressure1
## 20 -0.069873041               PSC
## 21 -0.085259857           PSC.CO2
## 22 -0.118335903       Fill.Ounces
## 23 -0.118764185    Carb.Pressure1
## 24 -0.171434026     Hyd.Pressure4
## 25 -0.182659650       Temperature
## 26 -0.222660048     Hyd.Pressure2
## 27 -0.268101792     Hyd.Pressure3
## 28 -0.311663908 Pressure.Setpoint
## 29 -0.316514463     Fill.Pressure
## 30 -0.357611993        Usage.cont
## 31 -0.459231253          Mnf.Flow

From the above, we can see that the variables Bowl Setpoint, Filler Level, Carb Flow, Pressure Vacuum, and Carb Rel have the strongest positive correlations with PH, while Mnf Flow, Usage cont, Fill Pressure, Pressure Setpoint, and Hyd Pressure3 have the strongest negative correlations with PH. The other features have a weak or slightly negative correlation, which implies they have less predictive power.

Multicollinearity

One problem that can occur with multiple regression and other models is a correlation between predictors or multicollinearity. A quick check is to run correlations between all predictors.

We can see that some predictors are highly correlated with one another, such as Balling Level and Carb Volume, Carb Rel and Alch Rel, Density, and Balling, with a correlation between 0.75 and 1. When we start examining predictors for our models, we’ll have to consider the correlations between them and avoid including pairs with strong correlations.

In general, it looks like many of the predictors go hand-in-hand with other features and multicollinearity could be a problem.

## 6 variables from the 31 input variables have collinearity problem: 
##  
## Balling Bowl.Setpoint Balling.Lvl MFR Hyd.Pressure3 Alch.Rel 
## 
## After excluding the collinear variables, the linear correlation coefficients ranges between: 
## min correlation ( Pressure.Setpoint ~ PC.Volume ):  0.0001991472 
## max correlation ( Carb.Rel ~ Density ):  0.852689 
## 
## ---------- VIFs of the remained variables -------- 
##            Variables       VIF
## 1        Carb.Volume 17.159340
## 2        Fill.Ounces  1.153764
## 3          PC.Volume  1.685901
## 4      Carb.Pressure 43.298925
## 5          Carb.Temp 35.460787
## 6                PSC  1.155980
## 7           PSC.Fill  1.109449
## 8            PSC.CO2  1.064636
## 9           Mnf.Flow  4.262754
## 10    Carb.Pressure1  1.434406
## 11     Fill.Pressure  3.490621
## 12     Hyd.Pressure1  2.935470
## 13     Hyd.Pressure2  4.900597
## 14     Hyd.Pressure4  1.752686
## 15      Filler.Level  2.618103
## 16      Filler.Speed  1.273500
## 17       Temperature  1.151185
## 18        Usage.cont  1.718776
## 19         Carb.Flow  1.987496
## 20           Density  4.499376
## 21   Pressure.Vacuum  2.054866
## 22     Oxygen.Filler  1.561878
## 23 Pressure.Setpoint  3.300894
## 24     Air.Pressurer  1.167671
## 25          Carb.Rel  6.339500

The vifcor function from the usdm package allows us to do an early analysis into multi-collinearity. As can be seen from the above, this function tells us that 6 of the 31 numeric predictors are highly correlated.

Near-Zero Variance Predictors

Lastly, we want to check for any features that show near zero-variance. Predictors that are the same across most of the instances will add little predictive information.

##               freqRatio percentUnique zeroVar  nzv
## Hyd.Pressure1  31.11111      9.529366   FALSE TRUE

Since “Hyd Pressure1” displays near-zero variance, we will drop this feature prior to modeling.

 

2. Data Preparation

To summarize our data preparation and exploration, we distinguish our findings into a few categories below.

Removed Fields

  • MFR has more than 8% missing values, so we can remove this predictor.
  • Hyd Pressure1 shows little variance, so we can remove this predictor.
  • We had 4 rows with missing PH that need to be removed.
  • We replace missing values for Brand Code with “Unknown”.
  • Impute remaining missing values using Predictive mean matching via the mice package.

 

Imputing Missing Values

30 out of 33 variables contain missing values of varying quantities (ranging from 212 to 1). This is enough to justify imputation. Rather than removing entire observations with missing values and jeopardizing the accuracy of the data, we will use the mice package’s mice() function to impute them.

The mice package offers an array of imputation methods (Predictive mean matching, mean, norm, to name a few), but due to the fact that the dataset contains both numeric and categorical variables, we have decided to use the Predictive mean matching method as this covers both variable types.

set.seed(200)

# Impute missing values in training data using the Predictive mean matching imputation method.
beverage.train.clean <- mice(beverage.train.clean, m = 1, method = 'pmm', print = FALSE) %>% complete()

# After imputation, check if any missing values remain.
colSums(is.na(beverage.train.clean))
##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##                 0                 0                 0                 0 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##                 0                 0                 0                 0 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##                 0                 0                 0                 0 
##     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4      Filler.Level 
##                 0                 0                 0                 0 
##      Filler.Speed       Temperature        Usage.cont         Carb.Flow 
##                 0                 0                 0                 0 
##           Density           Balling   Pressure.Vacuum                PH 
##                 0                 0                 0                 0 
##     Oxygen.Filler     Bowl.Setpoint Pressure.Setpoint     Air.Pressurer 
##                 0                 0                 0                 0 
##          Alch.Rel          Carb.Rel       Balling.Lvl 
##                 0                 0                 0
# Impute missing values in test data using the Predictive mean matching imputation method.
beverage.eval.clean <- mice(beverage.eval.clean, m = 1, method = 'pmm', print = FALSE) %>% complete()

As per the above results, we can confirm that the missing values have been eliminated after imputation.

Convert Categorical Variable to Dummy variables

“Brand.Code” is a categorical variable with values A, B, C, D and Unknown. So we will convert it to a set of dummy variables for modeling.

Transform Predictors With Skewed Distributions

As discussed earlier, some of the predictors are highly skewed. To address this, we scale, center, and apply the Box-Cox transformation to them using the “preProcess” function from the “caret” package. These transformations should result in more normal distributions.

## Created from 267 samples and 34 variables
## 
## Pre-processing:
##   - Box-Cox transformation (22)
##   - centered (34)
##   - ignored (0)
##   - scaled (34)
## 
## Lambda estimates for Box-Cox transformation:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.000  -2.000   0.100  -0.300   0.675   2.000

Here are some plots to demonstrate the changes in distributions after the transformations:

# Prepare data for ggplot.
gather_df <- bev.train.transformed %>% dplyr::select(-c(PH)) %>% gather(key = 'variable', value = 'value')

# Histogram plots of each variable.
ggplot(gather_df) + geom_histogram(aes(x=value, y = ..density..), bins = 30) +
  geom_density(aes(x = value), color = 'red') +
  facet_wrap(. ~variable, scales = 'free', ncol = 4)

As expected, the plots of the dummy variables are binary. For the others, we can still see bimodal predictors since we did not apply any feature engineering to them. Some predictors such as ‘PSC Fill’ and ‘Temperature’ still show some skewness, but we can move on to building the models.

Pre-Modeling Data Splitting

Here, we perform a train-test split with a 80:20 ratio.

# Split the training data into train and test sets using an 80% data split.
trainingData <- createDataPartition(bev.train.transformed$PH, p = 0.8, list = FALSE)

# Training data splits.
trainingDataSet <- bev.train.transformed[trainingData, ]
xTrainData <- subset(trainingDataSet, select = -PH)
yTrainData <- subset(trainingDataSet, select = PH)

# Test data splits.
testDataSet <- bev.train.transformed[-trainingData, ]
xTestData <- subset(testDataSet, select = -PH)
yTestData <- subset(testDataSet, select = PH)

Model Building/Fitting

In this section, we will build and run 3 categories of models: tree, linear, and non-linear. We will then compare the results of the models in each category, select the best category performer, and then select the overall best performer.

Non-Linear Models

In the non-linear category, we will build and run 2 models - a Support Vector Machine (SVM) model, and a K-Nearest Neighbors (KNN) model. We will use the caret package’s train() function to build the models, and use the same training and test datasets for both models.

Support Vector Machine (SVM) Model

set.seed(200)

# Define the SVM model.
svmModel = train(x = xTrainData, 
                 y = yTrainData$PH,
                 preProcess = c('center', 'scale'),
                 method = 'svmRadial', 
                 tuneLength = 10,
                 trControl = trainControl(method = 'repeatedcv'))

# Run predict() and postResample() on the model and display the results.
svmPrediction <- predict(svmModel, newdata = xTestData)
svmPerformance <- postResample(pred = svmPrediction, obs = yTestData$PH)
svmPerformance
##       RMSE   Rsquared        MAE 
## 0.10884499 0.59016597 0.08127628
# Predict on test data and calculate performance
results<-data.frame()
results <- data.frame(t(postResample(pred = svmPrediction, obs = yTestData$PH))) %>%mutate(Model = "SVM") %>% 
rbind(results)

 

K Nearest Neighbors (KNN) Model

set.seed(200)

# Define the KNN model.
knnModel <- train(x = xTrainData,
                  y = yTrainData$PH,
                  preProcess = c('center', 'scale'),
                  method = 'knn',
                  tuneLength = 10)

# Run predict() and postResample() on the model and display the results.
knnPrediction <- predict(knnModel, newdata = xTestData)
knnPerformance <- postResample(pred = knnPrediction, obs = yTestData$PH)
knnPerformance
##      RMSE  Rsquared       MAE 
## 0.1191253 0.5096819 0.0913447
# Predict on test data and calculate performance
results <- data.frame(t(postResample(pred = knnPrediction, obs = yTestData$PH))) %>% mutate(Model = "k-Nearest Neighbors(kNN)") %>% rbind(results)

 

Linear Models

In the linear model category, we will build and run a generalized linear model (GLM), and a partial least squares (PLS) model.

Generalized Linear Model (GLM)

set.seed(200)

# Define the GLM model.
glmModel = train(PH ~ .,
                 data = trainingDataSet, 
                 metric = 'RMSE',
                 preProcess = c('center', 'scale'),
                 method = 'glm', 
                 trControl = trainControl(method = 'cv', number = 5, savePredictions = TRUE))

# Run predict() and postResample() on the model and display the results.
glmModelPrediction <- predict(glmModel, xTestData)
glmPerformance <- postResample(pred = glmModelPrediction, obs = yTestData$PH)
glmPerformance
##      RMSE  Rsquared       MAE 
## 0.1295556 0.4133405 0.1024145
# Predict on test data and calculate performance
results <- data.frame(t(postResample(pred = glmModelPrediction, obs = yTestData$PH))) %>% mutate(Model = "Generalized Linear Model(GLM)") %>% rbind(results)

 

Partial Least Squares Model (PLS)

set.seed(200)

# Define the PLS model.
plsModel = train(PH ~ .,
                 data = trainingDataSet, 
                 metric = 'RMSE',
                 preProcess = c('center', 'scale'),
                 method = 'pls', 
                 trControl = trainControl(method = 'cv', number = 5, savePredictions = TRUE))

# Run predict() and postResample() on the model and display the results.
plsModelPrediction <- predict(plsModel, xTestData)
plsPerformance <- postResample(pred = plsModelPrediction, obs = yTestData$PH)
plsPerformance
##      RMSE  Rsquared       MAE 
## 0.1299029 0.4088798 0.1040693
# Predict on test data and calculate performance
results <- data.frame(t(postResample(pred = plsModelPrediction, obs = yTestData$PH))) %>% mutate(Model = "Partial Least Squares(PLS)") %>% rbind(results)

 

Tree Models

In this category, we will build and run a cubist model, and a single tree model.

Cubist Model

set.seed(200)

# Define the Cubist model.
cubistModel <- cubist(xTrainData, 
                      yTrainData$PH, 
                      committees = 6)

# Run predict() and postResample() on the model and display the results.
cubistModelPrediction <- predict(cubistModel, newdata = xTestData)
cubistPerformance <- postResample(pred = cubistModelPrediction, obs = yTestData$PH)
cubistPerformance
##       RMSE   Rsquared        MAE 
## 0.10518426 0.61494856 0.07688106
# Predict on test data and calculate performance
results <- data.frame(t(postResample(pred = cubistModelPrediction, obs = yTestData$PH))) %>% mutate(Model = "Tree Model(Cubist)") %>% rbind(results)

 

Single Tree Model

set.seed(100)

# Define the Single Tree model.
singleTreeModel <- train(xTrainData,
                         yTrainData$PH,
                         method = 'rpart2',
                         tuneLength = 10,
                         trControl = trainControl(method = 'cv'))

# Run predict() and postResample() on the model and display the results.
singleTreeModelPrediction <- predict(singleTreeModel, newdata = xTestData)
singleTreePerformance <- postResample(pred = singleTreeModelPrediction , obs = yTestData$PH) 
singleTreePerformance
##       RMSE   Rsquared        MAE 
## 0.12522668 0.45164526 0.09759854
# Predict on test data and calculate performance.
results <- data.frame(t(postResample(pred = singleTreeModelPrediction, obs = yTestData$PH))) %>% mutate(Model = "Single Tree Model") %>% rbind(results)

 

Model Comparisons

After running our models, we will now compare the results of the models and select the best performing model within each category. This will allow us to select the overall best performing model.

Non-Linear Model Comparisons

nonLinearComparisons <- rbind(
  'Support Vector Machine' = svmPerformance,
  'K Nearest Neighbors' = knnPerformance)

nonLinearComparisons %>% kable() %>% kable_styling(bootstrap_options = c('striped'))
RMSE Rsquared MAE
Support Vector Machine 0.1088450 0.5901660 0.0812763
K Nearest Neighbors 0.1191253 0.5096819 0.0913447

Using RMSE and Rsquared as the selection criteria for the best performing model, the support vector machine model yielded the best performance. The Rsquared value of the model is 0.57 which tells us that the model explains 57% of the variability in the data. This trumps the Rsquared value of the KNN model (52%), but not by much.

 

Linear Model Comparisons

linearComparisons <- rbind(
  'Generalized Linear Model' = glmPerformance,
  'Partial Least Squares' = plsPerformance)

linearComparisons %>% kable() %>% kable_styling(bootstrap_options = c('striped'))
RMSE Rsquared MAE
Generalized Linear Model 0.1295556 0.4133405 0.1024145
Partial Least Squares 0.1299029 0.4088798 0.1040693

Again, using RMSE and Rsquared to select the best model, the GLM and PLS models are almost the same in terms of their performance. However, the generalized linear model performs slightly better than the partial least squares model. The GLM model explains 41% of the data variance which is higher than the Rsquared value of the PLS model by a fraction.

 

Tree Model Comparisons

treeComparisons <- rbind(
  'Cubist' = cubistPerformance,
  'Single Tree' = singleTreePerformance)

treeComparisons %>% kable() %>% kable_styling(bootstrap_options = c('striped'))
RMSE Rsquared MAE
Cubist 0.1051843 0.6149486 0.0768811
Single Tree 0.1252267 0.4516453 0.0975985

Finally, In the tree model category, based on the fact that the cubist model has a lower RMSE than that of the single tree model, and the fact that it explains 61% of the data variance (as opposed to the single tree model’s 45%), the cubist tree model is the best performing model in this category.

Model Summary

We now consolidate the results from all the models using the following criteria: root mean squared error (RMSE), R-squared, and Mean Absolute Error (MAE). The table below lists these criteria for each model.

results %>% dplyr::select(Model, RMSE, Rsquared, MAE)
##                           Model      RMSE  Rsquared        MAE
## 1             Single Tree Model 0.1252267 0.4516453 0.09759854
## 2            Tree Model(Cubist) 0.1051843 0.6149486 0.07688106
## 3    Partial Least Squares(PLS) 0.1299029 0.4088798 0.10406927
## 4 Generalized Linear Model(GLM) 0.1295556 0.4133405 0.10241455
## 5      k-Nearest Neighbors(kNN) 0.1191253 0.5096819 0.09134470
## 6                           SVM 0.1088450 0.5901660 0.08127628

 

Model Selection And Top Predictor Analysis

Based on the RMSE and RSquared values of all the models we ran, the Cubist model is the overall best performer. This is expected given that this model is more tolerant of multi-collinearity and works well with non-linear features. The Rsquared for the Cubist model tells us that it explains 61% of the data variance which falls within an acceptable RSquared value range. Based on this, we will proceed with the Cubist model as the best predictive model for this project.

Let’s inspect the predictors that this model found important.

var.imp.cubist<-varImp(cubistModel, scale = FALSE)
var.imp.cubist
##                   Overall
## Mnf.Flow             73.0
## Alch.Rel             57.5
## Balling.Lvl          54.5
## Pressure.Vacuum      54.0
## Brand.CodeC          27.5
## Bowl.Setpoint        40.0
## Carb.Flow            36.5
## Filler.Speed         24.0
## Oxygen.Filler        44.5
## Balling              53.5
## Carb.Rel             35.5
## Usage.cont           27.0
## Density              41.5
## Air.Pressurer        32.5
## Hyd.Pressure3        36.5
## Hyd.Pressure2        25.5
## Carb.Pressure1       29.5
## Temperature          29.5
## Filler.Level         16.0
## PC.Volume            10.5
## Pressure.Setpoint    10.5
## Carb.Volume          16.5
## Carb.Pressure        22.5
## Carb.Temp            19.0
## Brand.CodeB          13.0
## PSC.Fill              4.5
## Brand.CodeD           4.0
## Fill.Pressure         3.0
## Hyd.Pressure4         3.0
## Fill.Ounces           2.0
## PSC                   2.0
## PSC.CO2               2.0
## Brand.CodeA           1.0
## Brand.CodeUnknown     0.0

Interestingly, we can see that the list of important predictors contains some that had strong correlations (positive and negative) with the target variable. For example: Alch Rel, Bowl Setpoint, Carb Flow, Pressure Vacuum, Oxygen Filler and Mnf Flow. At the same time, there are other predictors that showed strong correlation with PH, but did not make it to the top 10 important predictors. For example: Filler Level, Carb Rel, Oxygen Filler, Usage cont, Fill Pressure, Temperature, Pressure Setpoint, Hyd Pressure2 and Hyd Pressure3.

Instead, the topmost important predictors had variables such as Balling Lvl, Bowl Setpoint, Filler Speed and Balling in the important predictors list that did not demonstrate the strongest correlations.

This begins to make more sense when we compare to the predictor-predictor correlations calculated previously as well as the results of the vifcor function used previously. We can see that Carb Rel and Alch Rel are strongly correlated, as are Alch Rel and Hyd Pressure3. This indicates that the model is taking into account multi-collinearity and avoiding predictors that are strongly correlated with others that have already been selected and thereby do not provide incremental predictive power.

 

Predictions

Now that we have identified the Cubist model as the best predictive model, we will apply the model to the evaluation dataset by replacing the empty PH values in the evaluation dataset with the Cubist model’s predictions.

# Define the "evaluationDataClean" variable.
evaluationDataClean <- bev.eval.transformed

# Run predict() on the Cubit model.
cubistPredictions <- predict(cubistModel, newdata = evaluationDataClean)

# Replace the empty PH values in the evaluation set with the Cubist predictions.
evaluationDataClean$PH <- round(cubistPredictions,2)

# Take a look at the evaluation data after PH value replacement.
head(evaluationDataClean, 20) %>% kable() %>% kable_styling() %>% scroll_box(width = '100%', height = '600px')
Brand.CodeA Brand.CodeB Brand.CodeC Brand.CodeD Brand.CodeUnknown Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow Density Balling Pressure.Vacuum Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl PH
-0.3876816 -0.9650293 -0.3617512 1.7776376 -0.1754205 1.0173477 0.8453014 -0.1261451 -0.7151727 -1.6286322 2.2451102 1.6099559 -0.2942791 -1.027837 -1.4492875 -0.6003036 -1.151696 -1.17903 -0.0759471 1.3488585 0.5086529 -0.1059513 0.1956905 0.4661381 -0.7278954 -0.8968551 2.3563890 -0.3646397 1.5281699 -1.2774435 -0.1787716 -0.6935289 -0.7893006 -0.6467639 8.63
2.5697753 -0.9650293 -0.3617512 -0.5604375 -0.1754205 0.2569971 -0.2061344 -0.8152380 -1.4064544 -1.5210765 -0.7797427 0.4326833 0.7395699 -1.027837 -0.9470071 -0.5412123 -1.163307 -1.17903 1.0115862 0.6212109 0.5500784 -0.3700120 -1.1183338 0.4368621 0.8571200 0.9606936 1.3300336 -0.0738317 0.7065401 -0.8299445 3.5102683 0.5986430 1.1043790 1.1194968 8.44
-0.3876816 1.0323569 -0.3617512 -0.5604375 -0.1754205 -0.6671561 -0.6473493 0.4039264 -0.4234268 -0.1578536 -0.1784639 -0.7460745 -0.8112036 -1.027837 -0.6293102 -0.6595235 -1.163307 -1.17903 0.0695220 0.5766276 0.5468822 -0.3700120 1.1479870 0.5574101 -0.6723612 -0.7892285 1.6721521 0.3576903 0.7065401 -0.8299445 3.0487050 -0.7955619 -0.7893006 -0.6694082 8.60
-0.3876816 1.0323569 -0.3617512 -0.5604375 -0.1754205 -0.9225304 -0.3823992 -1.4619252 -0.8967401 -0.4960754 -2.3057805 0.2696713 -0.8112036 -1.027837 0.4042797 -2.4369491 -1.163307 -1.17903 2.1707445 0.6361217 -0.2105342 4.4933952 -0.9652103 -2.0498698 -1.1279666 -1.8425756 2.0142706 0.7859125 0.7065401 -0.8299445 2.8935880 -0.8994902 0.5006257 -0.6467639 8.59
-0.3876816 1.0323569 -0.3617512 -0.5604375 -0.1754205 0.3763584 3.0023969 -1.8753809 0.3774154 0.2624104 -0.8329359 1.0082946 0.2226454 -1.027837 -1.8169667 0.9525837 -1.163307 -1.17903 -0.2244790 0.3282101 0.5596763 0.1533516 0.0752223 0.6934571 -0.7278954 -0.8968551 2.0142706 1.0031796 0.7065401 1.0948096 2.4244009 -0.8472862 -0.4560038 -0.6694082 8.58
-0.3876816 1.0323569 -0.3617512 -0.5604375 -0.1754205 -0.7306375 1.7999363 -1.0484694 1.2962940 1.3499991 0.0209609 0.4326833 1.2564944 -1.027837 -0.9925143 -0.4822488 -1.163307 -1.17903 -0.2244790 0.6510572 0.5468822 0.2812546 -1.0009422 0.5642985 -0.8401151 -1.1317616 2.3563890 0.7172809 0.7065401 -0.8299445 2.5814395 -0.8472862 -0.1300589 -0.6920526 8.56
2.5697753 -0.9650293 -0.3617512 -0.5604375 -0.1754205 1.0173477 -0.4706421 -0.5502023 -0.7751385 -1.6286322 0.2079618 -0.2880740 -1.3281282 -1.027837 -1.2205124 -0.5412123 -1.163307 -1.17903 0.7550124 0.5914639 0.5468822 0.4080104 -1.0950665 0.5453553 0.8095406 0.9279513 1.6721521 0.2627355 0.7065401 -0.8299445 1.7897350 0.6763990 0.1887486 1.0968525 8.49
-0.3876816 1.0323569 -0.3617512 -0.5604375 -0.1754205 0.4948399 1.2803082 -2.4690609 -0.1445701 -0.4960754 -0.0178523 -0.7460745 -0.2942791 -1.027837 -0.3581749 -2.4369491 -1.163307 -1.17903 0.7550124 1.5107538 0.5023036 0.0242877 -2.3000814 -0.3759754 1.0922312 1.1916247 1.3300336 1.1925299 0.7065401 -0.8299445 2.5814395 0.6376839 -0.1300589 1.0742081 8.60
2.5697753 -0.9650293 -0.3617512 -0.5604375 -0.1754205 0.3763584 -0.6473493 0.8703893 -0.3103806 -0.7439904 2.3548087 2.0287088 -0.2942791 -1.027837 2.8590587 -1.2589854 -1.163307 -1.17903 0.8844656 0.6960130 0.5468822 -0.2373797 -1.0892332 0.0803846 0.9045094 0.9939035 1.3300336 0.3576903 0.7065401 -0.8299445 2.7378340 0.5986430 0.0302239 1.1874299 8.54
-0.3876816 -0.9650293 -0.3617512 1.7776376 -0.1754205 0.9601386 0.7580828 -0.3487751 1.1245987 0.6670133 1.1282743 -0.7460745 -0.8112036 -1.027837 0.8045664 -2.1844568 -1.163307 -1.17903 0.6231393 0.6810029 -2.4548396 -0.1059513 0.6135222 -2.0498698 0.8095406 0.9265682 1.6721521 1.1925299 0.7065401 -0.8299445 2.5814395 1.7026429 0.6540279 1.2100743 8.61
-0.3876816 1.0323569 -0.3617512 -0.5604375 -0.1754205 -1.7799105 -2.0717096 1.0400121 -1.1468189 -0.0630670 0.6173947 1.2615772 -0.2942791 -1.027837 -0.4032894 -0.4234122 -1.163307 -1.17903 0.0695220 0.6361217 0.5468822 -0.5038630 -0.3612107 0.6572927 -0.7838095 -1.0108226 1.6721521 0.7520127 0.7065401 -0.8299445 3.3570417 -0.7955619 -0.6217195 -0.7599857 8.74
-0.3876816 1.0323569 -0.3617512 -0.5604375 -0.1754205 -0.9869816 1.4538053 -0.9212523 -1.4064544 -0.3498769 1.6328781 0.7338316 3.8411170 -1.027837 -1.3119287 -0.5412123 -1.163307 -1.17903 -0.0759471 0.5028187 0.5468822 -0.2373797 -1.2444176 0.5952966 -0.7838095 -1.0108226 1.6721521 0.4027100 0.7065401 -0.8299445 3.3570417 -0.8472862 -0.4560038 -0.7146970 8.55
-0.3876816 1.0323569 -0.3617512 -0.5604375 -0.1754205 -0.6039141 1.1934515 0.0646806 -0.8356587 -0.5452300 1.2119820 -0.5055641 -1.3281282 -1.027837 -1.3576836 -0.6595235 -1.163307 -1.17903 0.2120521 0.5914639 0.5468822 -0.5038630 -0.2007139 0.4454727 -0.6171961 -0.6873889 0.6457967 0.7520127 0.7065401 -0.8299445 3.3570417 -0.7443114 -1.3035206 -0.6694082 8.63
-0.3876816 1.0323569 -0.3617512 -0.5604375 -0.1754205 -0.5409102 -0.3823992 0.1706949 -1.2108131 -0.9462379 0.4188365 0.0965566 -0.8112036 -1.027837 -0.2230094 -0.4822488 -1.163307 -1.17903 0.2120521 0.6063250 0.5564754 -0.3700120 0.1175229 0.4695823 -0.5623897 -0.5908470 0.6457967 0.4462532 0.7065401 -0.8299445 -0.3473212 -0.7443114 -1.8355740 -0.6920526 8.57
-0.3876816 -0.9650293 2.7539772 -0.5604375 -0.1754205 -0.8583235 0.0577117 0.6795635 -0.9583897 -0.2534538 0.0592796 0.8740066 1.2564944 -1.027837 -1.4951365 -0.5412123 -1.163307 -1.17903 -0.3762053 0.6361217 0.5500784 0.9038331 -0.0086665 0.5729091 -0.4538136 -0.4119692 1.6721521 0.3576903 0.7065401 -0.8299445 -0.3473212 -0.5439333 -1.4788500 -0.5108976 8.43
-0.3876816 -0.9650293 -0.3617512 -0.5604375 5.6792381 -1.0516784 -1.0904126 0.6583607 -0.8356587 -0.2534538 -0.6273472 0.7338316 -0.8112036 -1.027837 0.5824681 -1.3805093 -1.163307 -1.17903 0.8844656 1.4294958 -2.4540372 1.7303860 -0.9532469 -2.0464255 -0.9539501 -1.3975431 0.6457967 1.8488327 0.7065401 -0.8299445 -0.5165825 -0.7955619 -1.3035206 -0.5108976 8.60
-0.3876816 1.0323569 -0.3617512 -0.5604375 -0.1754205 -0.2294224 0.2332426 -0.3911808 0.6218844 0.8420124 0.6494338 -0.5055641 -1.3281282 -1.027837 -1.1292205 -0.7188728 -1.163307 -1.17903 -0.5312665 0.5618162 0.5277389 -0.1059513 -0.8199085 0.5126351 -0.7278954 -0.8968551 2.0142706 1.2669589 0.7065401 -0.8299445 -0.6865594 -0.8472862 -0.6217195 -0.7373413 8.52
-0.3876816 1.0323569 -0.3617512 -0.5604375 -0.1754205 -0.9225304 -0.3823992 -0.0519351 -0.7151727 -0.2055514 0.7748625 1.7190407 0.2226454 -1.027837 -1.4951365 -0.5412123 -1.163307 -1.17903 -0.5312665 0.6660177 0.5181887 -0.2373797 -0.9472553 0.5866860 -0.5623897 -0.5908470 1.3300336 0.6451271 0.7065401 -0.8299445 -0.5165825 -0.7955619 -1.1301723 -0.6920526 8.54
-0.3876816 -0.9650293 -0.3617512 1.7776376 -0.1754205 1.2441099 -1.0016517 -0.4547894 0.1221458 -0.5945976 -0.3931767 -0.5055641 -0.8112036 -1.027837 -1.1292205 -0.5412123 -1.163307 -1.17903 -1.7241003 0.6361217 0.5245539 -1.3334161 0.2028362 0.4540833 1.1850323 1.1748584 1.6721521 1.7973723 0.7065401 -0.8299445 -0.3473212 1.6722219 1.3966330 1.2553630 8.62
-0.3876816 1.0323569 -0.3617512 -0.5604375 -0.1754205 -0.4156124 -0.1181122 -1.3877152 -0.4234268 -0.2055514 -0.8874764 -1.7215960 -1.3281282 -1.027837 -1.1748510 -0.7783524 -1.163307 -1.17903 -0.3762053 0.6212109 0.5245539 -0.5038630 0.4202745 0.4850813 -0.6171961 -0.6873889 1.3300336 -0.3646397 0.7065401 -0.8299445 -0.1787716 -0.6935289 -0.4560038 -0.6694082 8.55

 

Looking at the PH column (the last column) in the final data sample above, we can see that the empty PH values have now been replaced with our Cubist model predictions.

# Inspect range of predicted values for evaluation dataset.
summary(evaluationDataClean$PH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.130   8.430   8.510   8.504   8.590   8.870
histogram(evaluationDataClean$PH)

From the above, we can see that the range of the target variable in the evaluation dataset is slightly narrower than its corresponding range in the training data. This gives us confidence that the model seems to work well on unseen data. Obviously the real accuracy metric would be to compare the actual PH values for the evaluation data, which we did not have access to. While not perfectly normal, the shape of the distribution of the predicted PH level is reasonably close to normal.

Export Final Results As A CSV File

Finally, we will write the final results to a CSV file.

# Save the final results to a CSV file.
write.csv(evaluationDataClean, './data/FinalPHPredictions.csv', row.names=F)

Conclusion

Based on the exploratory data analysis of the training dataset, we decided to prepare the data - this included dropping some columns and rows, creating a separate “Unknown” group for missing brand codes, creating dummy variables for the single categorical variable i.e. Brand Code, imputing missing variable data and box-cox transforming the predictors to make them less skewed.

After this, we used 3 categories of models: Linear, Non-linear and Tree-based. We trained 2 models from each category for a total of 6 models. We used a combination of RMSE and R-squared as the performance metrics to decide the final model. We decided to go with the Cubist model because it’s metrics were clearly better than the other models. This model has the lowest RMSE and also happens to have the highest r-squared as well. This is not surprising given that these models handle non-linear relationships and multi-collinearity better. This comes across in the list of top predictors selected by this model, as described in a previous section.

For the final model selected, we see that it considers the following as the top 5 predictors in terms of importance: Mnf.Flow, Alch.Rel, Balling.Lvl, Pressure.Vacuum and Brand Code C. Finding Brand Code as a top predictor is interesting because at the end of the day, Brand is a not a physical/chemical construct that can be linked to PH levels. but we think it must be best encapsulating other chemical features collectively that are in turn helpful in explaining the PH levels.

We see that the range of the predicted values in the evaluation data is in line with the range of the predicted values in the training data, which gives us confidence that the selected model seems generalizable. Besides, the general shape of the distribution of the predicted values is approximately normal.

As with any real-world data science process, the logical next step would be to calculate better accuracy metrics by comparing the predicted values to the actual PH values for the evaluation data. Our recommendation to the manager of ABC Beverages would be to go with the Cubist model and put in place an on-going process to keep monitoring the model and fine-tuning in case the model metrics show any deterioration.

References

Linear Models with R: Julian Faraway. Applied Predictive Modeling: Kuhn & Johnson https://newalbanysmiles.com/ph-values-of-common-beverages/