This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
There are two files provided:
StudentData.xlsx: This is the training dataset. Note the PH column will be our target we are trying to predict.
StudentEvaluation.xlsx: This is the evaluation dataset. Note the PH column is empty in this dataset.
# Load the ABC Beverages' train dataset.
beverage.train <- read.csv('./data/StudentData.csv', na.strings = c('', 'NA'), stringsAsFactors = FALSE)
# Load the ABC Beverages' evaluation dataset.
beverage.eval <- read.csv('./data/StudentEvaluation.csv', na.strings = c('', 'NA'), stringsAsFactors = FALSE)
The evaulation dataset contains an empty PH column, so we will remove it for now until it is needed later on in the project.
# Remove the empty PH column from the evaluation data.
beverage.eval <- beverage.eval %>% dplyr::select(-PH)The first step in our data exploration is to take a brief look at the evaluation data set. To get an idea of it’s structure, we will print out the first 40 rows of the data.
# Examine the structure of the evaluation dataset.
head(beverage.eval, 40) %>% kable() %>% kable_styling() %>% scroll_box(width = '100%', height = '600px')| Brand.Code | Carb.Volume | Fill.Ounces | PC.Volume | Carb.Pressure | Carb.Temp | PSC | PSC.Fill | PSC.CO2 | Mnf.Flow | Carb.Pressure1 | Fill.Pressure | Hyd.Pressure1 | Hyd.Pressure2 | Hyd.Pressure3 | Hyd.Pressure4 | Filler.Level | Filler.Speed | Temperature | Usage.cont | Carb.Flow | Density | MFR | Balling | Pressure.Vacuum | Oxygen.Filler | Bowl.Setpoint | Pressure.Setpoint | Air.Pressurer | Alch.Rel | Carb.Rel | Balling.Lvl |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| D | 5.480000 | 24.03333 | 0.2700000 | 65.4 | 134.6 | 0.236 | 0.40 | 0.04 | -100 | 116.6 | 46.0 | 0 | NA | NA | 96 | 129.4 | 3986 | 66.0 | 21.66 | 2950 | 0.88 | 727.6 | 1.398 | -3.8 | 0.022 | 130 | 45.2 | 142.6 | 6.56 | 5.34 | 1.48 |
| A | 5.393333 | 23.95333 | 0.2266667 | 63.2 | 135.0 | 0.042 | 0.22 | 0.08 | -100 | 118.8 | 46.2 | 0 | 0 | 0 | 112 | 120.0 | 4012 | 65.6 | 17.60 | 2916 | 1.50 | 735.8 | 2.942 | -4.4 | 0.030 | 120 | 46.0 | 147.2 | 7.14 | 5.58 | 3.04 |
| B | 5.293333 | 23.92000 | 0.3033333 | 66.4 | 140.4 | 0.068 | 0.10 | 0.02 | -100 | 120.2 | 45.8 | 0 | 0 | 0 | 98 | 119.4 | 4010 | 65.6 | 24.18 | 3056 | 0.90 | 734.8 | 1.448 | -4.2 | 0.046 | 120 | 46.0 | 146.6 | 6.52 | 5.34 | 1.46 |
| B | 5.266667 | 23.94000 | 0.1860000 | 64.8 | 139.0 | 0.004 | 0.20 | 0.02 | -100 | 124.8 | 40.0 | 0 | 0 | 0 | 132 | 120.2 | NA | 74.4 | 18.12 | 28 | 0.74 | NA | 1.056 | -4.0 | NA | 120 | 46.0 | 146.4 | 6.48 | 5.50 | 1.48 |
| B | 5.406667 | 24.20000 | 0.1600000 | 69.4 | 142.2 | 0.040 | 0.30 | 0.06 | -100 | 115.0 | 51.4 | 0 | 0 | 0 | 94 | 116.0 | 4018 | 66.4 | 21.32 | 3214 | 0.88 | 752.0 | 1.398 | -4.0 | 0.082 | 120 | 50.0 | 145.8 | 6.50 | 5.38 | 1.46 |
| B | 5.286667 | 24.10667 | 0.2120000 | 73.4 | 147.2 | 0.078 | 0.22 | NA | -100 | 118.6 | 46.4 | 0 | 0 | 0 | 94 | 120.4 | 4010 | 66.6 | 18.00 | 3064 | 0.84 | 732.0 | 1.298 | -3.8 | 0.064 | 120 | 46.0 | 146.0 | 6.50 | 5.42 | 1.44 |
| A | 5.480000 | 23.93333 | 0.2433333 | 65.2 | 134.6 | 0.088 | 0.14 | 0.00 | -100 | 117.6 | 46.2 | 0 | 0 | 0 | 108 | 119.6 | 4010 | 66.8 | 17.68 | 3042 | 1.48 | 729.8 | 2.894 | -4.2 | 0.042 | 120 | 46.0 | 145.0 | 7.18 | 5.46 | 3.02 |
| B | 5.420000 | 24.06667 | 0.1226667 | 67.4 | 139.0 | 0.076 | 0.10 | 0.04 | -100 | 121.4 | 40.0 | 0 | 0 | 0 | 108 | 131.4 | NA | NA | 12.90 | 1972 | 1.60 | NA | 3.320 | -4.4 | 0.096 | 120 | 46.0 | 146.0 | 7.16 | 5.42 | 3.00 |
| A | 5.406667 | 23.92000 | 0.3326667 | 66.8 | 138.0 | 0.246 | 0.48 | 0.04 | -100 | 136.0 | 43.8 | 0 | 0 | 0 | 110 | 121.0 | 4010 | 65.8 | 17.70 | 2502 | 1.52 | 741.2 | 2.992 | -4.4 | 0.046 | 120 | 46.0 | 146.2 | 7.14 | 5.44 | 3.10 |
| D | 5.473333 | 24.02667 | 0.2560000 | 72.6 | 144.0 | 0.146 | 0.10 | 0.02 | -100 | 126.6 | 40.8 | 0 | 0 | 0 | 106 | 120.8 | 1006 | 66.0 | 22.80 | 28 | 1.48 | NA | 2.892 | -4.2 | 0.096 | 120 | 46.0 | 146.0 | 7.78 | 5.52 | 3.12 |
| B | 5.180000 | NA | 0.3433333 | 64.0 | 140.8 | NA | 0.34 | 0.04 | -100 | 121.2 | 46.6 | 0 | 0 | 0 | 98 | 120.2 | 4010 | 65.4 | 20.04 | 3172 | 0.86 | 732.8 | 1.348 | -4.2 | 0.066 | 120 | 46.0 | 147.0 | 6.52 | 5.36 | 1.38 |
| B | 5.260000 | 24.08000 | 0.2200000 | 63.2 | 139.6 | 0.184 | 0.26 | 0.20 | -100 | 117.2 | 46.2 | 0 | 0 | 0 | 96 | 118.4 | 4010 | 65.8 | 17.16 | 3100 | 0.86 | 735.8 | 1.348 | -4.2 | 0.048 | 120 | 46.0 | 147.0 | 6.50 | 5.38 | 1.42 |
| B | 5.300000 | 24.06000 | 0.2820000 | 65.0 | 138.8 | 0.152 | 0.12 | 0.00 | -100 | 117.0 | 45.8 | 0 | 0 | 0 | 100 | 119.6 | 4010 | 65.4 | 20.52 | 2926 | 0.92 | 735.6 | 1.498 | -4.8 | 0.066 | 120 | 46.0 | 147.0 | 6.54 | 5.28 | 1.46 |
| B | 5.306667 | 23.94000 | 0.2886667 | 63.8 | 137.2 | 0.100 | 0.18 | 0.02 | -100 | 122.0 | 46.4 | 0 | 0 | 0 | 100 | 119.8 | 4016 | 65.6 | 21.44 | 2954 | 0.94 | 736.4 | 1.548 | -4.8 | 0.050 | 120 | 46.0 | 142.4 | 6.54 | 5.22 | 1.44 |
| C | 5.273333 | 23.97333 | 0.3206667 | 64.6 | 140.0 | 0.080 | 0.28 | 0.10 | -100 | 116.4 | 46.2 | 0 | 0 | 0 | 92 | 120.2 | 4012 | 67.6 | 21.08 | 3074 | 0.98 | 738.0 | 1.648 | -4.2 | 0.046 | 120 | 46.0 | 142.4 | 6.62 | 5.26 | 1.60 |
| NA | 5.253333 | 23.88667 | 0.3193333 | 65.0 | 140.0 | 0.048 | 0.26 | 0.02 | -100 | 125.6 | 43.4 | 0 | 0 | 0 | 110 | 130.4 | NA | 69.0 | 18.16 | 32 | 0.80 | NA | 1.198 | -4.8 | 0.160 | 120 | 46.0 | 142.2 | 6.52 | 5.28 | 1.60 |
| B | 5.340000 | 23.98667 | 0.2533333 | 70.4 | 144.8 | 0.114 | 0.12 | 0.00 | -100 | 118.0 | 45.6 | 0 | 0 | 0 | 90 | 119.2 | 3998 | 66.0 | 18.60 | 3004 | 0.88 | 730.4 | 1.398 | -4.0 | 0.102 | 120 | 46.0 | 142.0 | 6.50 | 5.36 | 1.40 |
| B | 5.266667 | 23.94000 | 0.2746667 | 65.4 | 140.2 | 0.122 | 0.42 | 0.06 | -100 | 116.4 | 46.2 | 0 | 0 | 0 | 90 | 120.6 | 3992 | 65.8 | 18.18 | 3090 | 0.94 | 728.6 | 1.548 | -4.4 | 0.060 | 120 | 46.0 | 142.2 | 6.52 | 5.30 | 1.44 |
| D | 5.506667 | 23.89333 | 0.2493333 | 68.4 | 138.6 | 0.058 | 0.12 | 0.02 | -100 | 118.0 | 46.2 | 0 | 0 | 0 | 76 | 120.2 | 3996 | 64.2 | 21.68 | 2936 | 1.64 | 729.8 | 3.290 | -4.2 | 0.154 | 120 | 46.0 | 142.4 | 7.76 | 5.62 | 3.16 |
| B | 5.320000 | 23.96000 | 0.1906667 | 66.4 | 140.2 | 0.038 | 0.04 | 0.00 | -100 | 117.8 | 45.4 | 0 | 0 | 0 | 92 | 120.0 | 3996 | 65.4 | 22.28 | 2972 | 0.92 | 726.8 | 1.498 | -4.4 | 0.022 | 120 | 46.0 | 142.6 | 6.56 | 5.38 | 1.46 |
| B | 5.273333 | 23.96667 | 0.1993333 | 68.4 | 141.8 | 0.008 | 0.30 | 0.20 | -100 | 117.2 | 46.0 | 0 | 0 | 0 | 94 | 120.2 | 3998 | 65.6 | 24.02 | 3094 | 0.92 | 732.0 | 1.498 | -4.4 | 0.022 | 120 | 46.0 | 142.4 | 6.58 | 5.40 | 1.46 |
| B | 5.533333 | 23.98667 | 0.2466667 | 70.4 | 142.2 | 0.062 | 0.08 | NA | -100 | 121.0 | 47.6 | 0 | 0 | 0 | 118 | 120.0 | 2834 | 67.4 | 13.56 | 3154 | 0.70 | 523.4 | 0.946 | -4.4 | 0.054 | 120 | 46.0 | 141.8 | 6.50 | 5.48 | 1.36 |
| A | 5.426667 | 23.98667 | 0.2553333 | 69.0 | 140.4 | 0.122 | 0.48 | 0.04 | -100 | 121.2 | 38.8 | 0 | 0 | 0 | NA | 131.4 | 1386 | 66.4 | 19.32 | 868 | 1.56 | NA | 3.092 | -4.4 | 0.022 | 120 | 46.0 | 142.8 | 7.08 | 5.54 | 3.18 |
| B | 5.406667 | 23.94000 | 0.3293333 | 66.2 | 137.8 | 0.208 | 0.46 | 0.02 | -100 | 130.6 | 44.6 | 0 | 0 | 0 | 96 | 119.0 | 4002 | 67.0 | 17.52 | 2592 | 0.88 | 731.4 | 1.398 | -4.0 | 0.022 | 120 | 46.0 | 142.8 | 6.50 | 5.46 | 1.40 |
| B | 5.453333 | 24.09333 | 0.2353333 | 66.4 | 136.2 | 0.072 | 0.06 | 0.06 | -100 | 125.2 | 46.4 | 0 | 0 | 0 | 94 | 120.2 | 4010 | 66.4 | 20.38 | 2996 | 0.90 | 736.8 | 1.448 | -4.6 | 0.024 | 120 | 46.0 | 142.6 | 6.50 | 5.38 | 1.40 |
| C | 5.266667 | 23.94667 | 0.2766667 | 64.8 | 139.2 | 0.048 | 0.18 | 0.02 | -100 | 116.6 | 46.2 | 0 | 0 | 0 | 90 | 120.6 | 4014 | 66.0 | 24.12 | 3060 | 0.90 | 738.6 | 1.448 | -4.4 | 0.022 | 120 | 46.0 | 142.2 | 6.54 | 5.32 | 1.40 |
| C | 5.253333 | 23.99333 | 0.2933333 | 70.4 | 146.4 | 0.040 | 0.14 | 0.02 | -100 | 122.2 | 46.0 | 0 | 0 | 0 | 102 | 119.8 | 4012 | 66.8 | 17.54 | 3136 | 0.94 | 741.2 | 1.548 | -4.2 | 0.022 | 120 | 46.0 | 142.2 | 6.62 | 5.34 | 1.52 |
| D | 5.500000 | 24.04667 | 0.2466667 | 71.0 | 141.8 | 0.040 | 0.02 | 0.00 | -100 | 120.4 | 46.4 | 0 | 0 | 0 | 78 | 119.4 | 4010 | 65.6 | 18.12 | 3194 | 1.64 | 735.2 | 3.290 | -3.8 | 0.174 | 120 | 46.0 | 142.0 | 7.74 | 5.66 | 3.28 |
| D | 5.480000 | 23.89333 | 0.2246667 | 70.4 | 140.8 | NA | 0.34 | 0.06 | -100 | 120.2 | 50.4 | 0 | 0 | 0 | 80 | 120.0 | 4010 | 64.8 | 16.94 | 3162 | 1.66 | 740.8 | 3.340 | -3.8 | 0.022 | 120 | 50.0 | 142.0 | 7.74 | 5.62 | 3.24 |
| D | 5.486667 | 23.98000 | 0.3066667 | 69.6 | 140.0 | 0.234 | 0.16 | 0.02 | -100 | 122.6 | 47.0 | 0 | 0 | 0 | 78 | 120.2 | 4010 | 64.6 | 17.04 | 2982 | 1.66 | 733.8 | 3.340 | -4.0 | 0.024 | 120 | 46.0 | 141.8 | 7.74 | 5.62 | 3.26 |
| D | 5.466667 | 24.04000 | 0.2300000 | 70.2 | 141.2 | 0.004 | 0.10 | 0.04 | -100 | 120.2 | 46.6 | 0 | 0 | 0 | 78 | 120.4 | 4010 | 65.4 | 23.44 | 3182 | 1.68 | 732.4 | 3.390 | -4.0 | 0.066 | 120 | 46.0 | 142.0 | 7.72 | 5.56 | 3.28 |
| D | 5.460000 | 24.04667 | 0.2780000 | 70.8 | 141.8 | 0.174 | 0.62 | 0.10 | -100 | 125.6 | 40.0 | 0 | 0 | 0 | 104 | 121.8 | 1008 | 70.6 | 19.22 | 32 | 1.42 | NA | 2.750 | -4.2 | 0.024 | 120 | 46.0 | 141.8 | 7.72 | 5.58 | 3.28 |
| B | 5.320000 | NA | 0.2686667 | 64.4 | 137.0 | 0.068 | 0.10 | 0.04 | -100 | 120.4 | 46.8 | 0 | 0 | 0 | 104 | 119.2 | 4018 | 66.0 | 18.04 | 3190 | 0.92 | 733.6 | 1.496 | -4.2 | 0.022 | 120 | 46.0 | 142.4 | 6.50 | 5.38 | 1.50 |
| B | 5.313333 | NA | 0.3186667 | 64.2 | 136.8 | 0.218 | 0.44 | 0.04 | -100 | 118.2 | 45.8 | 0 | 0 | 0 | 100 | 120.4 | 4014 | 65.2 | 22.02 | 2862 | 0.90 | 729.8 | 1.448 | -4.2 | 0.022 | 120 | 46.0 | 142.0 | 6.50 | 5.36 | 1.52 |
| B | 5.266667 | NA | 0.2800000 | 73.2 | 149.8 | 0.040 | 0.24 | 0.04 | -100 | 124.6 | 44.8 | 0 | 0 | 0 | 102 | 121.0 | 3990 | 65.0 | 16.72 | 3122 | 0.92 | 724.2 | 1.498 | -4.0 | 0.036 | 120 | 46.0 | 141.6 | 6.52 | 5.34 | 1.56 |
| A | 5.486667 | 24.10667 | 0.1986667 | 75.0 | 146.4 | 0.086 | 0.28 | 0.04 | -100 | 114.8 | 46.2 | 0 | 0 | 0 | 100 | 120.2 | 4016 | 65.4 | 17.76 | 3048 | 1.50 | 732.8 | 2.942 | -4.0 | 0.022 | 120 | 46.0 | 142.0 | 7.16 | 5.44 | 3.00 |
| A | 5.460000 | 23.98000 | 0.2046667 | 65.8 | 136.6 | 0.010 | 0.04 | 0.06 | -100 | 116.8 | 45.6 | 0 | 0 | 0 | 100 | 119.2 | 4010 | 66.2 | 17.76 | 3080 | 1.50 | 730.0 | 2.942 | -3.6 | 0.022 | 120 | 46.0 | 142.6 | 7.14 | 5.54 | 3.08 |
| A | 5.440000 | 23.92000 | 0.2360000 | 69.4 | 142.0 | 0.050 | 0.16 | 0.08 | -100 | 116.4 | 46.0 | 0 | 0 | 0 | 98 | 120.8 | 4012 | 66.4 | 21.46 | 3102 | 1.52 | 731.8 | 2.992 | -3.6 | 0.022 | 120 | 46.0 | 142.4 | 7.14 | 5.56 | 3.06 |
| D | 5.480000 | 23.90667 | 0.1786667 | 70.4 | 141.0 | 0.090 | 0.22 | 0.04 | -100 | 118.0 | 46.0 | 0 | 0 | 0 | 74 | 120.8 | 4012 | 65.0 | 19.68 | 2900 | 1.70 | 732.2 | 3.440 | -4.4 | 0.052 | 120 | 46.0 | 142.6 | 7.68 | 5.58 | 3.32 |
| D | 5.473333 | 23.92000 | 0.3473333 | 74.2 | 145.2 | NA | 0.50 | 0.08 | -100 | 125.8 | 46.8 | 0 | 0 | 0 | 78 | 123.2 | 4010 | 65.2 | 16.74 | 2880 | 1.72 | 745.4 | 3.490 | -4.4 | 0.050 | 120 | 46.0 | 141.8 | 7.70 | 5.60 | 3.32 |
We will now check the summary statistics for the data.
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## Length:267 Min. :5.147 Min. :23.75 Min. :0.09867
## Class :character 1st Qu.:5.287 1st Qu.:23.92 1st Qu.:0.23333
## Mode :character Median :5.340 Median :23.97 Median :0.27533
## Mean :5.369 Mean :23.97 Mean :0.27769
## 3rd Qu.:5.465 3rd Qu.:24.01 3rd Qu.:0.32200
## Max. :5.667 Max. :24.20 Max. :0.46400
## NA's :1 NA's :6 NA's :4
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :60.20 Min. :130.0 Min. :0.00400 Min. :0.0200
## 1st Qu.:65.30 1st Qu.:138.4 1st Qu.:0.04450 1st Qu.:0.1000
## Median :68.00 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.25 Mean :141.2 Mean :0.08545 Mean :0.1903
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :77.60 Max. :154.0 Max. :0.24600 Max. :0.6200
## NA's :1 NA's :5 NA's :3
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :113.0 Min. :37.80
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:120.2 1st Qu.:46.00
## Median :0.04000 Median : 0.20 Median :123.4 Median :47.80
## Mean :0.05107 Mean : 21.03 Mean :123.0 Mean :48.14
## 3rd Qu.:0.06000 3rd Qu.: 141.30 3rd Qu.:125.5 3rd Qu.:50.20
## Max. :0.24000 Max. : 220.40 Max. :136.0 Max. :60.20
## NA's :5 NA's :4 NA's :2
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :-50.00 Min. :-50.00 Min. :-50.00 Min. : 68.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 90.00
## Median : 10.40 Median : 26.80 Median : 27.70 Median : 98.00
## Mean : 12.01 Mean : 20.11 Mean : 19.61 Mean : 97.84
## 3rd Qu.: 20.40 3rd Qu.: 34.80 3rd Qu.: 33.00 3rd Qu.:104.00
## Max. : 50.00 Max. : 61.40 Max. : 49.20 Max. :140.00
## NA's :1 NA's :1 NA's :4
## Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow
## Min. : 69.2 Min. :1006 Min. :63.80 Min. :12.90 Min. : 0
## 1st Qu.:100.6 1st Qu.:3812 1st Qu.:65.40 1st Qu.:18.12 1st Qu.:1083
## Median :118.6 Median :3978 Median :65.80 Median :21.44 Median :3038
## Mean :110.3 Mean :3581 Mean :66.23 Mean :20.90 Mean :2409
## 3rd Qu.:120.2 3rd Qu.:3996 3rd Qu.:66.60 3rd Qu.:23.74 3rd Qu.:3215
## Max. :153.2 Max. :4020 Max. :75.40 Max. :24.60 Max. :3858
## NA's :2 NA's :10 NA's :2 NA's :2
## Density MFR Balling Pressure.Vacuum
## Min. :0.060 Min. : 15.6 Min. :0.902 Min. :-6.400
## 1st Qu.:0.920 1st Qu.:707.0 1st Qu.:1.498 1st Qu.:-5.600
## Median :0.980 Median :724.6 Median :1.648 Median :-5.200
## Mean :1.177 Mean :697.8 Mean :2.203 Mean :-5.174
## 3rd Qu.:1.600 3rd Qu.:731.5 3rd Qu.:3.242 3rd Qu.:-4.800
## Max. :1.840 Max. :784.8 Max. :3.788 Max. :-3.600
## NA's :1 NA's :31 NA's :1 NA's :1
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer
## Min. :0.00240 Min. : 70.0 Min. :44.00 Min. :141.2
## 1st Qu.:0.01960 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2
## Median :0.03370 Median :120.0 Median :46.00 Median :142.6
## Mean :0.04666 Mean :109.6 Mean :47.73 Mean :142.8
## 3rd Qu.:0.05440 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:142.8
## Max. :0.39800 Max. :130.0 Max. :52.00 Max. :147.2
## NA's :3 NA's :1 NA's :2 NA's :1
## Alch.Rel Carb.Rel Balling.Lvl
## Min. :6.400 Min. :5.18 Min. :0.000
## 1st Qu.:6.540 1st Qu.:5.34 1st Qu.:1.380
## Median :6.580 Median :5.40 Median :1.480
## Mean :6.907 Mean :5.44 Mean :2.051
## 3rd Qu.:7.180 3rd Qu.:5.56 3rd Qu.:3.080
## Max. :7.820 Max. :5.74 Max. :3.420
## NA's :3 NA's :2
The summary statistics for the evaluation dataset tell us that it contains missing values so we will need to impute these later on in the project.
Our next step is to examine the training dataset in detail as this is the main dataset that we will be working with throughout the project.
Firstly, we will take a look at the first few observations in the dataset so we can get a feel for the data. We will then explore the structure of the data using the str() function which will tell us how many observations and variables it contains, and whether or not it contains missing values.
# Take a look at the structure of the training dataset.
head(beverage.train, 40) %>% kable() %>% kable_styling() %>% scroll_box(width = '100%', height = '600px')| Brand.Code | Carb.Volume | Fill.Ounces | PC.Volume | Carb.Pressure | Carb.Temp | PSC | PSC.Fill | PSC.CO2 | Mnf.Flow | Carb.Pressure1 | Fill.Pressure | Hyd.Pressure1 | Hyd.Pressure2 | Hyd.Pressure3 | Hyd.Pressure4 | Filler.Level | Filler.Speed | Temperature | Usage.cont | Carb.Flow | Density | MFR | Balling | Pressure.Vacuum | PH | Oxygen.Filler | Bowl.Setpoint | Pressure.Setpoint | Air.Pressurer | Alch.Rel | Carb.Rel | Balling.Lvl |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| B | 5.340000 | 23.96667 | 0.2633333 | 68.2 | 141.2 | 0.104 | 0.26 | 0.04 | -100 | 118.8 | 46.0 | 0 | NA | NA | 118 | 121.2 | 4002 | 66.0 | 16.18 | 2932 | 0.88 | 725.0 | 1.398 | -4.0 | 8.36 | 0.022 | 120 | 46.4 | 142.6 | 6.58 | 5.32 | 1.48 |
| A | 5.426667 | 24.00667 | 0.2386667 | 68.4 | 139.6 | 0.124 | 0.22 | 0.04 | -100 | 121.6 | 46.0 | 0 | NA | NA | 106 | 118.6 | 3986 | 67.6 | 19.90 | 3144 | 0.92 | 726.8 | 1.498 | -4.0 | 8.26 | 0.026 | 120 | 46.8 | 143.0 | 6.56 | 5.30 | 1.56 |
| B | 5.286667 | 24.06000 | 0.2633333 | 70.8 | 144.8 | 0.090 | 0.34 | 0.16 | -100 | 120.2 | 46.0 | 0 | NA | NA | 82 | 120.0 | 4020 | 67.0 | 17.76 | 2914 | 1.58 | 735.0 | 3.142 | -3.8 | 8.94 | 0.024 | 120 | 46.6 | 142.0 | 7.66 | 5.84 | 3.28 |
| A | 5.440000 | 24.00667 | 0.2933333 | 63.0 | 132.6 | NA | 0.42 | 0.04 | -100 | 115.2 | 46.4 | 0 | 0 | 0 | 92 | 117.8 | 4012 | 65.6 | 17.42 | 3062 | 1.54 | 730.6 | 3.042 | -4.4 | 8.24 | 0.030 | 120 | 46.0 | 146.2 | 7.14 | 5.42 | 3.04 |
| A | 5.486667 | 24.31333 | 0.1113333 | 67.2 | 136.8 | 0.026 | 0.16 | 0.12 | -100 | 118.4 | 45.8 | 0 | 0 | 0 | 92 | 118.6 | 4010 | 65.6 | 17.68 | 3054 | 1.54 | 722.8 | 3.042 | -4.4 | 8.26 | 0.030 | 120 | 46.0 | 146.2 | 7.14 | 5.44 | 3.04 |
| A | 5.380000 | 23.92667 | 0.2693333 | 66.6 | 138.4 | 0.090 | 0.24 | 0.04 | -100 | 119.6 | 45.6 | 0 | 0 | 0 | 116 | 120.2 | 4014 | 66.2 | 23.82 | 2948 | 1.52 | 738.8 | 2.992 | -4.4 | 8.32 | 0.024 | 120 | 46.0 | 146.6 | 7.16 | 5.44 | 3.02 |
| A | 5.313333 | 23.88667 | 0.2680000 | 64.2 | 136.8 | 0.128 | 0.40 | 0.04 | -100 | 122.2 | 51.8 | 0 | 0 | 0 | 124 | 123.4 | NA | 65.8 | 20.74 | 30 | 0.84 | NA | 1.298 | -4.4 | 8.40 | 0.066 | 120 | 46.0 | 146.2 | 6.54 | 5.38 | 1.44 |
| B | 5.320000 | 24.17333 | 0.2206667 | 67.6 | 141.4 | 0.154 | 0.34 | 0.04 | -100 | 124.2 | 46.8 | 0 | 0 | 0 | 132 | 118.6 | 1004 | 65.2 | 18.96 | 684 | 0.84 | NA | 1.298 | -4.4 | 8.38 | 0.046 | 120 | 46.0 | 146.4 | 6.52 | 5.34 | 1.44 |
| B | 5.246667 | 23.98000 | 0.2626667 | 64.2 | 140.2 | 0.132 | 0.12 | 0.14 | -100 | 120.8 | 46.0 | 0 | 0 | 0 | 90 | 120.2 | 4014 | 65.4 | 18.40 | 2902 | 0.90 | 740.4 | 1.446 | -4.4 | 8.38 | 0.064 | 120 | 46.0 | 147.2 | 6.52 | 5.34 | 1.44 |
| B | 5.266667 | 24.00667 | 0.2313333 | 72.0 | 147.4 | 0.014 | 0.24 | 0.06 | -100 | 119.8 | 45.2 | 0 | 0 | 0 | 108 | 120.8 | 4028 | 66.6 | 13.50 | 3038 | 0.90 | 692.4 | 1.448 | -4.4 | 8.50 | 0.022 | 120 | 46.0 | 146.2 | 6.54 | 5.34 | 1.38 |
| B | 5.320000 | 23.92000 | 0.2586667 | 66.2 | 139.4 | 0.078 | 0.18 | 0.04 | -100 | 119.6 | 46.6 | 0 | 0 | 0 | 94 | 119.6 | 4020 | 65.0 | 19.04 | 3056 | 0.90 | 727.0 | 1.448 | -4.4 | 8.34 | 0.030 | 120 | 46.0 | 146.2 | 6.52 | 5.34 | 1.44 |
| B | 5.353333 | 24.06667 | 0.2513333 | 61.6 | 132.8 | 0.110 | 0.18 | 0.02 | -100 | 119.2 | 46.6 | 0 | 0 | 0 | 86 | 119.6 | 4012 | 65.4 | 18.44 | 3110 | 0.92 | 735.0 | 1.498 | -4.4 | 8.34 | 0.058 | 120 | 46.0 | 146.8 | 6.52 | 5.34 | 1.44 |
| B | 5.220000 | 23.89333 | 0.2673333 | 63.4 | 141.0 | 0.114 | 0.38 | NA | -100 | 117.4 | 45.4 | 0 | 0 | 0 | 98 | 121.0 | 4012 | 65.0 | 17.12 | 2870 | 0.92 | 729.6 | 1.498 | -4.4 | 8.34 | 0.048 | 120 | 46.0 | 146.0 | 6.52 | 5.34 | 1.46 |
| B | 5.266667 | 23.89333 | 0.2286667 | 71.6 | 147.8 | 0.096 | 0.22 | 0.04 | -100 | 113.6 | 46.0 | 0 | 0 | 0 | 94 | 120.0 | 4012 | 65.0 | 23.44 | 3040 | 0.92 | 731.0 | 1.498 | -4.4 | 8.38 | 0.046 | 120 | 46.0 | 146.8 | 6.52 | 5.34 | 1.44 |
| B | 5.266667 | 23.87333 | 0.3340000 | 72.6 | 148.0 | 0.160 | 0.36 | 0.08 | -100 | 120.2 | 46.6 | 0 | 0 | 0 | 92 | 120.0 | 4010 | 65.0 | 21.16 | 3056 | 0.90 | 732.4 | 1.448 | -4.4 | 8.40 | 0.066 | 120 | 46.0 | 146.6 | 6.52 | 5.34 | 1.44 |
| B | 5.286667 | 23.86667 | 0.2566667 | 68.0 | 143.2 | 0.034 | 0.16 | 0.02 | -100 | 129.0 | 47.4 | 0 | 0 | 0 | 96 | 119.8 | 4010 | 65.2 | 19.88 | 3290 | 0.90 | 731.0 | 1.448 | -4.4 | 8.42 | 0.046 | 120 | 46.0 | 146.0 | 6.52 | 5.34 | 1.42 |
| C | 5.226667 | 23.69333 | 0.3166667 | 63.8 | 138.2 | 0.124 | 0.20 | 0.06 | -100 | 123.4 | 48.8 | 0 | 0 | 0 | 92 | 120.2 | 1624 | 68.8 | 17.02 | 3200 | 0.46 | 295.8 | 0.346 | -4.2 | 8.58 | 0.164 | 120 | 46.0 | 146.6 | 6.52 | 5.34 | 1.46 |
| B | 5.353333 | 23.99333 | 0.2793333 | 64.8 | 137.0 | 0.146 | 0.06 | 0.02 | -100 | 115.6 | 46.4 | 0 | 0 | 0 | 94 | 120.4 | 4012 | 65.2 | 21.82 | 3082 | 0.88 | 726.4 | 1.398 | -4.0 | 8.50 | 0.046 | 120 | 46.0 | 146.8 | 6.52 | 5.36 | 1.46 |
| B | 5.366667 | 24.09333 | 0.2613333 | 70.6 | 143.8 | 0.220 | 0.48 | 0.08 | -100 | 121.4 | 47.0 | 0 | 0 | 0 | 98 | 116.4 | 3060 | 65.4 | 20.32 | 3324 | 0.84 | 535.8 | 1.298 | -4.0 | 8.44 | 0.064 | 120 | 46.0 | 146.8 | 6.52 | 5.28 | 1.44 |
| C | 5.213333 | 23.98667 | 0.2353333 | 62.6 | 140.8 | 0.246 | 0.10 | 0.20 | -100 | 119.6 | 45.4 | 0 | 0 | 0 | 102 | 120.2 | 4012 | 67.8 | 16.44 | 2970 | 0.86 | 731.8 | 1.348 | -4.0 | 8.30 | 0.046 | 120 | 46.0 | 146.2 | 6.62 | 5.34 | 1.38 |
| C | 5.220000 | 24.26000 | 0.1120000 | 66.8 | 143.4 | 0.042 | 0.08 | 0.06 | -100 | 116.6 | 46.4 | 0 | 0 | 0 | 94 | 121.0 | 4010 | 65.4 | 16.56 | 3090 | 0.94 | 726.4 | 1.548 | -4.2 | 8.42 | 0.022 | 120 | 46.0 | 146.2 | 6.52 | 5.34 | 1.52 |
| B | 5.333333 | 24.09333 | 0.3046667 | 66.0 | 139.4 | 0.060 | 0.06 | 0.08 | -100 | 130.2 | 44.2 | 0 | 0 | 0 | 130 | 100.2 | 1008 | 69.8 | 21.98 | 30 | 0.74 | NA | 1.048 | -4.0 | 8.48 | NA | 100 | 50.0 | 147.0 | 6.50 | 5.40 | 1.48 |
| B | 5.340000 | 23.98667 | 0.2120000 | 68.2 | 142.2 | 0.038 | 0.16 | 0.04 | -100 | 113.6 | 51.4 | 0 | 0 | 0 | 100 | 96.8 | 2936 | 66.6 | 19.36 | 3418 | 0.82 | 519.0 | 1.248 | -4.2 | 8.52 | 0.254 | 100 | 50.0 | 146.8 | 6.50 | 5.38 | 1.48 |
| B | 5.413333 | 23.98667 | 0.2926667 | 70.0 | 142.8 | 0.124 | 0.02 | 0.04 | -100 | 118.2 | 50.2 | 0 | 0 | 0 | 96 | 100.4 | 4016 | 66.0 | 24.00 | 3206 | 0.92 | 732.6 | 1.496 | -4.2 | 8.44 | 0.084 | 100 | 50.0 | 146.4 | 6.52 | 5.38 | 1.48 |
| B | 5.373333 | 24.02000 | 0.2813333 | 68.0 | 141.0 | 0.102 | 0.26 | 0.02 | -100 | 119.2 | 50.0 | 0 | 0 | 0 | 90 | 100.2 | 4010 | 66.2 | 21.58 | 3220 | 0.90 | 734.4 | 1.448 | -4.2 | 8.44 | 0.064 | 100 | 50.0 | 147.2 | 6.50 | 5.28 | 1.50 |
| B | 5.313333 | 23.98667 | 0.2940000 | 68.2 | 142.2 | 0.052 | 0.18 | 0.10 | -100 | 118.0 | 50.0 | 0 | 0 | 0 | 94 | 99.8 | 4016 | 66.0 | 20.72 | 3206 | 0.92 | 732.8 | 1.496 | -4.2 | 8.40 | 0.206 | 100 | 50.0 | 146.6 | 6.50 | NA | 1.50 |
| B | 5.360000 | 24.02667 | 0.2780000 | 67.0 | 139.8 | 0.080 | 0.34 | 0.04 | -100 | 115.2 | 50.2 | 0 | 0 | 0 | 102 | 100.2 | 4014 | 68.2 | 21.60 | 3168 | 0.86 | 740.4 | 1.348 | -4.2 | 8.42 | 0.096 | 100 | 50.0 | 146.8 | 6.48 | 5.38 | 1.48 |
| B | 5.446667 | 24.02000 | 0.0900000 | 70.8 | 142.6 | 0.012 | 0.34 | 0.02 | -100 | 124.4 | 50.0 | 0 | 0 | 0 | 96 | 100.8 | 4012 | 65.6 | 23.58 | 3138 | 0.88 | 729.4 | 1.398 | -4.2 | 8.42 | 0.090 | 100 | 50.0 | 147.0 | 6.50 | 5.38 | 1.48 |
| B | 5.380000 | 24.07333 | 0.2180000 | 66.6 | 138.8 | 0.040 | 0.18 | 0.04 | -100 | 116.4 | 50.0 | 0 | 0 | 0 | 92 | 100.0 | 4010 | 65.8 | 21.40 | 3212 | 0.90 | 731.0 | 1.448 | -4.2 | 8.40 | 0.064 | 100 | 50.0 | 147.4 | 6.52 | 5.40 | 1.46 |
| B | 5.393333 | 24.08667 | 0.2120000 | 65.8 | 137.4 | 0.102 | 0.10 | 0.02 | -100 | 118.6 | 50.0 | 0 | 0 | 0 | 102 | 100.6 | 4016 | 66.4 | 18.32 | 3164 | 0.90 | 734.2 | 1.448 | -4.2 | 8.44 | 0.084 | 100 | 50.0 | 147.0 | 6.50 | 5.40 | 1.46 |
| B | 5.406667 | 24.11333 | 0.2220000 | 67.4 | 138.6 | 0.128 | 0.22 | 0.06 | -100 | 116.8 | 49.8 | 0 | 0 | 0 | 100 | 100.4 | 4014 | 65.8 | 21.16 | 3194 | 0.90 | 731.6 | 1.448 | -4.2 | 8.36 | 0.096 | 100 | 50.0 | 146.6 | 6.50 | 5.38 | 1.46 |
| B | 5.366667 | 24.09333 | 0.2106667 | 70.8 | 144.2 | 0.068 | 0.04 | 0.08 | -100 | 115.2 | 50.2 | 0 | 0 | 0 | 96 | 100.2 | 4010 | 65.8 | 22.90 | 3182 | 0.90 | 731.0 | 1.398 | -4.2 | 8.36 | 0.084 | 100 | 50.0 | 146.2 | 6.50 | 5.38 | 1.46 |
| B | 5.300000 | 24.07333 | 0.1860000 | 69.6 | 143.8 | 0.052 | 0.42 | 0.24 | -100 | 121.2 | 50.4 | 0 | 0 | 0 | 94 | 110.0 | 4014 | 65.8 | 19.18 | 3214 | 0.90 | 730.6 | 1.448 | -4.2 | 8.40 | 0.082 | 110 | 50.0 | 146.4 | 6.50 | 5.38 | 1.46 |
| B | 5.360000 | 24.08000 | 0.1546667 | 68.6 | 141.6 | 0.088 | 0.04 | 0.06 | -100 | 119.6 | 50.0 | 0 | 0 | 0 | 100 | 109.4 | 4010 | 65.8 | 15.88 | 3198 | 0.92 | 740.0 | 1.496 | -4.2 | 8.38 | 0.062 | 110 | 50.0 | 147.0 | 6.52 | 5.38 | 1.50 |
| B | 5.366667 | 24.04667 | 0.1326667 | 68.2 | 141.0 | 0.112 | 0.34 | 0.16 | -100 | 118.0 | 50.2 | 0 | 0 | 0 | 94 | 110.0 | 4010 | 65.8 | 18.54 | 3220 | 0.92 | 733.8 | 1.496 | -4.2 | 8.44 | 0.064 | 110 | 50.0 | 145.8 | 6.52 | 5.38 | 1.50 |
| B | 5.373333 | 24.00667 | 0.3160000 | 69.4 | 142.8 | NA | 0.28 | 0.02 | -100 | 120.4 | 49.8 | 0 | 0 | 0 | 96 | 110.2 | 4012 | 67.4 | 19.98 | 3208 | 0.90 | 728.4 | 1.398 | -4.0 | 8.36 | 0.064 | 110 | 50.0 | 146.0 | 6.50 | 5.36 | 1.50 |
| B | 5.346667 | 23.98667 | 0.2280000 | 68.8 | 142.4 | 0.164 | 0.26 | 0.06 | -100 | 121.4 | 43.0 | 0 | 0 | 0 | 120 | 120.0 | 1006 | 67.0 | 13.66 | 1464 | 0.86 | NA | 1.346 | -4.0 | 8.40 | 0.080 | 120 | 50.0 | 148.2 | 6.50 | 5.40 | 1.50 |
| B | 5.373333 | 24.01333 | 0.2000000 | 65.2 | 137.4 | 0.112 | 0.34 | 0.02 | -100 | 116.0 | 50.2 | 0 | 0 | 0 | 96 | 120.8 | 4014 | 66.4 | 21.98 | 3222 | 0.88 | 702.4 | 1.398 | -4.0 | 8.38 | 0.060 | 120 | 50.0 | 147.0 | 6.50 | 5.40 | 1.48 |
| B | 5.326667 | 24.06000 | 0.2393333 | 75.8 | 151.4 | 0.080 | 0.08 | 0.02 | -100 | 117.0 | 45.8 | 0 | 0 | 0 | 94 | 119.6 | 4012 | 66.6 | 18.12 | 2986 | 0.86 | 732.6 | 1.346 | -3.8 | 8.36 | 0.082 | 120 | 46.0 | 145.8 | 6.50 | 5.42 | 1.44 |
| C | 5.273333 | 23.86000 | 0.1126667 | 65.6 | 140.2 | 0.050 | 0.10 | 0.04 | -100 | 119.4 | 45.8 | 0 | 0 | 0 | 92 | 119.2 | 4014 | 67.2 | 13.78 | 2976 | 0.92 | 722.0 | 1.496 | -3.8 | 8.28 | 0.062 | 120 | 46.0 | 146.2 | 6.64 | 5.38 | 1.60 |
# Examine the structure of the training data.
str(beverage.train)## 'data.frame': 2571 obs. of 33 variables:
## $ Brand.Code : chr "B" "A" "B" "A" ...
## $ Carb.Volume : num 5.34 5.43 5.29 5.44 5.49 ...
## $ Fill.Ounces : num 24 24 24.1 24 24.3 ...
## $ PC.Volume : num 0.263 0.239 0.263 0.293 0.111 ...
## $ Carb.Pressure : num 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
## $ Carb.Temp : num 141 140 145 133 137 ...
## $ PSC : num 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC.Fill : num 0.26 0.22 0.34 0.42 0.16 0.24 0.4 0.34 0.12 0.24 ...
## $ PSC.CO2 : num 0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
## $ Mnf.Flow : num -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb.Pressure1 : num 119 122 120 115 118 ...
## $ Fill.Pressure : num 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
## $ Hyd.Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure2 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure3 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : int 118 106 82 92 92 116 124 132 90 108 ...
## $ Filler.Level : num 121 119 120 118 119 ...
## $ Filler.Speed : int 4002 3986 4020 4012 4010 4014 NA 1004 4014 4028 ...
## $ Temperature : num 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage.cont : num 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb.Flow : int 2932 3144 2914 3062 3054 2948 30 684 2902 3038 ...
## $ Density : num 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
## $ MFR : num 725 727 735 731 723 ...
## $ Balling : num 1.4 1.5 3.14 3.04 3.04 ...
## $ Pressure.Vacuum : num -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
## $ PH : num 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Oxygen.Filler : num 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl.Setpoint : int 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure.Setpoint: num 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air.Pressurer : num 143 143 142 146 146 ...
## $ Alch.Rel : num 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
## $ Carb.Rel : num 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Balling.Lvl : num 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
The results of running the training data through the str() function reveal that the dataset consists of 33 Columns, and 2571 Observations. Almost all of the variables are numerical, with the exception of the Brand.Code variable which is categorical. An other important revelation is that some of the variables contain missing values.
The training dataset contains 32 predictor variables which include 1 categorical variable and the rest are numeric (continuous and discrete) variables. There are 2571 records in the training data and 267 records in the evaluation dataset. The target column is the PH column.
The data has the following variables:
Brand Code: categorical, values: A, B, C, D
Carb Volume: Numeric
Fill Ounces: Numeric
PC Volume: Numeric
Carb Pressure: Numeric
Carb Temp: Numeric
PSC: Numeric
PSC Fill: Numeric
PSC CO2: Numeric
Mnf Flow: Numeric
Carb Pressure1: Numeric
Fill Pressure: Numeric
Hyd Pressure1: Numeric
Hyd Pressure2: Numeric
Hyd Pressure3: Numeric
Hyd Pressure4: Numeric
Filler Level: Numeric
Filler Speed: Numeric
Temperature: Numeric
Usage cont: Numeric
Carb Flow: Numeric
Density: Numeric
MFR: Numeric
Balling: Numeric
Pressure Vacuum: Numeric
PH: This is the numeric TARGET variable that has to be predicted.
Bowl Setpoint: Numeric
Pressure Setpoint: Numeric
Air Pressurer: Numeric
Alch Rel: Numeric
Carb Rel: Numeric
Balling Lvl: Numeric
Now let’s check the summary statistics for the data.
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure.Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
## rows columns discrete_columns continuous_columns all_missing_columns
## 1 2571 33 1 32 0
## total_missing_values complete_rows total_observations memory_usage
## 1 844 2038 84843 645352
| Name | beverage.train |
| Number of rows | 2571 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand.Code | 120 | 0.95 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb.Volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | <U+2581><U+2586><U+2587><U+2585><U+2581> |
| Fill.Ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | <U+2581><U+2582><U+2587><U+2582><U+2581> |
| PC.Volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | <U+2581><U+2583><U+2587><U+2582><U+2581> |
| Carb.Pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | <U+2581><U+2585><U+2587><U+2583><U+2581> |
| Carb.Temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | <U+2581><U+2585><U+2587><U+2583><U+2581> |
| PSC | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | <U+2586><U+2587><U+2583><U+2581><U+2581> |
| PSC.Fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | <U+2586><U+2587><U+2583><U+2581><U+2581> |
| PSC.CO2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | <U+2587><U+2585><U+2582><U+2581><U+2581> |
| Mnf.Flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | <U+2587><U+2581><U+2581><U+2587><U+2582> |
| Carb.Pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | <U+2581><U+2583><U+2587><U+2582><U+2581> |
| Fill.Pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | <U+2581><U+2581><U+2587><U+2582><U+2581> |
| Hyd.Pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | <U+2587><U+2585><U+2582><U+2581><U+2581> |
| Hyd.Pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | <U+2587><U+2582><U+2587><U+2585><U+2581> |
| Hyd.Pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | <U+2587><U+2581><U+2583><U+2587><U+2581> |
| Hyd.Pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | <U+2581><U+2583><U+2587><U+2582><U+2581> |
| Filler.Level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | <U+2581><U+2583><U+2585><U+2587><U+2581> |
| Filler.Speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| Temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | <U+2587><U+2583><U+2581><U+2581><U+2581> |
| Usage.cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | <U+2581><U+2583><U+2585><U+2583><U+2587> |
| Carb.Flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | <U+2582><U+2585><U+2586><U+2587><U+2581> |
| Density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | <U+2581><U+2585><U+2587><U+2582><U+2586> |
| MFR | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | <U+2581><U+2581><U+2581><U+2582><U+2587> |
| Balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | <U+2581><U+2587><U+2587><U+2581><U+2587> |
| Pressure.Vacuum | 0 | 1.00 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | <U+2582><U+2587><U+2586><U+2582><U+2581> |
| PH | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | <U+2581><U+2585><U+2587><U+2582><U+2581> |
| Oxygen.Filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| Bowl.Setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | <U+2581><U+2582><U+2583><U+2587><U+2581> |
| Pressure.Setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | <U+2581><U+2587><U+2581><U+2586><U+2581> |
| Air.Pressurer | 0 | 1.00 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | <U+2585><U+2587><U+2581><U+2581><U+2581> |
| Alch.Rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | <U+2581><U+2587><U+2582><U+2583><U+2581> |
| Carb.Rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | <U+2581><U+2587><U+2587><U+2582><U+2581> |
| Balling.Lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | <U+2581><U+2587><U+2582><U+2581><U+2586> |
From the above, we see that most of the predictors (except for 2) contain missing data and will therefore need to be imputed. For the target variable (PH), we see that 4 rows are missing “PH” values. These rows will need to be dropped since they cannot be used for training.
Let’s look at the distribution of the target variable next.
The above histogram reveals that the target variable is not very skewed, even though there are some outliers. The minimum value for PH is 7.88 and the maximum value is 9.36 indicating that ABC manufactures relatively alkaline beverages - likely to be green tea or fruit and vegetable juices.
| variable | n_miss | pct_miss |
|---|---|---|
| MFR | 212 | 8.2458187 |
| Brand.Code | 120 | 4.6674446 |
| Filler.Speed | 57 | 2.2170362 |
| PC.Volume | 39 | 1.5169195 |
| PSC.CO2 | 39 | 1.5169195 |
| Fill.Ounces | 38 | 1.4780241 |
| PSC | 33 | 1.2835473 |
| Carb.Pressure1 | 32 | 1.2446519 |
| Hyd.Pressure4 | 30 | 1.1668611 |
| Carb.Pressure | 27 | 1.0501750 |
| Carb.Temp | 26 | 1.0112797 |
| PSC.Fill | 23 | 0.8945935 |
| Fill.Pressure | 22 | 0.8556982 |
| Filler.Level | 20 | 0.7779074 |
| Hyd.Pressure2 | 15 | 0.5834306 |
| Hyd.Pressure3 | 15 | 0.5834306 |
| Temperature | 14 | 0.5445352 |
| Oxygen.Filler | 12 | 0.4667445 |
| Pressure.Setpoint | 12 | 0.4667445 |
| Hyd.Pressure1 | 11 | 0.4278491 |
| Carb.Volume | 10 | 0.3889537 |
| Carb.Rel | 10 | 0.3889537 |
| Alch.Rel | 9 | 0.3500583 |
| Usage.cont | 5 | 0.1944769 |
| PH | 4 | 0.1555815 |
| Mnf.Flow | 2 | 0.0777907 |
| Carb.Flow | 2 | 0.0777907 |
| Bowl.Setpoint | 2 | 0.0777907 |
| Density | 1 | 0.0388954 |
| Balling | 1 | 0.0388954 |
| Balling.Lvl | 1 | 0.0388954 |
| Pressure.Vacuum | 0 | 0.0000000 |
| Air.Pressurer | 0 | 0.0000000 |
The above statistics tell us that about 8.25% of the records are missing a value for MFR. We may need to drop this feature since as missingness increases, the increasing amount of imputed values would have potential negative consequences.
The second most missing variable is the categorical variable called “Brand Code”, which is missing about 4.67% percent of its values. These could potentially be a 5th brand besides the existing A,B,C and D or could be one of the existing 4 brands. In any case, we will create a new feature category ‘Unknown’ for these records. The rest of the predictors are missing smaller percentages of values, and we can use imputation for these records.
From the above plots, we can see that a lot of the predictors are significantly skewed, suggesting that we might need to transform the data. Several features are discrete with limited possible values, e.g. Pressure Setpoint. We also see a number of bimodal variables such as Carb Flow, Balling, and Balling Level.
We now use boxplots to check the spread of each predictor.
The boxplots reveal outliers, but we don’t have have a strong reason to impute or drop them from the dataset.
We will now derive the correlations for the numeric predictors. This will enable us to focus on those predictors that show stronger positive or negative correlations with PH. Predictors with correlations closer to zero will most likely not provide any meaningful information for the target variable.
## values ind
## 1 0.361587534 Bowl.Setpoint
## 2 0.352043962 Filler.Level
## 3 0.233593699 Carb.Flow
## 4 0.219735497 Pressure.Vacuum
## 5 0.196051481 Carb.Rel
## 6 0.166682228 Alch.Rel
## 7 0.164485364 Oxygen.Filler
## 8 0.109371168 Balling.Lvl
## 9 0.098866734 PC.Volume
## 10 0.095546936 Density
## 11 0.076700227 Balling
## 12 0.076213407 Carb.Pressure
## 13 0.072132509 Carb.Volume
## 14 0.032279368 Carb.Temp
## 15 -0.007997231 Air.Pressurer
## 16 -0.023809796 PSC.Fill
## 17 -0.040882953 Filler.Speed
## 18 -0.045196477 MFR
## 19 -0.047066423 Hyd.Pressure1
## 20 -0.069873041 PSC
## 21 -0.085259857 PSC.CO2
## 22 -0.118335903 Fill.Ounces
## 23 -0.118764185 Carb.Pressure1
## 24 -0.171434026 Hyd.Pressure4
## 25 -0.182659650 Temperature
## 26 -0.222660048 Hyd.Pressure2
## 27 -0.268101792 Hyd.Pressure3
## 28 -0.311663908 Pressure.Setpoint
## 29 -0.316514463 Fill.Pressure
## 30 -0.357611993 Usage.cont
## 31 -0.459231253 Mnf.Flow
From the above, we can see that the variables Bowl Setpoint, Filler Level, Carb Flow, Pressure Vacuum, and Carb Rel have the strongest positive correlations with PH, while Mnf Flow, Usage cont, Fill Pressure, Pressure Setpoint, and Hyd Pressure3 have the strongest negative correlations with PH. The other features have a weak or slightly negative correlation, which implies they have less predictive power.
One problem that can occur with multiple regression and other models is a correlation between predictors or multicollinearity. A quick check is to run correlations between all predictors.
We can see that some predictors are highly correlated with one another, such as Balling Level and Carb Volume, Carb Rel and Alch Rel, Density, and Balling, with a correlation between 0.75 and 1. When we start examining predictors for our models, we’ll have to consider the correlations between them and avoid including pairs with strong correlations.
In general, it looks like many of the predictors go hand-in-hand with other features and multicollinearity could be a problem.
## 6 variables from the 31 input variables have collinearity problem:
##
## Balling Bowl.Setpoint Balling.Lvl MFR Hyd.Pressure3 Alch.Rel
##
## After excluding the collinear variables, the linear correlation coefficients ranges between:
## min correlation ( Pressure.Setpoint ~ PC.Volume ): 0.0001991472
## max correlation ( Carb.Rel ~ Density ): 0.852689
##
## ---------- VIFs of the remained variables --------
## Variables VIF
## 1 Carb.Volume 17.159340
## 2 Fill.Ounces 1.153764
## 3 PC.Volume 1.685901
## 4 Carb.Pressure 43.298925
## 5 Carb.Temp 35.460787
## 6 PSC 1.155980
## 7 PSC.Fill 1.109449
## 8 PSC.CO2 1.064636
## 9 Mnf.Flow 4.262754
## 10 Carb.Pressure1 1.434406
## 11 Fill.Pressure 3.490621
## 12 Hyd.Pressure1 2.935470
## 13 Hyd.Pressure2 4.900597
## 14 Hyd.Pressure4 1.752686
## 15 Filler.Level 2.618103
## 16 Filler.Speed 1.273500
## 17 Temperature 1.151185
## 18 Usage.cont 1.718776
## 19 Carb.Flow 1.987496
## 20 Density 4.499376
## 21 Pressure.Vacuum 2.054866
## 22 Oxygen.Filler 1.561878
## 23 Pressure.Setpoint 3.300894
## 24 Air.Pressurer 1.167671
## 25 Carb.Rel 6.339500
The vifcor function from the usdm package allows us to do an early analysis into multi-collinearity. As can be seen from the above, this function tells us that 6 of the 31 numeric predictors are highly correlated.
Lastly, we want to check for any features that show near zero-variance. Predictors that are the same across most of the instances will add little predictive information.
## freqRatio percentUnique zeroVar nzv
## Hyd.Pressure1 31.11111 9.529366 FALSE TRUE
Since “Hyd Pressure1” displays near-zero variance, we will drop this feature prior to modeling.
To summarize our data preparation and exploration, we distinguish our findings into a few categories below.
MFR has more than 8% missing values, so we can remove this predictor.Hyd Pressure1 shows little variance, so we can remove this predictor.PH that need to be removed.Brand Code with “Unknown”.Predictive mean matching via the mice package.
30 out of 33 variables contain missing values of varying quantities (ranging from 212 to 1). This is enough to justify imputation. Rather than removing entire observations with missing values and jeopardizing the accuracy of the data, we will use the mice package’s mice() function to impute them.
The mice package offers an array of imputation methods (Predictive mean matching, mean, norm, to name a few), but due to the fact that the dataset contains both numeric and categorical variables, we have decided to use the Predictive mean matching method as this covers both variable types.
set.seed(200)
# Impute missing values in training data using the Predictive mean matching imputation method.
beverage.train.clean <- mice(beverage.train.clean, m = 1, method = 'pmm', print = FALSE) %>% complete()
# After imputation, check if any missing values remain.
colSums(is.na(beverage.train.clean))## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## 0 0 0 0
## Carb.Pressure Carb.Temp PSC PSC.Fill
## 0 0 0 0
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## 0 0 0 0
## Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level
## 0 0 0 0
## Filler.Speed Temperature Usage.cont Carb.Flow
## 0 0 0 0
## Density Balling Pressure.Vacuum PH
## 0 0 0 0
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer
## 0 0 0 0
## Alch.Rel Carb.Rel Balling.Lvl
## 0 0 0
# Impute missing values in test data using the Predictive mean matching imputation method.
beverage.eval.clean <- mice(beverage.eval.clean, m = 1, method = 'pmm', print = FALSE) %>% complete()As per the above results, we can confirm that the missing values have been eliminated after imputation.
“Brand.Code” is a categorical variable with values A, B, C, D and Unknown. So we will convert it to a set of dummy variables for modeling.
As discussed earlier, some of the predictors are highly skewed. To address this, we scale, center, and apply the Box-Cox transformation to them using the “preProcess” function from the “caret” package. These transformations should result in more normal distributions.
## Created from 267 samples and 34 variables
##
## Pre-processing:
## - Box-Cox transformation (22)
## - centered (34)
## - ignored (0)
## - scaled (34)
##
## Lambda estimates for Box-Cox transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.000 -2.000 0.100 -0.300 0.675 2.000
Here are some plots to demonstrate the changes in distributions after the transformations:
# Prepare data for ggplot.
gather_df <- bev.train.transformed %>% dplyr::select(-c(PH)) %>% gather(key = 'variable', value = 'value')
# Histogram plots of each variable.
ggplot(gather_df) + geom_histogram(aes(x=value, y = ..density..), bins = 30) +
geom_density(aes(x = value), color = 'red') +
facet_wrap(. ~variable, scales = 'free', ncol = 4)As expected, the plots of the dummy variables are binary. For the others, we can still see bimodal predictors since we did not apply any feature engineering to them. Some predictors such as ‘PSC Fill’ and ‘Temperature’ still show some skewness, but we can move on to building the models.
Here, we perform a train-test split with a 80:20 ratio.
# Split the training data into train and test sets using an 80% data split.
trainingData <- createDataPartition(bev.train.transformed$PH, p = 0.8, list = FALSE)
# Training data splits.
trainingDataSet <- bev.train.transformed[trainingData, ]
xTrainData <- subset(trainingDataSet, select = -PH)
yTrainData <- subset(trainingDataSet, select = PH)
# Test data splits.
testDataSet <- bev.train.transformed[-trainingData, ]
xTestData <- subset(testDataSet, select = -PH)
yTestData <- subset(testDataSet, select = PH)In this section, we will build and run 3 categories of models: tree, linear, and non-linear. We will then compare the results of the models in each category, select the best category performer, and then select the overall best performer.
In the non-linear category, we will build and run 2 models - a Support Vector Machine (SVM) model, and a K-Nearest Neighbors (KNN) model. We will use the caret package’s train() function to build the models, and use the same training and test datasets for both models.
set.seed(200)
# Define the SVM model.
svmModel = train(x = xTrainData,
y = yTrainData$PH,
preProcess = c('center', 'scale'),
method = 'svmRadial',
tuneLength = 10,
trControl = trainControl(method = 'repeatedcv'))
# Run predict() and postResample() on the model and display the results.
svmPrediction <- predict(svmModel, newdata = xTestData)
svmPerformance <- postResample(pred = svmPrediction, obs = yTestData$PH)
svmPerformance## RMSE Rsquared MAE
## 0.10884499 0.59016597 0.08127628
# Predict on test data and calculate performance
results<-data.frame()
results <- data.frame(t(postResample(pred = svmPrediction, obs = yTestData$PH))) %>%mutate(Model = "SVM") %>%
rbind(results)
set.seed(200)
# Define the KNN model.
knnModel <- train(x = xTrainData,
y = yTrainData$PH,
preProcess = c('center', 'scale'),
method = 'knn',
tuneLength = 10)
# Run predict() and postResample() on the model and display the results.
knnPrediction <- predict(knnModel, newdata = xTestData)
knnPerformance <- postResample(pred = knnPrediction, obs = yTestData$PH)
knnPerformance## RMSE Rsquared MAE
## 0.1191253 0.5096819 0.0913447
# Predict on test data and calculate performance
results <- data.frame(t(postResample(pred = knnPrediction, obs = yTestData$PH))) %>% mutate(Model = "k-Nearest Neighbors(kNN)") %>% rbind(results)
In the linear model category, we will build and run a generalized linear model (GLM), and a partial least squares (PLS) model.
set.seed(200)
# Define the GLM model.
glmModel = train(PH ~ .,
data = trainingDataSet,
metric = 'RMSE',
preProcess = c('center', 'scale'),
method = 'glm',
trControl = trainControl(method = 'cv', number = 5, savePredictions = TRUE))
# Run predict() and postResample() on the model and display the results.
glmModelPrediction <- predict(glmModel, xTestData)
glmPerformance <- postResample(pred = glmModelPrediction, obs = yTestData$PH)
glmPerformance## RMSE Rsquared MAE
## 0.1295556 0.4133405 0.1024145
# Predict on test data and calculate performance
results <- data.frame(t(postResample(pred = glmModelPrediction, obs = yTestData$PH))) %>% mutate(Model = "Generalized Linear Model(GLM)") %>% rbind(results)
set.seed(200)
# Define the PLS model.
plsModel = train(PH ~ .,
data = trainingDataSet,
metric = 'RMSE',
preProcess = c('center', 'scale'),
method = 'pls',
trControl = trainControl(method = 'cv', number = 5, savePredictions = TRUE))
# Run predict() and postResample() on the model and display the results.
plsModelPrediction <- predict(plsModel, xTestData)
plsPerformance <- postResample(pred = plsModelPrediction, obs = yTestData$PH)
plsPerformance## RMSE Rsquared MAE
## 0.1299029 0.4088798 0.1040693
# Predict on test data and calculate performance
results <- data.frame(t(postResample(pred = plsModelPrediction, obs = yTestData$PH))) %>% mutate(Model = "Partial Least Squares(PLS)") %>% rbind(results)
In this category, we will build and run a cubist model, and a single tree model.
set.seed(200)
# Define the Cubist model.
cubistModel <- cubist(xTrainData,
yTrainData$PH,
committees = 6)
# Run predict() and postResample() on the model and display the results.
cubistModelPrediction <- predict(cubistModel, newdata = xTestData)
cubistPerformance <- postResample(pred = cubistModelPrediction, obs = yTestData$PH)
cubistPerformance## RMSE Rsquared MAE
## 0.10518426 0.61494856 0.07688106
# Predict on test data and calculate performance
results <- data.frame(t(postResample(pred = cubistModelPrediction, obs = yTestData$PH))) %>% mutate(Model = "Tree Model(Cubist)") %>% rbind(results)
set.seed(100)
# Define the Single Tree model.
singleTreeModel <- train(xTrainData,
yTrainData$PH,
method = 'rpart2',
tuneLength = 10,
trControl = trainControl(method = 'cv'))
# Run predict() and postResample() on the model and display the results.
singleTreeModelPrediction <- predict(singleTreeModel, newdata = xTestData)
singleTreePerformance <- postResample(pred = singleTreeModelPrediction , obs = yTestData$PH)
singleTreePerformance## RMSE Rsquared MAE
## 0.12522668 0.45164526 0.09759854
# Predict on test data and calculate performance.
results <- data.frame(t(postResample(pred = singleTreeModelPrediction, obs = yTestData$PH))) %>% mutate(Model = "Single Tree Model") %>% rbind(results)
After running our models, we will now compare the results of the models and select the best performing model within each category. This will allow us to select the overall best performing model.
nonLinearComparisons <- rbind(
'Support Vector Machine' = svmPerformance,
'K Nearest Neighbors' = knnPerformance)
nonLinearComparisons %>% kable() %>% kable_styling(bootstrap_options = c('striped'))| RMSE | Rsquared | MAE | |
|---|---|---|---|
| Support Vector Machine | 0.1088450 | 0.5901660 | 0.0812763 |
| K Nearest Neighbors | 0.1191253 | 0.5096819 | 0.0913447 |
Using RMSE and Rsquared as the selection criteria for the best performing model, the support vector machine model yielded the best performance. The Rsquared value of the model is 0.57 which tells us that the model explains 57% of the variability in the data. This trumps the Rsquared value of the KNN model (52%), but not by much.
linearComparisons <- rbind(
'Generalized Linear Model' = glmPerformance,
'Partial Least Squares' = plsPerformance)
linearComparisons %>% kable() %>% kable_styling(bootstrap_options = c('striped'))| RMSE | Rsquared | MAE | |
|---|---|---|---|
| Generalized Linear Model | 0.1295556 | 0.4133405 | 0.1024145 |
| Partial Least Squares | 0.1299029 | 0.4088798 | 0.1040693 |
Again, using RMSE and Rsquared to select the best model, the GLM and PLS models are almost the same in terms of their performance. However, the generalized linear model performs slightly better than the partial least squares model. The GLM model explains 41% of the data variance which is higher than the Rsquared value of the PLS model by a fraction.
treeComparisons <- rbind(
'Cubist' = cubistPerformance,
'Single Tree' = singleTreePerformance)
treeComparisons %>% kable() %>% kable_styling(bootstrap_options = c('striped'))| RMSE | Rsquared | MAE | |
|---|---|---|---|
| Cubist | 0.1051843 | 0.6149486 | 0.0768811 |
| Single Tree | 0.1252267 | 0.4516453 | 0.0975985 |
Finally, In the tree model category, based on the fact that the cubist model has a lower RMSE than that of the single tree model, and the fact that it explains 61% of the data variance (as opposed to the single tree model’s 45%), the cubist tree model is the best performing model in this category.
We now consolidate the results from all the models using the following criteria: root mean squared error (RMSE), R-squared, and Mean Absolute Error (MAE). The table below lists these criteria for each model.
results %>% dplyr::select(Model, RMSE, Rsquared, MAE)## Model RMSE Rsquared MAE
## 1 Single Tree Model 0.1252267 0.4516453 0.09759854
## 2 Tree Model(Cubist) 0.1051843 0.6149486 0.07688106
## 3 Partial Least Squares(PLS) 0.1299029 0.4088798 0.10406927
## 4 Generalized Linear Model(GLM) 0.1295556 0.4133405 0.10241455
## 5 k-Nearest Neighbors(kNN) 0.1191253 0.5096819 0.09134470
## 6 SVM 0.1088450 0.5901660 0.08127628
Based on the RMSE and RSquared values of all the models we ran, the Cubist model is the overall best performer. This is expected given that this model is more tolerant of multi-collinearity and works well with non-linear features. The Rsquared for the Cubist model tells us that it explains 61% of the data variance which falls within an acceptable RSquared value range. Based on this, we will proceed with the Cubist model as the best predictive model for this project.
Let’s inspect the predictors that this model found important.
var.imp.cubist<-varImp(cubistModel, scale = FALSE)
var.imp.cubist## Overall
## Mnf.Flow 73.0
## Alch.Rel 57.5
## Balling.Lvl 54.5
## Pressure.Vacuum 54.0
## Brand.CodeC 27.5
## Bowl.Setpoint 40.0
## Carb.Flow 36.5
## Filler.Speed 24.0
## Oxygen.Filler 44.5
## Balling 53.5
## Carb.Rel 35.5
## Usage.cont 27.0
## Density 41.5
## Air.Pressurer 32.5
## Hyd.Pressure3 36.5
## Hyd.Pressure2 25.5
## Carb.Pressure1 29.5
## Temperature 29.5
## Filler.Level 16.0
## PC.Volume 10.5
## Pressure.Setpoint 10.5
## Carb.Volume 16.5
## Carb.Pressure 22.5
## Carb.Temp 19.0
## Brand.CodeB 13.0
## PSC.Fill 4.5
## Brand.CodeD 4.0
## Fill.Pressure 3.0
## Hyd.Pressure4 3.0
## Fill.Ounces 2.0
## PSC 2.0
## PSC.CO2 2.0
## Brand.CodeA 1.0
## Brand.CodeUnknown 0.0
Interestingly, we can see that the list of important predictors contains some that had strong correlations (positive and negative) with the target variable. For example: Alch Rel, Bowl Setpoint, Carb Flow, Pressure Vacuum, Oxygen Filler and Mnf Flow. At the same time, there are other predictors that showed strong correlation with PH, but did not make it to the top 10 important predictors. For example: Filler Level, Carb Rel, Oxygen Filler, Usage cont, Fill Pressure, Temperature, Pressure Setpoint, Hyd Pressure2 and Hyd Pressure3.
Instead, the topmost important predictors had variables such as Balling Lvl, Bowl Setpoint, Filler Speed and Balling in the important predictors list that did not demonstrate the strongest correlations.
This begins to make more sense when we compare to the predictor-predictor correlations calculated previously as well as the results of the vifcor function used previously. We can see that Carb Rel and Alch Rel are strongly correlated, as are Alch Rel and Hyd Pressure3. This indicates that the model is taking into account multi-collinearity and avoiding predictors that are strongly correlated with others that have already been selected and thereby do not provide incremental predictive power.
Now that we have identified the Cubist model as the best predictive model, we will apply the model to the evaluation dataset by replacing the empty PH values in the evaluation dataset with the Cubist model’s predictions.
# Define the "evaluationDataClean" variable.
evaluationDataClean <- bev.eval.transformed
# Run predict() on the Cubit model.
cubistPredictions <- predict(cubistModel, newdata = evaluationDataClean)
# Replace the empty PH values in the evaluation set with the Cubist predictions.
evaluationDataClean$PH <- round(cubistPredictions,2)
# Take a look at the evaluation data after PH value replacement.
head(evaluationDataClean, 20) %>% kable() %>% kable_styling() %>% scroll_box(width = '100%', height = '600px')| Brand.CodeA | Brand.CodeB | Brand.CodeC | Brand.CodeD | Brand.CodeUnknown | Carb.Volume | Fill.Ounces | PC.Volume | Carb.Pressure | Carb.Temp | PSC | PSC.Fill | PSC.CO2 | Mnf.Flow | Carb.Pressure1 | Fill.Pressure | Hyd.Pressure2 | Hyd.Pressure3 | Hyd.Pressure4 | Filler.Level | Filler.Speed | Temperature | Usage.cont | Carb.Flow | Density | Balling | Pressure.Vacuum | Oxygen.Filler | Bowl.Setpoint | Pressure.Setpoint | Air.Pressurer | Alch.Rel | Carb.Rel | Balling.Lvl | PH |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -0.3876816 | -0.9650293 | -0.3617512 | 1.7776376 | -0.1754205 | 1.0173477 | 0.8453014 | -0.1261451 | -0.7151727 | -1.6286322 | 2.2451102 | 1.6099559 | -0.2942791 | -1.027837 | -1.4492875 | -0.6003036 | -1.151696 | -1.17903 | -0.0759471 | 1.3488585 | 0.5086529 | -0.1059513 | 0.1956905 | 0.4661381 | -0.7278954 | -0.8968551 | 2.3563890 | -0.3646397 | 1.5281699 | -1.2774435 | -0.1787716 | -0.6935289 | -0.7893006 | -0.6467639 | 8.63 |
| 2.5697753 | -0.9650293 | -0.3617512 | -0.5604375 | -0.1754205 | 0.2569971 | -0.2061344 | -0.8152380 | -1.4064544 | -1.5210765 | -0.7797427 | 0.4326833 | 0.7395699 | -1.027837 | -0.9470071 | -0.5412123 | -1.163307 | -1.17903 | 1.0115862 | 0.6212109 | 0.5500784 | -0.3700120 | -1.1183338 | 0.4368621 | 0.8571200 | 0.9606936 | 1.3300336 | -0.0738317 | 0.7065401 | -0.8299445 | 3.5102683 | 0.5986430 | 1.1043790 | 1.1194968 | 8.44 |
| -0.3876816 | 1.0323569 | -0.3617512 | -0.5604375 | -0.1754205 | -0.6671561 | -0.6473493 | 0.4039264 | -0.4234268 | -0.1578536 | -0.1784639 | -0.7460745 | -0.8112036 | -1.027837 | -0.6293102 | -0.6595235 | -1.163307 | -1.17903 | 0.0695220 | 0.5766276 | 0.5468822 | -0.3700120 | 1.1479870 | 0.5574101 | -0.6723612 | -0.7892285 | 1.6721521 | 0.3576903 | 0.7065401 | -0.8299445 | 3.0487050 | -0.7955619 | -0.7893006 | -0.6694082 | 8.60 |
| -0.3876816 | 1.0323569 | -0.3617512 | -0.5604375 | -0.1754205 | -0.9225304 | -0.3823992 | -1.4619252 | -0.8967401 | -0.4960754 | -2.3057805 | 0.2696713 | -0.8112036 | -1.027837 | 0.4042797 | -2.4369491 | -1.163307 | -1.17903 | 2.1707445 | 0.6361217 | -0.2105342 | 4.4933952 | -0.9652103 | -2.0498698 | -1.1279666 | -1.8425756 | 2.0142706 | 0.7859125 | 0.7065401 | -0.8299445 | 2.8935880 | -0.8994902 | 0.5006257 | -0.6467639 | 8.59 |
| -0.3876816 | 1.0323569 | -0.3617512 | -0.5604375 | -0.1754205 | 0.3763584 | 3.0023969 | -1.8753809 | 0.3774154 | 0.2624104 | -0.8329359 | 1.0082946 | 0.2226454 | -1.027837 | -1.8169667 | 0.9525837 | -1.163307 | -1.17903 | -0.2244790 | 0.3282101 | 0.5596763 | 0.1533516 | 0.0752223 | 0.6934571 | -0.7278954 | -0.8968551 | 2.0142706 | 1.0031796 | 0.7065401 | 1.0948096 | 2.4244009 | -0.8472862 | -0.4560038 | -0.6694082 | 8.58 |
| -0.3876816 | 1.0323569 | -0.3617512 | -0.5604375 | -0.1754205 | -0.7306375 | 1.7999363 | -1.0484694 | 1.2962940 | 1.3499991 | 0.0209609 | 0.4326833 | 1.2564944 | -1.027837 | -0.9925143 | -0.4822488 | -1.163307 | -1.17903 | -0.2244790 | 0.6510572 | 0.5468822 | 0.2812546 | -1.0009422 | 0.5642985 | -0.8401151 | -1.1317616 | 2.3563890 | 0.7172809 | 0.7065401 | -0.8299445 | 2.5814395 | -0.8472862 | -0.1300589 | -0.6920526 | 8.56 |
| 2.5697753 | -0.9650293 | -0.3617512 | -0.5604375 | -0.1754205 | 1.0173477 | -0.4706421 | -0.5502023 | -0.7751385 | -1.6286322 | 0.2079618 | -0.2880740 | -1.3281282 | -1.027837 | -1.2205124 | -0.5412123 | -1.163307 | -1.17903 | 0.7550124 | 0.5914639 | 0.5468822 | 0.4080104 | -1.0950665 | 0.5453553 | 0.8095406 | 0.9279513 | 1.6721521 | 0.2627355 | 0.7065401 | -0.8299445 | 1.7897350 | 0.6763990 | 0.1887486 | 1.0968525 | 8.49 |
| -0.3876816 | 1.0323569 | -0.3617512 | -0.5604375 | -0.1754205 | 0.4948399 | 1.2803082 | -2.4690609 | -0.1445701 | -0.4960754 | -0.0178523 | -0.7460745 | -0.2942791 | -1.027837 | -0.3581749 | -2.4369491 | -1.163307 | -1.17903 | 0.7550124 | 1.5107538 | 0.5023036 | 0.0242877 | -2.3000814 | -0.3759754 | 1.0922312 | 1.1916247 | 1.3300336 | 1.1925299 | 0.7065401 | -0.8299445 | 2.5814395 | 0.6376839 | -0.1300589 | 1.0742081 | 8.60 |
| 2.5697753 | -0.9650293 | -0.3617512 | -0.5604375 | -0.1754205 | 0.3763584 | -0.6473493 | 0.8703893 | -0.3103806 | -0.7439904 | 2.3548087 | 2.0287088 | -0.2942791 | -1.027837 | 2.8590587 | -1.2589854 | -1.163307 | -1.17903 | 0.8844656 | 0.6960130 | 0.5468822 | -0.2373797 | -1.0892332 | 0.0803846 | 0.9045094 | 0.9939035 | 1.3300336 | 0.3576903 | 0.7065401 | -0.8299445 | 2.7378340 | 0.5986430 | 0.0302239 | 1.1874299 | 8.54 |
| -0.3876816 | -0.9650293 | -0.3617512 | 1.7776376 | -0.1754205 | 0.9601386 | 0.7580828 | -0.3487751 | 1.1245987 | 0.6670133 | 1.1282743 | -0.7460745 | -0.8112036 | -1.027837 | 0.8045664 | -2.1844568 | -1.163307 | -1.17903 | 0.6231393 | 0.6810029 | -2.4548396 | -0.1059513 | 0.6135222 | -2.0498698 | 0.8095406 | 0.9265682 | 1.6721521 | 1.1925299 | 0.7065401 | -0.8299445 | 2.5814395 | 1.7026429 | 0.6540279 | 1.2100743 | 8.61 |
| -0.3876816 | 1.0323569 | -0.3617512 | -0.5604375 | -0.1754205 | -1.7799105 | -2.0717096 | 1.0400121 | -1.1468189 | -0.0630670 | 0.6173947 | 1.2615772 | -0.2942791 | -1.027837 | -0.4032894 | -0.4234122 | -1.163307 | -1.17903 | 0.0695220 | 0.6361217 | 0.5468822 | -0.5038630 | -0.3612107 | 0.6572927 | -0.7838095 | -1.0108226 | 1.6721521 | 0.7520127 | 0.7065401 | -0.8299445 | 3.3570417 | -0.7955619 | -0.6217195 | -0.7599857 | 8.74 |
| -0.3876816 | 1.0323569 | -0.3617512 | -0.5604375 | -0.1754205 | -0.9869816 | 1.4538053 | -0.9212523 | -1.4064544 | -0.3498769 | 1.6328781 | 0.7338316 | 3.8411170 | -1.027837 | -1.3119287 | -0.5412123 | -1.163307 | -1.17903 | -0.0759471 | 0.5028187 | 0.5468822 | -0.2373797 | -1.2444176 | 0.5952966 | -0.7838095 | -1.0108226 | 1.6721521 | 0.4027100 | 0.7065401 | -0.8299445 | 3.3570417 | -0.8472862 | -0.4560038 | -0.7146970 | 8.55 |
| -0.3876816 | 1.0323569 | -0.3617512 | -0.5604375 | -0.1754205 | -0.6039141 | 1.1934515 | 0.0646806 | -0.8356587 | -0.5452300 | 1.2119820 | -0.5055641 | -1.3281282 | -1.027837 | -1.3576836 | -0.6595235 | -1.163307 | -1.17903 | 0.2120521 | 0.5914639 | 0.5468822 | -0.5038630 | -0.2007139 | 0.4454727 | -0.6171961 | -0.6873889 | 0.6457967 | 0.7520127 | 0.7065401 | -0.8299445 | 3.3570417 | -0.7443114 | -1.3035206 | -0.6694082 | 8.63 |
| -0.3876816 | 1.0323569 | -0.3617512 | -0.5604375 | -0.1754205 | -0.5409102 | -0.3823992 | 0.1706949 | -1.2108131 | -0.9462379 | 0.4188365 | 0.0965566 | -0.8112036 | -1.027837 | -0.2230094 | -0.4822488 | -1.163307 | -1.17903 | 0.2120521 | 0.6063250 | 0.5564754 | -0.3700120 | 0.1175229 | 0.4695823 | -0.5623897 | -0.5908470 | 0.6457967 | 0.4462532 | 0.7065401 | -0.8299445 | -0.3473212 | -0.7443114 | -1.8355740 | -0.6920526 | 8.57 |
| -0.3876816 | -0.9650293 | 2.7539772 | -0.5604375 | -0.1754205 | -0.8583235 | 0.0577117 | 0.6795635 | -0.9583897 | -0.2534538 | 0.0592796 | 0.8740066 | 1.2564944 | -1.027837 | -1.4951365 | -0.5412123 | -1.163307 | -1.17903 | -0.3762053 | 0.6361217 | 0.5500784 | 0.9038331 | -0.0086665 | 0.5729091 | -0.4538136 | -0.4119692 | 1.6721521 | 0.3576903 | 0.7065401 | -0.8299445 | -0.3473212 | -0.5439333 | -1.4788500 | -0.5108976 | 8.43 |
| -0.3876816 | -0.9650293 | -0.3617512 | -0.5604375 | 5.6792381 | -1.0516784 | -1.0904126 | 0.6583607 | -0.8356587 | -0.2534538 | -0.6273472 | 0.7338316 | -0.8112036 | -1.027837 | 0.5824681 | -1.3805093 | -1.163307 | -1.17903 | 0.8844656 | 1.4294958 | -2.4540372 | 1.7303860 | -0.9532469 | -2.0464255 | -0.9539501 | -1.3975431 | 0.6457967 | 1.8488327 | 0.7065401 | -0.8299445 | -0.5165825 | -0.7955619 | -1.3035206 | -0.5108976 | 8.60 |
| -0.3876816 | 1.0323569 | -0.3617512 | -0.5604375 | -0.1754205 | -0.2294224 | 0.2332426 | -0.3911808 | 0.6218844 | 0.8420124 | 0.6494338 | -0.5055641 | -1.3281282 | -1.027837 | -1.1292205 | -0.7188728 | -1.163307 | -1.17903 | -0.5312665 | 0.5618162 | 0.5277389 | -0.1059513 | -0.8199085 | 0.5126351 | -0.7278954 | -0.8968551 | 2.0142706 | 1.2669589 | 0.7065401 | -0.8299445 | -0.6865594 | -0.8472862 | -0.6217195 | -0.7373413 | 8.52 |
| -0.3876816 | 1.0323569 | -0.3617512 | -0.5604375 | -0.1754205 | -0.9225304 | -0.3823992 | -0.0519351 | -0.7151727 | -0.2055514 | 0.7748625 | 1.7190407 | 0.2226454 | -1.027837 | -1.4951365 | -0.5412123 | -1.163307 | -1.17903 | -0.5312665 | 0.6660177 | 0.5181887 | -0.2373797 | -0.9472553 | 0.5866860 | -0.5623897 | -0.5908470 | 1.3300336 | 0.6451271 | 0.7065401 | -0.8299445 | -0.5165825 | -0.7955619 | -1.1301723 | -0.6920526 | 8.54 |
| -0.3876816 | -0.9650293 | -0.3617512 | 1.7776376 | -0.1754205 | 1.2441099 | -1.0016517 | -0.4547894 | 0.1221458 | -0.5945976 | -0.3931767 | -0.5055641 | -0.8112036 | -1.027837 | -1.1292205 | -0.5412123 | -1.163307 | -1.17903 | -1.7241003 | 0.6361217 | 0.5245539 | -1.3334161 | 0.2028362 | 0.4540833 | 1.1850323 | 1.1748584 | 1.6721521 | 1.7973723 | 0.7065401 | -0.8299445 | -0.3473212 | 1.6722219 | 1.3966330 | 1.2553630 | 8.62 |
| -0.3876816 | 1.0323569 | -0.3617512 | -0.5604375 | -0.1754205 | -0.4156124 | -0.1181122 | -1.3877152 | -0.4234268 | -0.2055514 | -0.8874764 | -1.7215960 | -1.3281282 | -1.027837 | -1.1748510 | -0.7783524 | -1.163307 | -1.17903 | -0.3762053 | 0.6212109 | 0.5245539 | -0.5038630 | 0.4202745 | 0.4850813 | -0.6171961 | -0.6873889 | 1.3300336 | -0.3646397 | 0.7065401 | -0.8299445 | -0.1787716 | -0.6935289 | -0.4560038 | -0.6694082 | 8.55 |
Looking at the PH column (the last column) in the final data sample above, we can see that the empty PH values have now been replaced with our Cubist model predictions.
# Inspect range of predicted values for evaluation dataset.
summary(evaluationDataClean$PH)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.130 8.430 8.510 8.504 8.590 8.870
histogram(evaluationDataClean$PH)From the above, we can see that the range of the target variable in the evaluation dataset is slightly narrower than its corresponding range in the training data. This gives us confidence that the model seems to work well on unseen data. Obviously the real accuracy metric would be to compare the actual PH values for the evaluation data, which we did not have access to. While not perfectly normal, the shape of the distribution of the predicted PH level is reasonably close to normal.
Finally, we will write the final results to a CSV file.
# Save the final results to a CSV file.
write.csv(evaluationDataClean, './data/FinalPHPredictions.csv', row.names=F)Based on the exploratory data analysis of the training dataset, we decided to prepare the data - this included dropping some columns and rows, creating a separate “Unknown” group for missing brand codes, creating dummy variables for the single categorical variable i.e. Brand Code, imputing missing variable data and box-cox transforming the predictors to make them less skewed.
After this, we used 3 categories of models: Linear, Non-linear and Tree-based. We trained 2 models from each category for a total of 6 models. We used a combination of RMSE and R-squared as the performance metrics to decide the final model. We decided to go with the Cubist model because it’s metrics were clearly better than the other models. This model has the lowest RMSE and also happens to have the highest r-squared as well. This is not surprising given that these models handle non-linear relationships and multi-collinearity better. This comes across in the list of top predictors selected by this model, as described in a previous section.
For the final model selected, we see that it considers the following as the top 5 predictors in terms of importance: Mnf.Flow, Alch.Rel, Balling.Lvl, Pressure.Vacuum and Brand Code C. Finding Brand Code as a top predictor is interesting because at the end of the day, Brand is a not a physical/chemical construct that can be linked to PH levels. but we think it must be best encapsulating other chemical features collectively that are in turn helpful in explaining the PH levels.
We see that the range of the predicted values in the evaluation data is in line with the range of the predicted values in the training data, which gives us confidence that the selected model seems generalizable. Besides, the general shape of the distribution of the predicted values is approximately normal.
As with any real-world data science process, the logical next step would be to calculate better accuracy metrics by comparing the predicted values to the actual PH values for the evaluation data. Our recommendation to the manager of ABC Beverages would be to go with the Cubist model and put in place an on-going process to keep monitoring the model and fine-tuning in case the model metrics show any deterioration.
Linear Models with R: Julian Faraway. Applied Predictive Modeling: Kuhn & Johnson https://newalbanysmiles.com/ph-values-of-common-beverages/