This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel.Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp
## 1 B 5.340000 23.96667 0.2633333 68.2 141.2
## 2 A 5.426667 24.00667 0.2386667 68.4 139.6
## 3 B 5.286667 24.06000 0.2633333 70.8 144.8
## 4 A 5.440000 24.00667 0.2933333 63.0 132.6
## 5 A 5.486667 24.31333 0.1113333 67.2 136.8
## 6 A 5.380000 23.92667 0.2693333 66.6 138.4
## PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## 1 0.104 0.26 0.04 -100 118.8 46.0
## 2 0.124 0.22 0.04 -100 121.6 46.0
## 3 0.090 0.34 0.16 -100 120.2 46.0
## 4 NA 0.42 0.04 -100 115.2 46.4
## 5 0.026 0.16 0.12 -100 118.4 45.8
## 6 0.090 0.24 0.04 -100 119.6 45.6
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level
## 1 0 NA NA 118 121.2
## 2 0 NA NA 106 118.6
## 3 0 NA NA 82 120.0
## 4 0 0 0 92 117.8
## 5 0 0 0 92 118.6
## 6 0 0 0 116 120.2
## Filler.Speed Temperature Usage.cont Carb.Flow Density MFR Balling
## 1 4002 66.0 16.18 2932 0.88 725.0 1.398
## 2 3986 67.6 19.90 3144 0.92 726.8 1.498
## 3 4020 67.0 17.76 2914 1.58 735.0 3.142
## 4 4012 65.6 17.42 3062 1.54 730.6 3.042
## 5 4010 65.6 17.68 3054 1.54 722.8 3.042
## 6 4014 66.2 23.82 2948 1.52 738.8 2.992
## Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint
## 1 -4.0 8.36 0.022 120 46.4
## 2 -4.0 8.26 0.026 120 46.8
## 3 -3.8 8.94 0.024 120 46.6
## 4 -4.4 8.24 0.030 120 46.0
## 5 -4.4 8.26 0.030 120 46.0
## 6 -4.4 8.32 0.024 120 46.0
## Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
## 1 142.6 6.58 5.32 1.48
## 2 143.0 6.56 5.30 1.56
## 3 142.0 7.66 5.84 3.28
## 4 146.2 7.14 5.42 3.04
## 5 146.2 7.14 5.44 3.04
## 6 146.6 7.16 5.44 3.02
EXCEL
The first thing we did was to open the EXCEL files for training and test data that was provided. This is just to get a general idea of what we are looking at. We first looked out the training data. We have 1 response variable (PH) and 32 predictor variables(all numerical) with 2571 observation. One thing that we noticed immediately using the filter function was that we are missing about 120 of the predictor variable, “Brand Code”. We also noticed that about(4) of our response variable, PH was also missing. In addition to the 4 missing entries, we realized that it may be benficial to convert PH from numerical to catergorical based on the value(ie. basic, acidic, neutral). We know that anything below 7 is acidic, while anything above 7 is basic, although we realize that are data ranges from 7 up. Below is the summary statistic we obtained from our EXCEL dive.
Column.Name | MIN | MAX | MEAN | MEDIAN | NA.S |
---|---|---|---|---|---|
Carb Volume | 5.0400 | 5.700 | 5.370 | 5.347 | 10 |
Fill Ounces | 23.6300 | 24.320 | 23.970 | 2397.000 | 34 |
PC Volume | 0.0790 | 0.478 | 0.277 | 0.271 | 39 |
Carb Pressure | 57.0000 | 79.400 | 68.190 | 68.200 | 27 |
Carb Temp | 128.6000 | 154.000 | 141.090 | 140.800 | 26 |
PSC | 0.0000 | 0.270 | 0.080 | 0.080 | 33 |
PSC Fill | 0.0000 | 0.620 | 0.200 | 0.180 | 23 |
PSC CO2 | 0.0000 | 0.240 | 0.060 | 0.040 | 39 |
Mnf Flow | -100.2000 | 229.400 | 24.590 | 65.200 | 2 |
Carb Pressure 1 | 105.6000 | 140.200 | 122.590 | 123.200 | 32 |
Fill Pressure | 34.6000 | 60.400 | 47.920 | 46.400 | 22 |
Hyd Pressure 1 | -0.8000 | 58.000 | 12.440 | 11.400 | 11 |
Hyd Pressure 2 | 0.0000 | 59.400 | 20.960 | 28.600 | 11 |
Hyd Pressure 3 | -1.2000 | 50.000 | 20.460 | 27.600 | 15 |
Hyd Pressure 4 | 52.0000 | 142.000 | 96.290 | 96.000 | 30 |
Filler Level | 55.8000 | 161.200 | 109.250 | 118.400 | 20 |
Filler Speed | 998.0000 | 4030.000 | 3687.200 | 3982.000 | 57 |
Temperature | 63.6000 | 76.200 | 65.970 | 65.600 | 14 |
Usage Cont | 12.0800 | 25.900 | 20.990 | 21.790 | 5 |
Carb Flow | 26.0000 | 5104.000 | 2468.350 | 3028.000 | 2 |
Density | 0.2400 | 1.920 | 1.170 | 0.980 | 1 |
MFR | 31.4000 | 868.600 | 704.050 | 724.000 | 212 |
Balling | -0.1700 | 4.012 | 2.200 | 1.650 | 1 |
Pressure Vacuum | -6.6000 | -3.600 | -5.220 | -5.400 | 0 |
PH | 7.8800 | 9.360 | 8.550 | 8.540 | 4 |
Oxygen Filler | 0.0024 | 0.400 | 0.047 | 0.030 | 12 |
Bowl Setpoint | 70.0000 | 130.000 | 109.340 | 120.000 | 2 |
Pressure Setpoint | 44.0000 | 52.000 | 47.620 | 46.000 | 12 |
Air Pressurer | 140.8000 | 148.200 | 142.830 | 142.600 | 0 |
Alch Rel | 5.2800 | 8.620 | 6.900 | 6.560 | 9 |
Carb Rel | 4.9600 | 6.060 | 5.440 | 5.400 | 10 |
Balling Lvl | 0.0000 | 3.660 | 2.050 | 1.480 | 1 |
Aside from the missing response variables, there are quite a bit of the predictor variables with missing values. MFR has a total of 212 missing values and some like “Pressure Vacuum” and “Air Pressure” have no missing values. We will go ahead a impute the missing values for the predictor variables. There are a few variables that we worry may have outliers because of the range between the min and the max. One such variable is “Carb Flow”, with a min of 26 and max of 5104. Another would be MFR.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | IQR | Q0.25 | Q0.75 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Brand.Code* | 1 | 2451 | 2.5063239 | 0.9956337 | 2.0000000 | 2.5079041 | 0.0000000 | 1.0000000 | 4.000 | 3.0000000 | 0.3818872 | -1.0613288 | 0.0201107 | 2.0000000 | 2.0000000 | 4.000000 |
Carb.Volume | 2 | 2561 | 5.3701978 | 0.1063852 | 5.3466667 | 5.3654840 | 0.1087240 | 5.0400000 | 5.700 | 0.6600000 | 0.3922121 | -0.4669916 | 0.0021022 | 0.1600000 | 5.2933333 | 5.453333 |
Fill.Ounces | 3 | 2533 | 23.9747546 | 0.0875299 | 23.9733333 | 23.9751390 | 0.0790720 | 23.6333333 | 24.320 | 0.6866667 | -0.0215452 | 0.8624714 | 0.0017392 | 0.1066667 | 23.9200000 | 24.026667 |
PC.Volume | 4 | 2532 | 0.2771187 | 0.0606953 | 0.2713333 | 0.2745818 | 0.0523852 | 0.0793333 | 0.478 | 0.3986667 | 0.3396269 | 0.6699690 | 0.0012062 | 0.0728333 | 0.2391667 | 0.312000 |
Carb.Pressure | 5 | 2544 | 68.1895755 | 3.5382039 | 68.2000000 | 68.1212574 | 3.5582400 | 57.0000000 | 79.400 | 22.4000000 | 0.1822162 | -0.0138046 | 0.0701495 | 5.0000000 | 65.6000000 | 70.600000 |
Carb.Temp | 6 | 2545 | 141.0949234 | 4.0373861 | 140.8000000 | 140.9912617 | 3.8547600 | 128.6000000 | 154.000 | 25.4000000 | 0.2468280 | 0.2375822 | 0.0800307 | 5.4000000 | 138.4000000 | 143.800000 |
PSC | 7 | 2538 | 0.0845737 | 0.0492690 | 0.0760000 | 0.0802746 | 0.0474432 | 0.0020000 | 0.270 | 0.2680000 | 0.8491445 | 0.6480498 | 0.0009780 | 0.0640000 | 0.0480000 | 0.112000 |
PSC.Fill | 8 | 2548 | 0.1953689 | 0.1177817 | 0.1800000 | 0.1837059 | 0.1186080 | 0.0000000 | 0.620 | 0.6200000 | 0.9334450 | 0.7691466 | 0.0023333 | 0.1600000 | 0.1000000 | 0.260000 |
PSC.CO2 | 9 | 2532 | 0.0564139 | 0.0430387 | 0.0400000 | 0.0494965 | 0.0296520 | 0.0000000 | 0.240 | 0.2400000 | 1.7288937 | 3.7250025 | 0.0008553 | 0.0600000 | 0.0200000 | 0.080000 |
Mnf.Flow | 10 | 2569 | 24.5689373 | 119.4811263 | 65.2000000 | 21.0679631 | 169.0164000 | -100.2000000 | 229.400 | 329.6000000 | 0.0041430 | -1.8697072 | 2.3573130 | 240.8000000 | -100.0000000 | 140.800000 |
Carb.Pressure1 | 11 | 2539 | 122.5863726 | 4.7428819 | 123.2000000 | 122.5379242 | 4.4478000 | 105.6000000 | 140.200 | 34.6000000 | 0.0543587 | 0.1418265 | 0.0941263 | 6.4000000 | 119.0000000 | 125.400000 |
Fill.Pressure | 12 | 2549 | 47.9221656 | 3.1775457 | 46.4000000 | 47.7071044 | 2.3721600 | 34.6000000 | 60.400 | 25.8000000 | 0.5471107 | 1.4067532 | 0.0629371 | 4.0000000 | 46.0000000 | 50.000000 |
Hyd.Pressure1 | 13 | 2560 | 12.4375781 | 12.4332538 | 11.4000000 | 10.8374023 | 16.9016400 | -0.8000000 | 58.000 | 58.8000000 | 0.7798043 | -0.1426463 | 0.2457338 | 20.2000000 | 0.0000000 | 20.200000 |
Hyd.Pressure2 | 14 | 2556 | 20.9610329 | 16.3863066 | 28.6000000 | 21.0519062 | 13.3434000 | 0.0000000 | 59.400 | 59.4000000 | -0.3019570 | -1.5592984 | 0.3241161 | 34.6000000 | 0.0000000 | 34.600000 |
Hyd.Pressure3 | 15 | 2556 | 20.4584507 | 15.9757236 | 27.6000000 | 20.5052786 | 13.9364400 | -1.2000000 | 50.000 | 51.2000000 | -0.3189061 | -1.5745834 | 0.3159949 | 33.4000000 | 0.0000000 | 33.400000 |
Hyd.Pressure4 | 16 | 2541 | 96.2888627 | 13.1225594 | 96.0000000 | 95.4530251 | 11.8608000 | 52.0000000 | 142.000 | 90.0000000 | 0.5459786 | 0.6340041 | 0.2603252 | 16.0000000 | 86.0000000 | 102.000000 |
Filler.Level | 17 | 2551 | 109.2523716 | 15.6984241 | 118.4000000 | 111.0417442 | 9.1921200 | 55.8000000 | 161.200 | 105.4000000 | -0.8482847 | 0.0460488 | 0.3108142 | 21.7000000 | 98.3000000 | 120.000000 |
Filler.Speed | 18 | 2514 | 3687.1988862 | 770.8200208 | 3982.0000000 | 3919.9870775 | 47.4432000 | 998.0000000 | 4030.000 | 3032.0000000 | -2.8700359 | 6.7059692 | 15.3734149 | 110.0000000 | 3888.0000000 | 3998.000000 |
Temperature | 19 | 2557 | 65.9675401 | 1.3827783 | 65.6000000 | 65.7986321 | 0.8895600 | 63.6000000 | 76.200 | 12.6000000 | 2.3869732 | 10.1612904 | 0.0273456 | 1.2000000 | 65.2000000 | 66.400000 |
Usage.cont | 20 | 2566 | 20.9929618 | 2.9779364 | 21.7900000 | 21.2517819 | 3.1875900 | 12.0800000 | 25.900 | 13.8200000 | -0.5353253 | -1.0170230 | 0.0587878 | 5.3950000 | 18.3600000 | 23.755000 |
Carb.Flow | 21 | 2569 | 2468.3542234 | 1073.6964743 | 3028.0000000 | 2601.1356344 | 326.1720000 | 26.0000000 | 5104.000 | 5078.0000000 | -0.9877287 | -0.5826893 | 21.1835857 | 2042.0000000 | 1144.0000000 | 3186.000000 |
Density | 22 | 2570 | 1.1736498 | 0.3775269 | 0.9800000 | 1.1533463 | 0.1482600 | 0.2400000 | 1.920 | 1.6800000 | 0.5260149 | -1.1992070 | 0.0074470 | 0.7200000 | 0.9000000 | 1.620000 |
MFR | 23 | 2359 | 704.0492582 | 73.8983094 | 724.0000000 | 718.1566967 | 15.4190400 | 31.4000000 | 868.600 | 837.2000000 | -5.0917729 | 30.4558939 | 1.5214950 | 24.7000000 | 706.3000000 | 731.000000 |
Balling | 24 | 2570 | 2.1977696 | 0.9310914 | 1.6480000 | 2.1287189 | 0.3706500 | -0.1700000 | 4.012 | 4.1820000 | 0.5939224 | -1.3855651 | 0.0183665 | 1.7960000 | 1.4960000 | 3.292000 |
Pressure.Vacuum | 25 | 2571 | -5.2161027 | 0.5699933 | -5.4000000 | -5.2521147 | 0.5930400 | -6.6000000 | -3.600 | 3.0000000 | 0.5256608 | -0.0313126 | 0.0112414 | 0.6000000 | -5.6000000 | -5.000000 |
PH | 26 | 2567 | 8.5456486 | 0.1725162 | 8.5400000 | 8.5516788 | 0.1779120 | 7.8800000 | 9.360 | 1.4800000 | -0.2906437 | 0.0644294 | 0.0034050 | 0.2400000 | 8.4400000 | 8.680000 |
Oxygen.Filler | 27 | 2559 | 0.0468426 | 0.0466436 | 0.0334000 | 0.0388837 | 0.0249077 | 0.0024000 | 0.400 | 0.3976000 | 2.6603955 | 11.0882098 | 0.0009221 | 0.0380000 | 0.0220000 | 0.060000 |
Bowl.Setpoint | 28 | 2569 | 109.3265862 | 15.3031541 | 120.0000000 | 111.3466213 | 0.0000000 | 70.0000000 | 140.000 | 70.0000000 | -0.9743842 | -0.0564212 | 0.3019249 | 20.0000000 | 100.0000000 | 120.000000 |
Pressure.Setpoint | 29 | 2559 | 47.6153966 | 2.0390474 | 46.0000000 | 47.6026354 | 0.0000000 | 44.0000000 | 52.000 | 8.0000000 | 0.2031970 | -1.6012622 | 0.0403081 | 4.0000000 | 46.0000000 | 50.000000 |
Air.Pressurer | 30 | 2571 | 142.8339946 | 1.2119170 | 142.6000000 | 142.5812348 | 0.5930400 | 140.8000000 | 148.200 | 7.4000000 | 2.2521053 | 4.7336291 | 0.0239013 | 0.8000000 | 142.2000000 | 143.000000 |
Alch.Rel | 31 | 2562 | 6.8974161 | 0.5052753 | 6.5600000 | 6.8384390 | 0.0593040 | 5.2800000 | 8.620 | 3.3400000 | 0.8836378 | -0.8506221 | 0.0099825 | 0.7000000 | 6.5400000 | 7.240000 |
Carb.Rel | 32 | 2561 | 5.4367825 | 0.1287183 | 5.4000000 | 5.4301318 | 0.1186080 | 4.9600000 | 6.060 | 1.1000000 | 0.5032472 | -0.2949480 | 0.0025435 | 0.2000000 | 5.3400000 | 5.540000 |
Balling.Lvl | 33 | 2570 | 2.0500078 | 0.8703089 | 1.4800000 | 1.9827237 | 0.2075640 | 0.0000000 | 3.660 | 3.6600000 | 0.5858456 | -1.4858636 | 0.0171675 | 1.7600000 | 1.3800000 | 3.140000 |
The describe function from the psych package gives us a more descriptive summary statistic breakdown, inclduing skewness. We see that some variables are right skewed(PSC CO2, PSC Fill, and Temperature) while some are left skewed(Filler Speed, Carb Flow, and MFR). We will perform some transformations later to address the skewness of the data. First, let’s do some plots|further exploration of our predictors.
library(DataExplorer)
#create_report(bev, y = "PH")
DataExplorer::plot_histogram(bev, nrow = 3L, ncol = 4L)
Looking at the plots, a few things jump out immediately at us It doesn’t appear that a lot of the variables have a normal distribution. A few of them have spikes that we think might be outliers and will be explored further. A few of the distributions appear to be bimodial. We will create dummy variables to flag which these are. We will definitely need to do some pre-processing before throughing into a model. We’d like to take a look at the correlation plots to see if we have highly correlated date. We will remove those that are.
library(dplyr)
bev.new <- bev %>%
mutate(Mnf.Flow = if_else(Mnf.Flow < 0, 1, 0)) %>%
mutate(Hyd.Pressure1 = if_else(Hyd.Pressure1 <= 0, 1, 0)) %>%
mutate(Hyd.Pressure2 = if_else(Hyd.Pressure2 <= 0, 1, 0)) %>%
mutate(Filler.Speed = if_else(Filler.Speed < 2500, 1, 0)) %>%
mutate(Carb.Flow = if_else(Carb.Flow < 2000, 1, 0)) %>%
mutate(Balling = if_else(Balling < 2.5, 1, 0))
Now we’ll take a look at a correlation plot.
library(corrplot)
cor.plt <- cor(bev.new %>% dplyr::select(-Brand.Code), use = "pairwise.complete.obs", method = "pearson")
corrplot(cor.plt, method = "color", type = "upper", order = "original", number.cex = .6, addCoef.col = "black", tl.srt = 90, diag = TRUE)
bev.remove <- names(bev.new) %in% c("Density", "Balling", "Carb.Rel", "Alch.Rel")
bev.new <- bev.new[!bev.remove]
head(bev.new)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp
## 1 B 5.340000 23.96667 0.2633333 68.2 141.2
## 2 A 5.426667 24.00667 0.2386667 68.4 139.6
## 3 B 5.286667 24.06000 0.2633333 70.8 144.8
## 4 A 5.440000 24.00667 0.2933333 63.0 132.6
## 5 A 5.486667 24.31333 0.1113333 67.2 136.8
## 6 A 5.380000 23.92667 0.2693333 66.6 138.4
## PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## 1 0.104 0.26 0.04 1 118.8 46.0
## 2 0.124 0.22 0.04 1 121.6 46.0
## 3 0.090 0.34 0.16 1 120.2 46.0
## 4 NA 0.42 0.04 1 115.2 46.4
## 5 0.026 0.16 0.12 1 118.4 45.8
## 6 0.090 0.24 0.04 1 119.6 45.6
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level
## 1 1 NA NA 118 121.2
## 2 1 NA NA 106 118.6
## 3 1 NA NA 82 120.0
## 4 1 1 0 92 117.8
## 5 1 1 0 92 118.6
## 6 1 1 0 116 120.2
## Filler.Speed Temperature Usage.cont Carb.Flow MFR Pressure.Vacuum PH
## 1 0 66.0 16.18 0 725.0 -4.0 8.36
## 2 0 67.6 19.90 0 726.8 -4.0 8.26
## 3 0 67.0 17.76 0 735.0 -3.8 8.94
## 4 0 65.6 17.42 0 730.6 -4.4 8.24
## 5 0 65.6 17.68 0 722.8 -4.4 8.26
## 6 0 66.2 23.82 0 738.8 -4.4 8.32
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Balling.Lvl
## 1 0.022 120 46.4 142.6 1.48
## 2 0.026 120 46.8 143.0 1.56
## 3 0.024 120 46.6 142.0 3.28
## 4 0.030 120 46.0 146.2 3.04
## 5 0.030 120 46.0 146.2 3.04
## 6 0.024 120 46.0 146.6 3.02
#library(ggplot2)
#plot_correlation(bev.new, type = c("all", "discrete", "continuous"),
#maxcat = 20L, cor_args = list(), geom_text_args = list(),
#title = NULL, ggtheme = theme_gray(),
#theme_config = list(legend.position = "bottom", axis.text.x =
#element_text(angle = 90)))
From the plot, we notice that Density, Balling, Carb.Rel, Alch.Rel are highly correlated so we decided to remove those variables. As we stated earlier, Brand Code was missing about 120 variables. We first converted the Brand.Code predictor to factors so that it would be compatible for a random forest imputation.
We then filtered out the subset of records (4) with a missing response (PH) values and imputed the remaining missing values using the random forest imputation.
## Pressure.Vacuum Air.Pressurer Balling.Lvl Mnf.Flow Carb.Flow
## 2044 1 1 1 1 1
## 101 1 1 1 1 1
## 91 1 1 1 1 1
## 3 1 1 1 1 1
## 2 1 1 1 1 1
## 30 1 1 1 1 1
## 3 1 1 1 1 1
## 18 1 1 1 1 1
## 4 1 1 1 1 1
## 1 1 1 1 1 1
## 17 1 1 1 1 1
## 2 1 1 1 1 1
## 3 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 18 1 1 1 1 1
## 2 1 1 1 1 1
## 3 1 1 1 1 1
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 15 1 1 1 1 1
## 4 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 3 1 1 1 1 1
## 1 1 1 1 1 1
## 27 1 1 1 1 1
## 2 1 1 1 1 1
## 1 1 1 1 1 1
## 5 1 1 1 1 1
## 10 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 3 1 1 1 1 1
## 9 1 1 1 1 1
## 2 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 10 1 1 1 1 1
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 7 1 1 1 1 1
## 1 1 1 1 1 1
## 9 1 1 1 1 1
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 3 1 1 1 1 1
## 2 1 1 1 1 1
## 1 1 1 1 1 1
## 4 1 1 1 1 1
## 2 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 3 1 1 1 1 1
## 1 1 1 1 1 1
## 4 1 1 1 1 1
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 7 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 4 1 1 1 1 1
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 2 1 1 1 1 1
## 9 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 3 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 2 1 1 1 1 0
## 1 1 1 1 0 1
## 1 1 1 1 0 1
## 1 1 1 0 1 1
## 0 0 1 2 2
## Bowl.Setpoint PH Usage.cont Carb.Volume Hyd.Pressure1 Oxygen.Filler
## 2044 1 1 1 1 1 1
## 101 1 1 1 1 1 1
## 91 1 1 1 1 1 1
## 3 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 30 1 1 1 1 1 1
## 3 1 1 1 1 1 1
## 18 1 1 1 1 1 1
## 4 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 17 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 3 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 18 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 3 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 15 1 1 1 1 1 1
## 4 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 27 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1
## 10 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1
## 9 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 10 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 7 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 9 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 7 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 0
## 4 1 1 1 1 1 0
## 1 1 1 1 1 1 0
## 2 1 1 1 1 1 0
## 2 1 1 1 1 1 0
## 9 1 1 1 1 0 1
## 1 1 1 1 1 0 1
## 1 1 1 1 1 0 1
## 1 1 1 1 0 1 1
## 1 1 1 1 0 1 1
## 1 1 1 1 0 1 1
## 1 1 1 1 0 1 1
## 2 1 1 1 0 1 1
## 1 1 1 1 0 1 1
## 1 1 1 1 0 1 1
## 1 1 1 1 0 1 1
## 1 1 1 1 0 1 1
## 3 1 1 0 1 1 1
## 1 1 1 0 1 1 1
## 1 1 1 0 1 1 1
## 1 1 0 1 1 1 1
## 1 1 0 1 1 1 0
## 2 0 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 0 1 1 1 1
## 1 1 0 1 1 1 1
## 1 1 1 1 1 1 0
## 2 4 5 10 11 12
## Pressure.Setpoint Temperature Hyd.Pressure2 Hyd.Pressure3
## 2044 1 1 1 1
## 101 1 1 1 1
## 91 1 1 1 1
## 3 1 1 1 1
## 2 1 1 1 1
## 30 1 1 1 1
## 3 1 1 1 1
## 18 1 1 1 1
## 4 1 1 1 1
## 1 1 1 1 1
## 17 1 1 1 1
## 2 1 1 1 1
## 3 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 18 1 1 1 1
## 2 1 1 1 1
## 3 1 1 1 1
## 1 1 1 1 1
## 2 1 1 1 1
## 15 1 1 1 1
## 4 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 3 1 1 1 1
## 1 1 1 1 1
## 27 1 1 1 1
## 2 1 1 1 1
## 1 1 1 1 1
## 5 1 1 1 1
## 10 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 3 1 1 1 1
## 9 1 1 1 1
## 2 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 10 1 1 1 1
## 1 1 1 1 1
## 2 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 7 1 1 1 1
## 1 1 1 1 1
## 9 1 1 1 1
## 1 1 1 1 1
## 2 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 2 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 2 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 3 1 1 1 1
## 2 1 1 1 1
## 1 1 1 1 1
## 4 1 1 1 1
## 2 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 3 1 1 0 0
## 1 1 1 0 0
## 4 1 0 1 1
## 1 1 0 1 1
## 2 1 0 1 1
## 1 1 0 1 1
## 1 1 0 1 1
## 7 0 1 1 1
## 1 0 1 1 1
## 1 0 1 1 1
## 1 0 1 1 1
## 1 0 1 1 1
## 1 0 1 1 1
## 1 1 1 1 1
## 4 1 1 1 1
## 1 1 1 1 1
## 2 1 1 1 1
## 2 1 0 1 1
## 9 1 1 0 0
## 1 1 1 0 0
## 1 1 1 0 0
## 1 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 2 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 1 1 0 1 1
## 3 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 1 1 1 1 1
## 1 1 0 1 1
## 2 1 1 1 1
## 2 1 1 1 1
## 1 1 1 1 1
## 1 1 0 1 1
## 1 1 1 1 1
## 12 14 15 15
## Filler.Level Fill.Pressure PSC.Fill Carb.Temp Carb.Pressure
## 2044 1 1 1 1 1
## 101 1 1 1 1 1
## 91 1 1 1 1 1
## 3 1 1 1 1 1
## 2 1 1 1 1 1
## 30 1 1 1 1 1
## 3 1 1 1 1 1
## 18 1 1 1 1 1
## 4 1 1 1 1 1
## 1 1 1 1 1 1
## 17 1 1 1 1 1
## 2 1 1 1 1 1
## 3 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 18 1 1 1 1 1
## 2 1 1 1 1 1
## 3 1 1 1 1 1
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 15 1 1 1 1 1
## 4 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 3 1 1 1 1 1
## 1 1 1 1 1 1
## 27 1 1 1 1 1
## 2 1 1 1 1 1
## 1 1 1 1 1 1
## 5 1 1 1 1 1
## 10 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 3 1 1 1 1 1
## 9 1 1 1 1 0
## 2 1 1 1 1 0
## 1 1 1 1 1 0
## 1 1 1 1 1 0
## 10 1 1 1 0 1
## 1 1 1 1 0 1
## 2 1 1 1 0 1
## 1 1 1 1 0 1
## 1 1 1 1 0 1
## 1 1 1 1 0 1
## 7 1 1 1 0 0
## 1 1 1 1 0 0
## 9 1 1 0 1 1
## 1 1 1 0 1 1
## 2 1 1 0 1 1
## 1 1 1 0 1 1
## 1 1 1 0 1 1
## 2 1 1 0 1 1
## 1 1 1 0 1 1
## 1 1 1 0 1 1
## 1 1 1 0 1 1
## 1 1 1 0 1 1
## 1 1 0 1 1 1
## 2 1 0 1 1 1
## 1 1 0 1 1 1
## 1 1 0 1 1 1
## 1 1 0 1 1 1
## 3 0 1 1 1 1
## 2 0 1 1 1 1
## 1 0 1 1 1 1
## 4 0 0 1 1 1
## 2 0 0 1 1 1
## 1 0 0 1 1 1
## 1 0 0 1 1 1
## 1 0 0 0 1 1
## 3 1 1 1 1 1
## 1 1 1 1 1 1
## 4 1 1 1 1 1
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 1 1 1 1 1 1
## 1 0 0 1 1 1
## 7 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 0 1 1 1
## 1 1 1 1 1 1
## 4 1 1 1 1 1
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 2 1 1 1 1 1
## 9 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 0
## 2 1 1 1 1 0
## 1 1 1 0 1 0
## 1 1 1 0 0 0
## 1 1 0 1 1 1
## 1 1 1 1 0 0
## 3 1 1 1 1 1
## 1 1 1 1 1 1
## 1 1 1 1 1 1
## 1 0 0 1 1 1
## 1 0 0 1 1 1
## 2 1 1 1 1 1
## 2 1 1 1 1 1
## 1 0 0 1 1 1
## 1 0 0 1 1 1
## 1 1 1 1 1 1
## 20 22 23 26 27
## Hyd.Pressure4 Carb.Pressure1 PSC Fill.Ounces PC.Volume PSC.CO2
## 2044 1 1 1 1 1 1
## 101 1 1 1 1 1 1
## 91 1 1 1 1 1 1
## 3 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 30 1 1 1 1 1 1
## 3 1 1 1 1 1 1
## 18 1 1 1 1 1 0
## 4 1 1 1 1 1 0
## 1 1 1 1 1 1 0
## 17 1 1 1 1 0 1
## 2 1 1 1 1 0 1
## 3 1 1 1 1 0 1
## 1 1 1 1 1 0 1
## 1 1 1 1 1 0 0
## 18 1 1 1 0 1 1
## 2 1 1 1 0 1 1
## 3 1 1 1 0 1 1
## 1 1 1 1 0 1 1
## 2 1 1 1 0 0 1
## 15 1 1 0 1 1 1
## 4 1 1 0 1 1 1
## 1 1 1 0 1 1 1
## 1 1 1 0 1 1 0
## 1 1 1 0 1 1 0
## 3 1 1 0 1 0 1
## 1 1 1 0 1 0 1
## 27 1 0 1 1 1 1
## 2 1 0 1 1 0 1
## 1 1 0 0 1 0 1
## 5 0 1 1 1 1 1
## 10 0 1 1 1 1 1
## 1 0 1 1 1 1 1
## 1 0 1 1 1 1 1
## 3 0 1 1 1 1 1
## 9 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 1 1 1 0
## 1 1 1 1 1 0 1
## 10 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 0
## 1 1 1 1 0 1 1
## 7 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 9 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 2 1 1 1 0 1 1
## 1 1 1 1 0 1 1
## 1 1 1 1 0 1 0
## 2 1 1 0 1 1 1
## 1 1 1 0 0 1 0
## 1 1 0 1 1 1 1
## 1 1 0 1 1 1 1
## 1 0 1 1 1 1 1
## 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 0 1 0 1
## 1 0 1 1 1 1 1
## 1 0 1 1 1 1 1
## 3 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 1 0 1 1
## 4 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 1 0 1 1
## 1 0 1 1 1 1 1
## 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 1 0 1 1
## 1 0 1 1 1 1 1
## 7 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 0
## 1 1 1 1 1 0 1
## 1 1 1 0 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1
## 1 1 1 1 0 0 1
## 2 0 1 1 1 1 1
## 2 1 1 1 1 1 1
## 9 1 1 1 1 1 1
## 1 1 1 1 1 0 1
## 1 0 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 0
## 1 1 1 1 0 1 0
## 1 1 1 1 1 1 1
## 2 1 1 1 1 1 0
## 1 1 1 0 0 1 0
## 1 1 1 1 1 1 0
## 1 1 1 1 1 0 1
## 1 1 1 1 1 1 0
## 3 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 1 1 1 1 1 0
## 1 0 1 1 1 1 1
## 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 2 1 1 1 1 1 1
## 1 1 1 1 1 1 1
## 1 0 1 1 1 1 1
## 1 1 1 1 1 1 1
## 30 32 33 38 39 39
## Filler.Speed Brand.Code MFR
## 2044 1 1 1 0
## 101 1 1 0 1
## 91 1 0 1 1
## 3 1 0 0 2
## 2 0 1 1 1
## 30 0 1 0 2
## 3 0 0 0 3
## 18 1 1 1 1
## 4 1 1 0 2
## 1 1 0 1 2
## 17 1 1 1 1
## 2 1 1 0 2
## 3 1 0 1 2
## 1 0 1 0 3
## 1 1 1 1 2
## 18 1 1 1 1
## 2 1 1 0 2
## 3 1 0 1 2
## 1 0 1 0 3
## 2 1 1 1 2
## 15 1 1 1 1
## 4 1 1 0 2
## 1 1 0 1 2
## 1 1 1 1 2
## 1 1 0 1 3
## 3 1 1 1 2
## 1 1 0 1 3
## 27 1 1 1 1
## 2 1 1 1 2
## 1 1 1 1 3
## 5 1 1 1 1
## 10 1 1 0 2
## 1 1 0 0 3
## 1 0 1 1 2
## 3 0 1 0 3
## 9 1 1 1 1
## 2 1 1 0 2
## 1 1 1 1 2
## 1 1 1 1 2
## 10 1 1 1 1
## 1 1 1 0 2
## 2 1 0 1 2
## 1 0 1 0 3
## 1 1 1 1 2
## 1 1 1 1 2
## 7 1 1 1 2
## 1 1 0 1 3
## 9 1 1 1 1
## 1 1 1 0 2
## 2 1 1 1 2
## 1 1 0 1 3
## 1 1 1 1 3
## 2 1 1 1 2
## 1 1 1 1 4
## 1 1 1 1 2
## 1 1 0 1 3
## 1 1 0 0 4
## 1 1 1 0 2
## 2 0 1 0 3
## 1 1 1 0 4
## 1 1 1 0 3
## 1 1 0 0 4
## 3 1 1 1 1
## 2 0 1 0 3
## 1 1 1 0 3
## 4 1 1 1 2
## 2 0 1 0 4
## 1 1 1 1 3
## 1 1 0 0 5
## 1 1 0 1 4
## 3 1 1 1 2
## 1 1 1 0 3
## 4 1 1 0 2
## 1 1 0 0 3
## 2 0 1 0 3
## 1 0 1 0 4
## 1 0 1 0 6
## 7 1 1 1 1
## 1 1 0 0 3
## 1 1 1 1 2
## 1 1 1 1 2
## 1 0 1 0 4
## 1 0 1 0 4
## 1 1 1 1 1
## 4 1 1 0 2
## 1 1 1 0 4
## 2 1 1 0 3
## 2 1 1 0 3
## 9 1 1 1 3
## 1 1 1 1 4
## 1 1 1 1 4
## 1 1 1 1 1
## 1 1 1 1 2
## 1 1 1 1 3
## 1 1 1 1 2
## 2 1 1 1 3
## 1 1 1 1 6
## 1 1 1 1 5
## 1 1 1 1 3
## 1 1 1 0 6
## 3 1 1 1 1
## 1 1 1 0 2
## 1 1 1 1 2
## 1 1 1 0 5
## 1 0 1 0 7
## 2 1 1 1 1
## 2 1 1 1 1
## 1 0 1 0 6
## 1 0 1 0 8
## 1 1 0 0 4
## 57 120 212 823
##
## Variables sorted by number of missings:
## Variable Count
## MFR 0.0824581875
## Brand.Code 0.0466744457
## Filler.Speed 0.0221703617
## PC.Volume 0.0151691949
## PSC.CO2 0.0151691949
## Fill.Ounces 0.0147802412
## PSC 0.0128354726
## Carb.Pressure1 0.0124465189
## Hyd.Pressure4 0.0116686114
## Carb.Pressure 0.0105017503
## Carb.Temp 0.0101127966
## PSC.Fill 0.0089459354
## Fill.Pressure 0.0085569817
## Filler.Level 0.0077790743
## Hyd.Pressure2 0.0058343057
## Hyd.Pressure3 0.0058343057
## Temperature 0.0054453520
## Oxygen.Filler 0.0046674446
## Pressure.Setpoint 0.0046674446
## Hyd.Pressure1 0.0042784909
## Carb.Volume 0.0038895371
## Usage.cont 0.0019447686
## PH 0.0015558149
## Mnf.Flow 0.0007779074
## Carb.Flow 0.0007779074
## Bowl.Setpoint 0.0007779074
## Balling.Lvl 0.0003889537
## Pressure.Vacuum 0.0000000000
## Air.Pressurer 0.0000000000
#make Brand code a factor
bev.new$`Brand.Code` <- factor(bev.new$`Brand.Code`)
#Remove missing response rows, not suitable for model training
bev.new <- subset(bev.new ,is.na(`PH`) == FALSE)
#Remove PH from the imputation dataset so that it won't influence the imputation algorithm and bias the model test
myvars <- names(bev.new) %in% c("PH")
bev.imp <- bev.new[!myvars]
summary(bev.imp)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## A : 293 Min. :5.040 Min. :23.63 Min. :0.07933
## B :1235 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23933
## C : 304 Median :5.347 Median :23.97 Median :0.27133
## D : 615 Mean :5.370 Mean :23.97 Mean :0.27724
## NA's: 120 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08464 Mean :0.1953
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :0.0000 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:0.0000 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median :0.0000 Median :123.2 Median :46.40
## Mean :0.05644 Mean :0.4608 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.:1.0000 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. :1.0000 Max. :140.2 Max. :60.40
## NA's :39 NA's :32 NA's :18
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :0.0000 Min. :0.0000 Min. :-1.20 Min. : 62.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.00 1st Qu.: 86.00
## Median :0.0000 Median :0.0000 Median :27.60 Median : 96.00
## Mean :0.3427 Mean :0.3041 Mean :20.48 Mean : 96.31
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :1.0000 Max. :1.0000 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :28
## Filler.Level Filler.Speed Temperature Usage.cont
## Min. : 55.8 Min. :0.00000 Min. :63.60 Min. :12.08
## 1st Qu.: 98.3 1st Qu.:0.00000 1st Qu.:65.20 1st Qu.:18.36
## Median :118.4 Median :0.00000 Median :65.60 Median :21.79
## Mean :109.3 Mean :0.08516 Mean :65.96 Mean :20.99
## 3rd Qu.:120.0 3rd Qu.:0.00000 3rd Qu.:66.40 3rd Qu.:23.75
## Max. :161.2 Max. :1.00000 Max. :76.20 Max. :25.90
## NA's :16 NA's :54 NA's :12 NA's :5
## Carb.Flow MFR Pressure.Vacuum Oxygen.Filler
## Min. :0.0000 Min. : 31.4 Min. :-6.600 Min. :0.00240
## 1st Qu.:0.0000 1st Qu.:706.3 1st Qu.:-5.600 1st Qu.:0.02200
## Median :0.0000 Median :724.0 Median :-5.400 Median :0.03340
## Mean :0.2943 Mean :704.0 Mean :-5.216 Mean :0.04643
## 3rd Qu.:1.0000 3rd Qu.:731.0 3rd Qu.:-5.000 3rd Qu.:0.06000
## Max. :1.0000 Max. :868.6 Max. :-3.600 Max. :0.40000
## NA's :2 NA's :208 NA's :11
## Bowl.Setpoint Pressure.Setpoint Air.Pressurer Balling.Lvl
## Min. : 70.0 Min. :44.00 Min. :140.8 Min. :0.000
## 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2 1st Qu.:1.380
## Median :120.0 Median :46.00 Median :142.6 Median :1.480
## Mean :109.3 Mean :47.61 Mean :142.8 Mean :2.052
## 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:143.0 3rd Qu.:3.140
## Max. :140.0 Max. :52.00 Max. :148.2 Max. :3.660
## NA's :2 NA's :12 NA's :1
#use MissForest to impute because it does not need the response (PH). We do this to avoid bias when we impute the test set
bev.imp.missForest <- missForest(bev.imp)
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
## missForest iteration 6 in progress...done!
## missForest iteration 7 in progress...done!
## missForest iteration 8 in progress...done!
## missForest iteration 9 in progress...done!
bev.imp.missForest <- bev.imp.missForest$ximp
#add back the PH variable to the data frame
bev.imp.missForest$PH <- bev.new$PH
summary(bev.imp.missForest)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## A: 294 Min. :5.040 Min. :23.63 Min. :0.07933
## B:1296 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23933
## C: 361 Median :5.347 Median :23.97 Median :0.27200
## D: 616 Mean :5.370 Mean :23.97 Mean :0.27781
## 3rd Qu.:5.457 3rd Qu.:24.03 3rd Qu.:0.31267
## Max. :5.700 Max. :24.32 Max. :0.47800
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.05000 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07800 Median :0.1800
## Mean :68.21 Mean :141.1 Mean :0.08483 Mean :0.1958
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :0.0000 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:0.0000 1st Qu.:118.8 1st Qu.:46.00
## Median :0.04000 Median :0.0000 Median :123.2 Median :46.40
## Mean :0.05662 Mean :0.4608 Mean :122.5 Mean :47.91
## 3rd Qu.:0.08000 3rd Qu.:1.0000 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. :1.0000 Max. :140.2 Max. :60.40
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :0.0000 Min. :0.0000 Min. :-1.20 Min. : 62.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.00 1st Qu.: 86.00
## Median :0.0000 Median :0.0000 Median :27.60 Median : 96.00
## Mean :0.3413 Mean :0.3035 Mean :20.48 Mean : 96.51
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:33.20 3rd Qu.:102.00
## Max. :1.0000 Max. :1.0000 Max. :50.00 Max. :142.00
## Filler.Level Filler.Speed Temperature Usage.cont
## Min. : 55.8 Min. :0.0000 Min. :63.60 Min. :12.08
## 1st Qu.: 98.3 1st Qu.:0.0000 1st Qu.:65.20 1st Qu.:18.37
## Median :118.4 Median :0.0000 Median :65.60 Median :21.78
## Mean :109.2 Mean :0.1034 Mean :65.98 Mean :20.99
## 3rd Qu.:120.0 3rd Qu.:0.0000 3rd Qu.:66.40 3rd Qu.:23.74
## Max. :161.2 Max. :1.0000 Max. :76.20 Max. :25.90
## Carb.Flow MFR Pressure.Vacuum Oxygen.Filler
## Min. :0.0000 Min. : 31.4 Min. :-6.600 Min. :0.00240
## 1st Qu.:0.0000 1st Qu.:695.0 1st Qu.:-5.600 1st Qu.:0.02200
## Median :0.0000 Median :721.4 Median :-5.400 Median :0.03340
## Mean :0.2945 Mean :677.5 Mean :-5.216 Mean :0.04668
## 3rd Qu.:1.0000 3rd Qu.:730.4 3rd Qu.:-5.000 3rd Qu.:0.06000
## Max. :1.0000 Max. :868.6 Max. :-3.600 Max. :0.40000
## Bowl.Setpoint Pressure.Setpoint Air.Pressurer Balling.Lvl
## Min. : 70.0 Min. :44.00 Min. :140.8 Min. :0.000
## 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2 1st Qu.:1.380
## Median :120.0 Median :46.00 Median :142.6 Median :1.480
## Mean :109.3 Mean :47.61 Mean :142.8 Mean :2.051
## 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:143.0 3rd Qu.:3.140
## Max. :140.0 Max. :52.00 Max. :148.2 Max. :3.660
## PH
## Min. :7.880
## 1st Qu.:8.440
## Median :8.540
## Mean :8.546
## 3rd Qu.:8.680
## Max. :9.360
#bev.imp.missForest <- rfImpute(PH ~ ., bev)
#create new numeric labels for brand code
#student_df_missForest$`BrandCode_num` <- as.numeric(factor( student_df_missForest$`Brand Code`))
#bev.imp.missForest$`Brand Code`[bev.imp.missForest$`Brand Code` == ""] <- "U"
#bev.imp <- mice(bev, m =3, maxit =3, print = FALSE, seed = 234)
#densityplot(bev.imp.missForest)
Using missForest to impute took much longer than rfImpute, but it works better for our purposes. Initally, we wanted to convert our response variable to be categorical but at this point, we decided against it as it would lead to lose of information. Next, let’s delve into whether we have zero-variance variables or not. Zero-variance variables are those where the percentage of unique values is less than 10%.
## 'data.frame': 29 obs. of 4 variables:
## $ freqRatio : num 2.1 1.04 1.16 1.1 ...
## $ percentUnique: num 0.156 4.324 5.064 19.205 ...
## $ zeroVar : logi FALSE FALSE FALSE FALSE ...
## $ nzv : logi FALSE FALSE FALSE FALSE ...
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1] freqRatio percentUnique zeroVar nzv
## <0 rows> (or 0-length row.names)
We notice that there are no variables where we are getting a true for near zero variance(nzv) so we will move one to look at splitting our dataset. We mentioned earlier that we had a couple of variables that exhibited some skewness. We will do a BoxCox transformation of those variables(PSC, PSC.Fill and PSC.CO2, etc). We notice that PSC.Fill and PSC.CO2 have 0 values so we will add a small offset.
#lambda <- BoxCox.lambda(bev.imp.missForest)
#bev.boxcox <- BoxCox(bev.imp.missForest, lambda)
library(forecast)
bev.boxcox <- bev.imp.missForest
offset <- .0000001
bev.boxcox$PSC.Fill <- bev.boxcox$PSC.Fill + offset
bev.boxcox$PSC.CO2 <- bev.boxcox$PSC.CO2 + offset
#psc.boxcox <- boxcox(bev.boxcox$PSC ~ 1, lamda = seq(-6, 6, .1))
#pscfill.boxcox <- boxcox(bev.boxcox$PSC.Fill ~ 1, lambda = seq(-6, 6, 0.1))
#psccos.boxcox <- boxcox(bev.boxcox$PSC.CO2 ~ 1, lambda = seq(-6, 6, 0.1))
#oxygenfiller.boxcox <- boxcox(bev.boxcox$Oxygen.Filler ~ 1, lambda = seq(-6, 6, .1))
#bc1 <- data.frame(psc.boxcox$x, psc.boxcox$y)
#bc2 <- bc1[with(bc1, order(-bc1$psc.boxcox.y)),]
#bc2[1,]
#bc3 <- data.frame(pscfill.boxcox$x, pscfill.boxcox$y)
#bc4 <- bc3[with(bc3, order(-bc3$pscfill.boxcox.y)),]
#bc4[1,]
#bc5 <- data.frame(psccos.boxcox$x, psccos.boxcox$y)
#bc6 <- bc5[with(bc5, order(-bc5$psccos.boxcox.y)),]
#bc6[1,]
#bc7 <- data.frame(oxygenfiller.boxcox$x, oxygenfiller.boxcox$y)
#bc8 <- bc7[with(bc7, order(-bc7$oxygenfiller.boxcox.y)),]
#bc8[1,]
# to find optimal lambda
lambda1 <- BoxCox.lambda(bev.boxcox$PSC.Fill)
lambda2 <- BoxCox.lambda(bev.boxcox$PSC.CO2)
lambda3 <- BoxCox.lambda(bev.boxcox$Oxygen.Filler)
lambda4 <- BoxCox.lambda(bev.boxcox$PSC)
# now to transform vector
trans.vector1 = BoxCox(bev.boxcox$PSC.Fill, lambda1)
bev.boxcox$PSC.Fill <- trans.vector1
trans.vector2 = BoxCox(bev.boxcox$PSC.CO2, lambda2)
bev.boxcox$PSC.CO2 <- trans.vector2
trans.vector3 = BoxCox(bev.boxcox$Oxygen.Filler, lambda3)
bev.boxcox$Oxygen.Filler <- trans.vector3
trans.vector4 = BoxCox(bev.boxcox$PSC, lambda4)
bev.boxcox$PSC <- trans.vector4
DataExplorer::plot_histogram(bev.boxcox, nrow = 3L, ncol = 4L)
Now that we have completed transforming our dataset, we will go ahead and split the trainig data that we were given. We will split a few ways so that we are able to use for a few different models.
#set.seed(123)
#myvars <- names(bev.boxcox) %in% c("Brand.Code")
#bev.boxcox2<- bev.boxcox[, !myvars]
## 75% of the sample size
smp_size <- floor(0.75 * nrow(bev.boxcox))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(bev.boxcox)), size = smp_size)
bev.train <- bev.boxcox[train_ind, ]
bev.test <- bev.boxcox[-train_ind, ]
bev.trainX <- bev.train[, !names(bev.train) %in% "PH"]
bev.trainY <- bev.train[, "PH"]
bev.testX <- bev.test[, !names(bev.train) %in% "PH"]
bev.testY <- bev.test[, "PH"]
ctrl <- trainControl(method = "cv", number = 10)
GLM Model
GLM or generalized linear models, formulated by John Nelder and Robert Wedderburn, are “a flexible generalization of an ordinary linear ergression model” by allowing the linear model to be related to the response variable via a link-function. It was initally formulated as a way of unifying various models such as: linear, logistic, and Poisson regressions. It allows for a non-normal error distribution models.
library(tictoc)
set.seed(456)
tic()
glm.model <- train(PH ~., data = bev.train, metric = "RMSE", method = "glm", preProcess = c("center", "scale", "BoxCox"), trControl = ctrl)
glm.predict <- predict(glm.model, newdata = bev.test)
pre.eval <- data.frame(obs = bev.testY, pred = glm.predict)
glm.results <- data.frame(defaultSummary(pre.eval))
glm.rmse <- glm.results[1, 1]
toc()
## 4.26 sec elapsed
exectime <- toc()
exectime <- exectime$toc - exectime$tic
paste0("The RMSE value for the GLM model is ", glm.rmse)
## [1] "The RMSE value for the GLM model is 0.132278333770062"
GLMNET MODEL
GLMNET is for elastic net regression. Unlike GLM, there is a penalty term associated with this model. Elastics net is a regularized regression method that combines the L1 and L1 penalities of lasso and ridge.
set.seed(789)
tic()
glmnet.model <- train(PH ~., data = bev.train, metric = "RMSE", method = "glmnet", preProcess = c("center", "scale", "BoxCox"), trControl = ctrl)
glmnet.predict <- predict(glmnet.model, newdata = bev.test)
pre.eval2 <- data.frame(obs = bev.testY, pred = glmnet.predict)
glmnet.results <- data.frame(defaultSummary(pre.eval2))
glmnet.rmse <- glmnet.results[1, 1]
toc()
## 9.95 sec elapsed
exectime <- toc()
exectime <- exectime$toc - exectime$tic
paste0("The RMSE value for the GLMNET model is ", glmnet.rmse)
## [1] "The RMSE value for the GLMNET model is 0.132188463834013"
Partial Least Squares(PLS)
We will next try partial least squares regression(PLS) model.PLS is typically used when we have more predictors than observations, although that is not the case in our current situation. PLS is a dimension reduction technique similar to PCA. Our predictors are mapped to a smaller set of vairables and within that space we perform aregression against the our response variable. It aims to choose new mapped variables that maximally explains the outcome variable.
library(pls)
#model <- plsr(PH ~., data = bev.train, validation = "CV")
#cv <- RMSEP(model)
#best.dims <- which.min(cv$val[estimate = "adjCV", , ]) - 1
#model <- plsr(PH ~., data = bev.train, ncomp = best.dims)
#model
set.seed(654)
tic()
pls.bev <- train(PH ~., data = bev.train, metric = "RMSE", method = "pls", tunelength = 15, preProcess = c("center", "scale", "BoxCox"), trControl = ctrl)
pls.pred <- predict(pls.bev, bev.test)
pre.eval3 <- data.frame(obs = bev.testY, pred = pls.pred)
pls.results <- data.frame(defaultSummary(pre.eval3))
pls.rmse <- pls.results[1, 1]
toc()
## 3.52 sec elapsed
exectime <- toc()
exectime <- exectime$toc - exectime$tic
paste0("The RMSE value for the PLS model is ", pls.rmse)
## [1] "The RMSE value for the PLS model is 0.134909872665916"
Random Forest
We will now try our hands on a random forest model. Random Forest is flexible and relatively easy to use. It can be used for both classification and regression. It essentially builds multiple decision trees and merges them together to get an accurate and stable prediction.
ctrl2 <- trainControl(method = "repeatedcv", number = 5, repeats = 2, search = "random", allowParallel = TRUE)
mtry <- sqrt(ncol(bev.train))
set.seed(321)
tic()
ranfor.bev <- train(PH ~., data = bev.train, metric = "RMSE", method = "rf", tunelength = 5, trControl = ctrl2, importance = T)
rf.Pred <- predict(ranfor.bev, newdata = bev.test)
rf.results <- data.frame(postResample(pred = rf.Pred, obs = bev.test$PH))
rf.rmse <- rf.results[1, 1]
toc()
## 364.73 sec elapsed
exectime <- toc()
exectime <- exectime$toc - exectime$tic
paste0("The RMSE value for the Random Forest model is ", rf.rmse)
## [1] "The RMSE value for the Random Forest model is 0.106174329080017"
## rf variable importance
##
## only 20 most important variables shown (out of 30)
##
## Overall
## Mnf.Flow 100.00
## Pressure.Vacuum 91.19
## Oxygen.Filler 88.77
## Air.Pressurer 85.40
## Brand.CodeC 81.29
## Usage.cont 75.63
## Balling.Lvl 71.37
## Temperature 63.04
## Carb.Pressure1 61.65
## Hyd.Pressure3 50.49
## Bowl.Setpoint 49.30
## Carb.Volume 47.08
## Brand.CodeD 45.17
## MFR 42.21
## Filler.Level 42.02
## PC.Volume 33.69
## Hyd.Pressure4 33.36
## Carb.Flow 31.37
## Fill.Pressure 29.68
## Brand.CodeB 27.55
From the random forest model, we see that the top 5 most important variables are:
- Mnf.Flow
- Brand.CodeC
- Air.Pressure
- Pressure.Vacuum
- Oxygen.Filler
XGBoost Model
We decided to try the Extreme Gradient boosting model because of its high accuracy and optimization to tackle regression problems as it allows optimization of an arbitrary differentiable loss function XGBoost Model. We decided to try the Extreeme Gradient boosting model because of its high accuracy and optimization to tackle regression problems as it allows optimization of an arbitrary differentiable loss function. Xgboost accepts only numerical predictors, so let’s convert the Brandcode to numerical.
bev.trainX_num <- bev.trainX
bev.testX_num <- bev.testX
bev.trainX_num$Brand.Code <- as.numeric(bev.trainX_num$Brand.Code)
bev.testX_num$Brand.Code <- as.numeric(bev.testX_num$Brand.Code)
tuneGrid <- expand.grid(.nrounds=c(10,20,50), # boosting iterations (trees)
.max_depth=c(6, 10, 20), # max tree depth
.eta=c(0.3, 0.01, 0.1), # learning rate
.gamma=c(0, 5), # minimum loss reduction
.colsample_bytree=c(1, 0.5), # subsample ratio of columns
.min_child_weight=c(1, 5), # minimum sum of instance weight
.subsample=c(0.1, 0.5)) # subsample ratio of rows
set.seed(1)
tic()
bst <- train(x = bev.trainX_num,
y = bev.trainY,
method = 'xgbTree',
tuneGrid = tuneGrid,
trControl = trainControl(method='cv'))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## nrounds max_depth eta gamma colsample_bytree min_child_weight
## 312 50 6 0.3 0 1 5
## subsample
## 312 0.5
## ##### xgb.Booster
## raw: 95.3 Kb
## call:
## xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth,
## gamma = param$gamma, colsample_bytree = param$colsample_bytree,
## min_child_weight = param$min_child_weight, subsample = param$subsample),
## data = x, nrounds = param$nrounds, objective = "reg:linear")
## params (as set within xgb.train):
## eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", min_child_weight = "5", subsample = "0.5", objective = "reg:linear", silent = "1"
## xgb.attributes:
## niter
## callbacks:
## cb.print.evaluation(period = print_every_n)
## # of features: 28
## niter: 50
## nfeatures : 28
## xNames : Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont Carb.Flow MFR Pressure.Vacuum Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Balling.Lvl
## problemType : Regression
## tuneValue :
## nrounds max_depth eta gamma colsample_bytree min_child_weight
## 312 50 6 0.3 0 1 5
## subsample
## 312 0.5
## obsLevels : NA
## param :
## list()
xgboostTunePred <- predict(bst, newdata = bev.testX_num)
xgboost.results <- data.frame(postResample(pred =xgboostTunePred, obs = bev.testY))
xgboost.rmse <- xgboost.results[1, 1]
toc()
## 406.02 sec elapsed
exectime <- toc()
exectime <- exectime$toc - exectime$tic
paste0("The RMSE value for the XGBOOST model is ", xgboost.rmse)
## [1] "The RMSE value for the XGBOOST model is 0.132254803026915"
We clearly see that the most important predictors are 1. Mnf.Flow 2. Usage.cont 3. Carb.Flow 4. Oxygen.Filler 5. Carb.Rel
MARS model
We decided to try MARs model because it could predict the values of a continuous dependent or outcome variable from a set of independent or predictor variables.The reason I chose the MARSplines is because it is a nonparametric regression procedure that makes no assumption about the underlying functional relationship between the dependent and independent variables. Since in this case it was not clear if there was linear relationship or not. It is worls even in situations where the relationship between the predictors and the dependent variables is non-monotone and difficult to approximate with parametric models
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(100)
tic()
MarsModel <- train(x = bev.trainX,
y = bev.train$PH,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method='cv'))
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:psych':
##
## rescale
## Loading required package: TeachingDemos
## nprune degree
## 65 29 2
MarsModelTunePred <- predict(MarsModel, newdata = bev.testX)
mars.results <- data.frame(postResample(pred =MarsModelTunePred, obs = bev.test$PH))
mars.rmse <- mars.results[1, 1]
toc()
## 127.36 sec elapsed
exectime <- toc()
exectime <- exectime$toc - exectime$tic
paste0("The RMSE value for the MARS model is ", mars.rmse)
## [1] "The RMSE value for the MARS model is 0.123575829776953"
We clearly see that the most important predictors for the MARS model are 1. Mnf.Flow 2. Brand_code 3. Airpressure 4. Alch.Rel 5. Bowl.Setpoint
glm.rmse | glmnet.rmse | pls.rmse | rf.rmse | xgboost.rmse | mars.rmse |
---|---|---|---|---|---|
0.1322783 | 0.1321885 | 0.1349099 | 0.1061743 | 0.1322548 | 0.1235758 |
We see that | the random for | est model ha | s the best R | MSE as .107. Th | e model that performed best following the random forest was the MARS model at .130. We also timed each of our models and the model with the best time was |
Model Testing
Preprocess test set by imputing missing values.
Test_set_bev$`Brand.Code` <- factor(Test_set_bev$`Brand.Code`)
set.seed(123)
myvars <- names(Test_set_bev) %in% c("PH")
Test_set_bev.missForest <- Test_set_bev[, !myvars]
summary(Test_set_bev.missForest)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## A : 35 Min. :5.147 Min. :23.75 Min. :0.09867
## B :129 1st Qu.:5.287 1st Qu.:23.92 1st Qu.:0.23333
## C : 31 Median :5.340 Median :23.97 Median :0.27533
## D : 64 Mean :5.369 Mean :23.97 Mean :0.27769
## NA's: 8 3rd Qu.:5.465 3rd Qu.:24.01 3rd Qu.:0.32200
## Max. :5.667 Max. :24.20 Max. :0.46400
## NA's :1 NA's :6 NA's :4
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :60.20 Min. :130.0 Min. :0.00400 Min. :0.0200
## 1st Qu.:65.30 1st Qu.:138.4 1st Qu.:0.04450 1st Qu.:0.1000
## Median :68.00 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.25 Mean :141.2 Mean :0.08545 Mean :0.1903
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :77.60 Max. :154.0 Max. :0.24600 Max. :0.6200
## NA's :1 NA's :5 NA's :3
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :113.0 Min. :37.80
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:120.2 1st Qu.:46.00
## Median :0.04000 Median : 0.20 Median :123.4 Median :47.80
## Mean :0.05107 Mean : 21.03 Mean :123.0 Mean :48.14
## 3rd Qu.:0.06000 3rd Qu.: 141.30 3rd Qu.:125.5 3rd Qu.:50.20
## Max. :0.24000 Max. : 220.40 Max. :136.0 Max. :60.20
## NA's :5 NA's :4 NA's :2
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :-50.00 Min. :-50.00 Min. :-50.00 Min. : 68.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 90.00
## Median : 10.40 Median : 26.80 Median : 27.70 Median : 98.00
## Mean : 12.01 Mean : 20.11 Mean : 19.61 Mean : 97.84
## 3rd Qu.: 20.40 3rd Qu.: 34.80 3rd Qu.: 33.00 3rd Qu.:104.00
## Max. : 50.00 Max. : 61.40 Max. : 49.20 Max. :140.00
## NA's :1 NA's :1 NA's :4
## Filler.Level Filler.Speed Temperature Usage.cont
## Min. : 69.2 Min. :1006 Min. :63.80 Min. :12.90
## 1st Qu.:100.6 1st Qu.:3812 1st Qu.:65.40 1st Qu.:18.12
## Median :118.6 Median :3978 Median :65.80 Median :21.44
## Mean :110.3 Mean :3581 Mean :66.23 Mean :20.90
## 3rd Qu.:120.2 3rd Qu.:3996 3rd Qu.:66.60 3rd Qu.:23.74
## Max. :153.2 Max. :4020 Max. :75.40 Max. :24.60
## NA's :2 NA's :10 NA's :2 NA's :2
## Carb.Flow Density MFR Balling
## Min. : 0 Min. :0.060 Min. : 15.6 Min. :0.902
## 1st Qu.:1083 1st Qu.:0.920 1st Qu.:707.0 1st Qu.:1.498
## Median :3038 Median :0.980 Median :724.6 Median :1.648
## Mean :2409 Mean :1.177 Mean :697.8 Mean :2.203
## 3rd Qu.:3215 3rd Qu.:1.600 3rd Qu.:731.5 3rd Qu.:3.242
## Max. :3858 Max. :1.840 Max. :784.8 Max. :3.788
## NA's :1 NA's :31 NA's :1
## Pressure.Vacuum Oxygen.Filler Bowl.Setpoint Pressure.Setpoint
## Min. :-6.400 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:-5.600 1st Qu.:0.01960 1st Qu.:100.0 1st Qu.:46.00
## Median :-5.200 Median :0.03370 Median :120.0 Median :46.00
## Mean :-5.174 Mean :0.04666 Mean :109.6 Mean :47.73
## 3rd Qu.:-4.800 3rd Qu.:0.05440 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :-3.600 Max. :0.39800 Max. :130.0 Max. :52.00
## NA's :1 NA's :3 NA's :1 NA's :2
## Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
## Min. :141.2 Min. :6.400 Min. :5.18 Min. :0.000
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.34 1st Qu.:1.380
## Median :142.6 Median :6.580 Median :5.40 Median :1.480
## Mean :142.8 Mean :6.907 Mean :5.44 Mean :2.051
## 3rd Qu.:142.8 3rd Qu.:7.180 3rd Qu.:5.56 3rd Qu.:3.080
## Max. :147.2 Max. :7.820 Max. :5.74 Max. :3.420
## NA's :1 NA's :3 NA's :2
#make Brand code a factor
#Test_set_bev.imp <- mice(Test_set_bev, m =3, maxit =3, print = FALSE, seed = 234)
#Test_set_bev.imp.missForest <- rfImpute(PH ~ ., Test_set_bev)
#summary(Test_set_bev.imp[1]$data)
Test_set_bev.missForest2 <- missForest(Test_set_bev.missForest)
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## A: 35 Min. :5.147 Min. :23.75 Min. :0.09867
## B:135 1st Qu.:5.286 1st Qu.:23.92 1st Qu.:0.23433
## C: 33 Median :5.340 Median :23.97 Median :0.27600
## D: 64 Mean :5.368 Mean :23.97 Mean :0.27815
## 3rd Qu.:5.463 3rd Qu.:24.01 3rd Qu.:0.32233
## Max. :5.667 Max. :24.20 Max. :0.46400
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :60.20 Min. :130.0 Min. :0.0040 Min. :0.0200
## 1st Qu.:65.30 1st Qu.:138.4 1st Qu.:0.0460 1st Qu.:0.1100
## Median :68.00 Median :140.8 Median :0.0780 Median :0.1800
## Mean :68.25 Mean :141.2 Mean :0.0858 Mean :0.1906
## 3rd Qu.:70.60 3rd Qu.:143.9 3rd Qu.:0.1120 3rd Qu.:0.2500
## Max. :77.60 Max. :154.0 Max. :0.2460 Max. :0.6200
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :113.0 Min. :37.80
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:120.2 1st Qu.:46.00
## Median :0.04000 Median : 0.20 Median :123.4 Median :47.80
## Mean :0.05126 Mean : 21.03 Mean :123.0 Mean :48.12
## 3rd Qu.:0.06000 3rd Qu.: 141.30 3rd Qu.:125.5 3rd Qu.:50.20
## Max. :0.24000 Max. : 220.40 Max. :136.0 Max. :60.20
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :-50.00 Min. :-50.00 Min. :-50.00 Min. : 68.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 90.00
## Median : 10.40 Median : 26.80 Median : 27.60 Median : 98.00
## Mean : 12.01 Mean : 20.04 Mean : 19.54 Mean : 98.07
## 3rd Qu.: 20.40 3rd Qu.: 34.80 3rd Qu.: 33.00 3rd Qu.:104.00
## Max. : 50.00 Max. : 61.40 Max. : 49.20 Max. :140.00
## Filler.Level Filler.Speed Temperature Usage.cont
## Min. : 69.2 Min. :1006 Min. :63.80 Min. :12.90
## 1st Qu.:100.6 1st Qu.:3795 1st Qu.:65.40 1st Qu.:18.12
## Median :118.6 Median :3918 Median :65.80 Median :21.40
## Mean :110.4 Mean :3506 Mean :66.24 Mean :20.89
## 3rd Qu.:120.2 3rd Qu.:3996 3rd Qu.:66.60 3rd Qu.:23.74
## Max. :153.2 Max. :4020 Max. :75.40 Max. :24.60
## Carb.Flow Density MFR Balling
## Min. : 0 Min. :0.060 Min. : 15.6 Min. :0.902
## 1st Qu.:1083 1st Qu.:0.910 1st Qu.:687.2 1st Qu.:1.497
## Median :3038 Median :0.980 Median :720.4 Median :1.648
## Mean :2409 Mean :1.175 Mean :670.8 Mean :2.200
## 3rd Qu.:3215 3rd Qu.:1.600 3rd Qu.:730.7 3rd Qu.:3.242
## Max. :3858 Max. :1.840 Max. :784.8 Max. :3.788
## Pressure.Vacuum Oxygen.Filler Bowl.Setpoint Pressure.Setpoint
## Min. :-6.400 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:-5.600 1st Qu.:0.02070 1st Qu.:100.0 1st Qu.:46.00
## Median :-5.200 Median :0.03380 Median :120.0 Median :46.00
## Mean :-5.173 Mean :0.04724 Mean :109.6 Mean :47.73
## 3rd Qu.:-4.800 3rd Qu.:0.05710 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :-3.600 Max. :0.39800 Max. :130.0 Max. :52.00
## Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
## Min. :141.2 Min. :6.400 Min. :5.18 Min. :0.000
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.34 1st Qu.:1.380
## Median :142.6 Median :6.580 Median :5.40 Median :1.480
## Mean :142.8 Mean :6.904 Mean :5.44 Mean :2.051
## 3rd Qu.:142.8 3rd Qu.:7.180 3rd Qu.:5.56 3rd Qu.:3.080
## Max. :147.2 Max. :7.820 Max. :5.74 Max. :3.420
Use the Random forest model to predict PH because out of all the models it has the lowest RSME, although the runtime is about 7 minutes.1
library(xlsx)
Test_set_bev.imp$PH <- predict(ranfor.bev, newdata = Test_set_bev.imp)
summary(Test_set_bev.imp)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## A: 35 Min. :5.147 Min. :23.75 Min. :0.09867
## B:135 1st Qu.:5.286 1st Qu.:23.92 1st Qu.:0.23433
## C: 33 Median :5.340 Median :23.97 Median :0.27600
## D: 64 Mean :5.368 Mean :23.97 Mean :0.27815
## 3rd Qu.:5.463 3rd Qu.:24.01 3rd Qu.:0.32233
## Max. :5.667 Max. :24.20 Max. :0.46400
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :60.20 Min. :130.0 Min. :0.0040 Min. :0.0200
## 1st Qu.:65.30 1st Qu.:138.4 1st Qu.:0.0460 1st Qu.:0.1100
## Median :68.00 Median :140.8 Median :0.0780 Median :0.1800
## Mean :68.25 Mean :141.2 Mean :0.0858 Mean :0.1906
## 3rd Qu.:70.60 3rd Qu.:143.9 3rd Qu.:0.1120 3rd Qu.:0.2500
## Max. :77.60 Max. :154.0 Max. :0.2460 Max. :0.6200
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :113.0 Min. :37.80
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:120.2 1st Qu.:46.00
## Median :0.04000 Median : 0.20 Median :123.4 Median :47.80
## Mean :0.05126 Mean : 21.03 Mean :123.0 Mean :48.12
## 3rd Qu.:0.06000 3rd Qu.: 141.30 3rd Qu.:125.5 3rd Qu.:50.20
## Max. :0.24000 Max. : 220.40 Max. :136.0 Max. :60.20
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :-50.00 Min. :-50.00 Min. :-50.00 Min. : 68.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 90.00
## Median : 10.40 Median : 26.80 Median : 27.60 Median : 98.00
## Mean : 12.01 Mean : 20.04 Mean : 19.54 Mean : 98.07
## 3rd Qu.: 20.40 3rd Qu.: 34.80 3rd Qu.: 33.00 3rd Qu.:104.00
## Max. : 50.00 Max. : 61.40 Max. : 49.20 Max. :140.00
## Filler.Level Filler.Speed Temperature Usage.cont
## Min. : 69.2 Min. :1006 Min. :63.80 Min. :12.90
## 1st Qu.:100.6 1st Qu.:3795 1st Qu.:65.40 1st Qu.:18.12
## Median :118.6 Median :3918 Median :65.80 Median :21.40
## Mean :110.4 Mean :3506 Mean :66.24 Mean :20.89
## 3rd Qu.:120.2 3rd Qu.:3996 3rd Qu.:66.60 3rd Qu.:23.74
## Max. :153.2 Max. :4020 Max. :75.40 Max. :24.60
## Carb.Flow Density MFR Balling
## Min. : 0 Min. :0.060 Min. : 15.6 Min. :0.902
## 1st Qu.:1083 1st Qu.:0.910 1st Qu.:687.2 1st Qu.:1.497
## Median :3038 Median :0.980 Median :720.4 Median :1.648
## Mean :2409 Mean :1.175 Mean :670.8 Mean :2.200
## 3rd Qu.:3215 3rd Qu.:1.600 3rd Qu.:730.7 3rd Qu.:3.242
## Max. :3858 Max. :1.840 Max. :784.8 Max. :3.788
## Pressure.Vacuum Oxygen.Filler Bowl.Setpoint Pressure.Setpoint
## Min. :-6.400 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:-5.600 1st Qu.:0.02070 1st Qu.:100.0 1st Qu.:46.00
## Median :-5.200 Median :0.03380 Median :120.0 Median :46.00
## Mean :-5.173 Mean :0.04724 Mean :109.6 Mean :47.73
## 3rd Qu.:-4.800 3rd Qu.:0.05710 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :-3.600 Max. :0.39800 Max. :130.0 Max. :52.00
## Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl
## Min. :141.2 Min. :6.400 Min. :5.18 Min. :0.000
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.34 1st Qu.:1.380
## Median :142.6 Median :6.580 Median :5.40 Median :1.480
## Mean :142.8 Mean :6.904 Mean :5.44 Mean :2.051
## 3rd Qu.:142.8 3rd Qu.:7.180 3rd Qu.:5.56 3rd Qu.:3.080
## Max. :147.2 Max. :7.820 Max. :5.74 Max. :3.420
## PH
## Min. :8.360
## 1st Qu.:8.458
## Median :8.497
## Mean :8.498
## 3rd Qu.:8.529
## Max. :8.643