Predictive Analytics

Oluwakemi Omotunde

2021-02-11

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel.Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

##   Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp   PSC
## 1          B    5.340000    23.96667 0.2633333          68.2     141.2 0.104
## 2          A    5.426667    24.00667 0.2386667          68.4     139.6 0.124
## 3          B    5.286667    24.06000 0.2633333          70.8     144.8 0.090
## 4          A    5.440000    24.00667 0.2933333          63.0     132.6    NA
## 5          A    5.486667    24.31333 0.1113333          67.2     136.8 0.026
## 6          A    5.380000    23.92667 0.2693333          66.6     138.4 0.090
##   PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1     0.26    0.04     -100          118.8          46.0             0
## 2     0.22    0.04     -100          121.6          46.0             0
## 3     0.34    0.16     -100          120.2          46.0             0
## 4     0.42    0.04     -100          115.2          46.4             0
## 5     0.16    0.12     -100          118.4          45.8             0
## 6     0.24    0.04     -100          119.6          45.6             0
##   Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1            NA            NA           118        121.2         4002
## 2            NA            NA           106        118.6         3986
## 3            NA            NA            82        120.0         4020
## 4             0             0            92        117.8         4012
## 5             0             0            92        118.6         4010
## 6             0             0           116        120.2         4014
##   Temperature Usage.cont Carb.Flow Density   MFR Balling Pressure.Vacuum   PH
## 1        66.0      16.18      2932    0.88 725.0   1.398            -4.0 8.36
## 2        67.6      19.90      3144    0.92 726.8   1.498            -4.0 8.26
## 3        67.0      17.76      2914    1.58 735.0   3.142            -3.8 8.94
## 4        65.6      17.42      3062    1.54 730.6   3.042            -4.4 8.24
## 5        65.6      17.68      3054    1.54 722.8   3.042            -4.4 8.26
## 6        66.2      23.82      2948    1.52 738.8   2.992            -4.4 8.32
##   Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1         0.022           120              46.4         142.6     6.58     5.32
## 2         0.026           120              46.8         143.0     6.56     5.30
## 3         0.024           120              46.6         142.0     7.66     5.84
## 4         0.030           120              46.0         146.2     7.14     5.42
## 5         0.030           120              46.0         146.2     7.14     5.44
## 6         0.024           120              46.0         146.6     7.16     5.44
##   Balling.Lvl
## 1        1.48
## 2        1.56
## 3        3.28
## 4        3.04
## 5        3.04
## 6        3.02

EXCEL

The first thing we did was to open the EXCEL files for training and test data that was provided. This is just to get a general idea of what we are looking at. We first looked out the training data. We have 1 response variable (PH) and 32 predictor variables(all numerical) with 2571 observation. One thing that we noticed immediately using the filter function was that we are missing about 120 of the predictor variable, “Brand Code”. We also noticed that about(4) of our response variable, PH was also missing. In addition to the 4 missing entries, we realized that it may be benficial to convert PH from numerical to catergorical based on the value(ie. basic, acidic, neutral). We know that anything below 7 is acidic, while anything above 7 is basic, although we realize that are data ranges from 7 up. Below is the summary statistic we obtained from our EXCEL dive.

Column.Name MIN MAX MEAN MEDIAN NA.S
Carb Volume 5.0400 5.700 5.370 5.347 10
Fill Ounces 23.6300 24.320 23.970 2397.000 34
PC Volume 0.0790 0.478 0.277 0.271 39
Carb Pressure 57.0000 79.400 68.190 68.200 27
Carb Temp 128.6000 154.000 141.090 140.800 26
PSC 0.0000 0.270 0.080 0.080 33
PSC Fill 0.0000 0.620 0.200 0.180 23
PSC CO2 0.0000 0.240 0.060 0.040 39
Mnf Flow -100.2000 229.400 24.590 65.200 2
Carb Pressure 1 105.6000 140.200 122.590 123.200 32
Fill Pressure 34.6000 60.400 47.920 46.400 22
Hyd Pressure 1 -0.8000 58.000 12.440 11.400 11
Hyd Pressure 2 0.0000 59.400 20.960 28.600 11
Hyd Pressure 3 -1.2000 50.000 20.460 27.600 15
Hyd Pressure 4 52.0000 142.000 96.290 96.000 30
Filler Level 55.8000 161.200 109.250 118.400 20
Filler Speed 998.0000 4030.000 3687.200 3982.000 57
Temperature 63.6000 76.200 65.970 65.600 14
Usage Cont 12.0800 25.900 20.990 21.790 5
Carb Flow 26.0000 5104.000 2468.350 3028.000 2
Density 0.2400 1.920 1.170 0.980 1
MFR 31.4000 868.600 704.050 724.000 212
Balling -0.1700 4.012 2.200 1.650 1
Pressure Vacuum -6.6000 -3.600 -5.220 -5.400 0
PH 7.8800 9.360 8.550 8.540 4
Oxygen Filler 0.0024 0.400 0.047 0.030 12
Bowl Setpoint 70.0000 130.000 109.340 120.000 2
Pressure Setpoint 44.0000 52.000 47.620 46.000 12
Air Pressurer 140.8000 148.200 142.830 142.600 0
Alch Rel 5.2800 8.620 6.900 6.560 9
Carb Rel 4.9600 6.060 5.440 5.400 10
Balling Lvl 0.0000 3.660 2.050 1.480 1

Aside from the missing response variables, there are quite a bit of the predictor variables with missing values. MFR has a total of 212 missing values and some like “Pressure Vacuum” and “Air Pressure” have no missing values. We will go ahead a impute the missing values for the predictor variables. There are a few variables that we worry may have outliers because of the range between the min and the max. One such variable is “Carb Flow”, with a min of 26 and max of 5104. Another would be MFR.

vars n mean sd median trimmed mad min max range skew kurtosis se IQR Q0.25 Q0.75
Brand.Code* 1 2451 2.5063239 0.9956337 2.0000000 2.5079041 0.0000000 1.0000000 4.000 3.0000000 0.3818872 -1.0613288 0.0201107 2.0000000 2.0000000 4.000000
Carb.Volume 2 2561 5.3701978 0.1063852 5.3466667 5.3654840 0.1087240 5.0400000 5.700 0.6600000 0.3922121 -0.4669916 0.0021022 0.1600000 5.2933333 5.453333
Fill.Ounces 3 2533 23.9747546 0.0875299 23.9733333 23.9751390 0.0790720 23.6333333 24.320 0.6866667 -0.0215452 0.8624714 0.0017392 0.1066667 23.9200000 24.026667
PC.Volume 4 2532 0.2771187 0.0606953 0.2713333 0.2745818 0.0523852 0.0793333 0.478 0.3986667 0.3396269 0.6699690 0.0012062 0.0728333 0.2391667 0.312000
Carb.Pressure 5 2544 68.1895755 3.5382039 68.2000000 68.1212574 3.5582400 57.0000000 79.400 22.4000000 0.1822162 -0.0138046 0.0701495 5.0000000 65.6000000 70.600000
Carb.Temp 6 2545 141.0949234 4.0373861 140.8000000 140.9912617 3.8547600 128.6000000 154.000 25.4000000 0.2468280 0.2375822 0.0800307 5.4000000 138.4000000 143.800000
PSC 7 2538 0.0845737 0.0492690 0.0760000 0.0802746 0.0474432 0.0020000 0.270 0.2680000 0.8491445 0.6480498 0.0009780 0.0640000 0.0480000 0.112000
PSC.Fill 8 2548 0.1953689 0.1177817 0.1800000 0.1837059 0.1186080 0.0000000 0.620 0.6200000 0.9334450 0.7691466 0.0023333 0.1600000 0.1000000 0.260000
PSC.CO2 9 2532 0.0564139 0.0430387 0.0400000 0.0494965 0.0296520 0.0000000 0.240 0.2400000 1.7288937 3.7250025 0.0008553 0.0600000 0.0200000 0.080000
Mnf.Flow 10 2569 24.5689373 119.4811263 65.2000000 21.0679631 169.0164000 -100.2000000 229.400 329.6000000 0.0041430 -1.8697072 2.3573130 240.8000000 -100.0000000 140.800000
Carb.Pressure1 11 2539 122.5863726 4.7428819 123.2000000 122.5379242 4.4478000 105.6000000 140.200 34.6000000 0.0543587 0.1418265 0.0941263 6.4000000 119.0000000 125.400000
Fill.Pressure 12 2549 47.9221656 3.1775457 46.4000000 47.7071044 2.3721600 34.6000000 60.400 25.8000000 0.5471107 1.4067532 0.0629371 4.0000000 46.0000000 50.000000
Hyd.Pressure1 13 2560 12.4375781 12.4332538 11.4000000 10.8374023 16.9016400 -0.8000000 58.000 58.8000000 0.7798043 -0.1426463 0.2457338 20.2000000 0.0000000 20.200000
Hyd.Pressure2 14 2556 20.9610329 16.3863066 28.6000000 21.0519062 13.3434000 0.0000000 59.400 59.4000000 -0.3019570 -1.5592984 0.3241161 34.6000000 0.0000000 34.600000
Hyd.Pressure3 15 2556 20.4584507 15.9757236 27.6000000 20.5052786 13.9364400 -1.2000000 50.000 51.2000000 -0.3189061 -1.5745834 0.3159949 33.4000000 0.0000000 33.400000
Hyd.Pressure4 16 2541 96.2888627 13.1225594 96.0000000 95.4530251 11.8608000 52.0000000 142.000 90.0000000 0.5459786 0.6340041 0.2603252 16.0000000 86.0000000 102.000000
Filler.Level 17 2551 109.2523716 15.6984241 118.4000000 111.0417442 9.1921200 55.8000000 161.200 105.4000000 -0.8482847 0.0460488 0.3108142 21.7000000 98.3000000 120.000000
Filler.Speed 18 2514 3687.1988862 770.8200208 3982.0000000 3919.9870775 47.4432000 998.0000000 4030.000 3032.0000000 -2.8700359 6.7059692 15.3734149 110.0000000 3888.0000000 3998.000000
Temperature 19 2557 65.9675401 1.3827783 65.6000000 65.7986321 0.8895600 63.6000000 76.200 12.6000000 2.3869732 10.1612904 0.0273456 1.2000000 65.2000000 66.400000
Usage.cont 20 2566 20.9929618 2.9779364 21.7900000 21.2517819 3.1875900 12.0800000 25.900 13.8200000 -0.5353253 -1.0170230 0.0587878 5.3950000 18.3600000 23.755000
Carb.Flow 21 2569 2468.3542234 1073.6964743 3028.0000000 2601.1356344 326.1720000 26.0000000 5104.000 5078.0000000 -0.9877287 -0.5826893 21.1835857 2042.0000000 1144.0000000 3186.000000
Density 22 2570 1.1736498 0.3775269 0.9800000 1.1533463 0.1482600 0.2400000 1.920 1.6800000 0.5260149 -1.1992070 0.0074470 0.7200000 0.9000000 1.620000
MFR 23 2359 704.0492582 73.8983094 724.0000000 718.1566967 15.4190400 31.4000000 868.600 837.2000000 -5.0917729 30.4558939 1.5214950 24.7000000 706.3000000 731.000000
Balling 24 2570 2.1977696 0.9310914 1.6480000 2.1287189 0.3706500 -0.1700000 4.012 4.1820000 0.5939224 -1.3855651 0.0183665 1.7960000 1.4960000 3.292000
Pressure.Vacuum 25 2571 -5.2161027 0.5699933 -5.4000000 -5.2521147 0.5930400 -6.6000000 -3.600 3.0000000 0.5256608 -0.0313126 0.0112414 0.6000000 -5.6000000 -5.000000
PH 26 2567 8.5456486 0.1725162 8.5400000 8.5516788 0.1779120 7.8800000 9.360 1.4800000 -0.2906437 0.0644294 0.0034050 0.2400000 8.4400000 8.680000
Oxygen.Filler 27 2559 0.0468426 0.0466436 0.0334000 0.0388837 0.0249077 0.0024000 0.400 0.3976000 2.6603955 11.0882098 0.0009221 0.0380000 0.0220000 0.060000
Bowl.Setpoint 28 2569 109.3265862 15.3031541 120.0000000 111.3466213 0.0000000 70.0000000 140.000 70.0000000 -0.9743842 -0.0564212 0.3019249 20.0000000 100.0000000 120.000000
Pressure.Setpoint 29 2559 47.6153966 2.0390474 46.0000000 47.6026354 0.0000000 44.0000000 52.000 8.0000000 0.2031970 -1.6012622 0.0403081 4.0000000 46.0000000 50.000000
Air.Pressurer 30 2571 142.8339946 1.2119170 142.6000000 142.5812348 0.5930400 140.8000000 148.200 7.4000000 2.2521053 4.7336291 0.0239013 0.8000000 142.2000000 143.000000
Alch.Rel 31 2562 6.8974161 0.5052753 6.5600000 6.8384390 0.0593040 5.2800000 8.620 3.3400000 0.8836378 -0.8506221 0.0099825 0.7000000 6.5400000 7.240000
Carb.Rel 32 2561 5.4367825 0.1287183 5.4000000 5.4301318 0.1186080 4.9600000 6.060 1.1000000 0.5032472 -0.2949480 0.0025435 0.2000000 5.3400000 5.540000
Balling.Lvl 33 2570 2.0500078 0.8703089 1.4800000 1.9827237 0.2075640 0.0000000 3.660 3.6600000 0.5858456 -1.4858636 0.0171675 1.7600000 1.3800000 3.140000

The describe function from the psych package gives us a more descriptive summary statistic breakdown, inclduing skewness. We see that some variables are right skewed(PSC CO2, PSC Fill, and Temperature) while some are left skewed(Filler Speed, Carb Flow, and MFR). We will perform some transformations later to address the skewness of the data. First, let’s do some plots|further exploration of our predictors.

Looking at the plots, a few things jump out immediately at us It doesn’t appear that a lot of the variables have a normal distribution. A few of them have spikes that we think might be outliers and will be explored further. A few of the distributions appear to be bimodial. We will create dummy variables to flag which these are. We will definitely need to do some pre-processing before throughing into a model. We’d like to take a look at the correlation plots to see if we have highly correlated date. We will remove those that are.

Now we’ll take a look at a correlation plot.

##   Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp   PSC
## 1          B    5.340000    23.96667 0.2633333          68.2     141.2 0.104
## 2          A    5.426667    24.00667 0.2386667          68.4     139.6 0.124
## 3          B    5.286667    24.06000 0.2633333          70.8     144.8 0.090
## 4          A    5.440000    24.00667 0.2933333          63.0     132.6    NA
## 5          A    5.486667    24.31333 0.1113333          67.2     136.8 0.026
## 6          A    5.380000    23.92667 0.2693333          66.6     138.4 0.090
##   PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1     0.26    0.04        1          118.8          46.0             1
## 2     0.22    0.04        1          121.6          46.0             1
## 3     0.34    0.16        1          120.2          46.0             1
## 4     0.42    0.04        1          115.2          46.4             1
## 5     0.16    0.12        1          118.4          45.8             1
## 6     0.24    0.04        1          119.6          45.6             1
##   Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1            NA            NA           118        121.2            0
## 2            NA            NA           106        118.6            0
## 3            NA            NA            82        120.0            0
## 4             1             0            92        117.8            0
## 5             1             0            92        118.6            0
## 6             1             0           116        120.2            0
##   Temperature Usage.cont Carb.Flow   MFR Pressure.Vacuum   PH Oxygen.Filler
## 1        66.0      16.18         0 725.0            -4.0 8.36         0.022
## 2        67.6      19.90         0 726.8            -4.0 8.26         0.026
## 3        67.0      17.76         0 735.0            -3.8 8.94         0.024
## 4        65.6      17.42         0 730.6            -4.4 8.24         0.030
## 5        65.6      17.68         0 722.8            -4.4 8.26         0.030
## 6        66.2      23.82         0 738.8            -4.4 8.32         0.024
##   Bowl.Setpoint Pressure.Setpoint Air.Pressurer Balling.Lvl
## 1           120              46.4         142.6        1.48
## 2           120              46.8         143.0        1.56
## 3           120              46.6         142.0        3.28
## 4           120              46.0         146.2        3.04
## 5           120              46.0         146.2        3.04
## 6           120              46.0         146.6        3.02

From the plot, we notice that Density, Balling, Carb.Rel, Alch.Rel are highly correlated so we decided to remove those variables. As we stated earlier, Brand Code was missing about 120 variables. We first converted the Brand.Code predictor to factors so that it would be compatible for a random forest imputation.

We then filtered out the subset of records (4) with a missing response (PH) values and imputed the remaining missing values using the random forest imputation.

Using missForest to impute took much longer than rfImpute, but it works better for our purposes. Initally, we wanted to convert our response variable to be categorical but at this point, we decided against it as it would lead to lose of information. Next, let’s delve into whether we have zero-variance variables or not. Zero-variance variables are those where the percentage of unique values is less than 10%.

We notice that there are no variables where we are getting a true for near zero variance(nzv) so we will move one to look at splitting our dataset. We mentioned earlier that we had a couple of variables that exhibited some skewness. We will do a BoxCox transformation of those variables(PSC, PSC.Fill and PSC.CO2, etc). We notice that PSC.Fill and PSC.CO2 have 0 values so we will add a small offset.

#lambda <- BoxCox.lambda(bev.imp.missForest)
#bev.boxcox <- BoxCox(bev.imp.missForest, lambda) 
library(forecast)
bev.boxcox <- bev.imp.missForest
offset <- .0000001
bev.boxcox$PSC.Fill <- bev.boxcox$PSC.Fill + offset
bev.boxcox$PSC.CO2 <- bev.boxcox$PSC.CO2 + offset
#psc.boxcox <- boxcox(bev.boxcox$PSC ~ 1, lamda = seq(-6, 6, .1))
#pscfill.boxcox <- boxcox(bev.boxcox$PSC.Fill ~ 1, lambda = seq(-6, 6, 0.1))
#psccos.boxcox <- boxcox(bev.boxcox$PSC.CO2 ~ 1, lambda = seq(-6, 6, 0.1))
#oxygenfiller.boxcox <- boxcox(bev.boxcox$Oxygen.Filler ~ 1, lambda = seq(-6, 6, .1))
#bc1 <- data.frame(psc.boxcox$x, psc.boxcox$y)
#bc2 <- bc1[with(bc1, order(-bc1$psc.boxcox.y)),]
#bc2[1,]
#bc3 <- data.frame(pscfill.boxcox$x, pscfill.boxcox$y)
#bc4 <- bc3[with(bc3, order(-bc3$pscfill.boxcox.y)),]
#bc4[1,]
#bc5 <- data.frame(psccos.boxcox$x, psccos.boxcox$y)
#bc6 <- bc5[with(bc5, order(-bc5$psccos.boxcox.y)),]
#bc6[1,]
#bc7 <- data.frame(oxygenfiller.boxcox$x, oxygenfiller.boxcox$y)
#bc8 <- bc7[with(bc7, order(-bc7$oxygenfiller.boxcox.y)),]
#bc8[1,]
# to find optimal lambda
lambda1 <- BoxCox.lambda(bev.boxcox$PSC.Fill)
lambda2 <- BoxCox.lambda(bev.boxcox$PSC.CO2)
lambda3 <- BoxCox.lambda(bev.boxcox$Oxygen.Filler)
lambda4 <- BoxCox.lambda(bev.boxcox$PSC)
# now to transform vector
trans.vector1 = BoxCox(bev.boxcox$PSC.Fill, lambda1)
bev.boxcox$PSC.Fill <- trans.vector1
trans.vector2 = BoxCox(bev.boxcox$PSC.CO2, lambda2)
bev.boxcox$PSC.CO2 <- trans.vector2
trans.vector3 = BoxCox(bev.boxcox$Oxygen.Filler, lambda3)
bev.boxcox$Oxygen.Filler <- trans.vector3
trans.vector4 = BoxCox(bev.boxcox$PSC, lambda4)
bev.boxcox$PSC <- trans.vector4
DataExplorer::plot_histogram(bev.boxcox, nrow = 3L, ncol = 4L)

Now that we have completed transforming our dataset, we will go ahead and split the trainig data that we were given. We will split a few ways so that we are able to use for a few different models.

GLM Model

GLM or generalized linear models, formulated by John Nelder and Robert Wedderburn, are “a flexible generalization of an ordinary linear ergression model” by allowing the linear model to be related to the response variable via a link-function. It was initally formulated as a way of unifying various models such as: linear, logistic, and Poisson regressions. It allows for a non-normal error distribution models.

## 4.84 sec elapsed
## [1] "The RMSE value for the GLM model is 0.132169145517236"

GLMNET MODEL

GLMNET is for elastic net regression. Unlike GLM, there is a penalty term associated with this model. Elastics net is a regularized regression method that combines the L1 and L1 penalities of lasso and ridge.

## 11.28 sec elapsed
## [1] "The RMSE value for the GLMNET model is 0.132072825560028"

We will next try partial least squares regression(PLS) model.PLS is typically used when we have more predictors than observations, although that is not the case in our current situation. PLS is a dimension reduction technique similar to PCA. Our predictors are mapped to a smaller set of vairables and within that space we perform aregression against the our response variable. It aims to choose new mapped variables that maximally explains the outcome variable.

## 4.38 sec elapsed
## [1] "The RMSE value for the PLS model is 0.134651698744935"

Random Forest

## 527.37 sec elapsed
## [1] "The RMSE value for the Random Forest model is 0.105823809400893"
## rf variable importance
## 
##   only 20 most important variables shown (out of 30)
## 
##                   Overall
## Mnf.Flow           100.00
## Brand.CodeC         72.57
## Pressure.Vacuum     68.96
## Air.Pressurer       65.93
## Oxygen.Filler       63.94
## Balling.Lvl         50.44
## Usage.cont          45.89
## Carb.Pressure1      45.19
## Hyd.Pressure3       43.84
## Brand.CodeD         43.08
## Bowl.Setpoint       41.93
## Temperature         38.50
## Filler.Level        34.92
## Carb.Volume         33.26
## MFR                 29.28
## Fill.Pressure       27.78
## PC.Volume           25.17
## Carb.Flow           22.76
## Hyd.Pressure4       22.29
## Pressure.Setpoint   20.05

From the random forest model, we see that the top 5 most important variables are:

  1. Mnf.Flow
  2. Brand.CodeC
  3. Air.Pressure
  4. Pressure.Vacuum
  5. Oxygen.Filler

XGBoost Model

We decided to try the Extreme Gradient boosting model because of its high accuracy and optimization to tackle regression problems as it allows optimization of an arbitrary differentiable loss function XGBoost Model. We decided to try the Extreeme Gradient boosting model because of its high accuracy and optimization to tackle regression problems as it allows optimization of an arbitrary differentiable loss function. Xgboost accepts only numerical predictors, so let’s convert the Brandcode to numerical.

We clearly see that the most important predictors are 1. Mnf.Flow 2. Usage.cont 3. Carb.Flow 4. Oxygen.Filler 5. Carb.Rel

MARS model

We decided to try MARs model because it could predict the values of a continuous dependent or outcome variable from a set of independent or predictor variables.The reason I chose the MARSplines is because it is a nonparametric regression procedure that makes no assumption about the underlying functional relationship between the dependent and independent variables. Since in this case it was not clear if there was linear relationship or not. It is worls even in situations where the relationship between the predictors and the dependent variables is non-monotone and difficult to approximate with parametric models

##    nprune degree
## 69     33      2

## 180.98 sec elapsed
## [1] "The RMSE value for the MARS model is 0.1277669928728"

We clearly see that the most important predictors for the MARS model are 1. Mnf.Flow 2. Brand_code 3. Airpressure 4. Alch.Rel 5. Bowl.Setpoint

glm.rmse glmnet.rmse pls.rmse rf.rmse xgboost.rmse mars.rmse
0.1321691 0.1320728 0.1346517 0.1058238 0.1229514 0.127767

We see that the random forest model has the best RMSE as .107. The model that performed best following the random forest was the MARS model at .130. We also timed each of our models and the model with the best time was ## Model Testing

Preprocess test set by imputing missing values

##  Brand.Code  Carb.Volume     Fill.Ounces      PC.Volume       Carb.Pressure  
##  A   : 35   Min.   :5.147   Min.   :23.75   Min.   :0.09867   Min.   :60.20  
##  B   :129   1st Qu.:5.287   1st Qu.:23.92   1st Qu.:0.23333   1st Qu.:65.30  
##  C   : 31   Median :5.340   Median :23.97   Median :0.27533   Median :68.00  
##  D   : 64   Mean   :5.369   Mean   :23.97   Mean   :0.27769   Mean   :68.25  
##  NA's:  8   3rd Qu.:5.465   3rd Qu.:24.01   3rd Qu.:0.32200   3rd Qu.:70.60  
##             Max.   :5.667   Max.   :24.20   Max.   :0.46400   Max.   :77.60  
##             NA's   :1       NA's   :6       NA's   :4                        
##    Carb.Temp          PSC             PSC.Fill         PSC.CO2       
##  Min.   :130.0   Min.   :0.00400   Min.   :0.0200   Min.   :0.00000  
##  1st Qu.:138.4   1st Qu.:0.04450   1st Qu.:0.1000   1st Qu.:0.02000  
##  Median :140.8   Median :0.07600   Median :0.1800   Median :0.04000  
##  Mean   :141.2   Mean   :0.08545   Mean   :0.1903   Mean   :0.05107  
##  3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600   3rd Qu.:0.06000  
##  Max.   :154.0   Max.   :0.24600   Max.   :0.6200   Max.   :0.24000  
##  NA's   :1       NA's   :5         NA's   :3        NA's   :5        
##     Mnf.Flow       Carb.Pressure1  Fill.Pressure   Hyd.Pressure1   
##  Min.   :-100.20   Min.   :113.0   Min.   :37.80   Min.   :-50.00  
##  1st Qu.:-100.00   1st Qu.:120.2   1st Qu.:46.00   1st Qu.:  0.00  
##  Median :   0.20   Median :123.4   Median :47.80   Median : 10.40  
##  Mean   :  21.03   Mean   :123.0   Mean   :48.14   Mean   : 12.01  
##  3rd Qu.: 141.30   3rd Qu.:125.5   3rd Qu.:50.20   3rd Qu.: 20.40  
##  Max.   : 220.40   Max.   :136.0   Max.   :60.20   Max.   : 50.00  
##                    NA's   :4       NA's   :2                       
##  Hyd.Pressure2    Hyd.Pressure3    Hyd.Pressure4     Filler.Level  
##  Min.   :-50.00   Min.   :-50.00   Min.   : 68.00   Min.   : 69.2  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 90.00   1st Qu.:100.6  
##  Median : 26.80   Median : 27.70   Median : 98.00   Median :118.6  
##  Mean   : 20.11   Mean   : 19.61   Mean   : 97.84   Mean   :110.3  
##  3rd Qu.: 34.80   3rd Qu.: 33.00   3rd Qu.:104.00   3rd Qu.:120.2  
##  Max.   : 61.40   Max.   : 49.20   Max.   :140.00   Max.   :153.2  
##  NA's   :1        NA's   :1        NA's   :4        NA's   :2      
##   Filler.Speed   Temperature      Usage.cont      Carb.Flow       Density     
##  Min.   :1006   Min.   :63.80   Min.   :12.90   Min.   :   0   Min.   :0.060  
##  1st Qu.:3812   1st Qu.:65.40   1st Qu.:18.12   1st Qu.:1083   1st Qu.:0.920  
##  Median :3978   Median :65.80   Median :21.44   Median :3038   Median :0.980  
##  Mean   :3581   Mean   :66.23   Mean   :20.90   Mean   :2409   Mean   :1.177  
##  3rd Qu.:3996   3rd Qu.:66.60   3rd Qu.:23.74   3rd Qu.:3215   3rd Qu.:1.600  
##  Max.   :4020   Max.   :75.40   Max.   :24.60   Max.   :3858   Max.   :1.840  
##  NA's   :10     NA's   :2       NA's   :2                      NA's   :1      
##       MFR           Balling      Pressure.Vacuum  Oxygen.Filler    
##  Min.   : 15.6   Min.   :0.902   Min.   :-6.400   Min.   :0.00240  
##  1st Qu.:707.0   1st Qu.:1.498   1st Qu.:-5.600   1st Qu.:0.01960  
##  Median :724.6   Median :1.648   Median :-5.200   Median :0.03370  
##  Mean   :697.8   Mean   :2.203   Mean   :-5.174   Mean   :0.04666  
##  3rd Qu.:731.5   3rd Qu.:3.242   3rd Qu.:-4.800   3rd Qu.:0.05440  
##  Max.   :784.8   Max.   :3.788   Max.   :-3.600   Max.   :0.39800  
##  NA's   :31      NA's   :1       NA's   :1        NA's   :3        
##  Bowl.Setpoint   Pressure.Setpoint Air.Pressurer      Alch.Rel    
##  Min.   : 70.0   Min.   :44.00     Min.   :141.2   Min.   :6.400  
##  1st Qu.:100.0   1st Qu.:46.00     1st Qu.:142.2   1st Qu.:6.540  
##  Median :120.0   Median :46.00     Median :142.6   Median :6.580  
##  Mean   :109.6   Mean   :47.73     Mean   :142.8   Mean   :6.907  
##  3rd Qu.:120.0   3rd Qu.:50.00     3rd Qu.:142.8   3rd Qu.:7.180  
##  Max.   :130.0   Max.   :52.00     Max.   :147.2   Max.   :7.820  
##  NA's   :1       NA's   :2         NA's   :1       NA's   :3      
##     Carb.Rel     Balling.Lvl   
##  Min.   :5.18   Min.   :0.000  
##  1st Qu.:5.34   1st Qu.:1.380  
##  Median :5.40   Median :1.480  
##  Mean   :5.44   Mean   :2.051  
##  3rd Qu.:5.56   3rd Qu.:3.080  
##  Max.   :5.74   Max.   :3.420  
##  NA's   :2
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##  Brand.Code  Carb.Volume     Fill.Ounces      PC.Volume       Carb.Pressure  
##  A: 35      Min.   :5.147   Min.   :23.75   Min.   :0.09867   Min.   :60.20  
##  B:135      1st Qu.:5.286   1st Qu.:23.92   1st Qu.:0.23433   1st Qu.:65.30  
##  C: 33      Median :5.340   Median :23.97   Median :0.27600   Median :68.00  
##  D: 64      Mean   :5.368   Mean   :23.97   Mean   :0.27815   Mean   :68.25  
##             3rd Qu.:5.463   3rd Qu.:24.01   3rd Qu.:0.32233   3rd Qu.:70.60  
##             Max.   :5.667   Max.   :24.20   Max.   :0.46400   Max.   :77.60  
##    Carb.Temp          PSC            PSC.Fill         PSC.CO2       
##  Min.   :130.0   Min.   :0.0040   Min.   :0.0200   Min.   :0.00000  
##  1st Qu.:138.4   1st Qu.:0.0460   1st Qu.:0.1100   1st Qu.:0.02000  
##  Median :140.8   Median :0.0780   Median :0.1800   Median :0.04000  
##  Mean   :141.2   Mean   :0.0858   Mean   :0.1906   Mean   :0.05126  
##  3rd Qu.:143.9   3rd Qu.:0.1120   3rd Qu.:0.2500   3rd Qu.:0.06000  
##  Max.   :154.0   Max.   :0.2460   Max.   :0.6200   Max.   :0.24000  
##     Mnf.Flow       Carb.Pressure1  Fill.Pressure   Hyd.Pressure1   
##  Min.   :-100.20   Min.   :113.0   Min.   :37.80   Min.   :-50.00  
##  1st Qu.:-100.00   1st Qu.:120.2   1st Qu.:46.00   1st Qu.:  0.00  
##  Median :   0.20   Median :123.4   Median :47.80   Median : 10.40  
##  Mean   :  21.03   Mean   :123.0   Mean   :48.12   Mean   : 12.01  
##  3rd Qu.: 141.30   3rd Qu.:125.5   3rd Qu.:50.20   3rd Qu.: 20.40  
##  Max.   : 220.40   Max.   :136.0   Max.   :60.20   Max.   : 50.00  
##  Hyd.Pressure2    Hyd.Pressure3    Hyd.Pressure4     Filler.Level  
##  Min.   :-50.00   Min.   :-50.00   Min.   : 68.00   Min.   : 69.2  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 90.00   1st Qu.:100.6  
##  Median : 26.80   Median : 27.60   Median : 98.00   Median :118.6  
##  Mean   : 20.04   Mean   : 19.54   Mean   : 98.07   Mean   :110.4  
##  3rd Qu.: 34.80   3rd Qu.: 33.00   3rd Qu.:104.00   3rd Qu.:120.2  
##  Max.   : 61.40   Max.   : 49.20   Max.   :140.00   Max.   :153.2  
##   Filler.Speed   Temperature      Usage.cont      Carb.Flow       Density     
##  Min.   :1006   Min.   :63.80   Min.   :12.90   Min.   :   0   Min.   :0.060  
##  1st Qu.:3795   1st Qu.:65.40   1st Qu.:18.12   1st Qu.:1083   1st Qu.:0.910  
##  Median :3918   Median :65.80   Median :21.40   Median :3038   Median :0.980  
##  Mean   :3506   Mean   :66.24   Mean   :20.89   Mean   :2409   Mean   :1.175  
##  3rd Qu.:3996   3rd Qu.:66.60   3rd Qu.:23.74   3rd Qu.:3215   3rd Qu.:1.600  
##  Max.   :4020   Max.   :75.40   Max.   :24.60   Max.   :3858   Max.   :1.840  
##       MFR           Balling      Pressure.Vacuum  Oxygen.Filler    
##  Min.   : 15.6   Min.   :0.902   Min.   :-6.400   Min.   :0.00240  
##  1st Qu.:687.2   1st Qu.:1.497   1st Qu.:-5.600   1st Qu.:0.02070  
##  Median :720.4   Median :1.648   Median :-5.200   Median :0.03380  
##  Mean   :670.8   Mean   :2.200   Mean   :-5.173   Mean   :0.04724  
##  3rd Qu.:730.7   3rd Qu.:3.242   3rd Qu.:-4.800   3rd Qu.:0.05710  
##  Max.   :784.8   Max.   :3.788   Max.   :-3.600   Max.   :0.39800  
##  Bowl.Setpoint   Pressure.Setpoint Air.Pressurer      Alch.Rel    
##  Min.   : 70.0   Min.   :44.00     Min.   :141.2   Min.   :6.400  
##  1st Qu.:100.0   1st Qu.:46.00     1st Qu.:142.2   1st Qu.:6.540  
##  Median :120.0   Median :46.00     Median :142.6   Median :6.580  
##  Mean   :109.6   Mean   :47.73     Mean   :142.8   Mean   :6.904  
##  3rd Qu.:120.0   3rd Qu.:50.00     3rd Qu.:142.8   3rd Qu.:7.180  
##  Max.   :130.0   Max.   :52.00     Max.   :147.2   Max.   :7.820  
##     Carb.Rel     Balling.Lvl   
##  Min.   :5.18   Min.   :0.000  
##  1st Qu.:5.34   1st Qu.:1.380  
##  Median :5.40   Median :1.480  
##  Mean   :5.44   Mean   :2.051  
##  3rd Qu.:5.56   3rd Qu.:3.080  
##  Max.   :5.74   Max.   :3.420

Use the Random forest model to predict PH because out of all the models it has the lowest RSME

##  Brand.Code  Carb.Volume     Fill.Ounces      PC.Volume       Carb.Pressure  
##  A: 35      Min.   :5.147   Min.   :23.75   Min.   :0.09867   Min.   :60.20  
##  B:135      1st Qu.:5.286   1st Qu.:23.92   1st Qu.:0.23433   1st Qu.:65.30  
##  C: 33      Median :5.340   Median :23.97   Median :0.27600   Median :68.00  
##  D: 64      Mean   :5.368   Mean   :23.97   Mean   :0.27815   Mean   :68.25  
##             3rd Qu.:5.463   3rd Qu.:24.01   3rd Qu.:0.32233   3rd Qu.:70.60  
##             Max.   :5.667   Max.   :24.20   Max.   :0.46400   Max.   :77.60  
##    Carb.Temp          PSC            PSC.Fill         PSC.CO2       
##  Min.   :130.0   Min.   :0.0040   Min.   :0.0200   Min.   :0.00000  
##  1st Qu.:138.4   1st Qu.:0.0460   1st Qu.:0.1100   1st Qu.:0.02000  
##  Median :140.8   Median :0.0780   Median :0.1800   Median :0.04000  
##  Mean   :141.2   Mean   :0.0858   Mean   :0.1906   Mean   :0.05126  
##  3rd Qu.:143.9   3rd Qu.:0.1120   3rd Qu.:0.2500   3rd Qu.:0.06000  
##  Max.   :154.0   Max.   :0.2460   Max.   :0.6200   Max.   :0.24000  
##     Mnf.Flow       Carb.Pressure1  Fill.Pressure   Hyd.Pressure1   
##  Min.   :-100.20   Min.   :113.0   Min.   :37.80   Min.   :-50.00  
##  1st Qu.:-100.00   1st Qu.:120.2   1st Qu.:46.00   1st Qu.:  0.00  
##  Median :   0.20   Median :123.4   Median :47.80   Median : 10.40  
##  Mean   :  21.03   Mean   :123.0   Mean   :48.12   Mean   : 12.01  
##  3rd Qu.: 141.30   3rd Qu.:125.5   3rd Qu.:50.20   3rd Qu.: 20.40  
##  Max.   : 220.40   Max.   :136.0   Max.   :60.20   Max.   : 50.00  
##  Hyd.Pressure2    Hyd.Pressure3    Hyd.Pressure4     Filler.Level  
##  Min.   :-50.00   Min.   :-50.00   Min.   : 68.00   Min.   : 69.2  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 90.00   1st Qu.:100.6  
##  Median : 26.80   Median : 27.60   Median : 98.00   Median :118.6  
##  Mean   : 20.04   Mean   : 19.54   Mean   : 98.07   Mean   :110.4  
##  3rd Qu.: 34.80   3rd Qu.: 33.00   3rd Qu.:104.00   3rd Qu.:120.2  
##  Max.   : 61.40   Max.   : 49.20   Max.   :140.00   Max.   :153.2  
##   Filler.Speed   Temperature      Usage.cont      Carb.Flow       Density     
##  Min.   :1006   Min.   :63.80   Min.   :12.90   Min.   :   0   Min.   :0.060  
##  1st Qu.:3795   1st Qu.:65.40   1st Qu.:18.12   1st Qu.:1083   1st Qu.:0.910  
##  Median :3918   Median :65.80   Median :21.40   Median :3038   Median :0.980  
##  Mean   :3506   Mean   :66.24   Mean   :20.89   Mean   :2409   Mean   :1.175  
##  3rd Qu.:3996   3rd Qu.:66.60   3rd Qu.:23.74   3rd Qu.:3215   3rd Qu.:1.600  
##  Max.   :4020   Max.   :75.40   Max.   :24.60   Max.   :3858   Max.   :1.840  
##       MFR           Balling      Pressure.Vacuum  Oxygen.Filler    
##  Min.   : 15.6   Min.   :0.902   Min.   :-6.400   Min.   :0.00240  
##  1st Qu.:687.2   1st Qu.:1.497   1st Qu.:-5.600   1st Qu.:0.02070  
##  Median :720.4   Median :1.648   Median :-5.200   Median :0.03380  
##  Mean   :670.8   Mean   :2.200   Mean   :-5.173   Mean   :0.04724  
##  3rd Qu.:730.7   3rd Qu.:3.242   3rd Qu.:-4.800   3rd Qu.:0.05710  
##  Max.   :784.8   Max.   :3.788   Max.   :-3.600   Max.   :0.39800  
##  Bowl.Setpoint   Pressure.Setpoint Air.Pressurer      Alch.Rel    
##  Min.   : 70.0   Min.   :44.00     Min.   :141.2   Min.   :6.400  
##  1st Qu.:100.0   1st Qu.:46.00     1st Qu.:142.2   1st Qu.:6.540  
##  Median :120.0   Median :46.00     Median :142.6   Median :6.580  
##  Mean   :109.6   Mean   :47.73     Mean   :142.8   Mean   :6.904  
##  3rd Qu.:120.0   3rd Qu.:50.00     3rd Qu.:142.8   3rd Qu.:7.180  
##  Max.   :130.0   Max.   :52.00     Max.   :147.2   Max.   :7.820  
##     Carb.Rel     Balling.Lvl          PH       
##  Min.   :5.18   Min.   :0.000   Min.   :8.336  
##  1st Qu.:5.34   1st Qu.:1.380   1st Qu.:8.462  
##  Median :5.40   Median :1.480   Median :8.527  
##  Mean   :5.44   Mean   :2.051   Mean   :8.513  
##  3rd Qu.:5.56   3rd Qu.:3.080   3rd Qu.:8.558  
##  Max.   :5.74   Max.   :3.420   Max.   :8.660