Data cleaning

We see there is NA under quite a lot of columns by looking at the summary, however, it is hard to tell for “Brand Code” since it is only showing the class is character due to the column format. Then we sum the na to check it, and it has 120 NA. Therefore, we try to label it as NA instead of letting it blank the data before working on it.

##  [1] "Brand Code"        "Carb Volume"       "Fill Ounces"      
##  [4] "PC Volume"         "Carb Pressure"     "Carb Temp"        
##  [7] "PSC"               "PSC Fill"          "PSC CO2"          
## [10] "Mnf Flow"          "Carb Pressure1"    "Fill Pressure"    
## [13] "Hyd Pressure1"     "Hyd Pressure2"     "Hyd Pressure3"    
## [16] "Hyd Pressure4"     "Filler Level"      "Filler Speed"     
## [19] "Temperature"       "Usage cont"        "Carb Flow"        
## [22] "Density"           "MFR"               "Balling"          
## [25] "Pressure Vacuum"   "PH"                "Oxygen Filler"    
## [28] "Bowl Setpoint"     "Pressure Setpoint" "Air Pressurer"    
## [31] "Alch Rel"          "Carb Rel"          "Balling Lvl"
##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC CO2           Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd Pressure1   Hyd Pressure2   Hyd Pressure3   Hyd Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler Level    Filler Speed   Temperature      Usage cont      Carb Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen Filler     Bowl Setpoint   Pressure Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air Pressurer      Alch Rel        Carb Rel      Balling Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1
## [1] 120
## [1] 0

Introduction

ABC Beverage is facing new regulations requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Below is the technical report detailing our examination of the data and creation of our PH model.

Data Summary and Preparation

As we can see the data has 2571 beverage records, including PH (the target) and 31 other possible predictors of PH. 31 of the predictors are numeric, and one (Brand) is character. Many of the predictors have missing values.

summary(dfPHa)
##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC CO2           Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd Pressure1   Hyd Pressure2   Hyd Pressure3   Hyd Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler Level    Filler Speed   Temperature      Usage cont      Carb Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen Filler     Bowl Setpoint   Pressure Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air Pressurer      Alch Rel        Carb Rel      Balling Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1
str(dfPHa)
## 'data.frame':    2571 obs. of  33 variables:
##  $ Brand Code       : chr  "B" "A" "B" "A" ...
##  $ Carb Volume      : num  5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill Ounces      : num  24 24 24.1 24 24.3 ...
##  $ PC Volume        : num  0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb Pressure    : num  68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb Temp        : num  141 140 145 133 137 ...
##  $ PSC              : num  0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC Fill         : num  0.26 0.22 0.34 0.42 0.16 ...
##  $ PSC CO2          : num  0.04 0.04 0.16 0.04 0.12 ...
##  $ Mnf Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb Pressure1   : num  119 122 120 115 118 ...
##  $ Fill Pressure    : num  46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure2    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure3    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure4    : num  118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler Level     : num  121 119 120 118 119 ...
##  $ Filler Speed     : num  4002 3986 4020 4012 4010 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb Flow        : num  2932 3144 2914 3062 3054 ...
##  $ Density          : num  0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Balling          : num  1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure Vacuum  : num  -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl Setpoint    : num  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air Pressurer    : num  143 143 142 146 146 ...
##  $ Alch Rel         : num  6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling Lvl      : num  1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...

Lets remove the brand code and check descriptive Statistics of the predictors, and visualize it by using box plots.

##                   vars    n    mean      sd  median trimmed    mad     min
## Carb Volume          1 2561    5.37    0.11    5.35    5.37   0.11    5.04
## Fill Ounces          2 2533   23.97    0.09   23.97   23.98   0.08   23.63
## PC Volume            3 2532    0.28    0.06    0.27    0.27   0.05    0.08
## Carb Pressure        4 2544   68.19    3.54   68.20   68.12   3.56   57.00
## Carb Temp            5 2545  141.09    4.04  140.80  140.99   3.85  128.60
## PSC                  6 2538    0.08    0.05    0.08    0.08   0.05    0.00
## PSC Fill             7 2548    0.20    0.12    0.18    0.18   0.12    0.00
## PSC CO2              8 2532    0.06    0.04    0.04    0.05   0.03    0.00
## Mnf Flow             9 2569   24.57  119.48   65.20   21.07 169.02 -100.20
## Carb Pressure1      10 2539  122.59    4.74  123.20  122.54   4.45  105.60
## Fill Pressure       11 2549   47.92    3.18   46.40   47.71   2.37   34.60
## Hyd Pressure1       12 2560   12.44   12.43   11.40   10.84  16.90   -0.80
## Hyd Pressure2       13 2556   20.96   16.39   28.60   21.05  13.34    0.00
## Hyd Pressure3       14 2556   20.46   15.98   27.60   20.51  13.94   -1.20
## Hyd Pressure4       15 2541   96.29   13.12   96.00   95.45  11.86   52.00
## Filler Level        16 2551  109.25   15.70  118.40  111.04   9.19   55.80
## Filler Speed        17 2514 3687.20  770.82 3982.00 3919.99  47.44  998.00
## Temperature         18 2557   65.97    1.38   65.60   65.80   0.89   63.60
## Usage cont          19 2566   20.99    2.98   21.79   21.25   3.19   12.08
## Carb Flow           20 2569 2468.35 1073.70 3028.00 2601.14 326.17   26.00
## Density             21 2570    1.17    0.38    0.98    1.15   0.15    0.24
## MFR                 22 2359  704.05   73.90  724.00  718.16  15.42   31.40
## Balling             23 2570    2.20    0.93    1.65    2.13   0.37   -0.17
## Pressure Vacuum     24 2571   -5.22    0.57   -5.40   -5.25   0.59   -6.60
## PH                  25 2567    8.55    0.17    8.54    8.55   0.18    7.88
## Oxygen Filler       26 2559    0.05    0.05    0.03    0.04   0.02    0.00
## Bowl Setpoint       27 2569  109.33   15.30  120.00  111.35   0.00   70.00
## Pressure Setpoint   28 2559   47.62    2.04   46.00   47.60   0.00   44.00
## Air Pressurer       29 2571  142.83    1.21  142.60  142.58   0.59  140.80
## Alch Rel            30 2562    6.90    0.51    6.56    6.84   0.06    5.28
## Carb Rel            31 2561    5.44    0.13    5.40    5.43   0.12    4.96
## Balling Lvl         32 2570    2.05    0.87    1.48    1.98   0.21    0.00
##                       max   range  skew kurtosis    se
## Carb Volume          5.70    0.66  0.39    -0.47  0.00
## Fill Ounces         24.32    0.69 -0.02     0.86  0.00
## PC Volume            0.48    0.40  0.34     0.67  0.00
## Carb Pressure       79.40   22.40  0.18    -0.01  0.07
## Carb Temp          154.00   25.40  0.25     0.24  0.08
## PSC                  0.27    0.27  0.85     0.65  0.00
## PSC Fill             0.62    0.62  0.93     0.77  0.00
## PSC CO2              0.24    0.24  1.73     3.73  0.00
## Mnf Flow           229.40  329.60  0.00    -1.87  2.36
## Carb Pressure1     140.20   34.60  0.05     0.14  0.09
## Fill Pressure       60.40   25.80  0.55     1.41  0.06
## Hyd Pressure1       58.00   58.80  0.78    -0.14  0.25
## Hyd Pressure2       59.40   59.40 -0.30    -1.56  0.32
## Hyd Pressure3       50.00   51.20 -0.32    -1.57  0.32
## Hyd Pressure4      142.00   90.00  0.55     0.63  0.26
## Filler Level       161.20  105.40 -0.85     0.05  0.31
## Filler Speed      4030.00 3032.00 -2.87     6.71 15.37
## Temperature         76.20   12.60  2.39    10.16  0.03
## Usage cont          25.90   13.82 -0.54    -1.02  0.06
## Carb Flow         5104.00 5078.00 -0.99    -0.58 21.18
## Density              1.92    1.68  0.53    -1.20  0.01
## MFR                868.60  837.20 -5.09    30.46  1.52
## Balling              4.01    4.18  0.59    -1.39  0.02
## Pressure Vacuum     -3.60    3.00  0.53    -0.03  0.01
## PH                   9.36    1.48 -0.29     0.06  0.00
## Oxygen Filler        0.40    0.40  2.66    11.09  0.00
## Bowl Setpoint      140.00   70.00 -0.97    -0.06  0.30
## Pressure Setpoint   52.00    8.00  0.20    -1.60  0.04
## Air Pressurer      148.20    7.40  2.25     4.73  0.02
## Alch Rel             8.62    3.34  0.88    -0.85  0.01
## Carb Rel             6.06    1.10  0.50    -0.29  0.00
## Balling Lvl          3.66    3.66  0.59    -1.49  0.02
## Warning: Removed 724 rows containing non-finite values (`stat_boxplot()`).

Density curve of Predictors

Then we try to see the density curve of the predictors, and as we can see there is quite a few different of overall pattern of a distribution.

## Warning: Removed 724 rows containing non-finite values (`stat_density()`).

Missing Values

Then we can use the simply way to find out the total missing value count in each column has NA. Brand Code we have replace the NA to “actually NA” char. Please see the below for more details. One thing catch my attention which is MFR has high missing values count while every other predictors has less than 100 missing.

Lets visualize it to see it better, as we can see MFR has the most missing value which is 8.25%.

Can other predictors replace MFR? it does not seem like there is a high correlation with PH or other predictors. We may consider that the missing values are replaced the entire column with the median value that we found earlier.

##       Carb Volume       Fill Ounces         PC Volume     Carb Pressure 
##                10                38                39                27 
##         Carb Temp               PSC          PSC Fill           PSC CO2 
##                26                33                23                39 
##          Mnf Flow    Carb Pressure1     Fill Pressure     Hyd Pressure1 
##                 2                32                22                11 
##     Hyd Pressure2     Hyd Pressure3     Hyd Pressure4      Filler Level 
##                15                15                30                20 
##      Filler Speed       Temperature        Usage cont         Carb Flow 
##                57                14                 5                 2 
##           Density               MFR           Balling   Pressure Vacuum 
##                 1               212                 1                 0 
##                PH     Oxygen Filler     Bowl Setpoint Pressure Setpoint 
##                 4                12                 2                12 
##     Air Pressurer          Alch Rel          Carb Rel       Balling Lvl 
##                 0                 9                10                 1
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## ℹ The deprecated feature was likely used in the naniar package.
##   Please report the issue at <]8;;https://github.com/njtierney/naniar/issueshttps://github.com/njtierney/naniar/issues]8;;>.

## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the visdat package.
##   Please report the issue at <]8;;https://github.com/ropensci/visdat/issueshttps://github.com/ropensci/visdat/issues]8;;>.

## 
##  Pearson's product-moment correlation
## 
## data:  dfPH_Num$PH and mfr_flag
## t = 0.67288, df = 2565, p-value = 0.5011
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02541582  0.05194583
## sample estimates:
##        cor 
## 0.01328488

Multicollinearity

From the corrplot, we can see there is a strong correlation between Balling, Balling level and Density. So we have chose to remove Balling since keeping Balling level and Density should be good enough for future steps.

## corrplot 0.92 loaded

Machine Learning Models

We now test our data to determine the best predictive model.

1). Lasso

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-7
## 
## Call:  glmnet(x = x, y = yTrain, alpha = 1, lambda = lambda1, preProc = c("center",      "scale")) 
## 
##   Df  %Dev    Lambda
## 1 33 41.38 0.0003475
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 0.1397969 0.3980719 0.1097855

2) Ridge

## 
## Call:  glmnet(x = x, y = yTrain, alpha = 0, lambda = lambda1, preProc = c("center",      "scale")) 
## 
##   Df %Dev   Lambda
## 1 34 40.9 0.007664
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 0.1405160 0.3938785 0.1104329

3). PLS

## Partial Least Squares 
## 
## 2058 samples
##   34 predictor
## 
## Pre-processing: centered (34), scaled (34) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1852, 1852, 1852, 1852, 1853, 1852, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.1482391  0.2455820  0.1177220
##    2     0.1388578  0.3384327  0.1089266
##    3     0.1367599  0.3580082  0.1077355
##    4     0.1354618  0.3698751  0.1065667
##    5     0.1343346  0.3813681  0.1049822
##    6     0.1338006  0.3862363  0.1040038
##    7     0.1333593  0.3903250  0.1037876
##    8     0.1331776  0.3924393  0.1030584
##    9     0.1330803  0.3930953  0.1029091
##   10     0.1330278  0.3934436  0.1027157
##   11     0.1329662  0.3938820  0.1026727
##   12     0.1329566  0.3939496  0.1026549
##   13     0.1330019  0.3935102  0.1027165
##   14     0.1330440  0.3931957  0.1026966
##   15     0.1330988  0.3928171  0.1026961
##   16     0.1331753  0.3921624  0.1027848
##   17     0.1331763  0.3921296  0.1028178
##   18     0.1331805  0.3921171  0.1027878
##   19     0.1331837  0.3920940  0.1027788
##   20     0.1331783  0.3921365  0.1027708
##   21     0.1331724  0.3921956  0.1027637
##   22     0.1331679  0.3922377  0.1027554
##   23     0.1331652  0.3922493  0.1027552
##   24     0.1331655  0.3922448  0.1027556
##   25     0.1331656  0.3922442  0.1027561
##   26     0.1331643  0.3922535  0.1027552
##   27     0.1331634  0.3922607  0.1027556
##   28     0.1331627  0.3922663  0.1027548
##   29     0.1331627  0.3922659  0.1027547
##   30     0.1331630  0.3922640  0.1027549
##   31     0.1331628  0.3922651  0.1027548
##   32     0.1331628  0.3922646  0.1027549
##   33     0.1331628  0.3922647  0.1027550
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 12.
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 0.1399521 0.3960262 0.1098523

4.) KNN

## k-Nearest Neighbors 
## 
## 2058 samples
##   34 predictor
## 
## Pre-processing: centered (34), scaled (34) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2058, 2058, 2058, 2058, 2058, 2058, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE       
##    5  0.1314645  0.4381820  0.09581417
##    7  0.1291734  0.4470027  0.09518601
##    9  0.1281588  0.4505701  0.09521960
##   11  0.1280463  0.4496276  0.09545292
##   13  0.1282553  0.4469070  0.09581402
##   15  0.1287975  0.4418528  0.09650971
##   17  0.1291531  0.4383839  0.09708308
##   19  0.1293592  0.4367300  0.09751506
##   21  0.1296778  0.4341361  0.09788488
##   23  0.1300428  0.4311397  0.09830290
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
## [1] "Test Metrics:"
##       RMSE   Rsquared        MAE 
## 0.13033381 0.47520537 0.09905783

5). Neural Net

Decay = .04, size = 3.

## a 34-3-1 network with 109 weights
## options were - linear output units  decay=0.04
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 0.1303556 0.4777951 0.0979910

6.) MARS

## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## 
## Attaching package: 'plotrix'
## The following object is masked from 'package:scales':
## 
##     rescale
## The following object is masked from 'package:psych':
## 
##     rescale
## Loading required package: TeachingDemos
## 
## Attaching package: 'TeachingDemos'
## The following objects are masked from 'package:Hmisc':
## 
##     cnvrt.coords, subplot
## Selected 45 of 55 terms, and 21 of 34 predictors
## Termination condition: Reached nk 69
## Importance: Mnf.Flow, Brand.Code_C, Usage.cont, Alch.Rel, Temperature, ...
## Number of terms at each degree of interaction: 1 44 (additive model)
## GCV 0.01526276    RSS 28.75273    GRSq 0.4743331    RSq 0.5183479
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 0.1312462 0.4702532 0.1022397

7.) and 8.) SVM (linear and radial)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 2058 samples
##   34 predictor
## 
## Pre-processing: centered (34), scaled (34) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1852, 1853, 1852, 1852, 1852, 1853, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE       Rsquared   MAE       
##   0.25  0.1252827  0.4690902  0.09295133
##   0.50  0.1224433  0.4899658  0.09009948
##   1.00  0.1194835  0.5126329  0.08738536
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01990335
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01990335 and C = 1.
## [1] "Test Metrics:"
##       RMSE   Rsquared        MAE 
## 0.12273749 0.53764160 0.09092678
## [1] ""
## [1] "___________________________________________"
## [1] ""
## Support Vector Machines with Linear Kernel 
## 
## 2058 samples
##   34 predictor
## 
## Pre-processing: centered (34), scaled (34) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1853, 1852, 1852, 1852, 1852, 1852, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1345371  0.3828227  0.1019708
## 
## Tuning parameter 'C' was held constant at a value of 1
## [1] "Test Metrics:"
##      RMSE  Rsquared       MAE 
## 0.1411874 0.3836443 0.1082657

9.) Random Forest

## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:psych':
## 
##     outlier
## 
## Call:
##  randomForest(formula = PH ~ ., data = dfTrain, importance = TRUE,      ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 11
## 
##           Mean of squared residuals: 0.009447109
##                     % Var explained: 67.43
## [1] "Test Metrics:"
##       RMSE   Rsquared        MAE 
## 0.10117603 0.70733568 0.07546305

10.) GBM

Depth = 4, ntrees = 10000, shrinkage = .007

## Loaded gbm 2.1.8.1
## gbm(formula = PH ~ ., distribution = "gaussian", data = dfTrain, 
##     n.trees = 10000, interaction.depth = 4, shrinkage = 0.007, 
##     cv.folds = 10, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 10000 iterations were performed.
## The best cross-validation iteration was 9999.
## There were 34 predictors of which 34 had non-zero influence.
## [1] "Number of trees: 10000"
## [1] "Test Metrics:"
##       RMSE   Rsquared        MAE 
## 0.10856903 0.63655965 0.08162703

11.) Cubist

## 
## Call:
## cubist.default(x = dfTrain2, y = yTrain)
## 
## Number of samples: 2058 
## Number of predictors: 34 
## 
## Number of committees: 1 
## Number of rules: 35
## [1] "Test Metrics:"
##       RMSE   Rsquared        MAE 
## 0.12644448 0.52371190 0.08919304

Choosing the best model

Most of the model has really low R squared, and even the top 3, only Random Forest is good to use. Also the RMSE is the lowest.

Random Forest RMSE 0.10117603 Rsquared 0.70733568

GBM RMSE 0.10856903 Rsquared 0.63655965

Cubist RMSE 0.12644448 Rsquared 0.52371190

Most important predictors

##                     Overall
## Brand.Code_C      67.092661
## Mnf.Flow          61.079848
## Pressure.Vacuum   50.505536
## Oxygen.Filler     50.229076
## Usage.cont        43.240739
## Balling.Lvl       39.875260
## Temperature       39.405143
## Carb.Rel          37.363884
## Density           36.194307
## Air.Pressurer     35.210296
## Alch.Rel          35.046117
## Filler.Speed      34.832764
## Carb.Flow         29.246673
## Bowl.Setpoint     29.078899
## Filler.Level      26.632490
## Hyd.Pressure1     26.505321
## Carb.Pressure1    25.341874
## Hyd.Pressure3     24.296731
## Carb.Volume       23.146086
## MFR               19.734645
## Brand.Code_D      19.389483
## PC.Volume         19.215872
## Hyd.Pressure4     18.705485
## Fill.Pressure     17.920572
## Brand.Code_A      17.663367
## Hyd.Pressure2     17.180092
## Brand.Code_NA     16.289629
## Pressure.Setpoint 15.380911
## Fill.Ounces        6.077110
## Carb.Pressure      4.193779
## PSC                4.038923
## Carb.Temp          2.714383
## PSC.CO2            2.370301
## PSC.Fill           1.552322

Prediction

##   Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp   PSC PSC.Fill
## 1    5.480000    24.03333 0.2700000          65.4     134.6 0.236     0.40
## 2    5.393333    23.95333 0.2266667          63.2     135.0 0.042     0.22
## 3    5.293333    23.92000 0.3033333          66.4     140.4 0.068     0.10
## 4    5.266667    23.94000 0.1860000          64.8     139.0 0.004     0.20
## 5    5.406667    24.20000 0.1600000          69.4     142.2 0.040     0.30
## 6    5.286667    24.10667 0.2120000          73.4     147.2 0.078     0.22
##   PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1 Hyd.Pressure2
## 1    0.04     -100          116.6          46.0             0          26.8
## 2    0.08     -100          118.8          46.2             0           0.0
## 3    0.02     -100          120.2          45.8             0           0.0
## 4    0.02     -100          124.8          40.0             0           0.0
## 5    0.06     -100          115.0          51.4             0           0.0
## 6    0.04     -100          118.6          46.4             0           0.0
##   Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont
## 1          27.7            96        129.4         3986        66.0      21.66
## 2           0.0           112        120.0         4012        65.6      17.60
## 3           0.0            98        119.4         4010        65.6      24.18
## 4           0.0           132        120.2         3978        74.4      18.12
## 5           0.0            94        116.0         4018        66.4      21.32
## 6           0.0            94        120.4         4010        66.6      18.00
##   Carb.Flow Density   MFR Pressure.Vacuum       PH Oxygen.Filler Bowl.Setpoint
## 1      2950    0.88 727.6            -3.8 8.456119        0.0220           130
## 2      2916    1.50 735.8            -4.4 8.575456        0.0300           120
## 3      3056    0.90 734.8            -4.2 8.530313        0.0460           120
## 4        28    0.74 724.6            -4.0 8.503903        0.0337           120
## 5      3214    0.88 752.0            -4.0 8.581512        0.0820           120
## 6      3064    0.84 732.0            -3.8 8.571334        0.0640           120
##   Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl Brand.Code_A
## 1              45.2         142.6     6.56     5.34        1.48            0
## 2              46.0         147.2     7.14     5.58        3.04            1
## 3              46.0         146.6     6.52     5.34        1.46            0
## 4              46.0         146.4     6.48     5.50        1.48            0
## 5              50.0         145.8     6.50     5.38        1.46            0
## 6              46.0         146.0     6.50     5.42        1.44            0
##   Brand.Code_C Brand.Code_D Brand.Code_NA
## 1            0            1             0
## 2            0            0             0
## 3            0            0             0
## 4            0            0             0
## 5            0            0             0
## 6            0            0             0
write.csv(output, "predictions.csv")

Conclusion

Most of the PH fall around 8 while we are using the best model which is random forecast since it is best R2 and RMSE.