We see there is NA under quite a lot of columns by looking at the summary, however, it is hard to tell for “Brand Code” since it is only showing the class is character due to the column format. Then we sum the na to check it, and it has 120 NA. Therefore, we try to label it as NA instead of letting it blank the data before working on it.
## [1] "Brand Code" "Carb Volume" "Fill Ounces"
## [4] "PC Volume" "Carb Pressure" "Carb Temp"
## [7] "PSC" "PSC Fill" "PSC CO2"
## [10] "Mnf Flow" "Carb Pressure1" "Fill Pressure"
## [13] "Hyd Pressure1" "Hyd Pressure2" "Hyd Pressure3"
## [16] "Hyd Pressure4" "Filler Level" "Filler Speed"
## [19] "Temperature" "Usage cont" "Carb Flow"
## [22] "Density" "MFR" "Balling"
## [25] "Pressure Vacuum" "PH" "Oxygen Filler"
## [28] "Bowl Setpoint" "Pressure Setpoint" "Air Pressurer"
## [31] "Alch Rel" "Carb Rel" "Balling Lvl"
## Brand Code Carb Volume Fill Ounces PC Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler Level Filler Speed Temperature Usage cont Carb Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## PH Oxygen Filler Bowl Setpoint Pressure Setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## Air Pressurer Alch Rel Carb Rel Balling Lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
## [1] 120
## [1] 0
ABC Beverage is facing new regulations requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Below is the technical report detailing our examination of the data and creation of our PH model.
As we can see the data has 2571 beverage records, including PH (the target) and 31 other possible predictors of PH. 31 of the predictors are numeric, and one (Brand) is character. Many of the predictors have missing values.
summary(dfPHa)
## Brand Code Carb Volume Fill Ounces PC Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler Level Filler Speed Temperature Usage cont Carb Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## PH Oxygen Filler Bowl Setpoint Pressure Setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## Air Pressurer Alch Rel Carb Rel Balling Lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
str(dfPHa)
## 'data.frame': 2571 obs. of 33 variables:
## $ Brand Code : chr "B" "A" "B" "A" ...
## $ Carb Volume : num 5.34 5.43 5.29 5.44 5.49 ...
## $ Fill Ounces : num 24 24 24.1 24 24.3 ...
## $ PC Volume : num 0.263 0.239 0.263 0.293 0.111 ...
## $ Carb Pressure : num 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
## $ Carb Temp : num 141 140 145 133 137 ...
## $ PSC : num 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC Fill : num 0.26 0.22 0.34 0.42 0.16 ...
## $ PSC CO2 : num 0.04 0.04 0.16 0.04 0.12 ...
## $ Mnf Flow : num -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb Pressure1 : num 119 122 120 115 118 ...
## $ Fill Pressure : num 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
## $ Hyd Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd Pressure2 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure3 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure4 : num 118 106 82 92 92 116 124 132 90 108 ...
## $ Filler Level : num 121 119 120 118 119 ...
## $ Filler Speed : num 4002 3986 4020 4012 4010 ...
## $ Temperature : num 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage cont : num 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb Flow : num 2932 3144 2914 3062 3054 ...
## $ Density : num 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
## $ MFR : num 725 727 735 731 723 ...
## $ Balling : num 1.4 1.5 3.14 3.04 3.04 ...
## $ Pressure Vacuum : num -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
## $ PH : num 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Oxygen Filler : num 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl Setpoint : num 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure Setpoint: num 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air Pressurer : num 143 143 142 146 146 ...
## $ Alch Rel : num 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
## $ Carb Rel : num 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Balling Lvl : num 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
Lets remove the brand code and check descriptive Statistics of the predictors, and visualize it by using box plots.
## vars n mean sd median trimmed mad min
## Carb Volume 1 2561 5.37 0.11 5.35 5.37 0.11 5.04
## Fill Ounces 2 2533 23.97 0.09 23.97 23.98 0.08 23.63
## PC Volume 3 2532 0.28 0.06 0.27 0.27 0.05 0.08
## Carb Pressure 4 2544 68.19 3.54 68.20 68.12 3.56 57.00
## Carb Temp 5 2545 141.09 4.04 140.80 140.99 3.85 128.60
## PSC 6 2538 0.08 0.05 0.08 0.08 0.05 0.00
## PSC Fill 7 2548 0.20 0.12 0.18 0.18 0.12 0.00
## PSC CO2 8 2532 0.06 0.04 0.04 0.05 0.03 0.00
## Mnf Flow 9 2569 24.57 119.48 65.20 21.07 169.02 -100.20
## Carb Pressure1 10 2539 122.59 4.74 123.20 122.54 4.45 105.60
## Fill Pressure 11 2549 47.92 3.18 46.40 47.71 2.37 34.60
## Hyd Pressure1 12 2560 12.44 12.43 11.40 10.84 16.90 -0.80
## Hyd Pressure2 13 2556 20.96 16.39 28.60 21.05 13.34 0.00
## Hyd Pressure3 14 2556 20.46 15.98 27.60 20.51 13.94 -1.20
## Hyd Pressure4 15 2541 96.29 13.12 96.00 95.45 11.86 52.00
## Filler Level 16 2551 109.25 15.70 118.40 111.04 9.19 55.80
## Filler Speed 17 2514 3687.20 770.82 3982.00 3919.99 47.44 998.00
## Temperature 18 2557 65.97 1.38 65.60 65.80 0.89 63.60
## Usage cont 19 2566 20.99 2.98 21.79 21.25 3.19 12.08
## Carb Flow 20 2569 2468.35 1073.70 3028.00 2601.14 326.17 26.00
## Density 21 2570 1.17 0.38 0.98 1.15 0.15 0.24
## MFR 22 2359 704.05 73.90 724.00 718.16 15.42 31.40
## Balling 23 2570 2.20 0.93 1.65 2.13 0.37 -0.17
## Pressure Vacuum 24 2571 -5.22 0.57 -5.40 -5.25 0.59 -6.60
## PH 25 2567 8.55 0.17 8.54 8.55 0.18 7.88
## Oxygen Filler 26 2559 0.05 0.05 0.03 0.04 0.02 0.00
## Bowl Setpoint 27 2569 109.33 15.30 120.00 111.35 0.00 70.00
## Pressure Setpoint 28 2559 47.62 2.04 46.00 47.60 0.00 44.00
## Air Pressurer 29 2571 142.83 1.21 142.60 142.58 0.59 140.80
## Alch Rel 30 2562 6.90 0.51 6.56 6.84 0.06 5.28
## Carb Rel 31 2561 5.44 0.13 5.40 5.43 0.12 4.96
## Balling Lvl 32 2570 2.05 0.87 1.48 1.98 0.21 0.00
## max range skew kurtosis se
## Carb Volume 5.70 0.66 0.39 -0.47 0.00
## Fill Ounces 24.32 0.69 -0.02 0.86 0.00
## PC Volume 0.48 0.40 0.34 0.67 0.00
## Carb Pressure 79.40 22.40 0.18 -0.01 0.07
## Carb Temp 154.00 25.40 0.25 0.24 0.08
## PSC 0.27 0.27 0.85 0.65 0.00
## PSC Fill 0.62 0.62 0.93 0.77 0.00
## PSC CO2 0.24 0.24 1.73 3.73 0.00
## Mnf Flow 229.40 329.60 0.00 -1.87 2.36
## Carb Pressure1 140.20 34.60 0.05 0.14 0.09
## Fill Pressure 60.40 25.80 0.55 1.41 0.06
## Hyd Pressure1 58.00 58.80 0.78 -0.14 0.25
## Hyd Pressure2 59.40 59.40 -0.30 -1.56 0.32
## Hyd Pressure3 50.00 51.20 -0.32 -1.57 0.32
## Hyd Pressure4 142.00 90.00 0.55 0.63 0.26
## Filler Level 161.20 105.40 -0.85 0.05 0.31
## Filler Speed 4030.00 3032.00 -2.87 6.71 15.37
## Temperature 76.20 12.60 2.39 10.16 0.03
## Usage cont 25.90 13.82 -0.54 -1.02 0.06
## Carb Flow 5104.00 5078.00 -0.99 -0.58 21.18
## Density 1.92 1.68 0.53 -1.20 0.01
## MFR 868.60 837.20 -5.09 30.46 1.52
## Balling 4.01 4.18 0.59 -1.39 0.02
## Pressure Vacuum -3.60 3.00 0.53 -0.03 0.01
## PH 9.36 1.48 -0.29 0.06 0.00
## Oxygen Filler 0.40 0.40 2.66 11.09 0.00
## Bowl Setpoint 140.00 70.00 -0.97 -0.06 0.30
## Pressure Setpoint 52.00 8.00 0.20 -1.60 0.04
## Air Pressurer 148.20 7.40 2.25 4.73 0.02
## Alch Rel 8.62 3.34 0.88 -0.85 0.01
## Carb Rel 6.06 1.10 0.50 -0.29 0.00
## Balling Lvl 3.66 3.66 0.59 -1.49 0.02
## Warning: Removed 724 rows containing non-finite values (`stat_boxplot()`).
Then we try to see the density curve of the predictors, and as we can see there is quite a few different of overall pattern of a distribution.
## Warning: Removed 724 rows containing non-finite values (`stat_density()`).
Then we can use the simply way to find out the total missing value count in each column has NA. Brand Code we have replace the NA to “actually NA” char. Please see the below for more details. One thing catch my attention which is MFR has high missing values count while every other predictors has less than 100 missing.
Lets visualize it to see it better, as we can see MFR has the most missing value which is 8.25%.
Can other predictors replace MFR? it does not seem like there is a high correlation with PH or other predictors. We may consider that the missing values are replaced the entire column with the median value that we found earlier.
## Carb Volume Fill Ounces PC Volume Carb Pressure
## 10 38 39 27
## Carb Temp PSC PSC Fill PSC CO2
## 26 33 23 39
## Mnf Flow Carb Pressure1 Fill Pressure Hyd Pressure1
## 2 32 22 11
## Hyd Pressure2 Hyd Pressure3 Hyd Pressure4 Filler Level
## 15 15 30 20
## Filler Speed Temperature Usage cont Carb Flow
## 57 14 5 2
## Density MFR Balling Pressure Vacuum
## 1 212 1 0
## PH Oxygen Filler Bowl Setpoint Pressure Setpoint
## 4 12 2 12
## Air Pressurer Alch Rel Carb Rel Balling Lvl
## 0 9 10 1
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## ℹ The deprecated feature was likely used in the naniar package.
## Please report the issue at <]8;;https://github.com/njtierney/naniar/issueshttps://github.com/njtierney/naniar/issues]8;;>.
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the visdat package.
## Please report the issue at <]8;;https://github.com/ropensci/visdat/issueshttps://github.com/ropensci/visdat/issues]8;;>.
##
## Pearson's product-moment correlation
##
## data: dfPH_Num$PH and mfr_flag
## t = 0.67288, df = 2565, p-value = 0.5011
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.02541582 0.05194583
## sample estimates:
## cor
## 0.01328488
From the corrplot, we can see there is a strong correlation between Balling, Balling level and Density. So we have chose to remove Balling since keeping Balling level and Density should be good enough for future steps.
## corrplot 0.92 loaded
We now test our data to determine the best predictive model.
1). Lasso
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-7
##
## Call: glmnet(x = x, y = yTrain, alpha = 1, lambda = lambda1, preProc = c("center", "scale"))
##
## Df %Dev Lambda
## 1 33 41.38 0.0003475
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 0.1397969 0.3980719 0.1097855
2) Ridge
##
## Call: glmnet(x = x, y = yTrain, alpha = 0, lambda = lambda1, preProc = c("center", "scale"))
##
## Df %Dev Lambda
## 1 34 40.9 0.007664
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 0.1405160 0.3938785 0.1104329
3). PLS
## Partial Least Squares
##
## 2058 samples
## 34 predictor
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1852, 1852, 1852, 1852, 1853, 1852, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1482391 0.2455820 0.1177220
## 2 0.1388578 0.3384327 0.1089266
## 3 0.1367599 0.3580082 0.1077355
## 4 0.1354618 0.3698751 0.1065667
## 5 0.1343346 0.3813681 0.1049822
## 6 0.1338006 0.3862363 0.1040038
## 7 0.1333593 0.3903250 0.1037876
## 8 0.1331776 0.3924393 0.1030584
## 9 0.1330803 0.3930953 0.1029091
## 10 0.1330278 0.3934436 0.1027157
## 11 0.1329662 0.3938820 0.1026727
## 12 0.1329566 0.3939496 0.1026549
## 13 0.1330019 0.3935102 0.1027165
## 14 0.1330440 0.3931957 0.1026966
## 15 0.1330988 0.3928171 0.1026961
## 16 0.1331753 0.3921624 0.1027848
## 17 0.1331763 0.3921296 0.1028178
## 18 0.1331805 0.3921171 0.1027878
## 19 0.1331837 0.3920940 0.1027788
## 20 0.1331783 0.3921365 0.1027708
## 21 0.1331724 0.3921956 0.1027637
## 22 0.1331679 0.3922377 0.1027554
## 23 0.1331652 0.3922493 0.1027552
## 24 0.1331655 0.3922448 0.1027556
## 25 0.1331656 0.3922442 0.1027561
## 26 0.1331643 0.3922535 0.1027552
## 27 0.1331634 0.3922607 0.1027556
## 28 0.1331627 0.3922663 0.1027548
## 29 0.1331627 0.3922659 0.1027547
## 30 0.1331630 0.3922640 0.1027549
## 31 0.1331628 0.3922651 0.1027548
## 32 0.1331628 0.3922646 0.1027549
## 33 0.1331628 0.3922647 0.1027550
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 12.
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 0.1399521 0.3960262 0.1098523
4.) KNN
## k-Nearest Neighbors
##
## 2058 samples
## 34 predictor
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2058, 2058, 2058, 2058, 2058, 2058, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.1314645 0.4381820 0.09581417
## 7 0.1291734 0.4470027 0.09518601
## 9 0.1281588 0.4505701 0.09521960
## 11 0.1280463 0.4496276 0.09545292
## 13 0.1282553 0.4469070 0.09581402
## 15 0.1287975 0.4418528 0.09650971
## 17 0.1291531 0.4383839 0.09708308
## 19 0.1293592 0.4367300 0.09751506
## 21 0.1296778 0.4341361 0.09788488
## 23 0.1300428 0.4311397 0.09830290
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 0.13033381 0.47520537 0.09905783
5). Neural Net
Decay = .04, size = 3.
## a 34-3-1 network with 109 weights
## options were - linear output units decay=0.04
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 0.1303556 0.4777951 0.0979910
6.) MARS
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:scales':
##
## rescale
## The following object is masked from 'package:psych':
##
## rescale
## Loading required package: TeachingDemos
##
## Attaching package: 'TeachingDemos'
## The following objects are masked from 'package:Hmisc':
##
## cnvrt.coords, subplot
## Selected 45 of 55 terms, and 21 of 34 predictors
## Termination condition: Reached nk 69
## Importance: Mnf.Flow, Brand.Code_C, Usage.cont, Alch.Rel, Temperature, ...
## Number of terms at each degree of interaction: 1 44 (additive model)
## GCV 0.01526276 RSS 28.75273 GRSq 0.4743331 RSq 0.5183479
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 0.1312462 0.4702532 0.1022397
7.) and 8.) SVM (linear and radial)
## Support Vector Machines with Radial Basis Function Kernel
##
## 2058 samples
## 34 predictor
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1852, 1853, 1852, 1852, 1852, 1853, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.1252827 0.4690902 0.09295133
## 0.50 0.1224433 0.4899658 0.09009948
## 1.00 0.1194835 0.5126329 0.08738536
##
## Tuning parameter 'sigma' was held constant at a value of 0.01990335
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01990335 and C = 1.
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 0.12273749 0.53764160 0.09092678
## [1] ""
## [1] "___________________________________________"
## [1] ""
## Support Vector Machines with Linear Kernel
##
## 2058 samples
## 34 predictor
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1853, 1852, 1852, 1852, 1852, 1852, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1345371 0.3828227 0.1019708
##
## Tuning parameter 'C' was held constant at a value of 1
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 0.1411874 0.3836443 0.1082657
9.) Random Forest
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:psych':
##
## outlier
##
## Call:
## randomForest(formula = PH ~ ., data = dfTrain, importance = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 11
##
## Mean of squared residuals: 0.009447109
## % Var explained: 67.43
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 0.10117603 0.70733568 0.07546305
10.) GBM
Depth = 4, ntrees = 10000, shrinkage = .007
## Loaded gbm 2.1.8.1
## gbm(formula = PH ~ ., distribution = "gaussian", data = dfTrain,
## n.trees = 10000, interaction.depth = 4, shrinkage = 0.007,
## cv.folds = 10, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 10000 iterations were performed.
## The best cross-validation iteration was 9999.
## There were 34 predictors of which 34 had non-zero influence.
## [1] "Number of trees: 10000"
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 0.10856903 0.63655965 0.08162703
11.) Cubist
##
## Call:
## cubist.default(x = dfTrain2, y = yTrain)
##
## Number of samples: 2058
## Number of predictors: 34
##
## Number of committees: 1
## Number of rules: 35
## [1] "Test Metrics:"
## RMSE Rsquared MAE
## 0.12644448 0.52371190 0.08919304
Most of the model has really low R squared, and even the top 3, only Random Forest is good to use. Also the RMSE is the lowest.
Random Forest RMSE 0.10117603 Rsquared 0.70733568
GBM RMSE 0.10856903 Rsquared 0.63655965
Cubist RMSE 0.12644448 Rsquared 0.52371190
## Overall
## Brand.Code_C 67.092661
## Mnf.Flow 61.079848
## Pressure.Vacuum 50.505536
## Oxygen.Filler 50.229076
## Usage.cont 43.240739
## Balling.Lvl 39.875260
## Temperature 39.405143
## Carb.Rel 37.363884
## Density 36.194307
## Air.Pressurer 35.210296
## Alch.Rel 35.046117
## Filler.Speed 34.832764
## Carb.Flow 29.246673
## Bowl.Setpoint 29.078899
## Filler.Level 26.632490
## Hyd.Pressure1 26.505321
## Carb.Pressure1 25.341874
## Hyd.Pressure3 24.296731
## Carb.Volume 23.146086
## MFR 19.734645
## Brand.Code_D 19.389483
## PC.Volume 19.215872
## Hyd.Pressure4 18.705485
## Fill.Pressure 17.920572
## Brand.Code_A 17.663367
## Hyd.Pressure2 17.180092
## Brand.Code_NA 16.289629
## Pressure.Setpoint 15.380911
## Fill.Ounces 6.077110
## Carb.Pressure 4.193779
## PSC 4.038923
## Carb.Temp 2.714383
## PSC.CO2 2.370301
## PSC.Fill 1.552322
## Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC PSC.Fill
## 1 5.480000 24.03333 0.2700000 65.4 134.6 0.236 0.40
## 2 5.393333 23.95333 0.2266667 63.2 135.0 0.042 0.22
## 3 5.293333 23.92000 0.3033333 66.4 140.4 0.068 0.10
## 4 5.266667 23.94000 0.1860000 64.8 139.0 0.004 0.20
## 5 5.406667 24.20000 0.1600000 69.4 142.2 0.040 0.30
## 6 5.286667 24.10667 0.2120000 73.4 147.2 0.078 0.22
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1 Hyd.Pressure2
## 1 0.04 -100 116.6 46.0 0 26.8
## 2 0.08 -100 118.8 46.2 0 0.0
## 3 0.02 -100 120.2 45.8 0 0.0
## 4 0.02 -100 124.8 40.0 0 0.0
## 5 0.06 -100 115.0 51.4 0 0.0
## 6 0.04 -100 118.6 46.4 0 0.0
## Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed Temperature Usage.cont
## 1 27.7 96 129.4 3986 66.0 21.66
## 2 0.0 112 120.0 4012 65.6 17.60
## 3 0.0 98 119.4 4010 65.6 24.18
## 4 0.0 132 120.2 3978 74.4 18.12
## 5 0.0 94 116.0 4018 66.4 21.32
## 6 0.0 94 120.4 4010 66.6 18.00
## Carb.Flow Density MFR Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint
## 1 2950 0.88 727.6 -3.8 8.456119 0.0220 130
## 2 2916 1.50 735.8 -4.4 8.575456 0.0300 120
## 3 3056 0.90 734.8 -4.2 8.530313 0.0460 120
## 4 28 0.74 724.6 -4.0 8.503903 0.0337 120
## 5 3214 0.88 752.0 -4.0 8.581512 0.0820 120
## 6 3064 0.84 732.0 -3.8 8.571334 0.0640 120
## Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel Balling.Lvl Brand.Code_A
## 1 45.2 142.6 6.56 5.34 1.48 0
## 2 46.0 147.2 7.14 5.58 3.04 1
## 3 46.0 146.6 6.52 5.34 1.46 0
## 4 46.0 146.4 6.48 5.50 1.48 0
## 5 50.0 145.8 6.50 5.38 1.46 0
## 6 46.0 146.0 6.50 5.42 1.44 0
## Brand.Code_C Brand.Code_D Brand.Code_NA
## 1 0 1 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
write.csv(output, "predictions.csv")
Most of the PH fall around 8 while we are using the best model which is random forecast since it is best R2 and RMSE.