• 1 Project Problem Statement
  • 2 Executive Summary
  • 3 Research Statement
  • 4 Data Collection
  • 5 Data Ingestion
  • 6 Data Preparation and EDA (Exploratory Data Analysis)
    • 6.1 Zero variance
    • 6.2 Box plot
    • 6.3 Normality
  • 7 Data Preprocessing
  • 8 Model Building
    • 8.1 Test Train Split
    • 8.2 Modeling Technique and Approach
    • 8.3 MARS
    • 8.4 RandomForest
    • 8.5 Cubist
    • 8.6 XGB Trees
    • 8.7 XGB Dart
  • 9 Model Evaluation And Model Selection
    • 9.1 Variable Importances
    • 9.2 Test Model Stack
    • 9.3 Model Selection
    • 9.4 Predict on Eval Set
  • 10 Conclusion
  • 11 References

1 Project Problem Statement

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH. Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach. Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

2 Executive Summary

New Regulations by ABC beverage company leadership requires the company’s production unit to better understand the manufacturing process, the predictive factors and their relationship to the PH of the beverages.

3 Research Statement

The research is an effort to find the predictive variables related to the ph of the beverages and build the predictive model for ph of beverages

4 Data Collection

The data set is a historic data containing predictors associated to the PH and is provided in an excel file. We will utilize this historic dataset to analyze and predict the PH of beverages. Two excel files are provided:

  • The training data (StudentData.xlsx)
  • The test data (StudentEvaluation.xlsx).
student_evaluation_gh_file <- "https://github.com/hovig/Team5-Data624-Project2/raw/master/StudentEvaluation.xlsx"
student_data_gh_file <- "https://github.com/hovig/Team5-Data624-Project2/raw/master/StudentData.xlsx"
se_temp_file <- tempfile(fileext = ".xlsx")
sd_temp_file <- tempfile(fileext = ".xlsx")
se_data <- GET(student_evaluation_gh_file, authenticate(Sys.getenv("GITHUB_PAT"), ""), write_disk(path = se_temp_file))
df_data <- GET(student_data_gh_file, authenticate(Sys.getenv("GITHUB_PAT"), ""), write_disk(path = sd_temp_file))
se_data <- readxl::read_excel(se_temp_file)
sd_data <- readxl::read_excel(sd_temp_file)
dim(sd_data)
## [1] 2571   33
colnames(sd_data)
##  [1] "Brand Code"        "Carb Volume"       "Fill Ounces"      
##  [4] "PC Volume"         "Carb Pressure"     "Carb Temp"        
##  [7] "PSC"               "PSC Fill"          "PSC CO2"          
## [10] "Mnf Flow"          "Carb Pressure1"    "Fill Pressure"    
## [13] "Hyd Pressure1"     "Hyd Pressure2"     "Hyd Pressure3"    
## [16] "Hyd Pressure4"     "Filler Level"      "Filler Speed"     
## [19] "Temperature"       "Usage cont"        "Carb Flow"        
## [22] "Density"           "MFR"               "Balling"          
## [25] "Pressure Vacuum"   "PH"                "Oxygen Filler"    
## [28] "Bowl Setpoint"     "Pressure Setpoint" "Air Pressurer"    
## [31] "Alch Rel"          "Carb Rel"          "Balling Lvl"
str(sd_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2571 obs. of  33 variables:
##  $ Brand Code       : chr  "B" "A" "B" "A" ...
##  $ Carb Volume      : num  5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill Ounces      : num  24 24 24.1 24 24.3 ...
##  $ PC Volume        : num  0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb Pressure    : num  68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb Temp        : num  141 140 145 133 137 ...
##  $ PSC              : num  0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC Fill         : num  0.26 0.22 0.34 0.42 0.16 ...
##  $ PSC CO2          : num  0.04 0.04 0.16 0.04 0.12 ...
##  $ Mnf Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb Pressure1   : num  119 122 120 115 118 ...
##  $ Fill Pressure    : num  46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure2    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure3    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure4    : num  118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler Level     : num  121 119 120 118 119 ...
##  $ Filler Speed     : num  4002 3986 4020 4012 4010 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb Flow        : num  2932 3144 2914 3062 3054 ...
##  $ Density          : num  0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Balling          : num  1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure Vacuum  : num  -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl Setpoint    : num  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air Pressurer    : num  143 143 142 146 146 ...
##  $ Alch Rel         : num  6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling Lvl      : num  1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
dim(se_data)
## [1] 267  33
colnames(se_data)
##  [1] "Brand Code"        "Carb Volume"       "Fill Ounces"      
##  [4] "PC Volume"         "Carb Pressure"     "Carb Temp"        
##  [7] "PSC"               "PSC Fill"          "PSC CO2"          
## [10] "Mnf Flow"          "Carb Pressure1"    "Fill Pressure"    
## [13] "Hyd Pressure1"     "Hyd Pressure2"     "Hyd Pressure3"    
## [16] "Hyd Pressure4"     "Filler Level"      "Filler Speed"     
## [19] "Temperature"       "Usage cont"        "Carb Flow"        
## [22] "Density"           "MFR"               "Balling"          
## [25] "Pressure Vacuum"   "PH"                "Oxygen Filler"    
## [28] "Bowl Setpoint"     "Pressure Setpoint" "Air Pressurer"    
## [31] "Alch Rel"          "Carb Rel"          "Balling Lvl"
str(se_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    267 obs. of  33 variables:
##  $ Brand Code       : chr  "D" "A" "B" "B" ...
##  $ Carb Volume      : num  5.48 5.39 5.29 5.27 5.41 ...
##  $ Fill Ounces      : num  24 24 23.9 23.9 24.2 ...
##  $ PC Volume        : num  0.27 0.227 0.303 0.186 0.16 ...
##  $ Carb Pressure    : num  65.4 63.2 66.4 64.8 69.4 73.4 65.2 67.4 66.8 72.6 ...
##  $ Carb Temp        : num  135 135 140 139 142 ...
##  $ PSC              : num  0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
##  $ PSC Fill         : num  0.4 0.22 0.1 0.2 0.3 ...
##  $ PSC CO2          : num  0.04 0.08 0.02 0.02 0.06 ...
##  $ Mnf Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb Pressure1   : num  117 119 120 125 115 ...
##  $ Fill Pressure    : num  46 46.2 45.8 40 51.4 46.4 46.2 40 43.8 40.8 ...
##  $ Hyd Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure2    : num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure3    : num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure4    : num  96 112 98 132 94 94 108 108 110 106 ...
##  $ Filler Level     : num  129 120 119 120 116 ...
##  $ Filler Speed     : num  3986 4012 4010 NA 4018 ...
##  $ Temperature      : num  66 65.6 65.6 74.4 66.4 66.6 66.8 NA 65.8 66 ...
##  $ Usage cont       : num  21.7 17.6 24.2 18.1 21.3 ...
##  $ Carb Flow        : num  2950 2916 3056 28 3214 ...
##  $ Density          : num  0.88 1.5 0.9 0.74 0.88 0.84 1.48 1.6 1.52 1.48 ...
##  $ MFR              : num  728 736 735 NA 752 ...
##  $ Balling          : num  1.4 2.94 1.45 1.06 1.4 ...
##  $ Pressure Vacuum  : num  -3.8 -4.4 -4.2 -4 -4 -3.8 -4.2 -4.4 -4.4 -4.2 ...
##  $ PH               : logi  NA NA NA NA NA NA ...
##  $ Oxygen Filler    : num  0.022 0.03 0.046 NA 0.082 0.064 0.042 0.096 0.046 0.096 ...
##  $ Bowl Setpoint    : num  130 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure Setpoint: num  45.2 46 46 46 50 46 46 46 46 46 ...
##  $ Air Pressurer    : num  143 147 147 146 146 ...
##  $ Alch Rel         : num  6.56 7.14 6.52 6.48 6.5 6.5 7.18 7.16 7.14 7.78 ...
##  $ Carb Rel         : num  5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
##  $ Balling Lvl      : num  1.48 3.04 1.46 1.48 1.46 1.44 3.02 3 3.1 3.12 ...

5 Data Ingestion

bev_raw <- read_csv('https://raw.githubusercontent.com/hovig/Team5-Data624-Project2/master/StudentData.csv') 
head(bev_raw)
## # A tibble: 6 x 33
##   `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
##   <chr>                <dbl>         <dbl>       <dbl>           <dbl>
## 1 B                     5.34          24.0       0.263            68.2
## 2 A                     5.43          24.0       0.239            68.4
## 3 B                     5.29          24.1       0.263            70.8
## 4 A                     5.44          24.0       0.293            63  
## 5 A                     5.49          24.3       0.111            67.2
## 6 A                     5.38          23.9       0.269            66.6
## # … with 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC
## #   Fill` <dbl>, `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb
## #   Pressure1` <dbl>, `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd
## #   Pressure2` <dbl>, `Hyd Pressure3` <dbl>, `Hyd Pressure4` <int>,
## #   `Filler Level` <dbl>, `Filler Speed` <int>, Temperature <dbl>, `Usage
## #   cont` <dbl>, `Carb Flow` <int>, Density <dbl>, MFR <dbl>,
## #   Balling <dbl>, `Pressure Vacuum` <dbl>, PH <dbl>, `Oxygen
## #   Filler` <dbl>, `Bowl Setpoint` <int>, `Pressure Setpoint` <dbl>, `Air
## #   Pressurer` <dbl>, `Alch Rel` <dbl>, `Carb Rel` <dbl>, `Balling
## #   Lvl` <dbl>

6 Data Preparation and EDA (Exploratory Data Analysis)

bev_raw %>%
  ggplot(aes(PH, fill=PH > 9)) + 
  geom_histogram(bins=30) +
  theme_bw() +
  theme(legend.position='center') +
  labs(y='Count', title='PH Levels in Dataset')

bev_raw <- bev_raw %>% 
  filter(!is.na(bev_raw$PH), bev_raw$PH < 9) 
dim(bev_raw)
## [1] 2566   33
str(bev_raw)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2566 obs. of  33 variables:
##  $ Brand Code       : chr  "B" "A" "B" "A" ...
##  $ Carb Volume      : num  5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill Ounces      : num  24 24 24.1 24 24.3 ...
##  $ PC Volume        : num  0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb Pressure    : num  68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb Temp        : num  141 140 145 133 137 ...
##  $ PSC              : num  0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC Fill         : num  0.26 0.22 0.34 0.42 0.16 0.24 0.4 0.34 0.12 0.24 ...
##  $ PSC CO2          : num  0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
##  $ Mnf Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb Pressure1   : num  119 122 120 115 118 ...
##  $ Fill Pressure    : num  46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure2    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure3    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure4    : int  118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler Level     : num  121 119 120 118 119 ...
##  $ Filler Speed     : int  4002 3986 4020 4012 4010 4014 NA 1004 4014 4028 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb Flow        : int  2932 3144 2914 3062 3054 2948 30 684 2902 3038 ...
##  $ Density          : num  0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Balling          : num  1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure Vacuum  : num  -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl Setpoint    : int  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air Pressurer    : num  143 143 142 146 146 ...
##  $ Alch Rel         : num  6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling Lvl      : num  1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 33
##   .. ..$ Brand Code       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Carb Volume      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Fill Ounces      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ PC Volume        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Carb Pressure    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Carb Temp        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ PSC              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ PSC Fill         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ PSC CO2          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Mnf Flow         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Carb Pressure1   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Fill Pressure    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Hyd Pressure1    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Hyd Pressure2    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Hyd Pressure3    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Hyd Pressure4    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Filler Level     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Filler Speed     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Temperature      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Usage cont       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Carb Flow        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Density          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ MFR              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Balling          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Pressure Vacuum  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ PH               : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Oxygen Filler    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Bowl Setpoint    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Pressure Setpoint: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Air Pressurer    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Alch Rel         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Carb Rel         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Balling Lvl      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
hist.data.frame(bev_raw)

table(bev_raw$`Brand Code`)
## 
##    A    B    C    D 
##  293 1235  303  615
summary(bev_raw)
##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:2566        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23933  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27719  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08468   Mean   :0.1953  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC CO2           Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  70.70   Median :123.2   Median :46.40  
##  Mean   :0.05645   Mean   :  24.68   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39                          NA's   :32      NA's   :17     
##  Hyd Pressure1   Hyd Pressure2  Hyd Pressure3   Hyd Pressure4   
##  Min.   :-0.80   Min.   : 0.0   Min.   :-1.20   Min.   : 62.00  
##  1st Qu.: 0.00   1st Qu.: 0.0   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.6   Median :27.60   Median : 96.00  
##  Mean   :12.46   Mean   :21.0   Mean   :20.49   Mean   : 96.31  
##  3rd Qu.:20.20   3rd Qu.:34.6   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.4   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15     NA's   :15      NA's   :27      
##   Filler Level    Filler Speed   Temperature      Usage cont   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.38  
##  Median :118.4   Median :3982   Median :65.60   Median :21.80  
##  Mean   :109.3   Mean   :3688   Mean   :65.96   Mean   :21.00  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.76  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90  
##  NA's   :15      NA's   :53     NA's   :11      NA's   :5      
##    Carb Flow       Density           MFR           Balling     
##  Min.   :  26   Min.   :0.300   Min.   : 31.4   Min.   :0.346  
##  1st Qu.:1170   1st Qu.:0.900   1st Qu.:706.3   1st Qu.:1.496  
##  Median :3030   Median :0.980   Median :724.0   Median :1.648  
##  Mean   :2473   Mean   :1.175   Mean   :704.0   Mean   :2.201  
##  3rd Qu.:3188   3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.:3.292  
##  Max.   :5104   Max.   :1.920   Max.   :868.6   Max.   :4.012  
##  NA's   :2                      NA's   :207                    
##  Pressure Vacuum        PH        Oxygen Filler     Bowl Setpoint  
##  Min.   :-6.600   Min.   :7.880   Min.   :0.00240   Min.   : 70.0  
##  1st Qu.:-5.600   1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0  
##  Median :-5.400   Median :8.540   Median :0.03340   Median :120.0  
##  Mean   :-5.216   Mean   :8.545   Mean   :0.04635   Mean   :109.3  
##  3rd Qu.:-5.000   3rd Qu.:8.680   3rd Qu.:0.05960   3rd Qu.:120.0  
##  Max.   :-3.600   Max.   :8.940   Max.   :0.40000   Max.   :140.0  
##                                   NA's   :11        NA's   :2      
##  Pressure Setpoint Air Pressurer      Alch Rel        Carb Rel    
##  Min.   :44.00     Min.   :140.8   Min.   :5.280   Min.   :4.960  
##  1st Qu.:46.00     1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340  
##  Median :46.00     Median :142.6   Median :6.560   Median :5.400  
##  Mean   :47.61     Mean   :142.8   Mean   :6.898   Mean   :5.437  
##  3rd Qu.:50.00     3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.550  
##  Max.   :52.00     Max.   :148.2   Max.   :8.620   Max.   :6.060  
##  NA's   :12                        NA's   :6       NA's   :7      
##   Balling Lvl   
##  Min.   :0.000  
##  1st Qu.:1.380  
##  Median :1.480  
##  Mean   :2.052  
##  3rd Qu.:3.140  
##  Max.   :3.660  
##  NA's   :1
describe(bev_raw %>% select(-`Brand Code`))
##                   vars    n    mean      sd  median trimmed    mad     min
## Carb Volume          1 2556    5.37    0.11    5.35    5.37   0.11    5.04
## Fill Ounces          2 2528   23.97    0.09   23.97   23.98   0.08   23.63
## PC Volume            3 2527    0.28    0.06    0.27    0.27   0.05    0.08
## Carb Pressure        4 2539   68.19    3.54   68.20   68.12   3.56   57.00
## Carb Temp            5 2540  141.09    4.03  140.80  140.99   3.85  128.60
## PSC                  6 2533    0.08    0.05    0.08    0.08   0.05    0.00
## PSC Fill             7 2543    0.20    0.12    0.18    0.18   0.12    0.00
## PSC CO2              8 2527    0.06    0.04    0.04    0.05   0.03    0.00
## Mnf Flow             9 2566   24.68  119.50   70.70   21.20 160.57 -100.20
## Carb Pressure1      10 2534  122.56    4.72  123.20  122.52   4.45  105.60
## Fill Pressure       11 2549   47.92    3.18   46.40   47.71   2.37   34.60
## Hyd Pressure1       12 2555   12.46   12.43   11.40   10.87  16.90   -0.80
## Hyd Pressure2       13 2551   21.00   16.38   28.60   21.10  13.34    0.00
## Hyd Pressure3       14 2551   20.49   15.97   27.60   20.54  13.64   -1.20
## Hyd Pressure4       15 2539   96.31   13.10   96.00   95.46  11.86   62.00
## Filler Level        16 2551  109.25   15.70  118.40  111.04   9.19   55.80
## Filler Speed        17 2513 3688.11  769.63 3982.00 3920.25  47.44  998.00
## Temperature         18 2555   65.96    1.38   65.60   65.80   0.89   63.60
## Usage cont          19 2561   21.00    2.98   21.80   21.25   3.17   12.08
## Carb Flow           20 2564 2473.00 1069.56 3030.00 2604.91 323.21   26.00
## Density             21 2566    1.17    0.38    0.98    1.15   0.15    0.30
## MFR                 22 2359  704.05   73.90  724.00  718.16  15.42   31.40
## Balling             23 2566    2.20    0.93    1.65    2.13   0.37    0.35
## Pressure Vacuum     24 2566   -5.22    0.57   -5.40   -5.25   0.59   -6.60
## PH                  25 2566    8.55    0.17    8.54    8.55   0.18    7.88
## Oxygen Filler       26 2555    0.05    0.04    0.03    0.04   0.02    0.00
## Bowl Setpoint       27 2564  109.34   15.29  120.00  111.36   0.00   70.00
## Pressure Setpoint   28 2554   47.61    2.04   46.00   47.60   0.00   44.00
## Air Pressurer       29 2566  142.83    1.21  142.60  142.58   0.59  140.80
## Alch Rel            30 2560    6.90    0.51    6.56    6.84   0.06    5.28
## Carb Rel            31 2559    5.44    0.13    5.40    5.43   0.12    4.96
## Balling Lvl         32 2565    2.05    0.87    1.48    1.98   0.21    0.00
##                       max   range  skew kurtosis    se
## Carb Volume          5.70    0.66  0.39    -0.47  0.00
## Fill Ounces         24.32    0.69 -0.02     0.87  0.00
## PC Volume            0.48    0.40  0.35     0.67  0.00
## Carb Pressure       79.40   22.40  0.18    -0.01  0.07
## Carb Temp          154.00   25.40  0.24     0.23  0.08
## PSC                  0.27    0.27  0.85     0.65  0.00
## PSC Fill             0.62    0.62  0.93     0.77  0.00
## PSC CO2              0.24    0.24  1.73     3.71  0.00
## Mnf Flow           229.40  329.60  0.00    -1.87  2.36
## Carb Pressure1     140.20   34.60  0.03     0.10  0.09
## Fill Pressure       60.40   25.80  0.55     1.41  0.06
## Hyd Pressure1       58.00   58.80  0.78    -0.15  0.25
## Hyd Pressure2       59.40   59.40 -0.31    -1.56  0.32
## Hyd Pressure3       50.00   51.20 -0.32    -1.57  0.32
## Hyd Pressure4      142.00   80.00  0.56     0.61  0.26
## Filler Level       161.20  105.40 -0.85     0.05  0.31
## Filler Speed      4030.00 3032.00 -2.88     6.75 15.35
## Temperature         76.20   12.60  2.39    10.25  0.03
## Usage cont          25.90   13.82 -0.54    -1.02  0.06
## Carb Flow         5104.00 5078.00 -0.99    -0.57 21.12
## Density              1.92    1.62  0.54    -1.22  0.01
## MFR                868.60  837.20 -5.09    30.46  1.52
## Balling              4.01    3.67  0.60    -1.41  0.02
## Pressure Vacuum     -3.60    3.00  0.53    -0.03  0.01
## PH                   8.94    1.06 -0.33    -0.08  0.00
## Oxygen Filler        0.40    0.40  2.41     8.89  0.00
## Bowl Setpoint      140.00   70.00 -0.97    -0.06  0.30
## Pressure Setpoint   52.00    8.00  0.20    -1.60  0.04
## Air Pressurer      148.20    7.40  2.25     4.72  0.02
## Alch Rel             8.62    3.34  0.88    -0.85  0.01
## Carb Rel             6.06    1.10  0.50    -0.30  0.00
## Balling Lvl          3.66    3.66  0.60    -1.51  0.02

6.1 Zero variance

To filter for near-zero variance predictors, the caret package function nearZeroVar will return the column numbers of any predictors that fulfill the conditions outlined. A zero variance predictor will never be chosen for a split since it offers no possible predictive information.

df1 <- bev_raw %>% select(-`Brand Code`) %>% mutate_each(funs(as.numeric(.)))%>%complete.cases()%>%
       as.data.frame()
names(df1)[nearZeroVar(df1)]
## character(0)
nzv <- nearZeroVar(bev_raw,saveMetrics= TRUE)
nzv[nzv$nzv,]
##               freqRatio percentUnique zeroVar  nzv
## Hyd Pressure1        31      9.547935   FALSE TRUE

6.2 Box plot

Box plots for the variables reveal, that besides having the outliers in the variables, most of the variables are skewed. For example: Variables density, carb flow, filler speed and oxygen filler are skewed providing us an opportunity to further check their distribution.

df.m <- melt(bev_raw %>% select(-MFR, -`Filler Speed`, -`Carb Flow`,-`Bowl Setpoint`,`Carb Pressure1`,
            `Hyd Pressure4`, `Air Pressurer`, `Carb Temp`, `Filler Level`, `Mnf Flow`), id.var = "Brand Code")
p <-ggplot(data = df.m, aes(x=variable, y=value)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4,aes(fill=variable)) +     
  scale_y_continuous(name = "Predictors for PH", breaks = seq(0, 2, 0.5))  + 
  coord_flip()
p

6.3 Normality

Normality is one of the most widely used technique to understand the continuous predictors. In the below plot we can see normal distribution behavior of the given dataset for different features.

bev_raw%>%
 select(-`Brand Code`) %>%   
  select(2:20) %>%         
  gather() %>%                            
  ggplot(aes(value)) +                     
    facet_wrap(~ key, scales = "free") +  
    geom_density()  

bev_raw%>%
 select(-`Brand Code`) %>%   
  select(`Carb Flow`, Density, MFR, Balling, `Pressure Vacuum`, PH, `Oxygen Filler`, `Bowl Setpoint`, `Pressure Setpoint`, `Air Pressurer`, `Alch Rel`,`Carb Rel`,`Balling Lvl`)   %>%               
  gather() %>%                             
  ggplot(aes(value)) +                     
    facet_wrap(~ key, scales = "free") +  
    geom_density() 

7 Data Preprocessing

We perform 3 data preparation steps - Remove the near-zero variance variables we previously mentioned - Impute missing values with a Random Forest Regression with the MICE package - Create dummy variables for the categorical variable, brand

Random Forest was chosen as the regression method in imputation because it requires very little tuning, allowing and there is no interface to tune the imputation model in MICE.

cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(42)
bev <- select(bev_raw, -PH) %>%
  union_all(se_data) #since we're not using the labels (y values), combining the sets for data clearning is fine
cols <- str_replace(colnames(bev), '\\s', '')
colnames(bev) <- cols
bev <- mutate(bev, BrandCode = ifelse(is.na(BrandCode), 'Unknown', BrandCode))
y <- bev_raw$PH
zero_vars <- nearZeroVar(bev[1:2566, ])
bev_new <- bev[, -zero_vars]
pred <- mice(data = bev_new, m = 3, method = 'rf', maxit = 3)
bev_imputed <- complete(pred)
form <- dummyVars(~ ., data = bev_imputed)
bev_imputed <- predict(form, bev_imputed) %>% data.frame() %>% as.matrix()

8 Model Building

8.1 Test Train Split

We will split the data into train/test sets before running cross validation. This will allow us to make sure that predictions are consistent at all levels of PH.

bev_eval <- bev_imputed[2567:2833,]
bev_train <- bev_imputed[1:2566,]
samples <- createDataPartition(y, p = .75, list = F)
x_train <- bev_train[samples, ]
x_test<- bev_train[-samples, ]
y_train <- y[samples]
y_test <- y[-samples]

8.2 Modeling Technique and Approach

For state-of-the-art prediction quality, we will use a model stack. This will consist of tuning models separately and then combining the candidate models into a metamodel that will formulate predictions with a linear combination of our tuned models.

From the perspective of understanding the manufacturing process, the model stack will also provide benefits. The stack is like a panel of experts, each looking at the data through slightly different lenses to form their diagnoses. By looking at the predictors each model uses, we can gather assemble a complete picture of the factors that affect our manufacturing process.

Four of the five models will be tree-based ensembles because this model type can handle non-linear relationships, is robust to outliers and skewed distributions, and can model complex interactions. One non-tree modle is included for divsersity in the stack.

xgb_grid <- expand.grid(eta = c(.01), nrounds = c(1000), max_depth = c(6), gamma = c(0), colsample_bytree = c(.8), min_child_weight = c(.8), subsample = c(.8))
cub_grid <- expand.grid(.committees = c(5), .neighbors = c(7))
mars_grid <- expand.grid(.degree = c(2), .nprune = 24)
dart_grid <- expand.grid(eta = c(.01), nrounds = c(1000), gamma = c(.1), skip_drop = c(.6), rate_drop = c(.4), max_depth = c(6), colsample_bytree = c(.6), min_child_weight = c(.6), subsample = c(.6))
rf_grid <- expand.grid(mtry = 25)
tuning_list <-list(
  caretModelSpec(method="xgbTree", tuneGrid = xgb_grid),
  caretModelSpec(method="xgbDART", tuneGrid = dart_grid),
  caretModelSpec(method="cubist", tuneGrid = cub_grid),
  caretModelSpec(method="rf", tuneGrid = rf_grid, importance = TRUE),
  caretModelSpec(method="bagEarth", tuneGrid = mars_grid)
)
my_control <- trainControl(method = 'cv',
                           number = 5,
                           savePredictions = 'final',
                           index = createFolds(y_train, 5),
                           allowParallel = TRUE) 
mod_list <- caretList(
  x = x_train,
  y = y_train,
  preProcess = c('center', 'scale'),
  trControl = my_control,
  tuneList = tuning_list
)

The training grid is commented out because of the time required to run, but can be uncommented for reproducibility of our model optimization

8.3 MARS

#helper function
show_results <- function(model){ 
  rslts <- model$results %>%
    arrange(RMSE)
  head(rslts, 10) %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
}
show_results(mod_list$bagEarth)
degree nprune RMSE Rsquared MAE RMSESD RsquaredSD MAESD
2 24 0.1452284 0.3442843 0.0974178 0.0224892 0.0981137 0.0023522
predict(mod_list$bagEarth, x_test) %>%
  postResample(y_test)
##       RMSE   Rsquared        MAE 
## 0.12291518 0.53214584 0.09257829

8.4 RandomForest

show_results(mod_list$rf)
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
25 0.1201781 0.508212 0.0916387 0.0022317 0.0173091 0.0018639
predict(mod_list$rf, x_test) %>%
  postResample(y_test)
##       RMSE   Rsquared        MAE 
## 0.09838320 0.71234554 0.06991311

8.5 Cubist

show_results(mod_list$cubist)
committees neighbors RMSE Rsquared MAE RMSESD RsquaredSD MAESD
5 7 0.1230572 0.4793229 0.0922648 0.0026294 0.0169731 0.0013288
predict(mod_list$cubist, x_test) %>%
  postResample(y_test)
##       RMSE   Rsquared        MAE 
## 0.10363209 0.66533091 0.07230681

8.6 XGB Trees

show_results(mod_list$xgbTree)
eta nrounds max_depth gamma colsample_bytree min_child_weight subsample RMSE Rsquared MAE RMSESD RsquaredSD MAESD
0.01 1000 6 0 0.8 0.8 0.8 0.1211406 0.4911182 0.0917093 0.001546 0.0146323 0.001122
predict(mod_list$xgbTree, x_test) %>%
  postResample(y_test)
##       RMSE   Rsquared        MAE 
## 0.10330553 0.67674199 0.07522457

8.7 XGB Dart

show_results(mod_list$xgbDART)
eta nrounds gamma skip_drop rate_drop max_depth colsample_bytree min_child_weight subsample RMSE Rsquared MAE RMSESD RsquaredSD MAESD
0.01 1000 0.1 0.6 0.4 6 0.6 0.6 0.6 0.1296332 0.4451329 0.1028912 0.0017605 0.0167511 0.0013675
predict(mod_list$xgbDART, x_test) %>%
  postResample(y_test)
##       RMSE   Rsquared        MAE 
## 0.11611208 0.61036059 0.09047387

Assessment

While the XGBTee model offers the lowest RMSE, it still only explains about 50% of the variance in PH. There is plenty of room for the other models to offer value. We will use the hyperparameters yielding the best RMSE in our final ensemble that will be built later.

9 Model Evaluation And Model Selection

9.1 Variable Importances

p1 <- varImp(mod_list$cubist) %>%
  plot(top = 10, main = 'Cubist')
p2 <- varImp(mod_list$xgbTree) %>%
  plot(top = 10, main = 'XGBTrees')
p3 <- varImp(mod_list$xgbDART) %>%
  plot(top = 10, main = 'XGBDart')
p4 <- varImp(mod_list$bagEarth) %>%
  plot(top = 10, main = 'MARS')
p5 <- varImp(mod_list$rf) %>%
  plot(top = 10, main = 'RF')
grid.arrange(p1,p2,p3,p4,p5, nrow = 3, ncol = 2)

Observations:

  • The top predictor, MnFlow, is consistent throughout the set of models
  • The XGBoost models are similar in variables, but less similar in their distributions of importance
  • After the top 5, divergence increases

Based on what we see, there are more factors contributing to PH with the ensemble than with a single model. We’ll confirm this by looking at the count by predictor among each top 10:

Variable Counts among the Models

top_10 <- function(model) {
  varImp(model)$importance %>%
  rownames_to_column('var') %>%
  arrange(desc(Overall)) %>%
  head(10) %>%
  select(var) 
}
rbind(top_10(mod_list$xgbDART), top_10(mod_list$xgbTree), top_10(mod_list$cubist), top_10(mod_list$bagEarth), top_10(mod_list$rf)) %>%
  group_by(var) %>%
  summarise(ModelCount = n()) %>%
  arrange(desc(ModelCount)) %>%
  kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
var ModelCount
AirPressurer 5
AlchRel 5
MnfFlow 5
PressureVacuum 5
BrandCodeC 4
OxygenFiller 4
Temperature 4
BowlSetpoint 3
CarbRel 3
Usagecont 3
Balling 2
BallingLvl 2
HydPressure3 2
CarbFlow 1
Density 1
FillerSpeed 1

Best Single Tree

Because 4 of the 5 models in the ensemble are tree-based, looking a single decision tree model will give us an idea of not just what predictors are important, but how they are being used to form predictions.

rp_mod <- rpart::rpart(ph ~ ., data = data.frame(bev_train, ph = y), method = 'anova')
rpart.plot(rp_mod, uniform=TRUE)

9.2 Test Model Stack

Now that our models are optimized and understood, we will form the ensemble and test the results. We first look at their prediction correlations to confirm that the stack, intuitively, be effective. The lower the correlation between the models, the more effective the stack will be.

resamples <- resamples(mod_list)
modelCor(resamples) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
xgbTree xgbDART cubist rf bagEarth
xgbTree 1.0000000 0.0359329 0.5686530 0.9384567 0.2858379
xgbDART 0.0359329 1.0000000 0.4606575 0.1812077 0.2035874
cubist 0.5686530 0.4606575 1.0000000 0.4387601 0.0939014
rf 0.9384567 0.1812077 0.4387601 1.0000000 0.1553078
bagEarth 0.2858379 0.2035874 0.0939014 0.1553078 1.0000000

Our model list has negative correlations, indicating that it will be effective.

ensemble1 <- caretStack(
  mod_list,
  method="glmnet",
  metric="RMSE",
  trControl=trainControl(
    method="cv",
    number=5,
    savePredictions="final"
  )
)
ensemble1$ens_model$results %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
0.10 0.0002408 0.1163325 0.5291532 0.0869919 0.0024546 0.0235280 0.0018167
0.10 0.0024077 0.1164400 0.5283258 0.0873434 0.0022805 0.0230699 0.0017847
0.10 0.0240768 0.1174627 0.5216820 0.0888550 0.0020787 0.0230548 0.0018040
0.55 0.0002408 0.1163292 0.5291801 0.0870144 0.0024430 0.0234748 0.0018136
0.55 0.0024077 0.1167687 0.5257197 0.0879296 0.0021121 0.0225295 0.0017564
0.55 0.0240768 0.1180667 0.5238896 0.0897646 0.0021940 0.0226635 0.0018279
1.00 0.0002408 0.1163299 0.5291771 0.0870469 0.0024256 0.0233859 0.0018068
1.00 0.0024077 0.1170109 0.5238458 0.0883080 0.0021029 0.0223825 0.0018120
1.00 0.0240768 0.1195952 0.5235662 0.0915518 0.0022784 0.0222044 0.0018043

We see a several percent improvement over the best single model.

preds <- predict(ensemble1, x_test) 
preds %>% postResample(y_test)
##       RMSE   Rsquared        MAE 
## 0.09345040 0.73030404 0.06499146

Predicted Vs Actual

preds_df <- data.frame(predicted = preds, actual = y_test)
ggplot(preds_df) + geom_point(aes(x = predicted, y = actual)) + geom_smooth(aes(x = predicted, y = actual))

9.3 Model Selection

On the test set, the ensemble’s predictions are consistent throughout the range of PH. Consistent quality is, theoretically, a feature of ensemble models because the models are weighted such that predictions are maximized. If one model predicts better at lower PH, it will be upweighted over that range. There are several outliers, but when not all of the variance is explainable, this is unavoidable.

9.4 Predict on Eval Set

With our hyperparameters of the individual models optimized, we want to provide each model the entire training set and build the meta-model based on these optimized single models.

xgb_grid <- expand.grid(eta = c(.01), nrounds = c(1000), max_depth = c(6), gamma = c(0), colsample_bytree = c(.8), min_child_weight = c(.8), subsample = c(.8))
cub_grid <- expand.grid(.committees = c(5), .neighbors = c(7))
mars_grid <- expand.grid(.degree = c(2), .nprune = 24)
dart_grid <- expand.grid(eta = c(.01), nrounds = c(1000), gamma = c(.1), skip_drop = c(.6), rate_drop = c(.4), max_depth = c(6), colsample_bytree = c(.6), min_child_weight = c(.6), subsample = c(.6))
rf_grid <- expand.grid(mtry = 25)
tuning_list <-list(
  caretModelSpec(method="xgbTree", tuneGrid = xgb_grid),
  caretModelSpec(method="xgbDART", tuneGrid = dart_grid),
  caretModelSpec(method="cubist", tuneGrid = cub_grid),
  caretModelSpec(method="rf", tuneGrid = rf_grid, importance = TRUE),
  caretModelSpec(method="bagEarth", tuneGrid = mars_grid)
)
my_control <- trainControl(method = 'cv',
                           number = 5,
                           savePredictions = 'final',
                           index = createFolds(y, 5),
                           allowParallel = TRUE) 
mod_list <- caretList(
  x = bev_train,
  y = y,
  preProcess = c('center', 'scale'),
  trControl = my_control,
  tuneList = tuning_list
)
final_ensemble <- caretStack(
  mod_list,
  method="glmnet",
  metric="RMSE",
  trControl=trainControl(
    method="cv",
    number=5,
    savePredictions="final"
  )
)
final_ensemble$ens_model$results %>% kable () %>% kable_styling(bootstrap_options = "striped", full_width = F)
alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
0.10 0.0002475 0.1152983 0.5496251 0.0852560 0.0014370 0.0099183 0.0011410
0.10 0.0024746 0.1154572 0.5484537 0.0856576 0.0013007 0.0096406 0.0011113
0.10 0.0247464 0.1166370 0.5409228 0.0874485 0.0011971 0.0112153 0.0011077
0.55 0.0002475 0.1153058 0.5495683 0.0852897 0.0014150 0.0099203 0.0011290
0.55 0.0024746 0.1159065 0.5450050 0.0863737 0.0012400 0.0099376 0.0011478
0.55 0.0247464 0.1172433 0.5435591 0.0883593 0.0013210 0.0100228 0.0010285
1.00 0.0002475 0.1153181 0.5494758 0.0853263 0.0013953 0.0099031 0.0011202
1.00 0.0024746 0.1161103 0.5435126 0.0867137 0.0012249 0.0099660 0.0010985
1.00 0.0247464 0.1188490 0.5434207 0.0902560 0.0014307 0.0099946 0.0010162

With more data available, the ensemble improve around 6%. Our model is considerably more predictive than a “baseline” single model.

Ensemble Model Importances

plot(varImp(final_ensemble$ens_model))

The MARS model is excluded, but all four tree models contribute the the final prediction

Export Predictions

predictions <- predict(final_ensemble, bev_eval)
predictions %>% tibble::enframe(name = NULL) %>% datatable()
write.csv(predictions, file = "predictions.csv")

10 Conclusion

After working on extracting the data from the given files, we processed data cleansing and handle the missing values along with the NAs. We trained and tested the data 75% to 25% respectively. Our models were able to produce for us predicted values for pH which are also saved in predictions.csv file separately. We notice that all the values predicted are greater than 7 and more specifically greater than 8. This scale translates into saying that the beverage made is alkaline. At the beginning of this study, we were not informed about the nature of the ABC Beverage company, meaning of what type of beverage manufacturer it was. But from our studies we can conclude that this company produces alkaline beverages like water, dairy, tea, fruit drinks, etc.