Business Case

A new regulations by the FDA forces the ABC company to understand our manufacturing process as it relates to the pH of the bevrages we produced. To this effect, a predictive model has been commissioned by the Production manager to better understand the predictive factors and report on the predictive model of pH.

Business Considrations

The ABC Company has limited resources for processing large amounts of data and analyzing the results and is working under an accelerated timeline to meet the new regulations. Consequently, the model selected should involve limited data cleaning, lend itself well to automation, and should predict pH with a fair degree of accuracy. More importantly, the interpretation of the model should provide actionable insights into the primary levers of pH levels.

Deliverables

  • A Executive level report of the findings
  • A Detail tecnical reports to be reviewed by an outside consultant

Thecnical Considerations

The team is opeating under a ver tight deadline, the deliverables have to be remitted on 05/22/2018. Since the team is operating remotely in various location, the following tools were adopted to enhanced efficient communication;

  • Slack was used for daily communication during the project and for quick access to code and documentation.
  • GoToMeeting was utilized for regular touch point meetings and on as needed basis.
  • Github was used for version control management and to ensure each team member had access to the latest version of the project documentation
  • R was used to perform analysis, R code can be found in Appendix A for Technical report

Team Members
Prashant Bhuyan (Team Leader)
Bruce Hao
Cheryl Bowersox
Chris Estevez
Valerie Briot

#Please update Data explorer to below version
pack_URL= "https://cran.r-project.org/src/contrib/Archive/DataExplorer/DataExplorer_0.4.0.tar.gz"
if (!require("DataExplorer")) install.packages(pack_URL, repos=NULL, type='source')

library(psych)        # EDA, describe function  
library(tidyverse)    #
library(knitr)        #
library(VIM)          # correlation
library(caret)        # correlation, model building
library(corrplot)     # Correlation
library(mice)         # Imputation
library(MASS)         # BoxCox Transformation
library("usdm")
library(forecast)     # alternate transform
library(ranger)       # Random Forest Model
library(randomForest)
library(parallel)     # Parallel processing for model building
library(doParallel)   # Parallel processing for model building

rm(pack_URL)

Data Set

The analysis will be performed on historical data. For reproducibility of the results, the data was loaded to and accessed from a Github repository.

#set file name - to be change by github link#
#beverage_filname <- 

# Load Trainning Data Set
beverages <-read.csv("StudentData.csv", header=TRUE, sep=",",stringsAsFactors = F)

#From DataExplorer
data_list <- list(beverages)

PlotStr(data_list, type="r")

rm(data_list)
dim(beverages)
## [1] 2571   33
summary(beverages)
##   Brand.Code         Carb.Volume     Fill.Ounces      PC.Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb.Pressure     Carb.Temp          PSC             PSC.Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC.CO2           Mnf.Flow       Carb.Pressure1  Fill.Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd.Pressure1   Hyd.Pressure2   Hyd.Pressure3   Hyd.Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler.Level    Filler.Speed   Temperature      Usage.cont   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5      
##    Carb.Flow       Density           MFR           Balling      
##  Min.   :  26   Min.   :0.240   Min.   : 31.4   Min.   :-0.170  
##  1st Qu.:1144   1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496  
##  Median :3028   Median :0.980   Median :724.0   Median : 1.648  
##  Mean   :2468   Mean   :1.174   Mean   :704.0   Mean   : 2.198  
##  3rd Qu.:3186   3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292  
##  Max.   :5104   Max.   :1.920   Max.   :868.6   Max.   : 4.012  
##  NA's   :2      NA's   :1       NA's   :212     NA's   :1       
##  Pressure.Vacuum        PH        Oxygen.Filler     Bowl.Setpoint  
##  Min.   :-6.600   Min.   :7.880   Min.   :0.00240   Min.   : 70.0  
##  1st Qu.:-5.600   1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0  
##  Median :-5.400   Median :8.540   Median :0.03340   Median :120.0  
##  Mean   :-5.216   Mean   :8.546   Mean   :0.04684   Mean   :109.3  
##  3rd Qu.:-5.000   3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0  
##  Max.   :-3.600   Max.   :9.360   Max.   :0.40000   Max.   :140.0  
##                   NA's   :4       NA's   :12        NA's   :2      
##  Pressure.Setpoint Air.Pressurer      Alch.Rel        Carb.Rel    
##  Min.   :44.00     Min.   :140.8   Min.   :5.280   Min.   :4.960  
##  1st Qu.:46.00     1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340  
##  Median :46.00     Median :142.6   Median :6.560   Median :5.400  
##  Mean   :47.62     Mean   :142.8   Mean   :6.897   Mean   :5.437  
##  3rd Qu.:50.00     3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540  
##  Max.   :52.00     Max.   :148.2   Max.   :8.620   Max.   :6.060  
##  NA's   :12                        NA's   :9       NA's   :10     
##   Balling.Lvl  
##  Min.   :0.00  
##  1st Qu.:1.38  
##  Median :1.48  
##  Mean   :2.05  
##  3rd Qu.:3.14  
##  Max.   :3.66  
##  NA's   :1
object.size(beverages)
## 642096 bytes

The dataset is comprised of 33 variables and 2571 observations. At first glance, it is clear that some variables have missing values that will need to be addressed. All the variables beside Brand.code are numeric.

Data Exploration and Statistic Measures

The purpose of the data exploration and statistic measures phase is to understand the data to determine how to process the dataset for modelling.

Descriptive Statistics

Descriptive statistics were calculated to examine the basic features of the data.

#Calculate mean missing values per variable
missing_values <- beverages %>% 
  summarize_all(funs(sum(is.na(.))))

missing_values_ratio <- beverages %>% 
  summarize_all(funs(sum(is.na(.)) / length(.)*100))

#Use Describe Package to calculate Descriptive Statistic
(beverages_d <- describe(beverages, na.rm=TRUE, interp=FALSE, skew=TRUE, ranges=TRUE, trim=.1, type=3, check=TRUE, fast=FALSE, quant=c(.25,.75), IQR=TRUE))
beverages_d$missing <- t(missing_values)
beverages_d$miss_ratio <- t(round(missing_values_ratio,4))

beverages_d <- beverages_d %>% 
  dplyr::select(n, missing, miss_ratio, mean, sd, min, max, skew, kurtosis, median, IQR, Q0.25, Q0.75)

kable(beverages_d)
n missing miss_ratio mean sd min max skew kurtosis median IQR Q0.25 Q0.75
Brand.Code* 2571 0 0.0000 NaN NA Inf -Inf NA NA NA NA NA NA
Carb.Volume 2561 10 0.3890 5.3701978 0.1063852 5.0400000 5.700 0.3922121 -0.4669916 5.3466667 0.1600000 5.2933333 5.453333
Fill.Ounces 2533 38 1.4780 23.9747546 0.0875299 23.6333333 24.320 -0.0215452 0.8624714 23.9733333 0.1066667 23.9200000 24.026667
PC.Volume 2532 39 1.5169 0.2771187 0.0606953 0.0793333 0.478 0.3396269 0.6699690 0.2713333 0.0728333 0.2391667 0.312000
Carb.Pressure 2544 27 1.0502 68.1895755 3.5382039 57.0000000 79.400 0.1822162 -0.0138046 68.2000000 5.0000000 65.6000000 70.600000
Carb.Temp 2545 26 1.0113 141.0949234 4.0373861 128.6000000 154.000 0.2468280 0.2375822 140.8000000 5.4000000 138.4000000 143.800000
PSC 2538 33 1.2835 0.0845737 0.0492690 0.0020000 0.270 0.8491445 0.6480498 0.0760000 0.0640000 0.0480000 0.112000
PSC.Fill 2548 23 0.8946 0.1953689 0.1177817 0.0000000 0.620 0.9334450 0.7691466 0.1800000 0.1600000 0.1000000 0.260000
PSC.CO2 2532 39 1.5169 0.0564139 0.0430387 0.0000000 0.240 1.7288937 3.7250025 0.0400000 0.0600000 0.0200000 0.080000
Mnf.Flow 2569 2 0.0778 24.5689373 119.4811263 -100.2000000 229.400 0.0041430 -1.8697072 65.2000000 240.8000000 -100.0000000 140.800000
Carb.Pressure1 2539 32 1.2447 122.5863726 4.7428819 105.6000000 140.200 0.0543587 0.1418265 123.2000000 6.4000000 119.0000000 125.400000
Fill.Pressure 2549 22 0.8557 47.9221656 3.1775457 34.6000000 60.400 0.5471107 1.4067532 46.4000000 4.0000000 46.0000000 50.000000
Hyd.Pressure1 2560 11 0.4278 12.4375781 12.4332538 -0.8000000 58.000 0.7798043 -0.1426463 11.4000000 20.2000000 0.0000000 20.200000
Hyd.Pressure2 2556 15 0.5834 20.9610329 16.3863066 0.0000000 59.400 -0.3019570 -1.5592984 28.6000000 34.6000000 0.0000000 34.600000
Hyd.Pressure3 2556 15 0.5834 20.4584507 15.9757236 -1.2000000 50.000 -0.3189061 -1.5745834 27.6000000 33.4000000 0.0000000 33.400000
Hyd.Pressure4 2541 30 1.1669 96.2888627 13.1225594 52.0000000 142.000 0.5459786 0.6340041 96.0000000 16.0000000 86.0000000 102.000000
Filler.Level 2551 20 0.7779 109.2523716 15.6984241 55.8000000 161.200 -0.8482847 0.0460488 118.4000000 21.7000000 98.3000000 120.000000
Filler.Speed 2514 57 2.2170 3687.1988862 770.8200208 998.0000000 4030.000 -2.8700359 6.7059692 3982.0000000 110.0000000 3888.0000000 3998.000000
Temperature 2557 14 0.5445 65.9675401 1.3827783 63.6000000 76.200 2.3869732 10.1612904 65.6000000 1.2000000 65.2000000 66.400000
Usage.cont 2566 5 0.1945 20.9929618 2.9779364 12.0800000 25.900 -0.5353253 -1.0170230 21.7900000 5.3950000 18.3600000 23.755000
Carb.Flow 2569 2 0.0778 2468.3542234 1073.6964743 26.0000000 5104.000 -0.9877287 -0.5826893 3028.0000000 2042.0000000 1144.0000000 3186.000000
Density 2570 1 0.0389 1.1736498 0.3775269 0.2400000 1.920 0.5260149 -1.1992070 0.9800000 0.7200000 0.9000000 1.620000
MFR 2359 212 8.2458 704.0492582 73.8983094 31.4000000 868.600 -5.0917729 30.4558939 724.0000000 24.7000000 706.3000000 731.000000
Balling 2570 1 0.0389 2.1977696 0.9310914 -0.1700000 4.012 0.5939224 -1.3855651 1.6480000 1.7960000 1.4960000 3.292000
Pressure.Vacuum 2571 0 0.0000 -5.2161027 0.5699933 -6.6000000 -3.600 0.5256608 -0.0313126 -5.4000000 0.6000000 -5.6000000 -5.000000
PH 2567 4 0.1556 8.5456486 0.1725162 7.8800000 9.360 -0.2906437 0.0644294 8.5400000 0.2400000 8.4400000 8.680000
Oxygen.Filler 2559 12 0.4667 0.0468426 0.0466436 0.0024000 0.400 2.6603955 11.0882098 0.0334000 0.0380000 0.0220000 0.060000
Bowl.Setpoint 2569 2 0.0778 109.3265862 15.3031541 70.0000000 140.000 -0.9743842 -0.0564212 120.0000000 20.0000000 100.0000000 120.000000
Pressure.Setpoint 2559 12 0.4667 47.6153966 2.0390474 44.0000000 52.000 0.2031970 -1.6012622 46.0000000 4.0000000 46.0000000 50.000000
Air.Pressurer 2571 0 0.0000 142.8339946 1.2119170 140.8000000 148.200 2.2521053 4.7336291 142.6000000 0.8000000 142.2000000 143.000000
Alch.Rel 2562 9 0.3501 6.8974161 0.5052753 5.2800000 8.620 0.8836378 -0.8506221 6.5600000 0.7000000 6.5400000 7.240000
Carb.Rel 2561 10 0.3890 5.4367825 0.1287183 4.9600000 6.060 0.5032472 -0.2949480 5.4000000 0.2000000 5.3400000 5.540000
Balling.Lvl 2570 1 0.0389 2.0500078 0.8703089 0.0000000 3.660 0.5858456 -1.4858636 1.4800000 1.7600000 1.3800000 3.140000
rm(beverages_d,missing_values,missing_values_ratio)

From the skewness coefficient, we observed that some variables may have a right skewed distribution (PSC.CO2, Temperature, Oxygen.Filler, Air.Pressurer) or a left skewed distribution (Filler.Speed, MFR). As we observed prior, we have missing values for some of the variables, we will need to take this into considerations.

Analysis of predictors

We will now examined each predictor to understand their distribution and determine whether any transformation is required.

# from DataExplorer Package

DataExplorer::HistogramContinuous(beverages)

Variable to Variable Analysis

Correlation Analysis

The correlation matrix shown below highlights correlations among several predictor variables. Correlation between between

# From DataExplorer Package
#plot_correlation(beverages, use = "pairwise.complete.obs")
DataExplorer::CorrelationContinuous(beverages, use = "pairwise.complete.obs")

cor_mx =cor(beverages%>% dplyr::select(-Brand.Code) ,use="pairwise.complete.obs", method = "pearson")

corrplot(cor_mx, method = "color",type = "upper", order = "original", number.cex = .7,addCoef.col = "black",   #Add coefficient of correlation
                            tl.srt = 90,# Text label color and rotation
                            diag = TRUE)# hide correlation coefficient on the principal diagonal

rm(cor_mx)

Let us now look at the correlation between the response (pH) variable and the predictors.

This section will test the predictor variables to determine if there is correlation among them. Variance inflaction factor (VIF) is used to detect multicollinearity, specifically among the entire set of predictors versus within pairs of variables.

Testing for collinearity among the predictor variables, we see that none of the numeric predictor variables appear to have a problem with collinearity based on their low VIF scores.

# from VIM Package
beverages_predictors <- dplyr::select(beverages, -PH)

numeric_fields <- dplyr::select_if(beverages_predictors, is.numeric)[, 3:15]

usdm::vifcor(numeric_fields) 
## 1 variables from the 13 input variables have collinearity problem: 
##  
## Hyd.Pressure3 
## 
## After excluding the collinear variables, the linear correlation coefficients ranges between: 
## min correlation ( Fill.Pressure ~ PSC.Fill ):  0.001947733 
## max correlation ( Carb.Temp ~ Carb.Pressure ):  0.8233431 
## 
## ---------- VIFs of the remained variables -------- 
##         Variables      VIF
## 1       PC.Volume 1.340747
## 2   Carb.Pressure 4.043068
## 3       Carb.Temp 3.731096
## 4             PSC 1.120366
## 5        PSC.Fill 1.093121
## 6         PSC.CO2 1.053359
## 7        Mnf.Flow 2.404566
## 8  Carb.Pressure1 1.312199
## 9   Fill.Pressure 1.291768
## 10  Hyd.Pressure1 2.586587
## 11  Hyd.Pressure2 3.752048
## 12  Hyd.Pressure4 1.325928
rm(beverages_predictors,numeric_fields)

Data Transformation

Missing Values

We have some observed some predictors with missing values however no predictors are missing more than 8% of data and no rows are missing more than ?? of data. We will feel confortable with imputing the missing data.

# From Data Explorer  
PlotMissing(beverages)

Examination of Zero values

Some cases, a zero values are actually representative of missing data, is this the case here?

df <- setNames(data.frame(colSums(beverages==0, na.rm = T)), 'Count')
           
df$Variable <- rownames(df)

rownames(df) <- NULL

df %>% dplyr::filter(!Variable %in% c("Brand.code")) %>%  
ggplot(aes(x=reorder(Variable, Count), y=Count, fill=Count)) +
    geom_bar(stat="identity") + coord_flip() + guides(fill=FALSE) +
    xlab("Variable") + ylab("Number of 0 Values") + 
    ggtitle("Count of Zero Values by Variable") +
    geom_text(aes(label=Count), vjust=.5, hjust=-.1,position= position_dodge(width=0.5),size=3,  color="black")

rm(df)

We had observed the high number of 0 values for variables; Hyd.Pressure1, Hyd.Pressure2, and Hyd.Pressure3 and we will add a dummy variable to flag such data. Also, based on correlation coefficient, we will probably drop Hyd.Pressure3.

Brand.code has a proportion of its data that is unspecify, we will flag these records with a “U’, for”unknown“.

Data Imputation

Since we have a limited amount of missing values accross predictors, we will impute the data. We will use the mice package.

# Replace *BLANK Brand.Code with "U"

beverages$Brand.Code[beverages$Brand.Code==""]= "U"

summary(beverages$Brand.Code)
##    Length     Class      Mode 
##      2571 character character
mice_imputes =mice(beverages, m = 2, maxit = 2, print = FALSE,seed = 143)
densityplot(mice_imputes)

The imputed density distribution is indicated in red.

# Applied the imputed values V1
beverages_v1 =complete(mice_imputes)

# Plot missing values
PlotMissing(beverages_v1)

rm(mice_imputes)

We have addressed the missing values. We will continue to possible problem with predictors by investigation possible near-zero variances predictors.

Near-Zero Variance Predictors

By default, a predictor is classified as near-zero variance if the percentage of unique values in the samples is less than 10% and when the frequency ratio mentioned above is greater than 19 (95/5).

These default values can be changed by setting the arguments uniqueCut and freqCut.

# From Caret package
#
x = nearZeroVar(beverages_v1, saveMetrics = TRUE)

str(x, vec.len = 2)
## 'data.frame':    33 obs. of  4 variables:
##  $ freqRatio    : num  2.01 1.06 ...
##  $ percentUnique: num  0.194 3.928 ...
##  $ zeroVar      : logi  FALSE FALSE FALSE ...
##  $ nzv          : logi  FALSE FALSE FALSE ...
x[x[,"zeroVar"] > 0, ]
x[x[,"nzv"] > 0, ]
rm(x)

Since this is the only variable with near zero variance and the percentate is very close to cut-off, we will not drop this variables.

Features Creation

Invalid data or bimodal distributions

Based on the histograms above, it’s clear that some variables have bimodal distributions or a large number of records with what appear to be invalid data, for example, Mnf.Flow and the Hyd.Pressure variables. While some models may be able to deal with such data without modification, other models may not. As such, se will create dummy variables to flag which distribution a given record belongs to within each variable.

[[QUESTION TO GROUP – we should decide where this step resides, i.e. should this step take place after imputing missing data or before? ]] vb-I think we should have it after we impute the data

beverages_v2 = beverages_v1 %>%
  mutate(Mnf.Flow.lt0        = if_else(Mnf.Flow      <     0, 1, 0)) %>% 
  mutate(Hyd.Pressure1.lte0  = if_else(Hyd.Pressure1 <=    0 ,1, 0)) %>% 
  mutate(Hyd.Pressure2.lte0  = if_else(Hyd.Pressure2 <=    0, 1, 0)) %>% #remove Hyd.Pressure3 since variable dropped
  mutate(Filler.Speed.lt2500 = if_else(Filler.Speed  <  2500, 1, 0)) %>% 
  mutate(Carb.Flow.lt2500    = if_else(Carb.Flow     <  2000, 1, 0)) %>% 
  mutate(Balling.lt.2.5      = if_else(Balling       <   2.5, 1, 0))

Dropping predictors

Based on the correlation results we are proposing the drop the following predictors: Density, Balling.Lvl, Carb.Rel, Alch.Rel, and Hyd.Pressure3.

# Drop some predictors due to high correlation  
beverages_v2$Density <- NULL
beverages_v2$Balling.Lvl <- NULL
beverages_v2$Carb.Rel <- NULL
beverages_v2$Alch.Rel <- NULL
beverages_v2$Hyd.Pressure3 <- NULL

Data Transformation

We have observed significant skewness for the following variables: PSC and Oxygen.Filler. We are proposing to apply boxcox tranformation to these variables. ** VB-Please note: Since PSC.Fill and PSC.cO2 have zero values and cannot be transformed using box cox. Do we want to add an offset to make them positive? like 0.000001 or something like that?**

# Copy our data set
beverages_v3 <- beverages_v2
offset <- 0.0000001
# PSC
Box = boxcox(beverages_v3$PSC ~ 1,              # Transform PSC Column as a single vector
             lambda = seq(-6,6,0.1)              # Try values -6 to 6 by 0.1
             )

Cox = data.frame(Box$x, Box$y)            # Create a data frame with the results

Cox2 = Cox[with(Cox, order(-Cox$Box.y)),] # Order the new data frame by decreasing y

Cox2[1,]                                  # Display the lambda with the greatest
                                          #    log likelihood


lambda.PSC = Cox2[1, "Box.x"]                 # Extract that lambda

#PSC.FILL
beverages_v3$PSC.Fill <- beverages_v3$PSC.Fill + offset

Box = boxcox(beverages_v3$PSC.Fill ~ 1,              # Transform PSC Column as a single vector
             lambda = seq(-6,6,0.1)              # Try values -6 to 6 by 0.1
             )

Cox = data.frame(Box$x, Box$y)            # Create a data frame with the results

Cox2 = Cox[with(Cox, order(-Cox$Box.y)),] # Order the new data frame by decreasing y

Cox2[1,]                                  # Display the lambda with the greatest
                                          #    log likelihood


lambda.PSC_Fill = Cox2[1, "Box.x"]                 # Extract that lambda

#PSC.CO2
beverages_v3$PSC.CO2 <- beverages_v3$PSC.CO2 + offset

Box = boxcox(beverages_v3$PSC.CO2 ~ 1,              # Transform PSC Column as a single vector
             lambda = seq(-6,6,0.1)              # Try values -6 to 6 by 0.1
             )

Cox = data.frame(Box$x, Box$y)            # Create a data frame with the results

Cox2 = Cox[with(Cox, order(-Cox$Box.y)),] # Order the new data frame by decreasing y

Cox2[1,]                                  # Display the lambda with the greatest
                                          #    log likelihood


lambda.PSC_CO2 = Cox2[1, "Box.x"]                 # Extract that lambda

#Oxygen.Filler
Box = boxcox(beverages_v3$Oxygen.Filler ~ 1,     # Transform PSC Column as a single vector
             lambda = seq(-6,6,0.1)              # Try values -6 to 6 by 0.1
             )

Cox = data.frame(Box$x, Box$y)            # Create a data frame with the results

Cox2 = Cox[with(Cox, order(-Cox$Box.y)),] # Order the new data frame by decreasing y

Cox2[1,]                                  # Display the lambda with the greatest
                                          #    log likelihood


lambda.Oxygen_Filler = Cox2[1, "Box.x"]                 # Extract that lambda

# library(forecast)
# a = BoxCox.lambda(beverages_v2$PSC.Fill)
# b = BoxCox(beverages_v2$PSC.Fill,a)
# hist(b,70)
#  hist(beverages_v2$PSC.Fill)
# 
# HistogramContinuous(b)

rm(offset)

The R code to produce the lambda value is not working. I believe is due to the object being a list.

The lambda for predictor PSC is r lambda.PSC. The lambda for predictor PSC.FIL is r lambda.PSC_Fill The lambad for predictor PSC.CO2 is r lambda.PSC_CO2 The lambda for predictor Oxygen.Filler is r lambda.Oxigen_Filler

These complete the transformation on the data set, any additional tranfomations will be performed in the building model phase as they will be model dependent.

HistogramContinuous(beverages_v3)

#?HistogramContinuous()

Dataset 4

Data set 4 splits the response into a categorical variable of Neutal or Alkaline. Brand will also be converted to a four level factor.

beverages_v4= beverages_v1

beverages_v4$PH = ifelse(beverages_v4$PH <= 8.5, "Neutral", "Alkaline")
beverages_v4$PH = factor(beverages_v4$PH )
beverages_v4$Brand.Code = factor(beverages_v4$Brand.Code) 

Model Buidlings

We will explore and build various model to identify the most significant variable that influence the pH and be able to predict pH values.

Data Splitting

We have 3 versions of the data set that we will use to based our models. Additional transformations such as scaling and centering may be also applied at the time of model building;

  • version 1; Imputed data set, with brand.code missing values (Blank) impuated as ‘U’
  • version 2; based on version 1, with additional feagures and dropped highly correlated variables
  • version 3; based on version 2, with box-cox transformations applied to very skewed variables
# Where Imputed data is the dataset such as beverages_v1
set.seed(143) 
sample = sample.int(n = nrow(beverages), size = floor(.70*nrow(beverages)), replace = F)

beverages_v1_train = beverages_v1[sample, ]
beverages_v1_test  = beverages_v1[-sample,]

beverages_v2_train = beverages_v2[sample, ]
beverages_v2_test  = beverages_v2[-sample,]

beverages_v3_train = beverages_v3[sample, ]
beverages_v3_test  = beverages_v3[-sample,]

beverages_v4_train = beverages_v4[sample, ]
beverages_v4_test  = beverages_v4[-sample,]

rm(beverages_v1,beverages_v2,beverages_v3,beverages_v4,beverages,sample)
Mycluster =makeCluster(detectCores()-1)
registerDoParallel(Mycluster)

myControl = trainControl(method = 'cv', number = 5, 
  verboseIter = FALSE, savePredictions = TRUE,allowParallel = T)

GLM

# first, start with a general linear model 
set.seed(143)
GLM_M1_Data1 = train(PH ~ ., data = beverages_v1_train , metric = 'RMSE', method = 'glm',preProcess = c('center', 'scale'), trControl = myControl)
GLM_M1_Data1
## Generalized Linear Model 
## 
## 1799 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1321942  0.4198535  0.1016524
set.seed(143)
GLM_M2_Data2 = train(PH ~ ., data = beverages_v2_train , metric = 'RMSE', method = 'glm', trControl = myControl)
GLM_M2_Data2
## Generalized Linear Model 
## 
## 1799 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1327388  0.4153035  0.1039857
set.seed(143)
GLM_M3_Data3 = train(PH ~ ., data = beverages_v3_train, metric = 'RMSE', method = 'glm', trControl = myControl)
GLM_M3_Data3
## Generalized Linear Model 
## 
## 1799 samples
##   34 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1328461  0.4143476  0.1042366

glmnet

# next, we'll try a glmnet model which combines lasso and ridge regression 
set.seed(143)
glmnet_M1_Data1 = train(PH ~ ., data = beverages_v1_train , metric = 'RMSE', method = 'glmnet',preProcess = c('center', 'scale'), trControl = myControl)
glmnet_M1_Data1
## glmnet 
## 
## 1799 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE       Rsquared   MAE      
##   0.10   0.0001599694  0.1321682  0.4200101  0.1016777
##   0.10   0.0015996935  0.1322468  0.4191405  0.1020003
##   0.10   0.0159969354  0.1344230  0.4028236  0.1050963
##   0.55   0.0001599694  0.1321767  0.4198747  0.1017159
##   0.55   0.0015996935  0.1326398  0.4157909  0.1026923
##   0.55   0.0159969354  0.1391558  0.3693005  0.1092707
##   1.00   0.0001599694  0.1321877  0.4197298  0.1017613
##   1.00   0.0015996935  0.1331279  0.4118142  0.1033393
##   1.00   0.0159969354  0.1439824  0.3318423  0.1128677
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda
##  = 0.0001599694.
set.seed(143)
glmnet_M2_Data2 = train(PH ~ ., data = beverages_v2_train , metric = 'RMSE', method = 'glmnet', trControl = myControl)
glmnet_M2_Data2
## glmnet 
## 
## 1799 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE       Rsquared   MAE      
##   0.10   0.0001672247  0.1326933  0.4155451  0.1038747
##   0.10   0.0016722466  0.1328820  0.4135047  0.1038632
##   0.10   0.0167224659  0.1343540  0.4025188  0.1051299
##   0.55   0.0001672247  0.1326944  0.4154579  0.1038483
##   0.55   0.0016722466  0.1333793  0.4090589  0.1041986
##   0.55   0.0167224659  0.1386539  0.3740453  0.1086213
##   1.00   0.0001672247  0.1326960  0.4153719  0.1038397
##   1.00   0.0016722466  0.1336176  0.4071727  0.1044491
##   1.00   0.0167224659  0.1439849  0.3307979  0.1127147
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda
##  = 0.0001672247.
set.seed(143)
glmnet_M3_Data3 = train(PH ~ ., data = beverages_v3_train, metric = 'RMSE', method = 'glmnet', trControl = myControl)
glmnet_M3_Data3
## glmnet 
## 
## 1799 samples
##   34 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE       Rsquared   MAE      
##   0.10   0.0001672247  0.1327977  0.4145988  0.1041364
##   0.10   0.0016722466  0.1329433  0.4128907  0.1040781
##   0.10   0.0167224659  0.1343872  0.4021404  0.1052213
##   0.55   0.0001672247  0.1328009  0.4144859  0.1041151
##   0.55   0.0016722466  0.1333806  0.4089906  0.1043980
##   0.55   0.0167224659  0.1386158  0.3743946  0.1085942
##   1.00   0.0001672247  0.1328006  0.4144030  0.1040986
##   1.00   0.0016722466  0.1336274  0.4070431  0.1046215
##   1.00   0.0167224659  0.1439831  0.3308204  0.1127133
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda
##  = 0.0001672247.

ranger

# let's also model using random forest just for fun
set.seed(143)
ranger_M1_Data1 = train(PH ~ ., data = beverages_v1_train, metric = 'RMSE', method = 'ranger',preProcess = c('center', 'scale'),trControl = myControl)
ranger_M1_Data1
## Random Forest 
## 
## 1799 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE       Rsquared   MAE       
##    2    variance    0.1148168  0.6082763  0.08645160
##    2    extratrees  0.1191566  0.5732928  0.09119231
##   18    variance    0.1022603  0.6658250  0.07395862
##   18    extratrees  0.1016529  0.6675300  0.07357492
##   35    variance    0.1025035  0.6570685  0.07348070
##   35    extratrees  0.1000349  0.6754776  0.07205404
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 35, splitrule =
##  extratrees and min.node.size = 5.
set.seed(143)
ranger_M2_Data2 = train(PH ~ ., data = beverages_v2_train, metric = 'RMSE', method = 'ranger',trControl = myControl)
ranger_M2_Data2
## Random Forest 
## 
## 1799 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE       Rsquared   MAE       
##    2    variance    0.1183735  0.5767453  0.09000200
##    2    extratrees  0.1219389  0.5435630  0.09361688
##   19    variance    0.1042656  0.6494408  0.07637783
##   19    extratrees  0.1036684  0.6506936  0.07604484
##   36    variance    0.1046413  0.6425340  0.07615347
##   36    extratrees  0.1021818  0.6590873  0.07477257
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 36, splitrule =
##  extratrees and min.node.size = 5.
set.seed(143)
ranger_M3_Data3 = train(PH ~ ., data = beverages_v3_train, metric = 'RMSE', method = 'ranger',trControl = myControl)
ranger_M3_Data3
## Random Forest 
## 
## 1799 samples
##   34 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE       Rsquared   MAE       
##    2    variance    0.1191127  0.5698374  0.09056467
##    2    extratrees  0.1222293  0.5412470  0.09392975
##   19    variance    0.1042516  0.6498560  0.07627652
##   19    extratrees  0.1041126  0.6479240  0.07647817
##   37    variance    0.1047951  0.6413251  0.07615708
##   37    extratrees  0.1024866  0.6571123  0.07519137
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 37, splitrule =
##  extratrees and min.node.size = 5.

PLS

Since we observed correlation between the predictors variables, we will consider building a Partial Least Square model.

set.seed(143)
pls_M1_Data1 = train(PH ~ ., data = beverages_v1_train, metric = 'RMSE', method ='pls', preProcess = c('center', 'scale'), tunelength = 15, trControl = myControl)
pls_M1_Data1
## Partial Least Squares 
## 
## 1799 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##   1      0.1494403  0.2581107  0.1177883
##   2      0.1403749  0.3456090  0.1099254
##   3      0.1378683  0.3686338  0.1086210
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
set.seed(143)
pls_M2_Data2 = train(PH ~ ., data = beverages_v2_train, metric = 'RMSE', method ='pls',  tunelength = 15, trControl = myControl)
pls_M2_Data2
## Partial Least Squares 
## 
## 1799 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared    MAE      
##   1      0.1705840  0.03468439  0.1360308
##   2      0.1691324  0.04868677  0.1352213
##   3      0.1536504  0.21647890  0.1197444
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
set.seed(143)
pls_M3_Data3 = train(PH ~ ., data = beverages_v3_train, metric = 'RMSE', method ='pls',  tunelength = 15, trControl = myControl)
pls_M3_Data3
## Partial Least Squares 
## 
## 1799 samples
##   34 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared    MAE      
##   1      0.1705840  0.03468441  0.1360308
##   2      0.1691324  0.04868680  0.1352213
##   3      0.1536504  0.21647873  0.1197445
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

SVM

set.seed(143)
svmRadial_M1_Data1 =train(PH~.,beverages_v1_train, metric = 'RMSE', method = "svmRadial",preProc =c("center", "scale"),tuneLength = 14, trControl = myControl)
svmRadial_M1_Data1
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1799 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE       
##      0.25  0.1271623  0.4763276  0.09440391
##      0.50  0.1237651  0.4996104  0.09135600
##      1.00  0.1203705  0.5237636  0.08824463
##      2.00  0.1177148  0.5422556  0.08548582
##      4.00  0.1160196  0.5542984  0.08379616
##      8.00  0.1157321  0.5578480  0.08375405
##     16.00  0.1173556  0.5510887  0.08532832
##     32.00  0.1209362  0.5333375  0.08834497
##     64.00  0.1264312  0.5071734  0.09295419
##    128.00  0.1339292  0.4748267  0.09843867
##    256.00  0.1413222  0.4466638  0.10391393
##    512.00  0.1458019  0.4293654  0.10746593
##   1024.00  0.1476611  0.4221392  0.10865742
##   2048.00  0.1476611  0.4221392  0.10865742
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01892239
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01892239 and C = 8.
set.seed(143)
svmRadial_M2_Data2 =train(PH~.,beverages_v2_train, metric = 'RMSE', method = "svmRadial",tuneLength = 14, trControl = myControl)
svmRadial_M2_Data2
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1799 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE       
##      0.25  0.1266025  0.4790584  0.09407358
##      0.50  0.1240029  0.4962984  0.09161310
##      1.00  0.1211552  0.5161218  0.08903764
##      2.00  0.1190476  0.5308862  0.08701478
##      4.00  0.1174062  0.5430372  0.08576343
##      8.00  0.1177581  0.5424166  0.08617114
##     16.00  0.1189869  0.5385996  0.08746447
##     32.00  0.1224791  0.5220192  0.09017677
##     64.00  0.1272165  0.5005933  0.09363698
##    128.00  0.1350334  0.4651577  0.09944323
##    256.00  0.1450232  0.4250946  0.10690726
##    512.00  0.1523229  0.3972286  0.11217570
##   1024.00  0.1563843  0.3830348  0.11491475
##   2048.00  0.1587342  0.3766594  0.11665990
## 
## Tuning parameter 'sigma' was held constant at a value of 0.018282
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.018282 and C = 4.
set.seed(143)
svmRadial_M3_Data3 =train(PH~.,beverages_v3_train, metric = 'RMSE', method = "svmRadial",tuneLength = 14, trControl = myControl)
svmRadial_M3_Data3
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1799 samples
##   34 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE       
##      0.25  0.1270587  0.4752379  0.09442825
##      0.50  0.1243818  0.4933657  0.09170709
##      1.00  0.1215350  0.5134861  0.08906423
##      2.00  0.1195646  0.5269365  0.08746778
##      4.00  0.1180271  0.5382430  0.08653561
##      8.00  0.1180227  0.5401329  0.08624021
##     16.00  0.1190728  0.5382667  0.08744345
##     32.00  0.1231344  0.5185287  0.09067732
##     64.00  0.1277064  0.4976801  0.09401868
##    128.00  0.1360166  0.4596066  0.10049839
##    256.00  0.1463599  0.4181806  0.10841813
##    512.00  0.1528888  0.3935027  0.11342022
##   1024.00  0.1556328  0.3859861  0.11575868
##   2048.00  0.1568338  0.3823311  0.11670405
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01751442
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01751442 and C = 8.
#plot(svmRadial_M1_Data1, scales = list(x=list(log=2)))

rf

control_forest = trainControl(method="repeatedcv", number=5, repeats=2, search="random",allowParallel = T)
mtry = sqrt(ncol(beverages_v1_train))


set.seed(143)
rf_M1_Data1 = train(PH~., data=beverages_v1_train, metric = 'RMSE' , method="rf", tuneLength=5, trControl=control_forest,importance =T)
rf_M1_Data1
## Random Forest 
## 
## 1799 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440, 1440, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    1    0.1269387  0.5384941  0.09774909
##    3    0.1113403  0.6259804  0.08295917
##   15    0.1027157  0.6647482  0.07437338
##   26    0.1019036  0.6644823  0.07329716
##   34    0.1024474  0.6573263  0.07334071
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 26.
mtry = sqrt(ncol(beverages_v2_train))

set.seed(143)
rf_M2_Data2 = train(PH~., data=beverages_v2_train, metric = 'RMSE' , method="rf", tuneLength=5, trControl=control_forest)
rf_M2_Data2
## Random Forest 
## 
## 1799 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440, 1440, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    1    0.1334847  0.4781047  0.10352434
##    3    0.1139627  0.6015363  0.08562282
##   15    0.1048334  0.6476058  0.07680505
##   27    0.1046327  0.6438005  0.07605319
##   35    0.1050295  0.6391563  0.07608868
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 27.
mtry = sqrt(ncol(beverages_v3_train))
set.seed(143)
rf_M3_Data3 = train(PH~., data=beverages_v3_train, metric = 'RMSE' , method="rf", tuneLength=5, trControl=control_forest)
rf_M3_Data3
## Random Forest 
## 
## 1799 samples
##   34 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440, 1440, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    1    0.1340604  0.4691741  0.10386163
##    3    0.1145931  0.5974142  0.08631543
##   16    0.1051138  0.6452811  0.07702960
##   28    0.1047163  0.6431253  0.07614646
##   36    0.1051171  0.6382049  0.07616957
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 28.
rm(control_forest,mtry)
set.seed(143)
knn_M1_Data1 <- train(PH ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 1:10),
             trControl  = myControl,
             metric     = "RMSE",
             data       = beverages_v1_train,preProc =c("center", "scale"))
knn_M1_Data1
## k-Nearest Neighbors 
## 
## 1799 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE       
##    1  0.1552863  0.3519753  0.10524865
##    2  0.1354010  0.4441269  0.09628802
##    3  0.1288917  0.4719160  0.09184895
##    4  0.1256230  0.4900061  0.09103044
##    5  0.1239321  0.4978780  0.09017747
##    6  0.1244458  0.4922766  0.09122504
##    7  0.1240195  0.4953110  0.09169746
##    8  0.1244070  0.4920447  0.09218550
##    9  0.1244541  0.4919942  0.09257184
##   10  0.1244123  0.4917993  0.09265990
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
set.seed(143)
knn_M2_Data2 <- train(PH ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 1:10),
             trControl  = myControl,
             metric     = "RMSE",
             data       = beverages_v2_train)
knn_M2_Data2
## k-Nearest Neighbors 
## 
## 1799 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    1  0.1683200  0.2859914  0.1144311
##    2  0.1500658  0.3327492  0.1058868
##    3  0.1418583  0.3653536  0.1021784
##    4  0.1401030  0.3694645  0.1028647
##    5  0.1401516  0.3633672  0.1033130
##    6  0.1393636  0.3643894  0.1034378
##    7  0.1400595  0.3562528  0.1038978
##    8  0.1398343  0.3550924  0.1045687
##    9  0.1405531  0.3466922  0.1052976
##   10  0.1408672  0.3423547  0.1057476
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 6.
set.seed(143)
knn_M3_Data3 <- train(PH ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 1:10),
             trControl  = myControl,
             metric     = "RMSE",
             data       = beverages_v3_train)
knn_M3_Data3
## k-Nearest Neighbors 
## 
## 1799 samples
##   34 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    1  0.1678809  0.2883229  0.1142200
##    2  0.1501421  0.3320212  0.1059035
##    3  0.1419641  0.3647402  0.1022868
##    4  0.1400958  0.3695812  0.1028442
##    5  0.1401539  0.3633406  0.1033219
##    6  0.1393665  0.3643136  0.1034634
##    7  0.1400607  0.3562888  0.1039153
##    8  0.1398256  0.3552039  0.1045193
##    9  0.1405340  0.3468925  0.1052685
##   10  0.1408292  0.3426671  0.1056854
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 6.

Random Forest (prashant) Clasification

Using the ‘randomForest’ and ‘caret’ packages, I applied k-fold cross validation on the training set, fitting the random forest model to 10 random samples of the training set and taking the average.

trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE, verboseIter = FALSE)

set.seed(143)
rf_M4_Data4 = train(PH ~ ., data = beverages_v4_train,method = "rf", prox = FALSE, trControl = myControl)
rf_M4_Data4
## Random Forest 
## 
## 1799 samples
##   32 predictor
##    2 classes: 'Alkaline', 'Neutral' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8054565  0.5981767
##   18    0.8199087  0.6314982
##   35    0.8126834  0.6174367
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 18.
rm(trControl)
stopCluster(Mycluster)
registerDoSEQ()

From the results of the cross validated model above, we can see that model accuracy was above 80 percent. From the confusion matrix below, we can see that overall accuracy of the model was 82.64 percent with a 95 percent confidence interval between 79.78 percent and 85.25 percdnt, sensitivity of 90.57 percent and specificity of 69.01 percent.

test <- predict( rf_M4_Data4, newdata = beverages_v4_test)
cf <- confusionMatrix(data = test,  beverages_v4_test$PH)
print(cf, digits = 4)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Alkaline Neutral
##   Alkaline      409      73
##   Neutral        57     233
##                                           
##                Accuracy : 0.8316          
##                  95% CI : (0.8033, 0.8573)
##     No Information Rate : 0.6036          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6449          
##  Mcnemar's Test P-Value : 0.1883          
##                                           
##             Sensitivity : 0.8777          
##             Specificity : 0.7614          
##          Pos Pred Value : 0.8485          
##          Neg Pred Value : 0.8034          
##              Prevalence : 0.6036          
##          Detection Rate : 0.5298          
##    Detection Prevalence : 0.6244          
##       Balanced Accuracy : 0.8196          
##                                           
##        'Positive' Class : Alkaline        
## 
# Concept importance for model
varImp(rf_M4_Data4, top = 10)
## rf variable importance
## 
##   only 20 most important variables shown (out of 35)
## 
##                 Overall
## Mnf.Flow         100.00
## Usage.cont        47.94
## Balling.Lvl       26.69
## Filler.Level      26.66
## Temperature       26.58
## Oxygen.Filler     26.41
## Air.Pressurer     25.42
## Balling           24.18
## Alch.Rel          23.65
## Hyd.Pressure3     23.39
## Carb.Pressure1    22.90
## Filler.Speed      20.86
## PC.Volume         20.60
## Carb.Flow         19.89
## Bowl.Setpoint     19.85
## Brand.CodeC       19.15
## Carb.Rel          18.16
## MFR               17.54
## Pressure.Vacuum   17.43
## Density           16.71
dotPlot(varImp(rf_M4_Data4), top=10)

Model Selection, Interpretation, & Evaluation

Selecting Best Model

We have built various models, linear regression, and nonlinear models. We will evaluate each of the models on the test datasets. Based on the results, the best model will be selected. The following criteria will be considered for the selection:

. Accuracy (how accurately the model performed, evaluated using RMSE and MAE) . Rsquared . MAE . Scalability (as measure by run time) . interpretability

# compare models
models_list = list("glm_D1" = GLM_M1_Data1, "glm_D2" = GLM_M2_Data2, "glm_D3" = GLM_M3_Data3, 
                   "glmnet_D1" = glmnet_M1_Data1, "glmnet_D2" = glmnet_M2_Data2, "glmnet_D3" = glmnet_M3_Data3, 
                   "ranger_D1"= ranger_M1_Data1, "ranger_D2" = ranger_M2_Data2, "ranger_D3" = ranger_M3_Data3, 
                   "pls_D1"= pls_M1_Data1, "pls_D2" = pls_M2_Data2, "pls_D3" = pls_M3_Data3, 
                   "SVM_D1" = svmRadial_M1_Data1, "SVM_D2" = svmRadial_M2_Data2, "SVM_D3" = svmRadial_M3_Data3, 
                  # "rf_D1" = rf_M1_Data1, "rf_D2" = rf_M2_Data2, "rf_D3" = rf_M3_Data3,
                   "knn_D1" = knn_M1_Data1, "knn_D2" = knn_M2_Data2, "knn_D3" = knn_M3_Data3)

resamps = resamples(models_list) 

dotplot(resamps, metric = 'RMSE')

dotplot(resamps, metric = 'Rsquared')

summary(resamps)
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: glm_D1, glm_D2, glm_D3, glmnet_D1, glmnet_D2, glmnet_D3, ranger_D1, ranger_D2, ranger_D3, pls_D1, pls_D2, pls_D3, SVM_D1, SVM_D2, SVM_D3, knn_D1, knn_D2, knn_D3 
## Number of resamples: 5 
## 
## MAE 
##              Min. 1st Qu.  Median    Mean 3rd Qu.    Max. NA's
## glm_D1    0.09667 0.09943 0.10200 0.10170 0.10300 0.10720    0
## glm_D2    0.10100 0.10220 0.10450 0.10400 0.10560 0.10660    0
## glm_D3    0.10150 0.10210 0.10430 0.10420 0.10630 0.10700    0
## glmnet_D1 0.09663 0.09955 0.10190 0.10170 0.10300 0.10720    0
## glmnet_D2 0.10070 0.10220 0.10440 0.10390 0.10520 0.10690    0
## glmnet_D3 0.10110 0.10210 0.10430 0.10410 0.10590 0.10730    0
## ranger_D1 0.06786 0.07040 0.07101 0.07205 0.07494 0.07605    0
## ranger_D2 0.07129 0.07329 0.07353 0.07477 0.07661 0.07914    0
## ranger_D3 0.07164 0.07376 0.07403 0.07519 0.07734 0.07919    0
## pls_D1    0.10440 0.10690 0.10830 0.10860 0.10850 0.11490    0
## pls_D2    0.11770 0.11780 0.11820 0.11970 0.12130 0.12370    0
## pls_D3    0.11770 0.11780 0.11820 0.11970 0.12130 0.12370    0
## SVM_D1    0.08213 0.08245 0.08308 0.08375 0.08524 0.08586    0
## SVM_D2    0.08462 0.08530 0.08619 0.08576 0.08628 0.08642    0
## SVM_D3    0.08365 0.08555 0.08699 0.08624 0.08729 0.08772    0
## knn_D1    0.08648 0.08762 0.08944 0.09018 0.09360 0.09375    0
## knn_D2    0.10080 0.10270 0.10370 0.10340 0.10480 0.10520    0
## knn_D3    0.10090 0.10280 0.10370 0.10350 0.10480 0.10510    0
## 
## RMSE 
##              Min. 1st Qu.  Median   Mean 3rd Qu.   Max. NA's
## glm_D1    0.12620 0.12780 0.13330 0.1322  0.1344 0.1393    0
## glm_D2    0.12830 0.13000 0.13450 0.1327  0.1347 0.1362    0
## glm_D3    0.12910 0.12980 0.13420 0.1328  0.1352 0.1360    0
## glmnet_D1 0.12610 0.12790 0.13320 0.1322  0.1344 0.1393    0
## glmnet_D2 0.12810 0.13000 0.13440 0.1327  0.1344 0.1365    0
## glmnet_D3 0.12890 0.12980 0.13420 0.1328  0.1348 0.1363    0
## ranger_D1 0.09439 0.09521 0.09902 0.1000  0.1029 0.1087    0
## ranger_D2 0.09692 0.09898 0.10050 0.1022  0.1063 0.1082    0
## ranger_D3 0.09778 0.09927 0.10070 0.1025  0.1060 0.1087    0
## pls_D1    0.13210 0.13500 0.13620 0.1379  0.1384 0.1477    0
## pls_D2    0.15000 0.15180 0.15290 0.1537  0.1534 0.1601    0
## pls_D3    0.15000 0.15180 0.15290 0.1537  0.1534 0.1601    0
## SVM_D1    0.11410 0.11450 0.11470 0.1157  0.1153 0.1200    0
## SVM_D2    0.11390 0.11590 0.11780 0.1174  0.1192 0.1203    0
## SVM_D3    0.11500 0.11610 0.11810 0.1180  0.1200 0.1211    0
## knn_D1    0.11860 0.12050 0.12440 0.1239  0.1278 0.1283    0
## knn_D2    0.13580 0.13760 0.13780 0.1394  0.1416 0.1440    0
## knn_D3    0.13580 0.13760 0.13780 0.1394  0.1416 0.1440    0
## 
## Rsquared 
##             Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## glm_D1    0.3754  0.4093 0.4222 0.4199  0.4236 0.4688    0
## glm_D2    0.3761  0.3888 0.4116 0.4153  0.4494 0.4506    0
## glm_D3    0.3782  0.3901 0.4083 0.4143  0.4441 0.4510    0
## glmnet_D1 0.3751  0.4083 0.4224 0.4200  0.4240 0.4702    0
## glmnet_D2 0.3760  0.3887 0.4140 0.4155  0.4469 0.4522    0
## glmnet_D3 0.3779  0.3900 0.4110 0.4146  0.4456 0.4485    0
## ranger_D1 0.6386  0.6535 0.6730 0.6755  0.6939 0.7184    0
## ranger_D2 0.6096  0.6447 0.6578 0.6591  0.6837 0.6996    0
## ranger_D3 0.6127  0.6430 0.6540 0.6571  0.6824 0.6935    0
## pls_D1    0.3384  0.3417 0.3466 0.3686  0.3972 0.4194    0
## pls_D2    0.1618  0.1996 0.2342 0.2165  0.2348 0.2520    0
## pls_D3    0.1618  0.1996 0.2342 0.2165  0.2348 0.2520    0
## SVM_D1    0.5112  0.5340 0.5634 0.5578  0.5786 0.6021    0
## SVM_D2    0.5039  0.5065 0.5521 0.5430  0.5736 0.5791    0
## SVM_D3    0.5004  0.5197 0.5377 0.5401  0.5692 0.5738    0
## knn_D1    0.4476  0.5060 0.5085 0.4979  0.5091 0.5182    0
## knn_D2    0.3242  0.3467 0.3687 0.3644  0.3786 0.4037    0
## knn_D3    0.3241  0.3468 0.3684 0.3643  0.3786 0.4037    0

We will now compute a matrix for summarize all the models and metrics, as shown in the table below. We will evaluate the models on the test data and compare the results.

predictAndMeasure <- function(model, model.label, testData, ytest, score_interpretability, grid = NULL) {
 
  #mesure prediction time
  ptm <- proc.time()
  # Predict Model on Test Date set
  pred <- predict(model, testData)
  tm <- proc.time() - ptm
  
  post<- postResample(pred = pred, obs = ytest)
  RMSE.test <- c(post[[1]])
  RSquared.test <- c(post[[2]])
  MAE.test <- c(post[[3]])
  
  perf.grid = NULL
  if (is.null(grid)) { 
    perf.grid = data.frame(predictor = c(model.label) ,  RMSE = RMSE.test , RSquared = RSquared.test, MAE = MAE.test, time = c(tm[[3]]), interpretability = c(score_interpretability))
  } else {
    .grid = data.frame(predictor = c(model.label) , RMSE = RMSE.test , RSquared = RSquared.test, MAE = MAE.test, time = c(tm[[3]]), interpretability = c(score_interpretability))
    perf.grid = rbind(grid, .grid)
  }
  
  perf.grid
}


#Prediction for glm 
performance.grid <- predictAndMeasure (GLM_M1_Data1, "glm_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=NULL)
performance.grid <- predictAndMeasure (GLM_M2_Data2, "glm_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (GLM_M3_Data3, "glm_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)
#Prediction for glmnet
performance.grid <- predictAndMeasure (glmnet_M1_Data1, "glmnet_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (glmnet_M2_Data2, "glmnet_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (glmnet_M3_Data3, "glmnet_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)

#Prediction for ranger
performance.grid <- predictAndMeasure (ranger_M1_Data1, "ranger_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (ranger_M2_Data2, "ranger_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (ranger_M3_Data3, "ranger_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)

#Prediction for pls
performance.grid <- predictAndMeasure (pls_M1_Data1, "PLS_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (pls_M2_Data2, "PLS_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (pls_M3_Data3, "PLS_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)

#Prediction for SVM
performance.grid <- predictAndMeasure (svmRadial_M1_Data1, "SVM_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (svmRadial_M2_Data2, "SVM_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (svmRadial_M3_Data3, "SVM_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)

#Prediction for rf
performance.grid <- predictAndMeasure (rf_M1_Data1, "rf_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (rf_M2_Data2, "rf_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (rf_M3_Data3, "rf_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)

#Prediction for knn
performance.grid <- predictAndMeasure (knn_M1_Data1, "knn_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (knn_M2_Data2, "knn_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (knn_M3_Data3, "knn_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)

kable(performance.grid[order(performance.grid$RMSE, decreasing=F),])
predictor RMSE RSquared MAE time interpretability
7 ranger_D1 0.0939969 0.7032532 0.0696673 0.06 3
16 rf_D1 0.0967080 0.6879884 0.0718421 0.12 3
9 ranger_D3 0.0977129 0.6765406 0.0728909 0.06 1
8 ranger_D2 0.0982740 0.6727468 0.0728900 0.06 2
17 rf_D2 0.1007574 0.6589337 0.0749488 0.14 2
18 rf_D3 0.1013214 0.6545622 0.0750357 0.36 1
13 SVM_D1 0.1170556 0.5343906 0.0861389 0.34 3
15 SVM_D3 0.1175400 0.5305444 0.0881589 0.11 1
14 SVM_D2 0.1192273 0.5150016 0.0890344 0.10 2
19 knn_D1 0.1192457 0.5140818 0.0883048 0.06 3
4 glmnet_D1 0.1356713 0.3705383 0.1050806 0.03 3
1 glm_D1 0.1356927 0.3705787 0.1051352 0.01 3
5 glmnet_D2 0.1376930 0.3537371 0.1068528 0.02 2
2 glm_D2 0.1379206 0.3522181 0.1069320 0.02 2
6 glmnet_D3 0.1381452 0.3500481 0.1072481 0.02 1
3 glm_D3 0.1383895 0.3484315 0.1073678 0.00 1
20 knn_D2 0.1415026 0.3352160 0.1059370 0.07 2
21 knn_D3 0.1416439 0.3343843 0.1060579 0.06 1
10 PLS_D1 0.1423358 0.3062500 0.1109452 0.01 3
11 PLS_D2 0.1559200 0.1733901 0.1248626 0.00 2
12 PLS_D3 0.1559200 0.1733898 0.1248626 0.02 1

From the summary table, we observed that the random forest models (ranger_D1 and rf_D1) are better performing based on RMSE, MAE, and RSquared values. They are both built on a dataset with no transformation which allows for better interpretability of the model.

The model is meant to be “productionalized” and application will be built to monitor the manufacturing process, this will require the model to be scalable and operate on large amount of data. Due to this requirement, we will select rf_D1 model.

Interpretation of selected model

The analysis of the variable importance and decision tree of the selected random forest model is given below.

varImp(rf_M1_Data1$finalModel)
# decision tree of final random forest model  
getTree(rf_M1_Data1$finalModel, labelVar = T)

The model’s decision tree is consistent with the variable importance output. The first node of the tree is Mnf.Flow with a split point of about -50, which seems to be separating valid vs. invalid (-100) Mnf.Flow values. The records with invalid Mnf.Flow values seem to have higher pH values.

The first node on the left branch is Brand.Code4, which equates to Brand.Code “C”. Here, it seems that records with Brand.Code “C” have lower pH values. The first node on the right branch is Brand.Code5, which equates to Brand.Code “D” and is associated with higher pH values.

These positive/negative associations between pH and the important variables are consistent with the coefficients from an earlier glmnet model, which while not as accurate in terms of RMSE does produce much more interpretable results.

On a centered and scaled basis, the variables that most increased pH were Balling.Lvl and Carb.Pressure1, and the variables that most decreased pH were Mnf.Flow and Brand.Code “C”.

varImpPlot(rf_M1_Data1$finalModel, type = 1, main = "Mean Decrease Accuracy") # graph 1 MSE

varImpPlot(rf_M1_Data1$finalModel, type = 2, main = "Node Purity") # graph 2 RSS

Graph1 above shows that if a variable is assigned values by random permutation, how much the mean squared error (MSE) will increase. In this case, if you randomly permute Mnf.Flow, MSE will increase by 60% on average.

Graph2 above shows Node purity which is measured by the Gini index which is the difference between a residual sum of squares (RSS) before and after the split on that particular variable. So, IncNodePurity measures decrease in node impurities from splitting on the variable, averaged across all trees.

Evaluation

Evaluation Data set and required transformations

We will load the evaluation data set and performed the necessary data transformation to run our selected model.

#set file name - to be change by github link#
beverages_filename_eval<- "https://raw.githubusercontent.com/vbriot28/Data624_Group1_FinalProject/master/StudentEvaluation-%20TO%20PREDICT.csv"

# Load Evaluation Data Set
beverages_eval <-read.csv(beverages_filename_eval, header=TRUE, sep=",",stringsAsFactors = F)

Based on our selected model, we only need to impute the data and converting any “unspecified” Brand code as “U”, we will first check whether this is a required steps but checking for missing data.

PlotMissing(beverages_eval)

We will impute the missing values for all our predictors.

beverages_eval$Brand.Code[beverages_eval$Brand.Code==""]= "U"

beverages_eval_predictor <-beverages_eval
beverages_eval_predictor$PH <- NULL

mice_imputes_eval <- mice(beverages_eval_predictor, m = 2, maxit = 2, print = FALSE,seed = 143)

beverages_eval_v1 <- complete(mice_imputes_eval)

beverages_eval_v1 <- cbind(beverages_eval_v1, beverages_eval$PH)

We will now predict PH using our selected model, the result will be written in a .csv file.

prediction_rf <- predict(rf_M1_Data1, beverages_eval_v1)

#write.csv(prediction_rf, file = "prediction_rf1.csv")

Conclusion

The most accurate models discovered were the random forest models. The selected model used untransformed data input and a faster processing time, making it more operationally attractive. The selected model’s results indicate that the chief driving factors influencing PH are the Mnf .Flow, Brand Code C, Pressure.Vacuum, Oxygen.Filler, Air.Pressurer, Alch.Rel, Balling.Lvll, Carb.Rel, Hyd.Pressure3, Temperature.

References:

EDA
https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html

Data Transformation
https://www.r-bloggers.com/near-zero-variance-predictors-should-we-remove-them/ http://rcompanion.org/handbook/I_12.html

Model Selection https://rpubs.com/Isaac/caret_reg