Business Case

A new regulations by the FDA forces the ABC company to understand our manufacturing process as it relates to the pH of the bevrages we produced. To this effect, a predictive model has been commissioned by the Production manager to better understand the predictive factors and report on the predictive model of pH.

Business Considrations

The ABC Company has limited resources for processing large amounts of data and analyzing the results and is working under an accelerated timeline to meet the new regulations. Consequently, the model selected should involve limited data cleaning, lend itself well to automation, and should predict pH with a fair degree of accuracy. More importantly, the interpretation of the model should provide actionable insights into the primary levers of pH levels.

Deliverables

A Executive level report of the findings
A Detail tecnical reports to be reviewed by an outside consultant

Thecnical Considerations

The team is opeating under a ver tight deadline, the deliverables have to be remitted on 05/22/2018. Since the team is operating remotely in various location, the following tools were adopted to enhanced efficient communication;

Slack was used for daily communication during the project and for quick access to code and documentation.
GoToMeeting was utilized for regular touch point meetings and on as needed basis.
Github was used for version control management and to ensure each team member had access to the latest version of the project documentation
R was used to perform analysis, R code can be found in Appendix A for Technical report

Team Members
Prashant Bhuyan (Team Leader)
Bruce Hao
Cheryl Bowersox
Chris Estevez
Valerie Briot

#Please update Data explorer to below version
pack_URL= "https://cran.r-project.org/src/contrib/Archive/DataExplorer/DataExplorer_0.4.0.tar.gz"
if (!require("DataExplorer")) install.packages(pack_URL, repos=NULL, type='source')

library(psych)        # EDA, describe function  
library(tidyverse)    #
library(knitr)        #
library(VIM)          # correlation
library(caret)        # correlation, model building
library(corrplot)     # Correlation
library(mice)         # Imputation
library(MASS)         # BoxCox Transformation
library("usdm")
library(forecast)     # alternate transform
library(ranger)       # Random Forest Model
library(randomForest)
library(parallel)     # Parallel processing for model building
library(doParallel)   # Parallel processing for model building

rm(pack_URL)

Data Set

The analysis will be performed on historical data. For reproducibility of the results, the data was loaded to and accessed from a Github repository.

#set file name - to be change by github link#
#beverage_filname <- 

# Load Trainning Data Set
beverages <-read.csv("StudentData.csv", header=TRUE, sep=",",stringsAsFactors = F)

#From DataExplorer
data_list <- list(beverages)

PlotStr(data_list, type="r")

rm(data_list)

dim(beverages)

## [1] 2571   33

summary(beverages)

##   Brand.Code         Carb.Volume     Fill.Ounces      PC.Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb.Pressure     Carb.Temp          PSC             PSC.Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC.CO2           Mnf.Flow       Carb.Pressure1  Fill.Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd.Pressure1   Hyd.Pressure2   Hyd.Pressure3   Hyd.Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler.Level    Filler.Speed   Temperature      Usage.cont   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5      
##    Carb.Flow       Density           MFR           Balling      
##  Min.   :  26   Min.   :0.240   Min.   : 31.4   Min.   :-0.170  
##  1st Qu.:1144   1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496  
##  Median :3028   Median :0.980   Median :724.0   Median : 1.648  
##  Mean   :2468   Mean   :1.174   Mean   :704.0   Mean   : 2.198  
##  3rd Qu.:3186   3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292  
##  Max.   :5104   Max.   :1.920   Max.   :868.6   Max.   : 4.012  
##  NA's   :2      NA's   :1       NA's   :212     NA's   :1       
##  Pressure.Vacuum        PH        Oxygen.Filler     Bowl.Setpoint  
##  Min.   :-6.600   Min.   :7.880   Min.   :0.00240   Min.   : 70.0  
##  1st Qu.:-5.600   1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0  
##  Median :-5.400   Median :8.540   Median :0.03340   Median :120.0  
##  Mean   :-5.216   Mean   :8.546   Mean   :0.04684   Mean   :109.3  
##  3rd Qu.:-5.000   3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0  
##  Max.   :-3.600   Max.   :9.360   Max.   :0.40000   Max.   :140.0  
##                   NA's   :4       NA's   :12        NA's   :2      
##  Pressure.Setpoint Air.Pressurer      Alch.Rel        Carb.Rel    
##  Min.   :44.00     Min.   :140.8   Min.   :5.280   Min.   :4.960  
##  1st Qu.:46.00     1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340  
##  Median :46.00     Median :142.6   Median :6.560   Median :5.400  
##  Mean   :47.62     Mean   :142.8   Mean   :6.897   Mean   :5.437  
##  3rd Qu.:50.00     3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540  
##  Max.   :52.00     Max.   :148.2   Max.   :8.620   Max.   :6.060  
##  NA's   :12                        NA's   :9       NA's   :10     
##   Balling.Lvl  
##  Min.   :0.00  
##  1st Qu.:1.38  
##  Median :1.48  
##  Mean   :2.05  
##  3rd Qu.:3.14  
##  Max.   :3.66  
##  NA's   :1

object.size(beverages)

## 642096 bytes

The dataset is comprised of 33 variables and 2571 observations. At first glance, it is clear that some variables have missing values that will need to be addressed. All the variables beside Brand.code are numeric.

Data Exploration and Statistic Measures

The purpose of the data exploration and statistic measures phase is to understand the data to determine how to process the dataset for modelling.

Descriptive Statistics

Descriptive statistics were calculated to examine the basic features of the data.

#Calculate mean missing values per variable
missing_values <- beverages %>% 
  summarize_all(funs(sum(is.na(.))))

missing_values_ratio <- beverages %>% 
  summarize_all(funs(sum(is.na(.)) / length(.)*100))

#Use Describe Package to calculate Descriptive Statistic
(beverages_d <- describe(beverages, na.rm=TRUE, interp=FALSE, skew=TRUE, ranges=TRUE, trim=.1, type=3, check=TRUE, fast=FALSE, quant=c(.25,.75), IQR=TRUE))

beverages_d$missing <- t(missing_values)
beverages_d$miss_ratio <- t(round(missing_values_ratio,4))

beverages_d <- beverages_d %>% 
  dplyr::select(n, missing, miss_ratio, mean, sd, min, max, skew, kurtosis, median, IQR, Q0.25, Q0.75)

kable(beverages_d)

	n	missing	miss_ratio	mean	sd	min	max	skew	kurtosis	median	IQR	Q0.25	Q0.75
Brand.Code*	2571	0	0.0000	NaN	NA	Inf	-Inf	NA	NA	NA	NA	NA	NA
Carb.Volume	2561	10	0.3890	5.3701978	0.1063852	5.0400000	5.700	0.3922121	-0.4669916	5.3466667	0.1600000	5.2933333	5.453333
Fill.Ounces	2533	38	1.4780	23.9747546	0.0875299	23.6333333	24.320	-0.0215452	0.8624714	23.9733333	0.1066667	23.9200000	24.026667
PC.Volume	2532	39	1.5169	0.2771187	0.0606953	0.0793333	0.478	0.3396269	0.6699690	0.2713333	0.0728333	0.2391667	0.312000
Carb.Pressure	2544	27	1.0502	68.1895755	3.5382039	57.0000000	79.400	0.1822162	-0.0138046	68.2000000	5.0000000	65.6000000	70.600000
Carb.Temp	2545	26	1.0113	141.0949234	4.0373861	128.6000000	154.000	0.2468280	0.2375822	140.8000000	5.4000000	138.4000000	143.800000
PSC	2538	33	1.2835	0.0845737	0.0492690	0.0020000	0.270	0.8491445	0.6480498	0.0760000	0.0640000	0.0480000	0.112000
PSC.Fill	2548	23	0.8946	0.1953689	0.1177817	0.0000000	0.620	0.9334450	0.7691466	0.1800000	0.1600000	0.1000000	0.260000
PSC.CO2	2532	39	1.5169	0.0564139	0.0430387	0.0000000	0.240	1.7288937	3.7250025	0.0400000	0.0600000	0.0200000	0.080000
Mnf.Flow	2569	2	0.0778	24.5689373	119.4811263	-100.2000000	229.400	0.0041430	-1.8697072	65.2000000	240.8000000	-100.0000000	140.800000
Carb.Pressure1	2539	32	1.2447	122.5863726	4.7428819	105.6000000	140.200	0.0543587	0.1418265	123.2000000	6.4000000	119.0000000	125.400000
Fill.Pressure	2549	22	0.8557	47.9221656	3.1775457	34.6000000	60.400	0.5471107	1.4067532	46.4000000	4.0000000	46.0000000	50.000000
Hyd.Pressure1	2560	11	0.4278	12.4375781	12.4332538	-0.8000000	58.000	0.7798043	-0.1426463	11.4000000	20.2000000	0.0000000	20.200000
Hyd.Pressure2	2556	15	0.5834	20.9610329	16.3863066	0.0000000	59.400	-0.3019570	-1.5592984	28.6000000	34.6000000	0.0000000	34.600000
Hyd.Pressure3	2556	15	0.5834	20.4584507	15.9757236	-1.2000000	50.000	-0.3189061	-1.5745834	27.6000000	33.4000000	0.0000000	33.400000
Hyd.Pressure4	2541	30	1.1669	96.2888627	13.1225594	52.0000000	142.000	0.5459786	0.6340041	96.0000000	16.0000000	86.0000000	102.000000
Filler.Level	2551	20	0.7779	109.2523716	15.6984241	55.8000000	161.200	-0.8482847	0.0460488	118.4000000	21.7000000	98.3000000	120.000000
Filler.Speed	2514	57	2.2170	3687.1988862	770.8200208	998.0000000	4030.000	-2.8700359	6.7059692	3982.0000000	110.0000000	3888.0000000	3998.000000
Temperature	2557	14	0.5445	65.9675401	1.3827783	63.6000000	76.200	2.3869732	10.1612904	65.6000000	1.2000000	65.2000000	66.400000
Usage.cont	2566	5	0.1945	20.9929618	2.9779364	12.0800000	25.900	-0.5353253	-1.0170230	21.7900000	5.3950000	18.3600000	23.755000
Carb.Flow	2569	2	0.0778	2468.3542234	1073.6964743	26.0000000	5104.000	-0.9877287	-0.5826893	3028.0000000	2042.0000000	1144.0000000	3186.000000
Density	2570	1	0.0389	1.1736498	0.3775269	0.2400000	1.920	0.5260149	-1.1992070	0.9800000	0.7200000	0.9000000	1.620000
MFR	2359	212	8.2458	704.0492582	73.8983094	31.4000000	868.600	-5.0917729	30.4558939	724.0000000	24.7000000	706.3000000	731.000000
Balling	2570	1	0.0389	2.1977696	0.9310914	-0.1700000	4.012	0.5939224	-1.3855651	1.6480000	1.7960000	1.4960000	3.292000
Pressure.Vacuum	2571	0	0.0000	-5.2161027	0.5699933	-6.6000000	-3.600	0.5256608	-0.0313126	-5.4000000	0.6000000	-5.6000000	-5.000000
PH	2567	4	0.1556	8.5456486	0.1725162	7.8800000	9.360	-0.2906437	0.0644294	8.5400000	0.2400000	8.4400000	8.680000
Oxygen.Filler	2559	12	0.4667	0.0468426	0.0466436	0.0024000	0.400	2.6603955	11.0882098	0.0334000	0.0380000	0.0220000	0.060000
Bowl.Setpoint	2569	2	0.0778	109.3265862	15.3031541	70.0000000	140.000	-0.9743842	-0.0564212	120.0000000	20.0000000	100.0000000	120.000000
Pressure.Setpoint	2559	12	0.4667	47.6153966	2.0390474	44.0000000	52.000	0.2031970	-1.6012622	46.0000000	4.0000000	46.0000000	50.000000
Air.Pressurer	2571	0	0.0000	142.8339946	1.2119170	140.8000000	148.200	2.2521053	4.7336291	142.6000000	0.8000000	142.2000000	143.000000
Alch.Rel	2562	9	0.3501	6.8974161	0.5052753	5.2800000	8.620	0.8836378	-0.8506221	6.5600000	0.7000000	6.5400000	7.240000
Carb.Rel	2561	10	0.3890	5.4367825	0.1287183	4.9600000	6.060	0.5032472	-0.2949480	5.4000000	0.2000000	5.3400000	5.540000
Balling.Lvl	2570	1	0.0389	2.0500078	0.8703089	0.0000000	3.660	0.5858456	-1.4858636	1.4800000	1.7600000	1.3800000	3.140000

rm(beverages_d,missing_values,missing_values_ratio)

From the skewness coefficient, we observed that some variables may have a right skewed distribution (PSC.CO2, Temperature, Oxygen.Filler, Air.Pressurer) or a left skewed distribution (Filler.Speed, MFR). As we observed prior, we have missing values for some of the variables, we will need to take this into considerations.

Analysis of predictors

We will now examined each predictor to understand their distribution and determine whether any transformation is required.

# from DataExplorer Package

DataExplorer::HistogramContinuous(beverages)

Variable to Variable Analysis

Correlation Analysis

The correlation matrix shown below highlights correlations among several predictor variables. Correlation between between

# From DataExplorer Package
#plot_correlation(beverages, use = "pairwise.complete.obs")
DataExplorer::CorrelationContinuous(beverages, use = "pairwise.complete.obs")

cor_mx =cor(beverages%>% dplyr::select(-Brand.Code) ,use="pairwise.complete.obs", method = "pearson")

corrplot(cor_mx, method = "color",type = "upper", order = "original", number.cex = .7,addCoef.col = "black",   #Add coefficient of correlation
                            tl.srt = 90,# Text label color and rotation
                            diag = TRUE)# hide correlation coefficient on the principal diagonal

rm(cor_mx)

Let us now look at the correlation between the response (pH) variable and the predictors.

This section will test the predictor variables to determine if there is correlation among them. Variance inflaction factor (VIF) is used to detect multicollinearity, specifically among the entire set of predictors versus within pairs of variables.

Testing for collinearity among the predictor variables, we see that none of the numeric predictor variables appear to have a problem with collinearity based on their low VIF scores.

# from VIM Package
beverages_predictors <- dplyr::select(beverages, -PH)

numeric_fields <- dplyr::select_if(beverages_predictors, is.numeric)[, 3:15]

usdm::vifcor(numeric_fields)

## 1 variables from the 13 input variables have collinearity problem: 
##  
## Hyd.Pressure3 
## 
## After excluding the collinear variables, the linear correlation coefficients ranges between: 
## min correlation ( Fill.Pressure ~ PSC.Fill ):  0.001947733 
## max correlation ( Carb.Temp ~ Carb.Pressure ):  0.8233431 
## 
## ---------- VIFs of the remained variables -------- 
##         Variables      VIF
## 1       PC.Volume 1.340747
## 2   Carb.Pressure 4.043068
## 3       Carb.Temp 3.731096
## 4             PSC 1.120366
## 5        PSC.Fill 1.093121
## 6         PSC.CO2 1.053359
## 7        Mnf.Flow 2.404566
## 8  Carb.Pressure1 1.312199
## 9   Fill.Pressure 1.291768
## 10  Hyd.Pressure1 2.586587
## 11  Hyd.Pressure2 3.752048
## 12  Hyd.Pressure4 1.325928

rm(beverages_predictors,numeric_fields)

Data Transformation

Missing Values

We have some observed some predictors with missing values however no predictors are missing more than 8% of data and no rows are missing more than ?? of data. We will feel confortable with imputing the missing data.

# From Data Explorer  
PlotMissing(beverages)

Examination of Zero values

Some cases, a zero values are actually representative of missing data, is this the case here?

df <- setNames(data.frame(colSums(beverages==0, na.rm = T)), 'Count')
           
df$Variable <- rownames(df)

rownames(df) <- NULL

df %>% dplyr::filter(!Variable %in% c("Brand.code")) %>%  
ggplot(aes(x=reorder(Variable, Count), y=Count, fill=Count)) +
    geom_bar(stat="identity") + coord_flip() + guides(fill=FALSE) +
    xlab("Variable") + ylab("Number of 0 Values") + 
    ggtitle("Count of Zero Values by Variable") +
    geom_text(aes(label=Count), vjust=.5, hjust=-.1,position= position_dodge(width=0.5),size=3,  color="black")

rm(df)

We had observed the high number of 0 values for variables; Hyd.Pressure1, Hyd.Pressure2, and Hyd.Pressure3 and we will add a dummy variable to flag such data. Also, based on correlation coefficient, we will probably drop Hyd.Pressure3.

Brand.code has a proportion of its data that is unspecify, we will flag these records with a “U’, for”unknown“.

Data Imputation

Since we have a limited amount of missing values accross predictors, we will impute the data. We will use the mice package.

# Replace *BLANK Brand.Code with "U"

beverages$Brand.Code[beverages$Brand.Code==""]= "U"

summary(beverages$Brand.Code)

##    Length     Class      Mode 
##      2571 character character

mice_imputes =mice(beverages, m = 2, maxit = 2, print = FALSE,seed = 143)
densityplot(mice_imputes)

The imputed density distribution is indicated in red.

# Applied the imputed values V1
beverages_v1 =complete(mice_imputes)

# Plot missing values
PlotMissing(beverages_v1)

rm(mice_imputes)

We have addressed the missing values. We will continue to possible problem with predictors by investigation possible near-zero variances predictors.

Near-Zero Variance Predictors

By default, a predictor is classified as near-zero variance if the percentage of unique values in the samples is less than 10% and when the frequency ratio mentioned above is greater than 19 (95/5).

These default values can be changed by setting the arguments uniqueCut and freqCut.

# From Caret package
#
x = nearZeroVar(beverages_v1, saveMetrics = TRUE)

str(x, vec.len = 2)

## 'data.frame':    33 obs. of  4 variables:
##  $ freqRatio    : num  2.01 1.06 ...
##  $ percentUnique: num  0.194 3.928 ...
##  $ zeroVar      : logi  FALSE FALSE FALSE ...
##  $ nzv          : logi  FALSE FALSE FALSE ...

x[x[,"zeroVar"] > 0, ]

x[x[,"nzv"] > 0, ]

rm(x)

Since this is the only variable with near zero variance and the percentate is very close to cut-off, we will not drop this variables.

Features Creation

Invalid data or bimodal distributions

Based on the histograms above, it’s clear that some variables have bimodal distributions or a large number of records with what appear to be invalid data, for example, Mnf.Flow and the Hyd.Pressure variables. While some models may be able to deal with such data without modification, other models may not. As such, se will create dummy variables to flag which distribution a given record belongs to within each variable.

[[QUESTION TO GROUP – we should decide where this step resides, i.e. should this step take place after imputing missing data or before? ]] vb-I think we should have it after we impute the data

beverages_v2 = beverages_v1 %>%
  mutate(Mnf.Flow.lt0        = if_else(Mnf.Flow      <     0, 1, 0)) %>% 
  mutate(Hyd.Pressure1.lte0  = if_else(Hyd.Pressure1 <=    0 ,1, 0)) %>% 
  mutate(Hyd.Pressure2.lte0  = if_else(Hyd.Pressure2 <=    0, 1, 0)) %>% #remove Hyd.Pressure3 since variable dropped
  mutate(Filler.Speed.lt2500 = if_else(Filler.Speed  <  2500, 1, 0)) %>% 
  mutate(Carb.Flow.lt2500    = if_else(Carb.Flow     <  2000, 1, 0)) %>% 
  mutate(Balling.lt.2.5      = if_else(Balling       <   2.5, 1, 0))

Dropping predictors

Based on the correlation results we are proposing the drop the following predictors: Density, Balling.Lvl, Carb.Rel, Alch.Rel, and Hyd.Pressure3.

# Drop some predictors due to high correlation  
beverages_v2$Density <- NULL
beverages_v2$Balling.Lvl <- NULL
beverages_v2$Carb.Rel <- NULL
beverages_v2$Alch.Rel <- NULL
beverages_v2$Hyd.Pressure3 <- NULL

Data Transformation

We have observed significant skewness for the following variables: PSC and Oxygen.Filler. We are proposing to apply boxcox tranformation to these variables. ** VB-Please note: Since PSC.Fill and PSC.cO2 have zero values and cannot be transformed using box cox. Do we want to add an offset to make them positive? like 0.000001 or something like that?**

# Copy our data set
beverages_v3 <- beverages_v2
offset <- 0.0000001
# PSC
Box = boxcox(beverages_v3$PSC ~ 1,              # Transform PSC Column as a single vector
             lambda = seq(-6,6,0.1)              # Try values -6 to 6 by 0.1
             )

Cox = data.frame(Box$x, Box$y)            # Create a data frame with the results

Cox2 = Cox[with(Cox, order(-Cox$Box.y)),] # Order the new data frame by decreasing y

Cox2[1,]                                  # Display the lambda with the greatest

                                          #    log likelihood


lambda.PSC = Cox2[1, "Box.x"]                 # Extract that lambda

#PSC.FILL
beverages_v3$PSC.Fill <- beverages_v3$PSC.Fill + offset

Box = boxcox(beverages_v3$PSC.Fill ~ 1,              # Transform PSC Column as a single vector
             lambda = seq(-6,6,0.1)              # Try values -6 to 6 by 0.1
             )

Cox = data.frame(Box$x, Box$y)            # Create a data frame with the results

Cox2 = Cox[with(Cox, order(-Cox$Box.y)),] # Order the new data frame by decreasing y

Cox2[1,]                                  # Display the lambda with the greatest

                                          #    log likelihood


lambda.PSC_Fill = Cox2[1, "Box.x"]                 # Extract that lambda

#PSC.CO2
beverages_v3$PSC.CO2 <- beverages_v3$PSC.CO2 + offset

Box = boxcox(beverages_v3$PSC.CO2 ~ 1,              # Transform PSC Column as a single vector
             lambda = seq(-6,6,0.1)              # Try values -6 to 6 by 0.1
             )

Cox = data.frame(Box$x, Box$y)            # Create a data frame with the results

Cox2 = Cox[with(Cox, order(-Cox$Box.y)),] # Order the new data frame by decreasing y

Cox2[1,]                                  # Display the lambda with the greatest

                                          #    log likelihood


lambda.PSC_CO2 = Cox2[1, "Box.x"]                 # Extract that lambda

#Oxygen.Filler
Box = boxcox(beverages_v3$Oxygen.Filler ~ 1,     # Transform PSC Column as a single vector
             lambda = seq(-6,6,0.1)              # Try values -6 to 6 by 0.1
             )

Cox = data.frame(Box$x, Box$y)            # Create a data frame with the results

Cox2 = Cox[with(Cox, order(-Cox$Box.y)),] # Order the new data frame by decreasing y

Cox2[1,]                                  # Display the lambda with the greatest

                                          #    log likelihood


lambda.Oxygen_Filler = Cox2[1, "Box.x"]                 # Extract that lambda

# library(forecast)
# a = BoxCox.lambda(beverages_v2$PSC.Fill)
# b = BoxCox(beverages_v2$PSC.Fill,a)
# hist(b,70)
#  hist(beverages_v2$PSC.Fill)
# 
# HistogramContinuous(b)

rm(offset)

The R code to produce the lambda value is not working. I believe is due to the object being a list.

The lambda for predictor PSC is r lambda.PSC. The lambda for predictor PSC.FIL is r lambda.PSC_Fill The lambad for predictor PSC.CO2 is r lambda.PSC_CO2 The lambda for predictor Oxygen.Filler is r lambda.Oxigen_Filler

These complete the transformation on the data set, any additional tranfomations will be performed in the building model phase as they will be model dependent.

HistogramContinuous(beverages_v3)

#?HistogramContinuous()

Dataset 4

Data set 4 splits the response into a categorical variable of Neutal or Alkaline. Brand will also be converted to a four level factor.

beverages_v4= beverages_v1

beverages_v4$PH = ifelse(beverages_v4$PH <= 8.5, "Neutral", "Alkaline")
beverages_v4$PH = factor(beverages_v4$PH )
beverages_v4$Brand.Code = factor(beverages_v4$Brand.Code)

Model Buidlings

We will explore and build various model to identify the most significant variable that influence the pH and be able to predict pH values.

Data Splitting

We have 3 versions of the data set that we will use to based our models. Additional transformations such as scaling and centering may be also applied at the time of model building;

version 1; Imputed data set, with brand.code missing values (Blank) impuated as ‘U’
version 2; based on version 1, with additional feagures and dropped highly correlated variables
version 3; based on version 2, with box-cox transformations applied to very skewed variables

# Where Imputed data is the dataset such as beverages_v1
set.seed(143) 
sample = sample.int(n = nrow(beverages), size = floor(.70*nrow(beverages)), replace = F)

beverages_v1_train = beverages_v1[sample, ]
beverages_v1_test  = beverages_v1[-sample,]

beverages_v2_train = beverages_v2[sample, ]
beverages_v2_test  = beverages_v2[-sample,]

beverages_v3_train = beverages_v3[sample, ]
beverages_v3_test  = beverages_v3[-sample,]

beverages_v4_train = beverages_v4[sample, ]
beverages_v4_test  = beverages_v4[-sample,]

rm(beverages_v1,beverages_v2,beverages_v3,beverages_v4,beverages,sample)

Mycluster =makeCluster(detectCores()-1)
registerDoParallel(Mycluster)

myControl = trainControl(method = 'cv', number = 5, 
  verboseIter = FALSE, savePredictions = TRUE,allowParallel = T)

GLM

# first, start with a general linear model 
set.seed(143)
GLM_M1_Data1 = train(PH ~ ., data = beverages_v1_train , metric = 'RMSE', method = 'glm',preProcess = c('center', 'scale'), trControl = myControl)
GLM_M1_Data1

## Generalized Linear Model 
## 
## 1799 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1321942  0.4198535  0.1016524

set.seed(143)
GLM_M2_Data2 = train(PH ~ ., data = beverages_v2_train , metric = 'RMSE', method = 'glm', trControl = myControl)
GLM_M2_Data2

## Generalized Linear Model 
## 
## 1799 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1327388  0.4153035  0.1039857

set.seed(143)
GLM_M3_Data3 = train(PH ~ ., data = beverages_v3_train, metric = 'RMSE', method = 'glm', trControl = myControl)
GLM_M3_Data3

## Generalized Linear Model 
## 
## 1799 samples
##   34 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1328461  0.4143476  0.1042366

glmnet

# next, we'll try a glmnet model which combines lasso and ridge regression 
set.seed(143)
glmnet_M1_Data1 = train(PH ~ ., data = beverages_v1_train , metric = 'RMSE', method = 'glmnet',preProcess = c('center', 'scale'), trControl = myControl)
glmnet_M1_Data1

## glmnet 
## 
## 1799 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE       Rsquared   MAE      
##   0.10   0.0001599694  0.1321682  0.4200101  0.1016777
##   0.10   0.0015996935  0.1322468  0.4191405  0.1020003
##   0.10   0.0159969354  0.1344230  0.4028236  0.1050963
##   0.55   0.0001599694  0.1321767  0.4198747  0.1017159
##   0.55   0.0015996935  0.1326398  0.4157909  0.1026923
##   0.55   0.0159969354  0.1391558  0.3693005  0.1092707
##   1.00   0.0001599694  0.1321877  0.4197298  0.1017613
##   1.00   0.0015996935  0.1331279  0.4118142  0.1033393
##   1.00   0.0159969354  0.1439824  0.3318423  0.1128677
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda
##  = 0.0001599694.

set.seed(143)
glmnet_M2_Data2 = train(PH ~ ., data = beverages_v2_train , metric = 'RMSE', method = 'glmnet', trControl = myControl)
glmnet_M2_Data2

## glmnet 
## 
## 1799 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE       Rsquared   MAE      
##   0.10   0.0001672247  0.1326933  0.4155451  0.1038747
##   0.10   0.0016722466  0.1328820  0.4135047  0.1038632
##   0.10   0.0167224659  0.1343540  0.4025188  0.1051299
##   0.55   0.0001672247  0.1326944  0.4154579  0.1038483
##   0.55   0.0016722466  0.1333793  0.4090589  0.1041986
##   0.55   0.0167224659  0.1386539  0.3740453  0.1086213
##   1.00   0.0001672247  0.1326960  0.4153719  0.1038397
##   1.00   0.0016722466  0.1336176  0.4071727  0.1044491
##   1.00   0.0167224659  0.1439849  0.3307979  0.1127147
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda
##  = 0.0001672247.

set.seed(143)
glmnet_M3_Data3 = train(PH ~ ., data = beverages_v3_train, metric = 'RMSE', method = 'glmnet', trControl = myControl)
glmnet_M3_Data3

## glmnet 
## 
## 1799 samples
##   34 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE       Rsquared   MAE      
##   0.10   0.0001672247  0.1327977  0.4145988  0.1041364
##   0.10   0.0016722466  0.1329433  0.4128907  0.1040781
##   0.10   0.0167224659  0.1343872  0.4021404  0.1052213
##   0.55   0.0001672247  0.1328009  0.4144859  0.1041151
##   0.55   0.0016722466  0.1333806  0.4089906  0.1043980
##   0.55   0.0167224659  0.1386158  0.3743946  0.1085942
##   1.00   0.0001672247  0.1328006  0.4144030  0.1040986
##   1.00   0.0016722466  0.1336274  0.4070431  0.1046215
##   1.00   0.0167224659  0.1439831  0.3308204  0.1127133
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda
##  = 0.0001672247.

ranger

# let's also model using random forest just for fun
set.seed(143)
ranger_M1_Data1 = train(PH ~ ., data = beverages_v1_train, metric = 'RMSE', method = 'ranger',preProcess = c('center', 'scale'),trControl = myControl)
ranger_M1_Data1

## Random Forest 
## 
## 1799 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE       Rsquared   MAE       
##    2    variance    0.1148168  0.6082763  0.08645160
##    2    extratrees  0.1191566  0.5732928  0.09119231
##   18    variance    0.1022603  0.6658250  0.07395862
##   18    extratrees  0.1016529  0.6675300  0.07357492
##   35    variance    0.1025035  0.6570685  0.07348070
##   35    extratrees  0.1000349  0.6754776  0.07205404
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 35, splitrule =
##  extratrees and min.node.size = 5.

set.seed(143)
ranger_M2_Data2 = train(PH ~ ., data = beverages_v2_train, metric = 'RMSE', method = 'ranger',trControl = myControl)
ranger_M2_Data2

## Random Forest 
## 
## 1799 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE       Rsquared   MAE       
##    2    variance    0.1183735  0.5767453  0.09000200
##    2    extratrees  0.1219389  0.5435630  0.09361688
##   19    variance    0.1042656  0.6494408  0.07637783
##   19    extratrees  0.1036684  0.6506936  0.07604484
##   36    variance    0.1046413  0.6425340  0.07615347
##   36    extratrees  0.1021818  0.6590873  0.07477257
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 36, splitrule =
##  extratrees and min.node.size = 5.

set.seed(143)
ranger_M3_Data3 = train(PH ~ ., data = beverages_v3_train, metric = 'RMSE', method = 'ranger',trControl = myControl)
ranger_M3_Data3

## Random Forest 
## 
## 1799 samples
##   34 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE       Rsquared   MAE       
##    2    variance    0.1191127  0.5698374  0.09056467
##    2    extratrees  0.1222293  0.5412470  0.09392975
##   19    variance    0.1042516  0.6498560  0.07627652
##   19    extratrees  0.1041126  0.6479240  0.07647817
##   37    variance    0.1047951  0.6413251  0.07615708
##   37    extratrees  0.1024866  0.6571123  0.07519137
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 37, splitrule =
##  extratrees and min.node.size = 5.

PLS

Since we observed correlation between the predictors variables, we will consider building a Partial Least Square model.

set.seed(143)
pls_M1_Data1 = train(PH ~ ., data = beverages_v1_train, metric = 'RMSE', method ='pls', preProcess = c('center', 'scale'), tunelength = 15, trControl = myControl)
pls_M1_Data1

## Partial Least Squares 
## 
## 1799 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##   1      0.1494403  0.2581107  0.1177883
##   2      0.1403749  0.3456090  0.1099254
##   3      0.1378683  0.3686338  0.1086210
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

set.seed(143)
pls_M2_Data2 = train(PH ~ ., data = beverages_v2_train, metric = 'RMSE', method ='pls',  tunelength = 15, trControl = myControl)
pls_M2_Data2

## Partial Least Squares 
## 
## 1799 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared    MAE      
##   1      0.1705840  0.03468439  0.1360308
##   2      0.1691324  0.04868677  0.1352213
##   3      0.1536504  0.21647890  0.1197444
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

set.seed(143)
pls_M3_Data3 = train(PH ~ ., data = beverages_v3_train, metric = 'RMSE', method ='pls',  tunelength = 15, trControl = myControl)
pls_M3_Data3

## Partial Least Squares 
## 
## 1799 samples
##   34 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared    MAE      
##   1      0.1705840  0.03468441  0.1360308
##   2      0.1691324  0.04868680  0.1352213
##   3      0.1536504  0.21647873  0.1197445
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

SVM

set.seed(143)
svmRadial_M1_Data1 =train(PH~.,beverages_v1_train, metric = 'RMSE', method = "svmRadial",preProc =c("center", "scale"),tuneLength = 14, trControl = myControl)
svmRadial_M1_Data1

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1799 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE       
##      0.25  0.1271623  0.4763276  0.09440391
##      0.50  0.1237651  0.4996104  0.09135600
##      1.00  0.1203705  0.5237636  0.08824463
##      2.00  0.1177148  0.5422556  0.08548582
##      4.00  0.1160196  0.5542984  0.08379616
##      8.00  0.1157321  0.5578480  0.08375405
##     16.00  0.1173556  0.5510887  0.08532832
##     32.00  0.1209362  0.5333375  0.08834497
##     64.00  0.1264312  0.5071734  0.09295419
##    128.00  0.1339292  0.4748267  0.09843867
##    256.00  0.1413222  0.4466638  0.10391393
##    512.00  0.1458019  0.4293654  0.10746593
##   1024.00  0.1476611  0.4221392  0.10865742
##   2048.00  0.1476611  0.4221392  0.10865742
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01892239
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01892239 and C = 8.

set.seed(143)
svmRadial_M2_Data2 =train(PH~.,beverages_v2_train, metric = 'RMSE', method = "svmRadial",tuneLength = 14, trControl = myControl)
svmRadial_M2_Data2

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1799 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE       
##      0.25  0.1266025  0.4790584  0.09407358
##      0.50  0.1240029  0.4962984  0.09161310
##      1.00  0.1211552  0.5161218  0.08903764
##      2.00  0.1190476  0.5308862  0.08701478
##      4.00  0.1174062  0.5430372  0.08576343
##      8.00  0.1177581  0.5424166  0.08617114
##     16.00  0.1189869  0.5385996  0.08746447
##     32.00  0.1224791  0.5220192  0.09017677
##     64.00  0.1272165  0.5005933  0.09363698
##    128.00  0.1350334  0.4651577  0.09944323
##    256.00  0.1450232  0.4250946  0.10690726
##    512.00  0.1523229  0.3972286  0.11217570
##   1024.00  0.1563843  0.3830348  0.11491475
##   2048.00  0.1587342  0.3766594  0.11665990
## 
## Tuning parameter 'sigma' was held constant at a value of 0.018282
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.018282 and C = 4.

set.seed(143)
svmRadial_M3_Data3 =train(PH~.,beverages_v3_train, metric = 'RMSE', method = "svmRadial",tuneLength = 14, trControl = myControl)
svmRadial_M3_Data3

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1799 samples
##   34 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE       
##      0.25  0.1270587  0.4752379  0.09442825
##      0.50  0.1243818  0.4933657  0.09170709
##      1.00  0.1215350  0.5134861  0.08906423
##      2.00  0.1195646  0.5269365  0.08746778
##      4.00  0.1180271  0.5382430  0.08653561
##      8.00  0.1180227  0.5401329  0.08624021
##     16.00  0.1190728  0.5382667  0.08744345
##     32.00  0.1231344  0.5185287  0.09067732
##     64.00  0.1277064  0.4976801  0.09401868
##    128.00  0.1360166  0.4596066  0.10049839
##    256.00  0.1463599  0.4181806  0.10841813
##    512.00  0.1528888  0.3935027  0.11342022
##   1024.00  0.1556328  0.3859861  0.11575868
##   2048.00  0.1568338  0.3823311  0.11670405
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01751442
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01751442 and C = 8.

#plot(svmRadial_M1_Data1, scales = list(x=list(log=2)))

rf

control_forest = trainControl(method="repeatedcv", number=5, repeats=2, search="random",allowParallel = T)
mtry = sqrt(ncol(beverages_v1_train))


set.seed(143)
rf_M1_Data1 = train(PH~., data=beverages_v1_train, metric = 'RMSE' , method="rf", tuneLength=5, trControl=control_forest,importance =T)
rf_M1_Data1

## Random Forest 
## 
## 1799 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440, 1440, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    1    0.1269387  0.5384941  0.09774909
##    3    0.1113403  0.6259804  0.08295917
##   15    0.1027157  0.6647482  0.07437338
##   26    0.1019036  0.6644823  0.07329716
##   34    0.1024474  0.6573263  0.07334071
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 26.

mtry = sqrt(ncol(beverages_v2_train))

set.seed(143)
rf_M2_Data2 = train(PH~., data=beverages_v2_train, metric = 'RMSE' , method="rf", tuneLength=5, trControl=control_forest)
rf_M2_Data2

## Random Forest 
## 
## 1799 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440, 1440, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    1    0.1334847  0.4781047  0.10352434
##    3    0.1139627  0.6015363  0.08562282
##   15    0.1048334  0.6476058  0.07680505
##   27    0.1046327  0.6438005  0.07605319
##   35    0.1050295  0.6391563  0.07608868
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 27.

mtry = sqrt(ncol(beverages_v3_train))
set.seed(143)
rf_M3_Data3 = train(PH~., data=beverages_v3_train, metric = 'RMSE' , method="rf", tuneLength=5, trControl=control_forest)
rf_M3_Data3

## Random Forest 
## 
## 1799 samples
##   34 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440, 1440, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    1    0.1340604  0.4691741  0.10386163
##    3    0.1145931  0.5974142  0.08631543
##   16    0.1051138  0.6452811  0.07702960
##   28    0.1047163  0.6431253  0.07614646
##   36    0.1051171  0.6382049  0.07616957
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 28.

rm(control_forest,mtry)

set.seed(143)
knn_M1_Data1 <- train(PH ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 1:10),
             trControl  = myControl,
             metric     = "RMSE",
             data       = beverages_v1_train,preProc =c("center", "scale"))
knn_M1_Data1

## k-Nearest Neighbors 
## 
## 1799 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE       
##    1  0.1552863  0.3519753  0.10524865
##    2  0.1354010  0.4441269  0.09628802
##    3  0.1288917  0.4719160  0.09184895
##    4  0.1256230  0.4900061  0.09103044
##    5  0.1239321  0.4978780  0.09017747
##    6  0.1244458  0.4922766  0.09122504
##    7  0.1240195  0.4953110  0.09169746
##    8  0.1244070  0.4920447  0.09218550
##    9  0.1244541  0.4919942  0.09257184
##   10  0.1244123  0.4917993  0.09265990
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.

set.seed(143)
knn_M2_Data2 <- train(PH ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 1:10),
             trControl  = myControl,
             metric     = "RMSE",
             data       = beverages_v2_train)
knn_M2_Data2

## k-Nearest Neighbors 
## 
## 1799 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    1  0.1683200  0.2859914  0.1144311
##    2  0.1500658  0.3327492  0.1058868
##    3  0.1418583  0.3653536  0.1021784
##    4  0.1401030  0.3694645  0.1028647
##    5  0.1401516  0.3633672  0.1033130
##    6  0.1393636  0.3643894  0.1034378
##    7  0.1400595  0.3562528  0.1038978
##    8  0.1398343  0.3550924  0.1045687
##    9  0.1405531  0.3466922  0.1052976
##   10  0.1408672  0.3423547  0.1057476
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 6.

set.seed(143)
knn_M3_Data3 <- train(PH ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 1:10),
             trControl  = myControl,
             metric     = "RMSE",
             data       = beverages_v3_train)
knn_M3_Data3

## k-Nearest Neighbors 
## 
## 1799 samples
##   34 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    1  0.1678809  0.2883229  0.1142200
##    2  0.1501421  0.3320212  0.1059035
##    3  0.1419641  0.3647402  0.1022868
##    4  0.1400958  0.3695812  0.1028442
##    5  0.1401539  0.3633406  0.1033219
##    6  0.1393665  0.3643136  0.1034634
##    7  0.1400607  0.3562888  0.1039153
##    8  0.1398256  0.3552039  0.1045193
##    9  0.1405340  0.3468925  0.1052685
##   10  0.1408292  0.3426671  0.1056854
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 6.

Random Forest (prashant) Clasification

Using the ‘randomForest’ and ‘caret’ packages, I applied k-fold cross validation on the training set, fitting the random forest model to 10 random samples of the training set and taking the average.

trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE, verboseIter = FALSE)

set.seed(143)
rf_M4_Data4 = train(PH ~ ., data = beverages_v4_train,method = "rf", prox = FALSE, trControl = myControl)
rf_M4_Data4

## Random Forest 
## 
## 1799 samples
##   32 predictor
##    2 classes: 'Alkaline', 'Neutral' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8054565  0.5981767
##   18    0.8199087  0.6314982
##   35    0.8126834  0.6174367
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 18.

rm(trControl)

stopCluster(Mycluster)
registerDoSEQ()

From the results of the cross validated model above, we can see that model accuracy was above 80 percent. From the confusion matrix below, we can see that overall accuracy of the model was 82.64 percent with a 95 percent confidence interval between 79.78 percent and 85.25 percdnt, sensitivity of 90.57 percent and specificity of 69.01 percent.

test <- predict( rf_M4_Data4, newdata = beverages_v4_test)
cf <- confusionMatrix(data = test,  beverages_v4_test$PH)
print(cf, digits = 4)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Alkaline Neutral
##   Alkaline      409      73
##   Neutral        57     233
##                                           
##                Accuracy : 0.8316          
##                  95% CI : (0.8033, 0.8573)
##     No Information Rate : 0.6036          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6449          
##  Mcnemar's Test P-Value : 0.1883          
##                                           
##             Sensitivity : 0.8777          
##             Specificity : 0.7614          
##          Pos Pred Value : 0.8485          
##          Neg Pred Value : 0.8034          
##              Prevalence : 0.6036          
##          Detection Rate : 0.5298          
##    Detection Prevalence : 0.6244          
##       Balanced Accuracy : 0.8196          
##                                           
##        'Positive' Class : Alkaline        
##

# Concept importance for model
varImp(rf_M4_Data4, top = 10)

## rf variable importance
## 
##   only 20 most important variables shown (out of 35)
## 
##                 Overall
## Mnf.Flow         100.00
## Usage.cont        47.94
## Balling.Lvl       26.69
## Filler.Level      26.66
## Temperature       26.58
## Oxygen.Filler     26.41
## Air.Pressurer     25.42
## Balling           24.18
## Alch.Rel          23.65
## Hyd.Pressure3     23.39
## Carb.Pressure1    22.90
## Filler.Speed      20.86
## PC.Volume         20.60
## Carb.Flow         19.89
## Bowl.Setpoint     19.85
## Brand.CodeC       19.15
## Carb.Rel          18.16
## MFR               17.54
## Pressure.Vacuum   17.43
## Density           16.71

dotPlot(varImp(rf_M4_Data4), top=10)

Model Selection, Interpretation, & Evaluation

Selecting Best Model

We have built various models, linear regression, and nonlinear models. We will evaluate each of the models on the test datasets. Based on the results, the best model will be selected. The following criteria will be considered for the selection:

. Accuracy (how accurately the model performed, evaluated using RMSE and MAE) . Rsquared . MAE . Scalability (as measure by run time) . interpretability

# compare models
models_list = list("glm_D1" = GLM_M1_Data1, "glm_D2" = GLM_M2_Data2, "glm_D3" = GLM_M3_Data3, 
                   "glmnet_D1" = glmnet_M1_Data1, "glmnet_D2" = glmnet_M2_Data2, "glmnet_D3" = glmnet_M3_Data3, 
                   "ranger_D1"= ranger_M1_Data1, "ranger_D2" = ranger_M2_Data2, "ranger_D3" = ranger_M3_Data3, 
                   "pls_D1"= pls_M1_Data1, "pls_D2" = pls_M2_Data2, "pls_D3" = pls_M3_Data3, 
                   "SVM_D1" = svmRadial_M1_Data1, "SVM_D2" = svmRadial_M2_Data2, "SVM_D3" = svmRadial_M3_Data3, 
                  # "rf_D1" = rf_M1_Data1, "rf_D2" = rf_M2_Data2, "rf_D3" = rf_M3_Data3,
                   "knn_D1" = knn_M1_Data1, "knn_D2" = knn_M2_Data2, "knn_D3" = knn_M3_Data3)

resamps = resamples(models_list) 

dotplot(resamps, metric = 'RMSE')

dotplot(resamps, metric = 'Rsquared')

summary(resamps)

## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: glm_D1, glm_D2, glm_D3, glmnet_D1, glmnet_D2, glmnet_D3, ranger_D1, ranger_D2, ranger_D3, pls_D1, pls_D2, pls_D3, SVM_D1, SVM_D2, SVM_D3, knn_D1, knn_D2, knn_D3 
## Number of resamples: 5 
## 
## MAE 
##              Min. 1st Qu.  Median    Mean 3rd Qu.    Max. NA's
## glm_D1    0.09667 0.09943 0.10200 0.10170 0.10300 0.10720    0
## glm_D2    0.10100 0.10220 0.10450 0.10400 0.10560 0.10660    0
## glm_D3    0.10150 0.10210 0.10430 0.10420 0.10630 0.10700    0
## glmnet_D1 0.09663 0.09955 0.10190 0.10170 0.10300 0.10720    0
## glmnet_D2 0.10070 0.10220 0.10440 0.10390 0.10520 0.10690    0
## glmnet_D3 0.10110 0.10210 0.10430 0.10410 0.10590 0.10730    0
## ranger_D1 0.06786 0.07040 0.07101 0.07205 0.07494 0.07605    0
## ranger_D2 0.07129 0.07329 0.07353 0.07477 0.07661 0.07914    0
## ranger_D3 0.07164 0.07376 0.07403 0.07519 0.07734 0.07919    0
## pls_D1    0.10440 0.10690 0.10830 0.10860 0.10850 0.11490    0
## pls_D2    0.11770 0.11780 0.11820 0.11970 0.12130 0.12370    0
## pls_D3    0.11770 0.11780 0.11820 0.11970 0.12130 0.12370    0
## SVM_D1    0.08213 0.08245 0.08308 0.08375 0.08524 0.08586    0
## SVM_D2    0.08462 0.08530 0.08619 0.08576 0.08628 0.08642    0
## SVM_D3    0.08365 0.08555 0.08699 0.08624 0.08729 0.08772    0
## knn_D1    0.08648 0.08762 0.08944 0.09018 0.09360 0.09375    0
## knn_D2    0.10080 0.10270 0.10370 0.10340 0.10480 0.10520    0
## knn_D3    0.10090 0.10280 0.10370 0.10350 0.10480 0.10510    0
## 
## RMSE 
##              Min. 1st Qu.  Median   Mean 3rd Qu.   Max. NA's
## glm_D1    0.12620 0.12780 0.13330 0.1322  0.1344 0.1393    0
## glm_D2    0.12830 0.13000 0.13450 0.1327  0.1347 0.1362    0
## glm_D3    0.12910 0.12980 0.13420 0.1328  0.1352 0.1360    0
## glmnet_D1 0.12610 0.12790 0.13320 0.1322  0.1344 0.1393    0
## glmnet_D2 0.12810 0.13000 0.13440 0.1327  0.1344 0.1365    0
## glmnet_D3 0.12890 0.12980 0.13420 0.1328  0.1348 0.1363    0
## ranger_D1 0.09439 0.09521 0.09902 0.1000  0.1029 0.1087    0
## ranger_D2 0.09692 0.09898 0.10050 0.1022  0.1063 0.1082    0
## ranger_D3 0.09778 0.09927 0.10070 0.1025  0.1060 0.1087    0
## pls_D1    0.13210 0.13500 0.13620 0.1379  0.1384 0.1477    0
## pls_D2    0.15000 0.15180 0.15290 0.1537  0.1534 0.1601    0
## pls_D3    0.15000 0.15180 0.15290 0.1537  0.1534 0.1601    0
## SVM_D1    0.11410 0.11450 0.11470 0.1157  0.1153 0.1200    0
## SVM_D2    0.11390 0.11590 0.11780 0.1174  0.1192 0.1203    0
## SVM_D3    0.11500 0.11610 0.11810 0.1180  0.1200 0.1211    0
## knn_D1    0.11860 0.12050 0.12440 0.1239  0.1278 0.1283    0
## knn_D2    0.13580 0.13760 0.13780 0.1394  0.1416 0.1440    0
## knn_D3    0.13580 0.13760 0.13780 0.1394  0.1416 0.1440    0
## 
## Rsquared 
##             Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## glm_D1    0.3754  0.4093 0.4222 0.4199  0.4236 0.4688    0
## glm_D2    0.3761  0.3888 0.4116 0.4153  0.4494 0.4506    0
## glm_D3    0.3782  0.3901 0.4083 0.4143  0.4441 0.4510    0
## glmnet_D1 0.3751  0.4083 0.4224 0.4200  0.4240 0.4702    0
## glmnet_D2 0.3760  0.3887 0.4140 0.4155  0.4469 0.4522    0
## glmnet_D3 0.3779  0.3900 0.4110 0.4146  0.4456 0.4485    0
## ranger_D1 0.6386  0.6535 0.6730 0.6755  0.6939 0.7184    0
## ranger_D2 0.6096  0.6447 0.6578 0.6591  0.6837 0.6996    0
## ranger_D3 0.6127  0.6430 0.6540 0.6571  0.6824 0.6935    0
## pls_D1    0.3384  0.3417 0.3466 0.3686  0.3972 0.4194    0
## pls_D2    0.1618  0.1996 0.2342 0.2165  0.2348 0.2520    0
## pls_D3    0.1618  0.1996 0.2342 0.2165  0.2348 0.2520    0
## SVM_D1    0.5112  0.5340 0.5634 0.5578  0.5786 0.6021    0
## SVM_D2    0.5039  0.5065 0.5521 0.5430  0.5736 0.5791    0
## SVM_D3    0.5004  0.5197 0.5377 0.5401  0.5692 0.5738    0
## knn_D1    0.4476  0.5060 0.5085 0.4979  0.5091 0.5182    0
## knn_D2    0.3242  0.3467 0.3687 0.3644  0.3786 0.4037    0
## knn_D3    0.3241  0.3468 0.3684 0.3643  0.3786 0.4037    0

We will now compute a matrix for summarize all the models and metrics, as shown in the table below. We will evaluate the models on the test data and compare the results.

predictAndMeasure <- function(model, model.label, testData, ytest, score_interpretability, grid = NULL) {
 
  #mesure prediction time
  ptm <- proc.time()
  # Predict Model on Test Date set
  pred <- predict(model, testData)
  tm <- proc.time() - ptm
  
  post<- postResample(pred = pred, obs = ytest)
  RMSE.test <- c(post[[1]])
  RSquared.test <- c(post[[2]])
  MAE.test <- c(post[[3]])
  
  perf.grid = NULL
  if (is.null(grid)) { 
    perf.grid = data.frame(predictor = c(model.label) ,  RMSE = RMSE.test , RSquared = RSquared.test, MAE = MAE.test, time = c(tm[[3]]), interpretability = c(score_interpretability))
  } else {
    .grid = data.frame(predictor = c(model.label) , RMSE = RMSE.test , RSquared = RSquared.test, MAE = MAE.test, time = c(tm[[3]]), interpretability = c(score_interpretability))
    perf.grid = rbind(grid, .grid)
  }
  
  perf.grid
}


#Prediction for glm 
performance.grid <- predictAndMeasure (GLM_M1_Data1, "glm_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=NULL)
performance.grid <- predictAndMeasure (GLM_M2_Data2, "glm_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (GLM_M3_Data3, "glm_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)
#Prediction for glmnet
performance.grid <- predictAndMeasure (glmnet_M1_Data1, "glmnet_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (glmnet_M2_Data2, "glmnet_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (glmnet_M3_Data3, "glmnet_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)

#Prediction for ranger
performance.grid <- predictAndMeasure (ranger_M1_Data1, "ranger_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (ranger_M2_Data2, "ranger_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (ranger_M3_Data3, "ranger_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)

#Prediction for pls
performance.grid <- predictAndMeasure (pls_M1_Data1, "PLS_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (pls_M2_Data2, "PLS_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (pls_M3_Data3, "PLS_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)

#Prediction for SVM
performance.grid <- predictAndMeasure (svmRadial_M1_Data1, "SVM_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (svmRadial_M2_Data2, "SVM_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (svmRadial_M3_Data3, "SVM_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)

#Prediction for rf
performance.grid <- predictAndMeasure (rf_M1_Data1, "rf_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (rf_M2_Data2, "rf_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (rf_M3_Data3, "rf_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)

#Prediction for knn
performance.grid <- predictAndMeasure (knn_M1_Data1, "knn_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (knn_M2_Data2, "knn_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (knn_M3_Data3, "knn_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)

kable(performance.grid[order(performance.grid$RMSE, decreasing=F),])

	predictor	RMSE	RSquared	MAE	time	interpretability
7	ranger_D1	0.0939969	0.7032532	0.0696673	0.06	3
16	rf_D1	0.0967080	0.6879884	0.0718421	0.12	3
9	ranger_D3	0.0977129	0.6765406	0.0728909	0.06	1
8	ranger_D2	0.0982740	0.6727468	0.0728900	0.06	2
17	rf_D2	0.1007574	0.6589337	0.0749488	0.14	2
18	rf_D3	0.1013214	0.6545622	0.0750357	0.36	1
13	SVM_D1	0.1170556	0.5343906	0.0861389	0.34	3
15	SVM_D3	0.1175400	0.5305444	0.0881589	0.11	1
14	SVM_D2	0.1192273	0.5150016	0.0890344	0.10	2
19	knn_D1	0.1192457	0.5140818	0.0883048	0.06	3
4	glmnet_D1	0.1356713	0.3705383	0.1050806	0.03	3
1	glm_D1	0.1356927	0.3705787	0.1051352	0.01	3
5	glmnet_D2	0.1376930	0.3537371	0.1068528	0.02	2
2	glm_D2	0.1379206	0.3522181	0.1069320	0.02	2
6	glmnet_D3	0.1381452	0.3500481	0.1072481	0.02	1
3	glm_D3	0.1383895	0.3484315	0.1073678	0.00	1
20	knn_D2	0.1415026	0.3352160	0.1059370	0.07	2
21	knn_D3	0.1416439	0.3343843	0.1060579	0.06	1
10	PLS_D1	0.1423358	0.3062500	0.1109452	0.01	3
11	PLS_D2	0.1559200	0.1733901	0.1248626	0.00	2
12	PLS_D3	0.1559200	0.1733898	0.1248626	0.02	1

From the summary table, we observed that the random forest models (ranger_D1 and rf_D1) are better performing based on RMSE, MAE, and RSquared values. They are both built on a dataset with no transformation which allows for better interpretability of the model.

The model is meant to be “productionalized” and application will be built to monitor the manufacturing process, this will require the model to be scalable and operate on large amount of data. Due to this requirement, we will select rf_D1 model.

Interpretation of selected model

The analysis of the variable importance and decision tree of the selected random forest model is given below.

varImp(rf_M1_Data1$finalModel)

# decision tree of final random forest model  
getTree(rf_M1_Data1$finalModel, labelVar = T)

The model’s decision tree is consistent with the variable importance output. The first node of the tree is Mnf.Flow with a split point of about -50, which seems to be separating valid vs. invalid (-100) Mnf.Flow values. The records with invalid Mnf.Flow values seem to have higher pH values.

The first node on the left branch is Brand.Code4, which equates to Brand.Code “C”. Here, it seems that records with Brand.Code “C” have lower pH values. The first node on the right branch is Brand.Code5, which equates to Brand.Code “D” and is associated with higher pH values.

These positive/negative associations between pH and the important variables are consistent with the coefficients from an earlier glmnet model, which while not as accurate in terms of RMSE does produce much more interpretable results.

On a centered and scaled basis, the variables that most increased pH were Balling.Lvl and Carb.Pressure1, and the variables that most decreased pH were Mnf.Flow and Brand.Code “C”.

varImpPlot(rf_M1_Data1$finalModel, type = 1, main = "Mean Decrease Accuracy") # graph 1 MSE

varImpPlot(rf_M1_Data1$finalModel, type = 2, main = "Node Purity") # graph 2 RSS

Graph1 above shows that if a variable is assigned values by random permutation, how much the mean squared error (MSE) will increase. In this case, if you randomly permute Mnf.Flow, MSE will increase by 60% on average.

Graph2 above shows Node purity which is measured by the Gini index which is the difference between a residual sum of squares (RSS) before and after the split on that particular variable. So, IncNodePurity measures decrease in node impurities from splitting on the variable, averaged across all trees.

Evaluation

Evaluation Data set and required transformations

We will load the evaluation data set and performed the necessary data transformation to run our selected model.

#set file name - to be change by github link#
beverages_filename_eval<- "https://raw.githubusercontent.com/vbriot28/Data624_Group1_FinalProject/master/StudentEvaluation-%20TO%20PREDICT.csv"

# Load Evaluation Data Set
beverages_eval <-read.csv(beverages_filename_eval, header=TRUE, sep=",",stringsAsFactors = F)

Based on our selected model, we only need to impute the data and converting any “unspecified” Brand code as “U”, we will first check whether this is a required steps but checking for missing data.

PlotMissing(beverages_eval)

We will impute the missing values for all our predictors.

beverages_eval$Brand.Code[beverages_eval$Brand.Code==""]= "U"

beverages_eval_predictor <-beverages_eval
beverages_eval_predictor$PH <- NULL

mice_imputes_eval <- mice(beverages_eval_predictor, m = 2, maxit = 2, print = FALSE,seed = 143)

beverages_eval_v1 <- complete(mice_imputes_eval)

beverages_eval_v1 <- cbind(beverages_eval_v1, beverages_eval$PH)

We will now predict PH using our selected model, the result will be written in a .csv file.

prediction_rf <- predict(rf_M1_Data1, beverages_eval_v1)

#write.csv(prediction_rf, file = "prediction_rf1.csv")

Conclusion

The most accurate models discovered were the random forest models. The selected model used untransformed data input and a faster processing time, making it more operationally attractive. The selected model’s results indicate that the chief driving factors influencing PH are the Mnf .Flow, Brand Code C, Pressure.Vacuum, Oxygen.Filler, Air.Pressurer, Alch.Rel, Balling.Lvll, Carb.Rel, Hyd.Pressure3, Temperature.

References:

EDA
https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html

Data Transformation
https://www.r-bloggers.com/near-zero-variance-predictors-should-we-remove-them/ http://rcompanion.org/handbook/I_12.html

Model Selection https://rpubs.com/Isaac/caret_reg

Data624_group1_Project2

Prashant Bhuyan,Valerie Briot,Bruce Hao,Cheryl Bowersox,Valerie Briot,Chris Estevez

May 7, 2018

Business Case

Business Considrations

Deliverables

Thecnical Considerations

Data Set

Data Exploration and Statistic Measures

Descriptive Statistics

Analysis of predictors

Variable to Variable Analysis

Correlation Analysis

Data Transformation

Missing Values

Examination of Zero values

Data Imputation

Near-Zero Variance Predictors

Features Creation

Dropping predictors

Data Transformation

Dataset 4

Model Buidlings

Data Splitting

GLM

glmnet

ranger

PLS

SVM

rf

Random Forest (prashant) Clasification

Model Selection, Interpretation, & Evaluation

Selecting Best Model

Interpretation of selected model

Evaluation

Evaluation Data set and required transformations

Conclusion

References: