A new regulations by the FDA forces the ABC company to understand our manufacturing process as it relates to the pH of the bevrages we produced. To this effect, a predictive model has been commissioned by the Production manager to better understand the predictive factors and report on the predictive model of pH.
The ABC Company has limited resources for processing large amounts of data and analyzing the results and is working under an accelerated timeline to meet the new regulations. Consequently, the model selected should involve limited data cleaning, lend itself well to automation, and should predict pH with a fair degree of accuracy. More importantly, the interpretation of the model should provide actionable insights into the primary levers of pH levels.
The team is opeating under a ver tight deadline, the deliverables have to be remitted on 05/22/2018. Since the team is operating remotely in various location, the following tools were adopted to enhanced efficient communication;
Team Members
Prashant Bhuyan (Team Leader)
Bruce Hao
Cheryl Bowersox
Chris Estevez
Valerie Briot
#Please update Data explorer to below version
pack_URL= "https://cran.r-project.org/src/contrib/Archive/DataExplorer/DataExplorer_0.4.0.tar.gz"
if (!require("DataExplorer")) install.packages(pack_URL, repos=NULL, type='source')
library(psych) # EDA, describe function
library(tidyverse) #
library(knitr) #
library(VIM) # correlation
library(caret) # correlation, model building
library(corrplot) # Correlation
library(mice) # Imputation
library(MASS) # BoxCox Transformation
library("usdm")
library(forecast) # alternate transform
library(ranger) # Random Forest Model
library(randomForest)
library(parallel) # Parallel processing for model building
library(doParallel) # Parallel processing for model building
rm(pack_URL)
The analysis will be performed on historical data. For reproducibility of the results, the data was loaded to and accessed from a Github repository.
#set file name - to be change by github link#
#beverage_filname <-
# Load Trainning Data Set
beverages <-read.csv("StudentData.csv", header=TRUE, sep=",",stringsAsFactors = F)
#From DataExplorer
data_list <- list(beverages)
PlotStr(data_list, type="r")
rm(data_list)
dim(beverages)
## [1] 2571 33
summary(beverages)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb.Pressure Carb.Temp PSC PSC.Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler.Level Filler.Speed Temperature Usage.cont
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36
## Median :118.4 Median :3982 Median :65.60 Median :21.79
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90
## NA's :20 NA's :57 NA's :14 NA's :5
## Carb.Flow Density MFR Balling
## Min. : 26 Min. :0.240 Min. : 31.4 Min. :-0.170
## 1st Qu.:1144 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496
## Median :3028 Median :0.980 Median :724.0 Median : 1.648
## Mean :2468 Mean :1.174 Mean :704.0 Mean : 2.198
## 3rd Qu.:3186 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292
## Max. :5104 Max. :1.920 Max. :868.6 Max. : 4.012
## NA's :2 NA's :1 NA's :212 NA's :1
## Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint
## Min. :-6.600 Min. :7.880 Min. :0.00240 Min. : 70.0
## 1st Qu.:-5.600 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0
## Median :-5.400 Median :8.540 Median :0.03340 Median :120.0
## Mean :-5.216 Mean :8.546 Mean :0.04684 Mean :109.3
## 3rd Qu.:-5.000 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0
## Max. :-3.600 Max. :9.360 Max. :0.40000 Max. :140.0
## NA's :4 NA's :12 NA's :2
## Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## Min. :44.00 Min. :140.8 Min. :5.280 Min. :4.960
## 1st Qu.:46.00 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340
## Median :46.00 Median :142.6 Median :6.560 Median :5.400
## Mean :47.62 Mean :142.8 Mean :6.897 Mean :5.437
## 3rd Qu.:50.00 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540
## Max. :52.00 Max. :148.2 Max. :8.620 Max. :6.060
## NA's :12 NA's :9 NA's :10
## Balling.Lvl
## Min. :0.00
## 1st Qu.:1.38
## Median :1.48
## Mean :2.05
## 3rd Qu.:3.14
## Max. :3.66
## NA's :1
object.size(beverages)
## 642096 bytes
The dataset is comprised of 33 variables and 2571 observations. At first glance, it is clear that some variables have missing values that will need to be addressed. All the variables beside Brand.code are numeric.
The purpose of the data exploration and statistic measures phase is to understand the data to determine how to process the dataset for modelling.
Descriptive statistics were calculated to examine the basic features of the data.
#Calculate mean missing values per variable
missing_values <- beverages %>%
summarize_all(funs(sum(is.na(.))))
missing_values_ratio <- beverages %>%
summarize_all(funs(sum(is.na(.)) / length(.)*100))
#Use Describe Package to calculate Descriptive Statistic
(beverages_d <- describe(beverages, na.rm=TRUE, interp=FALSE, skew=TRUE, ranges=TRUE, trim=.1, type=3, check=TRUE, fast=FALSE, quant=c(.25,.75), IQR=TRUE))
beverages_d$missing <- t(missing_values)
beverages_d$miss_ratio <- t(round(missing_values_ratio,4))
beverages_d <- beverages_d %>%
dplyr::select(n, missing, miss_ratio, mean, sd, min, max, skew, kurtosis, median, IQR, Q0.25, Q0.75)
kable(beverages_d)
n | missing | miss_ratio | mean | sd | min | max | skew | kurtosis | median | IQR | Q0.25 | Q0.75 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Brand.Code* | 2571 | 0 | 0.0000 | NaN | NA | Inf | -Inf | NA | NA | NA | NA | NA | NA |
Carb.Volume | 2561 | 10 | 0.3890 | 5.3701978 | 0.1063852 | 5.0400000 | 5.700 | 0.3922121 | -0.4669916 | 5.3466667 | 0.1600000 | 5.2933333 | 5.453333 |
Fill.Ounces | 2533 | 38 | 1.4780 | 23.9747546 | 0.0875299 | 23.6333333 | 24.320 | -0.0215452 | 0.8624714 | 23.9733333 | 0.1066667 | 23.9200000 | 24.026667 |
PC.Volume | 2532 | 39 | 1.5169 | 0.2771187 | 0.0606953 | 0.0793333 | 0.478 | 0.3396269 | 0.6699690 | 0.2713333 | 0.0728333 | 0.2391667 | 0.312000 |
Carb.Pressure | 2544 | 27 | 1.0502 | 68.1895755 | 3.5382039 | 57.0000000 | 79.400 | 0.1822162 | -0.0138046 | 68.2000000 | 5.0000000 | 65.6000000 | 70.600000 |
Carb.Temp | 2545 | 26 | 1.0113 | 141.0949234 | 4.0373861 | 128.6000000 | 154.000 | 0.2468280 | 0.2375822 | 140.8000000 | 5.4000000 | 138.4000000 | 143.800000 |
PSC | 2538 | 33 | 1.2835 | 0.0845737 | 0.0492690 | 0.0020000 | 0.270 | 0.8491445 | 0.6480498 | 0.0760000 | 0.0640000 | 0.0480000 | 0.112000 |
PSC.Fill | 2548 | 23 | 0.8946 | 0.1953689 | 0.1177817 | 0.0000000 | 0.620 | 0.9334450 | 0.7691466 | 0.1800000 | 0.1600000 | 0.1000000 | 0.260000 |
PSC.CO2 | 2532 | 39 | 1.5169 | 0.0564139 | 0.0430387 | 0.0000000 | 0.240 | 1.7288937 | 3.7250025 | 0.0400000 | 0.0600000 | 0.0200000 | 0.080000 |
Mnf.Flow | 2569 | 2 | 0.0778 | 24.5689373 | 119.4811263 | -100.2000000 | 229.400 | 0.0041430 | -1.8697072 | 65.2000000 | 240.8000000 | -100.0000000 | 140.800000 |
Carb.Pressure1 | 2539 | 32 | 1.2447 | 122.5863726 | 4.7428819 | 105.6000000 | 140.200 | 0.0543587 | 0.1418265 | 123.2000000 | 6.4000000 | 119.0000000 | 125.400000 |
Fill.Pressure | 2549 | 22 | 0.8557 | 47.9221656 | 3.1775457 | 34.6000000 | 60.400 | 0.5471107 | 1.4067532 | 46.4000000 | 4.0000000 | 46.0000000 | 50.000000 |
Hyd.Pressure1 | 2560 | 11 | 0.4278 | 12.4375781 | 12.4332538 | -0.8000000 | 58.000 | 0.7798043 | -0.1426463 | 11.4000000 | 20.2000000 | 0.0000000 | 20.200000 |
Hyd.Pressure2 | 2556 | 15 | 0.5834 | 20.9610329 | 16.3863066 | 0.0000000 | 59.400 | -0.3019570 | -1.5592984 | 28.6000000 | 34.6000000 | 0.0000000 | 34.600000 |
Hyd.Pressure3 | 2556 | 15 | 0.5834 | 20.4584507 | 15.9757236 | -1.2000000 | 50.000 | -0.3189061 | -1.5745834 | 27.6000000 | 33.4000000 | 0.0000000 | 33.400000 |
Hyd.Pressure4 | 2541 | 30 | 1.1669 | 96.2888627 | 13.1225594 | 52.0000000 | 142.000 | 0.5459786 | 0.6340041 | 96.0000000 | 16.0000000 | 86.0000000 | 102.000000 |
Filler.Level | 2551 | 20 | 0.7779 | 109.2523716 | 15.6984241 | 55.8000000 | 161.200 | -0.8482847 | 0.0460488 | 118.4000000 | 21.7000000 | 98.3000000 | 120.000000 |
Filler.Speed | 2514 | 57 | 2.2170 | 3687.1988862 | 770.8200208 | 998.0000000 | 4030.000 | -2.8700359 | 6.7059692 | 3982.0000000 | 110.0000000 | 3888.0000000 | 3998.000000 |
Temperature | 2557 | 14 | 0.5445 | 65.9675401 | 1.3827783 | 63.6000000 | 76.200 | 2.3869732 | 10.1612904 | 65.6000000 | 1.2000000 | 65.2000000 | 66.400000 |
Usage.cont | 2566 | 5 | 0.1945 | 20.9929618 | 2.9779364 | 12.0800000 | 25.900 | -0.5353253 | -1.0170230 | 21.7900000 | 5.3950000 | 18.3600000 | 23.755000 |
Carb.Flow | 2569 | 2 | 0.0778 | 2468.3542234 | 1073.6964743 | 26.0000000 | 5104.000 | -0.9877287 | -0.5826893 | 3028.0000000 | 2042.0000000 | 1144.0000000 | 3186.000000 |
Density | 2570 | 1 | 0.0389 | 1.1736498 | 0.3775269 | 0.2400000 | 1.920 | 0.5260149 | -1.1992070 | 0.9800000 | 0.7200000 | 0.9000000 | 1.620000 |
MFR | 2359 | 212 | 8.2458 | 704.0492582 | 73.8983094 | 31.4000000 | 868.600 | -5.0917729 | 30.4558939 | 724.0000000 | 24.7000000 | 706.3000000 | 731.000000 |
Balling | 2570 | 1 | 0.0389 | 2.1977696 | 0.9310914 | -0.1700000 | 4.012 | 0.5939224 | -1.3855651 | 1.6480000 | 1.7960000 | 1.4960000 | 3.292000 |
Pressure.Vacuum | 2571 | 0 | 0.0000 | -5.2161027 | 0.5699933 | -6.6000000 | -3.600 | 0.5256608 | -0.0313126 | -5.4000000 | 0.6000000 | -5.6000000 | -5.000000 |
PH | 2567 | 4 | 0.1556 | 8.5456486 | 0.1725162 | 7.8800000 | 9.360 | -0.2906437 | 0.0644294 | 8.5400000 | 0.2400000 | 8.4400000 | 8.680000 |
Oxygen.Filler | 2559 | 12 | 0.4667 | 0.0468426 | 0.0466436 | 0.0024000 | 0.400 | 2.6603955 | 11.0882098 | 0.0334000 | 0.0380000 | 0.0220000 | 0.060000 |
Bowl.Setpoint | 2569 | 2 | 0.0778 | 109.3265862 | 15.3031541 | 70.0000000 | 140.000 | -0.9743842 | -0.0564212 | 120.0000000 | 20.0000000 | 100.0000000 | 120.000000 |
Pressure.Setpoint | 2559 | 12 | 0.4667 | 47.6153966 | 2.0390474 | 44.0000000 | 52.000 | 0.2031970 | -1.6012622 | 46.0000000 | 4.0000000 | 46.0000000 | 50.000000 |
Air.Pressurer | 2571 | 0 | 0.0000 | 142.8339946 | 1.2119170 | 140.8000000 | 148.200 | 2.2521053 | 4.7336291 | 142.6000000 | 0.8000000 | 142.2000000 | 143.000000 |
Alch.Rel | 2562 | 9 | 0.3501 | 6.8974161 | 0.5052753 | 5.2800000 | 8.620 | 0.8836378 | -0.8506221 | 6.5600000 | 0.7000000 | 6.5400000 | 7.240000 |
Carb.Rel | 2561 | 10 | 0.3890 | 5.4367825 | 0.1287183 | 4.9600000 | 6.060 | 0.5032472 | -0.2949480 | 5.4000000 | 0.2000000 | 5.3400000 | 5.540000 |
Balling.Lvl | 2570 | 1 | 0.0389 | 2.0500078 | 0.8703089 | 0.0000000 | 3.660 | 0.5858456 | -1.4858636 | 1.4800000 | 1.7600000 | 1.3800000 | 3.140000 |
rm(beverages_d,missing_values,missing_values_ratio)
From the skewness coefficient, we observed that some variables may have a right skewed distribution (PSC.CO2, Temperature, Oxygen.Filler, Air.Pressurer) or a left skewed distribution (Filler.Speed, MFR). As we observed prior, we have missing values for some of the variables, we will need to take this into considerations.
We will now examined each predictor to understand their distribution and determine whether any transformation is required.
# from DataExplorer Package
DataExplorer::HistogramContinuous(beverages)
The correlation matrix shown below highlights correlations among several predictor variables. Correlation between between
# From DataExplorer Package
#plot_correlation(beverages, use = "pairwise.complete.obs")
DataExplorer::CorrelationContinuous(beverages, use = "pairwise.complete.obs")
cor_mx =cor(beverages%>% dplyr::select(-Brand.Code) ,use="pairwise.complete.obs", method = "pearson")
corrplot(cor_mx, method = "color",type = "upper", order = "original", number.cex = .7,addCoef.col = "black", #Add coefficient of correlation
tl.srt = 90,# Text label color and rotation
diag = TRUE)# hide correlation coefficient on the principal diagonal
rm(cor_mx)
Let us now look at the correlation between the response (pH) variable and the predictors.
This section will test the predictor variables to determine if there is correlation among them. Variance inflaction factor (VIF) is used to detect multicollinearity, specifically among the entire set of predictors versus within pairs of variables.
Testing for collinearity among the predictor variables, we see that none of the numeric predictor variables appear to have a problem with collinearity based on their low VIF scores.
# from VIM Package
beverages_predictors <- dplyr::select(beverages, -PH)
numeric_fields <- dplyr::select_if(beverages_predictors, is.numeric)[, 3:15]
usdm::vifcor(numeric_fields)
## 1 variables from the 13 input variables have collinearity problem:
##
## Hyd.Pressure3
##
## After excluding the collinear variables, the linear correlation coefficients ranges between:
## min correlation ( Fill.Pressure ~ PSC.Fill ): 0.001947733
## max correlation ( Carb.Temp ~ Carb.Pressure ): 0.8233431
##
## ---------- VIFs of the remained variables --------
## Variables VIF
## 1 PC.Volume 1.340747
## 2 Carb.Pressure 4.043068
## 3 Carb.Temp 3.731096
## 4 PSC 1.120366
## 5 PSC.Fill 1.093121
## 6 PSC.CO2 1.053359
## 7 Mnf.Flow 2.404566
## 8 Carb.Pressure1 1.312199
## 9 Fill.Pressure 1.291768
## 10 Hyd.Pressure1 2.586587
## 11 Hyd.Pressure2 3.752048
## 12 Hyd.Pressure4 1.325928
rm(beverages_predictors,numeric_fields)
We have some observed some predictors with missing values however no predictors are missing more than 8% of data and no rows are missing more than ?? of data. We will feel confortable with imputing the missing data.
# From Data Explorer
PlotMissing(beverages)
Some cases, a zero values are actually representative of missing data, is this the case here?
df <- setNames(data.frame(colSums(beverages==0, na.rm = T)), 'Count')
df$Variable <- rownames(df)
rownames(df) <- NULL
df %>% dplyr::filter(!Variable %in% c("Brand.code")) %>%
ggplot(aes(x=reorder(Variable, Count), y=Count, fill=Count)) +
geom_bar(stat="identity") + coord_flip() + guides(fill=FALSE) +
xlab("Variable") + ylab("Number of 0 Values") +
ggtitle("Count of Zero Values by Variable") +
geom_text(aes(label=Count), vjust=.5, hjust=-.1,position= position_dodge(width=0.5),size=3, color="black")
rm(df)
We had observed the high number of 0 values for variables; Hyd.Pressure1, Hyd.Pressure2, and Hyd.Pressure3 and we will add a dummy variable to flag such data. Also, based on correlation coefficient, we will probably drop Hyd.Pressure3.
Brand.code has a proportion of its data that is unspecify, we will flag these records with a “U’, for”unknown“.
Since we have a limited amount of missing values accross predictors, we will impute the data. We will use the mice package.
# Replace *BLANK Brand.Code with "U"
beverages$Brand.Code[beverages$Brand.Code==""]= "U"
summary(beverages$Brand.Code)
## Length Class Mode
## 2571 character character
mice_imputes =mice(beverages, m = 2, maxit = 2, print = FALSE,seed = 143)
densityplot(mice_imputes)
The imputed density distribution is indicated in red.
# Applied the imputed values V1
beverages_v1 =complete(mice_imputes)
# Plot missing values
PlotMissing(beverages_v1)
rm(mice_imputes)
We have addressed the missing values. We will continue to possible problem with predictors by investigation possible near-zero variances predictors.
By default, a predictor is classified as near-zero variance if the percentage of unique values in the samples is less than 10% and when the frequency ratio mentioned above is greater than 19 (95/5).
These default values can be changed by setting the arguments uniqueCut and freqCut.
# From Caret package
#
x = nearZeroVar(beverages_v1, saveMetrics = TRUE)
str(x, vec.len = 2)
## 'data.frame': 33 obs. of 4 variables:
## $ freqRatio : num 2.01 1.06 ...
## $ percentUnique: num 0.194 3.928 ...
## $ zeroVar : logi FALSE FALSE FALSE ...
## $ nzv : logi FALSE FALSE FALSE ...
x[x[,"zeroVar"] > 0, ]
x[x[,"nzv"] > 0, ]
rm(x)
Since this is the only variable with near zero variance and the percentate is very close to cut-off, we will not drop this variables.
Invalid data or bimodal distributions
Based on the histograms above, it’s clear that some variables have bimodal distributions or a large number of records with what appear to be invalid data, for example, Mnf.Flow and the Hyd.Pressure variables. While some models may be able to deal with such data without modification, other models may not. As such, se will create dummy variables to flag which distribution a given record belongs to within each variable.
[[QUESTION TO GROUP – we should decide where this step resides, i.e. should this step take place after imputing missing data or before? ]] vb-I think we should have it after we impute the data
beverages_v2 = beverages_v1 %>%
mutate(Mnf.Flow.lt0 = if_else(Mnf.Flow < 0, 1, 0)) %>%
mutate(Hyd.Pressure1.lte0 = if_else(Hyd.Pressure1 <= 0 ,1, 0)) %>%
mutate(Hyd.Pressure2.lte0 = if_else(Hyd.Pressure2 <= 0, 1, 0)) %>% #remove Hyd.Pressure3 since variable dropped
mutate(Filler.Speed.lt2500 = if_else(Filler.Speed < 2500, 1, 0)) %>%
mutate(Carb.Flow.lt2500 = if_else(Carb.Flow < 2000, 1, 0)) %>%
mutate(Balling.lt.2.5 = if_else(Balling < 2.5, 1, 0))
Based on the correlation results we are proposing the drop the following predictors: Density, Balling.Lvl, Carb.Rel, Alch.Rel, and Hyd.Pressure3.
# Drop some predictors due to high correlation
beverages_v2$Density <- NULL
beverages_v2$Balling.Lvl <- NULL
beverages_v2$Carb.Rel <- NULL
beverages_v2$Alch.Rel <- NULL
beverages_v2$Hyd.Pressure3 <- NULL
We have observed significant skewness for the following variables: PSC and Oxygen.Filler. We are proposing to apply boxcox tranformation to these variables. ** VB-Please note: Since PSC.Fill and PSC.cO2 have zero values and cannot be transformed using box cox. Do we want to add an offset to make them positive? like 0.000001 or something like that?**
# Copy our data set
beverages_v3 <- beverages_v2
offset <- 0.0000001
# PSC
Box = boxcox(beverages_v3$PSC ~ 1, # Transform PSC Column as a single vector
lambda = seq(-6,6,0.1) # Try values -6 to 6 by 0.1
)
Cox = data.frame(Box$x, Box$y) # Create a data frame with the results
Cox2 = Cox[with(Cox, order(-Cox$Box.y)),] # Order the new data frame by decreasing y
Cox2[1,] # Display the lambda with the greatest
# log likelihood
lambda.PSC = Cox2[1, "Box.x"] # Extract that lambda
#PSC.FILL
beverages_v3$PSC.Fill <- beverages_v3$PSC.Fill + offset
Box = boxcox(beverages_v3$PSC.Fill ~ 1, # Transform PSC Column as a single vector
lambda = seq(-6,6,0.1) # Try values -6 to 6 by 0.1
)
Cox = data.frame(Box$x, Box$y) # Create a data frame with the results
Cox2 = Cox[with(Cox, order(-Cox$Box.y)),] # Order the new data frame by decreasing y
Cox2[1,] # Display the lambda with the greatest
# log likelihood
lambda.PSC_Fill = Cox2[1, "Box.x"] # Extract that lambda
#PSC.CO2
beverages_v3$PSC.CO2 <- beverages_v3$PSC.CO2 + offset
Box = boxcox(beverages_v3$PSC.CO2 ~ 1, # Transform PSC Column as a single vector
lambda = seq(-6,6,0.1) # Try values -6 to 6 by 0.1
)
Cox = data.frame(Box$x, Box$y) # Create a data frame with the results
Cox2 = Cox[with(Cox, order(-Cox$Box.y)),] # Order the new data frame by decreasing y
Cox2[1,] # Display the lambda with the greatest
# log likelihood
lambda.PSC_CO2 = Cox2[1, "Box.x"] # Extract that lambda
#Oxygen.Filler
Box = boxcox(beverages_v3$Oxygen.Filler ~ 1, # Transform PSC Column as a single vector
lambda = seq(-6,6,0.1) # Try values -6 to 6 by 0.1
)
Cox = data.frame(Box$x, Box$y) # Create a data frame with the results
Cox2 = Cox[with(Cox, order(-Cox$Box.y)),] # Order the new data frame by decreasing y
Cox2[1,] # Display the lambda with the greatest
# log likelihood
lambda.Oxygen_Filler = Cox2[1, "Box.x"] # Extract that lambda
# library(forecast)
# a = BoxCox.lambda(beverages_v2$PSC.Fill)
# b = BoxCox(beverages_v2$PSC.Fill,a)
# hist(b,70)
# hist(beverages_v2$PSC.Fill)
#
# HistogramContinuous(b)
rm(offset)
The R code to produce the lambda value is not working. I believe is due to the object being a list.
The lambda for predictor PSC is r lambda.PSC. The lambda for predictor PSC.FIL is r lambda.PSC_Fill The lambad for predictor PSC.CO2 is r lambda.PSC_CO2 The lambda for predictor Oxygen.Filler is r lambda.Oxigen_Filler
These complete the transformation on the data set, any additional tranfomations will be performed in the building model phase as they will be model dependent.
HistogramContinuous(beverages_v3)
#?HistogramContinuous()
Data set 4 splits the response into a categorical variable of Neutal or Alkaline. Brand will also be converted to a four level factor.
beverages_v4= beverages_v1
beverages_v4$PH = ifelse(beverages_v4$PH <= 8.5, "Neutral", "Alkaline")
beverages_v4$PH = factor(beverages_v4$PH )
beverages_v4$Brand.Code = factor(beverages_v4$Brand.Code)
We will explore and build various model to identify the most significant variable that influence the pH and be able to predict pH values.
We have 3 versions of the data set that we will use to based our models. Additional transformations such as scaling and centering may be also applied at the time of model building;
# Where Imputed data is the dataset such as beverages_v1
set.seed(143)
sample = sample.int(n = nrow(beverages), size = floor(.70*nrow(beverages)), replace = F)
beverages_v1_train = beverages_v1[sample, ]
beverages_v1_test = beverages_v1[-sample,]
beverages_v2_train = beverages_v2[sample, ]
beverages_v2_test = beverages_v2[-sample,]
beverages_v3_train = beverages_v3[sample, ]
beverages_v3_test = beverages_v3[-sample,]
beverages_v4_train = beverages_v4[sample, ]
beverages_v4_test = beverages_v4[-sample,]
rm(beverages_v1,beverages_v2,beverages_v3,beverages_v4,beverages,sample)
Mycluster =makeCluster(detectCores()-1)
registerDoParallel(Mycluster)
myControl = trainControl(method = 'cv', number = 5,
verboseIter = FALSE, savePredictions = TRUE,allowParallel = T)
# first, start with a general linear model
set.seed(143)
GLM_M1_Data1 = train(PH ~ ., data = beverages_v1_train , metric = 'RMSE', method = 'glm',preProcess = c('center', 'scale'), trControl = myControl)
GLM_M1_Data1
## Generalized Linear Model
##
## 1799 samples
## 32 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1321942 0.4198535 0.1016524
set.seed(143)
GLM_M2_Data2 = train(PH ~ ., data = beverages_v2_train , metric = 'RMSE', method = 'glm', trControl = myControl)
GLM_M2_Data2
## Generalized Linear Model
##
## 1799 samples
## 33 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1327388 0.4153035 0.1039857
set.seed(143)
GLM_M3_Data3 = train(PH ~ ., data = beverages_v3_train, metric = 'RMSE', method = 'glm', trControl = myControl)
GLM_M3_Data3
## Generalized Linear Model
##
## 1799 samples
## 34 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1328461 0.4143476 0.1042366
# next, we'll try a glmnet model which combines lasso and ridge regression
set.seed(143)
glmnet_M1_Data1 = train(PH ~ ., data = beverages_v1_train , metric = 'RMSE', method = 'glmnet',preProcess = c('center', 'scale'), trControl = myControl)
glmnet_M1_Data1
## glmnet
##
## 1799 samples
## 32 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.0001599694 0.1321682 0.4200101 0.1016777
## 0.10 0.0015996935 0.1322468 0.4191405 0.1020003
## 0.10 0.0159969354 0.1344230 0.4028236 0.1050963
## 0.55 0.0001599694 0.1321767 0.4198747 0.1017159
## 0.55 0.0015996935 0.1326398 0.4157909 0.1026923
## 0.55 0.0159969354 0.1391558 0.3693005 0.1092707
## 1.00 0.0001599694 0.1321877 0.4197298 0.1017613
## 1.00 0.0015996935 0.1331279 0.4118142 0.1033393
## 1.00 0.0159969354 0.1439824 0.3318423 0.1128677
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda
## = 0.0001599694.
set.seed(143)
glmnet_M2_Data2 = train(PH ~ ., data = beverages_v2_train , metric = 'RMSE', method = 'glmnet', trControl = myControl)
glmnet_M2_Data2
## glmnet
##
## 1799 samples
## 33 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.0001672247 0.1326933 0.4155451 0.1038747
## 0.10 0.0016722466 0.1328820 0.4135047 0.1038632
## 0.10 0.0167224659 0.1343540 0.4025188 0.1051299
## 0.55 0.0001672247 0.1326944 0.4154579 0.1038483
## 0.55 0.0016722466 0.1333793 0.4090589 0.1041986
## 0.55 0.0167224659 0.1386539 0.3740453 0.1086213
## 1.00 0.0001672247 0.1326960 0.4153719 0.1038397
## 1.00 0.0016722466 0.1336176 0.4071727 0.1044491
## 1.00 0.0167224659 0.1439849 0.3307979 0.1127147
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda
## = 0.0001672247.
set.seed(143)
glmnet_M3_Data3 = train(PH ~ ., data = beverages_v3_train, metric = 'RMSE', method = 'glmnet', trControl = myControl)
glmnet_M3_Data3
## glmnet
##
## 1799 samples
## 34 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.0001672247 0.1327977 0.4145988 0.1041364
## 0.10 0.0016722466 0.1329433 0.4128907 0.1040781
## 0.10 0.0167224659 0.1343872 0.4021404 0.1052213
## 0.55 0.0001672247 0.1328009 0.4144859 0.1041151
## 0.55 0.0016722466 0.1333806 0.4089906 0.1043980
## 0.55 0.0167224659 0.1386158 0.3743946 0.1085942
## 1.00 0.0001672247 0.1328006 0.4144030 0.1040986
## 1.00 0.0016722466 0.1336274 0.4070431 0.1046215
## 1.00 0.0167224659 0.1439831 0.3308204 0.1127133
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda
## = 0.0001672247.
# let's also model using random forest just for fun
set.seed(143)
ranger_M1_Data1 = train(PH ~ ., data = beverages_v1_train, metric = 'RMSE', method = 'ranger',preProcess = c('center', 'scale'),trControl = myControl)
ranger_M1_Data1
## Random Forest
##
## 1799 samples
## 32 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 0.1148168 0.6082763 0.08645160
## 2 extratrees 0.1191566 0.5732928 0.09119231
## 18 variance 0.1022603 0.6658250 0.07395862
## 18 extratrees 0.1016529 0.6675300 0.07357492
## 35 variance 0.1025035 0.6570685 0.07348070
## 35 extratrees 0.1000349 0.6754776 0.07205404
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 35, splitrule =
## extratrees and min.node.size = 5.
set.seed(143)
ranger_M2_Data2 = train(PH ~ ., data = beverages_v2_train, metric = 'RMSE', method = 'ranger',trControl = myControl)
ranger_M2_Data2
## Random Forest
##
## 1799 samples
## 33 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 0.1183735 0.5767453 0.09000200
## 2 extratrees 0.1219389 0.5435630 0.09361688
## 19 variance 0.1042656 0.6494408 0.07637783
## 19 extratrees 0.1036684 0.6506936 0.07604484
## 36 variance 0.1046413 0.6425340 0.07615347
## 36 extratrees 0.1021818 0.6590873 0.07477257
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 36, splitrule =
## extratrees and min.node.size = 5.
set.seed(143)
ranger_M3_Data3 = train(PH ~ ., data = beverages_v3_train, metric = 'RMSE', method = 'ranger',trControl = myControl)
ranger_M3_Data3
## Random Forest
##
## 1799 samples
## 34 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 0.1191127 0.5698374 0.09056467
## 2 extratrees 0.1222293 0.5412470 0.09392975
## 19 variance 0.1042516 0.6498560 0.07627652
## 19 extratrees 0.1041126 0.6479240 0.07647817
## 37 variance 0.1047951 0.6413251 0.07615708
## 37 extratrees 0.1024866 0.6571123 0.07519137
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 37, splitrule =
## extratrees and min.node.size = 5.
Since we observed correlation between the predictors variables, we will consider building a Partial Least Square model.
set.seed(143)
pls_M1_Data1 = train(PH ~ ., data = beverages_v1_train, metric = 'RMSE', method ='pls', preProcess = c('center', 'scale'), tunelength = 15, trControl = myControl)
pls_M1_Data1
## Partial Least Squares
##
## 1799 samples
## 32 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1494403 0.2581107 0.1177883
## 2 0.1403749 0.3456090 0.1099254
## 3 0.1378683 0.3686338 0.1086210
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
set.seed(143)
pls_M2_Data2 = train(PH ~ ., data = beverages_v2_train, metric = 'RMSE', method ='pls', tunelength = 15, trControl = myControl)
pls_M2_Data2
## Partial Least Squares
##
## 1799 samples
## 33 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1705840 0.03468439 0.1360308
## 2 0.1691324 0.04868677 0.1352213
## 3 0.1536504 0.21647890 0.1197444
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
set.seed(143)
pls_M3_Data3 = train(PH ~ ., data = beverages_v3_train, metric = 'RMSE', method ='pls', tunelength = 15, trControl = myControl)
pls_M3_Data3
## Partial Least Squares
##
## 1799 samples
## 34 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1705840 0.03468441 0.1360308
## 2 0.1691324 0.04868680 0.1352213
## 3 0.1536504 0.21647873 0.1197445
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
set.seed(143)
svmRadial_M1_Data1 =train(PH~.,beverages_v1_train, metric = 'RMSE', method = "svmRadial",preProc =c("center", "scale"),tuneLength = 14, trControl = myControl)
svmRadial_M1_Data1
## Support Vector Machines with Radial Basis Function Kernel
##
## 1799 samples
## 32 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.1271623 0.4763276 0.09440391
## 0.50 0.1237651 0.4996104 0.09135600
## 1.00 0.1203705 0.5237636 0.08824463
## 2.00 0.1177148 0.5422556 0.08548582
## 4.00 0.1160196 0.5542984 0.08379616
## 8.00 0.1157321 0.5578480 0.08375405
## 16.00 0.1173556 0.5510887 0.08532832
## 32.00 0.1209362 0.5333375 0.08834497
## 64.00 0.1264312 0.5071734 0.09295419
## 128.00 0.1339292 0.4748267 0.09843867
## 256.00 0.1413222 0.4466638 0.10391393
## 512.00 0.1458019 0.4293654 0.10746593
## 1024.00 0.1476611 0.4221392 0.10865742
## 2048.00 0.1476611 0.4221392 0.10865742
##
## Tuning parameter 'sigma' was held constant at a value of 0.01892239
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01892239 and C = 8.
set.seed(143)
svmRadial_M2_Data2 =train(PH~.,beverages_v2_train, metric = 'RMSE', method = "svmRadial",tuneLength = 14, trControl = myControl)
svmRadial_M2_Data2
## Support Vector Machines with Radial Basis Function Kernel
##
## 1799 samples
## 33 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.1266025 0.4790584 0.09407358
## 0.50 0.1240029 0.4962984 0.09161310
## 1.00 0.1211552 0.5161218 0.08903764
## 2.00 0.1190476 0.5308862 0.08701478
## 4.00 0.1174062 0.5430372 0.08576343
## 8.00 0.1177581 0.5424166 0.08617114
## 16.00 0.1189869 0.5385996 0.08746447
## 32.00 0.1224791 0.5220192 0.09017677
## 64.00 0.1272165 0.5005933 0.09363698
## 128.00 0.1350334 0.4651577 0.09944323
## 256.00 0.1450232 0.4250946 0.10690726
## 512.00 0.1523229 0.3972286 0.11217570
## 1024.00 0.1563843 0.3830348 0.11491475
## 2048.00 0.1587342 0.3766594 0.11665990
##
## Tuning parameter 'sigma' was held constant at a value of 0.018282
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.018282 and C = 4.
set.seed(143)
svmRadial_M3_Data3 =train(PH~.,beverages_v3_train, metric = 'RMSE', method = "svmRadial",tuneLength = 14, trControl = myControl)
svmRadial_M3_Data3
## Support Vector Machines with Radial Basis Function Kernel
##
## 1799 samples
## 34 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.1270587 0.4752379 0.09442825
## 0.50 0.1243818 0.4933657 0.09170709
## 1.00 0.1215350 0.5134861 0.08906423
## 2.00 0.1195646 0.5269365 0.08746778
## 4.00 0.1180271 0.5382430 0.08653561
## 8.00 0.1180227 0.5401329 0.08624021
## 16.00 0.1190728 0.5382667 0.08744345
## 32.00 0.1231344 0.5185287 0.09067732
## 64.00 0.1277064 0.4976801 0.09401868
## 128.00 0.1360166 0.4596066 0.10049839
## 256.00 0.1463599 0.4181806 0.10841813
## 512.00 0.1528888 0.3935027 0.11342022
## 1024.00 0.1556328 0.3859861 0.11575868
## 2048.00 0.1568338 0.3823311 0.11670405
##
## Tuning parameter 'sigma' was held constant at a value of 0.01751442
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01751442 and C = 8.
#plot(svmRadial_M1_Data1, scales = list(x=list(log=2)))
control_forest = trainControl(method="repeatedcv", number=5, repeats=2, search="random",allowParallel = T)
mtry = sqrt(ncol(beverages_v1_train))
set.seed(143)
rf_M1_Data1 = train(PH~., data=beverages_v1_train, metric = 'RMSE' , method="rf", tuneLength=5, trControl=control_forest,importance =T)
rf_M1_Data1
## Random Forest
##
## 1799 samples
## 32 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440, 1440, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 0.1269387 0.5384941 0.09774909
## 3 0.1113403 0.6259804 0.08295917
## 15 0.1027157 0.6647482 0.07437338
## 26 0.1019036 0.6644823 0.07329716
## 34 0.1024474 0.6573263 0.07334071
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 26.
mtry = sqrt(ncol(beverages_v2_train))
set.seed(143)
rf_M2_Data2 = train(PH~., data=beverages_v2_train, metric = 'RMSE' , method="rf", tuneLength=5, trControl=control_forest)
rf_M2_Data2
## Random Forest
##
## 1799 samples
## 33 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440, 1440, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 0.1334847 0.4781047 0.10352434
## 3 0.1139627 0.6015363 0.08562282
## 15 0.1048334 0.6476058 0.07680505
## 27 0.1046327 0.6438005 0.07605319
## 35 0.1050295 0.6391563 0.07608868
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 27.
mtry = sqrt(ncol(beverages_v3_train))
set.seed(143)
rf_M3_Data3 = train(PH~., data=beverages_v3_train, metric = 'RMSE' , method="rf", tuneLength=5, trControl=control_forest)
rf_M3_Data3
## Random Forest
##
## 1799 samples
## 34 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440, 1440, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 0.1340604 0.4691741 0.10386163
## 3 0.1145931 0.5974142 0.08631543
## 16 0.1051138 0.6452811 0.07702960
## 28 0.1047163 0.6431253 0.07614646
## 36 0.1051171 0.6382049 0.07616957
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 28.
rm(control_forest,mtry)
set.seed(143)
knn_M1_Data1 <- train(PH ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
trControl = myControl,
metric = "RMSE",
data = beverages_v1_train,preProc =c("center", "scale"))
knn_M1_Data1
## k-Nearest Neighbors
##
## 1799 samples
## 32 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 0.1552863 0.3519753 0.10524865
## 2 0.1354010 0.4441269 0.09628802
## 3 0.1288917 0.4719160 0.09184895
## 4 0.1256230 0.4900061 0.09103044
## 5 0.1239321 0.4978780 0.09017747
## 6 0.1244458 0.4922766 0.09122504
## 7 0.1240195 0.4953110 0.09169746
## 8 0.1244070 0.4920447 0.09218550
## 9 0.1244541 0.4919942 0.09257184
## 10 0.1244123 0.4917993 0.09265990
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
set.seed(143)
knn_M2_Data2 <- train(PH ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
trControl = myControl,
metric = "RMSE",
data = beverages_v2_train)
knn_M2_Data2
## k-Nearest Neighbors
##
## 1799 samples
## 33 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 0.1683200 0.2859914 0.1144311
## 2 0.1500658 0.3327492 0.1058868
## 3 0.1418583 0.3653536 0.1021784
## 4 0.1401030 0.3694645 0.1028647
## 5 0.1401516 0.3633672 0.1033130
## 6 0.1393636 0.3643894 0.1034378
## 7 0.1400595 0.3562528 0.1038978
## 8 0.1398343 0.3550924 0.1045687
## 9 0.1405531 0.3466922 0.1052976
## 10 0.1408672 0.3423547 0.1057476
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 6.
set.seed(143)
knn_M3_Data3 <- train(PH ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:10),
trControl = myControl,
metric = "RMSE",
data = beverages_v3_train)
knn_M3_Data3
## k-Nearest Neighbors
##
## 1799 samples
## 34 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 0.1678809 0.2883229 0.1142200
## 2 0.1501421 0.3320212 0.1059035
## 3 0.1419641 0.3647402 0.1022868
## 4 0.1400958 0.3695812 0.1028442
## 5 0.1401539 0.3633406 0.1033219
## 6 0.1393665 0.3643136 0.1034634
## 7 0.1400607 0.3562888 0.1039153
## 8 0.1398256 0.3552039 0.1045193
## 9 0.1405340 0.3468925 0.1052685
## 10 0.1408292 0.3426671 0.1056854
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 6.
Using the ‘randomForest’ and ‘caret’ packages, I applied k-fold cross validation on the training set, fitting the random forest model to 10 random samples of the training set and taking the average.
trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE, verboseIter = FALSE)
set.seed(143)
rf_M4_Data4 = train(PH ~ ., data = beverages_v4_train,method = "rf", prox = FALSE, trControl = myControl)
rf_M4_Data4
## Random Forest
##
## 1799 samples
## 32 predictor
## 2 classes: 'Alkaline', 'Neutral'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1439, 1439, 1439, 1439, 1440
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8054565 0.5981767
## 18 0.8199087 0.6314982
## 35 0.8126834 0.6174367
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 18.
rm(trControl)
stopCluster(Mycluster)
registerDoSEQ()
From the results of the cross validated model above, we can see that model accuracy was above 80 percent. From the confusion matrix below, we can see that overall accuracy of the model was 82.64 percent with a 95 percent confidence interval between 79.78 percent and 85.25 percdnt, sensitivity of 90.57 percent and specificity of 69.01 percent.
test <- predict( rf_M4_Data4, newdata = beverages_v4_test)
cf <- confusionMatrix(data = test, beverages_v4_test$PH)
print(cf, digits = 4)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alkaline Neutral
## Alkaline 409 73
## Neutral 57 233
##
## Accuracy : 0.8316
## 95% CI : (0.8033, 0.8573)
## No Information Rate : 0.6036
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6449
## Mcnemar's Test P-Value : 0.1883
##
## Sensitivity : 0.8777
## Specificity : 0.7614
## Pos Pred Value : 0.8485
## Neg Pred Value : 0.8034
## Prevalence : 0.6036
## Detection Rate : 0.5298
## Detection Prevalence : 0.6244
## Balanced Accuracy : 0.8196
##
## 'Positive' Class : Alkaline
##
# Concept importance for model
varImp(rf_M4_Data4, top = 10)
## rf variable importance
##
## only 20 most important variables shown (out of 35)
##
## Overall
## Mnf.Flow 100.00
## Usage.cont 47.94
## Balling.Lvl 26.69
## Filler.Level 26.66
## Temperature 26.58
## Oxygen.Filler 26.41
## Air.Pressurer 25.42
## Balling 24.18
## Alch.Rel 23.65
## Hyd.Pressure3 23.39
## Carb.Pressure1 22.90
## Filler.Speed 20.86
## PC.Volume 20.60
## Carb.Flow 19.89
## Bowl.Setpoint 19.85
## Brand.CodeC 19.15
## Carb.Rel 18.16
## MFR 17.54
## Pressure.Vacuum 17.43
## Density 16.71
dotPlot(varImp(rf_M4_Data4), top=10)
We have built various models, linear regression, and nonlinear models. We will evaluate each of the models on the test datasets. Based on the results, the best model will be selected. The following criteria will be considered for the selection:
. Accuracy (how accurately the model performed, evaluated using RMSE and MAE) . Rsquared . MAE . Scalability (as measure by run time) . interpretability
# compare models
models_list = list("glm_D1" = GLM_M1_Data1, "glm_D2" = GLM_M2_Data2, "glm_D3" = GLM_M3_Data3,
"glmnet_D1" = glmnet_M1_Data1, "glmnet_D2" = glmnet_M2_Data2, "glmnet_D3" = glmnet_M3_Data3,
"ranger_D1"= ranger_M1_Data1, "ranger_D2" = ranger_M2_Data2, "ranger_D3" = ranger_M3_Data3,
"pls_D1"= pls_M1_Data1, "pls_D2" = pls_M2_Data2, "pls_D3" = pls_M3_Data3,
"SVM_D1" = svmRadial_M1_Data1, "SVM_D2" = svmRadial_M2_Data2, "SVM_D3" = svmRadial_M3_Data3,
# "rf_D1" = rf_M1_Data1, "rf_D2" = rf_M2_Data2, "rf_D3" = rf_M3_Data3,
"knn_D1" = knn_M1_Data1, "knn_D2" = knn_M2_Data2, "knn_D3" = knn_M3_Data3)
resamps = resamples(models_list)
dotplot(resamps, metric = 'RMSE')
dotplot(resamps, metric = 'Rsquared')
summary(resamps)
##
## Call:
## summary.resamples(object = resamps)
##
## Models: glm_D1, glm_D2, glm_D3, glmnet_D1, glmnet_D2, glmnet_D3, ranger_D1, ranger_D2, ranger_D3, pls_D1, pls_D2, pls_D3, SVM_D1, SVM_D2, SVM_D3, knn_D1, knn_D2, knn_D3
## Number of resamples: 5
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm_D1 0.09667 0.09943 0.10200 0.10170 0.10300 0.10720 0
## glm_D2 0.10100 0.10220 0.10450 0.10400 0.10560 0.10660 0
## glm_D3 0.10150 0.10210 0.10430 0.10420 0.10630 0.10700 0
## glmnet_D1 0.09663 0.09955 0.10190 0.10170 0.10300 0.10720 0
## glmnet_D2 0.10070 0.10220 0.10440 0.10390 0.10520 0.10690 0
## glmnet_D3 0.10110 0.10210 0.10430 0.10410 0.10590 0.10730 0
## ranger_D1 0.06786 0.07040 0.07101 0.07205 0.07494 0.07605 0
## ranger_D2 0.07129 0.07329 0.07353 0.07477 0.07661 0.07914 0
## ranger_D3 0.07164 0.07376 0.07403 0.07519 0.07734 0.07919 0
## pls_D1 0.10440 0.10690 0.10830 0.10860 0.10850 0.11490 0
## pls_D2 0.11770 0.11780 0.11820 0.11970 0.12130 0.12370 0
## pls_D3 0.11770 0.11780 0.11820 0.11970 0.12130 0.12370 0
## SVM_D1 0.08213 0.08245 0.08308 0.08375 0.08524 0.08586 0
## SVM_D2 0.08462 0.08530 0.08619 0.08576 0.08628 0.08642 0
## SVM_D3 0.08365 0.08555 0.08699 0.08624 0.08729 0.08772 0
## knn_D1 0.08648 0.08762 0.08944 0.09018 0.09360 0.09375 0
## knn_D2 0.10080 0.10270 0.10370 0.10340 0.10480 0.10520 0
## knn_D3 0.10090 0.10280 0.10370 0.10350 0.10480 0.10510 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm_D1 0.12620 0.12780 0.13330 0.1322 0.1344 0.1393 0
## glm_D2 0.12830 0.13000 0.13450 0.1327 0.1347 0.1362 0
## glm_D3 0.12910 0.12980 0.13420 0.1328 0.1352 0.1360 0
## glmnet_D1 0.12610 0.12790 0.13320 0.1322 0.1344 0.1393 0
## glmnet_D2 0.12810 0.13000 0.13440 0.1327 0.1344 0.1365 0
## glmnet_D3 0.12890 0.12980 0.13420 0.1328 0.1348 0.1363 0
## ranger_D1 0.09439 0.09521 0.09902 0.1000 0.1029 0.1087 0
## ranger_D2 0.09692 0.09898 0.10050 0.1022 0.1063 0.1082 0
## ranger_D3 0.09778 0.09927 0.10070 0.1025 0.1060 0.1087 0
## pls_D1 0.13210 0.13500 0.13620 0.1379 0.1384 0.1477 0
## pls_D2 0.15000 0.15180 0.15290 0.1537 0.1534 0.1601 0
## pls_D3 0.15000 0.15180 0.15290 0.1537 0.1534 0.1601 0
## SVM_D1 0.11410 0.11450 0.11470 0.1157 0.1153 0.1200 0
## SVM_D2 0.11390 0.11590 0.11780 0.1174 0.1192 0.1203 0
## SVM_D3 0.11500 0.11610 0.11810 0.1180 0.1200 0.1211 0
## knn_D1 0.11860 0.12050 0.12440 0.1239 0.1278 0.1283 0
## knn_D2 0.13580 0.13760 0.13780 0.1394 0.1416 0.1440 0
## knn_D3 0.13580 0.13760 0.13780 0.1394 0.1416 0.1440 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm_D1 0.3754 0.4093 0.4222 0.4199 0.4236 0.4688 0
## glm_D2 0.3761 0.3888 0.4116 0.4153 0.4494 0.4506 0
## glm_D3 0.3782 0.3901 0.4083 0.4143 0.4441 0.4510 0
## glmnet_D1 0.3751 0.4083 0.4224 0.4200 0.4240 0.4702 0
## glmnet_D2 0.3760 0.3887 0.4140 0.4155 0.4469 0.4522 0
## glmnet_D3 0.3779 0.3900 0.4110 0.4146 0.4456 0.4485 0
## ranger_D1 0.6386 0.6535 0.6730 0.6755 0.6939 0.7184 0
## ranger_D2 0.6096 0.6447 0.6578 0.6591 0.6837 0.6996 0
## ranger_D3 0.6127 0.6430 0.6540 0.6571 0.6824 0.6935 0
## pls_D1 0.3384 0.3417 0.3466 0.3686 0.3972 0.4194 0
## pls_D2 0.1618 0.1996 0.2342 0.2165 0.2348 0.2520 0
## pls_D3 0.1618 0.1996 0.2342 0.2165 0.2348 0.2520 0
## SVM_D1 0.5112 0.5340 0.5634 0.5578 0.5786 0.6021 0
## SVM_D2 0.5039 0.5065 0.5521 0.5430 0.5736 0.5791 0
## SVM_D3 0.5004 0.5197 0.5377 0.5401 0.5692 0.5738 0
## knn_D1 0.4476 0.5060 0.5085 0.4979 0.5091 0.5182 0
## knn_D2 0.3242 0.3467 0.3687 0.3644 0.3786 0.4037 0
## knn_D3 0.3241 0.3468 0.3684 0.3643 0.3786 0.4037 0
We will now compute a matrix for summarize all the models and metrics, as shown in the table below. We will evaluate the models on the test data and compare the results.
predictAndMeasure <- function(model, model.label, testData, ytest, score_interpretability, grid = NULL) {
#mesure prediction time
ptm <- proc.time()
# Predict Model on Test Date set
pred <- predict(model, testData)
tm <- proc.time() - ptm
post<- postResample(pred = pred, obs = ytest)
RMSE.test <- c(post[[1]])
RSquared.test <- c(post[[2]])
MAE.test <- c(post[[3]])
perf.grid = NULL
if (is.null(grid)) {
perf.grid = data.frame(predictor = c(model.label) , RMSE = RMSE.test , RSquared = RSquared.test, MAE = MAE.test, time = c(tm[[3]]), interpretability = c(score_interpretability))
} else {
.grid = data.frame(predictor = c(model.label) , RMSE = RMSE.test , RSquared = RSquared.test, MAE = MAE.test, time = c(tm[[3]]), interpretability = c(score_interpretability))
perf.grid = rbind(grid, .grid)
}
perf.grid
}
#Prediction for glm
performance.grid <- predictAndMeasure (GLM_M1_Data1, "glm_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=NULL)
performance.grid <- predictAndMeasure (GLM_M2_Data2, "glm_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (GLM_M3_Data3, "glm_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)
#Prediction for glmnet
performance.grid <- predictAndMeasure (glmnet_M1_Data1, "glmnet_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (glmnet_M2_Data2, "glmnet_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (glmnet_M3_Data3, "glmnet_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)
#Prediction for ranger
performance.grid <- predictAndMeasure (ranger_M1_Data1, "ranger_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (ranger_M2_Data2, "ranger_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (ranger_M3_Data3, "ranger_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)
#Prediction for pls
performance.grid <- predictAndMeasure (pls_M1_Data1, "PLS_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (pls_M2_Data2, "PLS_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (pls_M3_Data3, "PLS_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)
#Prediction for SVM
performance.grid <- predictAndMeasure (svmRadial_M1_Data1, "SVM_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (svmRadial_M2_Data2, "SVM_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (svmRadial_M3_Data3, "SVM_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)
#Prediction for rf
performance.grid <- predictAndMeasure (rf_M1_Data1, "rf_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (rf_M2_Data2, "rf_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (rf_M3_Data3, "rf_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)
#Prediction for knn
performance.grid <- predictAndMeasure (knn_M1_Data1, "knn_D1", beverages_v1_test, beverages_v1_test$PH, 3, grid=performance.grid)
performance.grid <- predictAndMeasure (knn_M2_Data2, "knn_D2", beverages_v2_test, beverages_v2_test$PH, 2, grid=performance.grid)
performance.grid <- predictAndMeasure (knn_M3_Data3, "knn_D3", beverages_v3_test, beverages_v3_test$PH, 1, grid=performance.grid)
kable(performance.grid[order(performance.grid$RMSE, decreasing=F),])
predictor | RMSE | RSquared | MAE | time | interpretability | |
---|---|---|---|---|---|---|
7 | ranger_D1 | 0.0939969 | 0.7032532 | 0.0696673 | 0.06 | 3 |
16 | rf_D1 | 0.0967080 | 0.6879884 | 0.0718421 | 0.12 | 3 |
9 | ranger_D3 | 0.0977129 | 0.6765406 | 0.0728909 | 0.06 | 1 |
8 | ranger_D2 | 0.0982740 | 0.6727468 | 0.0728900 | 0.06 | 2 |
17 | rf_D2 | 0.1007574 | 0.6589337 | 0.0749488 | 0.14 | 2 |
18 | rf_D3 | 0.1013214 | 0.6545622 | 0.0750357 | 0.36 | 1 |
13 | SVM_D1 | 0.1170556 | 0.5343906 | 0.0861389 | 0.34 | 3 |
15 | SVM_D3 | 0.1175400 | 0.5305444 | 0.0881589 | 0.11 | 1 |
14 | SVM_D2 | 0.1192273 | 0.5150016 | 0.0890344 | 0.10 | 2 |
19 | knn_D1 | 0.1192457 | 0.5140818 | 0.0883048 | 0.06 | 3 |
4 | glmnet_D1 | 0.1356713 | 0.3705383 | 0.1050806 | 0.03 | 3 |
1 | glm_D1 | 0.1356927 | 0.3705787 | 0.1051352 | 0.01 | 3 |
5 | glmnet_D2 | 0.1376930 | 0.3537371 | 0.1068528 | 0.02 | 2 |
2 | glm_D2 | 0.1379206 | 0.3522181 | 0.1069320 | 0.02 | 2 |
6 | glmnet_D3 | 0.1381452 | 0.3500481 | 0.1072481 | 0.02 | 1 |
3 | glm_D3 | 0.1383895 | 0.3484315 | 0.1073678 | 0.00 | 1 |
20 | knn_D2 | 0.1415026 | 0.3352160 | 0.1059370 | 0.07 | 2 |
21 | knn_D3 | 0.1416439 | 0.3343843 | 0.1060579 | 0.06 | 1 |
10 | PLS_D1 | 0.1423358 | 0.3062500 | 0.1109452 | 0.01 | 3 |
11 | PLS_D2 | 0.1559200 | 0.1733901 | 0.1248626 | 0.00 | 2 |
12 | PLS_D3 | 0.1559200 | 0.1733898 | 0.1248626 | 0.02 | 1 |
From the summary table, we observed that the random forest models (ranger_D1 and rf_D1) are better performing based on RMSE, MAE, and RSquared values. They are both built on a dataset with no transformation which allows for better interpretability of the model.
The model is meant to be “productionalized” and application will be built to monitor the manufacturing process, this will require the model to be scalable and operate on large amount of data. Due to this requirement, we will select rf_D1 model.
The analysis of the variable importance and decision tree of the selected random forest model is given below.
varImp(rf_M1_Data1$finalModel)
# decision tree of final random forest model
getTree(rf_M1_Data1$finalModel, labelVar = T)
The model’s decision tree is consistent with the variable importance output. The first node of the tree is Mnf.Flow with a split point of about -50, which seems to be separating valid vs. invalid (-100) Mnf.Flow values. The records with invalid Mnf.Flow values seem to have higher pH values.
The first node on the left branch is Brand.Code4, which equates to Brand.Code “C”. Here, it seems that records with Brand.Code “C” have lower pH values. The first node on the right branch is Brand.Code5, which equates to Brand.Code “D” and is associated with higher pH values.
These positive/negative associations between pH and the important variables are consistent with the coefficients from an earlier glmnet model, which while not as accurate in terms of RMSE does produce much more interpretable results.
On a centered and scaled basis, the variables that most increased pH were Balling.Lvl and Carb.Pressure1, and the variables that most decreased pH were Mnf.Flow and Brand.Code “C”.
varImpPlot(rf_M1_Data1$finalModel, type = 1, main = "Mean Decrease Accuracy") # graph 1 MSE
varImpPlot(rf_M1_Data1$finalModel, type = 2, main = "Node Purity") # graph 2 RSS
Graph1 above shows that if a variable is assigned values by random permutation, how much the mean squared error (MSE) will increase. In this case, if you randomly permute Mnf.Flow, MSE will increase by 60% on average.
Graph2 above shows Node purity which is measured by the Gini index which is the difference between a residual sum of squares (RSS) before and after the split on that particular variable. So, IncNodePurity measures decrease in node impurities from splitting on the variable, averaged across all trees.
We will load the evaluation data set and performed the necessary data transformation to run our selected model.
#set file name - to be change by github link#
beverages_filename_eval<- "https://raw.githubusercontent.com/vbriot28/Data624_Group1_FinalProject/master/StudentEvaluation-%20TO%20PREDICT.csv"
# Load Evaluation Data Set
beverages_eval <-read.csv(beverages_filename_eval, header=TRUE, sep=",",stringsAsFactors = F)
Based on our selected model, we only need to impute the data and converting any “unspecified” Brand code as “U”, we will first check whether this is a required steps but checking for missing data.
PlotMissing(beverages_eval)
We will impute the missing values for all our predictors.
beverages_eval$Brand.Code[beverages_eval$Brand.Code==""]= "U"
beverages_eval_predictor <-beverages_eval
beverages_eval_predictor$PH <- NULL
mice_imputes_eval <- mice(beverages_eval_predictor, m = 2, maxit = 2, print = FALSE,seed = 143)
beverages_eval_v1 <- complete(mice_imputes_eval)
beverages_eval_v1 <- cbind(beverages_eval_v1, beverages_eval$PH)
We will now predict PH using our selected model, the result will be written in a .csv file.
prediction_rf <- predict(rf_M1_Data1, beverages_eval_v1)
#write.csv(prediction_rf, file = "prediction_rf1.csv")
The most accurate models discovered were the random forest models. The selected model used untransformed data input and a faster processing time, making it more operationally attractive. The selected model’s results indicate that the chief driving factors influencing PH are the Mnf .Flow, Brand Code C, Pressure.Vacuum, Oxygen.Filler, Air.Pressurer, Alch.Rel, Balling.Lvll, Carb.Rel, Hyd.Pressure3, Temperature.