DATA 621 Homework #5

1 OVERVIEW
2 DATA LOAD
- 2.1 Raw Data as-is
3 DATA EXPLORATION
4 DATA PREPARATION
5 BUILD MODELS
6 MODEL SELECTION

1 OVERVIEW

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided).

Below is a short description of the variables of interest in the data set:

Variable Name	Definition	Theoretical Effect
IN / INDEX	Identification Variable (do not use)	None
TARGET	Number of Cases Purchased	None
AcidIndex	Proprietary method of testing total acidity of wine by using a weighted average
Alcohol	Alcohol Content
Chlorides	Chloride content of wine
CitricAcid	Citric Acid Content
Density	Density of Wine
FixedAcidity	Fixed Acidity of Wine
FreeSulfurDioxide	Sulfur Dioxide content of wine
LabelAppeal	Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don’t like the design.	Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.
ResidualSugar	Residual Sugar of wine
STARS	Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor	A high number of stars suggests high sales
Sulphates	Sulfate conten of wine
TotalSulfurDioxide	Total Sulfur Dioxide of Wine
VolatileAcidity	Volatile Acid content of wine
pH	pH of wine

2 DATA LOAD

We have two datasets.

One is the wine training dataset, which includes 14 candidate predictors, 1 response variable and 12795 observations.
Other one is the wine evaluation dataset, which also includes 14 candidate predictors, 1 response variable but 16129 observations.

We are going to study their missing values, data types and data statistics.

2.1 Raw Data as-is

2.1.1 Training Dataset

2.1.2 Evaluation Dataset

2.1.3 Missing Values & Data Type Check

In the wine training dataset, there are 14 candidate predictors and 1 response variable with 12,795 observations. In the wine evaluation dataset, there are 14 candidate predictors with 3,335 observations. Both datasets have no missing values (eg: NA, NULL or ’’).

Among the 12 candidate predictors, 3 are categorical (LabelAppeal,AcidIndex,STARS), the other 11 are continuous numerical. The response variable TARGET is categorical.

## Observations: 12,795
## Variables: 16
## $ TARGET             <dbl> 3, 3, 5, 3, 4, 0, 0, 4, 3, 6, 0, 4, 3, 7, 4, 0, ...
## $ INDEX              <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17,...
## $ FixedAcidity       <dbl> 3.2, 4.5, 7.1, 5.7, 8.0, 11.3, 7.7, 6.5, 14.8, 5...
## $ VolatileAcidity    <dbl> 1.160, 0.160, 2.640, 0.385, 0.330, 0.320, 0.290,...
## $ CitricAcid         <dbl> -0.98, -0.81, -0.88, 0.04, -1.26, 0.59, -0.40, 0...
## $ ResidualSugar      <dbl> 54.20, 26.10, 14.80, 18.80, 9.40, 2.20, 21.50, 1...
## $ Chlorides          <dbl> -0.567, -0.425, 0.037, -0.425, NA, 0.556, 0.060,...
## $ FreeSulfurDioxide  <dbl> NA, 15, 214, 22, -167, -37, 287, 523, -213, 62, ...
## $ TotalSulfurDioxide <dbl> 268, -327, 142, 115, 108, 15, 156, 551, NA, 180,...
## $ Density            <dbl> 0.99280, 1.02792, 0.99518, 0.99640, 0.99457, 0.9...
## $ pH                 <dbl> 3.33, 3.38, 3.12, 2.24, 3.12, 3.20, 3.49, 3.20, ...
## $ Sulphates          <dbl> -0.59, 0.70, 0.48, 1.83, 1.77, 1.29, 1.21, NA, 0...
## $ Alcohol            <dbl> 9.9, NA, 22.0, 6.2, 13.7, 15.4, 10.3, 11.6, 15.0...
## $ LabelAppeal        <dbl> 0, -1, -1, -1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 0, ...
## $ AcidIndex          <dbl> 8, 7, 8, 6, 9, 11, 8, 7, 6, 8, 5, 10, 7, 8, 9, 8...
## $ STARS              <dbl> 2, 3, 3, 1, 2, NA, NA, 3, NA, 4, 1, 2, 2, 3, NA,...

## Observations: 3,335
## Variables: 16
## $ IN                 <dbl> 3, 9, 10, 18, 21, 30, 31, 37, 39, 47, 60, 62, 63...
## $ TARGET             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ FixedAcidity       <dbl> 5.4, 12.4, 7.2, 6.2, 11.4, 17.6, 15.5, 15.9, 11....
## $ VolatileAcidity    <dbl> -0.860, 0.385, 1.750, 0.100, 0.210, 0.040, 0.530...
## $ CitricAcid         <dbl> 0.27, -0.76, 0.17, 1.80, 0.28, -1.15, -0.53, 1.1...
## $ ResidualSugar      <dbl> -10.70, -19.70, -33.00, 1.00, 1.20, 1.40, 4.60, ...
## $ Chlorides          <dbl> 0.092, 1.169, 0.065, -0.179, 0.038, 0.535, 1.263...
## $ FreeSulfurDioxide  <dbl> 23, -37, 9, 104, 70, -250, 10, 115, 35, 40, NA, ...
## $ TotalSulfurDioxide <dbl> 398, 68, 76, 89, 53, 140, 17, 381, 83, 129, 583,...
## $ Density            <dbl> 0.98527, 0.99048, 1.04641, 0.98877, 1.02899, 0.9...
## $ pH                 <dbl> 5.02, 3.37, 4.61, 3.20, 2.54, 3.06, 3.07, 2.99, ...
## $ Sulphates          <dbl> 0.64, 1.09, 0.68, 2.11, -0.07, -0.02, 0.75, 0.31...
## $ Alcohol            <dbl> 12.30, 16.00, 8.55, 12.30, 4.80, 11.40, 8.50, 11...
## $ LabelAppeal        <dbl> -1, 0, 0, -1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, -1, ...
## $ AcidIndex          <dbl> 6, 6, 8, 8, 10, 8, 12, 7, 12, 7, 8, 10, 9, 8, 9,...
## $ STARS              <dbl> NA, 2, 1, 1, NA, 4, 3, NA, NA, NA, 1, NA, 2, NA,...

Missing Values & Data Type Check

	Non_NAs	NAs	NA_Percent
TARGET	12795	0	0.0000000
INDEX	12795	0	0.0000000
FixedAcidity	12795	0	0.0000000
VolatileAcidity	12795	0	0.0000000
CitricAcid	12795	0	0.0000000
ResidualSugar	12179	616	0.0481438
Chlorides	12157	638	0.0498632
FreeSulfurDioxide	12148	647	0.0505666
TotalSulfurDioxide	12113	682	0.0533021
Density	12795	0	0.0000000
pH	12400	395	0.0308714
Sulphates	11585	1210	0.0945682
Alcohol	12142	653	0.0510356
LabelAppeal	12795	0	0.0000000
AcidIndex	12795	0	0.0000000
STARS	9436	3359	0.2625244

Below is the summary of the datasets and some inference of it.

It seems there are Null values in the predictor variables but none ine response variables.
Each variables are in different scale.

2.1.4 Data Statistics Summary

A binary logistic regression model is built using the training set, therefore the training set is used for the following data exploration.

The data types in the raw dataset are all ‘doubles’, however the counter INDEX and the response variable target are categorical.

training set

The statistics of all variables are list below:

summary(data_t_mod)

##      TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
##  Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
##  1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
##  Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
##  Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
##  3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
##  Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
##                                                                       
##  ResidualSugar        Chlorides       FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00   Min.   :-823.0    
##  1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00   1st Qu.:  27.0    
##  Median :   3.900   Median : 0.0460   Median :  30.00   Median : 123.0    
##  Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85   Mean   : 120.7    
##  3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00   3rd Qu.: 208.0    
##  Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00   Max.   :1057.0    
##  NA's   :616        NA's   :638       NA's   :647       NA's   :682       
##     Density             pH          Sulphates          Alcohol     
##  Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
##  1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
##  Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
##  Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
##  3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
##  Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
##                   NA's   :395     NA's   :1210      NA's   :653    
##   LabelAppeal          AcidIndex          STARS      
##  Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##                                       NA's   :3359

The statistics of TARGET Variable.

TARGET: Number of Cases Purchased as Actual

options(width=100)
round(with(data_t, c(summary(TARGET), StdD=sd(TARGET), Skew=skewness(TARGET), Kurt=kurtosis(TARGET))),2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
##    0.00    2.00    3.00    3.03    4.00    8.00    1.93   -0.33   -0.88

3 DATA EXPLORATION

3.1 Outliers

data_t_mod %>%
  scale() %>%
  as.data.frame() %>%
  stack() %>%
  ggplot(aes(x = ind, y = values)) +
  geom_boxplot(fill = 'deeppink4') +
  labs(title = 'Boxplot: Scaled Training Set',
       x = 'Variables',
       y = 'Normalized_Values')+
  theme(panel.background = element_rect(fill = 'grey'),axis.text.x=element_text(size=10, angle=90))

Boxplot: Scaled Training Set

The box plot below shows that outliners exist in variables FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal andAcidIndex. We use scaled training set to draw the box plot to show the corresponding outliers by ratio.

3.2 Histogram of attributes

FixedAcidity: This variable tells us about the FixedAcidity of wine.

library(qqplotr)
with(data_t_mod, c(summary(FixedAcidity), SD=sd(FixedAcidity), Skew=skewness(FixedAcidity), Kurt=kurtosis(FixedAcidity)))

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max.           SD 
## -18.10000000   5.20000000   6.90000000   7.07571708   9.50000000  34.40000000   6.31764346 
##         Skew         Kurt 
##  -0.02258596   1.67499867

hist <- ggplot(data_t_mod, aes(FixedAcidity)) + geom_histogram(fill = 'dodgerblue', binwidth = 4, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of FixedAcidity') + theme(plot.title = element_text(hjust = 0.5)) 

qq_plot <- ggplot(data_t_mod, aes(sample=FixedAcidity)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of FixedAcidity") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

box_plot <- ggplot(data_t_mod, aes(x="", FixedAcidity)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of FixedAcidity', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

box_TARGET <- ggplot(data_t_mod, aes(x=factor(TARGET), FixedAcidity)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of FixedAcidity by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

VolatileAcidity: This variable tells us about the VolatileAcidity content of Wine.

with(data_t_mod, c(summary(VolatileAcidity), SD=sd(VolatileAcidity), Skew=skewness(VolatileAcidity), Kurt=kurtosis(VolatileAcidity)))

##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max.          SD        Skew 
## -2.79000000  0.13000000  0.28000000  0.32410395  0.64000000  3.68000000  0.78401424  0.02037997 
##        Kurt 
##  1.83221064

hist <- ggplot(data_t_mod, aes(VolatileAcidity)) + geom_histogram(fill = 'dodgerblue', binwidth = .5, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of VolatileAcidity') + theme(plot.title = element_text(hjust = 0.5)) 

qq_plot <- ggplot(data_t_mod, aes(sample=VolatileAcidity)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of VolatileAcidity") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

box_plot <- ggplot(data_t_mod, aes(x="", VolatileAcidity)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of VolatileAcidity', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

box_TARGET <- ggplot(data_t_mod, aes(x=factor(TARGET), VolatileAcidity)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of VolatileAcidity by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

CitricAcid: This variable tells us about the Citric Acid Content of wine.

with(data_t_mod, c(summary(CitricAcid), SD=sd(CitricAcid), Skew=skewness(CitricAcid), Kurt=kurtosis(CitricAcid)))

##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max.          SD        Skew 
## -3.24000000  0.03000000  0.31000000  0.30841266  0.58000000  3.86000000  0.86207979 -0.05030704 
##        Kurt 
##  1.83794007

hist <- ggplot(data_t_mod, aes(CitricAcid)) + geom_histogram(fill = 'dodgerblue', binwidth = 1, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of CitricAcid') + theme(plot.title = element_text(hjust = 0.5)) 

qq_plot <- ggplot(data_t_mod, aes(sample=CitricAcid)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of CitricAcid") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

box_plot <- ggplot(data_t_mod, aes(x="", CitricAcid)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of CitricAcid', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

box_TARGET <- ggplot(data_t_mod, aes(x=factor(TARGET), CitricAcid)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of CitricAcid by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

ResidualSugar: This variable tells us about the ResidualSugar of wine.

with(data_t_mod, c(summary(ResidualSugar), SD=sd(ResidualSugar), Skew=skewness(ResidualSugar), Kurt=kurtosis(ResidualSugar)))

##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max.        NA's          SD 
## -127.800000   -2.000000    3.900000    5.418733   15.900000  141.150000  616.000000          NA 
##        Skew        Kurt 
##          NA          NA

hist <- ggplot(data_t_mod, aes(ResidualSugar)) + geom_histogram(fill = 'dodgerblue', binwidth = 20, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of ResidualSugar') + theme(plot.title = element_text(hjust = 0.5)) 

qq_plot <- ggplot(data_t_mod, aes(sample=ResidualSugar)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of ResidualSugar") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

box_plot <- ggplot(data_t_mod, aes(x="", ResidualSugar)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of ResidualSugar', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

box_TARGET <- ggplot(data_t_mod, aes(x=factor(TARGET), ResidualSugar)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of ResidualSugar by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

Chlorides: This variable tells us about the Chloride content of wine.

with(data_t_mod, c(summary(Chlorides), SD=sd(Chlorides), Skew=skewness(Chlorides), Kurt=kurtosis(Chlorides)))

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max.         NA's 
##  -1.17100000  -0.03100000   0.04600000   0.05482249   0.15300000   1.35100000 638.00000000 
##           SD         Skew         Kurt 
##           NA           NA           NA

hist <- ggplot(data_t_mod, aes(Chlorides)) + geom_histogram(fill = 'dodgerblue', binwidth = .2, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of Chlorides') + theme(plot.title = element_text(hjust = 0.5)) 

qq_plot <- ggplot(data_t_mod, aes(sample=Chlorides)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of Chlorides") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

box_plot <- ggplot(data_t_mod, aes(x="", Chlorides)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of Chlorides', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

box_TARGET <- ggplot(data_t_mod, aes(x=factor(TARGET), Chlorides)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of Chlorides by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

FreeSulfurDioxide : This variable tells us about the Sulfur Dioxide content of wine.

with(data_t_mod, c(summary(FreeSulfurDioxide), SD=sd(FreeSulfurDioxide), Skew=skewness(FreeSulfurDioxide), Kurt=kurtosis(FreeSulfurDioxide)))

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.       NA's         SD       Skew 
## -555.00000    0.00000   30.00000   30.84557   70.00000  623.00000  647.00000         NA         NA 
##       Kurt 
##         NA

hist <- ggplot(data_t_mod, aes(FreeSulfurDioxide)) + geom_histogram(fill = 'dodgerblue', binwidth = 50, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of FreeSulfurDioxide') + theme(plot.title = element_text(hjust = 0.5)) 

qq_plot <- ggplot(data_t_mod, aes(sample=FreeSulfurDioxide)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of FreeSulfurDioxide") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

box_plot <- ggplot(data_t_mod, aes(x="", FreeSulfurDioxide)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of FreeSulfurDioxide', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

box_TARGET <- ggplot(data_t_mod, aes(x=factor(TARGET), FreeSulfurDioxide)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of FreeSulfurDioxide by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

TotalSulfurDioxide : This variable tells us about the Total Sulfur Dioxide of Wine.

with(data_t_mod, c(summary(TotalSulfurDioxide), SD=sd(TotalSulfurDioxide), Skew=skewness(TotalSulfurDioxide), Kurt=kurtosis(TotalSulfurDioxide)))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's        SD      Skew      Kurt 
## -823.0000   27.0000  123.0000  120.7142  208.0000 1057.0000  682.0000        NA        NA        NA

hist <- ggplot(data_t_mod, aes(TotalSulfurDioxide)) + geom_histogram(fill = 'dodgerblue', binwidth = 200, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of TotalSulfurDioxide') + theme(plot.title = element_text(hjust = 0.5)) 

qq_plot <- ggplot(data_t_mod, aes(sample=TotalSulfurDioxide)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of TotalSulfurDioxide") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

box_plot <- ggplot(data_t_mod, aes(x="", TotalSulfurDioxide)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of TotalSulfurDioxide', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

box_TARGET <- ggplot(data_t_mod, aes(x=factor(TARGET), TotalSulfurDioxide)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of TotalSulfurDioxide by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

Density: This variable tells us about the Density of wine.

with(data_t_mod, c(summary(Density), SD=sd(Density), Skew=skewness(Density), Kurt=kurtosis(Density)))

##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max.          SD        Skew 
##  0.88809000  0.98772000  0.99449000  0.99420272  1.00051500  1.09924000  0.02653765 -0.01869376 
##        Kurt 
##  1.89995921

hist <- ggplot(data_t_mod, aes(Density)) + geom_histogram(fill = 'dodgerblue', binwidth = .05, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of Density') + theme(plot.title = element_text(hjust = 0.5)) 

qq_plot <- ggplot(data_t_mod, aes(sample=Density)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of Density") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

box_plot <- ggplot(data_t_mod, aes(x="", Density)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of Density', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

box_TARGET <- ggplot(data_t_mod, aes(x=factor(TARGET), Density)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of Density by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

Sulphates: This variable tells us about the Sulphates content of wine.

with(data_t_mod, c(summary(Sulphates), SD=sd(Sulphates), Skew=skewness(Sulphates), Kurt=kurtosis(Sulphates)))

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max.         NA's 
##   -3.1300000    0.2800000    0.5000000    0.5271118    0.8600000    4.2400000 1210.0000000 
##           SD         Skew         Kurt 
##           NA           NA           NA

hist <- ggplot(data_t_mod, aes(Sulphates)) + geom_histogram(fill = 'dodgerblue', binwidth = .5, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of Sulphates') + theme(plot.title = element_text(hjust = 0.5)) 

qq_plot <- ggplot(data_t_mod, aes(sample=Sulphates)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of Sulphates") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

box_plot <- ggplot(data_t_mod, aes(x="", Sulphates)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of Sulphates', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

box_TARGET <- ggplot(data_t_mod, aes(x=factor(TARGET), Sulphates)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of Sulphates by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

Alcohol: This variable tells us about the Alcohol content.

with(data_t_mod, c(summary(Alcohol), SD=sd(Alcohol), Skew=skewness(Alcohol), Kurt=kurtosis(Alcohol)))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's        SD      Skew      Kurt 
##  -4.70000   9.00000  10.40000  10.48924  12.40000  26.50000 653.00000        NA        NA        NA

hist <- ggplot(data_t_mod, aes(Alcohol)) + geom_histogram(fill = 'dodgerblue', binwidth = 2, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of Alcohol') + theme(plot.title = element_text(hjust = 0.5)) 

qq_plot <- ggplot(data_t_mod, aes(sample=Alcohol)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of Alcohol") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

box_plot <- ggplot(data_t_mod, aes(x="", Alcohol)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of Alcohol', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

box_TARGET <- ggplot(data_t_mod, aes(x=factor(TARGET), Alcohol)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of Alcohol by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

LabelAppeal: Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customers don’t like the design. Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.

with(data_t_mod, c(summary(LabelAppeal), SD=sd(LabelAppeal), Skew=skewness(LabelAppeal), Kurt=kurtosis(LabelAppeal)))

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max.           SD 
## -2.000000000 -1.000000000  0.000000000 -0.009066041  1.000000000  2.000000000  0.891089247 
##         Skew         Kurt 
##  0.008429457 -0.262291551

hist <- ggplot(data_t_mod, aes(LabelAppeal)) + geom_histogram(fill = 'dodgerblue', binwidth = 1, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of LabelAppeal') + theme(plot.title = element_text(hjust = 0.5)) 

qq_plot <- ggplot(data_t_mod, aes(sample=LabelAppeal)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of LabelAppeal") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

box_plot <- ggplot(data_t_mod, aes(x="", LabelAppeal)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of LabelAppeal', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

box_TARGET <- ggplot(data_t_mod, aes(x=factor(TARGET), LabelAppeal)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of LabelAppeal by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

AcidIndex: Proprietary method of testing total acidity of wine by using a weighted average.

with(data_t_mod, c(summary(AcidIndex), SD=sd(AcidIndex), Skew=skewness(AcidIndex), Kurt=kurtosis(AcidIndex)))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.        SD      Skew      Kurt 
##  4.000000  7.000000  8.000000  7.772724  8.000000 17.000000  1.323926  1.648496  5.190092

hist <- ggplot(data_t_mod, aes(AcidIndex)) + geom_histogram(fill = 'dodgerblue', binwidth = 2, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of AcidIndex') + theme(plot.title = element_text(hjust = 0.5)) 

qq_plot <- ggplot(data_t_mod, aes(sample=AcidIndex)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of AcidIndex") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

box_plot <- ggplot(data_t_mod, aes(x="", AcidIndex)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of AcidIndex', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

box_TARGET <- ggplot(data_t_mod, aes(x=factor(TARGET), AcidIndex)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of AcidIndex by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

AcidIndex seems to be skewed slightly. It also has a skew of 1.68. This is not enough to worry about.

STARS: Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor. A high number of stars suggests high sales.

with(data_t_mod, c(summary(STARS), SD=sd(STARS), Skew=skewness(STARS), Kurt=kurtosis(STARS)))

##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max.        NA's          SD 
##    1.000000    1.000000    2.000000    2.041755    3.000000    4.000000 3359.000000          NA 
##        Skew        Kurt 
##          NA          NA

hist <- ggplot(data_t_mod, aes(STARS)) + geom_histogram(fill = 'dodgerblue', binwidth = 1, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of STARS') + theme(plot.title = element_text(hjust = 0.5)) 

qq_plot <- ggplot(data_t_mod, aes(sample=STARS)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of STARS") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 

box_plot <- ggplot(data_t_mod, aes(x="", STARS)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of STARS', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

box_TARGET <- ggplot(data_t_mod, aes(x=factor(TARGET), STARS)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of STARS by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 

grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

3.3 Density Plot

data_t_mod %>%
  select_if(is.numeric) %>%
  keep(is.numeric) %>%                     # Keep only numeric columns
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(x=value)) +                   # Plot the values
    facet_wrap(~key, scales = "free") +    # In separate panels
    geom_density()

Density Plot: Training Set

The scaled histogram and density plots show that variables AcidIndex is right skewed; AcidIndex, STARS, LabelAppeal and TARGET have multimodal distribution; while most others seem to be normally distrbuted due to the bell curve they display.

3.4 Label Scores

m_scores <- data_t_mod$LabelAppeal %>% table() %>% data.frame() %>%  mutate(per = (Freq/sum(Freq))*100)
names(m_scores)[1]<-"score"
lbls <- paste(m_scores$score, "\n", round(m_scores$per, 2)) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(m_scores$Freq,labels = lbls, col= c("#990000", "#336600", "#CC6600", "#CCCC00", "#4CC099"), main="Marketing Scores Proportioned")

3.5 Univariate Analysis

3.5.1 Response Variable

3.6 Correlation Plot

We implement a correlation matrix to better understand the correlation between variables in the dataset.

drop_na(data_t_mod) %>%
  select_if(is.numeric) %>%
  cor() %>%
  #corrplot(method = "square", type = "upper", order = 'hclust', tl.col = "black", diag = FALSE, bg= 'white', col = colorRampPalette(c('deeppink4','white','steelblue1'))(100))
  corrplot.mixed(upper = 'pie', lower = 'number', order = 'hclust', tl.col = "black")

Correlation Pie Chart: Training Set

pairs.panels(data_t_mod[1:15])

Correlation Pie Chart: Training Set

The correlation matrix below shows that the response variable TARGET has strong positive relationship (>=0.6) with variables FixedAcidity,CitricAcid,ResidualSugar,Density,Alcohol.

PerformanceAnalytics::chart.Correlation(data_t, histogram=TRUE, pch=19)

Correlation Chart: Training Set

Scatter plots against TARGET:

data_t %>%
  gather(-TARGET, key = "key", value = "ResponseVariables") %>%
  ggplot(aes(x = ResponseVariables, y = TARGET)) +
  geom_point(size = .5) +
  geom_smooth(method='lm',formula=y~x, color = 'dark grey')+
  facet_wrap(~ key, scales = "free")+
  ggthemes::theme_tufte()+
  ylab('Cases Bought')

## Warning: Removed 8200 rows containing non-finite values (stat_smooth).

## Warning: Removed 8200 rows containing missing values (geom_point).

There don’t seem to be any crazy patterns here. It mostly looks linear which is a good sign for us. STARS and LableAppleal look like they have the greatest correlation.

data_t %>%
  dplyr::select(-(INDEX)) %>%
  cor() %>%
  as.data.frame() %>%
  rownames_to_column('Variable') %>%
  dplyr::rename(Correlation_vs_Response = TARGET)

3.7 Consolidated Data Dictionary

As a summary of the data exploration process, a data dictionary is created below:

data_stat <- data_t %>% 
  dplyr::select(-TARGET,-INDEX) %>%
  gather() %>%
  group_by(key) %>%
  summarise(Mean = mean(value),
            Median = median(value),
            Max = max(value),
            Min = min(value),
            SD = sd(value))

data_cor <- data_t %>%
  cor() %>%
  as.data.frame() %>% 
  dplyr::select(TARGET) %>% 
  rownames_to_column('Variable') %>%
  dplyr::rename(Correlation_vs_Response = TARGET)

data_t %>% 
  gather() %>%
  dplyr::select(key) %>%
  unique() %>%
  dplyr::rename(Variable = key) %>%
  mutate(
         #Var_Type_1 = case_when(Variable %in% c('target','chas') ~ 'categorical', 
         #                        Variable %in% c('rad','tax') ~ 'discrete numerical',
         #                        TRUE ~ 'continuous numerical'),
         #Var_Type_2 = if_else(Variable == 'target', 'response', 'predictor'),
         Missing_Value = 'No') %>%
  left_join(data_stat, by = c('Variable'='key')) %>%
  left_join(data_cor, by = 'Variable') %>%
  mutate_if(is.numeric,round,2) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F)

Variable	Missing_Value	Mean	Median	Max	Min	SD	Correlation_vs_Response
TARGET	No	NA	NA	NA	NA	NA	1.00
INDEX	No	NA	NA	NA	NA	NA	0.00
FixedAcidity	No	7.08	6.90	34.40	-18.10	6.32	-0.05
VolatileAcidity	No	0.32	0.28	3.68	-2.79	0.78	-0.09
CitricAcid	No	0.31	0.31	3.86	-3.24	0.86	0.01
ResidualSugar	No	NA	NA	NA	NA	NaN	NA
Chlorides	No	NA	NA	NA	NA	NaN	NA
FreeSulfurDioxide	No	NA	NA	NA	NA	NaN	NA
TotalSulfurDioxide	No	NA	NA	NA	NA	NaN	NA
Density	No	0.99	0.99	1.10	0.89	0.03	-0.04
pH	No	NA	NA	NA	NA	NaN	NA
Sulphates	No	NA	NA	NA	NA	NaN	NA
Alcohol	No	NA	NA	NA	NA	NaN	NA
LabelAppeal	No	-0.01	0.00	2.00	-2.00	0.89	0.36
AcidIndex	No	7.77	8.00	17.00	4.00	1.32	-0.25
STARS	No	NA	NA	NA	NA	NaN	NA

4 DATA PREPARATION

Lets first split the data into training and test.

MICE (Multivariate Imputation by Chained Equations) package helps in inspecting, imputing, diagonise, analyze, pool the result, and generate simulated incomplete data

require(mice)
set.seed(999) 
sampl = caTools::sample.split(data_t_mod$TARGET, SplitRatio = .80)
wine_train1 <- subset(data_t_mod, sampl == TRUE)
wine_test1 <- subset(data_t_mod, sampl == FALSE)

wine_train2 <-  as.data.frame(tidyr::complete(mice(wine_train1, m=1, maxit = 5, seed = 42)))

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

wine_test2 <- as.data.frame(tidyr::complete(mice(wine_test1, m=1, maxit = 5, seed = 42)))

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

Given the low correlation between AcidIndex and TARGET it might not make a huge difference, however, we will log transform it to test.

wine_train2$AcidIndex <- log(wine_train2$AcidIndex)
wine_test2$AcidIndex <- log(wine_test2$AcidIndex)

5 BUILD MODELS

5.1 Model I: Poisson Model

5.1.1 Model 1: Poisson Model without imputations

require(ggplot2)
require(gridExtra)

model1 = glm(TARGET ~  ., data=wine_train1, family=poisson)
summary(model1)

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2128  -0.2757   0.0647   0.3766   1.6981  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.608e+00  2.796e-01   5.750 8.90e-09 ***
## FixedAcidity        6.705e-04  1.177e-03   0.570  0.56901    
## VolatileAcidity    -2.750e-02  9.283e-03  -2.963  0.00305 ** 
## CitricAcid         -3.835e-03  8.519e-03  -0.450  0.65259    
## ResidualSugar       1.828e-05  2.152e-04   0.085  0.93232    
## Chlorides          -3.764e-02  2.314e-02  -1.627  0.10377    
## FreeSulfurDioxide   5.671e-05  4.892e-05   1.159  0.24630    
## TotalSulfurDioxide  2.230e-05  3.177e-05   0.702  0.48274    
## Density            -4.025e-01  2.749e-01  -1.464  0.14326    
## pH                  2.307e-04  1.085e-02   0.021  0.98303    
## Sulphates          -5.984e-03  7.973e-03  -0.751  0.45293    
## Alcohol             3.262e-03  2.004e-03   1.628  0.10360    
## LabelAppeal         1.730e-01  8.858e-03  19.530  < 2e-16 ***
## AcidIndex          -4.967e-02  6.666e-03  -7.451 9.28e-14 ***
## STARS               1.929e-01  8.328e-03  23.160  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4720.5  on 5143  degrees of freedom
## Residual deviance: 3242.8  on 5129  degrees of freedom
##   (5093 observations deleted due to missingness)
## AIC: 18545
## 
## Number of Fisher Scoring iterations: 5

plot(model1)

grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

5.1.2 Model 2: Poisson Model without imputations and only significant variables

model2 = glm(TARGET ~  .-FixedAcidity-CitricAcid-ResidualSugar-Chlorides-FreeSulfurDioxide-TotalSulfurDioxide-Density-pH-Sulphates-Alcohol, data=wine_train1, family=poisson)
summary(model2)

## 
## Call:
## glm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - 
##     Chlorides - FreeSulfurDioxide - TotalSulfurDioxide - Density - 
##     pH - Sulphates - Alcohol, family = poisson, data = wine_train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1898  -0.2777   0.0622   0.3764   1.6086  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.251442   0.054724  22.868  < 2e-16 ***
## VolatileAcidity -0.027581   0.009278  -2.973  0.00295 ** 
## LabelAppeal      0.173177   0.008853  19.562  < 2e-16 ***
## AcidIndex       -0.050616   0.006553  -7.724 1.13e-14 ***
## STARS            0.194208   0.008292  23.421  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4720.5  on 5143  degrees of freedom
## Residual deviance: 3253.1  on 5139  degrees of freedom
##   (5093 observations deleted due to missingness)
## AIC: 18535
## 
## Number of Fisher Scoring iterations: 5

plot(model2)

5.1.3 Model 3: Poisson Model with imputations

model3 = glm(TARGET ~  ., data=wine_train2, family=poisson)
summary(model3)

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1630  -0.6739   0.1305   0.6337   2.4320  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.337e+00  2.281e-01  10.242  < 2e-16 ***
## FixedAcidity        2.250e-04  9.190e-04   0.245 0.806610    
## VolatileAcidity    -4.313e-02  7.286e-03  -5.919 3.23e-09 ***
## CitricAcid          8.534e-03  6.573e-03   1.298 0.194168    
## ResidualSugar       1.271e-04  1.675e-04   0.759 0.448033    
## Chlorides          -6.572e-02  1.790e-02  -3.673 0.000240 ***
## FreeSulfurDioxide   1.336e-04  3.804e-05   3.512 0.000444 ***
## TotalSulfurDioxide  9.235e-05  2.460e-05   3.754 0.000174 ***
## Density            -3.404e-01  2.144e-01  -1.588 0.112379    
## pH                 -1.962e-02  8.417e-03  -2.331 0.019744 *  
## Sulphates          -1.569e-02  6.157e-03  -2.549 0.010805 *  
## Alcohol             2.951e-03  1.554e-03   1.898 0.057632 .  
## LabelAppeal         1.409e-01  6.798e-03  20.724  < 2e-16 ***
## AcidIndex          -7.709e-01  3.998e-02 -19.280  < 2e-16 ***
## STARS               3.407e-01  6.270e-03  54.337  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18291  on 10236  degrees of freedom
## Residual deviance: 12829  on 10222  degrees of freedom
## AIC: 38417
## 
## Number of Fisher Scoring iterations: 5

plot(model3)

5.1.4 Model 4: Poisson Model with imputations and only significant variables

model4 = glm(TARGET ~  .-FixedAcidity-CitricAcid-ResidualSugar-Density-Alcohol, data=wine_train2, family=poisson)

summary(model4)

## 
## Call:
## glm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - 
##     Density - Alcohol, family = poisson, data = wine_train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1469  -0.6828   0.1295   0.6379   2.4054  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.038e+00  8.840e-02  23.052  < 2e-16 ***
## VolatileAcidity    -4.348e-02  7.284e-03  -5.969 2.39e-09 ***
## Chlorides          -6.725e-02  1.789e-02  -3.760 0.000170 ***
## FreeSulfurDioxide   1.316e-04  3.801e-05   3.461 0.000537 ***
## TotalSulfurDioxide  9.150e-05  2.458e-05   3.723 0.000197 ***
## pH                 -1.991e-02  8.415e-03  -2.366 0.018003 *  
## Sulphates          -1.563e-02  6.153e-03  -2.540 0.011086 *  
## LabelAppeal         1.409e-01  6.798e-03  20.727  < 2e-16 ***
## AcidIndex          -7.729e-01  3.936e-02 -19.636  < 2e-16 ***
## STARS               3.417e-01  6.255e-03  54.634  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18291  on 10236  degrees of freedom
## Residual deviance: 12837  on 10227  degrees of freedom
## AIC: 38415
## 
## Number of Fisher Scoring iterations: 5

plot(model4)

5.2 Model II: Negative Binomial

5.2.1 Model 5 : Negative Binomial without imputations

model5 <- glm.nb(TARGET ~ ., data = wine_train1)

summary(model5)

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = wine_train1, init.theta = 138898.9107, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2127  -0.2757   0.0647   0.3766   1.6981  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.608e+00  2.796e-01   5.750 8.91e-09 ***
## FixedAcidity        6.705e-04  1.177e-03   0.570  0.56900    
## VolatileAcidity    -2.750e-02  9.283e-03  -2.963  0.00305 ** 
## CitricAcid         -3.835e-03  8.519e-03  -0.450  0.65259    
## ResidualSugar       1.828e-05  2.152e-04   0.085  0.93231    
## Chlorides          -3.764e-02  2.314e-02  -1.627  0.10378    
## FreeSulfurDioxide   5.671e-05  4.892e-05   1.159  0.24630    
## TotalSulfurDioxide  2.230e-05  3.177e-05   0.702  0.48275    
## Density            -4.025e-01  2.750e-01  -1.464  0.14326    
## pH                  2.307e-04  1.085e-02   0.021  0.98303    
## Sulphates          -5.984e-03  7.973e-03  -0.751  0.45293    
## Alcohol             3.262e-03  2.004e-03   1.628  0.10360    
## LabelAppeal         1.730e-01  8.858e-03  19.529  < 2e-16 ***
## AcidIndex          -4.967e-02  6.666e-03  -7.451 9.28e-14 ***
## STARS               1.929e-01  8.328e-03  23.160  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(138898.9) family taken to be 1)
## 
##     Null deviance: 4720.4  on 5143  degrees of freedom
## Residual deviance: 3242.7  on 5129  degrees of freedom
##   (5093 observations deleted due to missingness)
## AIC: 18547
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  138899 
##           Std. Err.:  259921 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -18515.07

plot(model5)

5.2.2 Model 6 : Negative Binomial without imputations and only significant variables

model6 <- glm.nb(TARGET ~ .-FixedAcidity-CitricAcid-ResidualSugar-Chlorides-FreeSulfurDioxide-TotalSulfurDioxide-Density-pH-Sulphates-Alcohol, data = wine_train1)

summary(model6)

## 
## Call:
## glm.nb(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - 
##     Chlorides - FreeSulfurDioxide - TotalSulfurDioxide - Density - 
##     pH - Sulphates - Alcohol, data = wine_train1, init.theta = 138402.5261, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1898  -0.2777   0.0622   0.3764   1.6086  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.251443   0.054725  22.868  < 2e-16 ***
## VolatileAcidity -0.027581   0.009279  -2.973  0.00295 ** 
## LabelAppeal      0.173177   0.008853  19.562  < 2e-16 ***
## AcidIndex       -0.050616   0.006553  -7.724 1.13e-14 ***
## STARS            0.194209   0.008292  23.421  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(138402.5) family taken to be 1)
## 
##     Null deviance: 4720.4  on 5143  degrees of freedom
## Residual deviance: 3253.0  on 5139  degrees of freedom
##   (5093 observations deleted due to missingness)
## AIC: 18537
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  138403 
##           Std. Err.:  258834 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -18525.37

plot(model6)

5.2.3 Model 7 : Negative Binomial with imputations

model7 <- glm.nb(TARGET ~ ., data = wine_train2)

summary(model7)

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = wine_train2, init.theta = 49078.50992, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1629  -0.6739   0.1305   0.6337   2.4320  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.337e+00  2.281e-01  10.242  < 2e-16 ***
## FixedAcidity        2.250e-04  9.190e-04   0.245 0.806608    
## VolatileAcidity    -4.313e-02  7.286e-03  -5.919 3.23e-09 ***
## CitricAcid          8.534e-03  6.573e-03   1.298 0.194177    
## ResidualSugar       1.271e-04  1.675e-04   0.759 0.448021    
## Chlorides          -6.573e-02  1.790e-02  -3.673 0.000240 ***
## FreeSulfurDioxide   1.336e-04  3.804e-05   3.512 0.000444 ***
## TotalSulfurDioxide  9.235e-05  2.460e-05   3.754 0.000174 ***
## Density            -3.404e-01  2.144e-01  -1.588 0.112389    
## pH                 -1.962e-02  8.418e-03  -2.331 0.019745 *  
## Sulphates          -1.569e-02  6.157e-03  -2.549 0.010806 *  
## Alcohol             2.951e-03  1.554e-03   1.898 0.057642 .  
## LabelAppeal         1.409e-01  6.798e-03  20.723  < 2e-16 ***
## AcidIndex          -7.709e-01  3.999e-02 -19.279  < 2e-16 ***
## STARS               3.407e-01  6.270e-03  54.335  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(49078.51) family taken to be 1)
## 
##     Null deviance: 18290  on 10236  degrees of freedom
## Residual deviance: 12828  on 10222  degrees of freedom
## AIC: 38419
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  49079 
##           Std. Err.:  63619 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -38387.04

plot(model7)

5.2.4 Model 8 : Negative Binomial with imputations and only significant variables

model8 <- glm.nb(TARGET ~ .-FixedAcidity-CitricAcid-ResidualSugar-Density-Alcohol, data = wine_train2)

summary(model8)

## 
## Call:
## glm.nb(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - 
##     Density - Alcohol, data = wine_train2, init.theta = 48992.35936, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1469  -0.6828   0.1295   0.6379   2.4053  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.038e+00  8.840e-02  23.052  < 2e-16 ***
## VolatileAcidity    -4.348e-02  7.284e-03  -5.969 2.39e-09 ***
## Chlorides          -6.726e-02  1.789e-02  -3.760 0.000170 ***
## FreeSulfurDioxide   1.316e-04  3.801e-05   3.461 0.000537 ***
## TotalSulfurDioxide  9.150e-05  2.458e-05   3.723 0.000197 ***
## pH                 -1.991e-02  8.415e-03  -2.366 0.018004 *  
## Sulphates          -1.563e-02  6.153e-03  -2.540 0.011087 *  
## LabelAppeal         1.409e-01  6.798e-03  20.726  < 2e-16 ***
## AcidIndex          -7.730e-01  3.936e-02 -19.636  < 2e-16 ***
## STARS               3.417e-01  6.255e-03  54.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(48992.36) family taken to be 1)
## 
##     Null deviance: 18290  on 10236  degrees of freedom
## Residual deviance: 12837  on 10227  degrees of freedom
## AIC: 38418
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  48992 
##           Std. Err.:  63531 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -38395.56

plot(model8)

5.3 Model III: Linear Model

5.3.1 Model 9 : Linear Model with imputations

Using Linear Regression Model on imputed training data.

model9 <- lm(TARGET ~ ., data = wine_train2)

summary(model9)

## 
## Call:
## lm(formula = TARGET ~ ., data = wine_train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7147 -1.0144  0.1737  1.0276  4.3109 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.968e+00  5.567e-01  10.719  < 2e-16 ***
## FixedAcidity        1.297e-03  2.253e-03   0.576 0.564897    
## VolatileAcidity    -1.269e-01  1.791e-02  -7.085 1.48e-12 ***
## CitricAcid          2.625e-02  1.629e-02   1.611 0.107133    
## ResidualSugar       4.231e-04  4.132e-04   1.024 0.305939    
## Chlorides          -2.023e-01  4.391e-02  -4.606 4.15e-06 ***
## FreeSulfurDioxide   3.635e-04  9.387e-05   3.873 0.000108 ***
## TotalSulfurDioxide  2.432e-04  6.023e-05   4.038 5.42e-05 ***
## Density            -8.659e-01  5.260e-01  -1.646 0.099740 .  
## pH                 -4.730e-02  2.072e-02  -2.283 0.022424 *  
## Sulphates          -4.212e-02  1.512e-02  -2.786 0.005346 ** 
## Alcohol             1.251e-02  3.807e-03   3.285 0.001025 ** 
## LabelAppeal         4.311e-01  1.646e-02  26.191  < 2e-16 ***
## AcidIndex          -2.068e+00  9.237e-02 -22.392  < 2e-16 ***
## STARS               1.167e+00  1.671e-02  69.805  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.416 on 10222 degrees of freedom
## Multiple R-squared:  0.4605, Adjusted R-squared:  0.4598 
## F-statistic: 623.3 on 14 and 10222 DF,  p-value: < 2.2e-16

plot(model9)

5.3.2 Model 10 : Linear Model with imputations and only significant variables.

As we know the significant variables are FixedAcidity, CitricAcid and ResidualSugar, so using the same in the Linear Regression Model and applying the same of imputed training data.

model10 <- lm(TARGET ~ .-FixedAcidity-CitricAcid-ResidualSugar, data = wine_train2)

summary(model10)

## 
## Call:
## lm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar, 
##     data = wine_train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7242 -1.0131  0.1728  1.0331  4.3050 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.951e+00  5.566e-01  10.691  < 2e-16 ***
## VolatileAcidity    -1.278e-01  1.791e-02  -7.138 1.01e-12 ***
## Chlorides          -2.032e-01  4.391e-02  -4.627 3.75e-06 ***
## FreeSulfurDioxide   3.660e-04  9.386e-05   3.899 9.71e-05 ***
## TotalSulfurDioxide  2.454e-04  6.021e-05   4.075 4.63e-05 ***
## Density            -8.712e-01  5.260e-01  -1.656  0.09768 .  
## pH                 -4.728e-02  2.071e-02  -2.282  0.02249 *  
## Sulphates          -4.236e-02  1.511e-02  -2.803  0.00507 ** 
## Alcohol             1.249e-02  3.807e-03   3.281  0.00104 ** 
## LabelAppeal         4.310e-01  1.646e-02  26.186  < 2e-16 ***
## AcidIndex          -2.048e+00  9.074e-02 -22.572  < 2e-16 ***
## STARS               1.167e+00  1.671e-02  69.833  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.416 on 10225 degrees of freedom
## Multiple R-squared:  0.4603, Adjusted R-squared:  0.4598 
## F-statistic: 792.9 on 11 and 10225 DF,  p-value: < 2.2e-16

plot(model10)

5.4 Model 11 : Ordinal Logistic Regression

This regression uses ordered factors. I would expect this to be one of the top performers.

polrDF <- wine_train2
polrDF$TARGET <- as.factor(polrDF$TARGET)
model11 <- polr(TARGET ~ ., data = polrDF, Hess=TRUE)
summary(model11)

## Call:
## polr(formula = TARGET ~ ., data = polrDF, Hess = TRUE)
## 
## Coefficients:
##                         Value Std. Error  t value
## FixedAcidity        0.0021819  0.0029055   0.7510
## VolatileAcidity    -0.1555960  0.0232760  -6.6848
## CitricAcid          0.0289713  0.0211212   1.3717
## ResidualSugar       0.0003196  0.0005320   0.6008
## Chlorides          -0.2627487  0.0566657  -4.6368
## FreeSulfurDioxide   0.0004607  0.0001216   3.7875
## TotalSulfurDioxide  0.0002716  0.0000783   3.4686
## Density            -1.2981402  0.1490930  -8.7069
## pH                 -0.0314095  0.0268078  -1.1717
## Sulphates          -0.0339150  0.0196712  -1.7241
## Alcohol             0.0269097  0.0048969   5.4953
## LabelAppeal         0.8256163  0.0237699  34.7337
## AcidIndex          -2.6646249  0.1250905 -21.3016
## STARS               1.4684471  0.0256683  57.2086
## 
## Intercepts:
##     Value    Std. Error t value 
## 0|1  -5.9211   0.1357   -43.6446
## 1|2  -5.7842   0.1355   -42.6743
## 2|3  -5.1811   0.1351   -38.3486
## 3|4  -3.8133   0.1350   -28.2556
## 4|5  -1.9656   0.1372   -14.3273
## 5|6   0.0034   0.1437     0.0237
## 6|7   2.2069   0.1675    13.1788
## 7|8   4.5480   0.3034    14.9895
## 
## Residual Deviance: 30016.23 
## AIC: 30060.23

5.5 Model 12 : Zero inflation

Zero inflation understands that some Poisson distrobutions are dominated by many zeros. As such it corrects for this. This is one of the most promissing ones because as we saw in our data exploration, there were more zeros, and then normally distributed data after that.

library(pscl)

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

model12 <- zeroinfl(TARGET ~ . | STARS, data = wine_train2, dist = 'negbin')
summary(model12)

## 
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = wine_train2, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.09180 -0.49650  0.07134  0.48208  2.08565 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.824e+00  2.385e-01   7.648 2.04e-14 ***
## FixedAcidity        4.523e-04  9.473e-04   0.477 0.633017    
## VolatileAcidity    -1.936e-02  7.560e-03  -2.561 0.010429 *  
## CitricAcid          2.116e-03  6.718e-03   0.315 0.752830    
## ResidualSugar      -6.323e-05  1.722e-04  -0.367 0.713427    
## Chlorides          -3.245e-02  1.851e-02  -1.754 0.079503 .  
## FreeSulfurDioxide   4.932e-05  3.854e-05   1.280 0.200613    
## TotalSulfurDioxide  4.760e-06  2.460e-05   0.194 0.846535    
## Density            -2.990e-01  2.224e-01  -1.344 0.178840    
## pH                 -1.910e-03  8.749e-03  -0.218 0.827223    
## Sulphates          -4.586e-03  6.391e-03  -0.718 0.473029    
## Alcohol             5.950e-03  1.591e-03   3.741 0.000184 ***
## LabelAppeal         2.240e-01  7.112e-03  31.491  < 2e-16 ***
## AcidIndex          -2.682e-01  4.492e-02  -5.971 2.36e-09 ***
## STARS               1.227e-01  6.997e-03  17.531  < 2e-16 ***
## Log(theta)          1.860e+01  2.758e+00   6.742 1.56e-11 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.31678    0.11092   20.89   <2e-16 ***
## STARS       -2.66899    0.09689  -27.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 119421399.3444 
## Number of iterations in BFGS optimization: 22 
## Log-likelihood: -1.725e+04 on 18 Df

scatterPreds <- predict(model12, wine_train2)
qplot(wine_train2$TARGET, scatterPreds, main = 'Predicted vs Actual') + ggthemes::theme_tufte()

residPlot <- scatterPreds - wine_train2$TARGET
qplot(wine_train2$TARGET, residPlot, main = 'Residuals') + ggthemes::theme_tufte()

6 MODEL SELECTION

6.1 Compare Models by MSE/AIC

aic1 <- model1$aic
aic2 <- model2$aic
aic3 <- model3$aic
aic4 <- model4$aic
aic5 <- model5$aic
aic6 <- model6$aic
aic7 <- model7$aic
aic8 <- model8$aic
aic9 <- model9$aic
aic10 <- model10$aic
aic11 <- model11$aic
aic12 <- model12$aic
mse1 <- mean((wine_train2$TARGET - predict(model1))^2)
mse2 <- mean((wine_train2$TARGET - predict(model2))^2)
mse3 <- mean((wine_train2$TARGET - predict(model3))^2)
mse4 <- mean((wine_train2$TARGET - predict(model4))^2)
mse5 <- mean((wine_train2$TARGET - predict(model5))^2)
mse6 <- mean((wine_train2$TARGET - predict(model6))^2)
mse7 <- mean((wine_train2$TARGET - predict(model7))^2)
mse8 <- mean((wine_train2$TARGET - predict(model8))^2)
mse9 <- mean((wine_train2$TARGET - predict(model9))^2)
mse10 <- mean((wine_train2$TARGET - predict(model10))^2)
mse11 <- mean((wine_train2$TARGET - predict(model11))^2)
mse12 <- mean((wine_train2$TARGET - predict(model12))^2)

compare_aic_mse <- matrix(c(mse1, mse2, mse3, mse4, mse5, mse6, mse7, mse8, mse9, mse10, mse11, mse12,
                            aic1, aic2, aic3, aic4, aic5, aic6, aic7, aic8, aic9, aic10, aic11, aic12),nrow=12,ncol=2,byrow=TRUE)

rownames(compare_aic_mse) <- c("Model1","Model2","Model3","Model4","Model5","Model6","Model7","Model8","Model9","Model10","Model11","Model12")
colnames(compare_aic_mse) <- c("MSE","AIC")
compare_models <- as.data.frame(compare_models)

kable(compare_aic_mse)  %>% 
  kable_styling(full_width = T)

	MSE	AIC
Model1	6.929787	6.926722
Model2	6.849005	6.849920
Model3	6.929788	6.926723
Model4	6.849001	6.849916
Model5	2.002207	2.002977
Model6	NA	1.988813
Model7	18544.983192	18535.285631
Model8	38416.867732	38415.386508
Model9	18547.067808	18537.370350
Model10	38419.044066	38417.562708
Model11	6.929787	6.926722
Model12	6.849005	6.849920

6.2 Compare Models by Loss

Now lets see the output of the Models using test data

We will use the squared loss to validate the model. We will use the squared difference to select a model (MSE) from predictions on the training sets. (Lower numbers are better.)

modelValidation <- function(mod){
  preds = predict(mod, wine_test2)
  diffMat = as.numeric(preds) - as.numeric(wine_test2$TARGET)
  diffMat = diffMat^2
  loss <- mean(diffMat)
  return(loss)
}

compare_models <- matrix(c(modelValidation(model1),modelValidation(model2),modelValidation(model3),modelValidation(model4),modelValidation(model5),modelValidation(model6),
                           modelValidation(model7),modelValidation(model8),modelValidation(model9),modelValidation(model10),modelValidation(model11),modelValidation(model12)),
                         nrow=12,ncol=1,byrow=TRUE)

rownames(compare_models) <- c("Model1","Model2","Model3","Model4","Model5","Model6","Model7","Model8","Model9","Model10","Model11","Model12")
colnames(compare_models) <- c("Loss:")
compare_models <- as.data.frame(compare_models)
compare_models

Following deductions can be made as per above values:

A regular Poisson regression does not perform very well.
The same can be said for the Negative Binomial.
The linear model actually performs very well.
Very surprisinly, Ordinal Logistic Regression does not work as well as the linear model.

Because we are not interested in gaining insight into the underlying causes of wine selection, we will use the squared loss. This will tell us how accurate our model is without caring about confidence intervals etc.

Based on this metric, Zero Poission Inflation is the most accurate.

From the above models, Model12 - Zero Poission Inflations has least loss as it uses less variables and is parsimonious. Also the R2 looks fine. The squared loss is also fine.

In terms of MSE and AIC, Model 5 is best followed by Model 4 and tied Model 2 and Model 12.
In terms of Loss, Model 12 is best followed closely by Model 9 and Model 10.

Overall, we choose Zero Poission Inflation due to following : - least loss - good MSE score - good AIC score

6.3 Prediction on Evaluation Data

We will use the same method to impute and use log transformation for AcidIndex.

require(mice)
data_e_mod <- data_e %>% dplyr::select(-(IN)) %>% mice(m=1, maxit = 5, seed = 42)

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

## Warning: Number of logged events: 1

wine_eval <- as.data.frame(complete(data_e_mod))
wine_eval$AcidIndex <- log(wine_eval$AcidIndex)
wine_eval$TARGET <- predict(model12, newdata=wine_eval)
write.csv(wine_eval,"Evaluation_Full_Data.csv", row.names=FALSE)

data_predicted_eval <- read_csv("Evaluation_Full_Data.csv")

## Parsed with column specification:
## cols(
##   TARGET = col_double(),
##   FixedAcidity = col_double(),
##   VolatileAcidity = col_double(),
##   CitricAcid = col_double(),
##   ResidualSugar = col_double(),
##   Chlorides = col_double(),
##   FreeSulfurDioxide = col_double(),
##   TotalSulfurDioxide = col_double(),
##   Density = col_double(),
##   pH = col_double(),
##   Sulphates = col_double(),
##   Alcohol = col_double(),
##   LabelAppeal = col_double(),
##   AcidIndex = col_double(),
##   STARS = col_double()
## )

DT::datatable(data_predicted_eval)

TARGET: Number of Cases Purchased as Predicted

options(width=100)
round(with(data_predicted_eval, c(summary(TARGET), StdD=sd(TARGET), Skew=skewness(TARGET), Kurt=kurtosis(TARGET))),2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
##    1.03    1.91    3.30    3.24    4.17    8.27    1.38    0.51   -0.32

```