Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

DATA EXPLORATION

Loading Data

Training Dataset

The training dataset has 14 predictors, 1 response variable and 12795 observations.

Evaluation Dataset

Missing Values & Data Type Check

From the 14 predictor variables, 3 are categorical (LabelAppeal,AcidIndex,STARS), the other 11 are continuous numerical. The response variable TARGET is categorical.

## Rows: 12,795
## Columns: 16
## $ TARGET             <dbl> 3, 3, 5, 3, 4, 0, 0, 4, 3, 6, 0, 4, 3, 7, 4, 0, 0, ~
## $ INDEX              <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 19~
## $ FixedAcidity       <dbl> 3.2, 4.5, 7.1, 5.7, 8.0, 11.3, 7.7, 6.5, 14.8, 5.5,~
## $ VolatileAcidity    <dbl> 1.160, 0.160, 2.640, 0.385, 0.330, 0.320, 0.290, -1~
## $ CitricAcid         <dbl> -0.98, -0.81, -0.88, 0.04, -1.26, 0.59, -0.40, 0.34~
## $ ResidualSugar      <dbl> 54.20, 26.10, 14.80, 18.80, 9.40, 2.20, 21.50, 1.40~
## $ Chlorides          <dbl> -0.567, -0.425, 0.037, -0.425, NA, 0.556, 0.060, 0.~
## $ FreeSulfurDioxide  <dbl> NA, 15, 214, 22, -167, -37, 287, 523, -213, 62, 551~
## $ TotalSulfurDioxide <dbl> 268, -327, 142, 115, 108, 15, 156, 551, NA, 180, 65~
## $ Density            <dbl> 0.99280, 1.02792, 0.99518, 0.99640, 0.99457, 0.9994~
## $ pH                 <dbl> 3.33, 3.38, 3.12, 2.24, 3.12, 3.20, 3.49, 3.20, 4.9~
## $ Sulphates          <dbl> -0.59, 0.70, 0.48, 1.83, 1.77, 1.29, 1.21, NA, 0.26~
## $ Alcohol            <dbl> 9.9, NA, 22.0, 6.2, 13.7, 15.4, 10.3, 11.6, 15.0, 1~
## $ LabelAppeal        <dbl> 0, -1, -1, -1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 0, 0, ~
## $ AcidIndex          <dbl> 8, 7, 8, 6, 9, 11, 8, 7, 6, 8, 5, 10, 7, 8, 9, 8, 9~
## $ STARS              <dbl> 2, 3, 3, 1, 2, NA, NA, 3, NA, 4, 1, 2, 2, 3, NA, NA~

## Rows: 3,335
## Columns: 16
## $ IN                 <dbl> 3, 9, 10, 18, 21, 30, 31, 37, 39, 47, 60, 62, 63, 6~
## $ TARGET             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ FixedAcidity       <dbl> 5.4, 12.4, 7.2, 6.2, 11.4, 17.6, 15.5, 15.9, 11.6, ~
## $ VolatileAcidity    <dbl> -0.860, 0.385, 1.750, 0.100, 0.210, 0.040, 0.530, 1~
## $ CitricAcid         <dbl> 0.27, -0.76, 0.17, 1.80, 0.28, -1.15, -0.53, 1.14, ~
## $ ResidualSugar      <dbl> -10.70, -19.70, -33.00, 1.00, 1.20, 1.40, 4.60, 31.~
## $ Chlorides          <dbl> 0.092, 1.169, 0.065, -0.179, 0.038, 0.535, 1.263, -~
## $ FreeSulfurDioxide  <dbl> 23, -37, 9, 104, 70, -250, 10, 115, 35, 40, NA, -21~
## $ TotalSulfurDioxide <dbl> 398, 68, 76, 89, 53, 140, 17, 381, 83, 129, 583, -5~
## $ Density            <dbl> 0.98527, 0.99048, 1.04641, 0.98877, 1.02899, 0.9502~
## $ pH                 <dbl> 5.02, 3.37, 4.61, 3.20, 2.54, 3.06, 3.07, 2.99, 3.3~
## $ Sulphates          <dbl> 0.64, 1.09, 0.68, 2.11, -0.07, -0.02, 0.75, 0.31, 2~
## $ Alcohol            <dbl> 12.30, 16.00, 8.55, 12.30, 4.80, 11.40, 8.50, 11.40~
## $ LabelAppeal        <dbl> -1, 0, 0, -1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, -1, -1,~
## $ AcidIndex          <dbl> 6, 6, 8, 8, 10, 8, 12, 7, 12, 7, 8, 10, 9, 8, 9, 11~
## $ STARS              <dbl> NA, 2, 1, 1, NA, 4, 3, NA, NA, NA, 1, NA, 2, NA, NA~

Missing Values & Data Type Check

	Non_NAs	NAs	NA_Percent
TARGET	12795	0	0.0000000
INDEX	12795	0	0.0000000
FixedAcidity	12795	0	0.0000000
VolatileAcidity	12795	0	0.0000000
CitricAcid	12795	0	0.0000000
ResidualSugar	12179	616	0.0481438
Chlorides	12157	638	0.0498632
FreeSulfurDioxide	12148	647	0.0505666
TotalSulfurDioxide	12113	682	0.0533021
Density	12795	0	0.0000000
pH	12400	395	0.0308714
Sulphates	11585	1210	0.0945682
Alcohol	12142	653	0.0510356
LabelAppeal	12795	0	0.0000000
AcidIndex	12795	0	0.0000000
STARS	9436	3359	0.2625244

Summary statistics for the training dataset

A binary logistic regression model is built using the training set, hence the training set is used for data exploration.

training set

summary(training_modified)

##      TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
##  Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
##  1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
##  Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
##  Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
##  3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
##  Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
##                                                                       
##  ResidualSugar        Chlorides       FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00   Min.   :-823.0    
##  1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00   1st Qu.:  27.0    
##  Median :   3.900   Median : 0.0460   Median :  30.00   Median : 123.0    
##  Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85   Mean   : 120.7    
##  3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00   3rd Qu.: 208.0    
##  Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00   Max.   :1057.0    
##  NA's   :616        NA's   :638       NA's   :647       NA's   :682       
##     Density             pH          Sulphates          Alcohol     
##  Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
##  1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
##  Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
##  Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
##  3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
##  Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
##                   NA's   :395     NA's   :1210      NA's   :653    
##   LabelAppeal          AcidIndex          STARS      
##  Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##                                       NA's   :3359

Outliers

training_modified %>%
  scale() %>%
  as.data.frame() %>%
  stack() %>%
  ggplot(aes(x = ind, y = values)) +
  geom_boxplot(fill = 'deeppink4') +
  labs(title = 'Boxplot: Scaled Training Set',
       x = 'Variables',
       y = 'Normalized_Values')+
  theme(panel.background = element_rect(fill = 'grey'),axis.text.x=element_text(size=10, angle=90))

Boxplot: Scaled Training Set

From the above Boxplot outliers occur in: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal andAcidIndex.

Histograms

FixedAcidity:

library(qqplotr)
with(training_modified, c(summary(FixedAcidity), SD=sd(FixedAcidity), Skew=skewness(FixedAcidity), Kurt=kurtosis(FixedAcidity)))

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## -18.10000000   5.20000000   6.90000000   7.07571708   9.50000000  34.40000000 
##           SD         Skew         Kurt 
##   6.31764346  -0.02258596   1.67499867

hist <- ggplot(training_modified, aes(FixedAcidity)) + geom_histogram(fill = 'dodgerblue', binwidth = 4, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of FixedAcidity') + theme(plot.title = element_text(hjust = 0.5)) 
qq_plot <- ggplot(training_modified, aes(sample=FixedAcidity)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of FixedAcidity") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 
box_plot <- ggplot(training_modified, aes(x="", FixedAcidity)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of FixedAcidity', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), FixedAcidity)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of FixedAcidity by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

VolatileAcidity:

with(training_modified, c(summary(VolatileAcidity), SD=sd(VolatileAcidity), Skew=skewness(VolatileAcidity), Kurt=kurtosis(VolatileAcidity)))

##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
## -2.79000000  0.13000000  0.28000000  0.32410395  0.64000000  3.68000000 
##          SD        Skew        Kurt 
##  0.78401424  0.02037997  1.83221064

hist <- ggplot(training_modified, aes(VolatileAcidity)) + geom_histogram(fill = 'dodgerblue', binwidth = .5, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of VolatileAcidity') + theme(plot.title = element_text(hjust = 0.5)) 
qq_plot <- ggplot(training_modified, aes(sample=VolatileAcidity)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of VolatileAcidity") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 
box_plot <- ggplot(training_modified, aes(x="", VolatileAcidity)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of VolatileAcidity', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), VolatileAcidity)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of VolatileAcidity by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

CitricAcid:

with(training_modified, c(summary(CitricAcid), SD=sd(CitricAcid), Skew=skewness(CitricAcid), Kurt=kurtosis(CitricAcid)))

##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
## -3.24000000  0.03000000  0.31000000  0.30841266  0.58000000  3.86000000 
##          SD        Skew        Kurt 
##  0.86207979 -0.05030704  1.83794007

hist <- ggplot(training_modified, aes(CitricAcid)) + geom_histogram(fill = 'dodgerblue', binwidth = 1, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of CitricAcid') + theme(plot.title = element_text(hjust = 0.5)) 
qq_plot <- ggplot(training_modified, aes(sample=CitricAcid)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of CitricAcid") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 
box_plot <- ggplot(training_modified, aes(x="", CitricAcid)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of CitricAcid', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), CitricAcid)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of CitricAcid by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

ResidualSugar:

with(training_modified, c(summary(ResidualSugar), SD=sd(ResidualSugar), Skew=skewness(ResidualSugar), Kurt=kurtosis(ResidualSugar)))

##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
## -127.800000   -2.000000    3.900000    5.418733   15.900000  141.150000 
##        NA's          SD        Skew        Kurt 
##  616.000000          NA          NA          NA

hist <- ggplot(training_modified, aes(ResidualSugar)) + geom_histogram(fill = 'dodgerblue', binwidth = 20, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of ResidualSugar') + theme(plot.title = element_text(hjust = 0.5)) 
qq_plot <- ggplot(training_modified, aes(sample=ResidualSugar)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of ResidualSugar") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 
box_plot <- ggplot(training_modified, aes(x="", ResidualSugar)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of ResidualSugar', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), ResidualSugar)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of ResidualSugar by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

Chlorides:

with(training_modified, c(summary(Chlorides), SD=sd(Chlorides), Skew=skewness(Chlorides), Kurt=kurtosis(Chlorides)))

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
##  -1.17100000  -0.03100000   0.04600000   0.05482249   0.15300000   1.35100000 
##         NA's           SD         Skew         Kurt 
## 638.00000000           NA           NA           NA

hist <- ggplot(training_modified, aes(Chlorides)) + geom_histogram(fill = 'dodgerblue', binwidth = .2, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of Chlorides') + theme(plot.title = element_text(hjust = 0.5)) 
qq_plot <- ggplot(training_modified, aes(sample=Chlorides)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of Chlorides") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 
box_plot <- ggplot(training_modified, aes(x="", Chlorides)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of Chlorides', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), Chlorides)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of Chlorides by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

FreeSulfurDioxide

with(training_modified, c(summary(FreeSulfurDioxide), SD=sd(FreeSulfurDioxide), Skew=skewness(FreeSulfurDioxide), Kurt=kurtosis(FreeSulfurDioxide)))

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.       NA's 
## -555.00000    0.00000   30.00000   30.84557   70.00000  623.00000  647.00000 
##         SD       Skew       Kurt 
##         NA         NA         NA

hist <- ggplot(training_modified, aes(FreeSulfurDioxide)) + geom_histogram(fill = 'dodgerblue', binwidth = 50, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of FreeSulfurDioxide') + theme(plot.title = element_text(hjust = 0.5)) 
qq_plot <- ggplot(training_modified, aes(sample=FreeSulfurDioxide)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of FreeSulfurDioxide") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 
box_plot <- ggplot(training_modified, aes(x="", FreeSulfurDioxide)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of FreeSulfurDioxide', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), FreeSulfurDioxide)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of FreeSulfurDioxide by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

TotalSulfurDioxide

with(training_modified, c(summary(TotalSulfurDioxide), SD=sd(TotalSulfurDioxide), Skew=skewness(TotalSulfurDioxide), Kurt=kurtosis(TotalSulfurDioxide)))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's        SD 
## -823.0000   27.0000  123.0000  120.7142  208.0000 1057.0000  682.0000        NA 
##      Skew      Kurt 
##        NA        NA

hist <- ggplot(training_modified, aes(TotalSulfurDioxide)) + geom_histogram(fill = 'dodgerblue', binwidth = 200, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of TotalSulfurDioxide') + theme(plot.title = element_text(hjust = 0.5)) 
qq_plot <- ggplot(training_modified, aes(sample=TotalSulfurDioxide)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of TotalSulfurDioxide") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 
box_plot <- ggplot(training_modified, aes(x="", TotalSulfurDioxide)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of TotalSulfurDioxide', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), TotalSulfurDioxide)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of TotalSulfurDioxide by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

Density:

with(training_modified, c(summary(Density), SD=sd(Density), Skew=skewness(Density), Kurt=kurtosis(Density)))

##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
##  0.88809000  0.98772000  0.99449000  0.99420272  1.00051500  1.09924000 
##          SD        Skew        Kurt 
##  0.02653765 -0.01869376  1.89995921

hist <- ggplot(training_modified, aes(Density)) + geom_histogram(fill = 'dodgerblue', binwidth = .05, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of Density') + theme(plot.title = element_text(hjust = 0.5)) 
qq_plot <- ggplot(training_modified, aes(sample=Density)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of Density") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 
box_plot <- ggplot(training_modified, aes(x="", Density)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of Density', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), Density)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of Density by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

Sulphates:

with(training_modified, c(summary(Sulphates), SD=sd(Sulphates), Skew=skewness(Sulphates), Kurt=kurtosis(Sulphates)))

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
##   -3.1300000    0.2800000    0.5000000    0.5271118    0.8600000    4.2400000 
##         NA's           SD         Skew         Kurt 
## 1210.0000000           NA           NA           NA

hist <- ggplot(training_modified, aes(Sulphates)) + geom_histogram(fill = 'dodgerblue', binwidth = .5, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of Sulphates') + theme(plot.title = element_text(hjust = 0.5)) 
qq_plot <- ggplot(training_modified, aes(sample=Sulphates)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of Sulphates") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 
box_plot <- ggplot(training_modified, aes(x="", Sulphates)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of Sulphates', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), Sulphates)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of Sulphates by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

Alcohol:

with(training_modified, c(summary(Alcohol), SD=sd(Alcohol), Skew=skewness(Alcohol), Kurt=kurtosis(Alcohol)))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's        SD 
##  -4.70000   9.00000  10.40000  10.48924  12.40000  26.50000 653.00000        NA 
##      Skew      Kurt 
##        NA        NA

hist <- ggplot(training_modified, aes(Alcohol)) + geom_histogram(fill = 'dodgerblue', binwidth = 2, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of Alcohol') + theme(plot.title = element_text(hjust = 0.5)) 
qq_plot <- ggplot(training_modified, aes(sample=Alcohol)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of Alcohol") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 
box_plot <- ggplot(training_modified, aes(x="", Alcohol)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of Alcohol', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), Alcohol)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of Alcohol by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

LabelAppeal:

with(training_modified, c(summary(LabelAppeal), SD=sd(LabelAppeal), Skew=skewness(LabelAppeal), Kurt=kurtosis(LabelAppeal)))

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## -2.000000000 -1.000000000  0.000000000 -0.009066041  1.000000000  2.000000000 
##           SD         Skew         Kurt 
##  0.891089247  0.008429457 -0.262291551

hist <- ggplot(training_modified, aes(LabelAppeal)) + geom_histogram(fill = 'dodgerblue', binwidth = 1, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of LabelAppeal') + theme(plot.title = element_text(hjust = 0.5)) 
qq_plot <- ggplot(training_modified, aes(sample=LabelAppeal)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of LabelAppeal") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 
box_plot <- ggplot(training_modified, aes(x="", LabelAppeal)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of LabelAppeal', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), LabelAppeal)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of LabelAppeal by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

AcidIndex:

with(training_modified, c(summary(AcidIndex), SD=sd(AcidIndex), Skew=skewness(AcidIndex), Kurt=kurtosis(AcidIndex)))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.        SD      Skew 
##  4.000000  7.000000  8.000000  7.772724  8.000000 17.000000  1.323926  1.648496 
##      Kurt 
##  5.190092

hist <- ggplot(training_modified, aes(AcidIndex)) + geom_histogram(fill = 'dodgerblue', binwidth = 2, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of AcidIndex') + theme(plot.title = element_text(hjust = 0.5)) 
qq_plot <- ggplot(training_modified, aes(sample=AcidIndex)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of AcidIndex") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 
box_plot <- ggplot(training_modified, aes(x="", AcidIndex)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of AcidIndex', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), AcidIndex)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of AcidIndex by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

STARS:

with(training_modified, c(summary(STARS), SD=sd(STARS), Skew=skewness(STARS), Kurt=kurtosis(STARS)))

##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
##    1.000000    1.000000    2.000000    2.041755    3.000000    4.000000 
##        NA's          SD        Skew        Kurt 
## 3359.000000          NA          NA          NA

hist <- ggplot(training_modified, aes(STARS)) + geom_histogram(fill = 'dodgerblue', binwidth = 1, color = 'darkgray' ) + 
 theme_classic() + labs(title = 'Histogram of STARS') + theme(plot.title = element_text(hjust = 0.5)) 
qq_plot <- ggplot(training_modified, aes(sample=STARS)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
  labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of STARS") + theme_classic() + 
  theme(plot.title = element_text(hjust = 0.5)) 
box_plot <- ggplot(training_modified, aes(x="", STARS)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
  labs(title = 'Boxplot of STARS', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), STARS)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
  labs(x='TARGET', title = 'Boxplot of STARS by TARGET') + theme_classic() +
  theme(plot.title = element_text(hjust = 0.5)) 
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

Density Plot

training_modified %>%
  select_if(is.numeric) %>%
  keep(is.numeric) %>%                     # Keep only numeric columns
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(x=value)) +                   # Plot the values
    facet_wrap(~key, scales = "free") +    # In separate panels
    geom_density()

Density Plot: Training Set

From the scaled histogram and density plots, AcidIndex is skewed right; AcidIndex, STARS, LabelAppeal and TARGET have multimodal distribution; while other variables seem to be normally distrbuted.

Correlation Plot

Correlation between variables in the dataset.

drop_na(training_modified) %>%
  select_if(is.numeric) %>%
  cor() %>%
  #corrplot(method = "square", type = "upper", order = 'hclust', tl.col = "black", diag = FALSE, bg= 'white', col = colorRampPalette(c('deeppink4','white','steelblue1'))(100))
  corrplot.mixed(upper = 'pie', lower = 'number', order = 'hclust', tl.col = "black")

Correlation Pie Chart: Training Set

From the correlation matrix below, the response variable TARGET has strong positive relationship with variables FixedAcidity,CitricAcid,ResidualSugar,Density,Alcohol.

PerformanceAnalytics::chart.Correlation(training_dataset, histogram=TRUE, pch=19)

Correlation Chart: Training Set

Scatter plots

training_dataset %>%
  gather(-TARGET, key = "key", value = "ResponseVariables") %>%
  ggplot(aes(x = ResponseVariables, y = TARGET)) +
  geom_point(size = .5) +
  geom_smooth(method='lm',formula=y~x, color = 'dark grey')+
  facet_wrap(~ key, scales = "free")+
  ggthemes::theme_tufte()+
  ylab('Cases Bought')

## Warning: Removed 8200 rows containing non-finite values (stat_smooth).

## Warning: Removed 8200 rows containing missing values (geom_point).

There are no unusual patterns. STARS and LableAppleal seem to have the greatest correlation.

training_dataset %>%
  dplyr::select(-(INDEX)) %>%
  cor() %>%
  as.data.frame() %>%
  rownames_to_column('Variable') %>%
  dplyr::rename(Correlation_vs_Response = TARGET)

##              Variable Correlation_vs_Response FixedAcidity VolatileAcidity
## 1              TARGET             1.000000000 -0.049010939     -0.08879321
## 2        FixedAcidity            -0.049010939  1.000000000      0.01237500
## 3     VolatileAcidity            -0.088793212  0.012375004      1.00000000
## 4          CitricAcid             0.008684633  0.014240478     -0.01695337
## 5       ResidualSugar                      NA           NA              NA
## 6           Chlorides                      NA           NA              NA
## 7   FreeSulfurDioxide                      NA           NA              NA
## 8  TotalSulfurDioxide                      NA           NA              NA
## 9             Density            -0.035517502  0.006476528      0.01473492
## 10                 pH                      NA           NA              NA
## 11          Sulphates                      NA           NA              NA
## 12            Alcohol                      NA           NA              NA
## 13        LabelAppeal             0.356500469 -0.003366431     -0.01698703
## 14          AcidIndex            -0.246049449  0.178437010      0.04464154
## 15              STARS                      NA           NA              NA
##      CitricAcid ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## 1   0.008684633            NA        NA                NA                 NA
## 2   0.014240478            NA        NA                NA                 NA
## 3  -0.016953371            NA        NA                NA                 NA
## 4   1.000000000            NA        NA                NA                 NA
## 5            NA             1        NA                NA                 NA
## 6            NA            NA         1                NA                 NA
## 7            NA            NA        NA                 1                 NA
## 8            NA            NA        NA                NA                  1
## 9  -0.013952174            NA        NA                NA                 NA
## 10           NA            NA        NA                NA                 NA
## 11           NA            NA        NA                NA                 NA
## 12           NA            NA        NA                NA                 NA
## 13  0.008650177            NA        NA                NA                 NA
## 14  0.065697323            NA        NA                NA                 NA
## 15           NA            NA        NA                NA                 NA
##         Density pH Sulphates Alcohol  LabelAppeal   AcidIndex STARS
## 1  -0.035517502 NA        NA      NA  0.356500469 -0.24604945    NA
## 2   0.006476528 NA        NA      NA -0.003366431  0.17843701    NA
## 3   0.014734923 NA        NA      NA -0.016987027  0.04464154    NA
## 4  -0.013952174 NA        NA      NA  0.008650177  0.06569732    NA
## 5            NA NA        NA      NA           NA          NA    NA
## 6            NA NA        NA      NA           NA          NA    NA
## 7            NA NA        NA      NA           NA          NA    NA
## 8            NA NA        NA      NA           NA          NA    NA
## 9   1.000000000 NA        NA      NA -0.009370077  0.04041295    NA
## 10           NA  1        NA      NA           NA          NA    NA
## 11           NA NA         1      NA           NA          NA    NA
## 12           NA NA        NA       1           NA          NA    NA
## 13 -0.009370077 NA        NA      NA  1.000000000  0.02475468    NA
## 14  0.040412947 NA        NA      NA  0.024754678  1.00000000    NA
## 15           NA NA        NA      NA           NA          NA     1

Data Exploration Summary

The data exploration process can be summarized in the data dictionary below:

data_stat <- training_dataset %>% 
  dplyr::select(-TARGET,-INDEX) %>%
  gather() %>%
  group_by(key) %>%
  summarise(Mean = mean(value),
            Median = median(value),
            Max = max(value),
            Min = min(value),
            SD = sd(value))
data_cor <- training_dataset %>%
  cor() %>%
  as.data.frame() %>% 
  dplyr::select(TARGET) %>% 
  rownames_to_column('Variable') %>%
  dplyr::rename(Correlation_vs_Response = TARGET)
training_dataset %>% 
  gather() %>%
  dplyr::select(key) %>%
  unique() %>%
  dplyr::rename(Variable = key) %>%
  mutate(
         Missing_Value = 'No') %>%
  left_join(data_stat, by = c('Variable'='key')) %>%
  left_join(data_cor, by = 'Variable') %>%
  mutate_if(is.numeric,round,2) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F)

Variable	Missing_Value	Mean	Median	Max	Min	SD	Correlation_vs_Response
TARGET	No	NA	NA	NA	NA	NA	1.00
INDEX	No	NA	NA	NA	NA	NA	0.00
FixedAcidity	No	7.08	6.90	34.40	-18.10	6.32	-0.05
VolatileAcidity	No	0.32	0.28	3.68	-2.79	0.78	-0.09
CitricAcid	No	0.31	0.31	3.86	-3.24	0.86	0.01
ResidualSugar	No	NA	NA	NA	NA	NA	NA
Chlorides	No	NA	NA	NA	NA	NA	NA
FreeSulfurDioxide	No	NA	NA	NA	NA	NA	NA
TotalSulfurDioxide	No	NA	NA	NA	NA	NA	NA
Density	No	0.99	0.99	1.10	0.89	0.03	-0.04
pH	No	NA	NA	NA	NA	NA	NA
Sulphates	No	NA	NA	NA	NA	NA	NA
Alcohol	No	NA	NA	NA	NA	NA	NA
LabelAppeal	No	-0.01	0.00	2.00	-2.00	0.89	0.36
AcidIndex	No	7.77	8.00	17.00	4.00	1.32	-0.25
STARS	No	NA	NA	NA	NA	NA	NA

DATA PREPARATION

I split the data into training and test.

The MICE package comes in handy.

set.seed(999) 
sampl = caTools::sample.split(training_modified$TARGET, SplitRatio = .80)
train_dataset_1 <- subset(training_modified, sampl == TRUE)
test_dataset_1 <- subset(training_modified, sampl == FALSE)
train_dataset_2 <-  as.data.frame(tidyr::complete(mice(train_dataset_1, m=1, maxit = 5, seed = 42)))

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

test_dataset_2 <- as.data.frame(tidyr::complete(mice(test_dataset_1, m=1, maxit = 5, seed = 42)))

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

Perform log transform to the data.

train_dataset_2$AcidIndex <- log(train_dataset_2$AcidIndex)
test_dataset_2$AcidIndex <- log(test_dataset_2$AcidIndex)

BUILD MODELS

Poisson Models

Model 1: Poisson Model without imputations

require(ggplot2)
require(gridExtra)
model1 = glm(TARGET ~  ., data=train_dataset_1, family=poisson)
summary(model1)

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = train_dataset_1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2128  -0.2757   0.0647   0.3766   1.6981  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.608e+00  2.796e-01   5.750 8.90e-09 ***
## FixedAcidity        6.705e-04  1.177e-03   0.570  0.56901    
## VolatileAcidity    -2.750e-02  9.283e-03  -2.963  0.00305 ** 
## CitricAcid         -3.835e-03  8.519e-03  -0.450  0.65259    
## ResidualSugar       1.828e-05  2.152e-04   0.085  0.93232    
## Chlorides          -3.764e-02  2.314e-02  -1.627  0.10377    
## FreeSulfurDioxide   5.671e-05  4.892e-05   1.159  0.24630    
## TotalSulfurDioxide  2.230e-05  3.177e-05   0.702  0.48274    
## Density            -4.025e-01  2.749e-01  -1.464  0.14326    
## pH                  2.307e-04  1.085e-02   0.021  0.98303    
## Sulphates          -5.984e-03  7.973e-03  -0.751  0.45293    
## Alcohol             3.262e-03  2.004e-03   1.628  0.10360    
## LabelAppeal         1.730e-01  8.858e-03  19.530  < 2e-16 ***
## AcidIndex          -4.967e-02  6.666e-03  -7.451 9.28e-14 ***
## STARS               1.929e-01  8.328e-03  23.160  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4720.5  on 5143  degrees of freedom
## Residual deviance: 3242.8  on 5129  degrees of freedom
##   (5093 observations deleted due to missingness)
## AIC: 18545
## 
## Number of Fisher Scoring iterations: 5

plot(model1)

grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)

Model 2: Poisson Model with imputations

model2 = glm(TARGET ~  ., data=train_dataset_2, family=poisson)
summary(model2)

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = train_dataset_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9896  -0.6853   0.1295   0.6351   2.4378  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.350e+00  2.280e-01  10.308  < 2e-16 ***
## FixedAcidity        2.238e-04  9.202e-04   0.243 0.807855    
## VolatileAcidity    -4.578e-02  7.286e-03  -6.282 3.33e-10 ***
## CitricAcid          6.770e-03  6.566e-03   1.031 0.302500    
## ResidualSugar       1.586e-04  1.674e-04   0.947 0.343467    
## Chlorides          -6.268e-02  1.794e-02  -3.494 0.000476 ***
## FreeSulfurDioxide   1.249e-04  3.801e-05   3.285 0.001019 ** 
## TotalSulfurDioxide  8.362e-05  2.457e-05   3.403 0.000666 ***
## Density            -3.887e-01  2.145e-01  -1.813 0.069894 .  
## pH                 -1.990e-02  8.406e-03  -2.367 0.017923 *  
## Sulphates          -1.035e-02  6.173e-03  -1.677 0.093602 .  
## Alcohol             2.527e-03  1.557e-03   1.623 0.104531    
## LabelAppeal         1.429e-01  6.779e-03  21.084  < 2e-16 ***
## AcidIndex          -7.543e-01  4.005e-02 -18.833  < 2e-16 ***
## STARS               3.432e-01  6.241e-03  54.999  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18291  on 10236  degrees of freedom
## Residual deviance: 12767  on 10222  degrees of freedom
## AIC: 38355
## 
## Number of Fisher Scoring iterations: 5

plot(model2)

Negative Binomial Models

Model 3 : Negative Binomial without imputations

model3 <- glm.nb(TARGET ~ ., data = train_dataset_1)
summary(model3)

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = train_dataset_1, init.theta = 138898.9107, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2127  -0.2757   0.0647   0.3766   1.6981  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.608e+00  2.796e-01   5.750 8.91e-09 ***
## FixedAcidity        6.705e-04  1.177e-03   0.570  0.56900    
## VolatileAcidity    -2.750e-02  9.283e-03  -2.963  0.00305 ** 
## CitricAcid         -3.835e-03  8.519e-03  -0.450  0.65259    
## ResidualSugar       1.828e-05  2.152e-04   0.085  0.93231    
## Chlorides          -3.764e-02  2.314e-02  -1.627  0.10378    
## FreeSulfurDioxide   5.671e-05  4.892e-05   1.159  0.24630    
## TotalSulfurDioxide  2.230e-05  3.177e-05   0.702  0.48275    
## Density            -4.025e-01  2.750e-01  -1.464  0.14326    
## pH                  2.307e-04  1.085e-02   0.021  0.98303    
## Sulphates          -5.984e-03  7.973e-03  -0.751  0.45293    
## Alcohol             3.262e-03  2.004e-03   1.628  0.10360    
## LabelAppeal         1.730e-01  8.858e-03  19.529  < 2e-16 ***
## AcidIndex          -4.967e-02  6.666e-03  -7.451 9.28e-14 ***
## STARS               1.929e-01  8.328e-03  23.160  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(138898.9) family taken to be 1)
## 
##     Null deviance: 4720.4  on 5143  degrees of freedom
## Residual deviance: 3242.7  on 5129  degrees of freedom
##   (5093 observations deleted due to missingness)
## AIC: 18547
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  138899 
##           Std. Err.:  259921 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -18515.07

plot(model3)

Model 4 : Negative Binomial with imputations

model4 <- glm.nb(TARGET ~ ., data = train_dataset_2)
summary(model4)

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = train_dataset_2, init.theta = 49067.84702, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9896  -0.6852   0.1295   0.6351   2.4377  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.350e+00  2.280e-01  10.307  < 2e-16 ***
## FixedAcidity        2.238e-04  9.202e-04   0.243 0.807848    
## VolatileAcidity    -4.578e-02  7.287e-03  -6.282 3.34e-10 ***
## CitricAcid          6.770e-03  6.566e-03   1.031 0.302510    
## ResidualSugar       1.586e-04  1.674e-04   0.947 0.343463    
## Chlorides          -6.268e-02  1.794e-02  -3.494 0.000476 ***
## FreeSulfurDioxide   1.249e-04  3.801e-05   3.285 0.001020 ** 
## TotalSulfurDioxide  8.362e-05  2.457e-05   3.403 0.000666 ***
## Density            -3.887e-01  2.145e-01  -1.813 0.069902 .  
## pH                 -1.990e-02  8.406e-03  -2.367 0.017923 *  
## Sulphates          -1.035e-02  6.173e-03  -1.677 0.093603 .  
## Alcohol             2.527e-03  1.557e-03   1.623 0.104549    
## LabelAppeal         1.429e-01  6.779e-03  21.084  < 2e-16 ***
## AcidIndex          -7.543e-01  4.005e-02 -18.832  < 2e-16 ***
## STARS               3.433e-01  6.241e-03  54.998  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(49067.85) family taken to be 1)
## 
##     Null deviance: 18290  on 10236  degrees of freedom
## Residual deviance: 12766  on 10222  degrees of freedom
## AIC: 38357
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  49068 
##           Std. Err.:  63120 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -38324.77

plot(model4)

Linear Models

Model 5 : Linear Model with imputations

Using Linear Regression Model on imputed training data.

model5 <- lm(TARGET ~ ., data = train_dataset_2)
summary(model5)

## 
## Call:
## lm(formula = TARGET ~ ., data = train_dataset_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3708 -1.0318  0.1602  1.0263  4.3131 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.001e+00  5.535e-01  10.842  < 2e-16 ***
## FixedAcidity        1.363e-03  2.241e-03   0.608 0.543026    
## VolatileAcidity    -1.326e-01  1.781e-02  -7.445 1.05e-13 ***
## CitricAcid          1.950e-02  1.621e-02   1.203 0.229052    
## ResidualSugar       4.527e-04  4.108e-04   1.102 0.270444    
## Chlorides          -1.916e-01  4.383e-02  -4.371 1.25e-05 ***
## FreeSulfurDioxide   3.229e-04  9.354e-05   3.452 0.000560 ***
## TotalSulfurDioxide  2.237e-04  5.984e-05   3.738 0.000187 ***
## Density            -9.958e-01  5.233e-01  -1.903 0.057047 .  
## pH                 -4.893e-02  2.060e-02  -2.376 0.017542 *  
## Sulphates          -2.701e-02  1.510e-02  -1.789 0.073692 .  
## Alcohol             1.092e-02  3.777e-03   2.892 0.003831 ** 
## LabelAppeal         4.385e-01  1.633e-02  26.862  < 2e-16 ***
## AcidIndex          -2.020e+00  9.201e-02 -21.954  < 2e-16 ***
## STARS               1.175e+00  1.655e-02  71.034  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.409 on 10222 degrees of freedom
## Multiple R-squared:  0.4661, Adjusted R-squared:  0.4654 
## F-statistic: 637.5 on 14 and 10222 DF,  p-value: < 2.2e-16

plot(model5)

Model 6 : Ordinal Logistic Regression

This regression uses ordered factors. I would expect this to be one of the top performers.

polrDF <- train_dataset_2
polrDF$TARGET <- as.factor(polrDF$TARGET)
model6 <- polr(TARGET ~ ., data = polrDF, Hess=TRUE)
summary(model6)

## Call:
## polr(formula = TARGET ~ ., data = polrDF, Hess = TRUE)
## 
## Coefficients:
##                         Value Std. Error  t value
## FixedAcidity        0.0025541  2.909e-03   0.8780
## VolatileAcidity    -0.1635870  2.328e-02  -7.0256
## CitricAcid          0.0195114  2.111e-02   0.9242
## ResidualSugar       0.0003666  5.310e-04   0.6904
## Chlorides          -0.2395616  5.681e-02  -4.2170
## FreeSulfurDioxide   0.0004071  1.216e-04   3.3467
## TotalSulfurDioxide  0.0002398  7.814e-05   3.0682
## Density            -1.5138603  1.493e-01 -10.1393
## pH                 -0.0385963  2.685e-02  -1.4375
## Sulphates          -0.0207772  1.982e-02  -1.0483
## Alcohol             0.0253462  4.883e-03   5.1902
## LabelAppeal         0.8329795  2.381e-02  34.9811
## AcidIndex          -2.5978181  1.251e-01 -20.7696
## STARS               1.4733679  2.563e-02  57.4787
## 
## Intercepts:
##     Value    Std. Error t value 
## 0|1  -6.0275   0.1360   -44.3081
## 1|2  -5.8898   0.1359   -43.3357
## 2|3  -5.2832   0.1355   -38.9891
## 3|4  -3.9123   0.1354   -28.8884
## 4|5  -2.0643   0.1377   -14.9885
## 5|6  -0.0952   0.1442    -0.6605
## 6|7   2.1034   0.1679    12.5310
## 7|8   4.4397   0.3036    14.6243
## 
## Residual Deviance: 29978.30 
## AIC: 30022.30

Model 7 : Zero inflation

Zero inflation understands that some Poisson distrobutions are dominated by many zeros. As such it corrects for this. This is one of the most promissing ones because as we saw in our data exploration, there were more zeros, and then normally distributed data after that.

library(pscl)

## Warning: package 'pscl' was built under R version 4.0.5

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

model7 <- zeroinfl(TARGET ~ . | STARS, data = train_dataset_2, dist = 'negbin')
summary(model7)

## 
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = train_dataset_2, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.17822 -0.50776  0.06056  0.48757  2.09505 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.793e+00  2.380e-01   7.532 4.99e-14 ***
## FixedAcidity        5.166e-04  9.446e-04   0.547  0.58447    
## VolatileAcidity    -2.095e-02  7.556e-03  -2.772  0.00556 ** 
## CitricAcid          7.361e-04  6.706e-03   0.110  0.91260    
## ResidualSugar      -3.494e-05  1.721e-04  -0.203  0.83913    
## Chlorides          -2.832e-02  1.851e-02  -1.530  0.12608    
## FreeSulfurDioxide   3.380e-05  3.849e-05   0.878  0.37983    
## TotalSulfurDioxide  4.004e-07  2.457e-05   0.016  0.98700    
## Density            -2.918e-01  2.222e-01  -1.313  0.18915    
## pH                  8.443e-04  8.732e-03   0.097  0.92297    
## Sulphates          -2.372e-03  6.392e-03  -0.371  0.71058    
## Alcohol             6.463e-03  1.587e-03   4.073 4.63e-05 ***
## LabelAppeal         2.280e-01  7.088e-03  32.169  < 2e-16 ***
## AcidIndex          -2.614e-01  4.464e-02  -5.856 4.73e-09 ***
## STARS               1.200e-01  6.995e-03  17.156  < 2e-16 ***
## Log(theta)          1.749e+01  4.449e+00   3.932 8.44e-05 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.7093     0.1347   20.11   <2e-16 ***
## STARS        -3.0279     0.1235  -24.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 39544917.0696 
## Number of iterations in BFGS optimization: 23 
## Log-likelihood: -1.715e+04 on 18 Df

scatterPreds <- predict(model7, train_dataset_2)
qplot(train_dataset_2$TARGET, scatterPreds, main = 'Predicted vs Actual') + ggthemes::theme_tufte()

residPlot <- scatterPreds - train_dataset_2$TARGET
qplot(train_dataset_2$TARGET, residPlot, main = 'Residuals') + ggthemes::theme_tufte()

MODEL SELECTION

Compare Models by MSE/AIC

aic1 <- model1$aic
aic2 <- model2$aic
aic3 <- model3$aic
aic4 <- model4$aic
aic5 <- model5$aic
aic6 <- model6$aic
aic7 <- model7$aic
mse1 <- mean((train_dataset_2$TARGET - predict(model1))^2)
mse2 <- mean((train_dataset_2$TARGET - predict(model2))^2)
mse3 <- mean((train_dataset_2$TARGET - predict(model3))^2)
mse4 <- mean((train_dataset_2$TARGET - predict(model4))^2)
mse5 <- mean((train_dataset_2$TARGET - predict(model5))^2)
mse6 <- mean((train_dataset_2$TARGET - predict(model6))^2)
mse7 <- mean((train_dataset_2$TARGET - predict(model7))^2)
compare_aic_mse <- matrix(c(mse1, mse2, mse3, mse4, mse5, mse6, mse7,
                            aic1, aic2, aic3, aic4, aic5, aic6,
                            aic7),nrow=7,ncol=2,byrow=TRUE)
rownames(compare_aic_mse) <- c("Model1","Model2","Model3","Model4","Model5","Model6","Model7")
colnames(compare_aic_mse) <- c("MSE","AIC")
compare_models <- as.data.frame(compare_models)
kable(compare_aic_mse)  %>% 
  kable_styling(full_width = T)

	MSE	AIC
Model1	6.929787	6.842661
Model2	6.929788	6.842657
Model3	1.981498	NA
Model4	1.942825	18544.983192
Model5	38354.589216	18547.067808
Model6	38356.768268	6.929787
Model7	6.842661	6.929788

Compare Models by Loss

Now lets see the output of the Models using test data

We will use the squared loss to validate the model. We will use the squared difference to select a model (MSE) from predictions on the training sets. (Lower numbers are better.)

modelValidation <- function(mod){
  preds = predict(mod, test_dataset_2)
  diffMat = as.numeric(preds) - as.numeric(test_dataset_2$TARGET)
  diffMat = diffMat^2
  loss <- mean(diffMat)
  return(loss)
}
compare_models <- matrix(c(modelValidation(model1),modelValidation(model2),
                     modelValidation(model3),modelValidation(model4),
                     modelValidation(model5),modelValidation(model6),
                     modelValidation(model7)),
                         nrow=7,ncol=1,byrow=TRUE)
rownames(compare_models) <- c("Model1","Model2","Model3","Model4","Model5","Model6","Model7")
colnames(compare_models) <- c("Loss:")
compare_models <- as.data.frame(compare_models)
compare_models

##           Loss:
## Model1 5.479565
## Model2 6.844632
## Model3 5.479559
## Model4 6.844628
## Model5 2.038659
## Model6 3.667318
## Model7 2.003325

The squared loss metric tells how accurate our model is without caring about confidence intervals among others. Based on this parameter, the Zero Poisson Inflation model is the most accurate.

model7 (Zero Poisson Inflations) is the best choice overally due to the following reasons: - least loss - good MSE score - good AIC score

Prediction on Evaluation Data

In the same manner as before, lets impute and use log transformation for AcidIndex.

evaluation_data_modified <- evaluation_dataset %>% dplyr::select(-(IN)) %>% mice(m=1, maxit = 5, seed = 42)

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

## Warning: Number of logged events: 1

imputed_eval_data <- as.data.frame(complete(evaluation_data_modified))
imputed_eval_data$AcidIndex <- log(imputed_eval_data$AcidIndex)
imputed_eval_data$TARGET <- predict(model7, newdata=imputed_eval_data)
write.csv(imputed_eval_data,"predicted_data.csv", row.names=FALSE)
data_predicted_eval <- read_csv("predicted_data.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   TARGET = col_double(),
##   FixedAcidity = col_double(),
##   VolatileAcidity = col_double(),
##   CitricAcid = col_double(),
##   ResidualSugar = col_double(),
##   Chlorides = col_double(),
##   FreeSulfurDioxide = col_double(),
##   TotalSulfurDioxide = col_double(),
##   Density = col_double(),
##   pH = col_double(),
##   Sulphates = col_double(),
##   Alcohol = col_double(),
##   LabelAppeal = col_double(),
##   AcidIndex = col_double(),
##   STARS = col_double()
## )

DT::datatable(data_predicted_eval)

TARGET: Predicted Number of Cases Purchased

options(width=100)
round(with(data_predicted_eval, c(summary(TARGET), StdD=sd(TARGET), Skew=skewness(TARGET), Kurt=kurtosis(TARGET))),2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
##    1.01    1.91    3.32    3.25    4.20    8.28    1.38    0.47   -0.36

```

Data 621 Homework 5

Trishita Nath

12/12/2021

Overview

DATA EXPLORATION

Loading Data

Training Dataset

Evaluation Dataset

Missing Values & Data Type Check

Summary statistics for the training dataset

Outliers

Histograms

Density Plot

Correlation Plot

Scatter plots

Data Exploration Summary

DATA PREPARATION

BUILD MODELS

Poisson Models

Model 1: Poisson Model without imputations

Model 2: Poisson Model with imputations

Negative Binomial Models

Model 3 : Negative Binomial without imputations

Model 4 : Negative Binomial with imputations

Linear Models

Model 5 : Linear Model with imputations

Model 6 : Ordinal Logistic Regression

Model 7 : Zero inflation

MODEL SELECTION

Compare Models by MSE/AIC

Compare Models by Loss

Prediction on Evaluation Data