Wine Sales

Whining about Wine Sales

Overview

In this report we attempt to build a model for wine sales as would be predicted by a number of factors about the wine and its packaging. In the end a zero inflated poisson model seems to be the best model. Particularly important to sales variables seem to consist of Label Appeal, Acid Index, Stars and to a lesser extent Alcohol.

Data Exploration

Data was provided already split into a training dataset of 12,795 observations, and an evaluation dataset of 16,129 observations. There was one response variable, Number of Cases purchased, and 14 predictors.

Below is a short description of the variables of interest in the data set:

Variable Name Description Import
Index Identification Not used
Target Number of cases purchased Response variable
————— ————– ———
AcidIndex Proprietary total acidity measure
Alcohol Alcohol content
Chlorides Chloride content
Density Density of Wine
FixedAcidity Fixed Acidity
FreeSulfurDioxide Sulfur Dioxide Content
Label Appeal Marketing Score indicating appeal of label Expected positive
ResidualSugar Residual Sugar
STARS Wine rating by experts Positive
Sulphates Sulfate content
TotalSulfurDioxide Total Sulfur Dioxide
VolatileAcidity Volatile Acid content
pH pH of wine

On a first inspection, it was obvious that many variables had at least some missing data.

A sample from the training dataset is provided below:

Training Dataset

Table continues below
TARGET INDEX FixedAcidity VolatileAcidity CitricAcid ResidualSugar
3 1 3.2 1.16 -0.98 54.2
3 2 4.5 0.16 -0.81 26.1
5 4 7.1 2.64 -0.88 14.8
3 5 5.7 0.385 0.04 18.8
4 6 8 0.33 -1.26 9.4
0 7 11.3 0.32 0.59 2.2
Table continues below
Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH
-0.567 NA 268 0.9928 3.33
-0.425 15 -327 1.028 3.38
0.037 214 142 0.9952 3.12
-0.425 22 115 0.9964 2.24
NA -167 108 0.9946 3.12
0.556 -37 15 0.9994 3.2
Sulphates Alcohol LabelAppeal AcidIndex STARS
-0.59 9.9 0 8 2
0.7 NA -1 7 3
0.48 22 -1 8 3
1.83 6.2 -1 6 1
1.77 13.7 0 9 2
1.29 15.4 0 11 NA

Check Data types and Missing values

The datasets both have missing values. There are 3 categorical variables (LabelAppeal,AcidIndex,STARS), 11 continuous variables and the target variable is categorical.

Missing Values and Data Type Check

Missing Values and Data Type Check

Missing Values and Data Type Check

Missing Values and Data Type Check

  Non_NAs NAs NA_Percent
TARGET 12795 0 0
INDEX 12795 0 0
FixedAcidity 12795 0 0
VolatileAcidity 12795 0 0
CitricAcid 12795 0 0
ResidualSugar 12179 616 0.04814
Chlorides 12157 638 0.04986
FreeSulfurDioxide 12148 647 0.05057
TotalSulfurDioxide 12113 682 0.0533
Density 12795 0 0
pH 12400 395 0.03087
Sulphates 11585 1210 0.09457
Alcohol 12142 653 0.05104
LabelAppeal 12795 0 0
AcidIndex 12795 0 0
STARS 9436 3359 0.2625

Data Statistics Summary

A binary logistic regression model is built using the training set, therefore the training set is used for the following data exploration.

The data types in the raw dataset are all ‘doubles’, however the counter INDEX and the response variable target are categorical.

The statistics of all variables are listed below:

Table continues below
TARGET FixedAcidity VolatileAcidity CitricAcid
Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400
1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300
Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100
Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084
3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800
Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600
NA NA NA NA
Table continues below
ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
Min. :-127.800 Min. :-1.1710 Min. :-555.00 Min. :-823.0
1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00 1st Qu.: 27.0
Median : 3.900 Median : 0.0460 Median : 30.00 Median : 123.0
Mean : 5.419 Mean : 0.0548 Mean : 30.85 Mean : 120.7
3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00 3rd Qu.: 208.0
Max. : 141.150 Max. : 1.3510 Max. : 623.00 Max. :1057.0
NA’s :616 NA’s :638 NA’s :647 NA’s :682
Table continues below
Density pH Sulphates Alcohol
Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70
1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00
Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40
Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49
3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40
Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50
NA NA’s :395 NA’s :1210 NA’s :653
LabelAppeal AcidIndex STARS
Min. :-2.000000 Min. : 4.000 Min. :1.000
1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
Median : 0.000000 Median : 8.000 Median :2.000
Mean :-0.009066 Mean : 7.773 Mean :2.042
3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
Max. : 2.000000 Max. :17.000 Max. :4.000
NA NA NA’s :3359

The statistics of TARGET Variable.

TARGET: Number of Cases Purchased as Actual

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
##    0.00    2.00    3.00    3.03    4.00    8.00    1.93   -0.33   -0.88

FURTHER DATA EXPLORATION

Attributes

FixedAcidity: This variable tells us about the FixedAcidity of wine.

VolatileAcidity: This variable tells us about the Volatile Acidity content of Wine.

CitricAcid: This variable tells us about the Citric Acid Content of wine.

ResidualSugar: This variable tells us about the Residual Sugar of wine.

Chlorides: This variable tells us about the Chloride content of wine.

FreeSulfurDioxide : This variable tells us about the Sulfur Dioxide content of wine.

TotalSulfurDioxide : This variable tells us about the Total Sulfur Dioxide of Wine.

Density: This variable tells us about the Density of wine.

Sulphates: This variable tells us about the Sulphates content of wine.

Alcohol: This variable tells us about the Alcohol content.

LabelAppeal: Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design.

AcidIndex: Proprietary method of testing total acidity of wine by using a weighted average.

STARS: Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor. A high number of stars suggests high sales.

Outliers

Boxplot: Scaled Training Set

Boxplot: Scaled Training Set

The box plot shows that outliers exist in variables FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal andAcidIndex.

Univariate Analysis

Response Variable

Upon examining the target variable we immediately see a normal distribution save for the large number of unsold cases.

Correlation Plot

The correlation plot below shows how variables in the dataset are related to each other.

Here we see relatively little in strong correlations with almost all of the chemistry having minimal impact. Interestingly there is a some relation between star rating and label appeal. As one would expect there is a relationship with Acid index and fixed acidity.

Density Plot

Based on the below plots we can observe that AcidIndex is right skewed; AcidIndex, STARS, LabelAppeal and TARGET have multi-modal distribution (as expected because they are categorical). While most others seem to be normally distributed.

Summarized Data Dictionary

As a summary of the data exploration process, a data dictionary is presented below:

Variable Missing_Value Mean Median Max Min SD Correlation_vs_Response
TARGET No NA NA NA NA NA 1.00
INDEX No NA NA NA NA NA 0.00
FixedAcidity No 7.08 6.90 34.40 -18.10 6.32 -0.05
VolatileAcidity No 0.32 0.28 3.68 -2.79 0.78 -0.09
CitricAcid No 0.31 0.31 3.86 -3.24 0.86 0.01
ResidualSugar No NA NA NA NA NA NA
Chlorides No NA NA NA NA NA NA
FreeSulfurDioxide No NA NA NA NA NA NA
TotalSulfurDioxide No NA NA NA NA NA NA
Density No 0.99 0.99 1.10 0.89 0.03 -0.04
pH No NA NA NA NA NA NA
Sulphates No NA NA NA NA NA NA
Alcohol No NA NA NA NA NA NA
LabelAppeal No -0.01 0.00 2.00 -2.00 0.89 0.36
AcidIndex No 7.77 8.00 17.00 4.00 1.32 -0.25
STARS No NA NA NA NA NA NA

DATA PREPARATION

In the data preparation we will split data into training and test dataset.

MICE package (Multivariate Imputation by Chained Equations)implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. helps in inspecting, imputing, diagnose, analyze, pool the result, and generate simulated incomplete data

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

‘AcidIndex’ and ‘TARGET’ have low correlation between them. We will apply a log transformation to it even if it doesn’t seem likely to provide a large model improvement.

BUILD MODELS

Model I: Poisson Model

Model 1: Poisson Model without imputations

  Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.608 0.2796 5.75 8.902e-09
FixedAcidity 0.0006705 0.001177 0.5695 0.569
VolatileAcidity -0.0275 0.009283 -2.963 0.00305
CitricAcid -0.003835 0.008519 -0.4502 0.6526
ResidualSugar 1.828e-05 0.0002152 0.08493 0.9323
Chlorides -0.03764 0.02314 -1.627 0.1038
FreeSulfurDioxide 5.671e-05 4.892e-05 1.159 0.2463
TotalSulfurDioxide 2.23e-05 3.177e-05 0.7019 0.4827
Density -0.4025 0.2749 -1.464 0.1433
pH 0.0002307 0.01085 0.02127 0.983
Sulphates -0.005984 0.007973 -0.7505 0.4529
Alcohol 0.003262 0.002004 1.628 0.1036
LabelAppeal 0.173 0.008858 19.53 6.135e-85
AcidIndex -0.04967 0.006666 -7.451 9.281e-14
STARS 0.1929 0.008328 23.16 1.146e-118

(Dispersion parameter for poisson family taken to be 1 )

Null deviance: 4721 on 5143 degrees of freedom
Residual deviance: 3243 on 5129 degrees of freedom

Stars and Label appeal seem to have a strong positive impact on sales, which is to be expected (and will be seen in all models). On the acid front the acidindex value seems a better predictor than the acid components. ### Model 2: Poisson Model without imputations and only significant variables

  Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.251 0.05472 22.87 9.626e-116
VolatileAcidity -0.02758 0.009278 -2.973 0.002953
LabelAppeal 0.1732 0.008853 19.56 3.247e-85
AcidIndex -0.05062 0.006553 -7.724 1.129e-14
STARS 0.1942 0.008292 23.42 2.601e-121

(Dispersion parameter for poisson family taken to be 1 )

Null deviance: 4721 on 5143 degrees of freedom
Residual deviance: 3253 on 5139 degrees of freedom

Reducing the variables in the model doesn’t have a major impact on the model, and will probably improve its performance in actual fact.

Model 3: Poisson Model with imputations

  Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.35 0.228 10.31 6.517e-25
FixedAcidity 0.0002238 0.0009202 0.2432 0.8079
VolatileAcidity -0.04578 0.007286 -6.282 3.334e-10
CitricAcid 0.00677 0.006566 1.031 0.3025
ResidualSugar 0.0001586 0.0001674 0.9473 0.3435
Chlorides -0.06268 0.01794 -3.494 0.0004761
FreeSulfurDioxide 0.0001249 3.801e-05 3.285 0.001019
TotalSulfurDioxide 8.362e-05 2.457e-05 3.403 0.0006663
Density -0.3887 0.2145 -1.813 0.06989
pH -0.0199 0.008406 -2.367 0.01792
Sulphates -0.01035 0.006173 -1.677 0.0936
Alcohol 0.002527 0.001557 1.623 0.1045
LabelAppeal 0.1429 0.006779 21.08 1.105e-98
AcidIndex -0.7543 0.04005 -18.83 4.078e-79
STARS 0.3432 0.006241 55 0

(Dispersion parameter for poisson family taken to be 1 )

Null deviance: 18291 on 10236 degrees of freedom
Residual deviance: 12767 on 10222 degrees of freedom

With imputations a number of variables become significant, including acidity, and Sulfur/sulphate counts. This makes some intuitive sense as those would affect the taste of the wine. Interestingly, the coefficient is positive on the sulfur dioxide. One would rarely expect sulfur dioxide (the smell of burnt matches) to be an improvement for wine.

Model 4: Poisson Model with imputations and only significant variables

  Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.001 0.0885 22.61 3.784e-113
VolatileAcidity -0.04605 0.007285 -6.321 2.591e-10
Chlorides -0.06452 0.01793 -3.599 0.0003195
FreeSulfurDioxide 0.0001233 3.8e-05 3.245 0.001173
TotalSulfurDioxide 8.234e-05 2.454e-05 3.355 0.0007935
pH -0.02022 0.008403 -2.406 0.01613
Sulphates -0.01025 0.006167 -1.661 0.09662
LabelAppeal 0.1429 0.006779 21.08 1.235e-98
AcidIndex -0.7573 0.03941 -19.22 2.61e-82
STARS 0.3442 0.006224 55.3 0

(Dispersion parameter for poisson family taken to be 1 )

Null deviance: 18291 on 10236 degrees of freedom
Residual deviance: 12775 on 10227 degrees of freedom

Using just the significant variables we do not get much of an improvement in AIC, but the model seems to lose little. ## Model II: Negative Binomial Model

Model 5 : Negative Binomial Model without imputations

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = wine_train1, init.theta = 138898.9107, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2127  -0.2757   0.0647   0.3766   1.6981  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.608e+00  2.796e-01   5.750 8.91e-09 ***
## FixedAcidity        6.705e-04  1.177e-03   0.570  0.56900    
## VolatileAcidity    -2.750e-02  9.283e-03  -2.963  0.00305 ** 
## CitricAcid         -3.835e-03  8.519e-03  -0.450  0.65259    
## ResidualSugar       1.828e-05  2.152e-04   0.085  0.93231    
## Chlorides          -3.764e-02  2.314e-02  -1.627  0.10378    
## FreeSulfurDioxide   5.671e-05  4.892e-05   1.159  0.24630    
## TotalSulfurDioxide  2.230e-05  3.177e-05   0.702  0.48275    
## Density            -4.025e-01  2.750e-01  -1.464  0.14326    
## pH                  2.307e-04  1.085e-02   0.021  0.98303    
## Sulphates          -5.984e-03  7.973e-03  -0.751  0.45293    
## Alcohol             3.262e-03  2.004e-03   1.628  0.10360    
## LabelAppeal         1.730e-01  8.858e-03  19.529  < 2e-16 ***
## AcidIndex          -4.967e-02  6.666e-03  -7.451 9.28e-14 ***
## STARS               1.929e-01  8.328e-03  23.160  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(138898.9) family taken to be 1)
## 
##     Null deviance: 4720.4  on 5143  degrees of freedom
## Residual deviance: 3242.7  on 5129  degrees of freedom
##   (5093 observations deleted due to missingness)
## AIC: 18547
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  138899 
##           Std. Err.:  259921 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -18515.07

AIC is similar to Poisson without imputations, as are some of the coefficients. ### Model 6 : Negative Binomial Model without imputations and only significant variables

## 
## Call:
## glm.nb(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - 
##     Chlorides - FreeSulfurDioxide - TotalSulfurDioxide - Density - 
##     pH - Sulphates - Alcohol, data = wine_train1, init.theta = 138402.5261, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1898  -0.2777   0.0622   0.3764   1.6086  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.251443   0.054725  22.868  < 2e-16 ***
## VolatileAcidity -0.027581   0.009279  -2.973  0.00295 ** 
## LabelAppeal      0.173177   0.008853  19.562  < 2e-16 ***
## AcidIndex       -0.050616   0.006553  -7.724 1.13e-14 ***
## STARS            0.194209   0.008292  23.421  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(138402.5) family taken to be 1)
## 
##     Null deviance: 4720.4  on 5143  degrees of freedom
## Residual deviance: 3253.0  on 5139  degrees of freedom
##   (5093 observations deleted due to missingness)
## AIC: 18537
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  138403 
##           Std. Err.:  258834 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -18525.37

Removing non significant variables doesn’t seem to affect the model’s accuracy much. ### Model 7 : Negative Binomial Model with imputations

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = wine_train2, init.theta = 49067.84702, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9896  -0.6852   0.1295   0.6351   2.4377  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.350e+00  2.280e-01  10.307  < 2e-16 ***
## FixedAcidity        2.238e-04  9.202e-04   0.243 0.807848    
## VolatileAcidity    -4.578e-02  7.287e-03  -6.282 3.34e-10 ***
## CitricAcid          6.770e-03  6.566e-03   1.031 0.302510    
## ResidualSugar       1.586e-04  1.674e-04   0.947 0.343463    
## Chlorides          -6.268e-02  1.794e-02  -3.494 0.000476 ***
## FreeSulfurDioxide   1.249e-04  3.801e-05   3.285 0.001020 ** 
## TotalSulfurDioxide  8.362e-05  2.457e-05   3.403 0.000666 ***
## Density            -3.887e-01  2.145e-01  -1.813 0.069902 .  
## pH                 -1.990e-02  8.406e-03  -2.367 0.017923 *  
## Sulphates          -1.035e-02  6.173e-03  -1.677 0.093603 .  
## Alcohol             2.527e-03  1.557e-03   1.623 0.104549    
## LabelAppeal         1.429e-01  6.779e-03  21.084  < 2e-16 ***
## AcidIndex          -7.543e-01  4.005e-02 -18.832  < 2e-16 ***
## STARS               3.433e-01  6.241e-03  54.998  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(49067.85) family taken to be 1)
## 
##     Null deviance: 18290  on 10236  degrees of freedom
## Residual deviance: 12766  on 10222  degrees of freedom
## AIC: 38357
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  49068 
##           Std. Err.:  63120 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -38324.77

With imputations we find surprisingly similar results to a the Poisson model with imputations. ### Model 8 : Negative Binomial Model with imputations and only significant variables

  Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.001 0.0885 22.61 3.826e-113
VolatileAcidity -0.04605 0.007285 -6.321 2.592e-10
Chlorides -0.06452 0.01793 -3.599 0.0003195
FreeSulfurDioxide 0.0001233 3.8e-05 3.245 0.001173
TotalSulfurDioxide 8.234e-05 2.454e-05 3.355 0.0007937
pH -0.02022 0.008403 -2.406 0.01613
Sulphates -0.01025 0.006167 -1.661 0.09662
LabelAppeal 0.1429 0.006779 21.08 1.257e-98
AcidIndex -0.7573 0.03941 -19.22 2.622e-82
STARS 0.3442 0.006224 55.3 0

Quitting from lines 354-357 (Wine_Sales.Rmd) Error in data.frame(Observations = length(x\(residuals), `Residual Std. Error` = x\)sigma, : arguments imply differing number of rows: 1, 0 Calls: … pandoc.table -> cat -> pandoc.table.return -> data.frame In addition: There were 45 warnings (use warnings() to see them)

Model III: Linear Model

Model 9 : Linear Model with imputations

Use imputed training data on Linear regression model

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.001 0.5535 10.84 3.054e-27
FixedAcidity 0.001363 0.002241 0.6083 0.543
VolatileAcidity -0.1326 0.01781 -7.445 1.049e-13
CitricAcid 0.0195 0.01621 1.203 0.2291
ResidualSugar 0.0004527 0.0004108 1.102 0.2704
Chlorides -0.1916 0.04383 -4.371 1.248e-05
FreeSulfurDioxide 0.0003229 9.354e-05 3.452 0.0005596
TotalSulfurDioxide 0.0002237 5.984e-05 3.738 0.0001868
Density -0.9958 0.5233 -1.903 0.05705
pH -0.04893 0.0206 -2.376 0.01754
Sulphates -0.02701 0.0151 -1.789 0.07369
Alcohol 0.01092 0.003777 2.892 0.003831
LabelAppeal 0.4385 0.01633 26.86 1.213e-153
AcidIndex -2.02 0.09201 -21.95 1.996e-104
STARS 1.175 0.01655 71.03 0
Fitting linear model: TARGET ~ .
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
10237 1.409 0.4661 0.4654

A plain linear regression model is a surprisingly decent performer when coupled with imputations. ### Model 10 : Linear Model with imputations and only significant variables.

We got FixedAcidity, CitricAcid and ResidualSugar as significant variables and use same variables on Linear regression model with imputed training data.

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.986 0.5534 10.82 3.965e-27
VolatileAcidity -0.1333 0.01781 -7.485 7.738e-14
Chlorides -0.1928 0.04382 -4.4 1.095e-05
FreeSulfurDioxide 0.0003249 9.353e-05 3.474 0.0005151
TotalSulfurDioxide 0.0002256 5.981e-05 3.771 0.0001632
Density -0.9991 0.5232 -1.909 0.05623
pH -0.04897 0.02059 -2.378 0.01743
Sulphates -0.02735 0.01509 -1.813 0.06993
Alcohol 0.01087 0.003775 2.879 0.003998
LabelAppeal 0.4384 0.01633 26.85 1.518e-153
AcidIndex -2.002 0.09039 -22.15 3.248e-106
STARS 1.176 0.01654 71.08 0
Fitting linear model: TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
10237 1.409 0.466 0.4654

Again, removing less significant variables has little impact on the model, and is recommended to reduce overfitting. ## Model 11 : Ordinal Logistic Regression

Since Ordinal logistic regression uses ordered factors we might find this as one of the top model based on our use cases.

Call: polr(formula = TARGET ~ ., data = polrDF, Hess = TRUE)

Coeficients
  Value Std. Error t value
FixedAcidity 0.002554 0.002909 0.878
VolatileAcidity -0.1636 0.02328 -7.026
CitricAcid 0.01951 0.02111 0.9242
ResidualSugar 0.0003666 0.000531 0.6904
Chlorides -0.2396 0.05681 -4.217
FreeSulfurDioxide 0.0004071 0.0001216 3.347
TotalSulfurDioxide 0.0002398 7.814e-05 3.068
Density -1.514 0.1493 -10.14
pH -0.0386 0.02685 -1.438
Sulphates -0.02078 0.01982 -1.048
Alcohol 0.02535 0.004883 5.19
LabelAppeal 0.833 0.02381 34.98
AcidIndex -2.598 0.1251 -20.77
STARS 1.473 0.02563 57.48
Intercepts
  Value Std. Error t value
0|1 -6.027 0.136 -44.31
1|2 -5.89 0.1359 -43.34
2|3 -5.283 0.1355 -38.99
3|4 -3.912 0.1354 -28.89
4|5 -2.064 0.1377 -14.99
5|6 -0.09522 0.1442 -0.6605
6|7 2.103 0.1679 12.53
7|8 4.44 0.3036 14.62

Residual Deviance: 29978.30

AIC: 30022.30

Model 12 : Zero inflation

Zero-inflated poisson regression is used to model count data that has an excess of zero counts. Further, theory suggests that the excess zeros are generated by a separate process from the count values and that the excess zeros can be modeled independently. In Data exploration we saw many zero values, considering this we might get this as one of our best model.

## 
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = wine_train2, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.17822 -0.50776  0.06056  0.48757  2.09505 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.793e+00  2.380e-01   7.532 4.99e-14 ***
## FixedAcidity        5.166e-04  9.446e-04   0.547  0.58447    
## VolatileAcidity    -2.095e-02  7.556e-03  -2.772  0.00556 ** 
## CitricAcid          7.361e-04  6.706e-03   0.110  0.91260    
## ResidualSugar      -3.494e-05  1.721e-04  -0.203  0.83913    
## Chlorides          -2.832e-02  1.851e-02  -1.530  0.12608    
## FreeSulfurDioxide   3.380e-05  3.849e-05   0.878  0.37983    
## TotalSulfurDioxide  4.004e-07  2.457e-05   0.016  0.98700    
## Density            -2.918e-01  2.222e-01  -1.313  0.18915    
## pH                  8.443e-04  8.732e-03   0.097  0.92297    
## Sulphates          -2.372e-03  6.392e-03  -0.371  0.71058    
## Alcohol             6.463e-03  1.587e-03   4.073 4.63e-05 ***
## LabelAppeal         2.280e-01  7.088e-03  32.169  < 2e-16 ***
## AcidIndex          -2.614e-01  4.464e-02  -5.856 4.73e-09 ***
## STARS               1.200e-01  6.995e-03  17.156  < 2e-16 ***
## Log(theta)          1.749e+01  4.449e+00   3.932 8.44e-05 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.7093     0.1347   20.11   <2e-16 ***
## STARS        -3.0279     0.1235  -24.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 39544917.0696 
## Number of iterations in BFGS optimization: 23 
## Log-likelihood: -1.715e+04 on 18 Df

Interestingly Sulfur Dioxides and Sulphates are not significant in this model, while Alcohol is. # SELECT MODELS

Compare Models based on MSE/AIC

MSE AIC
Model1 6.929787 18544.98
Model2 6.926722 18535.29
Model3 6.842661 38354.59
Model4 6.843489 38352.55
Model5 6.929788 18547.07
Model6 6.926723 18537.37
Model7 6.842657 38356.77
Model8 6.843485 38354.73
Model9 1.981498 NA
Model10 1.982077 NA
Model11 NA 30060.23
Model12 1.942825 NA

Similar MSE values are observed for Poisson and negative binomial models.

We can compare our zero inflated model using a Vuong test to a normal Poisson model.

vuong(model4,model12)
## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                   -35.18951 model2 > model1 < 2.22e-16
## AIC-corrected         -35.06714 model2 > model1 < 2.22e-16
## BIC-corrected         -34.62454 model2 > model1 < 2.22e-16

Model2, or our Zero Inflated model, would seem to be better than our non inflated model.

Compare Models by Loss

Use test data and check the output

In order to validate we will use squared loss and squared difference to select model (MSE) from predicting on selected training datasets. Smaller numbers would indicate a truer fit.

##            Loss:
## Model1  5.479565
## Model2  5.464097
## Model3  6.844632
## Model4  6.845511
## Model5  5.479559
## Model6  5.464091
## Model7  6.844628
## Model8  6.845507
## Model9  2.038659
## Model10 2.037739
## Model11 3.667318
## Model12 2.003325

Based on above results these are our observation

-> Linear model performed well. -> Poisson regression model and Negative binomial model did not performed as expected. -> We expected Ordinal logistic regression to be a better model but it did not performed well.

At this point we are concentrated more on square loss which tells us the accuracy of our model

Zero Poisson Inflation seems to be the most accurate model with least loss score, and had good results from a Vuong test.

If we consider all the factors like least loss, good MSE and AIC score we found ‘Zero Inflated Poisson’ as our best one.

Prediction on Evaluation Data

Here we use MICE just like how we used earlier for imputing and log transformation for AcidIndex.

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

Display the Predicted values

– Column specification —————————————————————————- cols( TARGET = col_double(), FixedAcidity = col_double(), VolatileAcidity = col_double(), CitricAcid = col_double(), ResidualSugar = col_double(), Chlorides = col_double(), FreeSulfurDioxide = col_double(), TotalSulfurDioxide = col_double(), Density = col_double(), pH = col_double(), Sulphates = col_double(), Alcohol = col_double(), LabelAppeal = col_double(), AcidIndex = col_double(), STARS = col_double() )

Table continues below
TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
1.623 5.4 -0.86 0.27 -10.7
3.702 12.4 0.385 -0.76 -19.7
1.719 7.2 1.75 0.17 -33
1.484 6.2 0.1 1.8 1
1.65 11.4 0.21 0.28 1.2
5.698 17.6 0.04 -1.15 1.4
Table continues below
Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH
0.092 23 398 0.9853 5.02
1.169 -37 68 0.9905 3.37
0.065 9 76 1.046 4.61
-0.179 104 89 0.9888 3.2
0.038 70 53 1.029 2.54
0.535 -250 140 0.9503 3.06
Sulphates Alcohol LabelAppeal AcidIndex STARS
0.64 12.3 -1 1.792 1
1.09 16 0 1.792 2
0.68 8.55 0 2.079 1
2.11 12.3 -1 2.079 1
-0.07 4.8 0 2.303 1
-0.02 11.4 1 2.079 4

For TARGET: Number of Cases Purchased as Predicted

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
##    1.01    1.91    3.32    3.25    4.20    8.28    1.38    0.47   -0.36