Wine Sales
Whining about Wine Sales
Overview
In this report we attempt to build a model for wine sales as would be predicted by a number of factors about the wine and its packaging. In the end a zero inflated poisson model seems to be the best model. Particularly important to sales variables seem to consist of Label Appeal, Acid Index, Stars and to a lesser extent Alcohol.
Data Exploration
Data was provided already split into a training dataset of 12,795 observations, and an evaluation dataset of 16,129 observations. There was one response variable, Number of Cases purchased, and 14 predictors.
Below is a short description of the variables of interest in the data set:
| Variable Name | Description | Import |
|---|---|---|
| Index | Identification | Not used |
| Target | Number of cases purchased | Response variable |
| ————— | ————– | ——— |
| AcidIndex | Proprietary total acidity measure | |
| Alcohol | Alcohol content | |
| Chlorides | Chloride content | |
| Density | Density of Wine | |
| FixedAcidity | Fixed Acidity | |
| FreeSulfurDioxide | Sulfur Dioxide Content | |
| Label Appeal | Marketing Score indicating appeal of label | Expected positive |
| ResidualSugar | Residual Sugar | |
| STARS | Wine rating by experts | Positive |
| Sulphates | Sulfate content | |
| TotalSulfurDioxide | Total Sulfur Dioxide | |
| VolatileAcidity | Volatile Acid content | |
| pH | pH of wine |
On a first inspection, it was obvious that many variables had at least some missing data.
A sample from the training dataset is provided below:
Training Dataset
| TARGET | INDEX | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar |
|---|---|---|---|---|---|
| 3 | 1 | 3.2 | 1.16 | -0.98 | 54.2 |
| 3 | 2 | 4.5 | 0.16 | -0.81 | 26.1 |
| 5 | 4 | 7.1 | 2.64 | -0.88 | 14.8 |
| 3 | 5 | 5.7 | 0.385 | 0.04 | 18.8 |
| 4 | 6 | 8 | 0.33 | -1.26 | 9.4 |
| 0 | 7 | 11.3 | 0.32 | 0.59 | 2.2 |
| Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH |
|---|---|---|---|---|
| -0.567 | NA | 268 | 0.9928 | 3.33 |
| -0.425 | 15 | -327 | 1.028 | 3.38 |
| 0.037 | 214 | 142 | 0.9952 | 3.12 |
| -0.425 | 22 | 115 | 0.9964 | 2.24 |
| NA | -167 | 108 | 0.9946 | 3.12 |
| 0.556 | -37 | 15 | 0.9994 | 3.2 |
| Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS |
|---|---|---|---|---|
| -0.59 | 9.9 | 0 | 8 | 2 |
| 0.7 | NA | -1 | 7 | 3 |
| 0.48 | 22 | -1 | 8 | 3 |
| 1.83 | 6.2 | -1 | 6 | 1 |
| 1.77 | 13.7 | 0 | 9 | 2 |
| 1.29 | 15.4 | 0 | 11 | NA |
Check Data types and Missing values
The datasets both have missing values. There are 3 categorical variables (LabelAppeal,AcidIndex,STARS), 11 continuous variables and the target variable is categorical.
Missing Values and Data Type Check
Missing Values and Data Type Check
| Non_NAs | NAs | NA_Percent | |
|---|---|---|---|
| TARGET | 12795 | 0 | 0 |
| INDEX | 12795 | 0 | 0 |
| FixedAcidity | 12795 | 0 | 0 |
| VolatileAcidity | 12795 | 0 | 0 |
| CitricAcid | 12795 | 0 | 0 |
| ResidualSugar | 12179 | 616 | 0.04814 |
| Chlorides | 12157 | 638 | 0.04986 |
| FreeSulfurDioxide | 12148 | 647 | 0.05057 |
| TotalSulfurDioxide | 12113 | 682 | 0.0533 |
| Density | 12795 | 0 | 0 |
| pH | 12400 | 395 | 0.03087 |
| Sulphates | 11585 | 1210 | 0.09457 |
| Alcohol | 12142 | 653 | 0.05104 |
| LabelAppeal | 12795 | 0 | 0 |
| AcidIndex | 12795 | 0 | 0 |
| STARS | 9436 | 3359 | 0.2625 |
Data Statistics Summary
A binary logistic regression model is built using the training set, therefore the training set is used for the following data exploration.
The data types in the raw dataset are all ‘doubles’, however the counter INDEX and the response variable target are categorical.
The statistics of all variables are listed below:
| TARGET | FixedAcidity | VolatileAcidity | CitricAcid |
|---|---|---|---|
| Min. :0.000 | Min. :-18.100 | Min. :-2.7900 | Min. :-3.2400 |
| 1st Qu.:2.000 | 1st Qu.: 5.200 | 1st Qu.: 0.1300 | 1st Qu.: 0.0300 |
| Median :3.000 | Median : 6.900 | Median : 0.2800 | Median : 0.3100 |
| Mean :3.029 | Mean : 7.076 | Mean : 0.3241 | Mean : 0.3084 |
| 3rd Qu.:4.000 | 3rd Qu.: 9.500 | 3rd Qu.: 0.6400 | 3rd Qu.: 0.5800 |
| Max. :8.000 | Max. : 34.400 | Max. : 3.6800 | Max. : 3.8600 |
| NA | NA | NA | NA |
| ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide |
|---|---|---|---|
| Min. :-127.800 | Min. :-1.1710 | Min. :-555.00 | Min. :-823.0 |
| 1st Qu.: -2.000 | 1st Qu.:-0.0310 | 1st Qu.: 0.00 | 1st Qu.: 27.0 |
| Median : 3.900 | Median : 0.0460 | Median : 30.00 | Median : 123.0 |
| Mean : 5.419 | Mean : 0.0548 | Mean : 30.85 | Mean : 120.7 |
| 3rd Qu.: 15.900 | 3rd Qu.: 0.1530 | 3rd Qu.: 70.00 | 3rd Qu.: 208.0 |
| Max. : 141.150 | Max. : 1.3510 | Max. : 623.00 | Max. :1057.0 |
| NA’s :616 | NA’s :638 | NA’s :647 | NA’s :682 |
| Density | pH | Sulphates | Alcohol |
|---|---|---|---|
| Min. :0.8881 | Min. :0.480 | Min. :-3.1300 | Min. :-4.70 |
| 1st Qu.:0.9877 | 1st Qu.:2.960 | 1st Qu.: 0.2800 | 1st Qu.: 9.00 |
| Median :0.9945 | Median :3.200 | Median : 0.5000 | Median :10.40 |
| Mean :0.9942 | Mean :3.208 | Mean : 0.5271 | Mean :10.49 |
| 3rd Qu.:1.0005 | 3rd Qu.:3.470 | 3rd Qu.: 0.8600 | 3rd Qu.:12.40 |
| Max. :1.0992 | Max. :6.130 | Max. : 4.2400 | Max. :26.50 |
| NA | NA’s :395 | NA’s :1210 | NA’s :653 |
| LabelAppeal | AcidIndex | STARS |
|---|---|---|
| Min. :-2.000000 | Min. : 4.000 | Min. :1.000 |
| 1st Qu.:-1.000000 | 1st Qu.: 7.000 | 1st Qu.:1.000 |
| Median : 0.000000 | Median : 8.000 | Median :2.000 |
| Mean :-0.009066 | Mean : 7.773 | Mean :2.042 |
| 3rd Qu.: 1.000000 | 3rd Qu.: 8.000 | 3rd Qu.:3.000 |
| Max. : 2.000000 | Max. :17.000 | Max. :4.000 |
| NA | NA | NA’s :3359 |
The statistics of TARGET Variable.
TARGET: Number of Cases Purchased as Actual
## Min. 1st Qu. Median Mean 3rd Qu. Max. StdD Skew Kurt
## 0.00 2.00 3.00 3.03 4.00 8.00 1.93 -0.33 -0.88
FURTHER DATA EXPLORATION
Attributes
FixedAcidity: This variable tells us about the FixedAcidity of wine.
VolatileAcidity: This variable tells us about the Volatile Acidity content of Wine.
CitricAcid: This variable tells us about the Citric Acid Content of wine.
ResidualSugar: This variable tells us about the Residual Sugar of wine.
Chlorides: This variable tells us about the Chloride content of wine.
FreeSulfurDioxide : This variable tells us about the Sulfur Dioxide content of wine.
TotalSulfurDioxide : This variable tells us about the Total Sulfur Dioxide of Wine.
Density: This variable tells us about the Density of wine.
Sulphates: This variable tells us about the Sulphates content of wine.
Alcohol: This variable tells us about the Alcohol content.
LabelAppeal: Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design.
AcidIndex: Proprietary method of testing total acidity of wine by using a weighted average.
STARS: Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor. A high number of stars suggests high sales.
Outliers
Boxplot: Scaled Training Set
The box plot shows that outliers exist in variables FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal andAcidIndex.
Univariate Analysis
Response Variable
Upon examining the target variable we immediately see a normal distribution save for the large number of unsold cases.
Correlation Plot
The correlation plot below shows how variables in the dataset are related to each other.
Here we see relatively little in strong correlations with almost all of the chemistry having minimal impact. Interestingly there is a some relation between star rating and label appeal. As one would expect there is a relationship with Acid index and fixed acidity.
Density Plot
Based on the below plots we can observe that AcidIndex is right skewed; AcidIndex, STARS, LabelAppeal and TARGET have multi-modal distribution (as expected because they are categorical). While most others seem to be normally distributed.
Summarized Data Dictionary
As a summary of the data exploration process, a data dictionary is presented below:
| Variable | Missing_Value | Mean | Median | Max | Min | SD | Correlation_vs_Response |
|---|---|---|---|---|---|---|---|
| TARGET | No | NA | NA | NA | NA | NA | 1.00 |
| INDEX | No | NA | NA | NA | NA | NA | 0.00 |
| FixedAcidity | No | 7.08 | 6.90 | 34.40 | -18.10 | 6.32 | -0.05 |
| VolatileAcidity | No | 0.32 | 0.28 | 3.68 | -2.79 | 0.78 | -0.09 |
| CitricAcid | No | 0.31 | 0.31 | 3.86 | -3.24 | 0.86 | 0.01 |
| ResidualSugar | No | NA | NA | NA | NA | NA | NA |
| Chlorides | No | NA | NA | NA | NA | NA | NA |
| FreeSulfurDioxide | No | NA | NA | NA | NA | NA | NA |
| TotalSulfurDioxide | No | NA | NA | NA | NA | NA | NA |
| Density | No | 0.99 | 0.99 | 1.10 | 0.89 | 0.03 | -0.04 |
| pH | No | NA | NA | NA | NA | NA | NA |
| Sulphates | No | NA | NA | NA | NA | NA | NA |
| Alcohol | No | NA | NA | NA | NA | NA | NA |
| LabelAppeal | No | -0.01 | 0.00 | 2.00 | -2.00 | 0.89 | 0.36 |
| AcidIndex | No | 7.77 | 8.00 | 17.00 | 4.00 | 1.32 | -0.25 |
| STARS | No | NA | NA | NA | NA | NA | NA |
DATA PREPARATION
In the data preparation we will split data into training and test dataset.
MICE package (Multivariate Imputation by Chained Equations)implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. helps in inspecting, imputing, diagnose, analyze, pool the result, and generate simulated incomplete data
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
‘AcidIndex’ and ‘TARGET’ have low correlation between them. We will apply a log transformation to it even if it doesn’t seem likely to provide a large model improvement.
BUILD MODELS
Model I: Poisson Model
Model 1: Poisson Model without imputations
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 1.608 | 0.2796 | 5.75 | 8.902e-09 |
| FixedAcidity | 0.0006705 | 0.001177 | 0.5695 | 0.569 |
| VolatileAcidity | -0.0275 | 0.009283 | -2.963 | 0.00305 |
| CitricAcid | -0.003835 | 0.008519 | -0.4502 | 0.6526 |
| ResidualSugar | 1.828e-05 | 0.0002152 | 0.08493 | 0.9323 |
| Chlorides | -0.03764 | 0.02314 | -1.627 | 0.1038 |
| FreeSulfurDioxide | 5.671e-05 | 4.892e-05 | 1.159 | 0.2463 |
| TotalSulfurDioxide | 2.23e-05 | 3.177e-05 | 0.7019 | 0.4827 |
| Density | -0.4025 | 0.2749 | -1.464 | 0.1433 |
| pH | 0.0002307 | 0.01085 | 0.02127 | 0.983 |
| Sulphates | -0.005984 | 0.007973 | -0.7505 | 0.4529 |
| Alcohol | 0.003262 | 0.002004 | 1.628 | 0.1036 |
| LabelAppeal | 0.173 | 0.008858 | 19.53 | 6.135e-85 |
| AcidIndex | -0.04967 | 0.006666 | -7.451 | 9.281e-14 |
| STARS | 0.1929 | 0.008328 | 23.16 | 1.146e-118 |
(Dispersion parameter for poisson family taken to be 1 )
| Null deviance: | 4721 on 5143 degrees of freedom |
| Residual deviance: | 3243 on 5129 degrees of freedom |
Stars and Label appeal seem to have a strong positive impact on sales, which is to be expected (and will be seen in all models). On the acid front the acidindex value seems a better predictor than the acid components. ### Model 2: Poisson Model without imputations and only significant variables
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 1.251 | 0.05472 | 22.87 | 9.626e-116 |
| VolatileAcidity | -0.02758 | 0.009278 | -2.973 | 0.002953 |
| LabelAppeal | 0.1732 | 0.008853 | 19.56 | 3.247e-85 |
| AcidIndex | -0.05062 | 0.006553 | -7.724 | 1.129e-14 |
| STARS | 0.1942 | 0.008292 | 23.42 | 2.601e-121 |
(Dispersion parameter for poisson family taken to be 1 )
| Null deviance: | 4721 on 5143 degrees of freedom |
| Residual deviance: | 3253 on 5139 degrees of freedom |
Reducing the variables in the model doesn’t have a major impact on the model, and will probably improve its performance in actual fact.
Model 3: Poisson Model with imputations
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 2.35 | 0.228 | 10.31 | 6.517e-25 |
| FixedAcidity | 0.0002238 | 0.0009202 | 0.2432 | 0.8079 |
| VolatileAcidity | -0.04578 | 0.007286 | -6.282 | 3.334e-10 |
| CitricAcid | 0.00677 | 0.006566 | 1.031 | 0.3025 |
| ResidualSugar | 0.0001586 | 0.0001674 | 0.9473 | 0.3435 |
| Chlorides | -0.06268 | 0.01794 | -3.494 | 0.0004761 |
| FreeSulfurDioxide | 0.0001249 | 3.801e-05 | 3.285 | 0.001019 |
| TotalSulfurDioxide | 8.362e-05 | 2.457e-05 | 3.403 | 0.0006663 |
| Density | -0.3887 | 0.2145 | -1.813 | 0.06989 |
| pH | -0.0199 | 0.008406 | -2.367 | 0.01792 |
| Sulphates | -0.01035 | 0.006173 | -1.677 | 0.0936 |
| Alcohol | 0.002527 | 0.001557 | 1.623 | 0.1045 |
| LabelAppeal | 0.1429 | 0.006779 | 21.08 | 1.105e-98 |
| AcidIndex | -0.7543 | 0.04005 | -18.83 | 4.078e-79 |
| STARS | 0.3432 | 0.006241 | 55 | 0 |
(Dispersion parameter for poisson family taken to be 1 )
| Null deviance: | 18291 on 10236 degrees of freedom |
| Residual deviance: | 12767 on 10222 degrees of freedom |
With imputations a number of variables become significant, including acidity, and Sulfur/sulphate counts. This makes some intuitive sense as those would affect the taste of the wine. Interestingly, the coefficient is positive on the sulfur dioxide. One would rarely expect sulfur dioxide (the smell of burnt matches) to be an improvement for wine.
Model 4: Poisson Model with imputations and only significant variables
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 2.001 | 0.0885 | 22.61 | 3.784e-113 |
| VolatileAcidity | -0.04605 | 0.007285 | -6.321 | 2.591e-10 |
| Chlorides | -0.06452 | 0.01793 | -3.599 | 0.0003195 |
| FreeSulfurDioxide | 0.0001233 | 3.8e-05 | 3.245 | 0.001173 |
| TotalSulfurDioxide | 8.234e-05 | 2.454e-05 | 3.355 | 0.0007935 |
| pH | -0.02022 | 0.008403 | -2.406 | 0.01613 |
| Sulphates | -0.01025 | 0.006167 | -1.661 | 0.09662 |
| LabelAppeal | 0.1429 | 0.006779 | 21.08 | 1.235e-98 |
| AcidIndex | -0.7573 | 0.03941 | -19.22 | 2.61e-82 |
| STARS | 0.3442 | 0.006224 | 55.3 | 0 |
(Dispersion parameter for poisson family taken to be 1 )
| Null deviance: | 18291 on 10236 degrees of freedom |
| Residual deviance: | 12775 on 10227 degrees of freedom |
Using just the significant variables we do not get much of an improvement in AIC, but the model seems to lose little. ## Model II: Negative Binomial Model
Model 5 : Negative Binomial Model without imputations
##
## Call:
## glm.nb(formula = TARGET ~ ., data = wine_train1, init.theta = 138898.9107,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2127 -0.2757 0.0647 0.3766 1.6981
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.608e+00 2.796e-01 5.750 8.91e-09 ***
## FixedAcidity 6.705e-04 1.177e-03 0.570 0.56900
## VolatileAcidity -2.750e-02 9.283e-03 -2.963 0.00305 **
## CitricAcid -3.835e-03 8.519e-03 -0.450 0.65259
## ResidualSugar 1.828e-05 2.152e-04 0.085 0.93231
## Chlorides -3.764e-02 2.314e-02 -1.627 0.10378
## FreeSulfurDioxide 5.671e-05 4.892e-05 1.159 0.24630
## TotalSulfurDioxide 2.230e-05 3.177e-05 0.702 0.48275
## Density -4.025e-01 2.750e-01 -1.464 0.14326
## pH 2.307e-04 1.085e-02 0.021 0.98303
## Sulphates -5.984e-03 7.973e-03 -0.751 0.45293
## Alcohol 3.262e-03 2.004e-03 1.628 0.10360
## LabelAppeal 1.730e-01 8.858e-03 19.529 < 2e-16 ***
## AcidIndex -4.967e-02 6.666e-03 -7.451 9.28e-14 ***
## STARS 1.929e-01 8.328e-03 23.160 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(138898.9) family taken to be 1)
##
## Null deviance: 4720.4 on 5143 degrees of freedom
## Residual deviance: 3242.7 on 5129 degrees of freedom
## (5093 observations deleted due to missingness)
## AIC: 18547
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 138899
## Std. Err.: 259921
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -18515.07
AIC is similar to Poisson without imputations, as are some of the coefficients. ### Model 6 : Negative Binomial Model without imputations and only significant variables
##
## Call:
## glm.nb(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar -
## Chlorides - FreeSulfurDioxide - TotalSulfurDioxide - Density -
## pH - Sulphates - Alcohol, data = wine_train1, init.theta = 138402.5261,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1898 -0.2777 0.0622 0.3764 1.6086
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.251443 0.054725 22.868 < 2e-16 ***
## VolatileAcidity -0.027581 0.009279 -2.973 0.00295 **
## LabelAppeal 0.173177 0.008853 19.562 < 2e-16 ***
## AcidIndex -0.050616 0.006553 -7.724 1.13e-14 ***
## STARS 0.194209 0.008292 23.421 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(138402.5) family taken to be 1)
##
## Null deviance: 4720.4 on 5143 degrees of freedom
## Residual deviance: 3253.0 on 5139 degrees of freedom
## (5093 observations deleted due to missingness)
## AIC: 18537
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 138403
## Std. Err.: 258834
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -18525.37
Removing non significant variables doesn’t seem to affect the model’s accuracy much. ### Model 7 : Negative Binomial Model with imputations
##
## Call:
## glm.nb(formula = TARGET ~ ., data = wine_train2, init.theta = 49067.84702,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9896 -0.6852 0.1295 0.6351 2.4377
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.350e+00 2.280e-01 10.307 < 2e-16 ***
## FixedAcidity 2.238e-04 9.202e-04 0.243 0.807848
## VolatileAcidity -4.578e-02 7.287e-03 -6.282 3.34e-10 ***
## CitricAcid 6.770e-03 6.566e-03 1.031 0.302510
## ResidualSugar 1.586e-04 1.674e-04 0.947 0.343463
## Chlorides -6.268e-02 1.794e-02 -3.494 0.000476 ***
## FreeSulfurDioxide 1.249e-04 3.801e-05 3.285 0.001020 **
## TotalSulfurDioxide 8.362e-05 2.457e-05 3.403 0.000666 ***
## Density -3.887e-01 2.145e-01 -1.813 0.069902 .
## pH -1.990e-02 8.406e-03 -2.367 0.017923 *
## Sulphates -1.035e-02 6.173e-03 -1.677 0.093603 .
## Alcohol 2.527e-03 1.557e-03 1.623 0.104549
## LabelAppeal 1.429e-01 6.779e-03 21.084 < 2e-16 ***
## AcidIndex -7.543e-01 4.005e-02 -18.832 < 2e-16 ***
## STARS 3.433e-01 6.241e-03 54.998 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(49067.85) family taken to be 1)
##
## Null deviance: 18290 on 10236 degrees of freedom
## Residual deviance: 12766 on 10222 degrees of freedom
## AIC: 38357
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 49068
## Std. Err.: 63120
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -38324.77
With imputations we find surprisingly similar results to a the Poisson model with imputations. ### Model 8 : Negative Binomial Model with imputations and only significant variables
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 2.001 | 0.0885 | 22.61 | 3.826e-113 |
| VolatileAcidity | -0.04605 | 0.007285 | -6.321 | 2.592e-10 |
| Chlorides | -0.06452 | 0.01793 | -3.599 | 0.0003195 |
| FreeSulfurDioxide | 0.0001233 | 3.8e-05 | 3.245 | 0.001173 |
| TotalSulfurDioxide | 8.234e-05 | 2.454e-05 | 3.355 | 0.0007937 |
| pH | -0.02022 | 0.008403 | -2.406 | 0.01613 |
| Sulphates | -0.01025 | 0.006167 | -1.661 | 0.09662 |
| LabelAppeal | 0.1429 | 0.006779 | 21.08 | 1.257e-98 |
| AcidIndex | -0.7573 | 0.03941 | -19.22 | 2.622e-82 |
| STARS | 0.3442 | 0.006224 | 55.3 | 0 |
Quitting from lines 354-357 (Wine_Sales.Rmd) Error in data.frame(Observations = length(x\(residuals), `Residual Std. Error` = x\)sigma, : arguments imply differing number of rows: 1, 0 Calls:
Model III: Linear Model
Model 9 : Linear Model with imputations
Use imputed training data on Linear regression model
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.001 | 0.5535 | 10.84 | 3.054e-27 |
| FixedAcidity | 0.001363 | 0.002241 | 0.6083 | 0.543 |
| VolatileAcidity | -0.1326 | 0.01781 | -7.445 | 1.049e-13 |
| CitricAcid | 0.0195 | 0.01621 | 1.203 | 0.2291 |
| ResidualSugar | 0.0004527 | 0.0004108 | 1.102 | 0.2704 |
| Chlorides | -0.1916 | 0.04383 | -4.371 | 1.248e-05 |
| FreeSulfurDioxide | 0.0003229 | 9.354e-05 | 3.452 | 0.0005596 |
| TotalSulfurDioxide | 0.0002237 | 5.984e-05 | 3.738 | 0.0001868 |
| Density | -0.9958 | 0.5233 | -1.903 | 0.05705 |
| pH | -0.04893 | 0.0206 | -2.376 | 0.01754 |
| Sulphates | -0.02701 | 0.0151 | -1.789 | 0.07369 |
| Alcohol | 0.01092 | 0.003777 | 2.892 | 0.003831 |
| LabelAppeal | 0.4385 | 0.01633 | 26.86 | 1.213e-153 |
| AcidIndex | -2.02 | 0.09201 | -21.95 | 1.996e-104 |
| STARS | 1.175 | 0.01655 | 71.03 | 0 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 10237 | 1.409 | 0.4661 | 0.4654 |
A plain linear regression model is a surprisingly decent performer when coupled with imputations. ### Model 10 : Linear Model with imputations and only significant variables.
We got FixedAcidity, CitricAcid and ResidualSugar as significant variables and use same variables on Linear regression model with imputed training data.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 5.986 | 0.5534 | 10.82 | 3.965e-27 |
| VolatileAcidity | -0.1333 | 0.01781 | -7.485 | 7.738e-14 |
| Chlorides | -0.1928 | 0.04382 | -4.4 | 1.095e-05 |
| FreeSulfurDioxide | 0.0003249 | 9.353e-05 | 3.474 | 0.0005151 |
| TotalSulfurDioxide | 0.0002256 | 5.981e-05 | 3.771 | 0.0001632 |
| Density | -0.9991 | 0.5232 | -1.909 | 0.05623 |
| pH | -0.04897 | 0.02059 | -2.378 | 0.01743 |
| Sulphates | -0.02735 | 0.01509 | -1.813 | 0.06993 |
| Alcohol | 0.01087 | 0.003775 | 2.879 | 0.003998 |
| LabelAppeal | 0.4384 | 0.01633 | 26.85 | 1.518e-153 |
| AcidIndex | -2.002 | 0.09039 | -22.15 | 3.248e-106 |
| STARS | 1.176 | 0.01654 | 71.08 | 0 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 10237 | 1.409 | 0.466 | 0.4654 |
Again, removing less significant variables has little impact on the model, and is recommended to reduce overfitting. ## Model 11 : Ordinal Logistic Regression
Since Ordinal logistic regression uses ordered factors we might find this as one of the top model based on our use cases.
Call: polr(formula = TARGET ~ ., data = polrDF, Hess = TRUE)
| Value | Std. Error | t value | |
|---|---|---|---|
| FixedAcidity | 0.002554 | 0.002909 | 0.878 |
| VolatileAcidity | -0.1636 | 0.02328 | -7.026 |
| CitricAcid | 0.01951 | 0.02111 | 0.9242 |
| ResidualSugar | 0.0003666 | 0.000531 | 0.6904 |
| Chlorides | -0.2396 | 0.05681 | -4.217 |
| FreeSulfurDioxide | 0.0004071 | 0.0001216 | 3.347 |
| TotalSulfurDioxide | 0.0002398 | 7.814e-05 | 3.068 |
| Density | -1.514 | 0.1493 | -10.14 |
| pH | -0.0386 | 0.02685 | -1.438 |
| Sulphates | -0.02078 | 0.01982 | -1.048 |
| Alcohol | 0.02535 | 0.004883 | 5.19 |
| LabelAppeal | 0.833 | 0.02381 | 34.98 |
| AcidIndex | -2.598 | 0.1251 | -20.77 |
| STARS | 1.473 | 0.02563 | 57.48 |
| Value | Std. Error | t value | |
|---|---|---|---|
| 0|1 | -6.027 | 0.136 | -44.31 |
| 1|2 | -5.89 | 0.1359 | -43.34 |
| 2|3 | -5.283 | 0.1355 | -38.99 |
| 3|4 | -3.912 | 0.1354 | -28.89 |
| 4|5 | -2.064 | 0.1377 | -14.99 |
| 5|6 | -0.09522 | 0.1442 | -0.6605 |
| 6|7 | 2.103 | 0.1679 | 12.53 |
| 7|8 | 4.44 | 0.3036 | 14.62 |
Residual Deviance: 29978.30
AIC: 30022.30
Model 12 : Zero inflation
Zero-inflated poisson regression is used to model count data that has an excess of zero counts. Further, theory suggests that the excess zeros are generated by a separate process from the count values and that the excess zeros can be modeled independently. In Data exploration we saw many zero values, considering this we might get this as one of our best model.
##
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = wine_train2, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.17822 -0.50776 0.06056 0.48757 2.09505
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.793e+00 2.380e-01 7.532 4.99e-14 ***
## FixedAcidity 5.166e-04 9.446e-04 0.547 0.58447
## VolatileAcidity -2.095e-02 7.556e-03 -2.772 0.00556 **
## CitricAcid 7.361e-04 6.706e-03 0.110 0.91260
## ResidualSugar -3.494e-05 1.721e-04 -0.203 0.83913
## Chlorides -2.832e-02 1.851e-02 -1.530 0.12608
## FreeSulfurDioxide 3.380e-05 3.849e-05 0.878 0.37983
## TotalSulfurDioxide 4.004e-07 2.457e-05 0.016 0.98700
## Density -2.918e-01 2.222e-01 -1.313 0.18915
## pH 8.443e-04 8.732e-03 0.097 0.92297
## Sulphates -2.372e-03 6.392e-03 -0.371 0.71058
## Alcohol 6.463e-03 1.587e-03 4.073 4.63e-05 ***
## LabelAppeal 2.280e-01 7.088e-03 32.169 < 2e-16 ***
## AcidIndex -2.614e-01 4.464e-02 -5.856 4.73e-09 ***
## STARS 1.200e-01 6.995e-03 17.156 < 2e-16 ***
## Log(theta) 1.749e+01 4.449e+00 3.932 8.44e-05 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.7093 0.1347 20.11 <2e-16 ***
## STARS -3.0279 0.1235 -24.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 39544917.0696
## Number of iterations in BFGS optimization: 23
## Log-likelihood: -1.715e+04 on 18 Df
Interestingly Sulfur Dioxides and Sulphates are not significant in this model, while Alcohol is. # SELECT MODELS
Compare Models based on MSE/AIC
| MSE | AIC | |
|---|---|---|
| Model1 | 6.929787 | 18544.98 |
| Model2 | 6.926722 | 18535.29 |
| Model3 | 6.842661 | 38354.59 |
| Model4 | 6.843489 | 38352.55 |
| Model5 | 6.929788 | 18547.07 |
| Model6 | 6.926723 | 18537.37 |
| Model7 | 6.842657 | 38356.77 |
| Model8 | 6.843485 | 38354.73 |
| Model9 | 1.981498 | NA |
| Model10 | 1.982077 | NA |
| Model11 | NA | 30060.23 |
| Model12 | 1.942825 | NA |
Similar MSE values are observed for Poisson and negative binomial models.
We can compare our zero inflated model using a Vuong test to a normal Poisson model.
vuong(model4,model12)## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw -35.18951 model2 > model1 < 2.22e-16
## AIC-corrected -35.06714 model2 > model1 < 2.22e-16
## BIC-corrected -34.62454 model2 > model1 < 2.22e-16
Model2, or our Zero Inflated model, would seem to be better than our non inflated model.
Compare Models by Loss
Use test data and check the output
In order to validate we will use squared loss and squared difference to select model (MSE) from predicting on selected training datasets. Smaller numbers would indicate a truer fit.
## Loss:
## Model1 5.479565
## Model2 5.464097
## Model3 6.844632
## Model4 6.845511
## Model5 5.479559
## Model6 5.464091
## Model7 6.844628
## Model8 6.845507
## Model9 2.038659
## Model10 2.037739
## Model11 3.667318
## Model12 2.003325
Based on above results these are our observation
-> Linear model performed well. -> Poisson regression model and Negative binomial model did not performed as expected. -> We expected Ordinal logistic regression to be a better model but it did not performed well.
At this point we are concentrated more on square loss which tells us the accuracy of our model
Zero Poisson Inflation seems to be the most accurate model with least loss score, and had good results from a Vuong test.
If we consider all the factors like least loss, good MSE and AIC score we found ‘Zero Inflated Poisson’ as our best one.
Prediction on Evaluation Data
Here we use MICE just like how we used earlier for imputing and log transformation for AcidIndex.
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
Display the Predicted values
– Column specification —————————————————————————- cols( TARGET = col_double(), FixedAcidity = col_double(), VolatileAcidity = col_double(), CitricAcid = col_double(), ResidualSugar = col_double(), Chlorides = col_double(), FreeSulfurDioxide = col_double(), TotalSulfurDioxide = col_double(), Density = col_double(), pH = col_double(), Sulphates = col_double(), Alcohol = col_double(), LabelAppeal = col_double(), AcidIndex = col_double(), STARS = col_double() )
| TARGET | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar |
|---|---|---|---|---|
| 1.623 | 5.4 | -0.86 | 0.27 | -10.7 |
| 3.702 | 12.4 | 0.385 | -0.76 | -19.7 |
| 1.719 | 7.2 | 1.75 | 0.17 | -33 |
| 1.484 | 6.2 | 0.1 | 1.8 | 1 |
| 1.65 | 11.4 | 0.21 | 0.28 | 1.2 |
| 5.698 | 17.6 | 0.04 | -1.15 | 1.4 |
| Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH |
|---|---|---|---|---|
| 0.092 | 23 | 398 | 0.9853 | 5.02 |
| 1.169 | -37 | 68 | 0.9905 | 3.37 |
| 0.065 | 9 | 76 | 1.046 | 4.61 |
| -0.179 | 104 | 89 | 0.9888 | 3.2 |
| 0.038 | 70 | 53 | 1.029 | 2.54 |
| 0.535 | -250 | 140 | 0.9503 | 3.06 |
| Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS |
|---|---|---|---|---|
| 0.64 | 12.3 | -1 | 1.792 | 1 |
| 1.09 | 16 | 0 | 1.792 | 2 |
| 0.68 | 8.55 | 0 | 2.079 | 1 |
| 2.11 | 12.3 | -1 | 2.079 | 1 |
| -0.07 | 4.8 | 0 | 2.303 | 1 |
| -0.02 | 11.4 | 1 | 2.079 | 4 |
For TARGET: Number of Cases Purchased as Predicted
## Min. 1st Qu. Median Mean 3rd Qu. Max. StdD Skew Kurt
## 1.01 1.91 3.32 3.25 4.20 8.28 1.38 0.47 -0.36
Predicted Evaluation data
https://github.com/ShovanBiswas/DATA621/blob/master/Homework05/Evaluation_Full_Data.csv
Appendix
https://github.com/ShovanBiswas/DATA621/blob/master/Homework05/Data621_Hw5.Rmd
Marker: 621-05_p