DATA 621: HW 5

David Quarshie - Group 3

Problem

Our goal is to explore, analyze and model a dataset containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

The objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine.

1. Data Exploration

Below we’ll display a few basic EDA techniques to gain insight into our wine dataset.

Basic Statistics

There are 12,795 rows and 15 columns. There are 8,200 missing values of total 191,925 data points.

n mean sd median min max skew kurtosis
TARGET 12795 3.0290739 1.9263682 3.00000 0.00000 8.00000 -0.3263010 -0.8772457
FixedAcidity 12795 7.0757171 6.3176435 6.90000 -18.10000 34.40000 -0.0225860 1.6749987
VolatileAcidity 12795 0.3241039 0.7840142 0.28000 -2.79000 3.68000 0.0203800 1.8322106
CitricAcid 12795 0.3084127 0.8620798 0.31000 -3.24000 3.86000 -0.0503070 1.8379401
ResidualSugar 12179 5.4187331 33.7493790 3.90000 -127.80000 141.15000 -0.0531229 1.8846917
Chlorides 12157 0.0548225 0.3184673 0.04600 -1.17100 1.35100 0.0304272 1.7886044
FreeSulfurDioxide 12148 30.8455713 148.7145577 30.00000 -555.00000 623.00000 0.0063930 1.8364966
TotalSulfurDioxide 12113 120.7142326 231.9132105 123.00000 -823.00000 1057.00000 -0.0071794 1.6746665
Density 12795 0.9942027 0.0265376 0.99449 0.88809 1.09924 -0.0186938 1.8999592
pH 12400 3.2076282 0.6796871 3.20000 0.48000 6.13000 0.0442880 1.6462681
Sulphates 11585 0.5271118 0.9321293 0.50000 -3.13000 4.24000 0.0059119 1.7525655
Alcohol 12142 10.4892363 3.7278190 10.40000 -4.70000 26.50000 -0.0307158 1.5394949
LabelAppeal 12795 -0.0090660 0.8910892 0.00000 -2.00000 2.00000 0.0084295 -0.2622916
AcidIndex 12795 7.7727237 1.3239264 8.00000 4.00000 17.00000 1.6484959 5.1900925
STARS 9436 2.0417550 0.9025400 2.00000 1.00000 4.00000 0.4472353 -0.6925343
## Histogram of Varia bles

Scatterplot of Variables

## NULL

2. DATA PREPARATION

Negative Values

There are some wine quality measures that are negative, we’ll take the absolute value for some of them

But for LabelAppeal, we will add the min value to the the absolute values.

Missing Values

We have missing values in pH, ResidualSugar, Chlorides, Free SulfurDioxide, Alcohol, TotalSulfurDioxide, Sulphates, and STARS.

STARS

For STARS we assign the NAs to be 0.

Mice Imputation

Let’s take a look at using MICE for imputation on the other missing values.

With the mice imputation distributions roughly match the existing, each of the remaining variables with missing values seem to be MAR. We can run the mice imputation on both the train and test set.

Correlation Review

Finally, we update STARS to be a factor variable so we it can be used for modeling.

3. BUILD MODELS

Model 1: Poisson

## 
## Call:
## glm(formula = TARGET ~ ., family = "poisson", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2449  -0.6939  -0.0082   0.4599   3.7907  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         7.660e-01  2.213e-01   3.461 0.000538 ***
## FixedAcidity       -7.309e-04  1.169e-03  -0.625 0.531895    
## VolatileAcidity    -3.906e-02  1.048e-02  -3.726 0.000195 ***
## CitricAcid          3.851e-03  9.374e-03   0.411 0.681183    
## ResidualSugar       1.931e-04  2.266e-04   0.852 0.394323    
## Chlorides          -3.226e-02  2.471e-02  -1.305 0.191752    
## FreeSulfurDioxide   6.889e-05  5.286e-05   1.303 0.192481    
## TotalSulfurDioxide  6.658e-05  3.439e-05   1.936 0.052902 .  
## Density            -2.313e-01  2.161e-01  -1.071 0.284345    
## pH                 -1.144e-02  8.448e-03  -1.354 0.175714    
## Sulphates          -1.824e-02  8.782e-03  -2.077 0.037838 *  
## Alcohol             3.755e-03  1.545e-03   2.430 0.015083 *  
## LabelAppeal         1.576e-01  6.871e-03  22.935  < 2e-16 ***
## AcidIndex          -8.027e-02  5.113e-03 -15.700  < 2e-16 ***
## STARS1              7.677e-01  2.176e-02  35.284  < 2e-16 ***
## STARS2              1.102e+00  2.032e-02  54.252  < 2e-16 ***
## STARS3              1.216e+00  2.144e-02  56.726  < 2e-16 ***
## STARS4              1.339e+00  2.713e-02  49.344  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18544  on 10235  degrees of freedom
## Residual deviance: 11155  on 10218  degrees of freedom
## AIC: 36593
## 
## Number of Fisher Scoring iterations: 6

Model 2: Poisson Reduced

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + 
##     Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, family = "poisson", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2827  -0.6959  -0.0100   0.4599   3.7738  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         5.106e-01  5.072e-02  10.067  < 2e-16 ***
## VolatileAcidity    -3.946e-02  1.048e-02  -3.766 0.000166 ***
## Chlorides          -3.387e-02  2.470e-02  -1.371 0.170294    
## TotalSulfurDioxide  6.561e-05  3.438e-05   1.909 0.056309 .  
## Sulphates          -1.858e-02  8.779e-03  -2.117 0.034260 *  
## Alcohol             3.786e-03  1.544e-03   2.451 0.014231 *  
## LabelAppeal         1.576e-01  6.867e-03  22.951  < 2e-16 ***
## AcidIndex          -8.061e-02  5.041e-03 -15.991  < 2e-16 ***
## STARS1              7.681e-01  2.176e-02  35.308  < 2e-16 ***
## STARS2              1.104e+00  2.031e-02  54.337  < 2e-16 ***
## STARS3              1.217e+00  2.143e-02  56.809  < 2e-16 ***
## STARS4              1.339e+00  2.712e-02  49.375  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18544  on 10235  degrees of freedom
## Residual deviance: 11161  on 10224  degrees of freedom
## AIC: 36587
## 
## Number of Fisher Scoring iterations: 6

Model 3: Negative Binomial

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = train, init.theta = 39447.33681, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2448  -0.6939  -0.0082   0.4599   3.7905  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         7.661e-01  2.213e-01   3.461 0.000538 ***
## FixedAcidity       -7.309e-04  1.169e-03  -0.625 0.531902    
## VolatileAcidity    -3.906e-02  1.048e-02  -3.726 0.000195 ***
## CitricAcid          3.851e-03  9.375e-03   0.411 0.681194    
## ResidualSugar       1.930e-04  2.267e-04   0.852 0.394356    
## Chlorides          -3.226e-02  2.471e-02  -1.305 0.191764    
## FreeSulfurDioxide   6.889e-05  5.286e-05   1.303 0.192470    
## TotalSulfurDioxide  6.658e-05  3.440e-05   1.936 0.052899 .  
## Density            -2.313e-01  2.161e-01  -1.071 0.284361    
## pH                 -1.144e-02  8.448e-03  -1.354 0.175700    
## Sulphates          -1.824e-02  8.782e-03  -2.077 0.037839 *  
## Alcohol             3.755e-03  1.545e-03   2.430 0.015089 *  
## LabelAppeal         1.576e-01  6.871e-03  22.934  < 2e-16 ***
## AcidIndex          -8.028e-02  5.113e-03 -15.700  < 2e-16 ***
## STARS1              7.677e-01  2.176e-02  35.283  < 2e-16 ***
## STARS2              1.102e+00  2.032e-02  54.251  < 2e-16 ***
## STARS3              1.216e+00  2.144e-02  56.724  < 2e-16 ***
## STARS4              1.339e+00  2.713e-02  49.342  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(39447.34) family taken to be 1)
## 
##     Null deviance: 18543  on 10235  degrees of freedom
## Residual deviance: 11154  on 10218  degrees of freedom
## AIC: 36595
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  39447 
##           Std. Err.:  37193 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -36557.14

Model 4: Negative Binomial Reduced

## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + 
##     Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, data = train, 
##     init.theta = 39434.27867, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2826  -0.6959  -0.0100   0.4599   3.7737  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         5.107e-01  5.073e-02  10.067  < 2e-16 ***
## VolatileAcidity    -3.946e-02  1.048e-02  -3.766 0.000166 ***
## Chlorides          -3.387e-02  2.470e-02  -1.371 0.170305    
## TotalSulfurDioxide  6.561e-05  3.438e-05   1.909 0.056305 .  
## Sulphates          -1.859e-02  8.779e-03  -2.117 0.034260 *  
## Alcohol             3.786e-03  1.544e-03   2.451 0.014237 *  
## LabelAppeal         1.576e-01  6.867e-03  22.950  < 2e-16 ***
## AcidIndex          -8.061e-02  5.041e-03 -15.991  < 2e-16 ***
## STARS1              7.681e-01  2.176e-02  35.307  < 2e-16 ***
## STARS2              1.104e+00  2.031e-02  54.336  < 2e-16 ***
## STARS3              1.217e+00  2.143e-02  56.808  < 2e-16 ***
## STARS4              1.339e+00  2.713e-02  49.373  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(39434.28) family taken to be 1)
## 
##     Null deviance: 18543  on 10235  degrees of freedom
## Residual deviance: 11160  on 10224  degrees of freedom
## AIC: 36589
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  39434 
##           Std. Err.:  37174 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -36563.13

Model 5: Zero Dispersion Counts

There is an inflated number of 0s in our target, so let’s try a zero dispersion on the negative binomial.

## 
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = train, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.42359 -0.56537  0.01324  0.44390  2.74903 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.152e+00  2.289e-01   5.030 4.90e-07 ***
## FixedAcidity       -1.940e-04  1.194e-03  -0.162  0.87096    
## VolatileAcidity    -1.751e-02  1.068e-02  -1.639  0.10114    
## CitricAcid         -1.915e-03  9.610e-03  -0.199  0.84206    
## ResidualSugar       1.345e-04  2.319e-04   0.580  0.56196    
## Chlorides          -2.245e-02  2.530e-02  -0.888  0.37481    
## FreeSulfurDioxide  -1.205e-05  5.325e-05  -0.226  0.82103    
## TotalSulfurDioxide -7.652e-06  3.402e-05  -0.225  0.82202    
## Density            -3.065e-01  2.232e-01  -1.374  0.16956    
## pH                  5.173e-03  8.662e-03   0.597  0.55040    
## Sulphates          -1.645e-03  8.954e-03  -0.184  0.85424    
## Alcohol             6.631e-03  1.573e-03   4.215 2.50e-05 ***
## LabelAppeal         2.229e-01  7.107e-03  31.359  < 2e-16 ***
## AcidIndex          -2.877e-02  5.614e-03  -5.124 2.98e-07 ***
## STARS1              4.489e-02  2.349e-02   1.911  0.05600 .  
## STARS2              1.792e-01  2.186e-02   8.200 2.41e-16 ***
## STARS3              2.729e-01  2.299e-02  11.872  < 2e-16 ***
## STARS4              3.719e-01  2.857e-02  13.017  < 2e-16 ***
## Log(theta)          1.551e+01  5.339e+00   2.905  0.00367 ** 
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.36397    0.04108   8.860   <2e-16 ***
## STARS1       -1.96937    0.07395 -26.631   <2e-16 ***
## STARS2       -6.01014    0.64983  -9.249   <2e-16 ***
## STARS3      -19.02098  269.64693  -0.071    0.944    
## STARS4      -19.01912  510.55652  -0.037    0.970    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 5446845.6153 
## Number of iterations in BFGS optimization: 50 
## Log-likelihood: -1.661e+04 on 24 Df

Model 6: Zero Dispersion Counts Reduced

## 
## Call:
## zeroinfl(formula = TARGET ~ VolatileAcidity + Chlorides + Density + 
##     Alcohol + LabelAppeal + AcidIndex | STARS, data = train, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.54247 -0.48412  0.05682  0.44699  2.38613 
## 
## Count model coefficients (negbin with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.309297   0.224856   5.823 5.79e-09 ***
## VolatileAcidity -0.017170   0.010642  -1.613   0.1067    
## Chlorides       -0.019134   0.025232  -0.758   0.4483    
## Density         -0.386039   0.222569  -1.734   0.0828 .  
## Alcohol          0.008718   0.001561   5.584 2.35e-08 ***
## LabelAppeal      0.268393   0.006565  40.882  < 2e-16 ***
## AcidIndex       -0.030786   0.005523  -5.574 2.48e-08 ***
## Log(theta)      17.301778         NA      NA       NA    
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.39594    0.04027   9.832   <2e-16 ***
## STARS1       -1.93097    0.06997 -27.596   <2e-16 ***
## STARS2       -5.92767    0.53601 -11.059   <2e-16 ***
## STARS3      -19.02090  265.36008  -0.072    0.943    
## STARS4      -19.01910  502.45447  -0.038    0.970    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 32663807.2066 
## Number of iterations in BFGS optimization: 48 
## Log-likelihood: -1.677e+04 on 13 Df

4. SELECT MODELS

Test each of our models agains the holdout validation set to help pick one.

Model MAE RMSE
Poisson 2.2530215 2.609764
Poisson Reduced 2.2532916 2.609983
Negative Binomial 2.2530214 2.609764
Negative Binomial Reduced 2.2532915 2.609983
Zero Dispersion 0.9796714 1.261205
Zero Dispersion Reduced 1.0034724 1.303326

Zero Dispersion looks to be the best model.

Make Predictions

The final predictions have a similar shape to our target variable in training .