DATA 621: HW 5
David Quarshie - Group 3
Problem
Our goal is to explore, analyze and model a dataset containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
The objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine.
1. Data Exploration
Below we’ll display a few basic EDA techniques to gain insight into our wine dataset.
Basic Statistics
There are 12,795 rows and 15 columns. There are 8,200 missing values of total 191,925 data points.
| n | mean | sd | median | min | max | skew | kurtosis | |
|---|---|---|---|---|---|---|---|---|
| TARGET | 12795 | 3.0290739 | 1.9263682 | 3.00000 | 0.00000 | 8.00000 | -0.3263010 | -0.8772457 |
| FixedAcidity | 12795 | 7.0757171 | 6.3176435 | 6.90000 | -18.10000 | 34.40000 | -0.0225860 | 1.6749987 |
| VolatileAcidity | 12795 | 0.3241039 | 0.7840142 | 0.28000 | -2.79000 | 3.68000 | 0.0203800 | 1.8322106 |
| CitricAcid | 12795 | 0.3084127 | 0.8620798 | 0.31000 | -3.24000 | 3.86000 | -0.0503070 | 1.8379401 |
| ResidualSugar | 12179 | 5.4187331 | 33.7493790 | 3.90000 | -127.80000 | 141.15000 | -0.0531229 | 1.8846917 |
| Chlorides | 12157 | 0.0548225 | 0.3184673 | 0.04600 | -1.17100 | 1.35100 | 0.0304272 | 1.7886044 |
| FreeSulfurDioxide | 12148 | 30.8455713 | 148.7145577 | 30.00000 | -555.00000 | 623.00000 | 0.0063930 | 1.8364966 |
| TotalSulfurDioxide | 12113 | 120.7142326 | 231.9132105 | 123.00000 | -823.00000 | 1057.00000 | -0.0071794 | 1.6746665 |
| Density | 12795 | 0.9942027 | 0.0265376 | 0.99449 | 0.88809 | 1.09924 | -0.0186938 | 1.8999592 |
| pH | 12400 | 3.2076282 | 0.6796871 | 3.20000 | 0.48000 | 6.13000 | 0.0442880 | 1.6462681 |
| Sulphates | 11585 | 0.5271118 | 0.9321293 | 0.50000 | -3.13000 | 4.24000 | 0.0059119 | 1.7525655 |
| Alcohol | 12142 | 10.4892363 | 3.7278190 | 10.40000 | -4.70000 | 26.50000 | -0.0307158 | 1.5394949 |
| LabelAppeal | 12795 | -0.0090660 | 0.8910892 | 0.00000 | -2.00000 | 2.00000 | 0.0084295 | -0.2622916 |
| AcidIndex | 12795 | 7.7727237 | 1.3239264 | 8.00000 | 4.00000 | 17.00000 | 1.6484959 | 5.1900925 |
| STARS | 9436 | 2.0417550 | 0.9025400 | 2.00000 | 1.00000 | 4.00000 | 0.4472353 | -0.6925343 |
| ## Histogram of Varia | bles |
Scatterplot of Variables
## NULL
2. DATA PREPARATION
Negative Values
There are some wine quality measures that are negative, we’ll take the absolute value for some of them
But for LabelAppeal, we will add the min value to the the absolute values.
Missing Values
We have missing values in pH, ResidualSugar, Chlorides, Free SulfurDioxide, Alcohol, TotalSulfurDioxide, Sulphates, and STARS.
STARS
For STARS we assign the NAs to be 0.
Mice Imputation
Let’s take a look at using MICE for imputation on the other missing values.
With the mice imputation distributions roughly match the existing, each of the remaining variables with missing values seem to be MAR. We can run the mice imputation on both the train and test set.
Correlation Review
Finally, we update STARS to be a factor variable so we it can be used for modeling.
3. BUILD MODELS
Model 1: Poisson
##
## Call:
## glm(formula = TARGET ~ ., family = "poisson", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2449 -0.6939 -0.0082 0.4599 3.7907
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.660e-01 2.213e-01 3.461 0.000538 ***
## FixedAcidity -7.309e-04 1.169e-03 -0.625 0.531895
## VolatileAcidity -3.906e-02 1.048e-02 -3.726 0.000195 ***
## CitricAcid 3.851e-03 9.374e-03 0.411 0.681183
## ResidualSugar 1.931e-04 2.266e-04 0.852 0.394323
## Chlorides -3.226e-02 2.471e-02 -1.305 0.191752
## FreeSulfurDioxide 6.889e-05 5.286e-05 1.303 0.192481
## TotalSulfurDioxide 6.658e-05 3.439e-05 1.936 0.052902 .
## Density -2.313e-01 2.161e-01 -1.071 0.284345
## pH -1.144e-02 8.448e-03 -1.354 0.175714
## Sulphates -1.824e-02 8.782e-03 -2.077 0.037838 *
## Alcohol 3.755e-03 1.545e-03 2.430 0.015083 *
## LabelAppeal 1.576e-01 6.871e-03 22.935 < 2e-16 ***
## AcidIndex -8.027e-02 5.113e-03 -15.700 < 2e-16 ***
## STARS1 7.677e-01 2.176e-02 35.284 < 2e-16 ***
## STARS2 1.102e+00 2.032e-02 54.252 < 2e-16 ***
## STARS3 1.216e+00 2.144e-02 56.726 < 2e-16 ***
## STARS4 1.339e+00 2.713e-02 49.344 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18544 on 10235 degrees of freedom
## Residual deviance: 11155 on 10218 degrees of freedom
## AIC: 36593
##
## Number of Fisher Scoring iterations: 6
Model 2: Poisson Reduced
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide +
## Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, family = "poisson",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2827 -0.6959 -0.0100 0.4599 3.7738
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.106e-01 5.072e-02 10.067 < 2e-16 ***
## VolatileAcidity -3.946e-02 1.048e-02 -3.766 0.000166 ***
## Chlorides -3.387e-02 2.470e-02 -1.371 0.170294
## TotalSulfurDioxide 6.561e-05 3.438e-05 1.909 0.056309 .
## Sulphates -1.858e-02 8.779e-03 -2.117 0.034260 *
## Alcohol 3.786e-03 1.544e-03 2.451 0.014231 *
## LabelAppeal 1.576e-01 6.867e-03 22.951 < 2e-16 ***
## AcidIndex -8.061e-02 5.041e-03 -15.991 < 2e-16 ***
## STARS1 7.681e-01 2.176e-02 35.308 < 2e-16 ***
## STARS2 1.104e+00 2.031e-02 54.337 < 2e-16 ***
## STARS3 1.217e+00 2.143e-02 56.809 < 2e-16 ***
## STARS4 1.339e+00 2.712e-02 49.375 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18544 on 10235 degrees of freedom
## Residual deviance: 11161 on 10224 degrees of freedom
## AIC: 36587
##
## Number of Fisher Scoring iterations: 6
Model 3: Negative Binomial
##
## Call:
## glm.nb(formula = TARGET ~ ., data = train, init.theta = 39447.33681,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2448 -0.6939 -0.0082 0.4599 3.7905
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.661e-01 2.213e-01 3.461 0.000538 ***
## FixedAcidity -7.309e-04 1.169e-03 -0.625 0.531902
## VolatileAcidity -3.906e-02 1.048e-02 -3.726 0.000195 ***
## CitricAcid 3.851e-03 9.375e-03 0.411 0.681194
## ResidualSugar 1.930e-04 2.267e-04 0.852 0.394356
## Chlorides -3.226e-02 2.471e-02 -1.305 0.191764
## FreeSulfurDioxide 6.889e-05 5.286e-05 1.303 0.192470
## TotalSulfurDioxide 6.658e-05 3.440e-05 1.936 0.052899 .
## Density -2.313e-01 2.161e-01 -1.071 0.284361
## pH -1.144e-02 8.448e-03 -1.354 0.175700
## Sulphates -1.824e-02 8.782e-03 -2.077 0.037839 *
## Alcohol 3.755e-03 1.545e-03 2.430 0.015089 *
## LabelAppeal 1.576e-01 6.871e-03 22.934 < 2e-16 ***
## AcidIndex -8.028e-02 5.113e-03 -15.700 < 2e-16 ***
## STARS1 7.677e-01 2.176e-02 35.283 < 2e-16 ***
## STARS2 1.102e+00 2.032e-02 54.251 < 2e-16 ***
## STARS3 1.216e+00 2.144e-02 56.724 < 2e-16 ***
## STARS4 1.339e+00 2.713e-02 49.342 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(39447.34) family taken to be 1)
##
## Null deviance: 18543 on 10235 degrees of freedom
## Residual deviance: 11154 on 10218 degrees of freedom
## AIC: 36595
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 39447
## Std. Err.: 37193
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -36557.14
Model 4: Negative Binomial Reduced
##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide +
## Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, data = train,
## init.theta = 39434.27867, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2826 -0.6959 -0.0100 0.4599 3.7737
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.107e-01 5.073e-02 10.067 < 2e-16 ***
## VolatileAcidity -3.946e-02 1.048e-02 -3.766 0.000166 ***
## Chlorides -3.387e-02 2.470e-02 -1.371 0.170305
## TotalSulfurDioxide 6.561e-05 3.438e-05 1.909 0.056305 .
## Sulphates -1.859e-02 8.779e-03 -2.117 0.034260 *
## Alcohol 3.786e-03 1.544e-03 2.451 0.014237 *
## LabelAppeal 1.576e-01 6.867e-03 22.950 < 2e-16 ***
## AcidIndex -8.061e-02 5.041e-03 -15.991 < 2e-16 ***
## STARS1 7.681e-01 2.176e-02 35.307 < 2e-16 ***
## STARS2 1.104e+00 2.031e-02 54.336 < 2e-16 ***
## STARS3 1.217e+00 2.143e-02 56.808 < 2e-16 ***
## STARS4 1.339e+00 2.713e-02 49.373 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(39434.28) family taken to be 1)
##
## Null deviance: 18543 on 10235 degrees of freedom
## Residual deviance: 11160 on 10224 degrees of freedom
## AIC: 36589
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 39434
## Std. Err.: 37174
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -36563.13
Model 5: Zero Dispersion Counts
There is an inflated number of 0s in our target, so let’s try a zero dispersion on the negative binomial.
##
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = train, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.42359 -0.56537 0.01324 0.44390 2.74903
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.152e+00 2.289e-01 5.030 4.90e-07 ***
## FixedAcidity -1.940e-04 1.194e-03 -0.162 0.87096
## VolatileAcidity -1.751e-02 1.068e-02 -1.639 0.10114
## CitricAcid -1.915e-03 9.610e-03 -0.199 0.84206
## ResidualSugar 1.345e-04 2.319e-04 0.580 0.56196
## Chlorides -2.245e-02 2.530e-02 -0.888 0.37481
## FreeSulfurDioxide -1.205e-05 5.325e-05 -0.226 0.82103
## TotalSulfurDioxide -7.652e-06 3.402e-05 -0.225 0.82202
## Density -3.065e-01 2.232e-01 -1.374 0.16956
## pH 5.173e-03 8.662e-03 0.597 0.55040
## Sulphates -1.645e-03 8.954e-03 -0.184 0.85424
## Alcohol 6.631e-03 1.573e-03 4.215 2.50e-05 ***
## LabelAppeal 2.229e-01 7.107e-03 31.359 < 2e-16 ***
## AcidIndex -2.877e-02 5.614e-03 -5.124 2.98e-07 ***
## STARS1 4.489e-02 2.349e-02 1.911 0.05600 .
## STARS2 1.792e-01 2.186e-02 8.200 2.41e-16 ***
## STARS3 2.729e-01 2.299e-02 11.872 < 2e-16 ***
## STARS4 3.719e-01 2.857e-02 13.017 < 2e-16 ***
## Log(theta) 1.551e+01 5.339e+00 2.905 0.00367 **
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.36397 0.04108 8.860 <2e-16 ***
## STARS1 -1.96937 0.07395 -26.631 <2e-16 ***
## STARS2 -6.01014 0.64983 -9.249 <2e-16 ***
## STARS3 -19.02098 269.64693 -0.071 0.944
## STARS4 -19.01912 510.55652 -0.037 0.970
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 5446845.6153
## Number of iterations in BFGS optimization: 50
## Log-likelihood: -1.661e+04 on 24 Df
Model 6: Zero Dispersion Counts Reduced
##
## Call:
## zeroinfl(formula = TARGET ~ VolatileAcidity + Chlorides + Density +
## Alcohol + LabelAppeal + AcidIndex | STARS, data = train, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.54247 -0.48412 0.05682 0.44699 2.38613
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.309297 0.224856 5.823 5.79e-09 ***
## VolatileAcidity -0.017170 0.010642 -1.613 0.1067
## Chlorides -0.019134 0.025232 -0.758 0.4483
## Density -0.386039 0.222569 -1.734 0.0828 .
## Alcohol 0.008718 0.001561 5.584 2.35e-08 ***
## LabelAppeal 0.268393 0.006565 40.882 < 2e-16 ***
## AcidIndex -0.030786 0.005523 -5.574 2.48e-08 ***
## Log(theta) 17.301778 NA NA NA
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.39594 0.04027 9.832 <2e-16 ***
## STARS1 -1.93097 0.06997 -27.596 <2e-16 ***
## STARS2 -5.92767 0.53601 -11.059 <2e-16 ***
## STARS3 -19.02090 265.36008 -0.072 0.943
## STARS4 -19.01910 502.45447 -0.038 0.970
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 32663807.2066
## Number of iterations in BFGS optimization: 48
## Log-likelihood: -1.677e+04 on 13 Df
4. SELECT MODELS
Test each of our models agains the holdout validation set to help pick one.
| Model | MAE | RMSE |
|---|---|---|
| Poisson | 2.2530215 | 2.609764 |
| Poisson Reduced | 2.2532916 | 2.609983 |
| Negative Binomial | 2.2530214 | 2.609764 |
| Negative Binomial Reduced | 2.2532915 | 2.609983 |
| Zero Dispersion | 0.9796714 | 1.261205 |
| Zero Dispersion Reduced | 1.0034724 | 1.303326 |
Zero Dispersion looks to be the best model.
Make Predictions
The final predictions have a similar shape to our target variable in training .