Required Libraries

Data Exploration

To-Do List: 1. Check for typo’s - No Typo’s, all data is int 2. Check for missing 3. Show distributions 4. Determine Categorical/Continuous

Introduction

A data set containing information on approximately 12,000 commercially available wines and their variables mostly related to chemical properties is analyzed for impact on sales and used to predict on sales to give accurate forecasts for manufacturing. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant.

Data Summary

A table below expands on the variables included in analysis with comments from domain experts on expected effects.

Summary
Variable_Name Description Theoretical_Effect Variable_Type
INDEX Identification Variable N/A
TARGET Number of Cases Purchased Discrete
AcidIndex Proprietary Method of Testing Total Acidity Continuous
Alcohol Alcohol Content Continuous
Chlorides Chloride Content of Wine Continuous
CitricAcid Citric Acid Content Continuous
Density Density of Wine Continuous
FixedAcidity Fixed Acidity of Wine Continuous
FreeSulfurDioxide Sulfur Dioxide Content of Wine Continuous
LabelAppeal Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales. Ordinal
ResidualSugar Residual Sugar of Wine Continuous
STARS Wine Rating by a Team of Experts, 4 Stars = Excellent A high number of stars suggests high sales. Ordinal
Sulphates Sulfate Content of Wine Continuous
TotalSulfurDioxide Sulfur Dioxide of Wine Continuous
VolatileAcidity Volatile Acid Content of Wine Continuous
pH pH of wine Continuous
A quick look at the variables 5 number summary reveals that several variables have large ranges which when relating to their mean may suggest significantly different scales between variables, a high amount of skew, bi-modal distributions, or outliers. FixedAcidity, ResidualSugar, FreeSulfurDioxide, and TotalSulfurDioxide have fairly extreme ranges in comparison to their means. Variables with Kurtosis greater than 4 will have observations distributed into heavy or long tails and may suggest numerous outliers, less than 2 suggest distributions centered around their mean with short or thin tails. Many of the variables are just below 2 suggesting many will have sharp peaks around the mean. Only AcidIndex shows as a non-ordinal or discrete distribution with extreme values of kurtosis, suggesting it will contain many outliers.
5 Number Summary
mean sd median trimmed mad min max range skew kurtosis se
TARGET 3.0290739 1.9263682 3.00000 3.0538244 1.4826000 0.00000 8.00000 8 -0.326301039267588 -0.877245713363431 0.0170302
FixedAcidity 7.0757171 6.3176435 6.90000 7.0736739 3.2617200 -18.10000 34.40000 52.5 -0.0225859613642668 1.67499867419602 0.0558515
VolatileAcidity 0.3241039 0.7840142 0.28000 0.3243890 0.4299540 -2.79000 3.68000 6.47 0.0203799652905512 1.83221063847995 0.0069311
CitricAcid 0.3084127 0.8620798 0.31000 0.3102520 0.4151280 -3.24000 3.86000 7.1 -0.0503070404378392 1.837940071767 0.0076213
ResidualSugar 5.4187331 33.7493790 3.90000 5.5800410 15.7155600 -127.80000 141.15000 268.95 -0.0531229052496501 1.88469166771449 0.3058158
Chlorides 0.0548225 0.3184673 0.04600 0.0540159 0.1349166 -1.17100 1.35100 2.522 0.0304271748147184 1.78860442940177 0.0028884
FreeSulfurDioxide 30.8455713 148.7145577 30.00000 30.9334877 56.3388000 -555.00000 623.00000 1178 0.0063930101150823 1.8364966248458 1.3492769
TotalSulfurDioxide 120.7142326 231.9132105 123.00000 120.8895367 134.9166000 -823.00000 1057.00000 1880 -0.00717935086303868 1.67466647637973 2.1071703
Density 0.9942027 0.0265376 0.99449 0.9942130 0.0093552 0.88809 1.09924 0.21115 -0.0186937638734045 1.89995920703906 0.0002346
pH 3.2076282 0.6796871 3.20000 3.2055706 0.3854760 0.48000 6.13000 5.65 0.0442880137342456 1.64626805925174 0.0061038
Sulphates 0.5271118 0.9321293 0.50000 0.5271453 0.4447800 -3.13000 4.24000 7.37 0.00591189546131464 1.75256551530022 0.0086602
Alcohol 10.4892363 3.7278190 10.40000 10.5018255 2.3721600 -4.70000 26.50000 31.2 -0.0307158360932212 1.53949494703394 0.0338306
LabelAppeal -0.0090660 0.8910892 0.00000 -0.0099639 1.4826000 -2.00000 2.00000 4 0.00842945702449739 -0.262291551256824 0.0078777
AcidIndex 7.7727237 1.3239264 8.00000 7.6431572 1.4826000 4.00000 17.00000 13 1.64849594529687 5.19009248111377 0.0117043
STARS 2.0417550 0.9025400 2.00000 1.9711258 1.4826000 1.00000 4.00000 3 0.447235291548031 -0.692534319319664 0.0092912

Histograms

As kurtosis foreshadowed, many of the distributions have sharp peaks at the mean with only the AcidIndex showing a bi-modal distribution. With the sharp centers around the peaks in the histograms, a high number of outliers may present themselves.

Bar Plots

There is a relatively normal distribution to LabelAppeal, but both STARS and TARGET tend to favor their lower values suggesting it’s quite difficult to gain either a critic’s praise or a significant amount of cases sold.

Boxplots reveal a significant number of residuals in all of the variables.

Correlation Matrix

The correlation matrix reveals a moderate relationship between STARS and LabelAppeal with Target. Although both STARS and LabelAppeal seem to be somewhat correlated to each other suggesting potential colinearity. The AcidIndex, being a propietary method that aggregates across Acid metrics, does show some relationship with FixedAcidity but is relatively minor.

Missing Values

While missing values may be indicative of the target, the STARS variable is missing 26% of its values. Determining the relationship it has to cases sold may be useful before removing it from the dataset. To view the relationship of the “missing” STARS ratings, NA’s have been replaced with a category of “Unrated” and the bar plots are shown again. Chlorides, FreeSulfurDioxide, Alcohol, and TotalSulfurDioxide are missing around 5% or about 600 values. Sulphates is missing about 10% of its values and about 1200 values.

Missing Values Count
ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
616 638 647 682 395 1210 653 3359
plot_pattern(train, square = TRUE, rotate = TRUE, npat = 6)

While unrated wines typically aren’t purchased, there are some that sell about 3 cases. This might suggest that non-rated wines are not submitted for critic’s appraisal and should be used as a feature in the modeling. This plot also reveals a heavy preference for 2 star wines.

The majority of 0 value appeals center on 4 cases sold and does tend to show a linear relationship between the two.

Data Preparation

Now that we have explored our data set, we can move on to data preparation to prep out data for modeling and analysis.

Data Wrangling

To do list: 1. Split the data into training and testing sets 2. Impute missing values 3. Normalize the data 4. Deal with outliers 5. One hot encode the categorical variables

The data has already been partially cleaned with the removal of the INDEX variable. The missing values in STARS were replaced with “Unrated” to indicate non-rated wines.

Data Imputation

Before we can impute missing values, we perform the train-test split to avoid data leakage:

Now, we can impute the missing values in the training and testing data sets. We will use the MICE package to impute the missing values. In the imputation process, we will exclude the TARGET variable from the predictors, as the target variable should not be used to predict the missing values of the predictors. All imputation will be done for all three of the training, testing, and evaluation data sets. In order to make the dataframes match we drop the INDEX column from the evaluation data set.

Now that we’ve imputed the missing values, we can compare the summary statistics of the original data and the imputed data. The summary statistics are calculated for the following variables: Chlorides, FreeSulfurDioxide, Alcohol, TotalSulfurDioxide, pH, and Sulphates. The summary statistics are calculated for the full training data set, the training data set after imputation, and the testing data set after imputation. The summary statistics are calculated for the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values of the variables. The summary statistics are then compared across the three data sets to see how the imputation process has affected the data.

Summary Statistics Comparison Across Datasets
Variable_Stat Dataset (Pre-Imputations) Train Imputed Test Imputed
Chlorides_min -1.17000000 -1.17100000 -1.17100000
Chlorides_q1 -0.03900000 -0.03850000 -0.00700000
Chlorides_median 0.04600000 0.04600000 0.04700000
Chlorides_mean 0.05075939 0.05138424 0.06377821
Chlorides_q3 0.14325000 0.14600000 0.17100000
Chlorides_max 1.35100000 1.35100000 1.26000000
FreeSulfurDioxide_min -546.00000000 -546.00000000 -555.00000000
FreeSulfurDioxide_q1 1.00000000 0.00000000 -2.00000000
FreeSulfurDioxide_median 30.00000000 30.00000000 31.00000000
FreeSulfurDioxide_mean 30.88763649 30.69956468 31.09989572
FreeSulfurDioxide_q3 69.00000000 69.00000000 73.00000000
FreeSulfurDioxide_max 623.00000000 623.00000000 617.00000000
Alcohol_min -4.70000000 -4.70000000 -4.40000000
Alcohol_q1 9.00000000 9.00000000 9.00000000
Alcohol_median 10.40000000 10.40000000 10.40000000
Alcohol_mean 10.46157651 10.46814317 10.56149635
Alcohol_q3 12.30000000 12.30000000 12.40000000
Alcohol_max 26.10000000 26.10000000 26.50000000
TotalSulfurDioxide_min -816.00000000 -816.00000000 -823.00000000
TotalSulfurDioxide_q1 26.00000000 26.00000000 29.00000000
TotalSulfurDioxide_median 123.00000000 123.00000000 124.00000000
TotalSulfurDioxide_mean 121.37894489 121.52347360 119.30724713
TotalSulfurDioxide_q3 209.00000000 209.00000000 205.00000000
TotalSulfurDioxide_max 1057.00000000 1057.00000000 1041.00000000
pH_min 0.48000000 0.48000000 0.53000000
pH_q1 2.95000000 2.96000000 2.96000000
pH_median 3.20000000 3.20000000 3.20000000
pH_mean 3.20488704 3.20552093 3.21393691
pH_q3 3.47000000 3.46000000 3.48000000
pH_max 6.02000000 6.05000000 6.21000000
Sulphates_min -3.13000000 -3.13000000 -3.10000000
Sulphates_q1 0.29000000 0.30000000 0.25000000
Sulphates_median 0.50000000 0.50000000 0.50000000
Sulphates_mean 0.53412426 0.53572050 0.51387226
Sulphates_q3 0.86000000 0.87000000 0.85000000
Sulphates_max 4.24000000 4.24000000 4.21000000

The above table is very encouraging. The summary statistics for the variables with missing data did not seem to change much after the imputation. Most of the discrepancies appear in the test data set, this plausibly due to the smaller sample size.

Transformations

Best Transformations
Variable Transformation
FixedAcidity orderNorm
VolatileAcidity orderNorm
CitricAcid orderNorm
ResidualSugar orderNorm
Chlorides orderNorm
FreeSulfurDioxide orderNorm
TotalSulfurDioxide orderNorm
Density orderNorm
pH orderNorm
Sulphates orderNorm
Alcohol orderNorm
AcidIndex orderNorm

Again, we must be quite careful to avoid data leakage, calculating the parameters for these transformations using only the training set, and then applying the transformations to our other sets using the same parameters.

Pre and Post Transformation Skewness Comparison
Variable Pre-Transformation Skew Post-Transformation Skew
FixedAcidity -0.043 0.000
VolatileAcidity -0.009 0.000
CitricAcid -0.043 0.000
ResidualSugar -0.085 0.000
Chlorides 0.026 0.000
FreeSulfurDioxide 0.037 0.000
TotalSulfurDioxide 0.014 0.000
Density 0.005 0.000
pH 0.012 0.000
Sulphates -0.013 0.000
Alcohol -0.034 0.000
AcidIndex 1.660 0.149

The transformations almost completely got rid of any skew in our data. We can visualize this using by recreating the histograms with the transformed data.

While the first table seems to suggest that AcidIndex_transformed had its skewness lowered to a relatively insignificant amount, the histogram reveals that the variable still seemingly is bimodal. This may suggest that the transformation was not the best choice for this variable and grouping the data may be a better choice. However, upon further investigation, the appearance of bimodality may be due to the amount of bins selected for the histograms. More bins reveal a more normal distribution. We can test whether it is bimodal using a dip test.

## 
##  Hartigans' dip test for unimodality / multimodality
## 
## data:  train_data_transformed$AcidIndex
## D = 0.15872, p-value < 2.2e-16
## alternative hypothesis: non-unimodal, i.e., at least bimodal

The extremely low p-value suggests that the AcidIndex variable is bimodal. We will group the data into two categories to deal with this issue.

The resulting groups are shown in the histogram above. The plot reveals that, while not evenly distributed, there really is only one group. The appearance of bimodality is likely due to the much larger amount of the non-median group. We will not group the data and move on to dealing with outliers.

Now that we’ve transformed our data, we can move on to dealing with outliers.

Outliers

We will use the IQR method to detect outliers in the data. The IQR method is a robust method for detecting outliers that is not sensitive to the presence of extreme values. The IQR method defines an outlier as any value that is below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. The lower and upper limits for each variable are calculated using the IQR method.

Using the IQR limits, there is a significant amount of outliers in the data. The transformation process did not impact the number of outliers in the data. Using a Box-Cox transformation might have been a better way to get rid of the outliers but it was not an option for many of the variables due to them containing negative and zero values.

Ultimately, due to the large amount of outliers, removing them would result in a significant loss of data. We will keep the outliers in the data and move on to one-hot encoding the categorical variables.

One-Hot Encoding

We have two factor columns in LabelAppeal and STARS. LabelAppeal can be converted to numeric as it is ordinal. While STARS is also ordinal, it also has an ‘unrated’ category. We will one-hot encode this column but also keep the original column for now.

After our data has been prepped, we can now move on to modeling.

Modeling

With the data exploration and preparation out of the way, we turn to build different types of regression models to predict the number of cases of wine ordered by distributors. Again, the response variable is the count of cases, and so it is appropriate to consider Poisson regression, negative binomial regression, and multiple linear regression. We build models of each type with some commentary, and then we will consider more generally how the models compare to one another.

Poisson Regression

We first consider Poisson regression models. Now, it’s critical to note, despite the transformations we performed in the previous section, Poisson models do not require normally distributed data, and so leveraging transformed data is actually counter-productive. As such, the relevant dataframes are:

  • train_data_imputed
  • test_data_imputed
  • eval_data_imputed

Poisson Model 1

We start with a rather simple model, with all the variables along with a few more sophisticated variables:

  1. Alcohol:LabelAppeal in case these two variables have an especially strong combined effect
  2. STARS:Alcohol since high quality wines with certain alcohol content might sell especially well
  3. LabelAppeal:STARS in case people might be especially likely to buy visually appealing and highly rated wines
  4. Alcohol^2 as intuitively alcohol content doesn’t have a strictly linear relationship with the target variable
##  Factor w/ 9 levels "0","1","2","3",..: 4 4 6 4 5 1 5 7 1 5 ...
## 
## Call:
## glm(formula = TARGET ~ Alcohol + Alcohol^2 + LabelAppeal + STARS + 
##     AcidIndex + Chlorides + CitricAcid + Density + FixedAcidity + 
##     FreeSulfurDioxide + ResidualSugar + Sulphates + TotalSulfurDioxide + 
##     VolatileAcidity + pH + Alcohol:LabelAppeal + STARS:Alcohol + 
##     LabelAppeal:STARS, family = poisson(), data = train_data_imputed)
## 
## Coefficients: (1 not defined because of singularities)
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.344e+00  1.233e-01  10.893  < 2e-16 ***
## Alcohol                     8.200e-03  5.323e-03   1.541 0.123437    
## LabelAppeal-1               3.952e-01  6.877e-02   5.747 9.10e-09 ***
## LabelAppeal0                6.420e-01  6.722e-02   9.551  < 2e-16 ***
## LabelAppeal1                7.300e-01  6.904e-02  10.573  < 2e-16 ***
## LabelAppeal2                7.548e-01  8.778e-02   8.599  < 2e-16 ***
## STARS2                      2.600e-01  5.629e-02   4.618 3.87e-06 ***
## STARS3                      5.118e-01  8.568e-02   5.973 2.33e-09 ***
## STARS4                      8.040e-01  6.222e-02  12.922  < 2e-16 ***
## STARSUnrated               -5.029e-01  5.674e-02  -8.862  < 2e-16 ***
## AcidIndex                  -7.576e-02  2.453e-03 -30.881  < 2e-16 ***
## Chlorides                  -3.808e-02  8.609e-03  -4.423 9.71e-06 ***
## CitricAcid                  5.293e-03  3.182e-03   1.663 0.096234 .  
## Density                    -3.336e-01  1.029e-01  -3.243 0.001183 ** 
## FixedAcidity                1.147e-04  4.369e-04   0.263 0.792929    
## FreeSulfurDioxide           9.423e-05  1.841e-05   5.118 3.09e-07 ***
## ResidualSugar               1.471e-04  8.122e-05   1.811 0.070104 .  
## Sulphates                  -9.047e-03  2.953e-03  -3.063 0.002189 ** 
## TotalSulfurDioxide          8.507e-05  1.180e-05   7.208 5.66e-13 ***
## VolatileAcidity            -3.508e-02  3.466e-03 -10.122  < 2e-16 ***
## pH                         -1.512e-02  3.989e-03  -3.791 0.000150 ***
## Alcohol:LabelAppeal-1      -4.535e-03  5.507e-03  -0.823 0.410277    
## Alcohol:LabelAppeal0       -8.650e-03  5.367e-03  -1.612 0.107037    
## Alcohol:LabelAppeal1       -6.552e-03  5.481e-03  -1.195 0.232001    
## Alcohol:LabelAppeal2       -2.205e-02  6.358e-03  -3.468 0.000524 ***
## Alcohol:STARS2              7.646e-03  2.031e-03   3.764 0.000167 ***
## Alcohol:STARS3              3.722e-03  2.234e-03   1.666 0.095685 .  
## Alcohol:STARS4              3.878e-03  3.233e-03   1.199 0.230435    
## Alcohol:STARSUnrated        1.849e-05  2.792e-03   0.007 0.994716    
## LabelAppeal-1:STARS2       -9.189e-02  5.377e-02  -1.709 0.087490 .  
## LabelAppeal0:STARS2        -4.243e-02  5.275e-02  -0.804 0.421141    
## LabelAppeal1:STARS2         5.258e-02  5.419e-02   0.970 0.331901    
## LabelAppeal2:STARS2         3.508e-01  7.237e-02   4.847 1.25e-06 ***
## LabelAppeal-1:STARS3       -1.215e-01  8.480e-02  -1.433 0.151866    
## LabelAppeal0:STARS3        -1.384e-01  8.315e-02  -1.665 0.095960 .  
## LabelAppeal1:STARS3        -8.809e-02  8.397e-02  -1.049 0.294134    
## LabelAppeal2:STARS3         2.314e-01  9.602e-02   2.410 0.015964 *  
## LabelAppeal-1:STARS4       -1.855e-01  7.662e-02  -2.421 0.015470 *  
## LabelAppeal0:STARS4        -2.962e-01  5.521e-02  -5.366 8.06e-08 ***
## LabelAppeal1:STARS4        -2.817e-01  5.527e-02  -5.098 3.44e-07 ***
## LabelAppeal2:STARS4                NA         NA      NA       NA    
## LabelAppeal-1:STARSUnrated -9.569e-02  5.166e-02  -1.852 0.064001 .  
## LabelAppeal0:STARSUnrated  -2.357e-01  5.051e-02  -4.667 3.05e-06 ***
## LabelAppeal1:STARSUnrated  -5.390e-01  5.437e-02  -9.912  < 2e-16 ***
## LabelAppeal2:STARSUnrated  -6.257e-02  7.974e-02  -0.785 0.432609    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 80043  on 44794  degrees of freedom
## Residual deviance: 47535  on 44751  degrees of freedom
## AIC: 159458
## 
## Number of Fisher Scoring iterations: 6

The model seems promising; for example there’s major reduction in deviance from the null model to the full model. Still, there’s much work to be done. We start by noting there are undefined coefficients because of singularities. Let’s take a look at potential multicollinearity.

## Model :
## TARGET ~ Alcohol + Alcohol^2 + LabelAppeal + STARS + AcidIndex + 
##     Chlorides + CitricAcid + Density + FixedAcidity + FreeSulfurDioxide + 
##     ResidualSugar + Sulphates + TotalSulfurDioxide + VolatileAcidity + 
##     pH + Alcohol:LabelAppeal + STARS:Alcohol + LabelAppeal:STARS
## 
## Complete :
##                     (Intercept) Alcohol LabelAppeal-1 LabelAppeal0 LabelAppeal1
## LabelAppeal2:STARS4  0           0       0             0            0          
##                     LabelAppeal2 STARS2 STARS3 STARS4 STARSUnrated AcidIndex
## LabelAppeal2:STARS4  0            0      0      1      0            0       
##                     Chlorides CitricAcid Density FixedAcidity FreeSulfurDioxide
## LabelAppeal2:STARS4  0         0          0       0            0               
##                     ResidualSugar Sulphates TotalSulfurDioxide VolatileAcidity
## LabelAppeal2:STARS4  0             0         0                  0             
##                     pH Alcohol:LabelAppeal-1 Alcohol:LabelAppeal0
## LabelAppeal2:STARS4  0  0                     0                  
##                     Alcohol:LabelAppeal1 Alcohol:LabelAppeal2 Alcohol:STARS2
## LabelAppeal2:STARS4  0                    0                    0            
##                     Alcohol:STARS3 Alcohol:STARS4 Alcohol:STARSUnrated
## LabelAppeal2:STARS4  0              0              0                  
##                     LabelAppeal-1:STARS2 LabelAppeal0:STARS2
## LabelAppeal2:STARS4  0                    0                 
##                     LabelAppeal1:STARS2 LabelAppeal2:STARS2
## LabelAppeal2:STARS4  0                   0                 
##                     LabelAppeal-1:STARS3 LabelAppeal0:STARS3
## LabelAppeal2:STARS4  0                    0                 
##                     LabelAppeal1:STARS3 LabelAppeal2:STARS3
## LabelAppeal2:STARS4  0                   0                 
##                     LabelAppeal-1:STARS4 LabelAppeal0:STARS4
## LabelAppeal2:STARS4 -1                   -1                 
##                     LabelAppeal1:STARS4 LabelAppeal-1:STARSUnrated
## LabelAppeal2:STARS4 -1                   0                        
##                     LabelAppeal0:STARSUnrated LabelAppeal1:STARSUnrated
## LabelAppeal2:STARS4  0                         0                       
##                     LabelAppeal2:STARSUnrated
## LabelAppeal2:STARS4  0
## 
## Call:
## glm(formula = TARGET ~ Alcohol + I(Alcohol^2) + LabelAppeal + 
##     STARS + AcidIndex + Chlorides + CitricAcid + Density + FixedAcidity + 
##     FreeSulfurDioxide + ResidualSugar + Sulphates + TotalSulfurDioxide + 
##     VolatileAcidity + pH + Alcohol:LabelAppeal + STARS:Alcohol, 
##     family = poisson(), data = train_data_imputed)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            1.478e+00  1.211e-01  12.205  < 2e-16 ***
## Alcohol                1.814e-03  5.754e-03   0.315 0.752540    
## I(Alcohol^2)           2.115e-04  1.027e-04   2.059 0.039449 *  
## LabelAppeal-1          3.036e-01  6.323e-02   4.801 1.58e-06 ***
## LabelAppeal0           5.301e-01  6.170e-02   8.592  < 2e-16 ***
## LabelAppeal1           6.403e-01  6.282e-02  10.193  < 2e-16 ***
## LabelAppeal2           9.445e-01  7.224e-02  13.074  < 2e-16 ***
## STARS2                 2.554e-01  2.234e-02  11.432  < 2e-16 ***
## STARS3                 4.111e-01  2.498e-02  16.456  < 2e-16 ***
## STARS4                 5.284e-01  3.691e-02  14.319  < 2e-16 ***
## STARSUnrated          -7.541e-01  3.044e-02 -24.774  < 2e-16 ***
## AcidIndex             -7.803e-02  2.445e-03 -31.913  < 2e-16 ***
## Chlorides             -3.866e-02  8.609e-03  -4.491 7.09e-06 ***
## CitricAcid             6.276e-03  3.184e-03   1.971 0.048704 *  
## Density               -3.394e-01  1.028e-01  -3.302 0.000959 ***
## FixedAcidity           1.370e-04  4.367e-04   0.314 0.753636    
## FreeSulfurDioxide      9.861e-05  1.840e-05   5.359 8.38e-08 ***
## ResidualSugar          1.619e-04  8.111e-05   1.996 0.045970 *  
## Sulphates             -8.979e-03  2.951e-03  -3.043 0.002345 ** 
## TotalSulfurDioxide     8.624e-05  1.179e-05   7.318 2.52e-13 ***
## VolatileAcidity       -3.567e-02  3.462e-03 -10.304  < 2e-16 ***
## pH                    -1.513e-02  3.988e-03  -3.793 0.000149 ***
## Alcohol:LabelAppeal-1 -3.224e-03  5.469e-03  -0.590 0.555497    
## Alcohol:LabelAppeal0  -6.556e-03  5.327e-03  -1.231 0.218415    
## Alcohol:LabelAppeal1  -3.875e-03  5.435e-03  -0.713 0.475845    
## Alcohol:LabelAppeal2  -1.860e-02  6.303e-03  -2.951 0.003163 ** 
## Alcohol:STARS2         6.739e-03  2.015e-03   3.345 0.000823 ***
## Alcohol:STARS3         3.169e-03  2.222e-03   1.426 0.153860    
## Alcohol:STARS4         2.983e-03  3.217e-03   0.927 0.353707    
## Alcohol:STARSUnrated   9.260e-04  2.781e-03   0.333 0.739112    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 80043  on 44794  degrees of freedom
## Residual deviance: 48067  on 44765  degrees of freedom
## AIC: 159962
## 
## Number of Fisher Scoring iterations: 6
VIF Values simple model 2
Term VIF VIF_CI_low VIF_CI_high SE_factor Tolerance Tolerance_CI_low Tolerance_CI_high
Alcohol 62.616827 61.477857 63.777246 7.913080 0.0159701 0.0156796 0.0162660
I(Alcohol^2) 9.975630 9.801962 10.152725 3.158422 0.1002443 0.0984957 0.1020204
LabelAppeal 9648.902267 9471.976675 9829.132974 98.228826 0.0001036 0.0001017 0.0001056
STARS 8808.211331 8646.701743 8972.738064 93.852071 0.0001135 0.0001114 0.0001157
AcidIndex 1.063457 1.053840 1.074791 1.031241 0.9403296 0.9304135 0.9489102
Chlorides 1.005769 1.001139 1.029226 1.002880 0.9942642 0.9716036 0.9988626
CitricAcid 1.008438 1.002766 1.025737 1.004210 0.9916325 0.9749088 0.9972411
Density 1.005458 1.000984 1.030295 1.002725 0.9945712 0.9705957 0.9990175
FixedAcidity 1.024270 1.016272 1.036199 1.012062 0.9763054 0.9650659 0.9839888
FreeSulfurDioxide 1.006183 1.001359 1.028133 1.003087 0.9938548 0.9726369 0.9986429
ResidualSugar 1.003295 1.000195 1.055650 1.001646 0.9967154 0.9472837 0.9998049
Sulphates 1.004457 1.000549 1.036204 1.002226 0.9955626 0.9650609 0.9994516
TotalSulfurDioxide 1.005488 1.000998 1.030182 1.002740 0.9945418 0.9707026 0.9990031
VolatileAcidity 1.007462 1.002119 1.026271 1.003724 0.9925937 0.9744016 0.9978852
pH 1.008664 1.002923 1.025681 1.004323 0.9914106 0.9749619 0.9970857
Alcohol:LabelAppeal 61462.048045 60335.009097 62610.140075 247.915405 0.0000163 0.0000160 0.0000166
Alcohol:STARS 13147.238693 12906.163046 13392.817768 114.661409 0.0000761 0.0000747 0.0000775

There’s clearly a high degree of collinearity, but it’s critical to remove one column at a time and reassess colinearity:

VIF Values simple model 2
Term VIF VIF_CI_low VIF_CI_high SE_factor Tolerance Tolerance_CI_low Tolerance_CI_high
Alcohol 9.907412 9.734964 10.083264 3.147604 0.1009345 0.0991742 0.1027225
I(Alcohol^2) 9.889156 9.717043 10.064668 3.144703 0.1011209 0.0993575 0.1029120
LabelAppeal 1.137419 1.126113 1.149740 1.066499 0.8791833 0.8697620 0.8880108
STARS 1.170559 1.158588 1.183435 1.081924 0.8542924 0.8449979 0.8631198
AcidIndex 1.061721 1.052151 1.073046 1.030398 0.9418672 0.9319266 0.9504335
Chlorides 1.004810 1.000689 1.033565 1.002402 0.9952134 0.9675249 0.9993113
CitricAcid 1.007451 1.002112 1.026286 1.003719 0.9926038 0.9743877 0.9978922
Density 1.004941 1.000745 1.032762 1.002468 0.9950833 0.9682775 0.9992554
FixedAcidity 1.023935 1.015961 1.035893 1.011897 0.9766241 0.9653503 0.9842895
FreeSulfurDioxide 1.005755 1.001131 1.029278 1.002873 0.9942781 0.9715549 0.9988701
ResidualSugar 1.003118 1.000157 1.061810 1.001558 0.9968917 0.9417877 0.9998427
Sulphates 1.003582 1.000265 1.048346 1.001789 0.9964310 0.9538838 0.9997347
TotalSulfurDioxide 1.004041 1.000402 1.040660 1.002019 0.9959749 0.9609283 0.9995985
VolatileAcidity 1.006550 1.001565 1.027412 1.003270 0.9934923 0.9733190 0.9984372
pH 1.007897 1.002401 1.025972 1.003941 0.9921648 0.9746857 0.9976046

At this point, we’ve removed all high correlation variables. Now, there are still two variables with fairly high VIFs, namely Alcohol and I(Alcohol^2)–this is unsurprising to say the least. It would be odd to remove only the former term, so let’s see the model summary and consider whether we ought to remove I(Alcohol^2):

## 
## Call:
## glm(formula = TARGET ~ Alcohol + I(Alcohol^2) + LabelAppeal + 
##     STARS + AcidIndex + Chlorides + CitricAcid + Density + FixedAcidity + 
##     FreeSulfurDioxide + ResidualSugar + Sulphates + TotalSulfurDioxide + 
##     VolatileAcidity + pH, family = poisson(), data = train_data_imputed)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.505e+00  1.065e-01  14.134  < 2e-16 ***
## Alcohol            -4.225e-04  2.289e-03  -0.185 0.853539    
## I(Alcohol^2)        2.183e-04  1.023e-04   2.134 0.032804 *  
## LabelAppeal-1       2.688e-01  2.074e-02  12.962  < 2e-16 ***
## LabelAppeal0        4.599e-01  2.027e-02  22.689  < 2e-16 ***
## LabelAppeal1        5.983e-01  2.060e-02  29.041  < 2e-16 ***
## LabelAppeal2        7.446e-01  2.318e-02  32.127  < 2e-16 ***
## STARS2              3.258e-01  7.656e-03  42.553  < 2e-16 ***
## STARS3              4.437e-01  8.361e-03  53.075  < 2e-16 ***
## STARS4              5.586e-01  1.150e-02  48.565  < 2e-16 ***
## STARSUnrated       -7.445e-01  1.046e-02 -71.179  < 2e-16 ***
## AcidIndex          -7.794e-02  2.442e-03 -31.914  < 2e-16 ***
## Chlorides          -3.930e-02  8.603e-03  -4.568 4.91e-06 ***
## CitricAcid          6.049e-03  3.183e-03   1.900 0.057403 .  
## Density            -3.418e-01  1.027e-01  -3.328 0.000875 ***
## FixedAcidity        1.468e-04  4.365e-04   0.336 0.736670    
## FreeSulfurDioxide   9.913e-05  1.840e-05   5.387 7.16e-08 ***
## ResidualSugar       1.618e-04  8.110e-05   1.995 0.046018 *  
## Sulphates          -8.779e-03  2.950e-03  -2.976 0.002917 ** 
## TotalSulfurDioxide  8.684e-05  1.178e-05   7.374 1.66e-13 ***
## VolatileAcidity    -3.535e-02  3.460e-03 -10.218  < 2e-16 ***
## pH                 -1.541e-02  3.987e-03  -3.864 0.000112 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 80043  on 44794  degrees of freedom
## Residual deviance: 48099  on 44773  degrees of freedom
## AIC: 159978
## 
## Number of Fisher Scoring iterations: 6

Indeed, I(Alcohol^2) is statistically significant, so we won’t remove it. However, there are a couple variables that appear less promising, and we will removes those one at a time (a manual backwards elimination process).

## 
## Call:
## glm(formula = TARGET ~ Alcohol + I(Alcohol^2) + LabelAppeal + 
##     STARS + AcidIndex + Chlorides + Density + FreeSulfurDioxide + 
##     Sulphates + TotalSulfurDioxide + VolatileAcidity + pH, family = poisson(), 
##     data = train_data_imputed)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.507e+00  1.064e-01  14.158  < 2e-16 ***
## Alcohol            -4.546e-04  2.289e-03  -0.199 0.842536    
## I(Alcohol^2)        2.199e-04  1.022e-04   2.151 0.031464 *  
## LabelAppeal-1       2.685e-01  2.073e-02  12.948  < 2e-16 ***
## LabelAppeal0        4.595e-01  2.027e-02  22.668  < 2e-16 ***
## LabelAppeal1        5.980e-01  2.060e-02  29.029  < 2e-16 ***
## LabelAppeal2        7.450e-01  2.317e-02  32.150  < 2e-16 ***
## STARS2              3.262e-01  7.654e-03  42.618  < 2e-16 ***
## STARS3              4.440e-01  8.360e-03  53.113  < 2e-16 ***
## STARS4              5.589e-01  1.150e-02  48.596  < 2e-16 ***
## STARSUnrated       -7.447e-01  1.046e-02 -71.201  < 2e-16 ***
## AcidIndex          -7.751e-02  2.412e-03 -32.128  < 2e-16 ***
## Chlorides          -3.975e-02  8.601e-03  -4.622 3.80e-06 ***
## Density            -3.439e-01  1.027e-01  -3.349 0.000811 ***
## FreeSulfurDioxide   9.995e-05  1.840e-05   5.432 5.56e-08 ***
## Sulphates          -8.900e-03  2.948e-03  -3.019 0.002539 ** 
## TotalSulfurDioxide  8.728e-05  1.177e-05   7.413 1.23e-13 ***
## VolatileAcidity    -3.551e-02  3.459e-03 -10.267  < 2e-16 ***
## pH                 -1.527e-02  3.986e-03  -3.830 0.000128 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 80043  on 44794  degrees of freedom
## Residual deviance: 48107  on 44776  degrees of freedom
## AIC: 159980
## 
## Number of Fisher Scoring iterations: 6

There are a number of takeaways from this model summary, most of which are totally expected. First, only the quadratic term for alcohol is significant; this suggests that after a certain point, small changes in alcohol content–past a certain point–can have a large impact on the target variable. Second, the higher the label appeal level, the higher the log counts of cases ordered. Third, higher star ratings are highly associated with higher values for the target variable; being unrated significantly decreases the log count of cases ordered–we return to this point momentarily. Finally, a number of the chemical properties have effects on the target variable. For example, AcidIndex, Density, and VolatileAcidity all have negative coefficients. While I’m not a wine connoisseur myself, a highly dense wine seems unappealing at least.

Again, we note the huge reduction in deviance when going from the null model to the full model–this speaks well to our model. Now, a further question is if a Poisson model is appropriate here. A key condition for Poisson is that the mean and variance of the response variable are equal. We check this now:

## [1] 0.8837684

So there certainly isn’t over-dispersion. The under-dispersion is somewhat surprising, but the value is close enough to 1, and certainly close enough for a baseline model. We turn now to construct a new model

Poisson Model 2 (Zero-Inflated)

We observed in our last model that unrated wines perform especially badly. Recall, though, we actually turned those values to unrated; they were missing at first. What if, then, these values should actually be a “0” rating? If so, we might be able to use a model that both improves accuracy and interpretability. The first step, then, is to create a new column changing the unrated values to zeros.

STARS Value Counts
Var1 Freq
0 11585
1 10755
2 12625
3 7635
4 2195

Immediately we see that this change leads to a large number of zeroes. This is a strong indicator for considering a zero-inflated model, especially given that a different process may well have led to a zero rating than the process that led to the other ratings.

We start with creating a zero-inflated model using the same variables as the most recent Poisson model as that provides a strong baseline:

## 
## Call:
## zeroinfl(formula = TARGET ~ Alcohol + I(Alcohol^2) + LabelAppeal + original_stars + 
##     AcidIndex + Chlorides + Density + FreeSulfurDioxide + Sulphates + 
##     TotalSulfurDioxide + VolatileAcidity + pH | original_stars, data = train_data_imputed, 
##     dist = "poisson")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.18614 -0.51971  0.01761  0.40950  2.87565 
## 
## Count model coefficients (poisson with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         9.261e-01  1.102e-01   8.405  < 2e-16 ***
## Alcohol             6.327e-03  2.298e-03   2.754  0.00589 ** 
## I(Alcohol^2)        5.421e-06  1.008e-04   0.054  0.95714    
## LabelAppeal-1       3.886e-01  2.144e-02  18.123  < 2e-16 ***
## LabelAppeal0        6.625e-01  2.100e-02  31.554  < 2e-16 ***
## LabelAppeal1        8.558e-01  2.139e-02  40.000  < 2e-16 ***
## LabelAppeal2        1.019e+00  2.397e-02  42.529  < 2e-16 ***
## original_stars      9.769e-02  2.783e-03  35.106  < 2e-16 ***
## AcidIndex          -2.690e-02  2.669e-03 -10.078  < 2e-16 ***
## Chlorides          -2.682e-02  8.807e-03  -3.045  0.00232 ** 
## Density            -3.123e-01  1.062e-01  -2.941  0.00327 ** 
## FreeSulfurDioxide   3.227e-05  1.862e-05   1.733  0.08302 .  
## Sulphates           1.543e-04  3.024e-03   0.051  0.95930    
## TotalSulfurDioxide  1.201e-05  1.173e-05   1.024  0.30599    
## VolatileAcidity    -1.917e-02  3.549e-03  -5.402 6.58e-08 ***
## pH                  2.642e-03  4.097e-03   0.645  0.51889    
## 
## Zero-inflation model coefficients (binomial with logit link):
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.38027    0.01948   19.52   <2e-16 ***
## original_stars -2.18435    0.02824  -77.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 23 
## Log-likelihood: -7.297e+04 on 18 Df

It’s quite interesting how this one change changed the model fairly significantly. We will finish removing variables, again using a backward elimination process, and then add more commentary.

## 
## Call:
## zeroinfl(formula = TARGET ~ Alcohol + LabelAppeal + original_stars + 
##     AcidIndex + Chlorides + Density + VolatileAcidity | original_stars, 
##     data = train_data_imputed, dist = "poisson")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.18913 -0.51720  0.01805  0.40872  2.87298 
## 
## Count model coefficients (poisson with log link):
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.9344984  0.1087370   8.594  < 2e-16 ***
## Alcohol          0.0064163  0.0007475   8.583  < 2e-16 ***
## LabelAppeal-1    0.3881827  0.0214393  18.106  < 2e-16 ***
## LabelAppeal0     0.6624722  0.0209931  31.557  < 2e-16 ***
## LabelAppeal1     0.8560520  0.0213906  40.020  < 2e-16 ***
## LabelAppeal2     1.0192949  0.0239632  42.536  < 2e-16 ***
## original_stars   0.0974396  0.0027784  35.071  < 2e-16 ***
## AcidIndex       -0.0271256  0.0026596 -10.199  < 2e-16 ***
## Chlorides       -0.0273112  0.0087975  -3.104  0.00191 ** 
## Density         -0.3075116  0.1061497  -2.897  0.00377 ** 
## VolatileAcidity -0.0191426  0.0035470  -5.397 6.78e-08 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.38060    0.01947   19.55   <2e-16 ***
## original_stars -2.18398    0.02821  -77.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 19 
## Log-likelihood: -7.297e+04 on 13 Df

There are many observations to be made. The first is that we used the p-value to eliminate predictors that were not significant, and it is striking that we were able to eliminate five variables once we switched to a zero-inflated model. Second, the quadratic alcohol term was one of those terms that was no longer significant. We were also able to eliminate Sulphates, pH, and the SulfurDioxide variables. As for the variables that persisted, the effects are not all that different: Label Appeal and Stars have a positive effect, chemical properties have negative effects. The key difference is that Alcohol now has a positive effect, but that’s intuitive now that the previously positively impacting quadratic term is now removed.

As for the Stars variable (here called original_stars), again higher star ratings are associated with a higher log count of cases ordered. It’s also the case that wines with no star ratings are more likely to have zero cases ordered.

Let’s now generate predictions for the two Poisson models:

## MAE Poisson Model:  1.016386
## RMSE Poisson Model:  1.275739
## MAE ZIP Model:  0.9956513
## RMSE ZIP Model:  1.284232

Again, we will compare all models once all models are built, although it’s worth noting that both MAE and RMSE values are pretty close to 1, which might be acceptable. But before we get ahead of ourselves, let’s build the next two models.

Negative Binomial

There is reason to believe that switching to a negative binomial model will yield better results. Specifically, the negative binomial is appropriate when we are working with count data that has over-dispersion (the variance is greater than the mean). Now, it is true that earlier we saw under-dispersion relative to what one Poisson model expects. However, the truth is that it is really worthwhile to more get a direct measure of the dispersion in the outcome variable, before even modelling:

observed_variance <- var(train_data_imputed$TARGET)
expected_mean <- mean(train_data_imputed$TARGET)
print(observed_variance / expected_mean)
## [1] 1.225267

We see that this dispersion statistic is greater than 1. This suggests that we ought to try a negative binomial model.

Negative Binomial Model 1

Much like earlier, we will start with a relatively simple model, at first using all the variables as well as the interaction terms attempted earlier, and then engaging in variable selection.

As a reminder, those additional variables are:

  1. Alcohol:LabelAppeal
  2. STARS:Alcohol
  3. Alcohol^2

(We omit LabelAppeal:STARS for the reason discussed earlier)

While it is true that most or all of these additional variables were not significant in the previous two models, it does not follow that they’ll be insignificant in the negative binomial models.

## 
## Call:
## glm.nb(formula = TARGET ~ Alcohol + I(Alcohol^2) + LabelAppeal + 
##     STARS + AcidIndex + Chlorides + CitricAcid + Density + FixedAcidity + 
##     FreeSulfurDioxide + ResidualSugar + Sulphates + TotalSulfurDioxide + 
##     VolatileAcidity + pH + Alcohol:LabelAppeal + STARS:Alcohol, 
##     data = train_data_imputed, init.theta = 41158.99707, link = log)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            1.478e+00  1.211e-01  12.204  < 2e-16 ***
## Alcohol                1.814e-03  5.755e-03   0.315 0.752556    
## I(Alcohol^2)           2.115e-04  1.027e-04   2.059 0.039449 *  
## LabelAppeal-1          3.036e-01  6.324e-02   4.801 1.58e-06 ***
## LabelAppeal0           5.301e-01  6.170e-02   8.592  < 2e-16 ***
## LabelAppeal1           6.403e-01  6.282e-02  10.192  < 2e-16 ***
## LabelAppeal2           9.446e-01  7.225e-02  13.074  < 2e-16 ***
## STARS2                 2.554e-01  2.234e-02  11.431  < 2e-16 ***
## STARS3                 4.111e-01  2.498e-02  16.455  < 2e-16 ***
## STARS4                 5.284e-01  3.691e-02  14.318  < 2e-16 ***
## STARSUnrated          -7.541e-01  3.044e-02 -24.773  < 2e-16 ***
## AcidIndex             -7.804e-02  2.445e-03 -31.912  < 2e-16 ***
## Chlorides             -3.866e-02  8.609e-03  -4.491 7.09e-06 ***
## CitricAcid             6.276e-03  3.184e-03   1.971 0.048712 *  
## Density               -3.394e-01  1.028e-01  -3.302 0.000959 ***
## FixedAcidity           1.370e-04  4.367e-04   0.314 0.753627    
## FreeSulfurDioxide      9.861e-05  1.840e-05   5.359 8.38e-08 ***
## ResidualSugar          1.619e-04  8.112e-05   1.996 0.045968 *  
## Sulphates             -8.979e-03  2.951e-03  -3.043 0.002344 ** 
## TotalSulfurDioxide     8.624e-05  1.179e-05   7.318 2.52e-13 ***
## VolatileAcidity       -3.567e-02  3.462e-03 -10.304  < 2e-16 ***
## pH                    -1.513e-02  3.988e-03  -3.793 0.000149 ***
## Alcohol:LabelAppeal-1 -3.225e-03  5.470e-03  -0.590 0.555504    
## Alcohol:LabelAppeal0  -6.556e-03  5.327e-03  -1.231 0.218412    
## Alcohol:LabelAppeal1  -3.875e-03  5.435e-03  -0.713 0.475857    
## Alcohol:LabelAppeal2  -1.860e-02  6.303e-03  -2.951 0.003163 ** 
## Alcohol:STARS2         6.739e-03  2.015e-03   3.345 0.000824 ***
## Alcohol:STARS3         3.168e-03  2.222e-03   1.426 0.153880    
## Alcohol:STARS4         2.983e-03  3.217e-03   0.927 0.353757    
## Alcohol:STARSUnrated   9.260e-04  2.781e-03   0.333 0.739116    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(41159) family taken to be 1)
## 
##     Null deviance: 80039  on 44794  degrees of freedom
## Residual deviance: 48065  on 44765  degrees of freedom
## AIC: 159965
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  41159 
##           Std. Err.:  18751 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -159903.1

If the previous model was any indication, there’s likely high collinearity. Let’s check:

VIF Values for NB Model
Term VIF VIF_CI_low VIF_CI_high SE_factor Tolerance Tolerance_CI_low Tolerance_CI_high
Alcohol 62.614401 61.475476 63.774775 7.912926 0.0159708 0.0156802 0.0162666
I(Alcohol^2) 9.975600 9.801932 10.152694 3.158417 0.1002446 0.0984960 0.1020207
LabelAppeal 9648.864165 9471.939272 9829.094160 98.228632 0.0001036 0.0001017 0.0001056
STARS 8808.209861 8646.700301 8972.736567 93.852064 0.0001135 0.0001114 0.0001157
AcidIndex 1.063458 1.053841 1.074792 1.031241 0.9403290 0.9304130 0.9489096
Chlorides 1.005769 1.001139 1.029227 1.002880 0.9942643 0.9716033 0.9988627
CitricAcid 1.008438 1.002766 1.025737 1.004210 0.9916325 0.9749088 0.9972411
Density 1.005458 1.000983 1.030296 1.002725 0.9945714 0.9705952 0.9990176
FixedAcidity 1.024270 1.016272 1.036199 1.012062 0.9763050 0.9650655 0.9839884
FreeSulfurDioxide 1.006183 1.001359 1.028133 1.003087 0.9938549 0.9726368 0.9986429
ResidualSugar 1.003295 1.000195 1.055652 1.001646 0.9967154 0.9472821 0.9998049
Sulphates 1.004457 1.000549 1.036204 1.002226 0.9955626 0.9650611 0.9994516
TotalSulfurDioxide 1.005488 1.000998 1.030182 1.002740 0.9945418 0.9707025 0.9990031
VolatileAcidity 1.007461 1.002119 1.026271 1.003724 0.9925939 0.9744014 0.9978853
pH 1.008664 1.002923 1.025681 1.004323 0.9914107 0.9749619 0.9970857
Alcohol:LabelAppeal 61459.298561 60332.310032 62607.339231 247.909860 0.0000163 0.0000160 0.0000166
Alcohol:STARS 13147.088803 12906.015905 13392.665078 114.660755 0.0000761 0.0000747 0.0000775

Yet again, there are some extraordinarily high VIF values, likely because of the interaction terms. Again, then, we consult the VIF, remodel, and repeat until there are no variables with high collinearity.

## 
## Call:
## glm.nb(formula = TARGET ~ Alcohol + I(Alcohol^2) + LabelAppeal + 
##     STARS + AcidIndex + Chlorides + CitricAcid + Density + FixedAcidity + 
##     FreeSulfurDioxide + ResidualSugar + Sulphates + TotalSulfurDioxide + 
##     VolatileAcidity + pH, data = train_data_imputed, init.theta = 41131.90506, 
##     link = log)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.505e+00  1.065e-01  14.134  < 2e-16 ***
## Alcohol            -4.228e-04  2.289e-03  -0.185 0.853454    
## I(Alcohol^2)        2.183e-04  1.023e-04   2.134 0.032803 *  
## LabelAppeal-1       2.688e-01  2.074e-02  12.961  < 2e-16 ***
## LabelAppeal0        4.599e-01  2.027e-02  22.689  < 2e-16 ***
## LabelAppeal1        5.983e-01  2.060e-02  29.040  < 2e-16 ***
## LabelAppeal2        7.446e-01  2.318e-02  32.126  < 2e-16 ***
## STARS2              3.258e-01  7.656e-03  42.552  < 2e-16 ***
## STARS3              4.437e-01  8.361e-03  53.073  < 2e-16 ***
## STARS4              5.586e-01  1.150e-02  48.562  < 2e-16 ***
## STARSUnrated       -7.445e-01  1.046e-02 -71.178  < 2e-16 ***
## AcidIndex          -7.794e-02  2.442e-03 -31.914  < 2e-16 ***
## Chlorides          -3.930e-02  8.604e-03  -4.568 4.92e-06 ***
## CitricAcid          6.049e-03  3.183e-03   1.900 0.057412 .  
## Density            -3.418e-01  1.027e-01  -3.328 0.000875 ***
## FixedAcidity        1.468e-04  4.366e-04   0.336 0.736665    
## FreeSulfurDioxide   9.914e-05  1.840e-05   5.387 7.16e-08 ***
## ResidualSugar       1.618e-04  8.110e-05   1.995 0.046016 *  
## Sulphates          -8.780e-03  2.950e-03  -2.976 0.002916 ** 
## TotalSulfurDioxide  8.684e-05  1.178e-05   7.374 1.66e-13 ***
## VolatileAcidity    -3.535e-02  3.460e-03 -10.218  < 2e-16 ***
## pH                 -1.541e-02  3.987e-03  -3.864 0.000111 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(41131.91) family taken to be 1)
## 
##     Null deviance: 80039  on 44794  degrees of freedom
## Residual deviance: 48097  on 44773  degrees of freedom
## AIC: 159982
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  41132 
##           Std. Err.:  18745 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -159935.6

There are still a few columns that don’t have significant predictors, and so again we consult the p-values to remove them one at a time:

## 
## Call:
## glm.nb(formula = TARGET ~ Alcohol + I(Alcohol^2) + LabelAppeal + 
##     STARS + AcidIndex + Chlorides + CitricAcid + Density + FreeSulfurDioxide + 
##     ResidualSugar + Sulphates + TotalSulfurDioxide + VolatileAcidity + 
##     pH, data = train_data_imputed, init.theta = 41133.97167, 
##     link = log)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.504e+00  1.065e-01  14.131  < 2e-16 ***
## Alcohol            -4.237e-04  2.289e-03  -0.185 0.853166    
## I(Alcohol^2)        2.184e-04  1.023e-04   2.135 0.032731 *  
## LabelAppeal-1       2.687e-01  2.074e-02  12.959  < 2e-16 ***
## LabelAppeal0        4.599e-01  2.027e-02  22.687  < 2e-16 ***
## LabelAppeal1        5.982e-01  2.060e-02  29.038  < 2e-16 ***
## LabelAppeal2        7.445e-01  2.318e-02  32.124  < 2e-16 ***
## STARS2              3.258e-01  7.656e-03  42.551  < 2e-16 ***
## STARS3              4.438e-01  8.360e-03  53.080  < 2e-16 ***
## STARS4              5.586e-01  1.150e-02  48.562  < 2e-16 ***
## STARSUnrated       -7.445e-01  1.046e-02 -71.178  < 2e-16 ***
## AcidIndex          -7.783e-02  2.418e-03 -32.182  < 2e-16 ***
## Chlorides          -3.934e-02  8.603e-03  -4.573 4.81e-06 ***
## CitricAcid          6.072e-03  3.183e-03   1.908 0.056420 .  
## Density            -3.413e-01  1.027e-01  -3.323 0.000890 ***
## FreeSulfurDioxide   9.916e-05  1.840e-05   5.388 7.11e-08 ***
## ResidualSugar       1.616e-04  8.110e-05   1.993 0.046291 *  
## Sulphates          -8.753e-03  2.949e-03  -2.969 0.002992 ** 
## TotalSulfurDioxide  8.678e-05  1.178e-05   7.369 1.72e-13 ***
## VolatileAcidity    -3.534e-02  3.460e-03 -10.215  < 2e-16 ***
## pH                 -1.539e-02  3.987e-03  -3.861 0.000113 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(41133.97) family taken to be 1)
## 
##     Null deviance: 80039  on 44794  degrees of freedom
## Residual deviance: 48097  on 44774  degrees of freedom
## AIC: 159980
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  41134 
##           Std. Err.:  18747 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -159935.7
## 
## Call:
## glm.nb(formula = TARGET ~ Alcohol + I(Alcohol^2) + LabelAppeal + 
##     STARS + AcidIndex + Chlorides + Density + FreeSulfurDioxide + 
##     ResidualSugar + Sulphates + TotalSulfurDioxide + VolatileAcidity + 
##     pH, data = train_data_imputed, init.theta = 41133.49204, 
##     link = log)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.507e+00  1.064e-01  14.154  < 2e-16 ***
## Alcohol            -4.679e-04  2.289e-03  -0.204 0.838026    
## I(Alcohol^2)        2.214e-04  1.023e-04   2.165 0.030391 *  
## LabelAppeal-1       2.683e-01  2.073e-02  12.938  < 2e-16 ***
## LabelAppeal0        4.595e-01  2.027e-02  22.667  < 2e-16 ***
## LabelAppeal1        5.979e-01  2.060e-02  29.022  < 2e-16 ***
## LabelAppeal2        7.445e-01  2.318e-02  32.125  < 2e-16 ***
## STARS2              3.260e-01  7.655e-03  42.579  < 2e-16 ***
## STARS3              4.438e-01  8.360e-03  53.085  < 2e-16 ***
## STARS4              5.589e-01  1.150e-02  48.592  < 2e-16 ***
## STARSUnrated       -7.447e-01  1.046e-02 -71.196  < 2e-16 ***
## AcidIndex          -7.752e-02  2.413e-03 -32.128  < 2e-16 ***
## Chlorides          -3.967e-02  8.601e-03  -4.612 4.00e-06 ***
## Density            -3.436e-01  1.027e-01  -3.345 0.000821 ***
## FreeSulfurDioxide   9.935e-05  1.840e-05   5.398 6.72e-08 ***
## ResidualSugar       1.585e-04  8.107e-05   1.955 0.050578 .  
## Sulphates          -8.817e-03  2.949e-03  -2.990 0.002789 ** 
## TotalSulfurDioxide  8.687e-05  1.177e-05   7.378 1.61e-13 ***
## VolatileAcidity    -3.541e-02  3.459e-03 -10.235  < 2e-16 ***
## pH                 -1.541e-02  3.987e-03  -3.865 0.000111 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(41133.49) family taken to be 1)
## 
##     Null deviance: 80039  on 44794  degrees of freedom
## Residual deviance: 48101  on 44775  degrees of freedom
## AIC: 159981
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  41133 
##           Std. Err.:  18748 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -159939.4
## 
## Call:
## glm.nb(formula = TARGET ~ Alcohol + I(Alcohol^2) + LabelAppeal + 
##     STARS + AcidIndex + Chlorides + Density + FreeSulfurDioxide + 
##     Sulphates + TotalSulfurDioxide + VolatileAcidity + pH, data = train_data_imputed, 
##     init.theta = 41129.92159, link = log)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.507e+00  1.065e-01  14.158  < 2e-16 ***
## Alcohol            -4.549e-04  2.289e-03  -0.199 0.842451    
## I(Alcohol^2)        2.199e-04  1.022e-04   2.151 0.031464 *  
## LabelAppeal-1       2.685e-01  2.073e-02  12.948  < 2e-16 ***
## LabelAppeal0        4.595e-01  2.027e-02  22.668  < 2e-16 ***
## LabelAppeal1        5.980e-01  2.060e-02  29.028  < 2e-16 ***
## LabelAppeal2        7.450e-01  2.317e-02  32.149  < 2e-16 ***
## STARS2              3.262e-01  7.654e-03  42.617  < 2e-16 ***
## STARS3              4.440e-01  8.360e-03  53.111  < 2e-16 ***
## STARS4              5.589e-01  1.150e-02  48.594  < 2e-16 ***
## STARSUnrated       -7.447e-01  1.046e-02 -71.200  < 2e-16 ***
## AcidIndex          -7.751e-02  2.413e-03 -32.128  < 2e-16 ***
## Chlorides          -3.975e-02  8.601e-03  -4.622 3.80e-06 ***
## Density            -3.440e-01  1.027e-01  -3.349 0.000811 ***
## FreeSulfurDioxide   9.996e-05  1.840e-05   5.432 5.56e-08 ***
## Sulphates          -8.901e-03  2.949e-03  -3.019 0.002539 ** 
## TotalSulfurDioxide  8.728e-05  1.177e-05   7.413 1.23e-13 ***
## VolatileAcidity    -3.551e-02  3.459e-03 -10.266  < 2e-16 ***
## pH                 -1.527e-02  3.986e-03  -3.830 0.000128 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(41129.92) family taken to be 1)
## 
##     Null deviance: 80039  on 44794  degrees of freedom
## Residual deviance: 48105  on 44776  degrees of freedom
## AIC: 159983
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  41130 
##           Std. Err.:  18745 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -159943.2

And so we arrive at a negative binomial model with some very interesting results. In particular, the estimates and standard errors are nearly identical as with the simple Poisson model! In fact, even the AIC values are quite similar.

Comparison of Simple Poisson and Negatiive Binomial Models
Term Poisson_Estimate NB_Estimate Poisson_Std_Error NB_Std_Error
(Intercept) 1.5071 1.5071 0.1064 0.1065
Alcohol -0.0005 -0.0005 0.0023 0.0023
I(Alcohol^2) 0.0002 0.0002 0.0001 0.0001
LabelAppeal-1 0.2685 0.2685 0.0207 0.0207
LabelAppeal0 0.4595 0.4595 0.0203 0.0203
LabelAppeal1 0.5980 0.5980 0.0206 0.0206
LabelAppeal2 0.7450 0.7450 0.0232 0.0232
STARS2 0.3262 0.3262 0.0077 0.0077
STARS3 0.4440 0.4440 0.0084 0.0084
STARS4 0.5589 0.5589 0.0115 0.0115
STARSUnrated -0.7447 -0.7447 0.0105 0.0105
AcidIndex -0.0775 -0.0775 0.0024 0.0024
Chlorides -0.0398 -0.0398 0.0086 0.0086
Density -0.3439 -0.3440 0.1027 0.1027
FreeSulfurDioxide 0.0001 0.0001 0.0000 0.0000
Sulphates -0.0089 -0.0089 0.0029 0.0029
TotalSulfurDioxide 0.0001 0.0001 0.0000 0.0000
VolatileAcidity -0.0355 -0.0355 0.0035 0.0035
pH -0.0153 -0.0153 0.0040 0.0040

This is striking at first, and it is possible that it’s due to convergence issues with the negative binomial model. It also seems that the dispersion is quite close to Poisson Assumptions (i.e. mean approximately equal to the variance), which is supported by the very high value of theta. Again, model selection will occur after all the models are built, but it certainly seems that there is no reason to accept this model over the simple Poisson one.

Before we give up entirely on a negative binomial model, though, let’s try a zero-inflated model as we did earlier:

Negative Binomial Model 2 (Zero-Inflated)

We mimic the approach from earlier, starting by creating a zero-inflated model using the same variables as the most recent Negative Binomial model as that provides a strong baseline:

## 
## Call:
## zeroinfl(formula = TARGET ~ Alcohol + I(Alcohol^2) + LabelAppeal + original_stars + 
##     AcidIndex + Chlorides + Density + FreeSulfurDioxide + Sulphates + 
##     TotalSulfurDioxide + VolatileAcidity + pH | original_stars, data = train_data_imputed, 
##     dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.18618 -0.51967  0.01759  0.40947  2.87559 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         9.255e-01  1.102e-01   8.400  < 2e-16 ***
## Alcohol             6.329e-03  2.298e-03   2.755  0.00588 ** 
## I(Alcohol^2)        5.381e-06  1.008e-04   0.053  0.95745    
## LabelAppeal-1       3.887e-01  2.144e-02  18.126  < 2e-16 ***
## LabelAppeal0        6.626e-01  2.100e-02  31.558  < 2e-16 ***
## LabelAppeal1        8.559e-01  2.139e-02  40.004  < 2e-16 ***
## LabelAppeal2        1.019e+00  2.397e-02  42.532  < 2e-16 ***
## original_stars      9.768e-02  2.783e-03  35.103  < 2e-16 ***
## AcidIndex          -2.690e-02  2.669e-03 -10.080  < 2e-16 ***
## Chlorides          -2.682e-02  8.807e-03  -3.045  0.00233 ** 
## Density            -3.118e-01  1.062e-01  -2.937  0.00332 ** 
## FreeSulfurDioxide   3.227e-05  1.862e-05   1.733  0.08304 .  
## Sulphates           1.622e-04  3.024e-03   0.054  0.95724    
## TotalSulfurDioxide  1.201e-05  1.173e-05   1.024  0.30569    
## VolatileAcidity    -1.917e-02  3.549e-03  -5.402 6.58e-08 ***
## pH                  2.646e-03  4.097e-03   0.646  0.51831    
## Log(theta)          1.795e+01        NaN     NaN      NaN    
## 
## Zero-inflation model coefficients (binomial with logit link):
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.38026    0.01948   19.52   <2e-16 ***
## original_stars -2.18434    0.02824  -77.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 62194291.0775 
## Number of iterations in BFGS optimization: 66 
## Log-likelihood: -7.297e+04 on 19 Df

Notice again, the extremely high theta value suggests that the variance is very close to that of a Poisson distribution. The similarites in estimates and errors are thus predictable. Still, we’ll complete the backward elimination before doing a proper comparison with the Poisson zero-inflated model.

## 
## Call:
## zeroinfl(formula = TARGET ~ Alcohol + LabelAppeal + original_stars + 
##     AcidIndex + Chlorides + Density + VolatileAcidity | original_stars, 
##     data = train_data_imputed, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1894 -0.5172  0.0179  0.4088  2.8725 
## 
## Count model coefficients (negbin with log link):
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.9341047  0.1087374   8.590  < 2e-16 ***
## Alcohol          0.0064161  0.0007475   8.583  < 2e-16 ***
## LabelAppeal-1    0.3882117  0.0214397  18.107  < 2e-16 ***
## LabelAppeal0     0.6625338  0.0209936  31.559  < 2e-16 ***
## LabelAppeal1     0.8562690  0.0213910  40.029  < 2e-16 ***
## LabelAppeal2     1.0194905  0.0239634  42.544  < 2e-16 ***
## original_stars   0.0974036  0.0027784  35.058  < 2e-16 ***
## AcidIndex       -0.0271432  0.0026596 -10.206  < 2e-16 ***
## Chlorides       -0.0273333  0.0087975  -3.107  0.00189 ** 
## Density         -0.3069899  0.1061500  -2.892  0.00383 ** 
## VolatileAcidity -0.0191950  0.0035470  -5.412 6.25e-08 ***
## Log(theta)      12.1996101  1.9602807   6.223 4.86e-10 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.38079    0.01947   19.56   <2e-16 ***
## original_stars -2.18488    0.02823  -77.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 198711.6575 
## Number of iterations in BFGS optimization: 30 
## Log-likelihood: -7.297e+04 on 14 Df

Note, these are again the same coefficients as in the zero-inflated Poisson–this despite the fact that we only used the p-values to guide our variable selection at this latter phase.

Comparison of Zero-Inflated Poisson and Negatiive Binomial Models
Term ZIP_Estimate ZINB_Estimate ZIP_Std_Error ZINB_Std_Error
(Intercept) 0.9261 0.9255 0.1102 0.1102
Alcohol 0.0063 0.0063 0.0023 0.0023
I(Alcohol^2) 0.0000 0.0000 0.0001 0.0001
LabelAppeal-1 0.3886 0.3887 0.0214 0.0214
LabelAppeal0 0.6625 0.6626 0.0210 0.0210
LabelAppeal1 0.8558 0.8559 0.0214 0.0214
LabelAppeal2 1.0194 1.0195 0.0240 0.0240
original_stars 0.0977 0.0977 0.0028 0.0028
AcidIndex -0.0269 -0.0269 0.0027 0.0027
Chlorides -0.0268 -0.0268 0.0088 0.0088
Density -0.3123 -0.3118 0.1062 0.1062
FreeSulfurDioxide 0.0000 0.0000 0.0000 0.0000
Sulphates 0.0002 0.0002 0.0030 0.0030
TotalSulfurDioxide 0.0000 0.0000 0.0000 0.0000
VolatileAcidity -0.0192 -0.0192 0.0035 0.0035
pH 0.0026 0.0026 0.0041 0.0041
Log(theta) NA 17.9458 NA NaN

So again, we are looking at nearly identical statistics from the zero-inflated Poisson to the zero-inflated negative binomial. Also again, we should prefer the zero-inflated Poisson to the zero-inflated negative binomial, since the distrubtion appars to b eclose neough to a Poisson.

And what of the comparison beetwene the two ngative binomial models? Of course, it is going to look extremeely similar to the comparison btween the two Poisson models. Still, we add it below for sake of completeneess:

## MAE Negative Binomial Model:  1.016387
## RMSE Negative Binomial Model:  1.27574
## MAE Zero-Inflated Negative Binomial Model:  0.9956453
## RMSE Zero-Inflated Negative Binomial Model:  1.28426

And indeed, the comparison is extremely similar as to earlier.

Multiple Linear Regression

We will look at a more direct approach with multiple linear regression using our normalized, transformed variables.

Looking at each predictor variable, we see that each predictor besides FixedAcidity_transformed are statistically significant within a 95% confidence level We also see that our \(Adj. R^2 = 0.5405\), is where our model accounts on average for 54% of the variation of the TARGET variable.

## 
## Call:
## lm(formula = as.numeric(as.character(TARGET)) ~ AcidIndex + FixedAcidity_transformed + 
##     VolatileAcidity_transformed + CitricAcid_transformed + ResidualSugar_transformed + 
##     Chlorides_transformed + FreeSulfurDioxide_transformed + TotalSulfurDioxide_transformed + 
##     Density_transformed + pH_transformed + Sulphates_transformed + 
##     Alcohol_transformed + STARS.1 + STARS.2 + STARS.3 + STARS.4 + 
##     STARS.Unrated + LabelAppeal, data = train_data_prepped)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7226 -0.8468  0.0138  0.8399  6.1074 
## 
## Coefficients: (1 not defined because of singularities)
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1.968728   0.051007  38.597  < 2e-16 ***
## AcidIndex                      -0.190274   0.004985 -38.169  < 2e-16 ***
## FixedAcidity_transformed        0.009494   0.006384   1.487 0.136974    
## VolatileAcidity_transformed    -0.105261   0.006209 -16.954  < 2e-16 ***
## CitricAcid_transformed          0.032272   0.006216   5.192 2.09e-07 ***
## ResidualSugar_transformed       0.020849   0.006188   3.369 0.000754 ***
## Chlorides_transformed          -0.062869   0.006205 -10.132  < 2e-16 ***
## FreeSulfurDioxide_transformed   0.062972   0.006206  10.147  < 2e-16 ***
## TotalSulfurDioxide_transformed  0.066956   0.006205  10.790  < 2e-16 ***
## Density_transformed            -0.051618   0.006209  -8.314  < 2e-16 ***
## pH_transformed                 -0.040865   0.006203  -6.588 4.52e-11 ***
## Sulphates_transformed          -0.028562   0.006191  -4.613 3.97e-06 ***
## Alcohol_transformed             0.061323   0.006214   9.868  < 2e-16 ***
## STARS.1                         1.311245   0.017628  74.384  < 2e-16 ***
## STARS.2                         2.354115   0.017173 137.080  < 2e-16 ***
## STARS.3                         2.906541   0.020015 145.219  < 2e-16 ***
## STARS.4                         3.564148   0.031432 113.393  < 2e-16 ***
## STARS.Unrated                         NA         NA      NA       NA    
## LabelAppeal-1                   0.413529   0.033736  12.258  < 2e-16 ***
## LabelAppeal0                    0.886686   0.032975  26.890  < 2e-16 ***
## LabelAppeal1                    1.375899   0.034432  39.960  < 2e-16 ***
## LabelAppeal2                    1.983104   0.045480  43.604  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.306 on 44774 degrees of freedom
## Multiple R-squared:  0.5407, Adjusted R-squared:  0.5405 
## F-statistic:  2635 on 20 and 44774 DF,  p-value: < 2.2e-16

Calculating the RMSE for our training multiple regression model, we obtain an \(RMSE = 1.3057\)

## [1] 1.305704

Stepwise Regression

We will create a stepwise regression model, to perform forward and backward elimination on our multiple linear regression attempt.

The results show that we strictly removed the STARS.Unrated column which makes sense, as it does not provide additional information as the other STARS columns accounts for it. The \(Adj. R^2 = 0.5405\) which stayed the same as before, and the RMSE stayed the same as well.

## Start:  AIC=23939.46
## as.numeric(as.character(TARGET)) ~ AcidIndex + FixedAcidity_transformed + 
##     VolatileAcidity_transformed + CitricAcid_transformed + ResidualSugar_transformed + 
##     Chlorides_transformed + FreeSulfurDioxide_transformed + TotalSulfurDioxide_transformed + 
##     Density_transformed + pH_transformed + Sulphates_transformed + 
##     Alcohol_transformed + STARS.1 + STARS.2 + STARS.3 + STARS.4 + 
##     STARS.Unrated + LabelAppeal
## 
## 
## Step:  AIC=23939.46
## as.numeric(as.character(TARGET)) ~ AcidIndex + FixedAcidity_transformed + 
##     VolatileAcidity_transformed + CitricAcid_transformed + ResidualSugar_transformed + 
##     Chlorides_transformed + FreeSulfurDioxide_transformed + TotalSulfurDioxide_transformed + 
##     Density_transformed + pH_transformed + Sulphates_transformed + 
##     Alcohol_transformed + STARS.1 + STARS.2 + STARS.3 + STARS.4 + 
##     LabelAppeal
## 
##                                  Df Sum of Sq    RSS   AIC
## <none>                                         76369 23939
## - FixedAcidity_transformed        1         4  76373 23940
## - ResidualSugar_transformed       1        19  76389 23949
## - Sulphates_transformed           1        36  76406 23959
## - CitricAcid_transformed          1        46  76415 23964
## - pH_transformed                  1        74  76443 23981
## - Density_transformed             1       118  76487 24007
## - Alcohol_transformed             1       166  76535 24035
## - Chlorides_transformed           1       175  76544 24040
## - FreeSulfurDioxide_transformed   1       176  76545 24040
## - TotalSulfurDioxide_transformed  1       199  76568 24054
## - VolatileAcidity_transformed     1       490  76860 24224
## - AcidIndex                       1      2485  78854 25372
## - LabelAppeal                     4      7629  83999 28197
## - STARS.1                         1      9437  85807 29157
## - STARS.4                         1     21931  98301 35246
## - STARS.2                         1     32051 108420 39635
## - STARS.3                         1     35970 112339 41226
## 
## Call:
## lm(formula = as.numeric(as.character(TARGET)) ~ AcidIndex + FixedAcidity_transformed + 
##     VolatileAcidity_transformed + CitricAcid_transformed + ResidualSugar_transformed + 
##     Chlorides_transformed + FreeSulfurDioxide_transformed + TotalSulfurDioxide_transformed + 
##     Density_transformed + pH_transformed + Sulphates_transformed + 
##     Alcohol_transformed + STARS.1 + STARS.2 + STARS.3 + STARS.4 + 
##     LabelAppeal, data = train_data_prepped)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7226 -0.8468  0.0138  0.8399  6.1074 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1.968728   0.051007  38.597  < 2e-16 ***
## AcidIndex                      -0.190274   0.004985 -38.169  < 2e-16 ***
## FixedAcidity_transformed        0.009494   0.006384   1.487 0.136974    
## VolatileAcidity_transformed    -0.105261   0.006209 -16.954  < 2e-16 ***
## CitricAcid_transformed          0.032272   0.006216   5.192 2.09e-07 ***
## ResidualSugar_transformed       0.020849   0.006188   3.369 0.000754 ***
## Chlorides_transformed          -0.062869   0.006205 -10.132  < 2e-16 ***
## FreeSulfurDioxide_transformed   0.062972   0.006206  10.147  < 2e-16 ***
## TotalSulfurDioxide_transformed  0.066956   0.006205  10.790  < 2e-16 ***
## Density_transformed            -0.051618   0.006209  -8.314  < 2e-16 ***
## pH_transformed                 -0.040865   0.006203  -6.588 4.52e-11 ***
## Sulphates_transformed          -0.028562   0.006191  -4.613 3.97e-06 ***
## Alcohol_transformed             0.061323   0.006214   9.868  < 2e-16 ***
## STARS.1                         1.311245   0.017628  74.384  < 2e-16 ***
## STARS.2                         2.354115   0.017173 137.080  < 2e-16 ***
## STARS.3                         2.906541   0.020015 145.219  < 2e-16 ***
## STARS.4                         3.564148   0.031432 113.393  < 2e-16 ***
## LabelAppeal-1                   0.413529   0.033736  12.258  < 2e-16 ***
## LabelAppeal0                    0.886686   0.032975  26.890  < 2e-16 ***
## LabelAppeal1                    1.375899   0.034432  39.960  < 2e-16 ***
## LabelAppeal2                    1.983104   0.045480  43.604  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.306 on 44774 degrees of freedom
## Multiple R-squared:  0.5407, Adjusted R-squared:  0.5405 
## F-statistic:  2635 on 20 and 44774 DF,  p-value: < 2.2e-16
## [1] 1.305704

Model Selection

Let’s now look at all the models compared to each other based on their results with using the test data. The best model in comparison to the others is the original theory of using Poisson with the best \(R^2=0.5611089\) and the lowest \(RMSE=1.2757393\). We could go with the basis of the AIC metric, however, our goal is to predict the best results with our evaluation set. Even with all the transformations, interactions and trying to handle zero inflation factors, a Poisson model is our best choice to predict on the evaluation set.

Again AIC would be a better choice if we wanted to have more of an inference between the models.

Model Performance
Poisson ZI_Poisson Neg_Binom ZI_Neg_Binom Stepwise
RMSE 1.2757393 1.2842321 1.2757404 1.2842601 1.2889856
Rsquared 0.5611089 0.5554209 0.5611082 0.5554045 0.5519412
MAE 1.0163856 0.9956513 1.0163873 0.9956453 1.0208232
AIC 159979.7702125 145969.2867016 159983.1967400 145971.8130167 151064.1672811

Predictions

Predictions

As we have now selected our models, we are ready to make predictions on the evaluation set. This is a slightly complicated process because our second model is dependent on our first one. We complete this process below:

Preview: Predictions for Evaluation Dataset
TARGET
3
4
2
2
2
5
3
5
1
3

Conclusion:

In this project, we aimed to develop predictive models for wine sales using statistical techniques and machine learning algorithms. We started by exploring the data, handling missing values, and transforming variables. Initially, we experimented with Poisson regression and zero-inflated Poisson models, as well as negative binomial regression to address over-dispersion. However, the simple Poisson model emerged as the best performer. We refined our models through stepwise regression but found no significant improvement over the simple Poisson model. Overall, our models offer valuable insights for wine producers and distributors, aiding in resource allocation and marketing strategy adjustments to optimize sales and enhance profitability in the wine industry.