Introduction

In this assignment, we analyze a data set containing approximately 12,000 records of commercially available wines. Each record details the chemical properties of the wine, along with factors such as its label appeal and expert ratings, aiming to predict the number of sample cases purchased by wine distribution companies. The primary objectives are as follows:

To build predictive models for estimating the number of sample cases ordered (TARGET) based on wine characteristics.
To evaluate and refine various regression models, with a focus on count regression techniques, including Poisson and negative binomial regression, to ensure accurate and interpretable predictions.

To accomplish these objectives, we will conduct an in-depth exploration of the data, investigating variable distributions, potential correlations with the target, and missing data patterns. Based on these insights, we will preprocess and transform the data, ensuring it is well-suited for modeling. Finally, we will construct and evaluate multiple regression models, selecting the best one based on performance metrics and interpretability to provide actionable insights for the wine manufacturer’s strategy.

## 'data.frame':    12795 obs. of  16 variables:
##  $ INDEX             : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int  2 3 3 1 2 NA NA 3 NA 4 ...

##      INDEX           TARGET       FixedAcidity     VolatileAcidity  
##  Min.   :    1   Min.   :0.000   Min.   :-18.100   Min.   :-2.7900  
##  1st Qu.: 4038   1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300  
##  Median : 8110   Median :3.000   Median :  6.900   Median : 0.2800  
##  Mean   : 8070   Mean   :3.029   Mean   :  7.076   Mean   : 0.3241  
##  3rd Qu.:12106   3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400  
##  Max.   :16129   Max.   :8.000   Max.   : 34.400   Max.   : 3.6800  
##                                                                     
##    CitricAcid      ResidualSugar        Chlorides       FreeSulfurDioxide
##  Min.   :-3.2400   Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00  
##  1st Qu.: 0.0300   1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00  
##  Median : 0.3100   Median :   3.900   Median : 0.0460   Median :  30.00  
##  Mean   : 0.3084   Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85  
##  3rd Qu.: 0.5800   3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00  
##  Max.   : 3.8600   Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00  
##                    NA's   :616        NA's   :638       NA's   :647      
##  TotalSulfurDioxide    Density             pH          Sulphates      
##  Min.   :-823.0     Min.   :0.8881   Min.   :0.480   Min.   :-3.1300  
##  1st Qu.:  27.0     1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800  
##  Median : 123.0     Median :0.9945   Median :3.200   Median : 0.5000  
##  Mean   : 120.7     Mean   :0.9942   Mean   :3.208   Mean   : 0.5271  
##  3rd Qu.: 208.0     3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600  
##  Max.   :1057.0     Max.   :1.0992   Max.   :6.130   Max.   : 4.2400  
##  NA's   :682                         NA's   :395     NA's   :1210     
##     Alcohol       LabelAppeal          AcidIndex          STARS      
##  Min.   :-4.70   Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.: 9.00   1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median :10.40   Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :10.49   Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.:12.40   3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   :26.50   Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##  NA's   :653                                          NA's   :3359

The dataset consists of 12,795 observations and 16 variables that capture various chemical and non-chemical characteristics of wines. The target variable, TARGET, represents the number of wine sample cases purchased, with values ranging from 0 to 8 and a mean of approximately 3 cases. The features include chemical properties such as FixedAcidity, VolatileAcidity, ResidualSugar, and Alcohol, as well as marketing-related factors like LabelAppeal and expert ratings captured by STARS. Many variables exhibit wide ranges and potential skewness, such as ResidualSugar, which spans from -127.8 to 141.15 with a mean of 5.4, and Alcohol, which ranges from -4.7 to 26.5 with a mean of 10.5. Missing values are present in several variables, including Chlorides, FreeSulfurDioxide, and STARS, with some variables like STARS having a substantial proportion of missing data. This will necessitate imputation or alternative handling. Overall, the dataset presents diverse features with varying distributions, and initial exploration suggests the need for transformations and careful handling of missing data to ensure robust modeling.

DATA EXPLORATION

Descriptive Statistics

##                             Mean     Median       StdDev
## INDEX               8.069980e+03 8110.00000 4.656905e+03
## TARGET              3.029074e+00    3.00000 1.926368e+00
## FixedAcidity        7.075717e+00    6.90000 6.317643e+00
## VolatileAcidity     3.241039e-01    0.28000 7.840142e-01
## CitricAcid          3.084127e-01    0.31000 8.620798e-01
## ResidualSugar       5.418733e+00    3.90000 3.374938e+01
## Chlorides           5.482249e-02    0.04600 3.184673e-01
## FreeSulfurDioxide   3.084557e+01   30.00000 1.487146e+02
## TotalSulfurDioxide  1.207142e+02  123.00000 2.319132e+02
## Density             9.942027e-01    0.99449 2.653765e-02
## pH                  3.207628e+00    3.20000 6.796871e-01
## Sulphates           5.271118e-01    0.50000 9.321293e-01
## Alcohol             1.048924e+01   10.40000 3.727819e+00
## LabelAppeal        -9.066041e-03    0.00000 8.910892e-01
## AcidIndex           7.772724e+00    8.00000 1.323926e+00
## STARS               2.041755e+00    2.00000 9.025400e-01

##              INDEX             TARGET       FixedAcidity    VolatileAcidity 
##                  0                  0                  0                  0 
##         CitricAcid      ResidualSugar          Chlorides  FreeSulfurDioxide 
##                  0                616                638                647 
## TotalSulfurDioxide            Density                 pH          Sulphates 
##                682                  0                395               1210 
##            Alcohol        LabelAppeal          AcidIndex              STARS 
##                653                  0                  0               3359

The summary statistics highlight key characteristics of the dataset and inform preprocessing steps. The target variable, TARGET, has a mean of about 3 cases purchased with moderate variability, making it suitable for count regression models. Chemical features like FixedAcidity and VolatileAcidity show consistency, while ResidualSugar exhibits significant variability. Non-chemical variables, such as LabelAppeal, are mostly neutral, while STARS, a categorical expert rating, has a mean of 2 but is missing many values, suggesting mode imputation to preserve its predictive potential. Continuous variables like Sulphates, Chlorides, and Alcohol, which have missing values and wide ranges, may require mean, median, or KNN imputation. High variability in features such as FreeSulfurDioxide and TotalSulfurDioxide suggests potential outliers, emphasizing the need for imputation, transformations, and careful scaling to optimize model performance.

Visualizations

We get a clear sense of the distribution for the target variable here. Looks normally distributed with the exception of a high count of 0 or null values.

The boxplot of Label Appeal vs TARGET makes a lot of sense. As we try to understand the relationship between the two, it’s not hard to see that an increase in Label Appeal corresponds with an increase in the TARGET value.

We see that the features have very low correlations with each other, meaning that there is not much multicollinearity present in the dataset. This means that the assumptions of linear regression are more likely to be met. However, we do see the strongest relationships between STARS, LabelAppeal, and TARGET.

Distributions look generally really nice. AcidIndex and STARS display some right skewness, which we can consider some transformations for.

The bar charts compare the three discrete categorical variables against the TARGET variable. For AcidIndex, a large quantity of wine was sold with index numbers 7 and 8. LabelAppeal indicates that wines with generic labels tend to have a higher number of cases sold per order. Finally, STARS reveals that higher-star-rated wines are associated with higher price tags. Overall, for each of these predictors, there appears to be a significant relationship between their ordered levels and the number of wine cases sold.

Here we see a weak but positive relationship between Alcohol and TARGET, which makes sense. If people are purchasing wine, it is likely with the intention of feeling the effects.

To better understand the negative values in our data set, we did some more digging.

##              INDEX             TARGET       FixedAcidity    VolatileAcidity 
##                  0                  0               1621               2827 
##         CitricAcid      ResidualSugar          Chlorides  FreeSulfurDioxide 
##               2966               3136               3197               3036 
## TotalSulfurDioxide            Density                 pH          Sulphates 
##               2504                  0                  0               2361 
##            Alcohol        LabelAppeal          AcidIndex              STARS 
##                118               3640                  0                  0

Citric acid, VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Sulphates and Alcohol content should be non-negative, so negative values here are likely invalid.

LabelAppeal is a marketing score, and it can theoretically be negative if the label is poorly received, so negative values might be valid.

In summary, our exploration of the wine dataset has provided valuable insights into its structure, distributions, and potential challenges for predictive modeling. The target variable, TARGET, representing the number of wine cases purchased, is moderately variable and suitable for count regression models. Strong relationships with predictors like LabelAppeal and STARS suggest these variables are critical for predictive performance. However, missing data in several key variables, especially STARS, which is highly categorical and potentially influential, must be addressed through imputation strategies. Negative values in chemical features like VolatileAcidity, Sulphates, and Alcohol likely indicate errors and need correction.

The dataset shows low multicollinearity between features, simplifying model assumptions, but high variability in chemical features such as ResidualSugar and FreeSulfurDioxide suggests potential outliers. Visualizations confirm meaningful relationships between predictors and the target variable, including the positive impact of LabelAppeal and higher STARS ratings on wine purchases. Transformations may be needed for skewed variables like AcidIndex and STARS, while mean, median, or KNN imputation can handle missing values in continuous variables.

Overall, the dataset presents a robust foundation for regression modeling, with the potential to yield actionable insights for predicting wine sales. Addressing data quality issues, handling missing values, and scaling features will be crucial next steps in preparing the data for reliable and interpretable modeling.

DATA PREPARATION

Missing/Negative Values

Let’s get a better sense of the number of missing values by plotting how many missing values we have for each variable.

Below, we replaced values in variables where they are invalid (e.g., CitricAcid, VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Sulphates, and Alcohol) with NA. This approach treats them as missing data to avoid introducing biases or errors. Counts of the replaced negative values were reviewed to ensure accurate handling.

##         CitricAcid    VolatileAcidity  FreeSulfurDioxide TotalSulfurDioxide 
##                  0                  0                  0                  0 
##          Sulphates            Alcohol 
##                  0                  0

Here, we will impute the missing values in the STARS variable (categorical) with the mode, as it reflects the most frequent value and preserves the categorical nature of the variable. For the remaining variables, I will use mean imputation to maintain the central tendency of the data.

##              INDEX             TARGET       FixedAcidity    VolatileAcidity 
##                  0                  0                  0                  0 
##         CitricAcid      ResidualSugar          Chlorides  FreeSulfurDioxide 
##                  0                  0                  0                  0 
## TotalSulfurDioxide            Density                 pH          Sulphates 
##                  0                  0                  0                  0 
##            Alcohol        LabelAppeal          AcidIndex              STARS 
##                  0                  0                  0                  0

Variable Transformation and New Feature Creation

We created some additional features:

Alcohol_Bucket: Categorizes wine into alcohol content ranges (Low, Medium, High, Very High) based on the Alcohol variable. ResidualSugar_Bucket: Classifies wine into sweetness categories (Dry, Semi-Dry, Sweet, Very Sweet) based on the ResidualSugar level. Alcohol_to_Sulphates: A ratio of Alcohol to Sulphates, capturing the relationship between alcohol content and sulphate levels in the wine. Acidity_Index: A combined measure of acidity, calculated by summing FixedAcidity and VolatileAcidity, providing a comprehensive view of the wine’s acidity. Sulphates_Alcohol_Interaction: The interaction term between Sulphates and Alcohol, examining their combined effect on wine quality.

To handle the skewedness in STARS and AcidIndex, we applied the following transformations:

The Box-Cox Transformation is the most effective for AcidIndex. For STARS, both the Log Transformation and the Box-Cox Transformation are suitable; however, the discrete nature of the variable limits its potential to achieve full normality.

##          Alcohol_to_Sulphates Acidity_Index Sulphates_Alcohol_Interaction
## Negative 0                    1460          0                            
## na       0                    0             0                            
## nan      0                    0             0                            
## inf      22                   0             0

In the data preparation process, missing values and negative values were addressed. First, missing values in the STARS variable (categorical) were imputed with the mode, while missing values in other continuous variables were imputed using the mean. Negative values in certain columns (e.g., CitricAcid, VolatileAcidity, Alcohol) were replaced with NA. Transformations were applied to handle skewness in the AcidIndex and STARS variables, with the Box-Cox Transformation being most effective for AcidIndex. The STARS variable benefitted from both Log and Box-Cox transformations, although its discrete nature limited normality. New features were created, including the Alcohol-to-Sulphates ratio, Acidity Index, and Sulphates × Alcohol interaction, which were checked for data issues. This comprehensive data cleaning and transformation ensures the dataset is ready for modeling.

BUILD MODELS

Poisson Model

## 
## Call:
## glm(formula = TARGET ~ Log_STARS + LabelAppeal + BoxCox_AcidIndex + 
##     VolatileAcidity + TotalSulfurDioxide + FreeSulfurDioxide + 
##     Chlorides + Alcohol + Sulphates + CitricAcid + pH + Density, 
##     family = poisson(link = "log"), data = wine_data)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         5.2902196  0.3551942  14.894  < 2e-16 ***
## Log_STARS           0.8344675  0.0115972  71.954  < 2e-16 ***
## LabelAppeal         0.1391214  0.0059964  23.201  < 2e-16 ***
## BoxCox_AcidIndex   -6.1055399  0.3968569 -15.385  < 2e-16 ***
## VolatileAcidity    -0.0577085  0.0107004  -5.393 6.92e-08 ***
## TotalSulfurDioxide  0.0001336  0.0000349   3.829 0.000129 ***
## FreeSulfurDioxide   0.0001508  0.0000565   2.669 0.007606 ** 
## Chlorides          -0.0398285  0.0164476  -2.422 0.015455 *  
## Alcohol             0.0032991  0.0014887   2.216 0.026681 *  
## Sulphates          -0.0189675  0.0090807  -2.089 0.036728 *  
## CitricAcid          0.0153793  0.0094866   1.621 0.104983    
## pH                 -0.0125060  0.0076460  -1.636 0.101917    
## Density            -0.2926834  0.1919023  -1.525 0.127217    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 13972  on 12782  degrees of freedom
## AIC: 45940
## 
## Number of Fisher Scoring iterations: 5

The Poisson regression model uses predictors such as Log_STARS, LabelAppeal, BoxCox_AcidIndex, and others to predict the target variable. Statistically significant variables (p-value < 0.05) like Log_STARS, LabelAppeal, and Alcohol show strong associations with the target. The final model improves fit, with a reduction in deviance from 22820 (null) to 13946 (residual). Some variables, like pH and Density, are seemingly less significant.

## 
## Call:
## zeroinfl(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Alcohol + 
##     Sulphates + LabelAppeal + BoxCox_AcidIndex + Log_STARS | 1, data = wine_data)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7759 -0.3925  0.1307  0.5070  3.9430 
## 
## Count model coefficients (poisson with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         4.633e+00  3.803e-01  12.183  < 2e-16 ***
## VolatileAcidity    -4.519e-02  1.118e-02  -4.041 5.33e-05 ***
## CitricAcid          9.914e-03  9.884e-03   1.003  0.31584    
## Chlorides          -3.336e-02  1.712e-02  -1.949  0.05130 .  
## FreeSulfurDioxide   1.192e-04  5.796e-05   2.057  0.03967 *  
## TotalSulfurDioxide  7.532e-05  3.528e-05   2.135  0.03277 *  
## Density            -2.943e-01  2.005e-01  -1.468  0.14215    
## pH                 -6.338e-03  7.959e-03  -0.796  0.42583    
## Alcohol             4.696e-03  1.545e-03   3.040  0.00236 ** 
## Sulphates          -1.291e-02  9.451e-03  -1.366  0.17205    
## LabelAppeal         1.756e-01  6.759e-03  25.979  < 2e-16 ***
## BoxCox_AcidIndex   -4.909e+00  4.330e-01 -11.336  < 2e-16 ***
## Log_STARS           6.162e-01  1.874e-02  32.873  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.20364    0.06409  -34.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 22 
## Log-likelihood: -2.277e+04 on 14 Df

The Poisson regression model shows that several predictors, including LabelAppeal and Log_STARS, have strong significant relationships with the target variable, with very low p-values indicating their importance. Other significant predictors include Alcohol, Sulphates, FreeSulfurDioxide, TotalSulfurDioxide, Chlorides, and VolatileAcidity. However, variables like CitricAcid, Density, and pH do not show significant effects on the target. The model’s deviance statistics suggest a good fit with the data. In the zero-inflated Poisson (ZIP) model, the count model coefficients largely mirror those from the Poisson regression, while the zero-inflation component shows a significant intercept, indicating a high probability of zero counts in the data.

Below, we visualize fitted vs observed values for Poisson model.

The Poisson model demonstrates a general trend, but it exhibits signs of overdispersion, where the variance exceeds the mean. This is reflected in the spread of points away from the red reference line, indicating a lack of perfect fit. The model tends to underpredict at higher observed values, suggesting that some important predictors or interaction terms may be missing. While the model performs reasonably well for lower counts, its accuracy diminishes as observed values increase, highlighting areas for potential improvement in modeling higher count observations.

Negative Binomial Models

## 
## Call:
## glm.nb(formula = TARGET ~ Log_STARS + LabelAppeal + BoxCox_AcidIndex + 
##     VolatileAcidity + TotalSulfurDioxide + FreeSulfurDioxide + 
##     Chlorides + Alcohol + Sulphates + CitricAcid + pH + Density, 
##     data = wine_data, init.theta = 44991.79975, link = log)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         5.298e+00  3.556e-01  14.900  < 2e-16 ***
## Log_STARS           8.343e-01  1.161e-02  71.861  < 2e-16 ***
## LabelAppeal         1.390e-01  6.002e-03  23.167  < 2e-16 ***
## BoxCox_AcidIndex   -6.118e+00  3.973e-01 -15.397  < 2e-16 ***
## VolatileAcidity    -5.767e-02  1.071e-02  -5.387 7.17e-08 ***
## TotalSulfurDioxide  1.351e-04  3.491e-05   3.870 0.000109 ***
## FreeSulfurDioxide   1.544e-04  5.658e-05   2.728 0.006367 ** 
## Chlorides          -4.005e-02  1.646e-02  -2.434 0.014950 *  
## Alcohol             3.314e-03  1.490e-03   2.225 0.026100 *  
## Sulphates          -1.893e-02  9.098e-03  -2.081 0.037446 *  
## CitricAcid          1.593e-02  9.494e-03   1.678 0.093326 .  
## pH                 -1.227e-02  7.652e-03  -1.604 0.108707    
## Density            -2.932e-01  1.921e-01  -1.526 0.126937    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(44991.8) family taken to be 1)
## 
##     Null deviance: 22819  on 12772  degrees of freedom
## Residual deviance: 13945  on 12760  degrees of freedom
## AIC: 45863
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  44992 
##           Std. Err.:  41032 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -45834.95

In Negative Binomial Model, the minimum value of AIC we get from the same set of Poisson Model. So, we keep those variables in this model too.

## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Alcohol + 
##     Sulphates + LabelAppeal + BoxCox_AcidIndex + Log_STARS, data = wine_data, 
##     init.theta = 44992.15985, link = log)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         5.298e+00  3.556e-01  14.900  < 2e-16 ***
## VolatileAcidity    -5.767e-02  1.071e-02  -5.387 7.17e-08 ***
## CitricAcid          1.593e-02  9.494e-03   1.678 0.093327 .  
## Chlorides          -4.005e-02  1.646e-02  -2.434 0.014950 *  
## FreeSulfurDioxide   1.544e-04  5.658e-05   2.728 0.006367 ** 
## TotalSulfurDioxide  1.351e-04  3.491e-05   3.870 0.000109 ***
## Density            -2.932e-01  1.921e-01  -1.526 0.126937    
## pH                 -1.227e-02  7.652e-03  -1.604 0.108707    
## Alcohol             3.314e-03  1.490e-03   2.225 0.026100 *  
## Sulphates          -1.893e-02  9.098e-03  -2.081 0.037446 *  
## LabelAppeal         1.390e-01  6.002e-03  23.167  < 2e-16 ***
## BoxCox_AcidIndex   -6.118e+00  3.973e-01 -15.397  < 2e-16 ***
## Log_STARS           8.343e-01  1.161e-02  71.861  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(44992.16) family taken to be 1)
## 
##     Null deviance: 22819  on 12772  degrees of freedom
## Residual deviance: 13945  on 12760  degrees of freedom
## AIC: 45863
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  44992 
##           Std. Err.:  41033 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -45834.95

The Negative Binomial model demonstrates improved predictive accuracy over the Poisson model, particularly for mid-range and high TARGET values. While it offers notable improvements, there is still some underprediction of high values and slight overprediction of low values, suggesting opportunities for further refinement.

The model performs well for mid-range TARGET values, with fitted values closely aligning with the observed distribution in this range. It effectively addresses the overdispersion in the data. However, it struggles with the excess zeros, indicating that it does not fully capture the characteristics of the data at the lower end.

## 
## Call:
## zeroinfl(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Alcohol + 
##     Sulphates + LabelAppeal + BoxCox_AcidIndex + Log_STARS | 1, data = wine_data, 
##     dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7757 -0.3921  0.1309  0.5069  3.9448 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         4.640e+00  3.807e-01  12.189  < 2e-16 ***
## VolatileAcidity    -4.515e-02  1.119e-02  -4.035 5.45e-05 ***
## CitricAcid          1.043e-02  9.892e-03   1.055  0.29154    
## Chlorides          -3.357e-02  1.712e-02  -1.960  0.04994 *  
## FreeSulfurDioxide   1.226e-04  5.803e-05   2.113  0.03464 *  
## TotalSulfurDioxide  7.644e-05  3.528e-05   2.166  0.03028 *  
## Density            -2.957e-01  2.007e-01  -1.473  0.14074    
## pH                 -6.095e-03  7.965e-03  -0.765  0.44414    
## Alcohol             4.713e-03  1.545e-03   3.050  0.00229 ** 
## Sulphates          -1.294e-02  9.469e-03  -1.367  0.17166    
## LabelAppeal         1.755e-01  6.765e-03  25.946  < 2e-16 ***
## BoxCox_AcidIndex   -4.919e+00  4.335e-01 -11.345  < 2e-16 ***
## Log_STARS           6.160e-01  1.876e-02  32.846  < 2e-16 ***
## Log(theta)          1.777e+01  1.037e+00  17.135  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.20334    0.06409  -34.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 52405508.723 
## Number of iterations in BFGS optimization: 73 
## Log-likelihood: -2.273e+04 on 15 Df

Compared to the Negative Binomial model, the Zero-Inflated Negative Binomial model appears to handle the excess zeros and low count values better. Points for TARGET = 0 align closer to the fitted values, indicating that the ZINB model better accounts for the zero-inflation in the data. Similar to the Negative Binomial model, the ZINB model still struggles to predict higher observed counts such as TARGET > 5. The fitted values for these points are consistently below the red line, indicating underprediction.

Linear regression models

## 
## Call:
## lm(formula = TARGET ~ Log_STARS + LabelAppeal + BoxCox_AcidIndex + 
##     VolatileAcidity + TotalSulfurDioxide + Alcohol + Chlorides + 
##     FreeSulfurDioxide + Sulphates + CitricAcid + Density + pH, 
##     data = wine_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7960 -0.8455  0.0218  0.8310  6.3083 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.555e+01  8.019e-01  19.386  < 2e-16 ***
## Log_STARS           2.204e+00  2.270e-02  97.074  < 2e-16 ***
## LabelAppeal         4.751e-01  1.340e-02  35.456  < 2e-16 ***
## BoxCox_AcidIndex   -1.766e+01  8.889e-01 -19.863  < 2e-16 ***
## VolatileAcidity    -1.682e-01  2.340e-02  -7.191 6.80e-13 ***
## TotalSulfurDioxide  3.859e-04  8.050e-05   4.794 1.65e-06 ***
## Alcohol             1.468e-02  3.393e-03   4.327 1.53e-05 ***
## Chlorides          -1.181e-01  3.743e-02  -3.156  0.00160 ** 
## FreeSulfurDioxide   4.153e-04  1.310e-04   3.170  0.00153 ** 
## Sulphates          -4.672e-02  2.042e-02  -2.288  0.02218 *  
## CitricAcid          4.947e-02  2.189e-02   2.260  0.02383 *  
## Density            -8.313e-01  4.382e-01  -1.897  0.05784 .  
## pH                 -2.880e-02  1.740e-02  -1.655  0.09794 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.312 on 12760 degrees of freedom
## Multiple R-squared:  0.5366, Adjusted R-squared:  0.5362 
## F-statistic:  1231 on 12 and 12760 DF,  p-value: < 2.2e-16

We can see that, Log_STARS, LabelAppeal, BoxCox_AcidIndex, VolatileAcidity, TotalSulfurDioxide, Alcohol, Chlorides, FreeSulfurDioxide, Sulphates, pH, CitricAcid, and Density provide the minimum value of AIC in linear regression model. So we kept these variables in this model.

## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Alcohol + 
##     Sulphates + LabelAppeal + BoxCox_AcidIndex + Log_STARS, data = wine_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7960 -0.8455  0.0218  0.8310  6.3083 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.555e+01  8.019e-01  19.386  < 2e-16 ***
## VolatileAcidity    -1.682e-01  2.340e-02  -7.191 6.80e-13 ***
## CitricAcid          4.947e-02  2.189e-02   2.260  0.02383 *  
## Chlorides          -1.181e-01  3.743e-02  -3.156  0.00160 ** 
## FreeSulfurDioxide   4.153e-04  1.310e-04   3.170  0.00153 ** 
## TotalSulfurDioxide  3.859e-04  8.050e-05   4.794 1.65e-06 ***
## Density            -8.313e-01  4.382e-01  -1.897  0.05784 .  
## pH                 -2.880e-02  1.740e-02  -1.655  0.09794 .  
## Alcohol             1.468e-02  3.393e-03   4.327 1.53e-05 ***
## Sulphates          -4.672e-02  2.042e-02  -2.288  0.02218 *  
## LabelAppeal         4.751e-01  1.340e-02  35.456  < 2e-16 ***
## BoxCox_AcidIndex   -1.766e+01  8.889e-01 -19.863  < 2e-16 ***
## Log_STARS           2.204e+00  2.270e-02  97.074  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.312 on 12760 degrees of freedom
## Multiple R-squared:  0.5366, Adjusted R-squared:  0.5362 
## F-statistic:  1231 on 12 and 12760 DF,  p-value: < 2.2e-16

The multiple linear regression model shows that several predictors, such as VolatileAcidity, CitricAcid, Chlorides, LabelAppeal, and Log_STARS, significantly impact the TARGET variable. The model explains 53.66% of the variance in the data, as indicated by the R-squared value. Although the model performs reasonably well, the residual standard error of 1.312 suggests some room for improvement. The observed vs predicted plot reveals some discrepancies, especially for higher values of TARGET.

## [1] "Optimal Lambda: 0.0133517693664588"

##      Coefficient           Variable
## 1   1.190563e+01        (Intercept)
## 4  -1.261571e-01    VolatileAcidity
## 5   1.984741e-02         CitricAcid
## 7  -6.416365e-02          Chlorides
## 8   2.277633e-04  FreeSulfurDioxide
## 9   2.673729e-04 TotalSulfurDioxide
## 10 -2.619490e-01            Density
## 11 -6.438560e-03                 pH
## 12 -2.035883e-02          Sulphates
## 13  8.626826e-03            Alcohol
## 14  3.839345e-01        LabelAppeal
## 15 -1.042935e-02          AcidIndex
## 27 -1.381986e+01   BoxCox_AcidIndex
## 28  1.955161e+00          Log_STARS
## 31  1.443776e-01        fitted_zinb
## 32  9.864509e-01 residuals_stepwise

ADD LASSO SUMMARY

Model Selection

We got almost identical metrics results for the following two models so I decided to add predictors to the zero-inflation portion (the right-hand side of the | part), to allow the model to take into account the influence of predictors on the excess zero counts, which could make the models more differentiated.

##                   Model     RMSE      MAE McFadden_R_squared      AIC
## 1               Poisson 1.332053 1.082923          0.1622001 45860.59
## 2     Negative Binomial 1.332054 1.082925          0.1621936 45862.95
## 3 Zero-Inflated Poisson 1.919950 1.565351          0.1076144 48836.88
## 4  Zero-Inflated NegBin 1.919950 1.565351          0.1076143 48838.89
##   Log_Likelihood    Deviance
## 1      -22917.30 13945.98646
## 2      -22917.47 13945.48276
## 3      -24410.44     1.91995
## 4      -24410.44     1.91995

Regular R-squared works well for linear models but isn’t always useful for Poisson or other GLMs. McFadden’s R-squared is better suited for GLMs and count models (like Poisson or Negative Binomial), providing a better fit metric for these types of models.

##         RMSE        MAE R_squared                      Model
## 1 0.05781604 0.04489417 0.9990991           Lasso Regression
## 2 1.31127180 1.02382501 0.5365918 Multiple Linear Regression

Comparing the metrics and deciding

Model Performance Metrics
Model	RMSE	MAE	Mcfaddens_Rsquared	AIC
Poisson	1.332053	1.082923	0.1622001	45860.59
Negative Binomial	1.332054	1.082925	0.1621936	45862.95
Zero-Inflated Poisson	1.919950	1.565351	0.1076140	48836.88
Zero-Inflated NegBin	1.919950	1.565351	0.1076143	48838.89

The R-squared value of Negative Binomial model (0.1621936) is lower than other models and almost identical to Poisson model (0.1622001). This is a modest level of explained variance for count data models, but it’s still informative. The Zero-Inflated models show much lower R-squared values (around 0.1076), suggesting that they do not explain as much of the variance as the Poisson or Negative Binomial models. The RMSE for the Negative Binomial model (1.332054) is nearly identical to that of the Poisson model (1.332053). Both are significantly lower than the Zero-Inflated models (Zero-Inflated Poisson and Zero-Inflated Negative Binomial), which have an RMSE of around 1.92. The MAE values for the Poisson (1.082923) and Negative Binomial (1.082925) models are also nearly identical, indicating that both models are equally good in terms of the average absolute error between predicted and actual values. Again, the Zero-Inflated models have higher MAE values (around 1.565), further suggesting that the Negative Binomial model is a better fit.

The Lasso Regression model has a much lower RMSE (0.05781589) and MAE (0.04489408), but it is not suited for count data. Lasso regression is typically used for continuous outcomes, not for count or discrete data. However, Linear Regression (with RMSE of 1.689788 and MAE of 1.347690) performs worse than the Poisson, Negative Binomial, and even Zero-Inflated models. Linear regression is also not appropriate for count data because it assumes continuous, normally distributed errors, which contradicts the assumptions of count data, leading to a poorer fit and worse error metrics.

In conclusion, The Negative Binomial model is the most suitable model for our count data, outperforming the other models in terms of handling overdispersion, error metrics such as, RMSE, MAE, and fit quality (AIC and Log-Likelihood). While the Poisson model performs similarly, the Negative Binomial offers more flexibility by allowing for variance to exceed the mean, making it a more appropriate choice for real-world count data with overdispersion. The Zero-Inflated models and regression models (Lasso, Linear) are either too complex or not suited to count data, which is why they perform worse.

Using the training data set, to evaluate the performance of the count regression model

## $RMSE
## [1] 1.332053
## 
## $MAE
## [1] 1.082923
## 
## $McFadden_R2
## [1] 0.1622001
## 
## $AIC
## [1] 45860.59

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Alcohol + 
##     Sulphates + LabelAppeal + BoxCox_AcidIndex + Log_STARS, family = poisson(link = "log"), 
##     data = wine_data)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         5.298e+00  3.556e-01  14.900  < 2e-16 ***
## VolatileAcidity    -5.767e-02  1.070e-02  -5.387 7.17e-08 ***
## CitricAcid          1.593e-02  9.494e-03   1.678 0.093318 .  
## Chlorides          -4.005e-02  1.646e-02  -2.434 0.014948 *  
## FreeSulfurDioxide   1.544e-04  5.658e-05   2.728 0.006367 ** 
## TotalSulfurDioxide  1.351e-04  3.491e-05   3.870 0.000109 ***
## Density            -2.932e-01  1.921e-01  -1.526 0.126928    
## pH                 -1.227e-02  7.652e-03  -1.604 0.108714    
## Alcohol             3.314e-03  1.489e-03   2.225 0.026090 *  
## Sulphates          -1.893e-02  9.098e-03  -2.081 0.037445 *  
## LabelAppeal         1.390e-01  6.001e-03  23.169  < 2e-16 ***
## BoxCox_AcidIndex   -6.117e+00  3.973e-01 -15.398  < 2e-16 ***
## Log_STARS           8.343e-01  1.161e-02  71.864  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22820  on 12772  degrees of freedom
## Residual deviance: 13946  on 12760  degrees of freedom
## AIC: 45861
## 
## Number of Fisher Scoring iterations: 5

Interpretation of Coefficients LabelAppeal: This variable has a highly significant and positive coefficient. For every one-unit increase in LabelAppeal, the log count of wine sales increases by 0.1391. In terms of the expected count, this corresponds to a 15% increase.Improving wine label design and marketing appeal can drive sales.

Log_STARS:This is the strongest predictor of wine sales. A one-unit increase in Log_STARS leads to a 130% increase in expected sales.Highlighting wine ratings and reviews is crucial for increasing consumer demand.

BoxCox_AcidIndex: A significant negative coefficient indicates that higher BoxCox_AcidIndex strongly decreases sales. A one-unit increase reduces the expected sales by about 99.8%.Ensuring optimal acidity levels in wine production is essential for market success.

FreeSulfurDioxide and TotalSulfurDioxide: Both variables have small but statistically significant positive effects. A unit increase in FreeSulfurDioxide and TotalSulfurDioxide slightly increases sales. These compounds should be optimized for preservation without negatively impacting quality.

Alcohol Content: A positive and significant coefficient shows that higher alcohol content slightly increases sales. A one-unit increase in Alcohol leads to a 0.33% rise in expected sales.Marketing wines with balanced alcohol content may enhance appeal. Sulphates (-0.0190, ):

Sulphates: A significant negative coefficient suggests that higher Sulphates reduce sales. A one-unit increase decreases the expected count by about 1.9%. Sulphate levels should be carefully monitored to avoid adversely affecting customer preference.

Volatile Acidity: Significant negative effect, suggesting that higher VolatileAcidity reduces sales by approximately 5.6% per unit increase.Maintain low volatile acidity for better product perception.

Chlorides:Significant negative impact, reducing expected sales by 3.9% per unit increase.

CitricAcid, Density, and pH do not show statistically significant effects in this model, suggesting they may not play a critical role in predicting wine sales under the given conditions.

Conclusion In conclusion factors like label appeal, wine quality, acidity, sulfur levels, and alcohol content play significant roles in driving wine sales. The model provides actionable insights for producers to optimize product characteristics and marketing strategies to maximize sales.

## 'data.frame':    3335 obs. of  16 variables:
##  $ IN                : int  3 9 10 18 21 30 31 37 39 47 ...
##  $ TARGET            : logi  NA NA NA NA NA NA ...
##  $ FixedAcidity      : num  5.4 12.4 7.2 6.2 11.4 17.6 15.5 15.9 11.6 3.8 ...
##  $ VolatileAcidity   : num  -0.86 0.385 1.75 0.1 0.21 0.04 0.53 1.19 0.32 0.22 ...
##  $ CitricAcid        : num  0.27 -0.76 0.17 1.8 0.28 -1.15 -0.53 1.14 0.55 0.31 ...
##  $ ResidualSugar     : num  -10.7 -19.7 -33 1 1.2 1.4 4.6 31.9 -50.9 -7.7 ...
##  $ Chlorides         : num  0.092 1.169 0.065 -0.179 0.038 ...
##  $ FreeSulfurDioxide : num  23 -37 9 104 70 -250 10 115 35 40 ...
##  $ TotalSulfurDioxide: num  398 68 76 89 53 140 17 381 83 129 ...
##  $ Density           : num  0.985 0.99 1.046 0.989 1.029 ...
##  $ pH                : num  5.02 3.37 4.61 3.2 2.54 3.06 3.07 2.99 3.32 4.72 ...
##  $ Sulphates         : num  0.64 1.09 0.68 2.11 -0.07 -0.02 0.75 0.31 2.18 -0.64 ...
##  $ Alcohol           : num  12.3 16 8.55 12.3 4.8 11.4 8.5 11.4 -0.5 10.9 ...
##  $ LabelAppeal       : int  -1 0 0 -1 0 1 0 1 0 0 ...
##  $ AcidIndex         : int  6 6 8 8 10 8 12 7 12 7 ...
##  $ STARS             : int  NA 2 1 1 NA 4 3 NA NA NA ...

##   IN    TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1  3 1.0726150          5.4          -0.860       0.27         -10.7     0.092
## 2  9 2.6196534         12.4           0.385      -0.76         -19.7     1.169
## 3 10 1.4081823          7.2           1.750       0.17         -33.0     0.065
## 4 18 1.4479202          6.2           0.100       1.80           1.0    -0.179
## 5 21 0.7891426         11.4           0.210       0.28           1.2     0.038
## 6 30 3.8311621         17.6           0.040      -1.15           1.4     0.535
##   FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
## 1                23                398 0.98527 5.02      0.64   12.30
## 2               -37                 68 0.99048 3.37      1.09   16.00
## 3                 9                 76 1.04641 4.61      0.68    8.55
## 4               104                 89 0.98877 3.20      2.11   12.30
## 5                70                 53 1.02899 2.54     -0.07    4.80
## 6              -250                140 0.95028 3.06     -0.02   11.40
##   LabelAppeal AcidIndex STARS Log_STARS BoxCox_AcidIndex predicted_nb
## 1          -1         6     0 0.0000000        0.7968244    1.0726050
## 2           0         6     2 1.0986123        0.7968244    2.6196274
## 3           0         8     1 0.6931472        0.8331799    1.4081535
## 4          -1         8     1 0.6931472        0.8331799    1.4478975
## 5           0        10     0 0.0000000        0.8545985    0.7891245
## 6           1         8     4 1.6094379        0.8331799    3.8311098

Homework 5 Data 621

Nikoleta Emanouilidi, Mohammed Rahman, Will Berritt

2024-11-24