In this assignment, we analyze a data set containing approximately 12,000 records of commercially available wines. Each record details the chemical properties of the wine, along with factors such as its label appeal and expert ratings, aiming to predict the number of sample cases purchased by wine distribution companies. The primary objectives are as follows:
To build predictive models for estimating the number of sample cases ordered (TARGET) based on wine characteristics.
To evaluate and refine various regression models, with a focus on count regression techniques, including Poisson and negative binomial regression, to ensure accurate and interpretable predictions.
To accomplish these objectives, we will conduct an in-depth exploration of the data, investigating variable distributions, potential correlations with the target, and missing data patterns. Based on these insights, we will preprocess and transform the data, ensuring it is well-suited for modeling. Finally, we will construct and evaluate multiple regression models, selecting the best one based on performance metrics and interpretability to provide actionable insights for the wine manufacturer’s strategy.
## 'data.frame': 12795 obs. of 16 variables:
## $ INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET : int 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : int 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : int 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : int 2 3 3 1 2 NA NA 3 NA 4 ...
## INDEX TARGET FixedAcidity VolatileAcidity
## Min. : 1 Min. :0.000 Min. :-18.100 Min. :-2.7900
## 1st Qu.: 4038 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300
## Median : 8110 Median :3.000 Median : 6.900 Median : 0.2800
## Mean : 8070 Mean :3.029 Mean : 7.076 Mean : 0.3241
## 3rd Qu.:12106 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400
## Max. :16129 Max. :8.000 Max. : 34.400 Max. : 3.6800
##
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## Min. :-3.2400 Min. :-127.800 Min. :-1.1710 Min. :-555.00
## 1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00
## Median : 0.3100 Median : 3.900 Median : 0.0460 Median : 30.00
## Mean : 0.3084 Mean : 5.419 Mean : 0.0548 Mean : 30.85
## 3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00
## Max. : 3.8600 Max. : 141.150 Max. : 1.3510 Max. : 623.00
## NA's :616 NA's :638 NA's :647
## TotalSulfurDioxide Density pH Sulphates
## Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300
## 1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800
## Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000
## Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271
## 3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600
## Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400
## NA's :682 NA's :395 NA's :1210
## Alcohol LabelAppeal AcidIndex STARS
## Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
## Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :653 NA's :3359
The dataset consists of 12,795 observations and 16 variables that capture various chemical and non-chemical characteristics of wines. The target variable, TARGET, represents the number of wine sample cases purchased, with values ranging from 0 to 8 and a mean of approximately 3 cases. The features include chemical properties such as FixedAcidity, VolatileAcidity, ResidualSugar, and Alcohol, as well as marketing-related factors like LabelAppeal and expert ratings captured by STARS. Many variables exhibit wide ranges and potential skewness, such as ResidualSugar, which spans from -127.8 to 141.15 with a mean of 5.4, and Alcohol, which ranges from -4.7 to 26.5 with a mean of 10.5. Missing values are present in several variables, including Chlorides, FreeSulfurDioxide, and STARS, with some variables like STARS having a substantial proportion of missing data. This will necessitate imputation or alternative handling. Overall, the dataset presents diverse features with varying distributions, and initial exploration suggests the need for transformations and careful handling of missing data to ensure robust modeling.
## Mean Median StdDev
## INDEX 8.069980e+03 8110.00000 4.656905e+03
## TARGET 3.029074e+00 3.00000 1.926368e+00
## FixedAcidity 7.075717e+00 6.90000 6.317643e+00
## VolatileAcidity 3.241039e-01 0.28000 7.840142e-01
## CitricAcid 3.084127e-01 0.31000 8.620798e-01
## ResidualSugar 5.418733e+00 3.90000 3.374938e+01
## Chlorides 5.482249e-02 0.04600 3.184673e-01
## FreeSulfurDioxide 3.084557e+01 30.00000 1.487146e+02
## TotalSulfurDioxide 1.207142e+02 123.00000 2.319132e+02
## Density 9.942027e-01 0.99449 2.653765e-02
## pH 3.207628e+00 3.20000 6.796871e-01
## Sulphates 5.271118e-01 0.50000 9.321293e-01
## Alcohol 1.048924e+01 10.40000 3.727819e+00
## LabelAppeal -9.066041e-03 0.00000 8.910892e-01
## AcidIndex 7.772724e+00 8.00000 1.323926e+00
## STARS 2.041755e+00 2.00000 9.025400e-01
## INDEX TARGET FixedAcidity VolatileAcidity
## 0 0 0 0
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## 0 616 638 647
## TotalSulfurDioxide Density pH Sulphates
## 682 0 395 1210
## Alcohol LabelAppeal AcidIndex STARS
## 653 0 0 3359
The summary statistics highlight key characteristics of the dataset and inform preprocessing steps. The target variable, TARGET, has a mean of about 3 cases purchased with moderate variability, making it suitable for count regression models. Chemical features like FixedAcidity and VolatileAcidity show consistency, while ResidualSugar exhibits significant variability. Non-chemical variables, such as LabelAppeal, are mostly neutral, while STARS, a categorical expert rating, has a mean of 2 but is missing many values, suggesting mode imputation to preserve its predictive potential. Continuous variables like Sulphates, Chlorides, and Alcohol, which have missing values and wide ranges, may require mean, median, or KNN imputation. High variability in features such as FreeSulfurDioxide and TotalSulfurDioxide suggests potential outliers, emphasizing the need for imputation, transformations, and careful scaling to optimize model performance.
We get a clear sense of the distribution for the target variable here. Looks normally distributed with the exception of a high count of 0 or null values.
The boxplot of Label Appeal vs TARGET makes
a lot of sense. As we try to understand the relationship between the
two, it’s not hard to see that an increase in Label Appeal
corresponds with an increase in the TARGET value.
We see that the features have very low correlations with each other,
meaning that there is not much multicollinearity present in the dataset.
This means that the assumptions of linear regression are more likely to
be met. However, we do see the strongest relationships between
STARS, LabelAppeal, and
TARGET.
Distributions look generally really nice. AcidIndex and
STARS display some right skewness, which we can consider
some transformations for.
The bar charts compare the three discrete categorical variables
against the TARGET variable. For AcidIndex, a
large quantity of wine was sold with index numbers 7 and 8.
LabelAppeal indicates that wines with generic labels tend
to have a higher number of cases sold per order. Finally,
STARS reveals that higher-star-rated wines are associated
with higher price tags. Overall, for each of these predictors, there
appears to be a significant relationship between their ordered levels
and the number of wine cases sold.
Here we see a weak but positive relationship between
Alcohol and TARGET, which makes sense. If
people are purchasing wine, it is likely with the intention of feeling
the effects.
To better understand the negative values in our data set, we did some more digging.
## INDEX TARGET FixedAcidity VolatileAcidity
## 0 0 1621 2827
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## 2966 3136 3197 3036
## TotalSulfurDioxide Density pH Sulphates
## 2504 0 0 2361
## Alcohol LabelAppeal AcidIndex STARS
## 118 3640 0 0
Citric acid, VolatileAcidity,
FreeSulfurDioxide, TotalSulfurDioxide,
Sulphates and Alcohol content should be
non-negative, so negative values here are likely invalid.
LabelAppeal is a marketing score, and it can
theoretically be negative if the label is poorly received, so negative
values might be valid.
In summary, our exploration of the wine dataset has provided valuable insights into its structure, distributions, and potential challenges for predictive modeling. The target variable, TARGET, representing the number of wine cases purchased, is moderately variable and suitable for count regression models. Strong relationships with predictors like LabelAppeal and STARS suggest these variables are critical for predictive performance. However, missing data in several key variables, especially STARS, which is highly categorical and potentially influential, must be addressed through imputation strategies. Negative values in chemical features like VolatileAcidity, Sulphates, and Alcohol likely indicate errors and need correction.
The dataset shows low multicollinearity between features, simplifying model assumptions, but high variability in chemical features such as ResidualSugar and FreeSulfurDioxide suggests potential outliers. Visualizations confirm meaningful relationships between predictors and the target variable, including the positive impact of LabelAppeal and higher STARS ratings on wine purchases. Transformations may be needed for skewed variables like AcidIndex and STARS, while mean, median, or KNN imputation can handle missing values in continuous variables.
Overall, the dataset presents a robust foundation for regression modeling, with the potential to yield actionable insights for predicting wine sales. Addressing data quality issues, handling missing values, and scaling features will be crucial next steps in preparing the data for reliable and interpretable modeling.
Let’s get a better sense of the number of missing values by plotting
how many missing values we have for each variable.
Below, we replaced values in variables where they are invalid (e.g., CitricAcid, VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Sulphates, and Alcohol) with NA. This approach treats them as missing data to avoid introducing biases or errors. Counts of the replaced negative values were reviewed to ensure accurate handling.
## CitricAcid VolatileAcidity FreeSulfurDioxide TotalSulfurDioxide
## 0 0 0 0
## Sulphates Alcohol
## 0 0
Here, we will impute the missing values in the STARS variable (categorical) with the mode, as it reflects the most frequent value and preserves the categorical nature of the variable. For the remaining variables, I will use mean imputation to maintain the central tendency of the data.
## INDEX TARGET FixedAcidity VolatileAcidity
## 0 0 0 0
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## 0 0 0 0
## TotalSulfurDioxide Density pH Sulphates
## 0 0 0 0
## Alcohol LabelAppeal AcidIndex STARS
## 0 0 0 0
We created some additional features:
Alcohol_Bucket: Categorizes wine into alcohol content ranges (Low, Medium, High, Very High) based on the Alcohol variable. ResidualSugar_Bucket: Classifies wine into sweetness categories (Dry, Semi-Dry, Sweet, Very Sweet) based on the ResidualSugar level. Alcohol_to_Sulphates: A ratio of Alcohol to Sulphates, capturing the relationship between alcohol content and sulphate levels in the wine. Acidity_Index: A combined measure of acidity, calculated by summing FixedAcidity and VolatileAcidity, providing a comprehensive view of the wine’s acidity. Sulphates_Alcohol_Interaction: The interaction term between Sulphates and Alcohol, examining their combined effect on wine quality.
To handle the skewedness in STARS and
AcidIndex, we applied the following transformations:
The Box-Cox Transformation is the most effective for AcidIndex. For STARS, both the Log Transformation and the Box-Cox Transformation are suitable; however, the discrete nature of the variable limits its potential to achieve full normality.
## Alcohol_to_Sulphates Acidity_Index Sulphates_Alcohol_Interaction
## Negative 0 1460 0
## na 0 0 0
## nan 0 0 0
## inf 22 0 0
In the data preparation process, missing values and negative values were addressed. First, missing values in the STARS variable (categorical) were imputed with the mode, while missing values in other continuous variables were imputed using the mean. Negative values in certain columns (e.g., CitricAcid, VolatileAcidity, Alcohol) were replaced with NA. Transformations were applied to handle skewness in the AcidIndex and STARS variables, with the Box-Cox Transformation being most effective for AcidIndex. The STARS variable benefitted from both Log and Box-Cox transformations, although its discrete nature limited normality. New features were created, including the Alcohol-to-Sulphates ratio, Acidity Index, and Sulphates × Alcohol interaction, which were checked for data issues. This comprehensive data cleaning and transformation ensures the dataset is ready for modeling.
##
## Call:
## glm(formula = TARGET ~ Log_STARS + LabelAppeal + BoxCox_AcidIndex +
## VolatileAcidity + TotalSulfurDioxide + FreeSulfurDioxide +
## Chlorides + Alcohol + Sulphates + CitricAcid + pH + Density,
## family = poisson(link = "log"), data = wine_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.2902196 0.3551942 14.894 < 2e-16 ***
## Log_STARS 0.8344675 0.0115972 71.954 < 2e-16 ***
## LabelAppeal 0.1391214 0.0059964 23.201 < 2e-16 ***
## BoxCox_AcidIndex -6.1055399 0.3968569 -15.385 < 2e-16 ***
## VolatileAcidity -0.0577085 0.0107004 -5.393 6.92e-08 ***
## TotalSulfurDioxide 0.0001336 0.0000349 3.829 0.000129 ***
## FreeSulfurDioxide 0.0001508 0.0000565 2.669 0.007606 **
## Chlorides -0.0398285 0.0164476 -2.422 0.015455 *
## Alcohol 0.0032991 0.0014887 2.216 0.026681 *
## Sulphates -0.0189675 0.0090807 -2.089 0.036728 *
## CitricAcid 0.0153793 0.0094866 1.621 0.104983
## pH -0.0125060 0.0076460 -1.636 0.101917
## Density -0.2926834 0.1919023 -1.525 0.127217
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 13972 on 12782 degrees of freedom
## AIC: 45940
##
## Number of Fisher Scoring iterations: 5
The Poisson regression model uses predictors such as
Log_STARS, LabelAppeal,
BoxCox_AcidIndex, and others to predict the target
variable. Statistically significant variables (p-value < 0.05) like
Log_STARS, LabelAppeal, and
Alcohol show strong associations with the target. The final
model improves fit, with a reduction in deviance from 22820 (null) to
13946 (residual). Some variables, like pH and Density, are seemingly
less significant.
##
## Call:
## zeroinfl(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides +
## FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Alcohol +
## Sulphates + LabelAppeal + BoxCox_AcidIndex + Log_STARS | 1, data = wine_data)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.7759 -0.3925 0.1307 0.5070 3.9430
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.633e+00 3.803e-01 12.183 < 2e-16 ***
## VolatileAcidity -4.519e-02 1.118e-02 -4.041 5.33e-05 ***
## CitricAcid 9.914e-03 9.884e-03 1.003 0.31584
## Chlorides -3.336e-02 1.712e-02 -1.949 0.05130 .
## FreeSulfurDioxide 1.192e-04 5.796e-05 2.057 0.03967 *
## TotalSulfurDioxide 7.532e-05 3.528e-05 2.135 0.03277 *
## Density -2.943e-01 2.005e-01 -1.468 0.14215
## pH -6.338e-03 7.959e-03 -0.796 0.42583
## Alcohol 4.696e-03 1.545e-03 3.040 0.00236 **
## Sulphates -1.291e-02 9.451e-03 -1.366 0.17205
## LabelAppeal 1.756e-01 6.759e-03 25.979 < 2e-16 ***
## BoxCox_AcidIndex -4.909e+00 4.330e-01 -11.336 < 2e-16 ***
## Log_STARS 6.162e-01 1.874e-02 32.873 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.20364 0.06409 -34.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 22
## Log-likelihood: -2.277e+04 on 14 Df
The Poisson regression model shows that several predictors, including LabelAppeal and Log_STARS, have strong significant relationships with the target variable, with very low p-values indicating their importance. Other significant predictors include Alcohol, Sulphates, FreeSulfurDioxide, TotalSulfurDioxide, Chlorides, and VolatileAcidity. However, variables like CitricAcid, Density, and pH do not show significant effects on the target. The model’s deviance statistics suggest a good fit with the data. In the zero-inflated Poisson (ZIP) model, the count model coefficients largely mirror those from the Poisson regression, while the zero-inflation component shows a significant intercept, indicating a high probability of zero counts in the data.
Below, we visualize fitted vs observed values for Poisson model.
The Poisson model demonstrates a general trend, but it exhibits signs of overdispersion, where the variance exceeds the mean. This is reflected in the spread of points away from the red reference line, indicating a lack of perfect fit. The model tends to underpredict at higher observed values, suggesting that some important predictors or interaction terms may be missing. While the model performs reasonably well for lower counts, its accuracy diminishes as observed values increase, highlighting areas for potential improvement in modeling higher count observations.
##
## Call:
## glm.nb(formula = TARGET ~ Log_STARS + LabelAppeal + BoxCox_AcidIndex +
## VolatileAcidity + TotalSulfurDioxide + FreeSulfurDioxide +
## Chlorides + Alcohol + Sulphates + CitricAcid + pH + Density,
## data = wine_data, init.theta = 44991.79975, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.298e+00 3.556e-01 14.900 < 2e-16 ***
## Log_STARS 8.343e-01 1.161e-02 71.861 < 2e-16 ***
## LabelAppeal 1.390e-01 6.002e-03 23.167 < 2e-16 ***
## BoxCox_AcidIndex -6.118e+00 3.973e-01 -15.397 < 2e-16 ***
## VolatileAcidity -5.767e-02 1.071e-02 -5.387 7.17e-08 ***
## TotalSulfurDioxide 1.351e-04 3.491e-05 3.870 0.000109 ***
## FreeSulfurDioxide 1.544e-04 5.658e-05 2.728 0.006367 **
## Chlorides -4.005e-02 1.646e-02 -2.434 0.014950 *
## Alcohol 3.314e-03 1.490e-03 2.225 0.026100 *
## Sulphates -1.893e-02 9.098e-03 -2.081 0.037446 *
## CitricAcid 1.593e-02 9.494e-03 1.678 0.093326 .
## pH -1.227e-02 7.652e-03 -1.604 0.108707
## Density -2.932e-01 1.921e-01 -1.526 0.126937
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(44991.8) family taken to be 1)
##
## Null deviance: 22819 on 12772 degrees of freedom
## Residual deviance: 13945 on 12760 degrees of freedom
## AIC: 45863
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 44992
## Std. Err.: 41032
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -45834.95
In Negative Binomial Model, the minimum value of AIC we get from the same set of Poisson Model. So, we keep those variables in this model too.
##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides +
## FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Alcohol +
## Sulphates + LabelAppeal + BoxCox_AcidIndex + Log_STARS, data = wine_data,
## init.theta = 44992.15985, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.298e+00 3.556e-01 14.900 < 2e-16 ***
## VolatileAcidity -5.767e-02 1.071e-02 -5.387 7.17e-08 ***
## CitricAcid 1.593e-02 9.494e-03 1.678 0.093327 .
## Chlorides -4.005e-02 1.646e-02 -2.434 0.014950 *
## FreeSulfurDioxide 1.544e-04 5.658e-05 2.728 0.006367 **
## TotalSulfurDioxide 1.351e-04 3.491e-05 3.870 0.000109 ***
## Density -2.932e-01 1.921e-01 -1.526 0.126937
## pH -1.227e-02 7.652e-03 -1.604 0.108707
## Alcohol 3.314e-03 1.490e-03 2.225 0.026100 *
## Sulphates -1.893e-02 9.098e-03 -2.081 0.037446 *
## LabelAppeal 1.390e-01 6.002e-03 23.167 < 2e-16 ***
## BoxCox_AcidIndex -6.118e+00 3.973e-01 -15.397 < 2e-16 ***
## Log_STARS 8.343e-01 1.161e-02 71.861 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(44992.16) family taken to be 1)
##
## Null deviance: 22819 on 12772 degrees of freedom
## Residual deviance: 13945 on 12760 degrees of freedom
## AIC: 45863
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 44992
## Std. Err.: 41033
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -45834.95
The Negative Binomial model demonstrates improved predictive accuracy over the Poisson model, particularly for mid-range and high TARGET values. While it offers notable improvements, there is still some underprediction of high values and slight overprediction of low values, suggesting opportunities for further refinement.
The model performs well for mid-range TARGET values, with fitted values closely aligning with the observed distribution in this range. It effectively addresses the overdispersion in the data. However, it struggles with the excess zeros, indicating that it does not fully capture the characteristics of the data at the lower end.
##
## Call:
## zeroinfl(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides +
## FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Alcohol +
## Sulphates + LabelAppeal + BoxCox_AcidIndex + Log_STARS | 1, data = wine_data,
## dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.7757 -0.3921 0.1309 0.5069 3.9448
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.640e+00 3.807e-01 12.189 < 2e-16 ***
## VolatileAcidity -4.515e-02 1.119e-02 -4.035 5.45e-05 ***
## CitricAcid 1.043e-02 9.892e-03 1.055 0.29154
## Chlorides -3.357e-02 1.712e-02 -1.960 0.04994 *
## FreeSulfurDioxide 1.226e-04 5.803e-05 2.113 0.03464 *
## TotalSulfurDioxide 7.644e-05 3.528e-05 2.166 0.03028 *
## Density -2.957e-01 2.007e-01 -1.473 0.14074
## pH -6.095e-03 7.965e-03 -0.765 0.44414
## Alcohol 4.713e-03 1.545e-03 3.050 0.00229 **
## Sulphates -1.294e-02 9.469e-03 -1.367 0.17166
## LabelAppeal 1.755e-01 6.765e-03 25.946 < 2e-16 ***
## BoxCox_AcidIndex -4.919e+00 4.335e-01 -11.345 < 2e-16 ***
## Log_STARS 6.160e-01 1.876e-02 32.846 < 2e-16 ***
## Log(theta) 1.777e+01 1.037e+00 17.135 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.20334 0.06409 -34.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 52405508.723
## Number of iterations in BFGS optimization: 73
## Log-likelihood: -2.273e+04 on 15 Df
Compared to the Negative Binomial model, the Zero-Inflated Negative Binomial model appears to handle the excess zeros and low count values better. Points for TARGET = 0 align closer to the fitted values, indicating that the ZINB model better accounts for the zero-inflation in the data. Similar to the Negative Binomial model, the ZINB model still struggles to predict higher observed counts such as TARGET > 5. The fitted values for these points are consistently below the red line, indicating underprediction.
##
## Call:
## lm(formula = TARGET ~ Log_STARS + LabelAppeal + BoxCox_AcidIndex +
## VolatileAcidity + TotalSulfurDioxide + Alcohol + Chlorides +
## FreeSulfurDioxide + Sulphates + CitricAcid + Density + pH,
## data = wine_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7960 -0.8455 0.0218 0.8310 6.3083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.555e+01 8.019e-01 19.386 < 2e-16 ***
## Log_STARS 2.204e+00 2.270e-02 97.074 < 2e-16 ***
## LabelAppeal 4.751e-01 1.340e-02 35.456 < 2e-16 ***
## BoxCox_AcidIndex -1.766e+01 8.889e-01 -19.863 < 2e-16 ***
## VolatileAcidity -1.682e-01 2.340e-02 -7.191 6.80e-13 ***
## TotalSulfurDioxide 3.859e-04 8.050e-05 4.794 1.65e-06 ***
## Alcohol 1.468e-02 3.393e-03 4.327 1.53e-05 ***
## Chlorides -1.181e-01 3.743e-02 -3.156 0.00160 **
## FreeSulfurDioxide 4.153e-04 1.310e-04 3.170 0.00153 **
## Sulphates -4.672e-02 2.042e-02 -2.288 0.02218 *
## CitricAcid 4.947e-02 2.189e-02 2.260 0.02383 *
## Density -8.313e-01 4.382e-01 -1.897 0.05784 .
## pH -2.880e-02 1.740e-02 -1.655 0.09794 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.312 on 12760 degrees of freedom
## Multiple R-squared: 0.5366, Adjusted R-squared: 0.5362
## F-statistic: 1231 on 12 and 12760 DF, p-value: < 2.2e-16
We can see that, Log_STARS, LabelAppeal,
BoxCox_AcidIndex, VolatileAcidity,
TotalSulfurDioxide, Alcohol,
Chlorides, FreeSulfurDioxide,
Sulphates, pH, CitricAcid, and
Density provide the minimum value of AIC in linear
regression model. So we kept these variables in this model.
##
## Call:
## lm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides +
## FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Alcohol +
## Sulphates + LabelAppeal + BoxCox_AcidIndex + Log_STARS, data = wine_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7960 -0.8455 0.0218 0.8310 6.3083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.555e+01 8.019e-01 19.386 < 2e-16 ***
## VolatileAcidity -1.682e-01 2.340e-02 -7.191 6.80e-13 ***
## CitricAcid 4.947e-02 2.189e-02 2.260 0.02383 *
## Chlorides -1.181e-01 3.743e-02 -3.156 0.00160 **
## FreeSulfurDioxide 4.153e-04 1.310e-04 3.170 0.00153 **
## TotalSulfurDioxide 3.859e-04 8.050e-05 4.794 1.65e-06 ***
## Density -8.313e-01 4.382e-01 -1.897 0.05784 .
## pH -2.880e-02 1.740e-02 -1.655 0.09794 .
## Alcohol 1.468e-02 3.393e-03 4.327 1.53e-05 ***
## Sulphates -4.672e-02 2.042e-02 -2.288 0.02218 *
## LabelAppeal 4.751e-01 1.340e-02 35.456 < 2e-16 ***
## BoxCox_AcidIndex -1.766e+01 8.889e-01 -19.863 < 2e-16 ***
## Log_STARS 2.204e+00 2.270e-02 97.074 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.312 on 12760 degrees of freedom
## Multiple R-squared: 0.5366, Adjusted R-squared: 0.5362
## F-statistic: 1231 on 12 and 12760 DF, p-value: < 2.2e-16
The multiple linear regression model shows that several predictors, such as VolatileAcidity, CitricAcid, Chlorides, LabelAppeal, and Log_STARS, significantly impact the TARGET variable. The model explains 53.66% of the variance in the data, as indicated by the R-squared value. Although the model performs reasonably well, the residual standard error of 1.312 suggests some room for improvement. The observed vs predicted plot reveals some discrepancies, especially for higher values of TARGET.
## [1] "Optimal Lambda: 0.0133517693664588"
## Coefficient Variable
## 1 1.190563e+01 (Intercept)
## 4 -1.261571e-01 VolatileAcidity
## 5 1.984741e-02 CitricAcid
## 7 -6.416365e-02 Chlorides
## 8 2.277633e-04 FreeSulfurDioxide
## 9 2.673729e-04 TotalSulfurDioxide
## 10 -2.619490e-01 Density
## 11 -6.438560e-03 pH
## 12 -2.035883e-02 Sulphates
## 13 8.626826e-03 Alcohol
## 14 3.839345e-01 LabelAppeal
## 15 -1.042935e-02 AcidIndex
## 27 -1.381986e+01 BoxCox_AcidIndex
## 28 1.955161e+00 Log_STARS
## 31 1.443776e-01 fitted_zinb
## 32 9.864509e-01 residuals_stepwise
ADD LASSO SUMMARY
We got almost identical metrics results for the following two models so I decided to add predictors to the zero-inflation portion (the right-hand side of the | part), to allow the model to take into account the influence of predictors on the excess zero counts, which could make the models more differentiated.
## Model RMSE MAE McFadden_R_squared AIC
## 1 Poisson 1.332053 1.082923 0.1622001 45860.59
## 2 Negative Binomial 1.332054 1.082925 0.1621936 45862.95
## 3 Zero-Inflated Poisson 1.919950 1.565351 0.1076144 48836.88
## 4 Zero-Inflated NegBin 1.919950 1.565351 0.1076143 48838.89
## Log_Likelihood Deviance
## 1 -22917.30 13945.98646
## 2 -22917.47 13945.48276
## 3 -24410.44 1.91995
## 4 -24410.44 1.91995
Regular R-squared works well for linear models but isn’t always useful for Poisson or other GLMs. McFadden’s R-squared is better suited for GLMs and count models (like Poisson or Negative Binomial), providing a better fit metric for these types of models.
## RMSE MAE R_squared Model
## 1 0.05781604 0.04489417 0.9990991 Lasso Regression
## 2 1.31127180 1.02382501 0.5365918 Multiple Linear Regression
| Model | RMSE | MAE | Mcfaddens_Rsquared | AIC |
|---|---|---|---|---|
| Poisson | 1.332053 | 1.082923 | 0.1622001 | 45860.59 |
| Negative Binomial | 1.332054 | 1.082925 | 0.1621936 | 45862.95 |
| Zero-Inflated Poisson | 1.919950 | 1.565351 | 0.1076140 | 48836.88 |
| Zero-Inflated NegBin | 1.919950 | 1.565351 | 0.1076143 | 48838.89 |
The R-squared value of Negative Binomial model (0.1621936) is lower than other models and almost identical to Poisson model (0.1622001). This is a modest level of explained variance for count data models, but it’s still informative. The Zero-Inflated models show much lower R-squared values (around 0.1076), suggesting that they do not explain as much of the variance as the Poisson or Negative Binomial models. The RMSE for the Negative Binomial model (1.332054) is nearly identical to that of the Poisson model (1.332053). Both are significantly lower than the Zero-Inflated models (Zero-Inflated Poisson and Zero-Inflated Negative Binomial), which have an RMSE of around 1.92. The MAE values for the Poisson (1.082923) and Negative Binomial (1.082925) models are also nearly identical, indicating that both models are equally good in terms of the average absolute error between predicted and actual values. Again, the Zero-Inflated models have higher MAE values (around 1.565), further suggesting that the Negative Binomial model is a better fit.
The Lasso Regression model has a much lower RMSE (0.05781589) and MAE (0.04489408), but it is not suited for count data. Lasso regression is typically used for continuous outcomes, not for count or discrete data. However, Linear Regression (with RMSE of 1.689788 and MAE of 1.347690) performs worse than the Poisson, Negative Binomial, and even Zero-Inflated models. Linear regression is also not appropriate for count data because it assumes continuous, normally distributed errors, which contradicts the assumptions of count data, leading to a poorer fit and worse error metrics.
In conclusion, The Negative Binomial model is the most suitable model for our count data, outperforming the other models in terms of handling overdispersion, error metrics such as, RMSE, MAE, and fit quality (AIC and Log-Likelihood). While the Poisson model performs similarly, the Negative Binomial offers more flexibility by allowing for variance to exceed the mean, making it a more appropriate choice for real-world count data with overdispersion. The Zero-Inflated models and regression models (Lasso, Linear) are either too complex or not suited to count data, which is why they perform worse.
Using the training data set, to evaluate the performance of the count regression model
## $RMSE
## [1] 1.332053
##
## $MAE
## [1] 1.082923
##
## $McFadden_R2
## [1] 0.1622001
##
## $AIC
## [1] 45860.59
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides +
## FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Alcohol +
## Sulphates + LabelAppeal + BoxCox_AcidIndex + Log_STARS, family = poisson(link = "log"),
## data = wine_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.298e+00 3.556e-01 14.900 < 2e-16 ***
## VolatileAcidity -5.767e-02 1.070e-02 -5.387 7.17e-08 ***
## CitricAcid 1.593e-02 9.494e-03 1.678 0.093318 .
## Chlorides -4.005e-02 1.646e-02 -2.434 0.014948 *
## FreeSulfurDioxide 1.544e-04 5.658e-05 2.728 0.006367 **
## TotalSulfurDioxide 1.351e-04 3.491e-05 3.870 0.000109 ***
## Density -2.932e-01 1.921e-01 -1.526 0.126928
## pH -1.227e-02 7.652e-03 -1.604 0.108714
## Alcohol 3.314e-03 1.489e-03 2.225 0.026090 *
## Sulphates -1.893e-02 9.098e-03 -2.081 0.037445 *
## LabelAppeal 1.390e-01 6.001e-03 23.169 < 2e-16 ***
## BoxCox_AcidIndex -6.117e+00 3.973e-01 -15.398 < 2e-16 ***
## Log_STARS 8.343e-01 1.161e-02 71.864 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22820 on 12772 degrees of freedom
## Residual deviance: 13946 on 12760 degrees of freedom
## AIC: 45861
##
## Number of Fisher Scoring iterations: 5
Interpretation of Coefficients LabelAppeal: This variable has a highly significant and positive coefficient. For every one-unit increase in LabelAppeal, the log count of wine sales increases by 0.1391. In terms of the expected count, this corresponds to a 15% increase.Improving wine label design and marketing appeal can drive sales.
Log_STARS:This is the strongest predictor of wine sales. A one-unit increase in Log_STARS leads to a 130% increase in expected sales.Highlighting wine ratings and reviews is crucial for increasing consumer demand.
BoxCox_AcidIndex: A significant negative coefficient indicates that higher BoxCox_AcidIndex strongly decreases sales. A one-unit increase reduces the expected sales by about 99.8%.Ensuring optimal acidity levels in wine production is essential for market success.
FreeSulfurDioxide and TotalSulfurDioxide: Both variables have small but statistically significant positive effects. A unit increase in FreeSulfurDioxide and TotalSulfurDioxide slightly increases sales. These compounds should be optimized for preservation without negatively impacting quality.
Alcohol Content: A positive and significant coefficient shows that higher alcohol content slightly increases sales. A one-unit increase in Alcohol leads to a 0.33% rise in expected sales.Marketing wines with balanced alcohol content may enhance appeal. Sulphates (-0.0190, ):
Sulphates: A significant negative coefficient suggests that higher Sulphates reduce sales. A one-unit increase decreases the expected count by about 1.9%. Sulphate levels should be carefully monitored to avoid adversely affecting customer preference.
Volatile Acidity: Significant negative effect, suggesting that higher VolatileAcidity reduces sales by approximately 5.6% per unit increase.Maintain low volatile acidity for better product perception.
Chlorides:Significant negative impact, reducing expected sales by 3.9% per unit increase.
CitricAcid, Density, and pH do not show statistically significant effects in this model, suggesting they may not play a critical role in predicting wine sales under the given conditions.
Conclusion In conclusion factors like label appeal, wine quality, acidity, sulfur levels, and alcohol content play significant roles in driving wine sales. The model provides actionable insights for producers to optimize product characteristics and marketing strategies to maximize sales.
## 'data.frame': 3335 obs. of 16 variables:
## $ IN : int 3 9 10 18 21 30 31 37 39 47 ...
## $ TARGET : logi NA NA NA NA NA NA ...
## $ FixedAcidity : num 5.4 12.4 7.2 6.2 11.4 17.6 15.5 15.9 11.6 3.8 ...
## $ VolatileAcidity : num -0.86 0.385 1.75 0.1 0.21 0.04 0.53 1.19 0.32 0.22 ...
## $ CitricAcid : num 0.27 -0.76 0.17 1.8 0.28 -1.15 -0.53 1.14 0.55 0.31 ...
## $ ResidualSugar : num -10.7 -19.7 -33 1 1.2 1.4 4.6 31.9 -50.9 -7.7 ...
## $ Chlorides : num 0.092 1.169 0.065 -0.179 0.038 ...
## $ FreeSulfurDioxide : num 23 -37 9 104 70 -250 10 115 35 40 ...
## $ TotalSulfurDioxide: num 398 68 76 89 53 140 17 381 83 129 ...
## $ Density : num 0.985 0.99 1.046 0.989 1.029 ...
## $ pH : num 5.02 3.37 4.61 3.2 2.54 3.06 3.07 2.99 3.32 4.72 ...
## $ Sulphates : num 0.64 1.09 0.68 2.11 -0.07 -0.02 0.75 0.31 2.18 -0.64 ...
## $ Alcohol : num 12.3 16 8.55 12.3 4.8 11.4 8.5 11.4 -0.5 10.9 ...
## $ LabelAppeal : int -1 0 0 -1 0 1 0 1 0 0 ...
## $ AcidIndex : int 6 6 8 8 10 8 12 7 12 7 ...
## $ STARS : int NA 2 1 1 NA 4 3 NA NA NA ...
## IN TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1 3 1.0726150 5.4 -0.860 0.27 -10.7 0.092
## 2 9 2.6196534 12.4 0.385 -0.76 -19.7 1.169
## 3 10 1.4081823 7.2 1.750 0.17 -33.0 0.065
## 4 18 1.4479202 6.2 0.100 1.80 1.0 -0.179
## 5 21 0.7891426 11.4 0.210 0.28 1.2 0.038
## 6 30 3.8311621 17.6 0.040 -1.15 1.4 0.535
## FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol
## 1 23 398 0.98527 5.02 0.64 12.30
## 2 -37 68 0.99048 3.37 1.09 16.00
## 3 9 76 1.04641 4.61 0.68 8.55
## 4 104 89 0.98877 3.20 2.11 12.30
## 5 70 53 1.02899 2.54 -0.07 4.80
## 6 -250 140 0.95028 3.06 -0.02 11.40
## LabelAppeal AcidIndex STARS Log_STARS BoxCox_AcidIndex predicted_nb
## 1 -1 6 0 0.0000000 0.7968244 1.0726050
## 2 0 6 2 1.0986123 0.7968244 2.6196274
## 3 0 8 1 0.6931472 0.8331799 1.4081535
## 4 -1 8 1 0.6931472 0.8331799 1.4478975
## 5 0 10 0 0.0000000 0.8545985 0.7891245
## 6 1 8 4 1.6094379 0.8331799 3.8311098