Wine Count Regression

Author

Darwhin Gomez

Wine Count Regression

Overview In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales. Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided).

Variable Name Definition Theoretical Effect
INDEX Identification variable (do not use) None
TARGET Number of cases purchased None
AcidIndex Proprietary method of testing total acidity of wine using a weighted average
Alcohol Alcohol content
Chlorides Chloride content of wine
CitricAcid Citric acid content
Density Density of wine
FixedAcidity Fixed acidity of wine
FreeSulfurDioxide Sulfur dioxide content of wine
LabelAppeal Marketing score indicating label appeal. High: customers like it. Negative: customers dislike it. Higher label appeal suggests higher sales.
ResidualSugar Residual sugar of wine
STARS Wine rating by expert panel (1–4 stars) Higher number of stars suggests higher sales.
Sulphates Sulfate content of wine
TotalSulfurDioxide Total sulfur dioxide content of wine
VolatileAcidity Volatile acid content of wine
pH pH of wine

Data Exploration

In this section, describe the structure of the wine training dataset, highlight important summary statistics, identify missing values, examine relationships with the target variable, and show key visualizations.

Data-set Overview

The training dataset contains 12,795 observations and 14 predictor variables, and the target variable (TARGET). The variables include chemical measurements such as acidity, sulfur content, pH levels, alcohol concentration, and marketing-related attributes like LabelAppeal and STARS. The evaluation set holds 3,335 observations, there are all numeric variables with LableAppeal, AcidIndex, Stars being of the int type the other 11 predictors are numeric type.

Summary Statistics

Provide mean, median, standard deviation, and distribution shape for key variables.

Code
wine_data<- wine_data|>
  dplyr::select(-INDEX)
summary(wine_data)
     TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
 Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
 1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
 Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
 Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
 3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
 Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
                                                                      
 ResidualSugar        Chlorides       FreeSulfurDioxide TotalSulfurDioxide
 Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00   Min.   :-823.0    
 1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00   1st Qu.:  27.0    
 Median :   3.900   Median : 0.0460   Median :  30.00   Median : 123.0    
 Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85   Mean   : 120.7    
 3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00   3rd Qu.: 208.0    
 Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00   Max.   :1057.0    
 NA's   :616        NA's   :638       NA's   :647       NA's   :682       
    Density             pH          Sulphates          Alcohol     
 Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
 1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
 Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
 Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
 3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
 Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
                  NA's   :395     NA's   :1210      NA's   :653    
  LabelAppeal          AcidIndex          STARS      
 Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
 1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
 Median : 0.000000   Median : 8.000   Median :2.000  
 Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
 3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
 Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
                                      NA's   :3359   
Code
wine_test<- wine_test|>
  dplyr::select(-TARGET)
summary(wine_test)
       IN         FixedAcidity     VolatileAcidity     CitricAcid     
 Min.   :    3   Min.   :-18.200   Min.   :-2.8300   Min.   :-3.1200  
 1st Qu.: 4018   1st Qu.:  5.200   1st Qu.: 0.0800   1st Qu.: 0.0000  
 Median : 7906   Median :  6.900   Median : 0.2800   Median : 0.3100  
 Mean   : 8048   Mean   :  6.864   Mean   : 0.3103   Mean   : 0.3124  
 3rd Qu.:12061   3rd Qu.:  9.000   3rd Qu.: 0.6300   3rd Qu.: 0.6050  
 Max.   :16130   Max.   : 33.500   Max.   : 3.6100   Max.   : 3.7600  
                                                                      
 ResidualSugar        Chlorides        FreeSulfurDioxide TotalSulfurDioxide
 Min.   :-128.300   Min.   :-1.15000   Min.   :-563.00   Min.   :-769.00   
 1st Qu.:  -2.600   1st Qu.: 0.01600   1st Qu.:   3.00   1st Qu.:  27.25   
 Median :   3.600   Median : 0.04700   Median :  30.00   Median : 124.00   
 Mean   :   5.319   Mean   : 0.06143   Mean   :  34.95   Mean   : 123.41   
 3rd Qu.:  17.200   3rd Qu.: 0.17100   3rd Qu.:  79.25   3rd Qu.: 210.00   
 Max.   : 145.400   Max.   : 1.26300   Max.   : 617.00   Max.   :1004.00   
 NA's   :168        NA's   :138        NA's   :152       NA's   :157       
    Density             pH          Sulphates          Alcohol     
 Min.   :0.8898   Min.   :0.600   Min.   :-3.0700   Min.   :-4.20  
 1st Qu.:0.9883   1st Qu.:2.980   1st Qu.: 0.3300   1st Qu.: 9.00  
 Median :0.9946   Median :3.210   Median : 0.5000   Median :10.40  
 Mean   :0.9947   Mean   :3.237   Mean   : 0.5346   Mean   :10.58  
 3rd Qu.:1.0005   3rd Qu.:3.490   3rd Qu.: 0.8200   3rd Qu.:12.50  
 Max.   :1.0998   Max.   :6.210   Max.   : 4.1800   Max.   :25.60  
                  NA's   :104     NA's   :310       NA's   :185    
  LabelAppeal         AcidIndex          STARS     
 Min.   :-2.00000   Min.   : 5.000   Min.   :1.00  
 1st Qu.:-1.00000   1st Qu.: 7.000   1st Qu.:1.00  
 Median : 0.00000   Median : 8.000   Median :2.00  
 Mean   : 0.01349   Mean   : 7.748   Mean   :2.04  
 3rd Qu.: 1.00000   3rd Qu.: 8.000   3rd Qu.:3.00  
 Max.   : 2.00000   Max.   :17.000   Max.   :4.00  
                                     NA's   :841   
Code
library(corrplot)
corrplot 0.95 loaded
Code
corr_p<-corrplot(cor_mat, method = "color", type = "upper")

Code
par(mfrow=c(2,3))
for (v in names(wine_data)[names(wine_data) != "TARGET"]) {
  hist(wine_data[[v]], main=v, col="lightgrey")
}

Code
hist(wine_data$TARGET)

Most variables in the dataset display fairly balanced, centered distributions, consistent with standardized chemical measures. FixedAcidity, VolatileAcidity, CitricAcid, Density, pH, Alcohol, and similar predictors do not exhibit strong skew and therefore do not require transformation.

AcidIndex is the main exception, showing a noticeably right-skewed distribution. Because of this, we will bucket AcidIndex into categories rather than apply a mathematical transform, as bucketing provides a more stable and interpretable way to handle its nonlinearity.

The STARS rating variable is discrete (1–4) with many missing values. Instead of treating missing STARS as zeros or excluding them, we will recode missing values as “Not Rated”, so wines without an expert rating are not implicitly penalized or interpreted as having low quality.

Other variables, including LabelAppeal, are well-behaved and can be used in their original numeric form. Thus, only AcidIndex and STARS require special treatment in the preparation step.

The predictor variables also show minimal correlation with one another.

The TARGET variable ranges from 0 to 8 and shows a moderately right-skewed distribution. The most common purchase counts fall between 2 and 5 cases, with a noticeable spike at 0 but no severe zero inflation. Very few wines reach the upper end of the range, and the distribution decreases steadily after 4 cases. Overall, the shape of TARGET is consistent with a typical count outcome, supporting the use of Poisson and Negative Binomial regression models.

Data Preparation

Code
acid_bucket_plt

Code
 STARS_plt

Code
# Variables with missing values
vars_with_na <- c(
  "ResidualSugar",
  "Chlorides",
  "FreeSulfurDioxide",
  "TotalSulfurDioxide",
  "Sulphates",
  "Alcohol",
  "pH"
)

# 1. Create missingness flags
for (v in vars_with_na) {
  wine_data[[paste0(v, "_missing")]] <- ifelse(is.na(wine_data[[v]]), 1, 0)
}

# 2. Median imputation
for (v in vars_with_na) {
  med <- median(wine_data[[v]], na.rm = TRUE)
  wine_data[[v]][is.na(wine_data[[v]])] <- med
}
train_medians <- sapply(
  wine_data[, vars_with_na],
  median,
  na.rm = TRUE
)
# 3. Verify no missing values remain
colSums(is.na(wine_data))
                    TARGET               FixedAcidity 
                         0                          0 
           VolatileAcidity                 CitricAcid 
                         0                          0 
             ResidualSugar                  Chlorides 
                         0                          0 
         FreeSulfurDioxide         TotalSulfurDioxide 
                         0                          0 
                   Density                         pH 
                         0                          0 
                 Sulphates                    Alcohol 
                         0                          0 
               LabelAppeal                  AcidIndex 
                         0                          0 
                     STARS           AcidIndex_bucket 
                         0                          0 
             STARS_missing      ResidualSugar_missing 
                         0                          0 
         Chlorides_missing  FreeSulfurDioxide_missing 
                         0                          0 
TotalSulfurDioxide_missing          Sulphates_missing 
                         0                          0 
           Alcohol_missing                 pH_missing 
                         0                          0 
Code
test_acid_plt<-ggplot(wine_test, aes(x = AcidIndex_bucket)) +
  geom_bar(fill = "steelblue") +
  theme_minimal() +
  labs(title = "AcidIndex Buckets in Test Data",
       x = "AcidIndex Bucket",
       y = "Count")
test_stars_plt<-ggplot(wine_test, aes(x = STARS)) +
  geom_bar(fill = "darkred") +
  theme_minimal() +
  labs(title = "STARS Ratings in Test Data",
       x = "STARS Category",
       y = "Count")

test_stars_plt + test_acid_plt

Code
set.seed(123) 

# total number of rows
n <- nrow(wine_data)

# randomly sample 75% of the data for training
train_idx <- sample(seq_len(n), size = 0.75 * n)

# create training and validation sets
wine_train <- wine_data[train_idx, ]
wine_valid <- wine_data[-train_idx, ]


cat("Training rows:", nrow(wine_train), "\n")
Training rows: 9596 
Code
cat("Validation rows:", nrow(wine_valid), "\n")
Validation rows: 3199 

Several chemical variables contained missing values, including ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Sulphates, and Alcohol. For each of these, a binary missingness indicator was created, as missing measurements may carry predictive value. The original numeric values were then imputed using the median, which maintains the shape of each distribution while preventing the loss of observations. All data preparation was then replicated on the test set. Finally the training data set was split 75-25 for model development.

Building Models

In this section we will explore different models to predict our target we will begin with a baseline linear model, then a multiple linear regression followed by possion and count models.

Code
### ============================================================
### Linear Regression Model 1 (Baseline)
### ============================================================

lm1 <- lm(TARGET ~ STARS + LabelAppeal, data = wine_train)
summary(lm1)

Call:
lm(formula = TARGET ~ STARS + LabelAppeal, data = wine_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6534 -0.8097  0.1606  0.7848  5.8759 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.24784    0.02673   46.68   <2e-16 ***
STARS1       1.45917    0.03863   37.77   <2e-16 ***
STARS2       2.52924    0.03722   67.96   <2e-16 ***
STARS3       3.15337    0.04279   73.69   <2e-16 ***
STARS4       3.85136    0.06803   56.62   <2e-16 ***
LabelAppeal  0.43815    0.01592   27.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.332 on 9590 degrees of freedom
Multiple R-squared:  0.5221,    Adjusted R-squared:  0.5218 
F-statistic:  2095 on 5 and 9590 DF,  p-value: < 2.2e-16
Code
# Evaluate on validation set
pred_lm1 <- predict(lm1, newdata = wine_valid)

mae_lm1  <- mean(abs(wine_valid$TARGET - pred_lm1))
rmse_lm1 <- sqrt(mean((wine_valid$TARGET - pred_lm1)^2))
add_results("LM1_Baseline", "Linear", lm1, wine_valid, pred_lm1)
cat("LM1 MAE:", mae_lm1, "\n")
LM1 MAE: 1.075446 
Code
cat("LM1 RMSE:", rmse_lm1, "\n")
LM1 RMSE: 1.35614 
Code
### ============================================================
### Linear Regression Model 2 (Expanded)
### ============================================================

lm2 <- lm(TARGET ~ STARS + LabelAppeal + AcidIndex_bucket + Alcohol, 
          data = wine_train)

summary(lm2)

Call:
lm(formula = TARGET ~ STARS + LabelAppeal + AcidIndex_bucket + 
    Alcohol, data = wine_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.8482 -0.8858  0.0855  0.7617  5.6638 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               1.480921   0.063081  23.476  < 2e-16 ***
STARS1                    1.382811   0.038017  36.374  < 2e-16 ***
STARS2                    2.419329   0.036841  65.669  < 2e-16 ***
STARS3                    3.007103   0.042538  70.693  < 2e-16 ***
STARS4                    3.705497   0.067032  55.279  < 2e-16 ***
LabelAppeal               0.460043   0.015623  29.446  < 2e-16 ***
AcidIndex_bucketModerate -0.138504   0.044698  -3.099  0.00195 ** 
AcidIndex_bucketHigh     -0.574929   0.054135 -10.620  < 2e-16 ***
AcidIndex_bucketVeryHigh -1.292294   0.078810 -16.397  < 2e-16 ***
Alcohol                   0.008279   0.003690   2.244  0.02488 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.304 on 9586 degrees of freedom
Multiple R-squared:  0.542, Adjusted R-squared:  0.5416 
F-statistic:  1260 on 9 and 9586 DF,  p-value: < 2.2e-16
Code
# Evaluate on validation set
pred_lm2 <- predict(lm2, newdata = wine_valid)

mae_lm2  <- mean(abs(wine_valid$TARGET - pred_lm2))
rmse_lm2 <- sqrt(mean((wine_valid$TARGET - pred_lm2)^2))
add_results("LM2_Expanded", "Linear", lm2, wine_valid, pred_lm2)
cat("LM2 MAE:", mae_lm2, "\n")
LM2 MAE: 1.036195 
Code
cat("LM2 RMSE:", rmse_lm2, "\n")
LM2 RMSE: 1.32258 
Code
pois1 <- glm(
  TARGET ~ STARS + LabelAppeal ,
  family = poisson(link = "log"),
  data = wine_train
)
pred_pois1 <- predict(pois1, newdata = wine_valid, type = "response")
add_results("Poisson Base", "Poisson", pois1, wine_valid, pred_pois1)


# predictions on validation

summary(pois1)

Call:
glm(formula = TARGET ~ STARS + LabelAppeal, family = poisson(link = "log"), 
    data = wine_train)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 0.176676   0.018426   9.588   <2e-16 ***
STARS1      0.806472   0.022597  35.690   <2e-16 ***
STARS2      1.143451   0.020982  54.497   <2e-16 ***
STARS3      1.278676   0.021953  58.246   <2e-16 ***
STARS4      1.398505   0.027603  50.664   <2e-16 ***
LabelAppeal 0.150672   0.007007  21.504   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 17109  on 9595  degrees of freedom
Residual deviance: 10456  on 9590  degrees of freedom
AIC: 34464

Number of Fisher Scoring iterations: 6
Code
pois2 <- glm(
  TARGET ~ .,
  family = poisson(link = "log"),
  data = wine_train
)

pred_pois2 <- predict(pois2, newdata = wine_valid, type = "response")

add_results("Poisson2", "Poisson", pois2, wine_valid, pred_pois2)

summary(pois2)

Call:
glm(formula = TARGET ~ ., family = poisson(link = "log"), data = wine_train)

Coefficients: (1 not defined because of singularities)
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 9.134e-01  2.334e-01   3.913 9.11e-05 ***
FixedAcidity               -1.072e-04  9.485e-04  -0.113  0.90998    
VolatileAcidity            -3.076e-02  7.530e-03  -4.085 4.41e-05 ***
CitricAcid                  8.006e-03  6.821e-03   1.174  0.24049    
ResidualSugar               4.275e-05  1.781e-04   0.240  0.81024    
Chlorides                  -5.168e-02  1.896e-02  -2.726  0.00641 ** 
FreeSulfurDioxide           1.069e-04  4.090e-05   2.613  0.00899 ** 
TotalSulfurDioxide          6.004e-05  2.633e-05   2.280  0.02258 *  
Density                    -3.835e-01  2.212e-01  -1.734  0.08295 .  
pH                         -1.295e-02  8.940e-03  -1.449  0.14738    
Sulphates                  -8.452e-03  6.599e-03  -1.281  0.20024    
Alcohol                     2.331e-03  1.631e-03   1.429  0.15301    
LabelAppeal                 1.568e-01  7.030e-03  22.299  < 2e-16 ***
AcidIndex                  -3.589e-02  1.194e-02  -3.007  0.00264 ** 
STARS1                      7.701e-01  2.267e-02  33.966  < 2e-16 ***
STARS2                      1.090e+00  2.116e-02  51.512  < 2e-16 ***
STARS3                      1.212e+00  2.224e-02  54.490  < 2e-16 ***
STARS4                      1.330e+00  2.789e-02  47.674  < 2e-16 ***
AcidIndex_bucketModerate    7.621e-03  2.601e-02   0.293  0.76952    
AcidIndex_bucketHigh       -8.445e-02  4.624e-02  -1.827  0.06777 .  
AcidIndex_bucketVeryHigh   -4.184e-01  8.700e-02  -4.809 1.52e-06 ***
STARS_missing                      NA         NA      NA       NA    
ResidualSugar_missing       3.788e-02  2.662e-02   1.423  0.15469    
Chlorides_missing          -9.076e-03  2.667e-02  -0.340  0.73362    
FreeSulfurDioxide_missing   1.066e-02  2.617e-02   0.407  0.68389    
TotalSulfurDioxide_missing  1.304e-02  2.600e-02   0.501  0.61616    
Sulphates_missing          -8.413e-03  2.029e-02  -0.415  0.67841    
Alcohol_missing             5.783e-03  2.680e-02   0.216  0.82918    
pH_missing                 -3.480e-02  3.421e-02  -1.017  0.30902    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 17109  on 9595  degrees of freedom
Residual deviance: 10100  on 9568  degrees of freedom
AIC: 34152

Number of Fisher Scoring iterations: 6
Code
nb1 <- glm.nb(
  TARGET ~ STARS + LabelAppeal + AcidIndex_bucket + Alcohol,
  data = wine_train
)
Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached
Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached
Code
pred_nb1 <- predict(nb1, newdata = wine_valid, type = "response")

add_results("NegBin1", "NegBin", nb1, wine_valid, pred_nb1)

summary(nb1)

Call:
glm.nb(formula = TARGET ~ STARS + LabelAppeal + AcidIndex_bucket + 
    Alcohol, data = wine_train, init.theta = 41085.18864, link = log)

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)               0.266190   0.030779   8.649   <2e-16 ***
STARS1                    0.775293   0.022656  34.221   <2e-16 ***
STARS2                    1.099172   0.021113  52.061   <2e-16 ***
STARS3                    1.222140   0.022184  55.090   <2e-16 ***
STARS4                    1.342853   0.027801  48.303   <2e-16 ***
LabelAppeal               0.156683   0.007022  22.312   <2e-16 ***
AcidIndex_bucketModerate -0.044978   0.018646  -2.412   0.0159 *  
AcidIndex_bucketHigh     -0.204917   0.024245  -8.452   <2e-16 ***
AcidIndex_bucketVeryHigh -0.645294   0.047941 -13.460   <2e-16 ***
Alcohol                   0.002416   0.001629   1.483   0.1379    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(41085.19) family taken to be 1)

    Null deviance: 17108  on 9595  degrees of freedom
Residual deviance: 10158  on 9586  degrees of freedom
AIC: 34176

Number of Fisher Scoring iterations: 1

              Theta:  41085 
          Std. Err.:  39724 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -34154.19 
Code
nb2 <- glm.nb(
  TARGET ~  .,
  data = wine_train
)
Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached
Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached
Code
pred_nb2 <- predict(nb2, newdata = wine_valid, type = "response")

add_results("NegBin2", "NegBin", nb2, wine_valid, pred_nb2)

summary(nb2)

Call:
glm.nb(formula = TARGET ~ ., data = wine_train, init.theta = 41353.29552, 
    link = log)

Coefficients: (1 not defined because of singularities)
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 9.134e-01  2.334e-01   3.913 9.11e-05 ***
FixedAcidity               -1.072e-04  9.486e-04  -0.113  0.90998    
VolatileAcidity            -3.076e-02  7.530e-03  -4.085 4.41e-05 ***
CitricAcid                  8.007e-03  6.821e-03   1.174  0.24050    
ResidualSugar               4.276e-05  1.781e-04   0.240  0.81022    
Chlorides                  -5.168e-02  1.896e-02  -2.726  0.00641 ** 
FreeSulfurDioxide           1.069e-04  4.090e-05   2.613  0.00899 ** 
TotalSulfurDioxide          6.005e-05  2.633e-05   2.280  0.02258 *  
Density                    -3.835e-01  2.212e-01  -1.734  0.08295 .  
pH                         -1.295e-02  8.941e-03  -1.449  0.14737    
Sulphates                  -8.453e-03  6.599e-03  -1.281  0.20024    
Alcohol                     2.331e-03  1.632e-03   1.429  0.15305    
LabelAppeal                 1.568e-01  7.031e-03  22.298  < 2e-16 ***
AcidIndex                  -3.589e-02  1.194e-02  -3.007  0.00264 ** 
STARS1                      7.701e-01  2.267e-02  33.966  < 2e-16 ***
STARS2                      1.090e+00  2.116e-02  51.511  < 2e-16 ***
STARS3                      1.212e+00  2.224e-02  54.489  < 2e-16 ***
STARS4                      1.330e+00  2.789e-02  47.672  < 2e-16 ***
AcidIndex_bucketModerate    7.622e-03  2.601e-02   0.293  0.76953    
AcidIndex_bucketHigh       -8.445e-02  4.624e-02  -1.826  0.06778 .  
AcidIndex_bucketVeryHigh   -4.184e-01  8.700e-02  -4.809 1.52e-06 ***
STARS_missing                      NA         NA      NA       NA    
ResidualSugar_missing       3.788e-02  2.662e-02   1.423  0.15471    
Chlorides_missing          -9.076e-03  2.667e-02  -0.340  0.73362    
FreeSulfurDioxide_missing   1.066e-02  2.617e-02   0.407  0.68391    
TotalSulfurDioxide_missing  1.304e-02  2.600e-02   0.501  0.61617    
Sulphates_missing          -8.414e-03  2.029e-02  -0.415  0.67840    
Alcohol_missing             5.782e-03  2.680e-02   0.216  0.82919    
pH_missing                 -3.480e-02  3.421e-02  -1.017  0.30902    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(41353.3) family taken to be 1)

    Null deviance: 17108  on 9595  degrees of freedom
Residual deviance: 10100  on 9568  degrees of freedom
AIC: 34154

Number of Fisher Scoring iterations: 1

              Theta:  41353 
          Std. Err.:  40018 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -34096.03 
Code
# Example ZINB model
zinb1 <- zeroinfl(
  TARGET ~ STARS + LabelAppeal + AcidIndex_bucket + Alcohol |
           STARS + LabelAppeal,      # zero-inflation part
  data = wine_train,
  dist = "negbin"
)
print(zinb1)

Call:
zeroinfl(formula = TARGET ~ STARS + LabelAppeal + AcidIndex_bucket + 
    Alcohol | STARS + LabelAppeal, data = wine_train, dist = "negbin")

Count model coefficients (negbin with log link):
             (Intercept)                    STARS1                    STARS2  
                1.092310                  0.068357                  0.201405  
                  STARS3                    STARS4               LabelAppeal  
                0.297727                  0.390896                  0.231822  
AcidIndex_bucketModerate      AcidIndex_bucketHigh  AcidIndex_bucketVeryHigh  
               -0.035173                 -0.106168                 -0.205610  
                 Alcohol  
                0.006101  
Theta = 46248909.7777 

Zero-inflation model coefficients (binomial with logit link):
(Intercept)       STARS1       STARS2       STARS3       STARS4  LabelAppeal  
     0.4684      -2.1405      -6.0446     -20.3516     -20.5150       0.7747  
Code
summary(zinb1)

Call:
zeroinfl(formula = TARGET ~ STARS + LabelAppeal + AcidIndex_bucket + 
    Alcohol | STARS + LabelAppeal, data = wine_train, dist = "negbin")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-2.36557 -0.46279  0.01323  0.39424  4.30254 

Count model coefficients (negbin with log link):
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)               1.092310   0.031759  34.393  < 2e-16 ***
STARS1                    0.068357   0.024647   2.773 0.005546 ** 
STARS2                    0.201405   0.022994   8.759  < 2e-16 ***
STARS3                    0.297727   0.024053  12.378  < 2e-16 ***
STARS4                    0.390896   0.029484  13.258  < 2e-16 ***
LabelAppeal               0.231822   0.007260  31.932  < 2e-16 ***
AcidIndex_bucketModerate -0.035173   0.018870  -1.864 0.062321 .  
AcidIndex_bucketHigh     -0.106168   0.024806  -4.280 1.87e-05 ***
AcidIndex_bucketVeryHigh -0.205610   0.055296  -3.718 0.000201 ***
Alcohol                   0.006101   0.001661   3.673 0.000240 ***
Log(theta)               17.649548   2.513523   7.022 2.19e-12 ***

Zero-inflation model coefficients (binomial with logit link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.46842    0.04698   9.971   <2e-16 ***
STARS1       -2.14047    0.08680 -24.660   <2e-16 ***
STARS2       -6.04460    0.45269 -13.353   <2e-16 ***
STARS3      -20.35157  399.42431  -0.051    0.959    
STARS4      -20.51499  723.15028  -0.028    0.977    
LabelAppeal   0.77465    0.04779  16.210   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 46248909.7777 
Number of iterations in BFGS optimization: 59 
Log-likelihood: -1.544e+04 on 17 Df
Code
pred_zinb <- predict(zinb1, newdata = wine_valid, type = "response")


# Compute MAE and RMSE for ZINB
mae_zinb  <- mean(abs(wine_valid$TARGET - pred_zinb))
rmse_zinb <- sqrt(mean((wine_valid$TARGET - pred_zinb)^2))

# Extract AIC properly
aic_zinb <- AIC(zinb1)

# Add row to results
new_row <- data.frame(
  Model   = "ZINB1",
  Type    = "ZINB",
  R2      = NA,             # not defined for ZINB
  Adj_R2  = NA,             # not applicable
  MAE     = mae_zinb,
  RMSE    = rmse_zinb,
  AIC     = aic_zinb,
  stringsAsFactors = FALSE
)

model_results <- rbind(model_results, new_row)

Selecting Model

Across all models evaluated, the Zero-Inflated Negative Binomial model (ZINB1) demonstrated the strongest overall performance with the lowest AIC (30,923.52) and the lowest MAE and RMSE among all count-based models. Although the expanded linear regression model achieved competitive error metrics, linear models violate distributional assumptions for discrete count data and cannot serve as the final deployed model. They are retained only as benchmarks for comparison.

Among the traditional count regression models, the Negative Binomial and Poisson variants performed reasonably well, but none matched the predictive accuracy or fit of the ZINB model. Both Poisson models showed higher AIC values and larger prediction errors, reflecting their inability to accommodate overdispersion. The Negative Binomial models improved on this but still failed to capture the substantial probability mass at zero: despite roughly 20% of the training data containing zero purchases, standard Poisson and NB models systematically underpredicted zeros due to the log-link structure, which forces strictly positive expected values.

Introducing a zero-inflation component addressed this deficiency. The ZINB model explicitly separates the data-generating process into (1) a structural zero mechanism and (2) a count process for wines that are actually ordered. This two-part structure yielded superior predictive accuracy and a substantially better AIC than the NB or Poisson models. It also allowed the model to produce realistic zero predictions, something the standard GLM-based count models could not accomplish.

Given its substantially better model fit, stronger predictive accuracy, and more realistic handling of zero outcomes, ZINB1 was selected as the final model for generating predictions on the evaluation dataset.

Code
model_results|>
  arrange((AIC))
         Model    Type        R2    Adj_R2      MAE     RMSE      AIC
1        ZINB1    ZINB        NA        NA 1.010027 1.320891 30923.52
2 LM2_Expanded  Linear 0.5419982 0.5415682 1.036195 1.322580 32343.47
3 LM1_Baseline  Linear 0.5220872 0.5218380 1.075446 1.356140 32743.83
4     Poisson2 Poisson        NA        NA 1.029192 1.313099 34151.71
5      NegBin2  NegBin        NA        NA 1.029193 1.313099 34154.03
6      NegBin1  NegBin        NA        NA 1.024112 1.311981 34176.19
7 Poisson Base Poisson        NA        NA 1.044851 1.334146 34463.51
Code
aic_mod_plt<-ggplot(model_results, aes(x = Model, y = AIC, )) +
  geom_col() +
  coord_flip() +
  theme_minimal() +
  labs(title = "AIC Comparison Across Models",
       x = "Model",
       y = "AIC")
aic_zoom_plt<-ggplot(model_results, aes(x = Model, y = AIC, )) +
  geom_col() +
  coord_flip(xlim = NULL, ylim = c(min(model_results$AIC) - 1050,
                                   max(model_results$AIC) + 1050)) +
  theme_minimal() +
  labs(title = "AIC Comparison (Zoomed In)",
       x = "Model",
       y = "AIC")

aic_mod_plt + aic_zoom_plt

Code
# Extract residuals
res_raw     <- residuals(zinb1, type = "response")
res_pearson <- residuals(zinb1, type = "pearson")

# Extract fitted values
fitted_vals <- predict(zinb1, type = "response")



# --- 1. Residual Distribution ---
hist(res_raw,
     breaks = 30,
     col = "lightgrey",
     border = "white",
     main = "ZINB Residual Distribution",
     xlab = "Raw Residuals")

Code
# --- 2. QQ Plot ---
qqnorm(res_pearson,
       main = "ZINB QQ Plot (Pearson Residuals)",
       pch = 19, col = "#3366AA60")
qqline(res_pearson, col = "red", lwd = 2)

Code
# --- 3. Fitted vs Actual ---
plot(fitted_vals, wine_train$TARGET,
     pch = 19, col = "#3366AA60",
     xlab = "Fitted Values (ZINB)",
     ylab = "Actual TARGET",
     main = "Fitted vs Actual")
abline(0, 1, col = "red", lwd = 2)

Predictions

Code
# 7. Inspect predictions
head(predsdf, 20)
   INDEX PREDICTED_TARGET
1      3                3
2      9                4
3     10                3
4     18                3
5     21                0
6     30                6
7     31                3
8     37                0
9     39                0
10    47                0
11    60                3
12    62                0
13    63                3
14    64                0
15    68                2
16    75                3
17    76                3
18    83                0
19    87                4
20    92                5
Code
# 8. Plot prediction distribution
pred_dis <-ggplot(predsdf, aes(x = PREDICTED_TARGET)) +
  geom_histogram(binwidth = 1, fill = "lightgrey", color = "white") +
  labs(title = "Final TARGET Predictions (ZINB)",
       x = "Predicted Case Sales")

# 9. Save results for submission
write.csv(predsdf, "final_predictions_ZINB.csv", row.names = FALSE)

Conclusion

Code
hist(wine_train$TARGET)

Code
hist(predsdf$PREDICTED_TARGET)

The comparison between the training TARGET distribution and the ZINB model’s predicted distribution shows that the model successfully reproduces the overall shape, center, and spread of the observed case-sales behavior. Both histograms exhibit their highest frequencies in the 2–5 case range, indicating that the model accurately captures the most common purchasing patterns seen in the training data.

The model also correctly preserves the long right tail, with non-zero predictions extending up to 7–8 cases, matching the maximum observed values in the training set. Importantly, unlike standard Poisson or Negative Binomial models both of which failed to generate zeros, the ZINB model produces a realistic number of zero predictions driven by its explicit zero-inflation component. Although the height of the zero bar is slightly lower than in the training distribution, the overall magnitude is directionally consistent and represents a substantial improvement over traditional count GLMs.

Across the remaining count values, the predicted distribution aligns closely with the empirical one, with only minor differences in relative frequencies. The slight smoothing around counts 3–5 reflects the expected behavior of a probabilistic regression model producing mean-based predictions rather than exact replications of sample frequencies.

Overall, the ZINB model provides a well-calibrated approximation of the true TARGET distribution, capturing both the central mass and the zero-inflation present in the data. This confirms that the model is appropriately tuned for deployment and capable of generating realistic predictions for the evaluation dataset.

Important Predictors

Code
print(summary(zinb1))

Call:
zeroinfl(formula = TARGET ~ STARS + LabelAppeal + AcidIndex_bucket + 
    Alcohol | STARS + LabelAppeal, data = wine_train, dist = "negbin")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-2.36557 -0.46279  0.01323  0.39424  4.30254 

Count model coefficients (negbin with log link):
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)               1.092310   0.031759  34.393  < 2e-16 ***
STARS1                    0.068357   0.024647   2.773 0.005546 ** 
STARS2                    0.201405   0.022994   8.759  < 2e-16 ***
STARS3                    0.297727   0.024053  12.378  < 2e-16 ***
STARS4                    0.390896   0.029484  13.258  < 2e-16 ***
LabelAppeal               0.231822   0.007260  31.932  < 2e-16 ***
AcidIndex_bucketModerate -0.035173   0.018870  -1.864 0.062321 .  
AcidIndex_bucketHigh     -0.106168   0.024806  -4.280 1.87e-05 ***
AcidIndex_bucketVeryHigh -0.205610   0.055296  -3.718 0.000201 ***
Alcohol                   0.006101   0.001661   3.673 0.000240 ***
Log(theta)               17.649548   2.513523   7.022 2.19e-12 ***

Zero-inflation model coefficients (binomial with logit link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.46842    0.04698   9.971   <2e-16 ***
STARS1       -2.14047    0.08680 -24.660   <2e-16 ***
STARS2       -6.04460    0.45269 -13.353   <2e-16 ***
STARS3      -20.35157  399.42431  -0.051    0.959    
STARS4      -20.51499  723.15028  -0.028    0.977    
LabelAppeal   0.77465    0.04779  16.210   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 46248909.7777 
Number of iterations in BFGS optimization: 59 
Log-likelihood: -1.544e+04 on 17 Df

In the count model, STARS ratings and LabelAppeal are the strongest positive predictors of case sales. Higher STARS levels significantly increase expected purchases, and LabelAppeal has a large positive effect, confirming that expert ratings and marketing appeal drive higher order volumes. Alcohol content also has a small positive effect. In contrast, wines in the High and Very High AcidIndex buckets show significantly lower expected sales, indicating that high acidity reduces demand.

In the zero-inflation model, STARS ratings greatly reduce the likelihood of a structural zero, meaning expert-rated wines are far less likely to receive no orders at all. LabelAppeal also significantly decreases the chance of a zero outcome. Together, these results show that both expert ratings and label strength influence not only how many cases a wine sells, but also whether it is ordered in the first place.