Wine

Wine Prediction

Introduction

This analysis builds a count regression model to predict the number of sample wine cases purchased by distributors, based on the chemical and marketing properties of roughly 12,000 commercially available wines. The target variable, cases purchased, is a non-negative integer, making count regression the natural modeling framework. We work through four stages: exploratory data analysis to understand the data, preparation to handle missingness and anomalies, model building across three families (Poisson, Negative Binomial, and Linear), and finally model selection based on fit, parsimony, and predictive performance. A key theme throughout is that two variables, expert star ratings and label appeal, dominate the prediction, while the chemical properties of the wine play a much smaller role than one might expect.

Data Exploration

Load & Split Data

df    <- read.csv("wine-data-1-1.csv")
set.seed(123)
idx   <- sample(nrow(df), round(0.8 * nrow(df)))
train <- df[idx,  !names(df) %in% "INDEX"]
test  <- df[-idx, !names(df) %in% "INDEX"]

Summary Statistics

stargazer(train, type = "text",
          title = "Summary Statistics",
          digits = 2,
          summary.stat = c("n", "mean", "sd", "min", "median", "max"))


Summary Statistics
=================================================================
Statistic            N     Mean  St. Dev.   Min   Median   Max   
-----------------------------------------------------------------
TARGET             10,236  3.03    1.93      0      3       8    
FixedAcidity       10,236  7.09    6.33   -18.00   6.90   34.40  
VolatileAcidity    10,236  0.33    0.78    -2.79   0.28    3.68  
CitricAcid         10,236  0.31    0.86    -3.24   0.31    3.86  
ResidualSugar      9,742   5.28   33.94   -127.80  3.90   141.15 
Chlorides          9,711   0.05    0.32    -1.17   0.05    1.35  
FreeSulfurDioxide  9,697  30.90   147.31  -546.00 30.00   622.00 
TotalSulfurDioxide 9,704  121.82  231.04  -823.00 124.00 1,057.00
Density            10,236  0.99    0.03    0.89    0.99    1.10  
pH                 9,915   3.21    0.68    0.48    3.20    6.05  
Sulphates          9,276   0.53    0.94    -3.13   0.50    4.24  
Alcohol            9,717  10.50    3.72    -4.70  10.40   26.10  
LabelAppeal        10,236 -0.01    0.90     -2      0       2    
AcidIndex          10,236  7.78    1.34      4      8       17   
STARS              7,548   2.05    0.90      1      2       4    
-----------------------------------------------------------------

The training set has 10,236 observations and 15 variables.The training set has 10,236 observations and 15 variables. A few things jump out right away. STARS is missing for about 26% of observations, by far the biggest gap and we’ll see shortly that it’s also the strongest predictor of TARGET. LabelAppeal is clean, fully observed, and looks like another key driver. On the chemistry side, a few variables have clearly erroneous negative values, things like negative alcohol content and negative sulfur dioxide, which we’ll deal with in Part 2. And TARGET itself has a mean of ~3 but drops all the way to 0, which hints at possible zero-inflation that’ll influence our model choice later.

Missing Values

miss <- data.frame(
  Variable = names(train),
  Missing  = sapply(train, function(x) sum(is.na(x))),
  Pct      = round(sapply(train, function(x) mean(is.na(x))) * 100, 1)
)
miss[miss$Missing > 0, ]

                             Variable Missing  Pct
ResidualSugar           ResidualSugar     494  4.8
Chlorides                   Chlorides     525  5.1
FreeSulfurDioxide   FreeSulfurDioxide     539  5.3
TotalSulfurDioxide TotalSulfurDioxide     532  5.2
pH                                 pH     321  3.1
Sulphates                   Sulphates     960  9.4
Alcohol                       Alcohol     519  5.1
STARS                           STARS    2688 26.3

STARS has by far the most missing data at 26.3%, while the remaining variables with missingness are all under 10%, Sulphates being the worst of that group at 9.4%. For the chemical variables we’ll impute with the median, which is straightforward. STARS is a different story, wines without a rating are likely not randomly unrated, so rather than imputing we’ll create a binary flag to preserve that signal. More on this in Part 2.

TARGET Distribution

ggplot(train, aes(x = TARGET)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
  labs(title = "Distribution of TARGET", x = "Cases Purchased", y = "Count") +
  theme_minimal()

TARGET ranges from 0 to 8 with a mean of ~3. The most striking feature is the large spike at 0, about 21% of wines had zero cases purchased, followed by a dip at 1 and then a second peak around 4. This bimodal shape suggests two distinct groups: wines that distributors passed on entirely, and wines that sold reasonably well. This is a classic sign of zero-inflation, meaning a standard Poisson model may not be enough and zero-inflated variants are worth considering.

Key Predictors

par(mfrow = c(1, 2))
barplot(table(train$STARS), main = "STARS Distribution",
        xlab = "Stars", ylab = "Count", col = "steelblue")

barplot(table(train$LabelAppeal), main = "LabelAppeal Distribution",
        xlab = "Label Appeal", ylab = "Count", col = "steelblue")

Most wines are rated 1 or 2 stars, with very few reaching 4, so high-rated wines are rare in this dataset. For LabelAppeal, the distribution is roughly symmetric and centered at 0, meaning most wines have a neutral label with relatively few at the extremes. Both variables will be treated as ordinal in modeling, higher is better for sales in both cases, which aligns with the theoretical expectations laid out in the assignment.

Corelation Plot

corrplot(cor(train, use = "pairwise.complete.obs"), 
         method = "color", type = "lower",
         tl.cex = 0.7, tl.col = "black")

The correlation plot confirms what we expected. STARS has the strongest positive correlation with TARGET, followed by LabelAppeal. AcidIndex stands out as the most notable negative correlator. The chemical variables: FixedAcidity, VolatileAcidity, CitricAcid, Chlorides, and the rest are mostly weakly correlated with TARGET but show some correlation with each other, which is expected since they’re all measuring related properties of the same wine. The big takeaway here is that STARS and LabelAppeal are doing most of the work, while the chemistry variables will likely add only marginal predictive value.

Data Preparation

Missing STARS

train$STARS_missing <- ifelse(is.na(train$STARS), 1, 0)
test$STARS_missing  <- ifelse(is.na(test$STARS),  1, 0)

Before imputing, we create a flag for missing STARS since unrated wines are likely not missing at random, they’re probably lower quality or less established, which itself predicts lower sales. Dropping or blindly imputing this would throw away a useful signal. All other missing values are then filled with the training median, applied consistently to both train and test to avoid leakage.

Median Imputation

for (col in names(train)) {
  med <- median(train[[col]], na.rm = TRUE)
  train[[col]][is.na(train[[col]])] <- med
  test[[col]][is.na(test[[col]])]   <- med
}

All variables with missing values are imputed using the training median. Median is preferred over mean here since several variables have skewed distributions and outliers that would pull the mean in the wrong direction.

Implausible Negatives

neg_vars <- c("Alcohol", "FreeSulfurDioxide", "TotalSulfurDioxide", 
              "CitricAcid", "Chlorides", "ResidualSugar")

for (col in neg_vars) {
  train[[paste0(col, "_neg")]] <- ifelse(train[[col]] < 0, 1, 0)
  test[[paste0(col, "_neg")]]  <- ifelse(test[[col]]  < 0, 1, 0)
}

Several chemical variables contain negative values that are physically impossible, things like negative alcohol content or negative sulfur dioxide. Rather than dropping these rows or replacing the values, we flag them with a binary indicator. This way the model can learn whether having an erroneous reading is itself predictive of sales, while we handle the actual values separately.

Winsorize Outliers

wins <- function(x, lo = 0.01, hi = 0.99) {
  q <- quantile(x, c(lo, hi), na.rm = TRUE)
  pmax(pmin(x, q[2]), q[1])
}

num_vars <- names(train)[sapply(train, is.numeric)]
num_vars <- num_vars[!num_vars %in% c("TARGET", "STARS_missing")]

for (col in num_vars) {
  train[[col]] <- wins(train[[col]])
  test[[col]]  <- wins(test[[col]])
}

Extreme values in the chemical variables are winsorized at the 1st and 99th percentiles. This caps the most egregious outliers without dropping any rows, keeping the dataset intact while preventing a handful of extreme readings from distorting the model coefficients. We exclude TARGET and STARS_missing from this step since those don’t need capping.

Build Models

Poisson Models

library(MASS)

p1 <- glm(TARGET ~ ., data = train, family = poisson)
p2 <- glm(TARGET ~ STARS + LabelAppeal + AcidIndex + STARS_missing, 
          data = train, family = poisson)
summary(p1)


Call:
glm(formula = TARGET ~ ., family = poisson, data = train)

Coefficients: (1 not defined because of singularities)
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)             1.811e+00  2.255e-01   8.029 9.85e-16 ***
FixedAcidity           -8.632e-05  9.503e-04  -0.091  0.92762    
VolatileAcidity        -3.089e-02  7.582e-03  -4.074 4.63e-05 ***
CitricAcid              1.977e-02  9.877e-03   2.001  0.04536 *  
ResidualSugar           1.689e-04  2.534e-04   0.667  0.50501    
Chlorides              -7.790e-02  2.731e-02  -2.852  0.00434 ** 
FreeSulfurDioxide       1.412e-04  5.877e-05   2.402  0.01630 *  
TotalSulfurDioxide      1.217e-04  3.776e-05   3.223  0.00127 ** 
Density                -2.961e-01  2.209e-01  -1.340  0.18015    
pH                     -1.541e-02  8.904e-03  -1.731  0.08348 .  
Sulphates              -1.026e-02  6.635e-03  -1.547  0.12191    
Alcohol                 2.241e-03  1.638e-03   1.368  0.17129    
LabelAppeal             1.584e-01  6.817e-03  23.243  < 2e-16 ***
AcidIndex              -8.398e-02  5.239e-03 -16.030  < 2e-16 ***
STARS                   1.899e-01  6.801e-03  27.914  < 2e-16 ***
STARS_missing          -1.026e+00  1.904e-02 -53.876  < 2e-16 ***
Alcohol_neg                    NA         NA      NA       NA    
FreeSulfurDioxide_neg   1.301e-02  1.929e-02   0.674  0.50011    
TotalSulfurDioxide_neg  3.507e-02  2.078e-02   1.688  0.09140 .  
CitricAcid_neg          2.983e-02  1.962e-02   1.520  0.12839    
Chlorides_neg          -2.598e-02  1.875e-02  -1.385  0.16597    
ResidualSugar_neg       8.331e-03  1.876e-02   0.444  0.65697    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 18306  on 10235  degrees of freedom
Residual deviance: 10968  on 10215  degrees of freedom
AIC: 36549

Number of Fisher Scoring iterations: 6

summary(p2)


Call:
glm(formula = TARGET ~ STARS + LabelAppeal + AcidIndex + STARS_missing, 
    family = poisson, data = train)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.517982   0.042889   35.39   <2e-16 ***
STARS          0.192055   0.006769   28.37   <2e-16 ***
LabelAppeal    0.158521   0.006807   23.29   <2e-16 ***
AcidIndex     -0.085355   0.005128  -16.64   <2e-16 ***
STARS_missing -1.034342   0.019009  -54.41   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 18306  on 10235  degrees of freedom
Residual deviance: 11027  on 10231  degrees of freedom
AIC: 36577

Number of Fisher Scoring iterations: 6

Both Poisson models tell a consistent story. STARS and LabelAppeal are the dominant positive predictors, each additional star rating increases expected cases by about 19%, and each unit of label appeal adds roughly 16%. AcidIndex is consistently negative across both models, meaning higher acidity is associated with fewer cases sold. Most importantly, STARS_missing has a large negative coefficient (~-1.03) and is highly significant, confirming that unrated wines sell far fewer cases, and that flagging this missingness was the right call.

The full model (p1) adds the chemical variables and negative flags, but most of them are insignificant and the AIC barely changes (36,549 vs 36,577). The parsimonious model (p2) with just 4 predictors performs nearly as well, which suggests the chemistry variables add little beyond what STARS, LabelAppeal, and AcidIndex already capture.

Negative Binomial Models

library(MASS)
nb1 <- glm.nb(TARGET ~ ., data = train)
nb2 <- glm.nb(TARGET ~ STARS + LabelAppeal + AcidIndex + STARS_missing, 
              data = train)

summary(nb1)


Call:
glm.nb(formula = TARGET ~ ., data = train, init.theta = 40898.55359, 
    link = log)

Coefficients: (1 not defined because of singularities)
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)             1.811e+00  2.256e-01   8.028 9.87e-16 ***
FixedAcidity           -8.632e-05  9.504e-04  -0.091  0.92763    
VolatileAcidity        -3.089e-02  7.583e-03  -4.074 4.63e-05 ***
CitricAcid              1.977e-02  9.877e-03   2.001  0.04537 *  
ResidualSugar           1.689e-04  2.534e-04   0.667  0.50501    
Chlorides              -7.790e-02  2.732e-02  -2.852  0.00434 ** 
FreeSulfurDioxide       1.412e-04  5.878e-05   2.402  0.01630 *  
TotalSulfurDioxide      1.217e-04  3.776e-05   3.223  0.00127 ** 
Density                -2.961e-01  2.209e-01  -1.340  0.18016    
pH                     -1.541e-02  8.904e-03  -1.731  0.08348 .  
Sulphates              -1.026e-02  6.635e-03  -1.547  0.12190    
Alcohol                 2.241e-03  1.638e-03   1.368  0.17133    
LabelAppeal             1.584e-01  6.817e-03  23.242  < 2e-16 ***
AcidIndex              -8.398e-02  5.239e-03 -16.030  < 2e-16 ***
STARS                   1.899e-01  6.802e-03  27.913  < 2e-16 ***
STARS_missing          -1.026e+00  1.904e-02 -53.875  < 2e-16 ***
Alcohol_neg                    NA         NA      NA       NA    
FreeSulfurDioxide_neg   1.301e-02  1.929e-02   0.674  0.50010    
TotalSulfurDioxide_neg  3.507e-02  2.078e-02   1.688  0.09140 .  
CitricAcid_neg          2.984e-02  1.962e-02   1.520  0.12840    
Chlorides_neg          -2.598e-02  1.875e-02  -1.385  0.16596    
ResidualSugar_neg       8.331e-03  1.876e-02   0.444  0.65698    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(40898.55) family taken to be 1)

    Null deviance: 18305  on 10235  degrees of freedom
Residual deviance: 10967  on 10215  degrees of freedom
AIC: 36552

Number of Fisher Scoring iterations: 1

              Theta:  40899 
          Std. Err.:  38984 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -36507.74

summary(nb2)


Call:
glm.nb(formula = TARGET ~ STARS + LabelAppeal + AcidIndex + STARS_missing, 
    data = train, init.theta = 40571.86879, link = log)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.517997   0.042891   35.39   <2e-16 ***
STARS          0.192057   0.006769   28.37   <2e-16 ***
LabelAppeal    0.158520   0.006807   23.29   <2e-16 ***
AcidIndex     -0.085358   0.005128  -16.64   <2e-16 ***
STARS_missing -1.034342   0.019009  -54.41   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(40571.87) family taken to be 1)

    Null deviance: 18305  on 10235  degrees of freedom
Residual deviance: 11027  on 10231  degrees of freedom
AIC: 36579

Number of Fisher Scoring iterations: 1

              Theta:  40572 
          Std. Err.:  38604 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -36567.26

The negative binomial results are telling us, that the theta parameter is enormous (40,000+) and hit the iteration limit, which means the NB model is essentially collapsing to Poisson. This makes sense because Poisson assumes mean = variance, and when that holds the NB has no overdispersion to model. Here’s the narrative:

The Negative Binomial models produce virtually identical coefficients and AICs to their Poisson counterparts. The estimated theta is extremely large (~40,000) in both cases, which indicates there is little to no overdispersion in the data the Poisson assumption of mean equal to variance is roughly satisfied here. This is actually good news: it means Poisson is the appropriate model family and we don’t need the extra complexity of NB. STARS, LabelAppeal, AcidIndex, and STARS_missing remain highly significant with the same signs and magnitudes across all four models so far.

Linear Regresion Models

lm1 <- lm(TARGET ~ ., data = train)
lm2 <- lm(TARGET ~ STARS + LabelAppeal + AcidIndex + STARS_missing, 
           data = train)
summary(lm1)


Call:
lm(formula = TARGET ~ ., data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6164 -0.8445  0.0240  0.8511  5.6747 

Coefficients: (1 not defined because of singularities)
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             4.503e+00  5.106e-01   8.819  < 2e-16 ***
FixedAcidity            5.186e-04  2.148e-03   0.241  0.80919    
VolatileAcidity        -9.781e-02  1.712e-02  -5.714 1.13e-08 ***
CitricAcid              6.596e-02  2.261e-02   2.917  0.00354 ** 
ResidualSugar           4.149e-04  5.767e-04   0.719  0.47194    
Chlorides              -2.289e-01  6.110e-02  -3.747  0.00018 ***
FreeSulfurDioxide       4.152e-04  1.346e-04   3.085  0.00204 ** 
TotalSulfurDioxide      3.507e-04  8.582e-05   4.086 4.41e-05 ***
Density                -8.580e-01  5.008e-01  -1.713  0.08669 .  
pH                     -4.048e-02  2.013e-02  -2.011  0.04437 *  
Sulphates              -2.740e-02  1.501e-02  -1.825  0.06805 .  
Alcohol                 8.963e-03  3.710e-03   2.416  0.01573 *  
LabelAppeal             4.653e-01  1.516e-02  30.690  < 2e-16 ***
AcidIndex              -2.124e-01  1.053e-02 -20.176  < 2e-16 ***
STARS                   7.842e-01  1.742e-02  45.011  < 2e-16 ***
STARS_missing          -2.242e+00  3.005e-02 -74.611  < 2e-16 ***
Alcohol_neg                    NA         NA      NA       NA    
FreeSulfurDioxide_neg   4.342e-02  4.364e-02   0.995  0.31986    
TotalSulfurDioxide_neg  1.061e-01  4.694e-02   2.260  0.02387 *  
CitricAcid_neg          1.038e-01  4.459e-02   2.327  0.01996 *  
Chlorides_neg          -6.939e-02  4.237e-02  -1.638  0.10155    
ResidualSugar_neg       1.414e-02  4.258e-02   0.332  0.73977    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.305 on 10215 degrees of freedom
Multiple R-squared:  0.5422,    Adjusted R-squared:  0.5413 
F-statistic:   605 on 20 and 10215 DF,  p-value: < 2.2e-16

summary(lm2)


Call:
lm(formula = TARGET ~ STARS + LabelAppeal + AcidIndex + STARS_missing, 
    data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6975 -0.8210  0.0141  0.8645  5.7864 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    3.70368    0.09023   41.05   <2e-16 ***
STARS          0.79162    0.01745   45.36   <2e-16 ***
LabelAppeal    0.46420    0.01522   30.51   <2e-16 ***
AcidIndex     -0.21684    0.01029  -21.06   <2e-16 ***
STARS_missing -2.26706    0.03007  -75.39   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.311 on 10231 degrees of freedom
Multiple R-squared:  0.5374,    Adjusted R-squared:  0.5372 
F-statistic:  2972 on 4 and 10231 DF,  p-value: < 2.2e-16

Both linear models explain about 54% of the variance in TARGET (R² ≈ 0.54), which is solid for this kind of data. The story is the same as before: STARS, LabelAppeal, AcidIndex, and STARS_missing dominate. Each additional star adds about 0.78 cases, each unit of label appeal adds about 0.46 cases, and unrated wines sell on average 2.26 fewer cases, all highly significant. The full model (lm1) picks up a few additional significant chemistry variables like VolatileAcidity, Chlorides, and TotalSulfurDioxide, but the R² gain over the parsimonious model (lm2) is minimal (0.542 vs 0.537). The linear model is easier to interpret than Poisson but technically misspecified for count data since it can predict negative values, something to keep in mind when selecting the final model.

Model Comparison Table

stargazer(p2, nb2, lm2,
          type = "text",
          title = "Model Comparison",
          column.labels = c("Poisson", "Neg Binomial", "Linear"),
          dep.var.labels = "TARGET",
          keep = c("STARS", "LabelAppeal", "AcidIndex", "STARS_missing"),
          digits = 3)


Model Comparison
====================================================================================
                                          Dependent variable:                       
                    ----------------------------------------------------------------
                                                 TARGET                             
                      Poisson          negative                     OLS             
                                       binomial                                     
                      Poisson        Neg Binomial                  Linear           
                        (1)               (2)                       (3)             
------------------------------------------------------------------------------------
STARS                0.192***          0.192***                   0.792***          
                      (0.007)           (0.007)                   (0.017)           
                                                                                    
LabelAppeal          0.159***          0.159***                   0.464***          
                      (0.007)           (0.007)                   (0.015)           
                                                                                    
AcidIndex            -0.085***         -0.085***                 -0.217***          
                      (0.005)           (0.005)                   (0.010)           
                                                                                    
STARS_missing        -1.034***         -1.034***                 -2.267***          
                      (0.019)           (0.019)                   (0.030)           
                                                                                    
------------------------------------------------------------------------------------
Observations          10,236            10,236                     10,236           
R2                                                                 0.537            
Adjusted R2                                                        0.537            
Log Likelihood      -18,283.460       -18,284.630                                   
theta                           40,571.870 (38,604.010)                             
Akaike Inf. Crit.   36,576.930        36,579.260                                    
Residual Std. Error                                          1.311 (df = 10231)     
F Statistic                                             2,971.519*** (df = 4; 10231)
====================================================================================
Note:                                                    *p<0.1; **p<0.05; ***p<0.01

Across all three model families the key coefficients are remarkably consistent in sign and significance. STARS and LabelAppeal are positive and highly significant everywhere, AcidIndex is negative, and STARS_missing carries the largest effect in all three models. The Poisson and Negative Binomial models are virtually identical, same coefficients, same standard errors, nearly identical AIC (36,577 vs 36,579) which confirms the earlier finding that there’s no meaningful overdispersion in the data. The linear model coefficients aren’t directly comparable since they’re on a different scale, but the direction and relative importance of each variable lines up perfectly with the count models.

Model Selection

Based on AIC, the parsimonious Poisson model (p2) is the preferred choice. It achieves an AIC of 36,577, nearly identical to the full Poisson model (36,549) despite using only 4 predictors instead of 20. The Negative Binomial adds no value since theta is enormous, confirming there is no overdispersion to model. The linear model is easy to interpret but is technically misspecified for count data since it can produce negative predictions. The parsimonious Poisson model with STARS, LabelAppeal, AcidIndex, and STARS_missing is the right balance of performance, parsimony, and statistical appropriateness.

Predictions on TestSet

test$predicted <- round(predict(p2, newdata = test, type = "response"))
table(test$predicted)


  1   2   3   4   5   6   7   8 
604 239 765 623 201 102  23   2

Confusing Matrix

library(caret)
conf <- confusionMatrix(factor(test$predicted), factor(test$TARGET))
conf

Confusion Matrix and Statistics

          Reference
Prediction   0   1   2   3   4   5   6   7   8
         0   0   0   0   0   0   0   0   0   0
         1 356  30  60  91  46  17   3   0   1
         2  84  10  57  53  17  10   8   0   0
         3  93   4  87 272 244  59   6   0   0
         4  11   0  10 108 257 186  50   1   0
         5   1   0   1   8  46 103  38   4   0
         6   0   0   0   0   7  40  41  12   2
         7   0   0   0   0   1   8   8   6   0
         8   0   0   0   0   0   0   0   2   0

Overall Statistics
                                          
               Accuracy : 0.2993          
                 95% CI : (0.2816, 0.3175)
    No Information Rate : 0.2415          
    P-Value [Acc > NIR] : 1.446e-11       
                                          
                  Kappa : 0.1773          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
Sensitivity             0.000  0.68182  0.26512   0.5113   0.4159  0.24350
Specificity             1.000  0.77177  0.92235   0.7568   0.8114  0.95412
Pos Pred Value            NaN  0.04967  0.23849   0.3556   0.4125  0.51244
Neg Pred Value          0.787  0.99284  0.93190   0.8551   0.8135  0.86429
Prevalence              0.213  0.01719  0.08402   0.2079   0.2415  0.16530
Detection Rate          0.000  0.01172  0.02227   0.1063   0.1004  0.04025
Detection Prevalence    0.000  0.23603  0.09340   0.2989   0.2435  0.07855
Balanced Accuracy       0.500  0.72679  0.59374   0.6340   0.6136  0.59881
                     Class: 6 Class: 7  Class: 8
Sensitivity           0.26623 0.240000 0.0000000
Specificity           0.97464 0.993291 0.9992175
Pos Pred Value        0.40196 0.260870 0.0000000
Neg Pred Value        0.95401 0.992508 0.9988268
Prevalence            0.06018 0.009769 0.0011723
Detection Rate        0.01602 0.002345 0.0000000
Detection Prevalence  0.03986 0.008988 0.0007816
Balanced Accuracy     0.62043 0.616646 0.4996088

Overall accuracy is about 30%, which sounds low but is actually reasonable for a 9-class prediction problem where random guessing would only get you to 24%. The model performs best on the middle counts (3 and 4), which makes sense since those are the most common classes. The biggest weakness is class 0, the model predicts no zeros at all, confirming the zero-inflation problem we flagged earlier. Kappa of 0.18 indicates the model is doing better than chance but leaves room for improvement. For a wine manufacturer the practical takeaway is still useful: the model reliably identifies high-potential wines (those with strong star ratings and appealing labels) even if it struggles with the exact count, particularly at the extremes.

Conclsuion

This analysis set out to predict the number of wine cases purchased by distributors using chemical and marketing properties of roughly 12,000 wines. After exploring the data, handling missingness, and building six models across three families, a few clear takeaways emerge.

First, two variables do most of the work: expert star ratings and label appeal. A wine with a high star rating and an attractive label is going to sell significantly more cases regardless of its chemical composition. The chemistry variables matter at the margins but add little predictive value once you account for those two.

Second, the fact that a wine is unrated is itself a strong negative signal. Wines with no star rating sell far fewer cases on average, and capturing this with a missing indicator flag was one of the more impactful data preparation decisions.

Third, the Poisson and Negative Binomial models performed essentially identically, with no meaningful overdispersion in the data. The parsimonious Poisson model with four predictors is the recommended choice for deployment, balancing statistical appropriateness, interpretability, and predictive performance. The linear model is a close competitor on R² but is technically misspecified for count data and can produce negative predictions.

The main limitation of the selected model is its inability to predict zero purchases, which points to zero-inflated Poisson as a natural next step if further refinement is needed.