df <- read.csv("wine-data-1-1.csv")
set.seed(123)
idx <- sample(nrow(df), round(0.8 * nrow(df)))
train <- df[idx, !names(df) %in% "INDEX"]
test <- df[-idx, !names(df) %in% "INDEX"]Wine
Wine Prediction
Introduction
This analysis builds a count regression model to predict the number of sample wine cases purchased by distributors, based on the chemical and marketing properties of roughly 12,000 commercially available wines. The target variable, cases purchased, is a non-negative integer, making count regression the natural modeling framework. We work through four stages: exploratory data analysis to understand the data, preparation to handle missingness and anomalies, model building across three families (Poisson, Negative Binomial, and Linear), and finally model selection based on fit, parsimony, and predictive performance. A key theme throughout is that two variables, expert star ratings and label appeal, dominate the prediction, while the chemical properties of the wine play a much smaller role than one might expect.
Data Exploration
Load & Split Data
Summary Statistics
stargazer(train, type = "text",
title = "Summary Statistics",
digits = 2,
summary.stat = c("n", "mean", "sd", "min", "median", "max"))
Summary Statistics
=================================================================
Statistic N Mean St. Dev. Min Median Max
-----------------------------------------------------------------
TARGET 10,236 3.03 1.93 0 3 8
FixedAcidity 10,236 7.09 6.33 -18.00 6.90 34.40
VolatileAcidity 10,236 0.33 0.78 -2.79 0.28 3.68
CitricAcid 10,236 0.31 0.86 -3.24 0.31 3.86
ResidualSugar 9,742 5.28 33.94 -127.80 3.90 141.15
Chlorides 9,711 0.05 0.32 -1.17 0.05 1.35
FreeSulfurDioxide 9,697 30.90 147.31 -546.00 30.00 622.00
TotalSulfurDioxide 9,704 121.82 231.04 -823.00 124.00 1,057.00
Density 10,236 0.99 0.03 0.89 0.99 1.10
pH 9,915 3.21 0.68 0.48 3.20 6.05
Sulphates 9,276 0.53 0.94 -3.13 0.50 4.24
Alcohol 9,717 10.50 3.72 -4.70 10.40 26.10
LabelAppeal 10,236 -0.01 0.90 -2 0 2
AcidIndex 10,236 7.78 1.34 4 8 17
STARS 7,548 2.05 0.90 1 2 4
-----------------------------------------------------------------
The training set has 10,236 observations and 15 variables.The training set has 10,236 observations and 15 variables. A few things jump out right away. STARS is missing for about 26% of observations, by far the biggest gap and we’ll see shortly that it’s also the strongest predictor of TARGET. LabelAppeal is clean, fully observed, and looks like another key driver. On the chemistry side, a few variables have clearly erroneous negative values, things like negative alcohol content and negative sulfur dioxide, which we’ll deal with in Part 2. And TARGET itself has a mean of ~3 but drops all the way to 0, which hints at possible zero-inflation that’ll influence our model choice later.
Missing Values
miss <- data.frame(
Variable = names(train),
Missing = sapply(train, function(x) sum(is.na(x))),
Pct = round(sapply(train, function(x) mean(is.na(x))) * 100, 1)
)
miss[miss$Missing > 0, ] Variable Missing Pct
ResidualSugar ResidualSugar 494 4.8
Chlorides Chlorides 525 5.1
FreeSulfurDioxide FreeSulfurDioxide 539 5.3
TotalSulfurDioxide TotalSulfurDioxide 532 5.2
pH pH 321 3.1
Sulphates Sulphates 960 9.4
Alcohol Alcohol 519 5.1
STARS STARS 2688 26.3
STARS has by far the most missing data at 26.3%, while the remaining variables with missingness are all under 10%, Sulphates being the worst of that group at 9.4%. For the chemical variables we’ll impute with the median, which is straightforward. STARS is a different story, wines without a rating are likely not randomly unrated, so rather than imputing we’ll create a binary flag to preserve that signal. More on this in Part 2.
TARGET Distribution
ggplot(train, aes(x = TARGET)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
labs(title = "Distribution of TARGET", x = "Cases Purchased", y = "Count") +
theme_minimal()TARGET ranges from 0 to 8 with a mean of ~3. The most striking feature is the large spike at 0, about 21% of wines had zero cases purchased, followed by a dip at 1 and then a second peak around 4. This bimodal shape suggests two distinct groups: wines that distributors passed on entirely, and wines that sold reasonably well. This is a classic sign of zero-inflation, meaning a standard Poisson model may not be enough and zero-inflated variants are worth considering.
Key Predictors
par(mfrow = c(1, 2))
barplot(table(train$STARS), main = "STARS Distribution",
xlab = "Stars", ylab = "Count", col = "steelblue")barplot(table(train$LabelAppeal), main = "LabelAppeal Distribution",
xlab = "Label Appeal", ylab = "Count", col = "steelblue")Most wines are rated 1 or 2 stars, with very few reaching 4, so high-rated wines are rare in this dataset. For LabelAppeal, the distribution is roughly symmetric and centered at 0, meaning most wines have a neutral label with relatively few at the extremes. Both variables will be treated as ordinal in modeling, higher is better for sales in both cases, which aligns with the theoretical expectations laid out in the assignment.
Corelation Plot
corrplot(cor(train, use = "pairwise.complete.obs"),
method = "color", type = "lower",
tl.cex = 0.7, tl.col = "black")The correlation plot confirms what we expected. STARS has the strongest positive correlation with TARGET, followed by LabelAppeal. AcidIndex stands out as the most notable negative correlator. The chemical variables: FixedAcidity, VolatileAcidity, CitricAcid, Chlorides, and the rest are mostly weakly correlated with TARGET but show some correlation with each other, which is expected since they’re all measuring related properties of the same wine. The big takeaway here is that STARS and LabelAppeal are doing most of the work, while the chemistry variables will likely add only marginal predictive value.
Data Preparation
Missing STARS
train$STARS_missing <- ifelse(is.na(train$STARS), 1, 0)
test$STARS_missing <- ifelse(is.na(test$STARS), 1, 0)Before imputing, we create a flag for missing STARS since unrated wines are likely not missing at random, they’re probably lower quality or less established, which itself predicts lower sales. Dropping or blindly imputing this would throw away a useful signal. All other missing values are then filled with the training median, applied consistently to both train and test to avoid leakage.
Median Imputation
for (col in names(train)) {
med <- median(train[[col]], na.rm = TRUE)
train[[col]][is.na(train[[col]])] <- med
test[[col]][is.na(test[[col]])] <- med
}All variables with missing values are imputed using the training median. Median is preferred over mean here since several variables have skewed distributions and outliers that would pull the mean in the wrong direction.
Implausible Negatives
neg_vars <- c("Alcohol", "FreeSulfurDioxide", "TotalSulfurDioxide",
"CitricAcid", "Chlorides", "ResidualSugar")
for (col in neg_vars) {
train[[paste0(col, "_neg")]] <- ifelse(train[[col]] < 0, 1, 0)
test[[paste0(col, "_neg")]] <- ifelse(test[[col]] < 0, 1, 0)
}Several chemical variables contain negative values that are physically impossible, things like negative alcohol content or negative sulfur dioxide. Rather than dropping these rows or replacing the values, we flag them with a binary indicator. This way the model can learn whether having an erroneous reading is itself predictive of sales, while we handle the actual values separately.
Winsorize Outliers
wins <- function(x, lo = 0.01, hi = 0.99) {
q <- quantile(x, c(lo, hi), na.rm = TRUE)
pmax(pmin(x, q[2]), q[1])
}
num_vars <- names(train)[sapply(train, is.numeric)]
num_vars <- num_vars[!num_vars %in% c("TARGET", "STARS_missing")]
for (col in num_vars) {
train[[col]] <- wins(train[[col]])
test[[col]] <- wins(test[[col]])
}Extreme values in the chemical variables are winsorized at the 1st and 99th percentiles. This caps the most egregious outliers without dropping any rows, keeping the dataset intact while preventing a handful of extreme readings from distorting the model coefficients. We exclude TARGET and STARS_missing from this step since those don’t need capping.
Build Models
Poisson Models
library(MASS)
p1 <- glm(TARGET ~ ., data = train, family = poisson)
p2 <- glm(TARGET ~ STARS + LabelAppeal + AcidIndex + STARS_missing,
data = train, family = poisson)
summary(p1)
Call:
glm(formula = TARGET ~ ., family = poisson, data = train)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.811e+00 2.255e-01 8.029 9.85e-16 ***
FixedAcidity -8.632e-05 9.503e-04 -0.091 0.92762
VolatileAcidity -3.089e-02 7.582e-03 -4.074 4.63e-05 ***
CitricAcid 1.977e-02 9.877e-03 2.001 0.04536 *
ResidualSugar 1.689e-04 2.534e-04 0.667 0.50501
Chlorides -7.790e-02 2.731e-02 -2.852 0.00434 **
FreeSulfurDioxide 1.412e-04 5.877e-05 2.402 0.01630 *
TotalSulfurDioxide 1.217e-04 3.776e-05 3.223 0.00127 **
Density -2.961e-01 2.209e-01 -1.340 0.18015
pH -1.541e-02 8.904e-03 -1.731 0.08348 .
Sulphates -1.026e-02 6.635e-03 -1.547 0.12191
Alcohol 2.241e-03 1.638e-03 1.368 0.17129
LabelAppeal 1.584e-01 6.817e-03 23.243 < 2e-16 ***
AcidIndex -8.398e-02 5.239e-03 -16.030 < 2e-16 ***
STARS 1.899e-01 6.801e-03 27.914 < 2e-16 ***
STARS_missing -1.026e+00 1.904e-02 -53.876 < 2e-16 ***
Alcohol_neg NA NA NA NA
FreeSulfurDioxide_neg 1.301e-02 1.929e-02 0.674 0.50011
TotalSulfurDioxide_neg 3.507e-02 2.078e-02 1.688 0.09140 .
CitricAcid_neg 2.983e-02 1.962e-02 1.520 0.12839
Chlorides_neg -2.598e-02 1.875e-02 -1.385 0.16597
ResidualSugar_neg 8.331e-03 1.876e-02 0.444 0.65697
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 18306 on 10235 degrees of freedom
Residual deviance: 10968 on 10215 degrees of freedom
AIC: 36549
Number of Fisher Scoring iterations: 6
summary(p2)
Call:
glm(formula = TARGET ~ STARS + LabelAppeal + AcidIndex + STARS_missing,
family = poisson, data = train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.517982 0.042889 35.39 <2e-16 ***
STARS 0.192055 0.006769 28.37 <2e-16 ***
LabelAppeal 0.158521 0.006807 23.29 <2e-16 ***
AcidIndex -0.085355 0.005128 -16.64 <2e-16 ***
STARS_missing -1.034342 0.019009 -54.41 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 18306 on 10235 degrees of freedom
Residual deviance: 11027 on 10231 degrees of freedom
AIC: 36577
Number of Fisher Scoring iterations: 6
Both Poisson models tell a consistent story. STARS and LabelAppeal are the dominant positive predictors, each additional star rating increases expected cases by about 19%, and each unit of label appeal adds roughly 16%. AcidIndex is consistently negative across both models, meaning higher acidity is associated with fewer cases sold. Most importantly, STARS_missing has a large negative coefficient (~-1.03) and is highly significant, confirming that unrated wines sell far fewer cases, and that flagging this missingness was the right call.
The full model (p1) adds the chemical variables and negative flags, but most of them are insignificant and the AIC barely changes (36,549 vs 36,577). The parsimonious model (p2) with just 4 predictors performs nearly as well, which suggests the chemistry variables add little beyond what STARS, LabelAppeal, and AcidIndex already capture.
Negative Binomial Models
library(MASS)
nb1 <- glm.nb(TARGET ~ ., data = train)
nb2 <- glm.nb(TARGET ~ STARS + LabelAppeal + AcidIndex + STARS_missing,
data = train)
summary(nb1)
Call:
glm.nb(formula = TARGET ~ ., data = train, init.theta = 40898.55359,
link = log)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.811e+00 2.256e-01 8.028 9.87e-16 ***
FixedAcidity -8.632e-05 9.504e-04 -0.091 0.92763
VolatileAcidity -3.089e-02 7.583e-03 -4.074 4.63e-05 ***
CitricAcid 1.977e-02 9.877e-03 2.001 0.04537 *
ResidualSugar 1.689e-04 2.534e-04 0.667 0.50501
Chlorides -7.790e-02 2.732e-02 -2.852 0.00434 **
FreeSulfurDioxide 1.412e-04 5.878e-05 2.402 0.01630 *
TotalSulfurDioxide 1.217e-04 3.776e-05 3.223 0.00127 **
Density -2.961e-01 2.209e-01 -1.340 0.18016
pH -1.541e-02 8.904e-03 -1.731 0.08348 .
Sulphates -1.026e-02 6.635e-03 -1.547 0.12190
Alcohol 2.241e-03 1.638e-03 1.368 0.17133
LabelAppeal 1.584e-01 6.817e-03 23.242 < 2e-16 ***
AcidIndex -8.398e-02 5.239e-03 -16.030 < 2e-16 ***
STARS 1.899e-01 6.802e-03 27.913 < 2e-16 ***
STARS_missing -1.026e+00 1.904e-02 -53.875 < 2e-16 ***
Alcohol_neg NA NA NA NA
FreeSulfurDioxide_neg 1.301e-02 1.929e-02 0.674 0.50010
TotalSulfurDioxide_neg 3.507e-02 2.078e-02 1.688 0.09140 .
CitricAcid_neg 2.984e-02 1.962e-02 1.520 0.12840
Chlorides_neg -2.598e-02 1.875e-02 -1.385 0.16596
ResidualSugar_neg 8.331e-03 1.876e-02 0.444 0.65698
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(40898.55) family taken to be 1)
Null deviance: 18305 on 10235 degrees of freedom
Residual deviance: 10967 on 10215 degrees of freedom
AIC: 36552
Number of Fisher Scoring iterations: 1
Theta: 40899
Std. Err.: 38984
Warning while fitting theta: iteration limit reached
2 x log-likelihood: -36507.74
summary(nb2)
Call:
glm.nb(formula = TARGET ~ STARS + LabelAppeal + AcidIndex + STARS_missing,
data = train, init.theta = 40571.86879, link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.517997 0.042891 35.39 <2e-16 ***
STARS 0.192057 0.006769 28.37 <2e-16 ***
LabelAppeal 0.158520 0.006807 23.29 <2e-16 ***
AcidIndex -0.085358 0.005128 -16.64 <2e-16 ***
STARS_missing -1.034342 0.019009 -54.41 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(40571.87) family taken to be 1)
Null deviance: 18305 on 10235 degrees of freedom
Residual deviance: 11027 on 10231 degrees of freedom
AIC: 36579
Number of Fisher Scoring iterations: 1
Theta: 40572
Std. Err.: 38604
Warning while fitting theta: iteration limit reached
2 x log-likelihood: -36567.26
The negative binomial results are telling us, that the theta parameter is enormous (40,000+) and hit the iteration limit, which means the NB model is essentially collapsing to Poisson. This makes sense because Poisson assumes mean = variance, and when that holds the NB has no overdispersion to model. Here’s the narrative:
The Negative Binomial models produce virtually identical coefficients and AICs to their Poisson counterparts. The estimated theta is extremely large (~40,000) in both cases, which indicates there is little to no overdispersion in the data the Poisson assumption of mean equal to variance is roughly satisfied here. This is actually good news: it means Poisson is the appropriate model family and we don’t need the extra complexity of NB. STARS, LabelAppeal, AcidIndex, and STARS_missing remain highly significant with the same signs and magnitudes across all four models so far.
Linear Regresion Models
lm1 <- lm(TARGET ~ ., data = train)
lm2 <- lm(TARGET ~ STARS + LabelAppeal + AcidIndex + STARS_missing,
data = train)
summary(lm1)
Call:
lm(formula = TARGET ~ ., data = train)
Residuals:
Min 1Q Median 3Q Max
-4.6164 -0.8445 0.0240 0.8511 5.6747
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.503e+00 5.106e-01 8.819 < 2e-16 ***
FixedAcidity 5.186e-04 2.148e-03 0.241 0.80919
VolatileAcidity -9.781e-02 1.712e-02 -5.714 1.13e-08 ***
CitricAcid 6.596e-02 2.261e-02 2.917 0.00354 **
ResidualSugar 4.149e-04 5.767e-04 0.719 0.47194
Chlorides -2.289e-01 6.110e-02 -3.747 0.00018 ***
FreeSulfurDioxide 4.152e-04 1.346e-04 3.085 0.00204 **
TotalSulfurDioxide 3.507e-04 8.582e-05 4.086 4.41e-05 ***
Density -8.580e-01 5.008e-01 -1.713 0.08669 .
pH -4.048e-02 2.013e-02 -2.011 0.04437 *
Sulphates -2.740e-02 1.501e-02 -1.825 0.06805 .
Alcohol 8.963e-03 3.710e-03 2.416 0.01573 *
LabelAppeal 4.653e-01 1.516e-02 30.690 < 2e-16 ***
AcidIndex -2.124e-01 1.053e-02 -20.176 < 2e-16 ***
STARS 7.842e-01 1.742e-02 45.011 < 2e-16 ***
STARS_missing -2.242e+00 3.005e-02 -74.611 < 2e-16 ***
Alcohol_neg NA NA NA NA
FreeSulfurDioxide_neg 4.342e-02 4.364e-02 0.995 0.31986
TotalSulfurDioxide_neg 1.061e-01 4.694e-02 2.260 0.02387 *
CitricAcid_neg 1.038e-01 4.459e-02 2.327 0.01996 *
Chlorides_neg -6.939e-02 4.237e-02 -1.638 0.10155
ResidualSugar_neg 1.414e-02 4.258e-02 0.332 0.73977
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.305 on 10215 degrees of freedom
Multiple R-squared: 0.5422, Adjusted R-squared: 0.5413
F-statistic: 605 on 20 and 10215 DF, p-value: < 2.2e-16
summary(lm2)
Call:
lm(formula = TARGET ~ STARS + LabelAppeal + AcidIndex + STARS_missing,
data = train)
Residuals:
Min 1Q Median 3Q Max
-4.6975 -0.8210 0.0141 0.8645 5.7864
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.70368 0.09023 41.05 <2e-16 ***
STARS 0.79162 0.01745 45.36 <2e-16 ***
LabelAppeal 0.46420 0.01522 30.51 <2e-16 ***
AcidIndex -0.21684 0.01029 -21.06 <2e-16 ***
STARS_missing -2.26706 0.03007 -75.39 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.311 on 10231 degrees of freedom
Multiple R-squared: 0.5374, Adjusted R-squared: 0.5372
F-statistic: 2972 on 4 and 10231 DF, p-value: < 2.2e-16
Both linear models explain about 54% of the variance in TARGET (R² ≈ 0.54), which is solid for this kind of data. The story is the same as before: STARS, LabelAppeal, AcidIndex, and STARS_missing dominate. Each additional star adds about 0.78 cases, each unit of label appeal adds about 0.46 cases, and unrated wines sell on average 2.26 fewer cases, all highly significant. The full model (lm1) picks up a few additional significant chemistry variables like VolatileAcidity, Chlorides, and TotalSulfurDioxide, but the R² gain over the parsimonious model (lm2) is minimal (0.542 vs 0.537). The linear model is easier to interpret than Poisson but technically misspecified for count data since it can predict negative values, something to keep in mind when selecting the final model.
Model Comparison Table
stargazer(p2, nb2, lm2,
type = "text",
title = "Model Comparison",
column.labels = c("Poisson", "Neg Binomial", "Linear"),
dep.var.labels = "TARGET",
keep = c("STARS", "LabelAppeal", "AcidIndex", "STARS_missing"),
digits = 3)
Model Comparison
====================================================================================
Dependent variable:
----------------------------------------------------------------
TARGET
Poisson negative OLS
binomial
Poisson Neg Binomial Linear
(1) (2) (3)
------------------------------------------------------------------------------------
STARS 0.192*** 0.192*** 0.792***
(0.007) (0.007) (0.017)
LabelAppeal 0.159*** 0.159*** 0.464***
(0.007) (0.007) (0.015)
AcidIndex -0.085*** -0.085*** -0.217***
(0.005) (0.005) (0.010)
STARS_missing -1.034*** -1.034*** -2.267***
(0.019) (0.019) (0.030)
------------------------------------------------------------------------------------
Observations 10,236 10,236 10,236
R2 0.537
Adjusted R2 0.537
Log Likelihood -18,283.460 -18,284.630
theta 40,571.870 (38,604.010)
Akaike Inf. Crit. 36,576.930 36,579.260
Residual Std. Error 1.311 (df = 10231)
F Statistic 2,971.519*** (df = 4; 10231)
====================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Across all three model families the key coefficients are remarkably consistent in sign and significance. STARS and LabelAppeal are positive and highly significant everywhere, AcidIndex is negative, and STARS_missing carries the largest effect in all three models. The Poisson and Negative Binomial models are virtually identical, same coefficients, same standard errors, nearly identical AIC (36,577 vs 36,579) which confirms the earlier finding that there’s no meaningful overdispersion in the data. The linear model coefficients aren’t directly comparable since they’re on a different scale, but the direction and relative importance of each variable lines up perfectly with the count models.
Model Selection
Based on AIC, the parsimonious Poisson model (p2) is the preferred choice. It achieves an AIC of 36,577, nearly identical to the full Poisson model (36,549) despite using only 4 predictors instead of 20. The Negative Binomial adds no value since theta is enormous, confirming there is no overdispersion to model. The linear model is easy to interpret but is technically misspecified for count data since it can produce negative predictions. The parsimonious Poisson model with STARS, LabelAppeal, AcidIndex, and STARS_missing is the right balance of performance, parsimony, and statistical appropriateness.
Predictions on TestSet
test$predicted <- round(predict(p2, newdata = test, type = "response"))
table(test$predicted)
1 2 3 4 5 6 7 8
604 239 765 623 201 102 23 2
Confusing Matrix
library(caret)
conf <- confusionMatrix(factor(test$predicted), factor(test$TARGET))
confConfusion Matrix and Statistics
Reference
Prediction 0 1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0 0 0
1 356 30 60 91 46 17 3 0 1
2 84 10 57 53 17 10 8 0 0
3 93 4 87 272 244 59 6 0 0
4 11 0 10 108 257 186 50 1 0
5 1 0 1 8 46 103 38 4 0
6 0 0 0 0 7 40 41 12 2
7 0 0 0 0 1 8 8 6 0
8 0 0 0 0 0 0 0 2 0
Overall Statistics
Accuracy : 0.2993
95% CI : (0.2816, 0.3175)
No Information Rate : 0.2415
P-Value [Acc > NIR] : 1.446e-11
Kappa : 0.1773
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
Sensitivity 0.000 0.68182 0.26512 0.5113 0.4159 0.24350
Specificity 1.000 0.77177 0.92235 0.7568 0.8114 0.95412
Pos Pred Value NaN 0.04967 0.23849 0.3556 0.4125 0.51244
Neg Pred Value 0.787 0.99284 0.93190 0.8551 0.8135 0.86429
Prevalence 0.213 0.01719 0.08402 0.2079 0.2415 0.16530
Detection Rate 0.000 0.01172 0.02227 0.1063 0.1004 0.04025
Detection Prevalence 0.000 0.23603 0.09340 0.2989 0.2435 0.07855
Balanced Accuracy 0.500 0.72679 0.59374 0.6340 0.6136 0.59881
Class: 6 Class: 7 Class: 8
Sensitivity 0.26623 0.240000 0.0000000
Specificity 0.97464 0.993291 0.9992175
Pos Pred Value 0.40196 0.260870 0.0000000
Neg Pred Value 0.95401 0.992508 0.9988268
Prevalence 0.06018 0.009769 0.0011723
Detection Rate 0.01602 0.002345 0.0000000
Detection Prevalence 0.03986 0.008988 0.0007816
Balanced Accuracy 0.62043 0.616646 0.4996088
Overall accuracy is about 30%, which sounds low but is actually reasonable for a 9-class prediction problem where random guessing would only get you to 24%. The model performs best on the middle counts (3 and 4), which makes sense since those are the most common classes. The biggest weakness is class 0, the model predicts no zeros at all, confirming the zero-inflation problem we flagged earlier. Kappa of 0.18 indicates the model is doing better than chance but leaves room for improvement. For a wine manufacturer the practical takeaway is still useful: the model reliably identifies high-potential wines (those with strong star ratings and appealing labels) even if it struggles with the exact count, particularly at the extremes.
Conclsuion
This analysis set out to predict the number of wine cases purchased by distributors using chemical and marketing properties of roughly 12,000 wines. After exploring the data, handling missingness, and building six models across three families, a few clear takeaways emerge.
First, two variables do most of the work: expert star ratings and label appeal. A wine with a high star rating and an attractive label is going to sell significantly more cases regardless of its chemical composition. The chemistry variables matter at the margins but add little predictive value once you account for those two.
Second, the fact that a wine is unrated is itself a strong negative signal. Wines with no star rating sell far fewer cases on average, and capturing this with a missing indicator flag was one of the more impactful data preparation decisions.
Third, the Poisson and Negative Binomial models performed essentially identically, with no meaningful overdispersion in the data. The parsimonious Poisson model with four predictors is the recommended choice for deployment, balancing statistical appropriateness, interpretability, and predictive performance. The linear model is a close competitor on R² but is technically misspecified for count data and can produce negative predictions.
The main limitation of the selected model is its inability to predict zero purchases, which points to zero-inflated Poisson as a natural next step if further refinement is needed.