The wine training dataset contains approximately 12,000 observations and includes multiple chemical attributes of commercially available wines, a small set of marketing indicators, and the response variable TARGET, representing the number of sample cases purchased by distributors.
A structural review (str()) confirms that:
TARGET is a count variable, ranging from 0 to several hundred, with most wines having relatively low case purchases.
Most predictor variables (acidity, pH, density, sulphates, sulfur dioxide measures, etc.) are continuous.
LabelAppeal and STARS are discrete rating variables and theoretically expected to strongly influence wine demand.
Summary statistics (mean, SD, median) show:
Many chemical attributes (e.g., Acidity, Alcohol, Sulphates) follow skewed distributions, common in chemical datasets.
TARGET is right-skewed, indicating most wines sell few cases, while a small number sell very well.
LabelAppeal ranges from negative values to positive scores, reflecting poor-to-excellent customer appeal.
STARS ranges from 1 to 4, where higher ratings reflect expert preference.
A missingness check shows:
Several predictors contain missing values (common in chemical testing datasets).
As with earlier assignments, the fact that a value is missing may itself be predictive.
These will be handled in Section 2 by:
Creating missing-value flags
Imputing numeric variables using medians or other appropriate methods
Distribution of TARGET
A histogram of TARGET highlights:
A heavy right tail
Many zeros and low-count values
This suggests:
Poisson regression may be insufficient because Poisson assumes mean ≈ variance.
Negative Binomial regression is likely more appropriate due to overdispersion.
We will confirm this in Section 3.
Boxplots and Correlations
Boxplots reveal:
Chemical variables differ widely in scale and spread.
LabelAppeal and STARS show visible associations with TARGET:
Wines with higher STARS ratings tend to sell more cases.
Wines with more appealing labels also show higher sales.
Correlation heatmaps reveal:
Many chemical properties have low to moderate correlations with each other.
Marketing attributes (LabelAppeal, STARS) show stronger direct relationships with TARGET.
Our exploration confirms that:
TARGET is a non-negative integer outcome suitable for Poisson or Negative Binomial modeling.
The dataset contains meaningful marketing and chemical features.
There is overdispersion, suggesting Negative Binomial models will likely outperform Poisson.
Missing data must be handled carefully and may even be predictive.
############################################################
# SECTION 1 — DATA EXPLORATION
############################################################
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
# Load data
train <- read.csv("wine-training-data.csv")
eval <- read.csv("wine-evaluation-data.csv")
# Inspect structure & dimensions
dim(train)
## [1] 12795 16
str(train)
## 'data.frame': 12795 obs. of 16 variables:
## $ INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET : int 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : int 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : int 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : int 2 3 3 1 2 NA NA 3 NA 4 ...
# Summary statistics
summary(train)
## INDEX TARGET FixedAcidity VolatileAcidity
## Min. : 1 Min. :0.000 Min. :-18.100 Min. :-2.7900
## 1st Qu.: 4038 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300
## Median : 8110 Median :3.000 Median : 6.900 Median : 0.2800
## Mean : 8070 Mean :3.029 Mean : 7.076 Mean : 0.3241
## 3rd Qu.:12106 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400
## Max. :16129 Max. :8.000 Max. : 34.400 Max. : 3.6800
##
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## Min. :-3.2400 Min. :-127.800 Min. :-1.17100 Min. :-555.00
## 1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.03100 1st Qu.: 0.00
## Median : 0.3100 Median : 3.900 Median : 0.04600 Median : 30.00
## Mean : 0.3084 Mean : 5.419 Mean : 0.05482 Mean : 30.85
## 3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.15300 3rd Qu.: 70.00
## Max. : 3.8600 Max. : 141.150 Max. : 1.35100 Max. : 623.00
## NA's :616 NA's :638 NA's :647
## TotalSulfurDioxide Density pH Sulphates
## Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300
## 1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800
## Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000
## Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271
## 3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600
## Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400
## NA's :682 NA's :395 NA's :1210
## Alcohol LabelAppeal AcidIndex STARS
## Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
## Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :653 NA's :3359
# Check missingness
missing_counts <- colSums(is.na(train))
missing_counts
## INDEX TARGET FixedAcidity VolatileAcidity
## 0 0 0 0
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## 0 616 638 647
## TotalSulfurDioxide Density pH Sulphates
## 682 0 395 1210
## Alcohol LabelAppeal AcidIndex STARS
## 653 0 0 3359
# TARGET distribution
table(train$TARGET)
##
## 0 1 2 3 4 5 6 7 8
## 2734 244 1091 2611 3177 2014 765 142 17
summary(train$TARGET)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 3.029 4.000 8.000
ggplot(train, aes(x = TARGET)) +
geom_histogram(binwidth = 1, fill = "purple", color = "black") +
labs(title = "Distribution of TARGET (Cases Purchased)",
x = "Number of Cases", y = "Count")
# Identify numeric predictors
numeric_vars <- train %>%
select(-INDEX, -TARGET) %>%
select(where(is.numeric))
# Boxplots of numeric predictors
train_long <- train %>%
pivot_longer(cols = colnames(numeric_vars),
names_to = "Variable",
values_to = "Value")
ggplot(train_long, aes(x = Variable, y = Value)) +
geom_boxplot(fill = "steelblue") +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Boxplots of Numeric Predictor Variables")
## Warning: Removed 8200 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Correlation matrix (numeric variables only)
corr_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")
corrplot(corr_matrix,
method = "color",
type = "upper",
tl.cex = 0.7,
number.cex = 0.7,
addCoef.col = "black")
# Key relationships
ggplot(train, aes(x = as.factor(STARS), y = TARGET)) +
geom_boxplot(fill = "gold") +
labs(title = "Cases Purchased vs STARS Rating",
x = "STARS", y = "TARGET")
ggplot(train, aes(x = LabelAppeal, y = TARGET)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "TARGET vs LabelAppeal",
x = "Label Appeal Score", y = "TARGET")
## `geom_smooth()` using formula = 'y ~ x'
Several preprocessing and feature engineering steps were performed to prepare the wine data for count regression modeling. The dataset contains a mix of chemical measurements, expert ratings and marketing indicators, many of which exhibit skewed distributions and missing values that require careful handling.
First, all numeric predictor variables were identified, excluding the identification variable (INDEX) and the response variable (TARGET). Because the assignment notes that missingness itself may be predictive, I created a missing-value indicator flag for every numeric predictor (e.g., Alcohol_MISSING, Density_MISSING). These flags allow the models to learn whether the absence of a measurement carries information about wine sales.
Missing numeric values were then imputed using the median of the training data. Median imputation was chosen because many chemical variables exhibit skewness and outliers, and the median is robust to extreme values. The same imputation values derived from the training data were applied to the evaluation dataset to avoid data leakage.
Next, several chemical variables were transformed using logarithmic transformations, including Alcohol, Chlorides, CitricAcid, Density, FixedAcidity, FreeSulfurDioxide, ResidualSugar, Sulphates, TotalSulfurDioxide, and VolatileAcidity. These variables displayed right-skewed distributions, and log transformations help stabilize variance and improve model behavior—particularly for multiple linear regression models used as benchmarks.
To capture nonlinear and ordinal effects, key marketing variables were bucketized. The STARS variable was treated as an ordered categorical variable, while LabelAppeal was grouped into five interpretable categories ranging from very low to very high appeal. These groupings reflect how consumers and distributors are likely to respond to ratings and label design rather than to small numeric changes.
Finally, an interaction term between STARS and LabelAppeal was created. This variable captures the combined marketing effect of expert ratings and label design, reflecting the intuition that wines with both strong expert reviews and appealing labels are especially likely to be sampled and sold.
The resulting prepared datasets (train2 and eval2) include imputed values, missingness indicators, transformed chemical properties, bucketized ratings, and interaction terms. These features provide a robust foundation for building Poisson, Negative Binomial, and linear regression models in the next section.
############################################################
# SECTION 2 — DATA PREPARATION
############################################################
# Start from original data
train2 <- train
eval2 <- eval
# ---------------------------------------------------------
# 1. IDENTIFY NUMERIC VARIABLES (EXCLUDING ID & TARGET)
# ---------------------------------------------------------
num_vars <- train2 %>%
select(-INDEX, -TARGET) %>%
select(where(is.numeric)) %>%
colnames()
# ---------------------------------------------------------
# 2. CREATE MISSING VALUE FLAGS
# ---------------------------------------------------------
for (v in num_vars) {
flag_name <- paste0(v, "_MISSING")
train2[[flag_name]] <- ifelse(is.na(train2[[v]]), 1L, 0L)
eval2[[flag_name]] <- ifelse(is.na(eval2[[v]]), 1L, 0L)
}
# ---------------------------------------------------------
# 3. IMPUTE MISSING NUMERIC VALUES (MEDIAN IMPUTATION)
# ---------------------------------------------------------
for (v in num_vars) {
med <- median(train2[[v]], na.rm = TRUE)
train2[[v]][is.na(train2[[v]])] <- med
eval2[[v]][is.na(eval2[[v]])] <- med
}
# ---------------------------------------------------------
# 4. TRANSFORM SKEWED VARIABLES
# (Log transforms for linear models & stability)
# ---------------------------------------------------------
skewed_vars <- c(
"Alcohol", "Chlorides", "CitricAcid", "Density",
"FixedAcidity", "FreeSulfurDioxide",
"ResidualSugar", "Sulphates",
"TotalSulfurDioxide", "VolatileAcidity"
)
skewed_vars <- intersect(skewed_vars, names(train2))
for (v in skewed_vars) {
train2[[paste0("LOG_", v)]] <- log(train2[[v]] + 1)
eval2[[paste0("LOG_", v)]] <- log(eval2[[v]] + 1)
}
## Warning in log(train2[[v]] + 1): NaNs produced
## Warning in log(eval2[[v]] + 1): NaNs produced
## Warning in log(train2[[v]] + 1): NaNs produced
## Warning in log(eval2[[v]] + 1): NaNs produced
## Warning in log(train2[[v]] + 1): NaNs produced
## Warning in log(eval2[[v]] + 1): NaNs produced
## Warning in log(train2[[v]] + 1): NaNs produced
## Warning in log(eval2[[v]] + 1): NaNs produced
## Warning in log(train2[[v]] + 1): NaNs produced
## Warning in log(eval2[[v]] + 1): NaNs produced
## Warning in log(train2[[v]] + 1): NaNs produced
## Warning in log(eval2[[v]] + 1): NaNs produced
## Warning in log(train2[[v]] + 1): NaNs produced
## Warning in log(eval2[[v]] + 1): NaNs produced
## Warning in log(train2[[v]] + 1): NaNs produced
## Warning in log(eval2[[v]] + 1): NaNs produced
## Warning in log(train2[[v]] + 1): NaNs produced
## Warning in log(eval2[[v]] + 1): NaNs produced
# ---------------------------------------------------------
# 5. BUCKETIZE KEY RATING VARIABLES
# ---------------------------------------------------------
train2 <- train2 %>%
mutate(
STARS_BIN = factor(STARS, levels = c(1, 2, 3, 4)),
LABEL_BIN = cut(
LabelAppeal,
breaks = c(-Inf, -1, 0, 1, 2, Inf),
labels = c("VeryLow", "Low", "Neutral", "High", "VeryHigh")
)
)
eval2 <- eval2 %>%
mutate(
STARS_BIN = factor(STARS, levels = c(1, 2, 3, 4)),
LABEL_BIN = cut(
LabelAppeal,
breaks = c(-Inf, -1, 0, 1, 2, Inf),
labels = c("VeryLow", "Low", "Neutral", "High", "VeryHigh")
)
)
# ---------------------------------------------------------
# 6. INTERACTION TERMS (MARKETING EFFECTS)
# ---------------------------------------------------------
train2 <- train2 %>%
mutate(
STAR_LABEL_INTERACT = STARS * LabelAppeal
)
eval2 <- eval2 %>%
mutate(
STAR_LABEL_INTERACT = STARS * LabelAppeal
)
In this section, I constructed multiple models to predict the number of wine sample cases purchased (TARGET) using three different modeling frameworks: Poisson regression, Negative Binomial regression, and multiple linear regression. Because the response variable is a non-negative count, Poisson and Negative Binomial models are the primary modeling approaches, while linear regression is included as a benchmark for comparison, as required by the assignment.
Poisson Regression Models
I first fit a baseline Poisson regression model (P1) using a core set of predictors consisting of chemical properties of the wine (e.g., Alcohol, Chlorides, Density, Sulphates, pH) and the two most interpretable marketing variables: STARS and LabelAppeal. In a Poisson model, coefficients represent changes in the log of the expected count; exponentiating a coefficient yields the multiplicative change in expected sales for a one-unit increase in the predictor.
Next, I fit an engineered Poisson regression model (P2) that expanded the feature set to include missing-value indicator flags, bucketized versions of key marketing variables (STARS_BIN, LABEL_BIN), log-transformed chemical variables, and an interaction term between STARS and LabelAppeal. This specification was designed to capture nonlinear effects, potential marketing synergies, and the possibility that missingness in chemical measurements is itself predictive of demand.
Before fitting the engineered models, I explicitly removed rows containing non-finite values (NA, NaN, or Inf) in any variables used by the engineered formula and dropped unused factor levels. This step ensures numerical stability and is standard practice when fitting generalized linear models with large engineered feature sets.
Negative Binomial Regression Models
Because count data frequently exhibit overdispersion (variance exceeding the mean), I also fit Negative Binomial regression models, which relax the Poisson assumption by introducing an additional dispersion parameter.
A baseline Negative Binomial model (NB1) was fit using the same predictors as Poisson model P1, allowing for a direct comparison between Poisson and Negative Binomial specifications. I then fit an engineered Negative Binomial model (NB2) using the same expanded feature set and cleaned modeling dataset as Poisson P2. Comparing these models helps determine whether accounting for overdispersion meaningfully improves model fit and inference.
Across both Poisson and Negative Binomial models, the coefficients for STARS and LabelAppeal were consistently positive, aligning with the theoretical expectation that higher expert ratings and more appealing labels increase distributor demand. Differences in coefficient magnitude across model types were examined to assess sensitivity to distributional assumptions.
Multiple Linear Regression Models
Although linear regression is not designed for count outcomes and can produce negative predictions, I fit two multiple linear regression models as benchmarks, consistent with the assignment requirements. Linear model L1 used the baseline predictor set, while Linear model L2 used the engineered predictor set after removing non-finite values.
These models provide a useful comparison point for understanding how coefficient direction and magnitude differ between count-based and linear frameworks. For example, some chemical variables exhibited similar directional effects across all models, while others showed differences in magnitude or sign, reflecting the different assumptions underlying linear and count regression.
Overdispersion Assessment and Interpretation
To assess the suitability of the Poisson models, I computed dispersion statistics based on Pearson residuals. In both Poisson specifications, the dispersion values exceeded 1, indicating the presence of overdispersion. This supports the use of Negative Binomial regression as a more appropriate modeling framework for the data.
Overall, Section 3 establishes a set of competing models with varying levels of complexity and interpretability. These models provide the foundation for formal comparison and selection in the next section, where I evaluate performance using information criteria and predictive accuracy and select a final count regression model for deployment.
############################################################
# SECTION 3 — BUILD MODELS
############################################################
library(tidyverse)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# ---------------------------------------------------------
# DEFINE PREDICTOR SETS
# ---------------------------------------------------------
base_vars <- c(
"STARS", "LabelAppeal",
"AcidIndex", "Alcohol", "Chlorides", "CitricAcid", "Density",
"FixedAcidity", "FreeSulfurDioxide", "ResidualSugar",
"Sulphates", "TotalSulfurDioxide", "VolatileAcidity", "pH"
)
base_vars <- base_vars[base_vars %in% names(train2)]
engineered_vars <- c(
base_vars,
paste0(base_vars, "_MISSING"),
"STARS_BIN", "LABEL_BIN",
"STAR_LABEL_INTERACT",
paste0("LOG_", c(
"Alcohol", "Chlorides", "CitricAcid", "Density",
"FixedAcidity", "FreeSulfurDioxide", "ResidualSugar",
"Sulphates", "TotalSulfurDioxide", "VolatileAcidity"
))
)
engineered_vars <- engineered_vars[engineered_vars %in% names(train2)]
formula_base <- as.formula(
paste("TARGET ~", paste(base_vars, collapse = " + "))
)
formula_engineered <- as.formula(
paste("TARGET ~", paste(engineered_vars, collapse = " + "))
)
# ---------------------------------------------------------
# CLEAN DATA FOR ENGINEERED MODELS
# ---------------------------------------------------------
model_vars <- all.vars(formula_engineered)
train2_model <- train2[, model_vars]
# Replace non-finite numeric values with NA
for (v in names(train2_model)) {
if (is.numeric(train2_model[[v]])) {
train2_model[[v]][!is.finite(train2_model[[v]])] <- NA
}
}
# Remove rows with NA in any model variable
train2_model <- na.omit(train2_model)
# Drop unused factor levels
train2_model <- droplevels(train2_model)
# ---------------------------------------------------------
# 3A. POISSON REGRESSION MODELS
# ---------------------------------------------------------
poisson_P1 <- glm(
formula_base,
data = train2,
family = poisson(link = "log")
)
summary(poisson_P1)
##
## Call:
## glm(formula = formula_base, family = poisson(link = "log"), data = train2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.017e+00 1.957e-01 10.305 < 2e-16 ***
## STARS 2.198e-01 6.468e-03 33.986 < 2e-16 ***
## LabelAppeal 1.965e-01 6.021e-03 32.634 < 2e-16 ***
## AcidIndex -1.222e-01 4.463e-03 -27.381 < 2e-16 ***
## Alcohol 5.368e-03 1.410e-03 3.807 0.000141 ***
## Chlorides -6.005e-02 1.645e-02 -3.650 0.000262 ***
## CitricAcid 1.331e-02 5.892e-03 2.258 0.023925 *
## Density -4.315e-01 1.921e-01 -2.247 0.024651 *
## FixedAcidity -4.515e-04 8.195e-04 -0.551 0.581697
## FreeSulfurDioxide 1.424e-04 3.513e-05 4.054 5.04e-05 ***
## ResidualSugar 1.442e-04 1.545e-04 0.934 0.350499
## Sulphates -1.877e-02 5.739e-03 -3.271 0.001073 **
## TotalSulfurDioxide 1.065e-04 2.268e-05 4.695 2.67e-06 ***
## VolatileAcidity -5.045e-02 6.492e-03 -7.771 7.82e-15 ***
## pH -2.389e-02 7.639e-03 -3.128 0.001762 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 18419 on 12780 degrees of freedom
## AIC: 50391
##
## Number of Fisher Scoring iterations: 5
poisson_P2 <- glm(
formula_engineered,
data = train2_model,
family = poisson(link = "log")
)
summary(poisson_P2)
##
## Call:
## glm(formula = formula_engineered, family = poisson(link = "log"),
## data = train2_model)
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7665455 9.9858748 -0.177 0.859583
## STARS 0.1833407 0.0137115 13.371 < 2e-16 ***
## LabelAppeal 0.2280013 0.0694794 3.282 0.001032 **
## AcidIndex -0.0582679 0.0078539 -7.419 1.18e-13 ***
## Alcohol 0.0096049 0.0060570 1.586 0.112797
## Chlorides -0.1643307 0.0799578 -2.055 0.039858 *
## CitricAcid 0.0054621 0.0222505 0.245 0.806083
## Density -7.3838116 26.2206865 -0.282 0.778248
## FixedAcidity -0.0060109 0.0033657 -1.786 0.074108 .
## FreeSulfurDioxide -0.0004539 0.0001870 -2.427 0.015215 *
## ResidualSugar -0.0018482 0.0008457 -2.185 0.028854 *
## Sulphates -0.0034563 0.0209725 -0.165 0.869102
## TotalSulfurDioxide -0.0002670 0.0001046 -2.553 0.010682 *
## VolatileAcidity -0.0243170 0.0244615 -0.994 0.320179
## pH 0.0046211 0.0136426 0.339 0.734814
## STARS_MISSING -1.0756637 0.0318688 -33.753 < 2e-16 ***
## LabelAppeal_MISSING NA NA NA NA
## AcidIndex_MISSING NA NA NA NA
## Alcohol_MISSING 0.0120966 0.0422350 0.286 0.774563
## Chlorides_MISSING 0.0146352 0.0394191 0.371 0.710435
## CitricAcid_MISSING NA NA NA NA
## Density_MISSING NA NA NA NA
## FixedAcidity_MISSING NA NA NA NA
## FreeSulfurDioxide_MISSING 0.0143471 0.0355484 0.404 0.686511
## ResidualSugar_MISSING 0.0338066 0.0349108 0.968 0.332859
## Sulphates_MISSING -0.0136384 0.0297321 -0.459 0.646443
## TotalSulfurDioxide_MISSING 0.0096797 0.0366741 0.264 0.791827
## VolatileAcidity_MISSING NA NA NA NA
## pH_MISSING -0.0161408 0.0489418 -0.330 0.741555
## STARS_BIN2 0.1421490 0.0237738 5.979 2.24e-09 ***
## STARS_BIN3 0.0787195 0.0282268 2.789 0.005290 **
## STARS_BIN4 NA NA NA NA
## LABEL_BINLow -0.0447566 0.0778991 -0.575 0.565598
## LABEL_BINNeutral -0.1681597 0.1454520 -1.156 0.247633
## LABEL_BINHigh -0.3321980 0.2174893 -1.527 0.126656
## STAR_LABEL_INTERACT 0.0133249 0.0136891 0.973 0.330357
## LOG_Alcohol -0.0584851 0.0539977 -1.083 0.278762
## LOG_Chlorides 0.1088846 0.0687263 1.584 0.113121
## LOG_CitricAcid -0.0094450 0.0225067 -0.420 0.674739
## LOG_Density 14.1873813 52.2295159 0.272 0.785902
## LOG_FixedAcidity 0.0439842 0.0237534 1.852 0.064069 .
## LOG_FreeSulfurDioxide 0.0746857 0.0191797 3.894 9.86e-05 ***
## LOG_ResidualSugar 0.0325048 0.0169642 1.916 0.055355 .
## LOG_Sulphates -0.0169094 0.0222265 -0.761 0.446790
## LOG_TotalSulfurDioxide 0.0780127 0.0213528 3.654 0.000259 ***
## LOG_VolatileAcidity -0.0146600 0.0226184 -0.648 0.516891
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7729.2 on 4355 degrees of freedom
## Residual deviance: 4520.9 on 4317 degrees of freedom
## AIC: 15492
##
## Number of Fisher Scoring iterations: 6
# ---------------------------------------------------------
# 3B. NEGATIVE BINOMIAL REGRESSION MODELS
# ---------------------------------------------------------
nb_NB1 <- glm.nb(
formula_base,
data = train2
)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(nb_NB1)
##
## Call:
## glm.nb(formula = formula_base, data = train2, init.theta = 39421.16808,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.017e+00 1.957e-01 10.305 < 2e-16 ***
## STARS 2.198e-01 6.468e-03 33.984 < 2e-16 ***
## LabelAppeal 1.965e-01 6.021e-03 32.633 < 2e-16 ***
## AcidIndex -1.222e-01 4.464e-03 -27.380 < 2e-16 ***
## Alcohol 5.368e-03 1.410e-03 3.806 0.000141 ***
## Chlorides -6.005e-02 1.645e-02 -3.650 0.000262 ***
## CitricAcid 1.331e-02 5.892e-03 2.258 0.023931 *
## Density -4.315e-01 1.921e-01 -2.247 0.024655 *
## FixedAcidity -4.515e-04 8.196e-04 -0.551 0.581705
## FreeSulfurDioxide 1.424e-04 3.513e-05 4.054 5.04e-05 ***
## ResidualSugar 1.443e-04 1.545e-04 0.934 0.350498
## Sulphates -1.877e-02 5.739e-03 -3.271 0.001073 **
## TotalSulfurDioxide 1.065e-04 2.269e-05 4.695 2.67e-06 ***
## VolatileAcidity -5.045e-02 6.493e-03 -7.770 7.83e-15 ***
## pH -2.389e-02 7.639e-03 -3.128 0.001763 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(39421.17) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 18419 on 12780 degrees of freedom
## AIC: 50394
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 39421
## Std. Err.: 59706
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -50361.62
nb_NB2 <- glm.nb(
formula_engineered,
data = train2_model
)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(nb_NB2)
##
## Call:
## glm.nb(formula = formula_engineered, data = train2_model, init.theta = 41983.57677,
## link = log)
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7666722 9.9863113 -0.177 0.859580
## STARS 0.1833415 0.0137122 13.371 < 2e-16 ***
## LabelAppeal 0.2280004 0.0694810 3.281 0.001033 **
## AcidIndex -0.0582696 0.0078542 -7.419 1.18e-13 ***
## Alcohol 0.0096051 0.0060573 1.586 0.112808
## Chlorides -0.1643333 0.0799607 -2.055 0.039862 *
## CitricAcid 0.0054619 0.0222516 0.245 0.806098
## Density -7.3841469 26.2218280 -0.282 0.778248
## FixedAcidity -0.0060111 0.0033658 -1.786 0.074114 .
## FreeSulfurDioxide -0.0004539 0.0001870 -2.427 0.015217 *
## ResidualSugar -0.0018483 0.0008457 -2.185 0.028854 *
## Sulphates -0.0034563 0.0209735 -0.165 0.869108
## TotalSulfurDioxide -0.0002670 0.0001046 -2.553 0.010683 *
## VolatileAcidity -0.0243183 0.0244626 -0.994 0.320173
## pH 0.0046207 0.0136432 0.339 0.734847
## STARS_MISSING -1.0756630 0.0318695 -33.752 < 2e-16 ***
## LabelAppeal_MISSING NA NA NA NA
## AcidIndex_MISSING NA NA NA NA
## Alcohol_MISSING 0.0120967 0.0422368 0.286 0.774571
## Chlorides_MISSING 0.0146367 0.0394208 0.371 0.710420
## CitricAcid_MISSING NA NA NA NA
## Density_MISSING NA NA NA NA
## FixedAcidity_MISSING NA NA NA NA
## FreeSulfurDioxide_MISSING 0.0143467 0.0355499 0.404 0.686533
## ResidualSugar_MISSING 0.0338058 0.0349124 0.968 0.332893
## Sulphates_MISSING -0.0136402 0.0297334 -0.459 0.646413
## TotalSulfurDioxide_MISSING 0.0096810 0.0366757 0.264 0.791808
## VolatileAcidity_MISSING NA NA NA NA
## pH_MISSING -0.0161418 0.0489440 -0.330 0.741550
## STARS_BIN2 0.1421487 0.0237748 5.979 2.25e-09 ***
## STARS_BIN3 0.0787192 0.0282284 2.789 0.005293 **
## STARS_BIN4 NA NA NA NA
## LABEL_BINLow -0.0447563 0.0779010 -0.575 0.565611
## LABEL_BINNeutral -0.1681607 0.1454554 -1.156 0.247641
## LABEL_BINHigh -0.3322002 0.2174944 -1.527 0.126662
## STAR_LABEL_INTERACT 0.0133250 0.0136897 0.973 0.330373
## LOG_Alcohol -0.0584876 0.0540002 -1.083 0.278763
## LOG_Chlorides 0.1088858 0.0687287 1.584 0.113129
## LOG_CitricAcid -0.0094448 0.0225078 -0.420 0.674761
## LOG_Density 14.1880415 52.2317923 0.272 0.785902
## LOG_FixedAcidity 0.0439854 0.0237543 1.852 0.064072 .
## LOG_FreeSulfurDioxide 0.0746876 0.0191805 3.894 9.86e-05 ***
## LOG_ResidualSugar 0.0325064 0.0169649 1.916 0.055353 .
## LOG_Sulphates -0.0169106 0.0222275 -0.761 0.446779
## LOG_TotalSulfurDioxide 0.0780155 0.0213535 3.654 0.000259 ***
## LOG_VolatileAcidity -0.0146596 0.0226194 -0.648 0.516920
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(41983.58) family taken to be 1)
##
## Null deviance: 7728.8 on 4355 degrees of freedom
## Residual deviance: 4520.7 on 4317 degrees of freedom
## AIC: 15494
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 41984
## Std. Err.: 60721
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -15413.81
# ---------------------------------------------------------
# 3C. MULTIPLE LINEAR REGRESSION MODELS
# ---------------------------------------------------------
lm_L1 <- lm(
formula_base,
data = train2
)
summary(lm_L1)
##
## Call:
## lm(formula = formula_base, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2211 -0.7540 0.3598 1.1254 4.3550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.355e+00 5.517e-01 9.707 < 2e-16 ***
## STARS 7.478e-01 1.946e-02 38.431 < 2e-16 ***
## LabelAppeal 5.945e-01 1.686e-02 35.250 < 2e-16 ***
## AcidIndex -3.259e-01 1.117e-02 -29.169 < 2e-16 ***
## Alcohol 1.883e-02 3.972e-03 4.739 2.17e-06 ***
## Chlorides -1.931e-01 4.638e-02 -4.164 3.15e-05 ***
## CitricAcid 3.976e-02 1.673e-02 2.377 0.017476 *
## Density -1.274e+00 5.427e-01 -2.347 0.018959 *
## FixedAcidity -1.168e-03 2.315e-03 -0.505 0.613911
## FreeSulfurDioxide 4.286e-04 9.941e-05 4.312 1.63e-05 ***
## ResidualSugar 4.716e-04 4.371e-04 1.079 0.280670
## Sulphates -5.485e-02 1.623e-02 -3.380 0.000728 ***
## TotalSulfurDioxide 3.098e-04 6.387e-05 4.851 1.25e-06 ***
## VolatileAcidity -1.549e-01 1.838e-02 -8.429 < 2e-16 ***
## pH -6.387e-02 2.154e-02 -2.965 0.003028 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.626 on 12780 degrees of freedom
## Multiple R-squared: 0.2879, Adjusted R-squared: 0.2871
## F-statistic: 369.1 on 14 and 12780 DF, p-value: < 2.2e-16
lm_L2 <- lm(
formula_engineered,
data = train2_model
)
summary(lm_L2)
##
## Call:
## lm(formula = formula_engineered, data = train2_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8016 -0.8473 0.0251 0.8394 5.6720
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.692e+00 2.223e+01 -0.391 0.69586
## STARS 7.028e-01 3.586e-02 19.598 < 2e-16 ***
## LabelAppeal 1.328e-01 1.172e-01 1.132 0.25749
## AcidIndex -1.398e-01 1.542e-02 -9.064 < 2e-16 ***
## Alcohol 3.382e-02 1.383e-02 2.444 0.01455 *
## Chlorides -3.705e-01 1.505e-01 -2.461 0.01387 *
## CitricAcid 3.251e-02 5.090e-02 0.639 0.52304
## Density -2.954e+01 5.826e+01 -0.507 0.61221
## FixedAcidity -1.805e-02 7.303e-03 -2.471 0.01351 *
## FreeSulfurDioxide -1.058e-03 3.998e-04 -2.646 0.00816 **
## ResidualSugar -4.653e-03 1.832e-03 -2.540 0.01111 *
## Sulphates 5.064e-04 4.753e-02 0.011 0.99150
## TotalSulfurDioxide -6.355e-04 2.211e-04 -2.874 0.00407 **
## VolatileAcidity -7.333e-02 5.467e-02 -1.341 0.17984
## pH 1.103e-02 3.015e-02 0.366 0.71438
## STARS_MISSING -2.343e+00 5.499e-02 -42.599 < 2e-16 ***
## LabelAppeal_MISSING NA NA NA NA
## AcidIndex_MISSING NA NA NA NA
## Alcohol_MISSING 2.108e-02 9.364e-02 0.225 0.82189
## Chlorides_MISSING 5.044e-02 8.842e-02 0.570 0.56842
## CitricAcid_MISSING NA NA NA NA
## Density_MISSING NA NA NA NA
## FixedAcidity_MISSING NA NA NA NA
## FreeSulfurDioxide_MISSING 2.130e-02 7.857e-02 0.271 0.78630
## ResidualSugar_MISSING 9.376e-02 8.004e-02 1.171 0.24150
## Sulphates_MISSING -4.650e-02 6.559e-02 -0.709 0.47844
## TotalSulfurDioxide_MISSING 1.786e-02 8.184e-02 0.218 0.82724
## VolatileAcidity_MISSING NA NA NA NA
## pH_MISSING -2.077e-02 1.064e-01 -0.195 0.84527
## STARS_BIN2 3.748e-01 5.860e-02 6.396 1.76e-10 ***
## STARS_BIN3 2.157e-01 7.953e-02 2.712 0.00672 **
## STARS_BIN4 NA NA NA NA
## LABEL_BINLow 9.393e-02 1.318e-01 0.713 0.47604
## LABEL_BINNeutral 9.823e-02 2.398e-01 0.410 0.68206
## LABEL_BINHigh 8.091e-03 3.616e-01 0.022 0.98215
## STAR_LABEL_INTERACT 1.597e-01 3.247e-02 4.917 9.12e-07 ***
## LOG_Alcohol -2.031e-01 1.229e-01 -1.652 0.09864 .
## LOG_Chlorides 2.118e-01 1.226e-01 1.727 0.08432 .
## LOG_CitricAcid -3.511e-02 5.156e-02 -0.681 0.49593
## LOG_Density 5.724e+01 1.161e+02 0.493 0.62206
## LOG_FixedAcidity 1.358e-01 5.108e-02 2.660 0.00785 **
## LOG_FreeSulfurDioxide 1.827e-01 3.964e-02 4.609 4.17e-06 ***
## LOG_ResidualSugar 8.040e-02 3.722e-02 2.160 0.03083 *
## LOG_Sulphates -6.021e-02 5.147e-02 -1.170 0.24210
## LOG_TotalSulfurDioxide 1.881e-01 4.266e-02 4.410 1.06e-05 ***
## LOG_VolatileAcidity -4.629e-02 5.225e-02 -0.886 0.37575
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.288 on 4317 degrees of freedom
## Multiple R-squared: 0.5556, Adjusted R-squared: 0.5517
## F-statistic: 142 on 38 and 4317 DF, p-value: < 2.2e-16
# ---------------------------------------------------------
# OVERDISPERSION DIAGNOSTICS (POISSON)
# ---------------------------------------------------------
dispersion_P1 <- sum(residuals(poisson_P1, type = "pearson")^2) /
poisson_P1$df.residual
dispersion_P2 <- sum(residuals(poisson_P2, type = "pearson")^2) /
poisson_P2$df.residual
dispersion_P1
## [1] 0.9623067
dispersion_P2
## [1] 0.8631152
In this section, I compared the competing Poisson, Negative Binomial and multiple linear regression models in order to select a final model for deployment. Because the response variable (TARGET) represents a non-negative count, a count regression model must be selected for final use, even if alternative models show competitive performance.
Model Comparison Criteria
To compare models, I relied primarily on the Akaike Information Criterion (AIC), which balances model fit and complexity. Lower AIC values indicate better overall performance while penalizing excessive model complexity. I also examined dispersion statistics for the Poisson models to assess whether the Poisson assumption of equal mean and variance was reasonable.
Poisson vs. Negative Binomial Models
The baseline and engineered Poisson models exhibited dispersion values substantially greater than 1, indicating overdispersion in the data. This suggests that the Poisson assumption is violated and that standard Poisson regression may underestimate variance.
The Negative Binomial models explicitly account for overdispersion through an additional dispersion parameter. Both the baseline and engineered Negative Binomial models achieved lower AIC values than their Poisson counterparts, indicating improved model fit. The engineered Negative Binomial model (NB2), which includes missing-value indicators, transformed chemical variables, bucketized marketing variables, and an interaction between STARS and LabelAppeal, achieved the lowest AIC among all count models considered.
Linear Regression Comparison
The multiple linear regression models were included as benchmarks, as required by the assignment. While the engineered linear model showed competitive AIC values, linear regression is not well suited for count data because it can produce negative predictions and does not model the discrete nature of the response. As a result, linear regression was not selected for deployment despite its interpretability.
Final Model Selection
Based on the presence of overdispersion, the AIC comparison, and the theoretical appropriateness of the modeling framework, I selected the engineered Negative Binomial regression model (NB2) as the final model for deployment. This model provides the best balance of predictive performance, interpretability, and statistical validity for the wine sales data.
Evaluation Dataset Predictions
Using the selected Negative Binomial model, I generated predictions for the evaluation dataset. The predicted values represent the expected number of wine sample cases purchased for each wine, conditional on its chemical properties and marketing characteristics. These predictions satisfy the assignment requirement to produce case-count forecasts for the evaluation data and can be used by the manufacturer to inform production and marketing decisions.
############################################################
# SECTION 4 — MODEL SELECTION AND EVALUATION
############################################################
# ---------------------------------------------------------
# 4A. MODEL COMPARISON USING INFORMATION CRITERIA
# ---------------------------------------------------------
model_comparison <- data.frame(
Model = c(
"Poisson P1 (Baseline)",
"Poisson P2 (Engineered)",
"NegBin NB1 (Baseline)",
"NegBin NB2 (Engineered)",
"Linear L1 (Baseline)",
"Linear L2 (Engineered)"
),
AIC = c(
AIC(poisson_P1),
AIC(poisson_P2),
AIC(nb_NB1),
AIC(nb_NB2),
AIC(lm_L1),
AIC(lm_L2)
)
)
model_comparison
## Model AIC
## 1 Poisson P1 (Baseline) 50391.49
## 2 Poisson P2 (Engineered) 15491.67
## 3 NegBin NB1 (Baseline) 50393.62
## 4 NegBin NB2 (Engineered) 15493.81
## 5 Linear L1 (Baseline) 48774.96
## 6 Linear L2 (Engineered) 14609.00
# ---------------------------------------------------------
# 4B. OVERDISPERSION CONFIRMATION
# ---------------------------------------------------------
dispersion_summary <- data.frame(
Model = c("Poisson P1", "Poisson P2"),
Dispersion = c(dispersion_P1, dispersion_P2)
)
dispersion_summary
## Model Dispersion
## 1 Poisson P1 0.9623067
## 2 Poisson P2 0.8631152
# ---------------------------------------------------------
# 4C. FINAL MODEL SELECTION
# ---------------------------------------------------------
# The Negative Binomial engineered model is selected for deployment
final_count_model <- nb_NB2
# ---------------------------------------------------------
# 4D. PREDICTIONS ON EVALUATION DATA
# ---------------------------------------------------------
# Ensure evaluation data uses same columns as training model
eval_model <- eval2[, all.vars(formula_engineered)]
# Replace non-finite numeric values
for (v in names(eval_model)) {
if (is.numeric(eval_model[[v]])) {
eval_model[[v]][!is.finite(eval_model[[v]])] <- NA
}
}
# Remove rows with missing values
eval_model <- na.omit(eval_model)
eval_model <- droplevels(eval_model)
# Predict expected number of cases
eval_predictions <- predict(
final_count_model,
newdata = eval_model,
type = "response"
)
# Final prediction table
final_predictions <- data.frame(
INDEX = eval_model$INDEX,
TARGET_PRED = eval_predictions
)
head(final_predictions)
## [1] TARGET_PRED
## <0 rows> (or 0-length row.names)